Pages

Friday, 28 September 2018

Auditing the digital archive filestore

A couple of months ago I blogged about checksums and the methodology I have in place to ensure that I can verify the integrity and authenticity of the files within the digital archive.

I was aware that my current workflows for integrity checking were 'good enough' for the scale at which I'm currently working, but that there was room for improvement. This is often the case when there are humans involved in a process. What if I forget to create checksums to a directory? What happens if I forget to run the checksum verification?

Also, I am aware that checksum verification does not solve everything. For example, read all about The mysterious case of the changed last modified dates. Also, When checksums don't match... the checksum verification process doesn't tell you what has changed, who has changed it, when it was changed...it just tells you that something has changed. So perhaps we need more information.

A colleague in IT Services here at York mentioned to me that after an operating system upgrade on the filestore server last year, there is now auditing support (a bit more information here). This wasn't being widely used yet but it was an option if I wanted to give it a try and see what it did.

This seemed like an interesting idea so we have given it a whirl. With a bit of help (and the right level of permissions on the filestore), I have switched on auditing for the digital archive.

My helpful IT colleague showed me an example of the logs that were coming through. It has been a busy week in the digital archive. I have ingested 11 memory sticks, 24 CD-ROM and a pile of floppy disks. The logs were extensive and not very user friendly in the first instance.

That morning I had wanted to find out the total size of the born digital archives in the digital archive filestore and had right clicked on the folder and selected 'properties'. This had produced tens of thousands of lines of XML in the filestore logs as the attributes of each individual file had to be accessed by the server in order to make the calculation. The audit logs really are capable of auditing everything that happens to the files!

...but do I really need that level of information? Too much information is a problem if it hides the useful stuff.

It is possible to configure the logging so that it looks for specific types of events. So, while I am not specifically interested in accesses to the files, I am interested in changes to them. We configured the auditing to record only certain types of events (as illustrated below). This cuts down the size of the resulting logs and restricts it just to those things that might be of interest to me.




There is little point in switching this on if it is not going to be of use. So what do I intend to do with the output?

The format this is created in is XML, but this would be more user-friendly in a spreadsheet. IT have worked out how to pull out the relevant bits of the log into a tab delimited format that I can then open in a spreadsheet application.

What I have is some basic information about the date and time of the event, who initiated it, the type of event (eg RENAME, WRITE, ATTRIBUTE|WRITE) and the folder/file that was affected.

As I can view this in a spreadsheet application, it is so easy to reorder the columns to look for unexpected or unusual activity.

  • Was there anyone other than me working on the filestore? (there shouldn't be right now)
  • Was there any activity on a date I wasn't in the office?
  • Were there any activity in a folder I wasn't intentionally working on?
The current plan is that these logs will be emailed to me on a weekly basis and I will have a brief check to ensure all looks OK. This will sit alongside my regular integrity checking as another means of assuring that all is as it should be.

We'll review how this is working a few weeks to see if it continues to be a valuable exercise or should be tweaked further.

In my Benchmarking with the NDSA Levels of Preservation post last year, I put us at level 2 for Information Security (as highlighted in green below).



See the full NDSA levels here

Now we have switched on this auditing feature and have a plan in place for regular checking of the logs, does this now take us to level 4 or is more work required?

I'd be really interested to find out whether other digital archivists are utilising filestore audit logs and what processes and procedures are in place to monitor these.

Final thoughts...

This was a quick win and hopefully will prove a useful tool for the digital archive her at the University of York. It is also a nice little example of collaboration between IT and Archives staff.

I sometimes think that IT people and digital preservation folk don't talk enough. If we take the time to talk and to explain our needs and use cases, then the chances are that IT might have some helpful solutions to share. The tools that we need to do our jobs effectively are sometimes already in place in our institutions. We just need to talk to the right people to get them working for us.

Thursday, 16 August 2018

What are the significant properties of a WordStar file?

I blogged a couple of months ago about an imperfect file migration.

One of the reasons this was imperfect (aside from the fact that perhaps all file migrations are imperfect - see below) was because it was an emergency rescue triggered by our Windows 10 upgrade.



Digital preservation is about making best use of your resources to mitigate the most pressing preservation threats and risks. This is a point that Trevor Owens makes very clearly in his excellent new book The Theory and Craft of Digital Preservation (draft).

I saw an immediate risk and I happened to have available resource (someone working with me on placement), so it seemed a good idea to dive in and take action.

This has led to a slightly back-to-front approach to file migration. We took urgent action and in the following months have had time to reflect, carry out QA and document the significant properties of the files.

Significant properties are the characteristics of a file that should be retained in order to create an adequate representation. We've been thinking about what it is we are really trying to preserve? What are the important features of these documents?

Again, Trevor Owens has some really useful insights on this process and numerous helpful examples in The Theory and Craft of Digital Preservation. The following is one of my favourite quotes from his book, and is particularly relevant in this context:
“The answer to nearly all-digital preservation question is “it depends.” In almost every case, the details matter. Deciding what matters about an object or a set of objects is largely contingent on what their future use might be.”
So, in fact the title of this blog post is wrong. There is no use in me asking "What are the significant properties of a WordStar file?" - the real question is "What are the significant properties of this particular set of WordStar files from the Marks and Gran archive?"

To answer this question, a selection of the WordStar files were manually inspected (within a copy of WordStar) to understand how the files were constructed and formatted.

Particular attention was given to how the document was laid out and to the presence of Control and Dot commands. Control commands are markup proceeded by ^ within WordStar - for example ^B to denote bold text). Dot commands are (not surprisingly) proceeded by ‘.’ within WordStar - for example, ‘.OP’ to indicate that page numbering should be omitted within the printed document.

These commands, along with use of carriage returns and horizontal spacing show the intention of the authors.

A few other things have helped with this piece of research.


It is worth also considering the intention of the authors. It seems logical to assume that these documents were created with the intention of printing them out. The use of WordStar was a means to an end - the printed copy to hand out to the actors being the end goal.

I've made the assumption that what we are trying to preserve is the content of the documents in the format that the creator intended, not the original experience of working with the documents within WordStar.


Properties considered to be significant


The words

Perhaps stating the obvious...but the characters and words present, and their order on the page are the primary intellectual property of this set of documents. This includes the use of upper and lower case. Upper case is typically used to indicate actions or instructions to the actor and also to indicate who is speaking.

It also includes formatting mistakes or  typos, for example in some files the character # is used instead of £. # and £ are often confused depending on whether your keyboard is set up as US or UK configuration. This is a problem that people experience today but appears to go back to the mid 1980’s.


Carriage returns

Another key characteristic of the documents is the arrangement of words into paragraphs. This is done in the original document using carriage returns. The screenplays would make little sense without them.

New line for the name of the character.
Blank line.
New line for the dialogue
Another blank line

The carriage returns make the screenplay readable. Without them it is very difficult to follow what is going on, and the look and feel of the documents is entirely different.


Bold text

Some of the text in these files is marked up as bold (using ^B). For example headings at the start of each scene and information on the title page. Bold text is occasionally used for emphasis within the dialogue and thus gives additional information to the reader as to how the text should be delivered: for example “^Bgood^B dog”


Alternate pitch

Alternate pitch is a new concept to me, but it appears in the WordStar files with an opening command of ^A and a closing command of ^N to mark the point at which the formatting should return to ‘Standard pitch’.

The Marks and Gran WordStar files appear make use of this command to emphasise particular sections of text. For example, one character tells another that he has seen Hubert with “^Aanother woman^N”. The fact that these words are displayed differently on the page, gives the actor additional instruction as to how these words should be spoken.

The Wordstar 4.0 manual describes how alternate pitch relates to the character width that is used when the file is printed and that “WordStar has a default alternate pitch of 12 characters per inch”.

However, in a printed physical script that was located for reference (which appears to correspond to some of the files in the digital archive), it appears that text marked as alternate pitch was printed in italics. We can not be sure that this would always be the case.

However this may be interpreted, the most important point is that Marks and Gran wanted these sections of text to stand out in some way and this is therefore a property that is significant.


Underlined text

In a small number of documents there is underlined text (marked up with ^S in WordStar). This is typically used for titles and headings.

As well as being marked up with ^S, underlined text in the documents typically has underscores instead of spaces. This is no doubt because (as described in the manual), spaces between underlined words are not underlined when you print. Using underscores presumably ensures that spaces are also underlined without impacting on the meaning of the text.


Monospace font

Although font type can make a substantial difference to the layout of a page, the concept of font (as we understand it today) does not seem to be a property of the WordStar 4.0 files themselves, however I do think we can comfortably say that the files were designed to be printed using a monospace font.

WordStar itself does not provide the ability to assign a specific font, and the fact that the interface is not WYSIWYG means font can not be assumed by seeing the document in a native environment.

Searching for 'font' within the WordStar manual brings up references to 'italics font' for example but not modern font type as we know it. It does however talk about using the .PS command to change to 'proportional spacing'. As described in the manual:

"Proportional spacing means that each character is allocated space that is proportional to the character's actual width. For example, an i is narrower than an m, so it is allocated less horizontal space in the line when printed. In monospacing (nonproportional spacing), all characters are allocated the same horizontal space regardless of the actual character width."

The .PS command is not used in the Marks and Gran WordStar files so we can assume that monospace font is used.

This is backed up by looking at the physical screenplays that we have in the Marks and Gran archive. The font on contemporary physical items is a serif font that looks similar to Courier.

This is also consistent with the description of screenplays as described on wikipedia: “The standard font is 12 point, 10 pitch Courier Typeface”.

Courier font is also mentioned in the description of a WordStar migration by Jay Gattuso and Peter McKinney (2014).


Hard page breaks

The Marks and Gran WordStar files make frequent use of hard page breaks. In a small selection of files that were inspected in detail, 65% of pages ended with a hard page break. A hard page break is visible in the WordStar file as the .PA command that appears at the bottom of the page.

As described in the wikipedia page on screenplays “The format is structured so that one page equates to roughly one minute of screen time, though this is only used as a ballpark estimate”.

This may help explain the frequent use of hard page breaks in these documents. As this is a deliberate action and impacts on the look of the final screenplay this is a property that is considered significant.


Text justification

In most of the documents, the authors have positioned the text in a particular way on the page, centering the headings and indenting the text so it sits towards the right of the page. In many documents, the name of a character sits on a new line and is centred, and the actual dialogue appears below. This formatting is deliberate and impacts on the look and feel of the document thus is considered a significant property.


Page numbering

Page numbering is another feature that was deliberately controlled by the document creators.

Many documents start with the .OP command that means ‘omit page numbering’.

In some documents page numbering is deliberately started at a later point in the document (after the title pages) with the .PN1 command to indicate (in this instance) that the page numbering should start at this point with page 1.

Screenplay files in this archive are characteristically split into several files (as is the recommended practice for longer documents created in WordStar). As these separate files are intended to be combined into a single document once printed, the inclusion of page numbers would have been helpful. In some cases Marks and Gran have deliberately edited the starting page number for each individual file to ensure that the order of the final screenplay is clear. For example the file CRIME5 starts with .PN31 (the first 30 pages clearly being in files CRIME1 to CRIME4).


Number of pages

The number of pages is considered to be significant for this collection of WordStar files. This is because of the way that Marks and Gran made frequent use of hard page breaks to control how much text appeared on each page and occasionally used the page numbering command in WordStar.

Note however that this is not an exact science given that other properties (for example font type and font size) also have an impact on how much text is included on each page.

Just to go back to my previous point that the question in the title of this blog is not really valid...

Other work that has been carried out on the preservation of a collection of WordStar files at the National Library of New Zealand reached a different conclusion about the number of pages. As described by Jay Gattuso and Peter McKinney, the documents they were working with were not screenplays, they were oral history transcripts and they describe one of their decisions below:
"We had to consider therefore if people had referenced these documents and how they did so. Did they (or would they in future) reference by page number? The decision was made that in this case, the movement of text across pages was allowable as accurate reference would be made through timepoints noted in the text rather than page numbers. However, it was an impact that required some considerable attention."
Different type of content = different decisions.


Headers

Several documents make use of a document header that defines the text that should appear at the top of every document in the printed copy. Sometimes the information in the header is not included elsewhere in the document and provides valuable metadata - for example the fact that a file is described in the header as "REVISED SECOND DRAFT” is potentially very useful to future users of the resource so this information (and ideally it's placement within the header of the documents as appropriate) should be retained.


Corruption

This is an interesting one. Can corruption be considered to be a significant property of a file? I think perhaps it can.

One of the 19 disks from the Marks and Gran digital archive appears to have suffered from some sort of corruption at some stage in its life. Five of the files on this disk display a jumble of apparently meaningless characters at one or more points within the text. This behaviour has not been noted on any of the other files on the other disks.



The corruption can not be fixed. The original content that has been lost can not be replaced. It therefore needs to be retained in some form.

There is a question around how this corruption is presented to future users of the digital archive. It should be clear that some content is missing because corruption has occurred, but it is not essential that the exact manifestation of the corruption is preserved in access copies. Perhaps a note along the lines of ...

[THE FILE WAS CORRUPT AT THIS POINT. SOME CONTENT IS NO LONGER AVAILABLE]

...would be helpful?

Very interested to hear how others have dealt with this issue.



Properties not considered to be significant:

Other properties noted within the document were thought to be less significant and are described below:

Font size

The size of a font will have a substantial impact on the layout and pagination of a document. This appears to have been controlled using the Character Width (.CW) command as described in the manual:

"In WordStar, the default character width is 12/120 inch. To change character width, use the dot command .CW followed by the new width in 120ths of an inch. For example, the 12/120 inch default is written as .CW 12. This is 10 characters per inch, which is normal pitch for pica type. "

The documents I'm working with do not use the .CW command so will accept the defaults. Trying to work out what this actually means in modern font sizes is making my head hurt. Help needed!

As mentioned above, the description of screenplays on wikipedia states that: “The standard font is 12 point, 10 pitch Courier Typeface”. We could use this as a guide but can't always be sure that this standard was followed.

In the National Library of New Zealand WordStar migration the original font is considered to be 10 point Courier.


Word wrap

Where hard carriage returns are not used to denote a new line,  WordStar will wrap the text onto a new line as appropriate.

As this operation was outside the control of the document creator, this property isn’t considered to be significant. This decision is also documented by the National Library of New Zealand in their work with WordStar files as discussed in Gattuso and McKinney (2014).


Soft page breaks

Where hard page breaks are not used, text flows on to the next page automatically.

As this operation was not directly controlled by the document creator it is not considered to be significant.


In conclusion

Defining significant properties is not an exact science, particularly given that properties are often interlinked. Note, that I have considered number of pages to be significant but other factors such as font size, word wrap and soft page breaks (that will clearly influence the number of pages) to be not so significant. Perhaps there is a flaw in my approach but I'm running with this for the time being!

This is a work in progress and comments and thoughts are very welcome.

I hope to blog another time about how these properties are (or are not) being preserved in the migrated files.

Tuesday, 31 July 2018

Checksum or Fixity? Which tool is for me?

The digital preservation community are in agreement that file fixity and data integrity are important.

Indeed there is a whole row devoted to this in the NDSA Levels of Preservation. But how do we all do it? There are many tools out there - see for example those listed in COPTR.

It was noted in the NDSA Fixity Survey Report of 2017 that there isn't really a consensus on how checksums are created and verified across the digital preservation community. Many respondents to the survey also gave the impression that current procedures were a work in progress and that other options should be explored.

From the conclusion of the report:
"Respondents also talked frequently about improvements they wanted to make, such as looking for new tools, developing workflows, and making other changes to their fixity activities. It may be useful, then, for practitioners to think of their fixity practices within a maturity or continuous improvement model; digital preservation practitioners develop a set of ideal practices for their institution and continuously evaluate methods and the allocation of available resources to get as close as possible to their ideal state."

This nicely sums up how I feel about a lot of the digital preservation routines I put into practice ...but continuous improvement needs time.

Life is busy and technology moves fast.

I realised that I haven't reviewed my tools and procedures for fixity checking since 2013.

A  recent upgrade of my PC to Windows 10 gave me a good excuse to change this. Being that I was going to have to re-install and configure all my tools post-upgrade anyway, this was the catalyst I needed to review the way that I currently create and verify checksums within the digital archive.

Current procedures using Checksum

Since 2013 I have been using a tool called Checksum to generate and verify checksums. Checksum describes itself as a "blisteringly fast, no-nonsense file hashing application for Windows" and this has worked well for me in practice. One of the key selling points of Checksum is it's speed. This has become a more important requirement over time as the digital archive has grown in size.

I have a set of procedures around this tool that document and describe how it is configured and how it is used, both as part of the ingest process and as a routine integrity checking task. I keep a log of the dates that checksums are verified, numbers of files checked and the results of these checks and am able to respond to issues and errors as and when they occur.

There has been little drama on this score over the last 5 years apart from a brief blip when my checksums didn't match and the realisation that my integrity checking routine wasn't going to catch things like changed last modified dates.

I'm not unhappy with Checksum but I am aware that there is room for improvement. Getting it set up and configured correctly isn't as easy as I would like. I sometimes wonder if there are things I'm missing. In the past I have scheduled a regular checksum verification task using Windows Task Scheduler as this is not a feature of Checksum itself but more recently I've just been initiating it manually on a regular schedule.


Introducing Fixity

Fixity is a free tool from AVP. It has been around since 2013 but hasn't hit my radar until recent times. It was mentioned several times in the NDSA Fixity Survey Report and I was keen to try it out.



Fixity was created in recognition of the key role that checksum generation and validation have in digital preservation workflows. The intention of the developers was to provide institutions with a simple and low cost (in fact...free) tool that allows checksums to be generated and validated and that enables tasks to be scheduled and reports to be generated.

The Fixity User Guide is a wonderful thing. From personal experience I would say that one of the key barriers to adopting new tools is not being able to understand exactly what they do and why.

Documentation for open source tools can sometimes be a bit scattered and impenetrable, or occasionally too technical for me to understand - not the Fixity User Guide!

It starts off by explaining what problem it is trying to solve and it includes step by step instructions with screen shots and an FAQ at the back. Full marks!

Testing Fixity

The Graphical User Interface (GUI)

I like the interface for Fixity. It is clear and easy to understand, and it gives you the flexibility to manage different fixity checks you may want to set up for different types of content or areas of your filestore.

First impressions were that it is certainly easier to use than the Checksum tool I use currently. On the downside though, there were a few glitches or bugs that I encountered when using it.

Yes, I did manage to break it.

Yes, I did have to use the Task Manager to shut it down on several occasions.

Reporting

The reporting is good. I found this to be clearer and more more flexible than the reports generated by Checksum. It helps that it is presented in a format that can be opened in a spreadsheet application - this means that you can explore the results in whatever way you like.

Fixity will send you an email summary with statistics about how many files were changed, new, removed or moved/renamed. Attached to this email is a tab delimited file that includes all the details. This can be read into a spreadsheet application for closer inspection. You can quickly filter by the status column and focus in on those files that are new or changed. A helpful way of checking whether all is as it should be.

A useful summary report emailed to me from Fixity

One of the limitations of Checksum is that I can add new files into a directory and if I forget to update the checksum file it will not tell me that I have a new file - in theory that file could go without having its fixity created or checked for some time...or, if I enable the 'synchronise' option it will add the new checksums to the hash file when it next does a verification task. This is helpful, but perhaps I don't want them to be added without notifying me in some way. I would prefer to cast my eye over them to double check that they should be in there.


Scheduling

You can use the Fixity GUI to schedule checksum verification events - so you don't have to kick off your weekly or monthly scan manually. This is a nice feature. If your PC isn't switched on when a scheduled scan is due to start, it simply delays it until you next switch your computer on.

One of the downsides of running a scheduled scan in Fixity is that there is little feedback on the screen to let you know what it is doing and where it has got to. Also given that a scan can take quite a while (over 5 hours in my case) it would be helpful to have a notification to remind you that the scheduled scan is going to start and to allow you to opt out or reschedule if you know that your PC isn't going to be switched on for long enough to make it worthwhile.


Managing projects

The Fixity GUI will allow you to manage more than one project - meaning that you could set up and save different scans on different content using different settings. This is a great feature and a real selling point if you have this use case (I don't at the moment).

One downside I found when trying to test this out was when moving between different projects Fixity kept giving me the message that there are unsaved changes in my project and asking if I want to save - I don't believe there were unsaved changes at this point so this is perhaps a bug?

Also, selecting the checksum algorithm for your project can be a little clunky. You have to save your project before you can choose which algorithm you would like to use. This feature is hidden up in the Preferences menu but I'd prefer to see it up front and visible in the GUI so you can't forget about it or ignore it.

I thought I had set my project up to use the MD5 algorithm but when looking at the summary report I realise it is using SHA256. I thought it would be a fairly easy procedure to go back into my project and change the algorithm but now Fixity is stuck on a message saying 'Reading Files, please wait...'. It may be reprocessing all of the files in my project but I don't know for sure because there is no indicator of progress. If this is the case I expect I will have to switch my PC off before it has finished.


Progress

Following on from the comment above...one really useful addition to Fixity would be for it to give an indication of how far through a checksum verification task it is (and how long it has left). To be fair, the feedback around progress in Checksum, is not exact around timings - but it does give a clear notification about how many checksum files it still has to look at (there is one in each directory) and how many individual files are left to look at in the directory it is currently in.


Timings

When you kick off a checksum verification task Fixity comes up with a message that says 'Please do not close until a report is generated'. This is useful, but it doesn't give an indication of how long the scan might take, so you have no idea when you kick off the task how long you will need to keep your computer on for. I've had to close it down in the middle of the scan on a few occasions and I don't know whether this has any impact on the checksums that are stored for next time.


The message you get when you manually initiate a scan using Fixity

Fixity certainly takes longer to run a scan than the Checksum tool that I have used over the last few years. The last scan of the digital archive took 5 hours and 49 minutes (though this was using SHA256). The fixity check using Checksum (with MD5) takes around 1 and a half hours so the difference is not inconsequential.

Note I'm currently trying to change the checksum algorithm on my Fixity test project to MD5 so I can do a fairer comparison and it has been thinking about it for over 3 hours.


Where are the checksums stored?

In the Checksum tool I use currently, the checksums themselves are stored in a little text file in each directory of the digital archive. This could be seen as a good thing (all the information you need to verify the files is stored and backed up with the files) or a bad thing (the digital archive is being contaminated with additional administrative files, as are my DROID reports!).

Fixity however, stores the actual checksums in a /History/ folder and also in a SQLite database. When you first install Fixity it asks you to specify a location for the History folder. I initially changed this to my C: drive to stop it clogging up my limited profile space, but this may not have been the best option in retrospect as my C: drive isn't backed up. It would certainly be useful to have backup copies of the checksums elsewhere, though we could debate at length what exactly should be kept and for how long.


How to keep checksums up to date

My current workflow using the Checksum application is to create or update checksums as part of the ingest process or as preservation actions are carried out so that I always have a current checksum for every file that is being preserved.

I'm unsure how I would do this if I was using Fixity. Is there a way of adding checksums into the database or history log without actually running the whole integrity check? I don't think so.

Given the time it takes Fixity to run, running the whole project might not be a practical solution in many cases so files may sit for a little while before checksums are created. There are risks involved with this.


The verdict

Fixity is a good tool and I would recommend it to digital archivists who are looking for a user-friendly introduction to fixity checking if they don't have a large quantity of digital content. The documentation, reporting and GUI are clear and user friendly and it is pretty easy to get started with it.

One of the nice things about Fixity is that it has been developed for the digital preservation community with our digital preservation use cases in mind. Perhaps the reason why the documentation is so clear is because it feels like it is written for me (I like to think I am a part of the designated community for Fixity!). In contrast, I find it quite difficult to understand the documentation for Checksum or to know whether I have discovered all that it is capable of.

However, after careful consideration I have decided to stick with Checksum for the time being. The main reason being speed. Clearly checksum's claims to be a "blisteringly fast, no-nonsense file hashing application for Windows" are not unfounded! I don't have a huge digital archive to manage at 187 GB but the ability to verify all of my checksums in 1.5 hours instead of 5+ hours is a compelling argument. The digital archive is only going to grow, and speed of operation will become more important over time.

Knowing that I can quickly create or update checksums as part of the ingest or preservation planning process is also a big reason for me to stick to current workflows.

It has been really interesting testing Fixity and comparing it with the functionality of my existing Checksum tool. There are lots of good things about it and it has helped me to review what I want from a fixity checking tool and to examine the workflows and preferences that I have in place with Checksum.

As part of a cycle of continuous improvement as mentioned in the NDSA Fixity Survey Report of 2017 I hope to review file fixity procedures again in another few years. Fixity has potential and I'll certainly be keeping my eye on it as an option for the future.

Wednesday, 25 July 2018

Some observations on digital appraisal

A couple of months ago I attended a Jisc sponsored workshop at the University of Westminster on digital appraisal. There were some useful presentations and discussions on a topic that I find both interesting and challenging.

Within the workshop I made the point that my approaches to some elements of digital appraisal may differ depending on the age of the born digital material I'm looking at.

For example, I may wish to take a firm line about removing modern system generated files such as Thumbs.db files and Apple Resource Forks that come into the archives - my reasons being that this is not really the content that the donor or depositor intended to give us, rather an artifact of the computer system that they were using.

However I also stated that for an older born digital archive I am much more reluctant to weed out system files or software.

It seems easy to weed out things that you recognise and understand - as is often the case with contemporary digital archives - but for older archives our lack of understanding of what we are seeing can make appraisal decisions much harder and the temptation is to keep things until we understand better what is useful.

I was thinking of a couple of examples we have here at the Borthwick Institute.

The archive of Trevor Wishart includes files dating back to 1985. Trevor Wishart specialises in electroacoustic composition, in particular the transformation of the human voice and natural sounds through technological means. He has also been involved in developing software for electronic composition. His digital archive is a great case study for us with interesting challenges around how we might be able to preserve and provide access to it.

Of course when I look at this archive there are numerous files that can not be identified by DROID. It is not always immediately obvious which files are system files, and which are software. Certainly for the time being, there is no intention to appraise out any of the content until we understand it better.

Another good case study...and one I am actively working on right now is the archive of Marks and Gran, a comedy screenwriting duo who have been writing together since the 1970's.

The digital element of this archive was deposited on a set of 5 1/4 inch floppy disks and includes files dating back to 1984.

When I carried out a first pass at the content of this archive to establish what it contained I encountered 100+ digital examples of screenplays, cast lists and plot outlines (in WordStar 4.0 format) and about 60 other files with various file extensions (.COM, .EXE, .BAT etc) that didn't appear to be created by Marks and Gran themselves.

Software and other system files were clearly present on these disks and this was also evidenced by the disk labels.

But do we want to keep this...are we even allowed to keep it? How can we preserve it effectively if we don't know what it is? Are we allowed to provide access to this material? If not, then what is the point of keeping it at all?

Given that rescuing this archive from the 5 1/4 inch floppy disks in the first place was quite a task and the fact that the size of the digital archive was small, it didn't seem right to appraise anything out until our knowledge and understanding increased.

As I spend a bit more time working with the Marks and Gran digital archive, this decision turns out to have had direct benefits. Here are a few examples of how:


WordStar

One of the floppy disks that was recovered had the label "Copy of MASTER WORDSTAR DISK (WITH INSTALL)" and indeed this is what it appeared to contain.

Why do we have actual copies of software in archives like this one?

Users of computers in the 1980's and 1990's were often encouraged to make their own backup copies of software. I've mentioned this before in a previous blog but there is this gem of information in the online WordStar 4 manual:


There will undoubtedly be numerous copies of early software deposited with archives as a result of this practice of creating a backup disk.

Of course there was an opportunity here - I had lots of WordStar files that were hard to read and I also had a copy of WordStar!

I installed the copy of WordStar on an ancient unsupported PC that sits under my desk and was pretty pleased with myself when I got it working.



Then I had to work out how to use it...

But the manual (and Twitter) helped and it has turned out to be incredibly useful in helping to understand the files within the archive and also to check back on the significant properties of the originals while carrying out file migrations.


WC.EXE

Another file within the archive that I didn't fully understand the purpose of until recently has turned out to be another tool to help with the process of file migration.

After the imperfect file migration triggered by a Windows 10 upgrade I wanted to go back and do some checks on how well this process worked.

If I could find out the number of words and characters within the WordStar files I could then compare these with similar statistics from the migrated files and see if they matched up.

But the first hurdle was how to get this information from a WordStar file in the first place. As with many things, to my modern brain, this was not entirely intuitive!

However, reading the manual revealed that there is an additional word counting utility called WC.EXE that wraps with WordStar.

Word counting advice from the WordStar 4.0 manual


Wouldn't it be great if I could find a copy of this utility!

As luck would have it, there was a copy in the digital archive!

I copied it across (on a high tech 3.5 inch floppy disk) to the old PC and got it working very quickly.

And it does what it says it will - here is the result for a file called REDHEAD.WS4


Using WC.EXE to count words and characters in WordStar


I then checked these stats against the stats for the migrated copy of REDHEAD.WS4 in Word 2016 and naively hoped they would match up.


Word counts for the same file in Word 2016

As you can see, the results were alarmingly different! (and note that unticking the box for textboxes, footnotes and endnotes doesn't help).

Twitter is a great help!

Furthermore it was suggested by Andy Jackson on Twitter that WordStar may be also counting formatting information at the start of a file, though I'm still unclear as to how this would add approximately 1,300 words. It is apparent that word and character counts are not to be trusted!

So back to equally imperfect manual methods and visual inspection...having spend some time with these files I am fairly confident that the content of the documents has been captured adequately.

Although WC.EXE didn't turn out to be such a useful file for assessing my migrations, if I hadn't had a copy of it I could have wasted a lot of time looking for it!


Print test pages

Another file within the Marks and Gran digital archive that would not necessarily be considered to be archival is PRINT.BAK. This WordStar file (illustrated below) doesn't look like something that was created by Marks and Gran.

However, the file has turned out to be hugely useful to me as I try to understand and migrate the WordStar files. It describes (and most importantly demonstrates) some of the ways that you can format text within WordStar and (in theory) shows how they will appear when printed.

This would have been quite important information for users of a word processor that is not WYSIWYG!

A migrated version of a file called PRINT.BAK, present on the MASTER WORDSTAR DISK

From my perspective, the great thing about this file is that by putting it through the same migration methodology as the other WordStar files within the archive, I can make an assessment of how successful the migration has been at capturing all the different types of formatting.


Here is how the print test page looks in WordStar - it shows the mark up used to display different formatting.

I thought my migrated version of the print test page looked pretty good until I opened up the original in WordStar and noted all the additional commands that have not been captured as part of the migration process.
  • See for example the .he command at the top of the page which specifies that the document should be printed with a header. Two lines of header text defined and Page # for page number. 
  • Note also the ^D mark up that wraps the text 'WordStar' - this tells the printer to double strike the text - effectively creating a light bold face as the printer prints the characters twice. 
  • The print test page includes an example of the strikeout command which should look like this.
  • It also includes examples of variable pitch which should be visible as a change in character width (font size).


Backup files

I described in a previous blog how WordStar automatically creates backup files each time a file is saved and I asked the question 'should we keep them?'

At this point in time I'm very glad that we have kept them.

The backup fileshave been useful as a point of comparison when encountering what appears to be some localised corruption on one particular floppy disk.

A fragment of one of the corrupt files in WordStar. The text no longer makes sense at a particular point in the document.

Looking at the backup file of the document illustrated above I can see a similar, but not identical, issue. It happens in roughly the same location in the document but some of the missing words can be recovered, providing opportunities to repair or retrieve more information from the archive because of the presence of these contemporary backups.


To conclude

When I first encountered the Marks and Gran digital archive I was not convinced about the archival value of some of the files.

Specifically I was not convinced that we should be keeping things like system files and software or automatically generated backup files unless they were created with deliberate intent by Marks and Gran themselves.

However, as I have worked with the archive more and come to understand the files and how they were created, I have found some of these system files to be incredibly useful in moving forward my understanding of the WordStar files as a whole.

I'm not suggesting we should keep everything but I am suggesting that we should be cautious about getting rid of things that we don't fully understand...they may be useful one day.


Friday, 6 July 2018

Accessibility and usability report for AtoM

Earlier this year I blogged about our recent upgrade to AtoM 2.4 and hinted at a follow up post on the subject of usability and accessibility.

AtoM is our Archives Management System and as well as being a system that staff use to enter information about the archives that we hold, it is also the means by which our users find out about our holdings. We care very much what our users think of it.

When we first released the Borthwick Catalogue (using AtoM 2.2) back in 2016 we were lucky enough to have some staff resource available to carry out a couple of rounds of user testing - this user testing is documented here and here.

We knew that some of the new features of AtoM 2.4 would help address issues that were raised by our users in these earlier tests. The addition of ‘shopping bag’ functionality in the form of the new clipboard feature, and the introduction of an advanced search by date range being two notable examples.

We did not have the capacity to carry out a similar user testing project when we upgraded to 2.4 but as a department committed to Customer Service Excellence we were very keen to consider our users as we rolled out the new version.

We decided to take a light touch approach to this problem and take advantage of the expertise of other colleagues across the University. 

We approached our Marketing team with a request for support and were pleased that the Senior User Experience Designer was able to act as a critical friend, giving us a couple of hours of his time to take a look at our catalogue from a usability perspective and give us some feedback. Thanks!

A quick note:
We use a slightly customised version of the Dominion theme. Some of the issues raised will be relevant to other AtoM users (particularly those with the Dominion theme) and others are more specific to our own customisations (for example the colours we use on our interface).
The Borthwick Catalogue at
iPhone 5 size with browse button
partially visible

Mobile responsiveness

We have already done some work on mobile responsiveness but our usability guru highlighted that there was still work to do. At some screen sizes parts of the top navigation were cut off the screen. At other screen sizes a white block appears to the left of the search box. This appears to be an issue not limited to our AtoM instance.

One of the customisations we have is a footer (similar to that found on our website) at the bottom of every page. It was noted that at certain screen sizes (below 1200px) the right end of the footer is cut off.

I had some discussion with a technical colleague who supports AtoM and we agreed that whilst this isn’t a deal breaker, we would like it to be better. He is going to investigate this in more detail at a later date.

Our footer for the Borthwick Catalogue - with information on the right truncated at smaller screen sizes


Bold text not appearing on a Mac

When viewing our catalogue on a Mac, it was noted that text that should have been bold was not appearing as such. This is a minor irritation but again something we would like to investigate further.


Colour contrast

There was a fair amount of feedback on colour contrast and how this may be problematic for some of our users.

Specific areas of concern were:

  • The text colours we had specifically chosen for our theme - for example the light orange and grey in this example
The light orange and grey text that we use for our catalogue have may not stand out enough to be legible
  • The buttons (for example ‘search’ and ‘reset’) in this example look a little faded.
Buttons in AtoM use subdued colours which may cause problems for some users
  • The clipboard notification bubble in the top menu bar is blue on dark grey. There were concerns about the visibility of this.
Lack of contrast between the blue and dark grey here

I need to try and find some more suitable (stronger) colours for our theme. Never an easy task but the WebAIM colour contrast checker is my friend!


Colour alone used to indicate links

It was noted that colour was the only method of indicating that some of the text within the interface was a link. 

This is not too much of a problem in some areas of the site (for example in the navigation bar on the left where it is more obvious that this will be linked text) but particularly when included in the static pages this could be problematic for some users.

An example of how colour is used to indicate links in our static pages - these visual clues will not be apparent to some users


Alt text for images

Alt text gives visually impaired users a description of an image on a website and is particularly necessary if an image conveys important information. Read more about alt text here.

The images within the AtoM interface (for example in the static pages) utilise the image filename as the alt text.

Having a file name as alt text is not always very helpful. It would be better to have no alt text if the images are purely illustrative or to allow AtoM administrators to add their own alt text to those images that do convey important information.

There doesn’t appear to be a method of including alt text in AtoM currently. It would be helpful if the syntax to include an image in static pages supported an additional alt text argument. 

Living in hope, I tried to include some alt text for an image within the static pages but unfortunately this was not picked up when the page was rendered in the web browser.


Login button

There is a login button in the top right corner of AtoM. This is because staff use the same interface to create records in AtoM as users use to search and browse. By logging in, staff have access to information and features that are not available to other users (for example the accessions records, draft records and import and export options).

A user coming to our catalogue may see the login button and wonder if they need to login in order to unlock additional functionality or perhaps to enable them to save their session or the contents of their clipboard.

By default the login button on the AtoM interface is simply labelled 'Log in'. I discussed with colleagues and the AtoM mailing list how we could avoid confusion for users and we talked through several different options.

For now we have taken the quick approach of altering the label on the button so it now reads 'Admin Login'. This makes it clearer to our users that the feature is not intended for them. 

Artefactual later announced that they had been able to carry out some work to resolve our problem and have added the log in button to the Visible Elements module of AtoM - enabling adminstrators to configure whether they want to show or hide the button to unauthenticated users. This will be available in AtoM 2.5 - thanks Artefactual!


Clipboard feature

A couple of comments were made about the clipboard functionality. 

Firstly, that users may be confused by the fact that the clipboard can appear to be empty if for example you have been collecting Authority Records rather than Archival Descriptions. As the clipboard shows Archival Descriptions by default, it is not immediately obvious that you may need to change the entity type in order to see the items you have selected. 

Secondly that the feedback available to users when adding or removing items from the clipboard is not always clear and consistent. When viewing a list of results the clipboard icon just changes colour to indicate an item has been added.

The small clipboard icon on the right of each item in a list changes colour when you add or remove the item


However, when on a archival description or authority record page, there is also a text prompt which helps explain to the user what the icon is for and what action is available. This is much more helpful for users, particularly if they haven't used an AtoM catalogue or encountered the clipboard feature before.

When viewing a record, the clipboard icon to the right is much clearer and more user friendly



So, some useful food for thought here. It is always good to be aware of potential usability and accessibility issues and to open up discussions about how things could be improved.

I hope these findings are of interest to other AtoM users.

Friday, 8 June 2018

An imperfect migration story

Over the past six years as a digital archivist at the Borthwick Institute I have carried out a very very small amount of file migration. The focus here has been on getting things 'safe', backed up and documented (along with running a few tools to find out what exactly we have and ensure that what we have doesn't change).

I've been deliberately avoiding file migration because:

  1. there is little time to do this sort of stuff 
  2. we don't have a digital archiving system in place
  3. we don't have a means to record the PREMIS metadata about the migrations (and who wants to create PREMIS by hand?)


The catalyst for a file migration

Recently I had to update my work PC to Windows 10.

Whereas colleagues might be able to just set this upgrade off and get it done while they had lunch, I left myself a big chunk of time to try and manage the process. As a digital archivist I have downloaded and installed lots of tools to help me do my job - some I rely on quite heavily to help me ingest digital content, monitor files over time and understand the born digital archives that I work with.

So, I wanted to spend some time capturing all the information about the tools I use and how I have them set up before I can upgrade, and then more time post-upgrade to get them all installed and configured again.

...so with a bit of thought and preparation, everything should be fine...shouldn't it?

Well it turns out everything wasn't fine.


Backwards compatibility is not always guaranteed

One of the tools I rely on and have blogged about previously is Quick View Plus. I have been using Quick View Plus version 12 for the last 6 years and it is a great tool for viewing a range of files that I might not have the software to read otherwise.

In particular it was invaluable in allowing me to access and identify a set of WordStar 4.0 files from the Marks and Gran archive. These files were not accessible through any of the other software that was available to me (apart from in a version of WordStar I have installed on an old Windows 98 PC that I keep under my desk for special occasions).

But when I tried to install Quick View Plus 12 on my PC after upgrading to Windows 10 I discovered it was not compatible with Windows 10.

This was an opportunity to try out a newer version of the Quick View Plus software, so I duly downloaded an evaluation copy of Quick View Plus 2017. My first impressions were good. It seemed the tool had come along a bit in the last few years and there was some nice new functionality around the display of metadata (a potential big selling point for digital archivists).

However, when I tried to open some of the 120 or so WordStar files we have in our digital archive I discovered they were no longer supported.

They were no longer identified as WordStar 4.0.

They were no longer displaying correctly in the viewer.

They looked just like they do in a basic text processing application

...which isn't ideal because as described in the PRONOM record for WordStar 4.0 files:

"On the surface it's a plain text file, however the format 'shifts' the last byte of each word. Effectively it is 'flipping' the first bit of the ASCII character from 0 to 1. so a lower case 'r' (hex value 0x72) becomes 'ò' (hex value 0xF2); lower case 'd' (hex 0x64) becomes 'ä' (hex 0xE4) and so on."

This means that viewing a WordStar file in an application that doesn't interpret and decode this behaviour can be a bit taxing for the brain.

Having looked back at the product description for Quick View Plus 2017 I discovered that WordStar for DOS is one of their supported file formats. It seems this functionality had not been intentionally deprecated.

I emailed Avantstar Customer Technical Support to report this issue and with a bit of testing they confirmed my findings. However, they were not able to tell me whether this would be fixed or not in a future release.


A 'good enough' rescue

This prompted me to kick off a little rescue mission. Whilst we still had one or two computers in the building on Windows 7, I installed Quick View Plus 12 on one of them and started a colleague off on a basic file migration task to ensure we have a copy of the files that can be more easily accessed on current software.

A two-pronged attack using limited resources is described below:

  • Open file in QVP12 and print to PDF/A-1b. This appears to effectively capture the words on the page, the layout and the pagination of the document as displayed in QVP12.
  • Open file in QVP12, select all the text and copy and paste into MS Word (keeping the source formatting). The file is then saved as DOCX. Although this doesn’t maintain the pagination, it does effectively capture the content and some of the formatting of the document in a reusable format and gives us an alternative preservation version of the document that we can work with in the future.
Files were saved with the same names as the originals (including the use of SHOUTY 1980's upper case) but with new file extensions. Original file extensions were also captured in the names of these migrated files. This is because (as described in a previous post) users of early WordStar for DOS packages were encouraged to make use of the 3 character file extension to add additional contextual information related to the file (gulp!).

The methodology was fully documented and progress has been noted on a spreadsheet. In the absence of a system for me to record PREMIS metadata, all of this information will be stored alongside the migrated files in the digital archive.


Future work

We've still got some work to do. For example some spot checking against the original files in their native WordStar environment - I believe that the text has been captured well but that there are a few formatting issues that I'd like to investigate.

I'd also like to use VeraPDF to check whether the PDF/A files that we have created are actually valid (am keeping my fingers firmly crossed!).

This was possibly not the best thought out migration strategy but as there was little time available my focus was to come up with a methodology that was 'good enough' for enabling continued access to the content of these documents. Of course the original files are also retained and we can go back to these at any time to carry out further (better?) migrations in the future.*

In the meantime, a follow up e-mail from Avantstar Technical Support has given me an alternative solution. Apparently, Quick View Plus version 13 (which our current licence for version 12 enables us to install at no extra cost) is compatible with Windows 10 and will enable me to continue to view WordStar 4.0 files on my PC. Good news!



* I'm very interested in the work carried out at the National Library of New Zealand to convert WordStar to HTML and would be interested in exploring this approach at a later date if resources allow.

Friday, 18 May 2018

UK Archivematica meeting at Westminster School

Yesterday the UK Archivematica user group meeting was held in the historic location of Westminster School in central London.

A pretty impressive location for a meeting!
(credit: Elizabeth Wells)


In the morning once fuelled with tea, coffee and biscuits we set about talking about our infrastructures and workflows. It was great to hear from a range of institutions and how Archivematica fits into the bigger picture for them. One of the points that lots of attendees made was that progress can be slow. Many of us were slightly frustrated that we aren't making faster progress in establishing our preservation infrastructures but I think it was a comfort to know that we were not alone in this!

I kicked things off by showing a couple of diagrams of our proposed and developing workflows at the University of York. Firstly illustrating our infrastructure for preserving and providing access to research data and secondly looking at our hypothetical workflow for born digital content that comes to the Borthwick Institute.

Now our AtoM upgrade is complete and that Archivematica 1.7 has been released, I am hoping that colleagues can set up a test instance of AtoM talking to Archivematica that I can start to play with. In a parallel strand, I am encouraging colleagues to consider and document access requirements for digital content. This will be invaluable when thinking about what sort of experience we are trying to implement for our users. The decision is yet to be made around whether AtoM and Archivematica will meet our needs on their own or whether additional functionality is needed through an integration with Fedora and Samvera (the software on which our digital library runs)...but that decision will come once we better understand what we are trying to achieve and what the solutions offer.

Elizabeth Wells from Westminster School talked about the different types of digital content that she would like Archivematica to handle and different workflows that may be required depending on whether it is born digital or digitised content, whether a hybrid or fully digital archive and whether it has been catalogued or not. She is using Archivematica alongside AtoM and considers that her primary problems are not technical but revolve around metadata and cataloguing. We had some interesting discussion around how we would provide access to digital content through AtoM if the archive hadn't been catalogued.

Anna McNally from the University of Westminster reminded us that information about how they are using Archivematica is already well described in a webinar that is now available on YouTube: Work in Progress: reflections on our first year of digital preservation. They are using the PERPETUA service from Arkivum and they use an automated upload folder in NextCloud to move digital content into Archivematica. They are in the process of migrating from CALM to AtoM to provide access to their digital content. One of the key selling points of AtoM for them is it's support for different languages and character sets.

Chris Grygiel from the University of Leeds showed us some infrastructure diagrams and explained that this is still very much a work in progress. Alongside Archivematica, he is using BitCurator to help appraise the content and EPrints and EMU for access.

Rachel MacGregor from Lancaster University updated us on work with Archivematica at Lancaster. They have been investigating both Archivematica and Preservica as part of the Jisc Research Data Shared Service pilot. The system that they use has to be integrated in some way with PURE for research data management.

After lunch in the dining hall (yes it did feel a bit like being back at school),
Rachel MacGregor (shouting to be heard over the sound of the bells at Westminster) kicked off the afternoon with a presentation about DMAonline. This tool, originally created as part of the Jisc Research Data Spring project, is under further development as part of the Jisc Research Data Shared Service pilot.

It provides reporting functionality for a range of systems in use for research data management including Archivematica. Archivematica itself does not come with advanced reporting functionality - it is focused on the primary task of creating an archival information package (AIP).

The tool (once in production) could be used by anyone regardless of whether they are part of the Jisc Shared Service or not. Rachel also stressed that it is modular - though it can gather data from a whole range of systems, it could also work just with Archivematica if that is the only system you are interested in reporting on.

An important part of developing a tool like this is to ensure that communication is clear - if you don’t adequately communicate to the developers what you want it to do, you won’t get what you want. With that in mind, Rachel has been working collaboratively to establish clear reporting requirements for preservation. She talked us through these requirements and asked for feedback. They are also available online for people to comment on:


Sean Rippington from the University of St Andrews talked us through some testing he has carried out, looking at how files in SharePoint could be handled by Archivematica. St Andrews are one of the pilot organisations for the Jisc Research Data Shared Service, and they are also interested in the preservation of their corporate records. There doesn’t seem to be much information out there about how SharePoint and Archivematica might work together, so it was really useful to hear about Sean’s work.

He showed us inside a sample SharePoint export file (a .cmp file). It consisted of various office documents (the documents that had been put into SharePoint) and other metadata files. The office documents themselves had lost much of their original metadata - they had been renamed with a consecutive number and given a .DAT file extension. The date last modified had changed to the date of export from SharePoint. However, all was not lost, a manifest file was included in the export and contained lots of valuable metadata, including the last modified date, the filename, the file extension and the name of the person who created file and last modified it.

Sean tried putting the .cmp file through Archivematica to see what happens. He found that Archivematica correctly identified the MS Office files (regardless of change of file extension) but obviously the correct (original) metadata was not associated with the files. This continued to be stored in the associated manifest file. This has potential for confusing future users of the digital archive - the metadata gives useful context to the files and if hidden in a separate manifest file it may not be discovered.

Another approach he took was to use the information in the manifest file to rename the files and assign them with their correct file extensions before pushing them into Archivematica. This might be a better solution in that the files that will be served up in the dissemination information package (DIP) will be named correctly and be easier for users to locate and understand. However, this was a manual process and probably not scalable unless it could be automated in some way.

He ended with lots of questions and would be very glad to hear from anyone who has done further work in this area.

Hrafn Malmquist from the University of Edinburgh talked about his use of Archivematica’s appraisal tab and described a specfic use case for Archivematica which had specific requirements. The records of the University court have been deposited as born digital since 2007 and need to be preserved and made accessible with full text searching to aid retrieval. This has been achieved using a combination of Archivematica and DSpace and by adding a package.csv file containing appropriate metadata that can be understood by DSpace.

Laura Giles from the University of Hull described ongoing work to establish a digital archive infrastructure for the Hull City of Culture archive. They had an appetite for open source and prior experience with Archivematica so they were keen to use this solution, but they did not have the in-house resource to implement it. Hull are now working with CoSector at the University of London to plan and establish a digital preservation solution that works alongside their existing repository (Fedora and Samvera) and archives management system (CALM). Once this is in place they hope to use similar principles for other preservation use cases at Hull.

We then had time for a quick tour of Westminster School archives followed by more biscuits before Sarah Romkey from Artefactual Systems joined us remotely to update us on the recent new Archivematica release and future plans. The group is considering taking her up on her suggestion to provide some more detailed and focused feedback on the appraisal tab within Archivematica - perhaps a task for one of our future meetings.

Talking of future meetings ...we have agreed that the next UK Archivematica meeting will be held at the University of Warwick at some point in the autumn.