Tuesday 31 July 2018

Checksum or Fixity? Which tool is for me?

The digital preservation community are in agreement that file fixity and data integrity are important.

Indeed there is a whole row devoted to this in the NDSA Levels of Preservation. But how do we all do it? There are many tools out there - see for example those listed in COPTR.

It was noted in the NDSA Fixity Survey Report of 2017 that there isn't really a consensus on how checksums are created and verified across the digital preservation community. Many respondents to the survey also gave the impression that current procedures were a work in progress and that other options should be explored.

From the conclusion of the report:
"Respondents also talked frequently about improvements they wanted to make, such as looking for new tools, developing workflows, and making other changes to their fixity activities. It may be useful, then, for practitioners to think of their fixity practices within a maturity or continuous improvement model; digital preservation practitioners develop a set of ideal practices for their institution and continuously evaluate methods and the allocation of available resources to get as close as possible to their ideal state."

This nicely sums up how I feel about a lot of the digital preservation routines I put into practice ...but continuous improvement needs time.

Life is busy and technology moves fast.

I realised that I haven't reviewed my tools and procedures for fixity checking since 2013.

A  recent upgrade of my PC to Windows 10 gave me a good excuse to change this. Being that I was going to have to re-install and configure all my tools post-upgrade anyway, this was the catalyst I needed to review the way that I currently create and verify checksums within the digital archive.

Current procedures using Checksum

Since 2013 I have been using a tool called Checksum to generate and verify checksums. Checksum describes itself as a "blisteringly fast, no-nonsense file hashing application for Windows" and this has worked well for me in practice. One of the key selling points of Checksum is it's speed. This has become a more important requirement over time as the digital archive has grown in size.

I have a set of procedures around this tool that document and describe how it is configured and how it is used, both as part of the ingest process and as a routine integrity checking task. I keep a log of the dates that checksums are verified, numbers of files checked and the results of these checks and am able to respond to issues and errors as and when they occur.

There has been little drama on this score over the last 5 years apart from a brief blip when my checksums didn't match and the realisation that my integrity checking routine wasn't going to catch things like changed last modified dates.

I'm not unhappy with Checksum but I am aware that there is room for improvement. Getting it set up and configured correctly isn't as easy as I would like. I sometimes wonder if there are things I'm missing. In the past I have scheduled a regular checksum verification task using Windows Task Scheduler as this is not a feature of Checksum itself but more recently I've just been initiating it manually on a regular schedule.


Introducing Fixity

Fixity is a free tool from AVP. It has been around since 2013 but hasn't hit my radar until recent times. It was mentioned several times in the NDSA Fixity Survey Report and I was keen to try it out.



Fixity was created in recognition of the key role that checksum generation and validation have in digital preservation workflows. The intention of the developers was to provide institutions with a simple and low cost (in fact...free) tool that allows checksums to be generated and validated and that enables tasks to be scheduled and reports to be generated.

The Fixity User Guide is a wonderful thing. From personal experience I would say that one of the key barriers to adopting new tools is not being able to understand exactly what they do and why.

Documentation for open source tools can sometimes be a bit scattered and impenetrable, or occasionally too technical for me to understand - not the Fixity User Guide!

It starts off by explaining what problem it is trying to solve and it includes step by step instructions with screen shots and an FAQ at the back. Full marks!

Testing Fixity

The Graphical User Interface (GUI)

I like the interface for Fixity. It is clear and easy to understand, and it gives you the flexibility to manage different fixity checks you may want to set up for different types of content or areas of your filestore.

First impressions were that it is certainly easier to use than the Checksum tool I use currently. On the downside though, there were a few glitches or bugs that I encountered when using it.

Yes, I did manage to break it.

Yes, I did have to use the Task Manager to shut it down on several occasions.

Reporting

The reporting is good. I found this to be clearer and more more flexible than the reports generated by Checksum. It helps that it is presented in a format that can be opened in a spreadsheet application - this means that you can explore the results in whatever way you like.

Fixity will send you an email summary with statistics about how many files were changed, new, removed or moved/renamed. Attached to this email is a tab delimited file that includes all the details. This can be read into a spreadsheet application for closer inspection. You can quickly filter by the status column and focus in on those files that are new or changed. A helpful way of checking whether all is as it should be.

A useful summary report emailed to me from Fixity

One of the limitations of Checksum is that I can add new files into a directory and if I forget to update the checksum file it will not tell me that I have a new file - in theory that file could go without having its fixity created or checked for some time...or, if I enable the 'synchronise' option it will add the new checksums to the hash file when it next does a verification task. This is helpful, but perhaps I don't want them to be added without notifying me in some way. I would prefer to cast my eye over them to double check that they should be in there.


Scheduling

You can use the Fixity GUI to schedule checksum verification events - so you don't have to kick off your weekly or monthly scan manually. This is a nice feature. If your PC isn't switched on when a scheduled scan is due to start, it simply delays it until you next switch your computer on.

One of the downsides of running a scheduled scan in Fixity is that there is little feedback on the screen to let you know what it is doing and where it has got to. Also given that a scan can take quite a while (over 5 hours in my case) it would be helpful to have a notification to remind you that the scheduled scan is going to start and to allow you to opt out or reschedule if you know that your PC isn't going to be switched on for long enough to make it worthwhile.


Managing projects

The Fixity GUI will allow you to manage more than one project - meaning that you could set up and save different scans on different content using different settings. This is a great feature and a real selling point if you have this use case (I don't at the moment).

One downside I found when trying to test this out was when moving between different projects Fixity kept giving me the message that there are unsaved changes in my project and asking if I want to save - I don't believe there were unsaved changes at this point so this is perhaps a bug?

Also, selecting the checksum algorithm for your project can be a little clunky. You have to save your project before you can choose which algorithm you would like to use. This feature is hidden up in the Preferences menu but I'd prefer to see it up front and visible in the GUI so you can't forget about it or ignore it.

I thought I had set my project up to use the MD5 algorithm but when looking at the summary report I realise it is using SHA256. I thought it would be a fairly easy procedure to go back into my project and change the algorithm but now Fixity is stuck on a message saying 'Reading Files, please wait...'. It may be reprocessing all of the files in my project but I don't know for sure because there is no indicator of progress. If this is the case I expect I will have to switch my PC off before it has finished.


Progress

Following on from the comment above...one really useful addition to Fixity would be for it to give an indication of how far through a checksum verification task it is (and how long it has left). To be fair, the feedback around progress in Checksum, is not exact around timings - but it does give a clear notification about how many checksum files it still has to look at (there is one in each directory) and how many individual files are left to look at in the directory it is currently in.


Timings

When you kick off a checksum verification task Fixity comes up with a message that says 'Please do not close until a report is generated'. This is useful, but it doesn't give an indication of how long the scan might take, so you have no idea when you kick off the task how long you will need to keep your computer on for. I've had to close it down in the middle of the scan on a few occasions and I don't know whether this has any impact on the checksums that are stored for next time.


The message you get when you manually initiate a scan using Fixity

Fixity certainly takes longer to run a scan than the Checksum tool that I have used over the last few years. The last scan of the digital archive took 5 hours and 49 minutes (though this was using SHA256). The fixity check using Checksum (with MD5) takes around 1 and a half hours so the difference is not inconsequential.

Note I'm currently trying to change the checksum algorithm on my Fixity test project to MD5 so I can do a fairer comparison and it has been thinking about it for over 3 hours.


Where are the checksums stored?

In the Checksum tool I use currently, the checksums themselves are stored in a little text file in each directory of the digital archive. This could be seen as a good thing (all the information you need to verify the files is stored and backed up with the files) or a bad thing (the digital archive is being contaminated with additional administrative files, as are my DROID reports!).

Fixity however, stores the actual checksums in a /History/ folder and also in a SQLite database. When you first install Fixity it asks you to specify a location for the History folder. I initially changed this to my C: drive to stop it clogging up my limited profile space, but this may not have been the best option in retrospect as my C: drive isn't backed up. It would certainly be useful to have backup copies of the checksums elsewhere, though we could debate at length what exactly should be kept and for how long.


How to keep checksums up to date

My current workflow using the Checksum application is to create or update checksums as part of the ingest process or as preservation actions are carried out so that I always have a current checksum for every file that is being preserved.

I'm unsure how I would do this if I was using Fixity. Is there a way of adding checksums into the database or history log without actually running the whole integrity check? I don't think so.

Given the time it takes Fixity to run, running the whole project might not be a practical solution in many cases so files may sit for a little while before checksums are created. There are risks involved with this.


The verdict

Fixity is a good tool and I would recommend it to digital archivists who are looking for a user-friendly introduction to fixity checking if they don't have a large quantity of digital content. The documentation, reporting and GUI are clear and user friendly and it is pretty easy to get started with it.

One of the nice things about Fixity is that it has been developed for the digital preservation community with our digital preservation use cases in mind. Perhaps the reason why the documentation is so clear is because it feels like it is written for me (I like to think I am a part of the designated community for Fixity!). In contrast, I find it quite difficult to understand the documentation for Checksum or to know whether I have discovered all that it is capable of.

However, after careful consideration I have decided to stick with Checksum for the time being. The main reason being speed. Clearly checksum's claims to be a "blisteringly fast, no-nonsense file hashing application for Windows" are not unfounded! I don't have a huge digital archive to manage at 187 GB but the ability to verify all of my checksums in 1.5 hours instead of 5+ hours is a compelling argument. The digital archive is only going to grow, and speed of operation will become more important over time.

Knowing that I can quickly create or update checksums as part of the ingest or preservation planning process is also a big reason for me to stick to current workflows.

It has been really interesting testing Fixity and comparing it with the functionality of my existing Checksum tool. There are lots of good things about it and it has helped me to review what I want from a fixity checking tool and to examine the workflows and preferences that I have in place with Checksum.

As part of a cycle of continuous improvement as mentioned in the NDSA Fixity Survey Report of 2017 I hope to review file fixity procedures again in another few years. Fixity has potential and I'll certainly be keeping my eye on it as an option for the future.



Jenny Mitcham, Digital Archivist

Wednesday 25 July 2018

Some observations on digital appraisal

A couple of months ago I attended a Jisc sponsored workshop at the University of Westminster on digital appraisal. There were some useful presentations and discussions on a topic that I find both interesting and challenging.

Within the workshop I made the point that my approaches to some elements of digital appraisal may differ depending on the age of the born digital material I'm looking at.

For example, I may wish to take a firm line about removing modern system generated files such as Thumbs.db files and Apple Resource Forks that come into the archives - my reasons being that this is not really the content that the donor or depositor intended to give us, rather an artifact of the computer system that they were using.

However I also stated that for an older born digital archive I am much more reluctant to weed out system files or software.

It seems easy to weed out things that you recognise and understand - as is often the case with contemporary digital archives - but for older archives our lack of understanding of what we are seeing can make appraisal decisions much harder and the temptation is to keep things until we understand better what is useful.

I was thinking of a couple of examples we have here at the Borthwick Institute.

The archive of Trevor Wishart includes files dating back to 1985. Trevor Wishart specialises in electroacoustic composition, in particular the transformation of the human voice and natural sounds through technological means. He has also been involved in developing software for electronic composition. His digital archive is a great case study for us with interesting challenges around how we might be able to preserve and provide access to it.

Of course when I look at this archive there are numerous files that can not be identified by DROID. It is not always immediately obvious which files are system files, and which are software. Certainly for the time being, there is no intention to appraise out any of the content until we understand it better.

Another good case study...and one I am actively working on right now is the archive of Marks and Gran, a comedy screenwriting duo who have been writing together since the 1970's.

The digital element of this archive was deposited on a set of 5 1/4 inch floppy disks and includes files dating back to 1984.

When I carried out a first pass at the content of this archive to establish what it contained I encountered 100+ digital examples of screenplays, cast lists and plot outlines (in WordStar 4.0 format) and about 60 other files with various file extensions (.COM, .EXE, .BAT etc) that didn't appear to be created by Marks and Gran themselves.

Software and other system files were clearly present on these disks and this was also evidenced by the disk labels.

But do we want to keep this...are we even allowed to keep it? How can we preserve it effectively if we don't know what it is? Are we allowed to provide access to this material? If not, then what is the point of keeping it at all?

Given that rescuing this archive from the 5 1/4 inch floppy disks in the first place was quite a task and the fact that the size of the digital archive was small, it didn't seem right to appraise anything out until our knowledge and understanding increased.

As I spend a bit more time working with the Marks and Gran digital archive, this decision turns out to have had direct benefits. Here are a few examples of how:


WordStar

One of the floppy disks that was recovered had the label "Copy of MASTER WORDSTAR DISK (WITH INSTALL)" and indeed this is what it appeared to contain.

Why do we have actual copies of software in archives like this one?

Users of computers in the 1980's and 1990's were often encouraged to make their own backup copies of software. I've mentioned this before in a previous blog but there is this gem of information in the online WordStar 4 manual:


There will undoubtedly be numerous copies of early software deposited with archives as a result of this practice of creating a backup disk.

Of course there was an opportunity here - I had lots of WordStar files that were hard to read and I also had a copy of WordStar!

I installed the copy of WordStar on an ancient unsupported PC that sits under my desk and was pretty pleased with myself when I got it working.



Then I had to work out how to use it...

But the manual (and Twitter) helped and it has turned out to be incredibly useful in helping to understand the files within the archive and also to check back on the significant properties of the originals while carrying out file migrations.


WC.EXE

Another file within the archive that I didn't fully understand the purpose of until recently has turned out to be another tool to help with the process of file migration.

After the imperfect file migration triggered by a Windows 10 upgrade I wanted to go back and do some checks on how well this process worked.

If I could find out the number of words and characters within the WordStar files I could then compare these with similar statistics from the migrated files and see if they matched up.

But the first hurdle was how to get this information from a WordStar file in the first place. As with many things, to my modern brain, this was not entirely intuitive!

However, reading the manual revealed that there is an additional word counting utility called WC.EXE that wraps with WordStar.

Word counting advice from the WordStar 4.0 manual


Wouldn't it be great if I could find a copy of this utility!

As luck would have it, there was a copy in the digital archive!

I copied it across (on a high tech 3.5 inch floppy disk) to the old PC and got it working very quickly.

And it does what it says it will - here is the result for a file called REDHEAD.WS4


Using WC.EXE to count words and characters in WordStar


I then checked these stats against the stats for the migrated copy of REDHEAD.WS4 in Word 2016 and naively hoped they would match up.


Word counts for the same file in Word 2016

As you can see, the results were alarmingly different! (and note that unticking the box for textboxes, footnotes and endnotes doesn't help).

Twitter is a great help!

Furthermore it was suggested by Andy Jackson on Twitter that WordStar may be also counting formatting information at the start of a file, though I'm still unclear as to how this would add approximately 1,300 words. It is apparent that word and character counts are not to be trusted!

So back to equally imperfect manual methods and visual inspection...having spend some time with these files I am fairly confident that the content of the documents has been captured adequately.

Although WC.EXE didn't turn out to be such a useful file for assessing my migrations, if I hadn't had a copy of it I could have wasted a lot of time looking for it!


Print test pages

Another file within the Marks and Gran digital archive that would not necessarily be considered to be archival is PRINT.BAK. This WordStar file (illustrated below) doesn't look like something that was created by Marks and Gran.

However, the file has turned out to be hugely useful to me as I try to understand and migrate the WordStar files. It describes (and most importantly demonstrates) some of the ways that you can format text within WordStar and (in theory) shows how they will appear when printed.

This would have been quite important information for users of a word processor that is not WYSIWYG!

A migrated version of a file called PRINT.BAK, present on the MASTER WORDSTAR DISK

From my perspective, the great thing about this file is that by putting it through the same migration methodology as the other WordStar files within the archive, I can make an assessment of how successful the migration has been at capturing all the different types of formatting.


Here is how the print test page looks in WordStar - it shows the mark up used to display different formatting.

I thought my migrated version of the print test page looked pretty good until I opened up the original in WordStar and noted all the additional commands that have not been captured as part of the migration process.
  • See for example the .he command at the top of the page which specifies that the document should be printed with a header. Two lines of header text defined and Page # for page number. 
  • Note also the ^D mark up that wraps the text 'WordStar' - this tells the printer to double strike the text - effectively creating a light bold face as the printer prints the characters twice. 
  • The print test page includes an example of the strikeout command which should look like this.
  • It also includes examples of variable pitch which should be visible as a change in character width (font size).


Backup files

I described in a previous blog how WordStar automatically creates backup files each time a file is saved and I asked the question 'should we keep them?'

At this point in time I'm very glad that we have kept them.

The backup fileshave been useful as a point of comparison when encountering what appears to be some localised corruption on one particular floppy disk.

A fragment of one of the corrupt files in WordStar. The text no longer makes sense at a particular point in the document.

Looking at the backup file of the document illustrated above I can see a similar, but not identical, issue. It happens in roughly the same location in the document but some of the missing words can be recovered, providing opportunities to repair or retrieve more information from the archive because of the presence of these contemporary backups.


To conclude

When I first encountered the Marks and Gran digital archive I was not convinced about the archival value of some of the files.

Specifically I was not convinced that we should be keeping things like system files and software or automatically generated backup files unless they were created with deliberate intent by Marks and Gran themselves.

However, as I have worked with the archive more and come to understand the files and how they were created, I have found some of these system files to be incredibly useful in moving forward my understanding of the WordStar files as a whole.

I'm not suggesting we should keep everything but I am suggesting that we should be cautious about getting rid of things that we don't fully understand...they may be useful one day.




Jenny Mitcham, Digital Archivist

Friday 6 July 2018

Accessibility and usability report for AtoM

Earlier this year I blogged about our recent upgrade to AtoM 2.4 and hinted at a follow up post on the subject of usability and accessibility.

AtoM is our Archives Management System and as well as being a system that staff use to enter information about the archives that we hold, it is also the means by which our users find out about our holdings. We care very much what our users think of it.

When we first released the Borthwick Catalogue (using AtoM 2.2) back in 2016 we were lucky enough to have some staff resource available to carry out a couple of rounds of user testing - this user testing is documented here and here.

We knew that some of the new features of AtoM 2.4 would help address issues that were raised by our users in these earlier tests. The addition of ‘shopping bag’ functionality in the form of the new clipboard feature, and the introduction of an advanced search by date range being two notable examples.

We did not have the capacity to carry out a similar user testing project when we upgraded to 2.4 but as a department committed to Customer Service Excellence we were very keen to consider our users as we rolled out the new version.

We decided to take a light touch approach to this problem and take advantage of the expertise of other colleagues across the University.

We approached our Marketing team with a request for support and were pleased that the Senior User Experience Designer was able to act as a critical friend, giving us a couple of hours of his time to take a look at our catalogue from a usability perspective and give us some feedback. Thanks!

A quick note:
We use a slightly customised version of the Dominion theme. Some of the issues raised will be relevant to other AtoM users (particularly those with the Dominion theme) and others are more specific to our own customisations (for example the colours we use on our interface).
The Borthwick Catalogue at
iPhone 5 size with browse button
partially visible

Mobile responsiveness

We have already done some work on mobile responsiveness but our usability guru highlighted that there was still work to do. At some screen sizes parts of the top navigation were cut off the screen. At other screen sizes a white block appears to the left of the search box. This appears to be an issue not limited to our AtoM instance.

One of the customisations we have is a footer (similar to that found on our website) at the bottom of every page. It was noted that at certain screen sizes (below 1200px) the right end of the footer is cut off.

I had some discussion with a technical colleague who supports AtoM and we agreed that whilst this isn’t a deal breaker, we would like it to be better. He is going to investigate this in more detail at a later date.

Our footer for the Borthwick Catalogue - with information on the right truncated at smaller screen sizes


Bold text not appearing on a Mac

When viewing our catalogue on a Mac, it was noted that text that should have been bold was not appearing as such. This is a minor irritation but again something we would like to investigate further.


Colour contrast

There was a fair amount of feedback on colour contrast and how this may be problematic for some of our users.

Specific areas of concern were:

  • The text colours we had specifically chosen for our theme - for example the light orange and grey in this example
The light orange and grey text that we use for our catalogue have may not stand out enough to be legible
  • The buttons (for example ‘search’ and ‘reset’) in this example look a little faded.

Buttons in AtoM use subdued colours which may cause problems for some users
  • The clipboard notification bubble in the top menu bar is blue on dark grey. There were concerns about the visibility of this.
Lack of contrast between the blue and dark grey here

I need to try and find some more suitable (stronger) colours for our theme. Never an easy task but the WebAIM colour contrast checker is my friend!


Colour alone used to indicate links

It was noted that colour was the only method of indicating that some of the text within the interface was a link. 

This is not too much of a problem in some areas of the site (for example in the navigation bar on the left where it is more obvious that this will be linked text) but particularly when included in the static pages this could be problematic for some users.

An example of how colour is used to indicate links in our static pages - these visual clues will not be apparent to some users


Alt text for images

Alt text gives visually impaired users a description of an image on a website and is particularly necessary if an image conveys important information. Read more about alt text here.

The images within the AtoM interface (for example in the static pages) utilise the image filename as the alt text.

Having a file name as alt text is not always very helpful. It would be better to have no alt text if the images are purely illustrative or to allow AtoM administrators to add their own alt text to those images that do convey important information.

There doesn’t appear to be a method of including alt text in AtoM currently. It would be helpful if the syntax to include an image in static pages supported an additional alt text argument. 

Living in hope, I tried to include some alt text for an image within the static pages but unfortunately this was not picked up when the page was rendered in the web browser.


Login button

There is a login button in the top right corner of AtoM. This is because staff use the same interface to create records in AtoM as users use to search and browse. By logging in, staff have access to information and features that are not available to other users (for example the accessions records, draft records and import and export options).

A user coming to our catalogue may see the login button and wonder if they need to login in order to unlock additional functionality or perhaps to enable them to save their session or the contents of their clipboard.

By default the login button on the AtoM interface is simply labelled 'Log in'. I discussed with colleagues and the AtoM mailing list how we could avoid confusion for users and we talked through several different options.

For now we have taken the quick approach of altering the label on the button so it now reads 'Admin Login'. This makes it clearer to our users that the feature is not intended for them. 

Artefactual later announced that they had been able to carry out some work to resolve our problem and have added the log in button to the Visible Elements module of AtoM - enabling adminstrators to configure whether they want to show or hide the button to unauthenticated users. This will be available in AtoM 2.5 - thanks Artefactual!


Clipboard feature

A couple of comments were made about the clipboard functionality. 

Firstly, that users may be confused by the fact that the clipboard can appear to be empty if for example you have been collecting Authority Records rather than Archival Descriptions. As the clipboard shows Archival Descriptions by default, it is not immediately obvious that you may need to change the entity type in order to see the items you have selected. 

Secondly that the feedback available to users when adding or removing items from the clipboard is not always clear and consistent. When viewing a list of results the clipboard icon just changes colour to indicate an item has been added.

The small clipboard icon on the right of each item in a list changes colour when you add or remove the item


However, when on a archival description or authority record page, there is also a text prompt which helps explain to the user what the icon is for and what action is available. This is much more helpful for users, particularly if they haven't used an AtoM catalogue or encountered the clipboard feature before.

When viewing a record, the clipboard icon to the right is much clearer and more user friendly



So, some useful food for thought here. It is always good to be aware of potential usability and accessibility issues and to open up discussions about how things could be improved.

I hope these findings are of interest to other AtoM users.


Jenny Mitcham, Digital Archivist

The sustainability of a digital preservation blog...

So this is a topic pretty close to home for me. Oh the irony of spending much of the last couple of months fretting about the future prese...