Tuesday 31 July 2018

Checksum or Fixity? Which tool is for me?

The digital preservation community are in agreement that file fixity and data integrity are important.

Indeed there is a whole row devoted to this in the NDSA Levels of Preservation. But how do we all do it? There are many tools out there - see for example those listed in COPTR.

It was noted in the NDSA Fixity Survey Report of 2017 that there isn't really a consensus on how checksums are created and verified across the digital preservation community. Many respondents to the survey also gave the impression that current procedures were a work in progress and that other options should be explored.

From the conclusion of the report:
"Respondents also talked frequently about improvements they wanted to make, such as looking for new tools, developing workflows, and making other changes to their fixity activities. It may be useful, then, for practitioners to think of their fixity practices within a maturity or continuous improvement model; digital preservation practitioners develop a set of ideal practices for their institution and continuously evaluate methods and the allocation of available resources to get as close as possible to their ideal state."

This nicely sums up how I feel about a lot of the digital preservation routines I put into practice ...but continuous improvement needs time.

Life is busy and technology moves fast.

I realised that I haven't reviewed my tools and procedures for fixity checking since 2013.

A  recent upgrade of my PC to Windows 10 gave me a good excuse to change this. Being that I was going to have to re-install and configure all my tools post-upgrade anyway, this was the catalyst I needed to review the way that I currently create and verify checksums within the digital archive.

Current procedures using Checksum

Since 2013 I have been using a tool called Checksum to generate and verify checksums. Checksum describes itself as a "blisteringly fast, no-nonsense file hashing application for Windows" and this has worked well for me in practice. One of the key selling points of Checksum is it's speed. This has become a more important requirement over time as the digital archive has grown in size.

I have a set of procedures around this tool that document and describe how it is configured and how it is used, both as part of the ingest process and as a routine integrity checking task. I keep a log of the dates that checksums are verified, numbers of files checked and the results of these checks and am able to respond to issues and errors as and when they occur.

There has been little drama on this score over the last 5 years apart from a brief blip when my checksums didn't match and the realisation that my integrity checking routine wasn't going to catch things like changed last modified dates.

I'm not unhappy with Checksum but I am aware that there is room for improvement. Getting it set up and configured correctly isn't as easy as I would like. I sometimes wonder if there are things I'm missing. In the past I have scheduled a regular checksum verification task using Windows Task Scheduler as this is not a feature of Checksum itself but more recently I've just been initiating it manually on a regular schedule.


Introducing Fixity

Fixity is a free tool from AVP. It has been around since 2013 but hasn't hit my radar until recent times. It was mentioned several times in the NDSA Fixity Survey Report and I was keen to try it out.



Fixity was created in recognition of the key role that checksum generation and validation have in digital preservation workflows. The intention of the developers was to provide institutions with a simple and low cost (in fact...free) tool that allows checksums to be generated and validated and that enables tasks to be scheduled and reports to be generated.

The Fixity User Guide is a wonderful thing. From personal experience I would say that one of the key barriers to adopting new tools is not being able to understand exactly what they do and why.

Documentation for open source tools can sometimes be a bit scattered and impenetrable, or occasionally too technical for me to understand - not the Fixity User Guide!

It starts off by explaining what problem it is trying to solve and it includes step by step instructions with screen shots and an FAQ at the back. Full marks!

Testing Fixity

The Graphical User Interface (GUI)

I like the interface for Fixity. It is clear and easy to understand, and it gives you the flexibility to manage different fixity checks you may want to set up for different types of content or areas of your filestore.

First impressions were that it is certainly easier to use than the Checksum tool I use currently. On the downside though, there were a few glitches or bugs that I encountered when using it.

Yes, I did manage to break it.

Yes, I did have to use the Task Manager to shut it down on several occasions.

Reporting

The reporting is good. I found this to be clearer and more more flexible than the reports generated by Checksum. It helps that it is presented in a format that can be opened in a spreadsheet application - this means that you can explore the results in whatever way you like.

Fixity will send you an email summary with statistics about how many files were changed, new, removed or moved/renamed. Attached to this email is a tab delimited file that includes all the details. This can be read into a spreadsheet application for closer inspection. You can quickly filter by the status column and focus in on those files that are new or changed. A helpful way of checking whether all is as it should be.

A useful summary report emailed to me from Fixity

One of the limitations of Checksum is that I can add new files into a directory and if I forget to update the checksum file it will not tell me that I have a new file - in theory that file could go without having its fixity created or checked for some time...or, if I enable the 'synchronise' option it will add the new checksums to the hash file when it next does a verification task. This is helpful, but perhaps I don't want them to be added without notifying me in some way. I would prefer to cast my eye over them to double check that they should be in there.


Scheduling

You can use the Fixity GUI to schedule checksum verification events - so you don't have to kick off your weekly or monthly scan manually. This is a nice feature. If your PC isn't switched on when a scheduled scan is due to start, it simply delays it until you next switch your computer on.

One of the downsides of running a scheduled scan in Fixity is that there is little feedback on the screen to let you know what it is doing and where it has got to. Also given that a scan can take quite a while (over 5 hours in my case) it would be helpful to have a notification to remind you that the scheduled scan is going to start and to allow you to opt out or reschedule if you know that your PC isn't going to be switched on for long enough to make it worthwhile.


Managing projects

The Fixity GUI will allow you to manage more than one project - meaning that you could set up and save different scans on different content using different settings. This is a great feature and a real selling point if you have this use case (I don't at the moment).

One downside I found when trying to test this out was when moving between different projects Fixity kept giving me the message that there are unsaved changes in my project and asking if I want to save - I don't believe there were unsaved changes at this point so this is perhaps a bug?

Also, selecting the checksum algorithm for your project can be a little clunky. You have to save your project before you can choose which algorithm you would like to use. This feature is hidden up in the Preferences menu but I'd prefer to see it up front and visible in the GUI so you can't forget about it or ignore it.

I thought I had set my project up to use the MD5 algorithm but when looking at the summary report I realise it is using SHA256. I thought it would be a fairly easy procedure to go back into my project and change the algorithm but now Fixity is stuck on a message saying 'Reading Files, please wait...'. It may be reprocessing all of the files in my project but I don't know for sure because there is no indicator of progress. If this is the case I expect I will have to switch my PC off before it has finished.


Progress

Following on from the comment above...one really useful addition to Fixity would be for it to give an indication of how far through a checksum verification task it is (and how long it has left). To be fair, the feedback around progress in Checksum, is not exact around timings - but it does give a clear notification about how many checksum files it still has to look at (there is one in each directory) and how many individual files are left to look at in the directory it is currently in.


Timings

When you kick off a checksum verification task Fixity comes up with a message that says 'Please do not close until a report is generated'. This is useful, but it doesn't give an indication of how long the scan might take, so you have no idea when you kick off the task how long you will need to keep your computer on for. I've had to close it down in the middle of the scan on a few occasions and I don't know whether this has any impact on the checksums that are stored for next time.


The message you get when you manually initiate a scan using Fixity

Fixity certainly takes longer to run a scan than the Checksum tool that I have used over the last few years. The last scan of the digital archive took 5 hours and 49 minutes (though this was using SHA256). The fixity check using Checksum (with MD5) takes around 1 and a half hours so the difference is not inconsequential.

Note I'm currently trying to change the checksum algorithm on my Fixity test project to MD5 so I can do a fairer comparison and it has been thinking about it for over 3 hours.


Where are the checksums stored?

In the Checksum tool I use currently, the checksums themselves are stored in a little text file in each directory of the digital archive. This could be seen as a good thing (all the information you need to verify the files is stored and backed up with the files) or a bad thing (the digital archive is being contaminated with additional administrative files, as are my DROID reports!).

Fixity however, stores the actual checksums in a /History/ folder and also in a SQLite database. When you first install Fixity it asks you to specify a location for the History folder. I initially changed this to my C: drive to stop it clogging up my limited profile space, but this may not have been the best option in retrospect as my C: drive isn't backed up. It would certainly be useful to have backup copies of the checksums elsewhere, though we could debate at length what exactly should be kept and for how long.


How to keep checksums up to date

My current workflow using the Checksum application is to create or update checksums as part of the ingest process or as preservation actions are carried out so that I always have a current checksum for every file that is being preserved.

I'm unsure how I would do this if I was using Fixity. Is there a way of adding checksums into the database or history log without actually running the whole integrity check? I don't think so.

Given the time it takes Fixity to run, running the whole project might not be a practical solution in many cases so files may sit for a little while before checksums are created. There are risks involved with this.


The verdict

Fixity is a good tool and I would recommend it to digital archivists who are looking for a user-friendly introduction to fixity checking if they don't have a large quantity of digital content. The documentation, reporting and GUI are clear and user friendly and it is pretty easy to get started with it.

One of the nice things about Fixity is that it has been developed for the digital preservation community with our digital preservation use cases in mind. Perhaps the reason why the documentation is so clear is because it feels like it is written for me (I like to think I am a part of the designated community for Fixity!). In contrast, I find it quite difficult to understand the documentation for Checksum or to know whether I have discovered all that it is capable of.

However, after careful consideration I have decided to stick with Checksum for the time being. The main reason being speed. Clearly checksum's claims to be a "blisteringly fast, no-nonsense file hashing application for Windows" are not unfounded! I don't have a huge digital archive to manage at 187 GB but the ability to verify all of my checksums in 1.5 hours instead of 5+ hours is a compelling argument. The digital archive is only going to grow, and speed of operation will become more important over time.

Knowing that I can quickly create or update checksums as part of the ingest or preservation planning process is also a big reason for me to stick to current workflows.

It has been really interesting testing Fixity and comparing it with the functionality of my existing Checksum tool. There are lots of good things about it and it has helped me to review what I want from a fixity checking tool and to examine the workflows and preferences that I have in place with Checksum.

As part of a cycle of continuous improvement as mentioned in the NDSA Fixity Survey Report of 2017 I hope to review file fixity procedures again in another few years. Fixity has potential and I'll certainly be keeping my eye on it as an option for the future.



Jenny Mitcham, Digital Archivist

1 comment:

  1. Hi Jenny!

    Thank you so much for such a thorough review of our tool Fixity. We truly appreciate users’ feedback as it allows us to build better tools for the community. We have taken notes of the bugs and feature requests you have mentioned and will incorporate these into ongoing development efforts for Fixity. Throughout your comments there are two primary points that emerge:

    1. The need to provide users with feedback on progress and timing. We hear that and agree that this will be a useful addition in the future.
    2. The length of time it takes to perform scans. This is something we’re looking into and we believe that the lack of speed may be specific to the Windows version of the application. Regardless, we will continue to look into this and see what’s going on there.

    Setting those issues aside there are a few comments that we wanted to offer which may serve as helpful tips and make the Fixity experience easier and more understandable. The headings below correspond with the headings in your blog.

    Scheduling
    Regardless of speed, the issues raised under this section of your blog can be tricky with large amounts of data, especially if the data being scanned is on a network drive. We have a couple of thoughts to share:

    1. It may be very well worthwhile to dedicate a machine to this task because it is rather intensive.
    2. Launching Fixity as an admin when setting up schedules will create the tasks within Windows Task Scheduler such that they will be able to run when you are logged off. This allows you to log off and still have your scheduled scans run at night or over the weekend when you’re not there.

    Managing Projects

    Regarding the switching of checksum algorithms we use an approach which requires that all files be scanned and validated before the checksum algorithm change is implemented. If we didn’t do this there would be a loophole in the logic where a change in algorithm could result in incomplete reporting and a failure to identify changes, additions, deletions, moves, and file renames. It is time intensive but the diligent approach lives up to the application’s charge of making sure you know what’s going on with your files. especially at points where things are liable to slip through the cracks. We hope that changing algorithms is an infrequent event. However when it is done we believe that it’s significant enough to require the thorough and careful process we have designed for Fixity.

    Where are the checksums stored?

    The latest version of Fixity (v1.2) allows you to change the location of where your reports are stored as you like.

    How to keep checksums up to date

    Many of the workflows that use Fixity have checksums prior to arriving at the archive. For instance, when Exactly and Fixity are used in tandem, Exactly creates checksums at the point of bagging files. At the point of receipt the bag can be validated, incorporating file attendance and checksum all file checksums verified. Upon deposit to the archive, there may a period of time before Fixity is run and the new files are added into the Fixity database. In these instances, the checksum from Exactly can serve as the interim checksum of record until Fixity is run. When Fixity is run the two checksums (one in Exactly and one in Fixity) can be compared to confirm that they match and the checksum of record can now live in Fixity. Having said all of this, it would be good to hear what your ideal scenario would be so that we can think about accommodating additional workflows.

    Thanks again, and please keep your comments coming!

    Best,

    Pamela
    (On behalf of Fixity Team)

    Jenny Mitcham, Digital Archivist

    ReplyDelete

The sustainability of a digital preservation blog...

So this is a topic pretty close to home for me. Oh the irony of spending much of the last couple of months fretting about the future prese...