Digital Archiving at the University of York: Checksum

Showing posts with label Checksum. Show all posts

Tuesday, 31 July 2018

Checksum or Fixity? Which tool is for me?

The digital preservation community are in agreement that file fixity and data integrity are important.

Indeed there is a whole row devoted to this in the NDSA Levels of Preservation. But how do we all do it? There are many tools out there - see for example those listed in COPTR.

It was noted in the NDSA Fixity Survey Report of 2017 that there isn't really a consensus on how checksums are created and verified across the digital preservation community. Many respondents to the survey also gave the impression that current procedures were a work in progress and that other options should be explored.

From the conclusion of the report:

"Respondents also talked frequently about improvements they wanted to make, such as looking for new tools, developing workflows, and making other changes to their fixity activities. It may be useful, then, for practitioners to think of their fixity practices within a maturity or continuous improvement model; digital preservation practitioners develop a set of ideal practices for their institution and continuously evaluate methods and the allocation of available resources to get as close as possible to their ideal state."

This nicely sums up how I feel about a lot of the digital preservation routines I put into practice ...but continuous improvement needs time.

Life is busy and technology moves fast.

I realised that I haven't reviewed my tools and procedures for fixity checking since 2013.

A recent upgrade of my PC to Windows 10 gave me a good excuse to change this. Being that I was going to have to re-install and configure all my tools post-upgrade anyway, this was the catalyst I needed to review the way that I currently create and verify checksums within the digital archive.

Current procedures using Checksum

Since 2013 I have been using a tool called Checksum to generate and verify checksums. Checksum describes itself as a "blisteringly fast, no-nonsense file hashing application for Windows" and this has worked well for me in practice. One of the key selling points of Checksum is it's speed. This has become a more important requirement over time as the digital archive has grown in size.

I have a set of procedures around this tool that document and describe how it is configured and how it is used, both as part of the ingest process and as a routine integrity checking task. I keep a log of the dates that checksums are verified, numbers of files checked and the results of these checks and am able to respond to issues and errors as and when they occur.

There has been little drama on this score over the last 5 years apart from a brief blip when my checksums didn't match and the realisation that my integrity checking routine wasn't going to catch things like changed last modified dates.

I'm not unhappy with Checksum but I am aware that there is room for improvement. Getting it set up and configured correctly isn't as easy as I would like. I sometimes wonder if there are things I'm missing. In the past I have scheduled a regular checksum verification task using Windows Task Scheduler as this is not a feature of Checksum itself but more recently I've just been initiating it manually on a regular schedule.

Introducing Fixity

Fixity is a free tool from AVP. It has been around since 2013 but hasn't hit my radar until recent times. It was mentioned several times in the NDSA Fixity Survey Report and I was keen to try it out.

Fixity was created in recognition of the key role that checksum generation and validation have in digital preservation workflows. The intention of the developers was to provide institutions with a simple and low cost (in fact...free) tool that allows checksums to be generated and validated and that enables tasks to be scheduled and reports to be generated.

The Fixity User Guide is a wonderful thing. From personal experience I would say that one of the key barriers to adopting new tools is not being able to understand exactly what they do and why.

Documentation for open source tools can sometimes be a bit scattered and impenetrable, or occasionally too technical for me to understand - not the Fixity User Guide!

It starts off by explaining what problem it is trying to solve and it includes step by step instructions with screen shots and an FAQ at the back. Full marks!

Testing Fixity

The Graphical User Interface (GUI)

I like the interface for Fixity. It is clear and easy to understand, and it gives you the flexibility to manage different fixity checks you may want to set up for different types of content or areas of your filestore.

First impressions were that it is certainly easier to use than the Checksum tool I use currently. On the downside though, there were a few glitches or bugs that I encountered when using it.

Yes, I did manage to break it.

Yes, I did have to use the Task Manager to shut it down on several occasions.

Reporting

The reporting is good. I found this to be clearer and more more flexible than the reports generated by Checksum. It helps that it is presented in a format that can be opened in a spreadsheet application - this means that you can explore the results in whatever way you like.

Fixity will send you an email summary with statistics about how many files were changed, new, removed or moved/renamed. Attached to this email is a tab delimited file that includes all the details. This can be read into a spreadsheet application for closer inspection. You can quickly filter by the status column and focus in on those files that are new or changed. A helpful way of checking whether all is as it should be.

A useful summary report emailed to me from Fixity

One of the limitations of Checksum is that I can add new files into a directory and if I forget to update the checksum file it will not tell me that I have a new file - in theory that file could go without having its fixity created or checked for some time...or, if I enable the 'synchronise' option it will add the new checksums to the hash file when it next does a verification task. This is helpful, but perhaps I don't want them to be added without notifying me in some way. I would prefer to cast my eye over them to double check that they should be in there.

Scheduling

You can use the Fixity GUI to schedule checksum verification events - so you don't have to kick off your weekly or monthly scan manually. This is a nice feature. If your PC isn't switched on when a scheduled scan is due to start, it simply delays it until you next switch your computer on.

One of the downsides of running a scheduled scan in Fixity is that there is little feedback on the screen to let you know what it is doing and where it has got to. Also given that a scan can take quite a while (over 5 hours in my case) it would be helpful to have a notification to remind you that the scheduled scan is going to start and to allow you to opt out or reschedule if you know that your PC isn't going to be switched on for long enough to make it worthwhile.

Managing projects

The Fixity GUI will allow you to manage more than one project - meaning that you could set up and save different scans on different content using different settings. This is a great feature and a real selling point if you have this use case (I don't at the moment).

One downside I found when trying to test this out was when moving between different projects Fixity kept giving me the message that there are unsaved changes in my project and asking if I want to save - I don't believe there were unsaved changes at this point so this is perhaps a bug?

Also, selecting the checksum algorithm for your project can be a little clunky. You have to save your project before you can choose which algorithm you would like to use. This feature is hidden up in the Preferences menu but I'd prefer to see it up front and visible in the GUI so you can't forget about it or ignore it.

I thought I had set my project up to use the MD5 algorithm but when looking at the summary report I realise it is using SHA256. I thought it would be a fairly easy procedure to go back into my project and change the algorithm but now Fixity is stuck on a message saying 'Reading Files, please wait...'. It may be reprocessing all of the files in my project but I don't know for sure because there is no indicator of progress. If this is the case I expect I will have to switch my PC off before it has finished.

Progress

Following on from the comment above...one really useful addition to Fixity would be for it to give an indication of how far through a checksum verification task it is (and how long it has left). To be fair, the feedback around progress in Checksum, is not exact around timings - but it does give a clear notification about how many checksum files it still has to look at (there is one in each directory) and how many individual files are left to look at in the directory it is currently in.

Timings

When you kick off a checksum verification task Fixity comes up with a message that says 'Please do not close until a report is generated'. This is useful, but it doesn't give an indication of how long the scan might take, so you have no idea when you kick off the task how long you will need to keep your computer on for. I've had to close it down in the middle of the scan on a few occasions and I don't know whether this has any impact on the checksums that are stored for next time.

The message you get when you manually initiate a scan using Fixity

Fixity certainly takes longer to run a scan than the Checksum tool that I have used over the last few years. The last scan of the digital archive took 5 hours and 49 minutes (though this was using SHA256). The fixity check using Checksum (with MD5) takes around 1 and a half hours so the difference is not inconsequential.

Note I'm currently trying to change the checksum algorithm on my Fixity test project to MD5 so I can do a fairer comparison and it has been thinking about it for over 3 hours.

Where are the checksums stored?

In the Checksum tool I use currently, the checksums themselves are stored in a little text file in each directory of the digital archive. This could be seen as a good thing (all the information you need to verify the files is stored and backed up with the files) or a bad thing (the digital archive is being contaminated with additional administrative files, as are my DROID reports!).

Fixity however, stores the actual checksums in a /History/ folder and also in a SQLite database. When you first install Fixity it asks you to specify a location for the History folder. I initially changed this to my C: drive to stop it clogging up my limited profile space, but this may not have been the best option in retrospect as my C: drive isn't backed up. It would certainly be useful to have backup copies of the checksums elsewhere, though we could debate at length what exactly should be kept and for how long.

How to keep checksums up to date

My current workflow using the Checksum application is to create or update checksums as part of the ingest process or as preservation actions are carried out so that I always have a current checksum for every file that is being preserved.

I'm unsure how I would do this if I was using Fixity. Is there a way of adding checksums into the database or history log without actually running the whole integrity check? I don't think so.

Given the time it takes Fixity to run, running the whole project might not be a practical solution in many cases so files may sit for a little while before checksums are created. There are risks involved with this.

The verdict

Fixity is a good tool and I would recommend it to digital archivists who are looking for a user-friendly introduction to fixity checking if they don't have a large quantity of digital content. The documentation, reporting and GUI are clear and user friendly and it is pretty easy to get started with it.

One of the nice things about Fixity is that it has been developed for the digital preservation community with our digital preservation use cases in mind. Perhaps the reason why the documentation is so clear is because it feels like it is written for me (I like to think I am a part of the designated community for Fixity!). In contrast, I find it quite difficult to understand the documentation for Checksum or to know whether I have discovered all that it is capable of.

However, after careful consideration I have decided to stick with Checksum for the time being. The main reason being speed. Clearly checksum's claims to be a "blisteringly fast, no-nonsense file hashing application for Windows" are not unfounded! I don't have a huge digital archive to manage at 187 GB but the ability to verify all of my checksums in 1.5 hours instead of 5+ hours is a compelling argument. The digital archive is only going to grow, and speed of operation will become more important over time.

Knowing that I can quickly create or update checksums as part of the ingest or preservation planning process is also a big reason for me to stick to current workflows.

It has been really interesting testing Fixity and comparing it with the functionality of my existing Checksum tool. There are lots of good things about it and it has helped me to review what I want from a fixity checking tool and to examine the workflows and preferences that I have in place with Checksum.

As part of a cycle of continuous improvement as mentioned in the NDSA Fixity Survey Report of 2017 I hope to review file fixity procedures again in another few years. Fixity has potential and I'll certainly be keeping my eye on it as an option for the future.

Jenny Mitcham, Digital Archivist

Monday, 31 July 2017

The mysterious case of the changed last modified dates

Today's blog post is effectively a mystery story.

Like any good story it has a beginning (the problem is discovered, the digital archive is temporarily thrown into chaos), a middle (attempts are made to solve the mystery and make things better, several different avenues are explored) and an end (the digital preservation community come to my aid).

This story has a happy ending (hooray) but also includes some food for thought (all the best stories do) and as always I'd be very pleased to hear what you think.

The beginning

I have probably mentioned before that I don't have a full digital archive in place just yet. While I work towards a bigger and better solution, I have a set of temporary procedures in place to ingest digital archives on to what is effectively a piece of locked down university filestore. The procedures and workflows are both 'better than nothing' and 'good enough' as a temporary measure and actually appear to take us pretty much up to Level 2 of the NDSA Levels of Preservation (and beyond in some places).

One of the ways I ensure that all is well in the little bit of filestore that I call 'The Digital Archive' is to run frequent integrity checks over the data, using a free checksum utility. Checksums (effectively unique digital fingerprints) for each file in the digital archive are created when content is ingested and these are checked periodically to ensure that nothing has changed. IT keep back-ups of the filestore for a period of three months, so as long as this integrity checking happens within this three month period (in reality I actually do this 3 or 4 times a month) then problems can be rectified and digital preservation nirvana can be seamlessly restored.

Checksum checking is normally quite dull. Thankfully it is an automated process that runs in the background and I can just get on with my work and cheer when I get a notification that tells me all is well. Generally all is well, it is very rare that any errors are highlighted - when that happens I blog about it!

I have perhaps naively believed for some time that I'm doing everything I need to do to keep those files safe and unchanged because if the checksum is the same then all is well, however this month I encountered a problem...

I've been doing some tidying of the digital archive structure and alongside this have been gathering a bit of data about the archives, specifically looking at things like file formats, number of unidentified files and last modified dates.

Whilst doing this I noticed that one of the archives that I had received in 2013 contained 26 files with a last modified date of 18th January 2017 at 09:53. How could this be so if I have been looking after these files carefully and the checksums are the same as they were when the files were deposited?

The 26 files were all EML files - email messages exported from Microsoft Outlook. These were the only EML files within the whole digital archive. The files weren't all in the same directory and other files sitting in those directories retained their original last modified dates.

The middle

So this was all a bit strange...and worrying too. Am I doing my job properly? Is this something I should be bringing to the supportive environment of the DPC's Fail Club?

The last modified dates of files are important to us as digital archivists. This is part of the metadata that comes with a file. It tells us something about the file. If we lose this date are we losing a little piece of the authentic digital object that we are trying to preserve?

Instead of beating myself up about it I wanted to do three things:

Solve the mystery (find out what happened and why)
See if I could fix it
Stop it happening again

So how could it have happened? Has someone tampered with these 26 files? Perhaps unlikely considering they all have the exact same date/time stamp which to me suggests a more automated process. Also, the digital archive isn't widely accessible. Quite deliberately it is only really me (and the filestore administrators) who have access.

I asked IT whether they could explain it. Had some process been carried out across all filestores that involved EML files specifically? They couldn't think of a reason why this may have occurred. They also confirmed my suspicions that we have no backups of the files with the original last modified dates.

I spoke to a digital forensics expert from the Computer Science department and he said he could analyse the files for me and see if he could work out what had acted on them and also suggest a methodology of restoring the dates.

I have a record of the last modified dates of these 26 files when they arrived - the checksum tool that I use writes the last modified date to the hash file it creates. I wondered whether manually changing the last modified dates back to what they were originally was the right thing to do or whether I should just accept and record the change.

...but I decided to sit on it until I understood the problem better.

The end

I threw the question out to the digital preservation community on Twitter and as usual I was not disappointed!

In fact, along with a whole load of discussion and debate, Andy Jackson was able to track down what appears to be the cause of the problem.

He very helpfully pointed me to a thread on StackExchange which described the issue I was seeing.

It was a great comfort to discover that the cause of this problem was apparently a bug and not something more sinister. It appears I am not alone!

...but what now?

So I now I think I know what caused the problem but questions remain around how to catch issues like this more quickly (not six months after it has happened) and what to do with the files themselves.

IT have mentioned to me that an OS upgrade may provide us with better auditing support on the filestore. Being able to view reports on changes made to digital objects within the digital archive would be potentially very useful (though perhaps even that wouldn't have picked up this Windows bug?). I'm also exploring whether I can make particular directories read only and whether that would stop issues such as this occurring in the future.

If anyone knows of any other tools that can help, please let me know.

The other decision to make is what to do with the files themselves. Should I try and fix them? More interesting debate on Twitter on this topic and even on the value of these dates in the first place. If we can fudge them then so can others - they may have already been fudged before they got to the digital archive - in which case, how much value do they really have?

So should we try and fix last modified dates or should we focus our attention on capturing and storing them within the metadata. The later may be a more sustainable solution in the longer term, given their slightly slippery nature!

I know there are lots of people interested in this topic - just see this recent blog post by Sarah Mason and in particular the comments - When was that?: Maintaining or changing ‘created’ and ‘last modified’ dates. It is great that we are talking about real nuts and bolts of digital preservation and that there are so many people willing to share their thoughts with the community.

...and perhaps if you have EML files in your digital archive you should check them too!

Jenny Mitcham, Digital Archivist

Monday, 2 February 2015

When checksums don't match...

No one really likes an error message but it is strangely satisfying when integrity checking of files within the digital archive throws up an issue. I know it shouldn't be, but having some confirmation that these basic administrative tasks that us digital archivists carry out are truly necessary and worthwhile is always encouraging. Furthermore, it is useful to have real life examples to hand when trying to make a business case for a level of archival storage that includes regular monitoring and comparison of checksums.

We don't have a full blown digital archiving system yet at the University of York, but as a minimum, born digital content that comes into the archives is copied on to University filestore and checksums are created. A checksum is a string of characters that acts as a unique fingerprint of a digital object, if the digital object remains unchanged, a checksumming tool will come up with the same string of characters each time the algorithm is run. This allows us digital archivists to ensure that files within our care remain authentic and free from accidental damage or corruption - and this is really one of our most basic roles as professionals.

The tool we are using at the moment to create and verify our checksums is the Checksum tool from Corz. A simple but comprehensive tool that it is quick and easy to get started with, but that gives ample scope for configuration and levels of automation for users who want to get more from it.

This morning when running integrity checks over the digital archive filestore I came across a problem. Two files in one of my directories that hold original born digital content came up with an MD5 error. Very strange.

I've just located the original CDs in the strongroom and had a look at the 2 files in question to try and work out what has gone wrong.

Another great tool that I use to manage our digital assets is Foldermatch. Foldermatch allows you to compare 2 folders and tells you with a neat graphical interface whether the contents of them are identical or not. Foldermatch has different comparison options and you can either take the quick approach and compare contents by file size, date and time or you can go for the belt and braces approach and compare using checksums. As a digital archivist I normally go for the belt and braces approach and here is a clear example of why this is necessary.

Comparison of folders using size, date and time - all looks well!

When comparing what is on the original CD from the strongroom with what is on our digital archive filestore by size, date and time, Foldermatch does not highlight any problems. All looks to be above board. The green column of the bar chart above shows that Foldermatch the files in the filestore to be 'identical' to those on the CD.

Similarly when running a comparison of contents, the results look the same. No problems highlighted.

Comparison of folders using SHA-1 checksums - differences emerge

However, when performing the same task using the SHA-1 checksum algorithm, this is where the problems are apparent. Two of the files (the purple column) are recorded as being 'different but same date/time'.

These changes appear not to have altered the file content, its size or date/time stamp. Indeed I am not clear on what specifically has been altered. Although checksum comparisons are helpful at flagging problems, they are not so helpful at giving specifics about what has changed.

As these files have sat gathering dust on the filestore, something has happened to subtly alter them, and these subtle changes are hard to spot but do have an impact on the their authenticity. This is the sort of thing we need to watch out for and this is why we digital archivists do need to worry about the integrity of our files and take steps to ensure we can prove that we are preserving what we think we are preserving.

Jenny Mitcham, Digital Archivist

Digital Archiving at the University of York

Tuesday, 31 July 2018

Checksum or Fixity? Which tool is for me?

Current procedures using Checksum

Introducing Fixity

Testing Fixity

The Graphical User Interface (GUI)

Reporting

Scheduling

Managing projects

Progress

Timings

Where are the checksums stored?

How to keep checksums up to date

The verdict

Monday, 31 July 2017

The mysterious case of the changed last modified dates

The beginning

The middle

The end

...but what now?

Monday, 2 February 2015

When checksums don't match...

The sustainability of a digital preservation blog...

Twitter

Subscribe