Monday 2 February 2015

When checksums don't match...

No one really likes an error message but it is strangely satisfying when integrity checking of files within the digital archive throws up an issue. I know it shouldn't be, but having some confirmation that these basic administrative tasks that us digital archivists carry out are truly necessary and worthwhile is always encouraging. Furthermore, it is useful to have real life examples to hand when trying to make a business case for a level of archival storage that includes regular monitoring and comparison of checksums.

We don't have a full blown digital archiving system yet at the University of York, but as a minimum, born digital content that comes into the archives is copied on to University filestore and checksums are created. A checksum is a string of characters that acts as a unique fingerprint of a digital object, if the digital object remains unchanged, a checksumming tool will come up with the same string of characters each time the algorithm is run. This allows us digital archivists to ensure that files within our care remain authentic and free from accidental damage or corruption - and this is really one of our most basic roles as professionals.

The tool we are using at the moment to create and verify our checksums is the Checksum tool from Corz. A simple but comprehensive tool that it is quick and easy to get started with, but that gives ample scope for configuration and levels of automation for users who want to get more from it.

This morning when running integrity checks over the digital archive filestore I came across a problem. Two files in one of my directories that hold original born digital content came up with an MD5 error. Very strange.

I've just located the original CDs in the strongroom and had a look at the 2 files in question to try and work out what has gone wrong.

Another great tool that I use to manage our digital assets is Foldermatch. Foldermatch allows you to compare 2 folders and tells you with a neat graphical interface whether the contents of them are identical or not. Foldermatch has different comparison options and you can either take the quick approach and compare contents by file size, date and time or you can go for the belt and braces approach and compare using checksums. As a digital archivist I normally go for the belt and braces approach and here is a clear example of why this is necessary.
Comparison of folders using size, date and time - all looks well!

When comparing what is on the original CD from the strongroom with what is on our digital archive filestore by size, date and time, Foldermatch does not highlight any problems. All looks to be above board. The green column of the bar chart above shows that Foldermatch the files in the filestore to be 'identical' to those on the CD.

Similarly when running a comparison of contents, the results look the same. No problems highlighted.

Comparison of folders using SHA-1 checksums - differences emerge

However, when performing the same task using the SHA-1 checksum algorithm, this is where the problems are apparent. Two of the files (the purple column) are recorded as being 'different but same date/time'.

These changes appear not to have altered the file content, its size or date/time stamp. Indeed I am not clear on what specifically has been altered. Although checksum comparisons are helpful at flagging problems, they are not so helpful at giving specifics about what has changed.

As these files have sat gathering dust on the filestore, something has happened to subtly alter them, and these subtle changes are hard to spot but do have an impact on the their authenticity. This is the sort of thing we need to watch out for and this is why we digital archivists do need to worry about the integrity of our files and take steps to ensure we can prove that we are preserving what we think we are preserving.




Jenny Mitcham, Digital Archivist

3 comments:

  1. Have you tried running a tool like FC or diff on these two files? It should be able to tell you exactly what has changed.
    https://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/fc.mspx?mfr=true

    Jenny Mitcham, Digital Archivist

    ReplyDelete
  2. No I didn't know about that one but thanks for the heads up Nick. Gareth Knight has also just pointed me to SSDeep (ssdeep.sourceforge.net) for fuzzy hashing. Now I'm wishing I had the forethought to take copies of the files that had changed. You don't think to backup corrupt files. I should be able to to find them on the filestore snapshots from last week though.

    Jenny Mitcham, Digital Archivist

    ReplyDelete
  3. It important to understand that a file system is much like a library, there are the artefacts and there is the catalogue. Parameters such as creation, modification dates and the file size and type are attributes stored in the file system's catalogue but they are entirely separate from the file itself (the artefact). In the same way a book can become damaged or degrade so can a file, generally a library catalogue will not note a book as being 'slightly foxed' but it may in fact be so.

    In digital preservation there are a range of perils and to appreciate them and guard against them you must understand the aims of preservation and the intricacies of digital media and files systems. To make clear SANs and NAS solutions in all their myriad of forms are very powerful tools, however in generally they come up short in respect of data integrity. i.e. the active management to ensure that at all times the content of a file or artefact has not altered in any respect.

    From the description you have provided I would suggest you have experienced 'bit rot' this is essentially the failure of individual storage bits that mean the overall artefact is subtly altered (thus failing to match the checksum). With live data this is not a big issue because media has a relatively short life 3-5 yrs in most cases and there are backups to restore from. However in digital archiving, artefacts are retained for long periods of time and may be accessed in frequently so errors may go undetected for an extended period of time, way beyond the retention of a backup routine (if one is applied, after all the data is not changing).

    In this scenario you need a storage service where the data at rest is exercised and validated periodically, where multiple copies are held on media that is suited to being at rest and where through out the lifecycle multiple checksum algorithms are used to check not only the properties of a file but also its content (or artefact itself). Few solutions or services meet this rather specialist set of requirements.

    Jenny Mitcham, Digital Archivist

    ReplyDelete

The sustainability of a digital preservation blog...

So this is a topic pretty close to home for me. Oh the irony of spending much of the last couple of months fretting about the future prese...