Wednesday, 18 February 2015

Digital preservation hits the headlines

It is not often that stories directly related to my profession make front page news, so it was with mixed feelings that I read following headline on the front of the Guardian this weekend:

"Digital is decaying. The future may have to be print"

While I agree with one of those statements, I strongly disagree with the other. Having worked in digital preservation for 12 years now, the idea of  a 'digital dark age' caused by obsolescence of the rapidly evolving hardware and software landscape is not a new one. Digital will decay if it is not properly looked after, and that is ultimately why there is a profession and practice that has built up in this area.

I would however disagree with the idea that the future may have to be print. At the Borthwick Institute for Archives we are now encouraging depositors to give us their archives in their original form. If the files are born digital we would like to receive and archive the digital files. Printouts lack the richness, accessibility and flexibility of the digital originals, which can often tell us far more about a file and the process of creation than a hard copy and can be used in a variety of different ways.

This headline was of course prompted by a BBC interview with Vint Cerf (VC of Google) on Friday in which he makes some very valid points. Digital preservation isn't just about preserving the physical bits, it is also about what they mean. We need to capture and preserve information about the digital environment as well as the data itself in order to enable future re-use. Again this is not new, but it is good to see story this hitting the headlines. Raising general awareness of the issues us digital archivists think about on a daily basis can only be a good thing in my mind. What the coverage misses though is the huge amount of work that has been going on in this area already...

The Jisc Research at Risk workshop at IDCC15 in the
fabulous surroundings of 30 Euston Square, London
Last week I spent several days at the International Digital Curation Conference (IDCC) in London. Surprisingly, this was my first IDCC (I'm not sure how I had managed to miss it until now), but this was a landmark birthday for the conference, celebrating a decade of bringing people together to talk about digital curation.

The theme of the conference was "Ten years back, ten years forward: achievements, lessons and the future for digital curation", consequently, many of the papers focused on how far we had come in the last ten years and on next steps. In ten years, we have clearly achieved much but the digital preservation problem is not yet solved. Progress is not always as fast as we would like, but enacting a culture change in the way we manage our digital assets is was never going to be a quick win, and this is sometimes not helped by the fact that as we solve one element of the problem, we adjust our expectations on what we consider to be a success. This is a constantly evolving field and we take on new challenges all the time.

It is great that public awareness about obsolescence and the fragility of digital data has been raised in the mainstream media, but it is also important for people to know that there is already a huge amount of work going on in this area and many of us who think about these issues all the time.

Jenny Mitcham, Digital Archivist

Monday, 2 February 2015

When checksums don't match...

No one really likes an error message but it is strangely satisfying when integrity checking of files within the digital archive throws up an issue. I know it shouldn't be, but having some confirmation that these basic administrative tasks that us digital archivists carry out are truly necessary and worthwhile is always encouraging. Furthermore, it is useful to have real life examples to hand when trying to make a business case for a level of archival storage that includes regular monitoring and comparison of checksums.

We don't have a full blown digital archiving system yet at the University of York, but as a minimum, born digital content that comes into the archives is copied on to University filestore and checksums are created. A checksum is a string of characters that acts as a unique fingerprint of a digital object, if the digital object remains unchanged, a checksumming tool will come up with the same string of characters each time the algorithm is run. This allows us digital archivists to ensure that files within our care remain authentic and free from accidental damage or corruption - and this is really one of our most basic roles as professionals.

The tool we are using at the moment to create and verify our checksums is the Checksum tool from Corz. A simple but comprehensive tool that it is quick and easy to get started with, but that gives ample scope for configuration and levels of automation for users who want to get more from it.

This morning when running integrity checks over the digital archive filestore I came across a problem. Two files in one of my directories that hold original born digital content came up with an MD5 error. Very strange.

I've just located the original CDs in the strongroom and had a look at the 2 files in question to try and work out what has gone wrong.

Another great tool that I use to manage our digital assets is Foldermatch. Foldermatch allows you to compare 2 folders and tells you with a neat graphical interface whether the contents of them are identical or not. Foldermatch has different comparison options and you can either take the quick approach and compare contents by file size, date and time or you can go for the belt and braces approach and compare using checksums. As a digital archivist I normally go for the belt and braces approach and here is a clear example of why this is necessary.
Comparison of folders using size, date and time - all looks well!

When comparing what is on the original CD from the strongroom with what is on our digital archive filestore by size, date and time, Foldermatch does not highlight any problems. All looks to be above board. The green column of the bar chart above shows that Foldermatch the files in the filestore to be 'identical' to those on the CD.

Similarly when running a comparison of contents, the results look the same. No problems highlighted.

Comparison of folders using SHA-1 checksums - differences emerge

However, when performing the same task using the SHA-1 checksum algorithm, this is where the problems are apparent. Two of the files (the purple column) are recorded as being 'different but same date/time'.

These changes appear not to have altered the file content, its size or date/time stamp. Indeed I am not clear on what specifically has been altered. Although checksum comparisons are helpful at flagging problems, they are not so helpful at giving specifics about what has changed.

As these files have sat gathering dust on the filestore, something has happened to subtly alter them, and these subtle changes are hard to spot but do have an impact on the their authenticity. This is the sort of thing we need to watch out for and this is why we digital archivists do need to worry about the integrity of our files and take steps to ensure we can prove that we are preserving what we think we are preserving.

Jenny Mitcham, Digital Archivist