Monday, 2 March 2015

Pitching for Archivematica at Research Data Spring

It was with trepidation that I attended the Research Data Spring sandpit last week. A sandpit has childhood associations with lazy long summers, but reading more about the format of the event it sounded more like an episode of Dragon’s Den.

The ‘sandpit’ event was held at Birmingham’s Aston University campus
 and we were at least rewarded with summer beach weather on the second day

The first day of this event was set aside to explore and discuss ideas submitted through Jisc’s latest funding call (Research Data Spring via Ideascale) leading to potential new collaborations and partnerships.

On the second day, idea holders were given the chance to pitch their idea to a panel of judges. In a kind of four minute ‘elevator pitch’ we had to say enough about our project idea to explain what we wanted to achieve, why it was important and what resources were required, as well as persuading the judges that we should receive funding. This was followed by a four minute grilling by the judges on the finer details of the idea. No pressure!

After a late night doing sums, I have a final run through of my pitch

I was pitching a version of an idea originally submitted by the University of Hull that I had been quick to volunteer to collaborate on. The idea is based around investigating Archivematica for use within a wider technical infrastructure for research data management. We are hoping to plug the digital preservation gap and demonstrate the benefits that Archivematica can bring as part of an automated workflow for ingesting and processing research data.

My poster advertising the idea
Initially we would like to do a piece of investigative work looking at what Archivematica has to offer Higher Education Institutions in the UK who are currently setting up workflows for managing research data. We want to look at its strengths and weaknesses, and highlight areas for development in order to inform a further phase of the project in which we would sponsor some development work to enhance Archivematica. In the last phase of the project we hope to implement a proof of concept integrated into our existing Fedora repositories and push some real research data through the system. We would write up our findings as a case study and intend to continue to spread awareness of our work through presentations and blogs.

Many universities in the UK will not yet be ready to start thinking seriously about digital preservation, but the hope is that if this idea is funded, our work with Archivematica will help inform future decision making in this area for other institutions.

Wednesday, 18 February 2015

Digital preservation hits the headlines

It is not often that stories directly related to my profession make front page news, so it was with mixed feelings that I read following headline on the front of the Guardian this weekend:

"Digital is decaying. The future may have to be print"

While I agree with one of those statements, I strongly disagree with the other. Having worked in digital preservation for 12 years now, the idea of  a 'digital dark age' caused by obsolescence of the rapidly evolving hardware and software landscape is not a new one. Digital will decay if it is not properly looked after, and that is ultimately why there is a profession and practice that has built up in this area.

I would however disagree with the idea that the future may have to be print. At the Borthwick Institute for Archives we are now encouraging depositors to give us their archives in their original form. If the files are born digital we would like to receive and archive the digital files. Printouts lack the richness, accessibility and flexibility of the digital originals, which can often tell us far more about a file and the process of creation than a hard copy and can be used in a variety of different ways.

This headline was of course prompted by a BBC interview with Vint Cerf (VC of Google) on Friday in which he makes some very valid points. Digital preservation isn't just about preserving the physical bits, it is also about what they mean. We need to capture and preserve information about the digital environment as well as the data itself in order to enable future re-use. Again this is not new, but it is good to see story this hitting the headlines. Raising general awareness of the issues us digital archivists think about on a daily basis can only be a good thing in my mind. What the coverage misses though is the huge amount of work that has been going on in this area already...

The Jisc Research at Risk workshop at IDCC15 in the
fabulous surroundings of 30 Euston Square, London
Last week I spent several days at the International Digital Curation Conference (IDCC) in London. Surprisingly, this was my first IDCC (I'm not sure how I had managed to miss it until now), but this was a landmark birthday for the conference, celebrating a decade of bringing people together to talk about digital curation.

The theme of the conference was "Ten years back, ten years forward: achievements, lessons and the future for digital curation", consequently, many of the papers focused on how far we had come in the last ten years and on next steps. In ten years, we have clearly achieved much but the digital preservation problem is not yet solved. Progress is not always as fast as we would like, but enacting a culture change in the way we manage our digital assets is was never going to be a quick win, and this is sometimes not helped by the fact that as we solve one element of the problem, we adjust our expectations on what we consider to be a success. This is a constantly evolving field and we take on new challenges all the time.

It is great that public awareness about obsolescence and the fragility of digital data has been raised in the mainstream media, but it is also important for people to know that there is already a huge amount of work going on in this area and many of us who think about these issues all the time.

Monday, 2 February 2015

When checksums don't match...

No one really likes an error message but it is strangely satisfying when integrity checking of files within the digital archive throws up an issue. I know it shouldn't be, but having some confirmation that these basic administrative tasks that us digital archivists carry out are truly necessary and worthwhile is always encouraging. Furthermore, it is useful to have real life examples to hand when trying to make a business case for a level of archival storage that includes regular monitoring and comparison of checksums.

We don't have a full blown digital archiving system yet at the University of York, but as a minimum, born digital content that comes into the archives is copied on to University filestore and checksums are created. A checksum is a string of characters that acts as a unique fingerprint of a digital object, if the digital object remains unchanged, a checksumming tool will come up with the same string of characters each time the algorithm is run. This allows us digital archivists to ensure that files within our care remain authentic and free from accidental damage or corruption - and this is really one of our most basic roles as professionals.

The tool we are using at the moment to create and verify our checksums is the Checksum tool from Corz. A simple but comprehensive tool that it is quick and easy to get started with, but that gives ample scope for configuration and levels of automation for users who want to get more from it.

This morning when running integrity checks over the digital archive filestore I came across a problem. Two files in one of my directories that hold original born digital content came up with an MD5 error. Very strange.

I've just located the original CDs in the strongroom and had a look at the 2 files in question to try and work out what has gone wrong.

Another great tool that I use to manage our digital assets is Foldermatch. Foldermatch allows you to compare 2 folders and tells you with a neat graphical interface whether the contents of them are identical or not. Foldermatch has different comparison options and you can either take the quick approach and compare contents by file size, date and time or you can go for the belt and braces approach and compare using checksums. As a digital archivist I normally go for the belt and braces approach and here is a clear example of why this is necessary.
Comparison of folders using size, date and time - all looks well!

When comparing what is on the original CD from the strongroom with what is on our digital archive filestore by size, date and time, Foldermatch does not highlight any problems. All looks to be above board. The green column of the bar chart above shows that Foldermatch the files in the filestore to be 'identical' to those on the CD.

Similarly when running a comparison of contents, the results look the same. No problems highlighted.

Comparison of folders using SHA-1 checksums - differences emerge

However, when performing the same task using the SHA-1 checksum algorithm, this is where the problems are apparent. Two of the files (the purple column) are recorded as being 'different but same date/time'.

These changes appear not to have altered the file content, its size or date/time stamp. Indeed I am not clear on what specifically has been altered. Although checksum comparisons are helpful at flagging problems, they are not so helpful at giving specifics about what has changed.

As these files have sat gathering dust on the filestore, something has happened to subtly alter them, and these subtle changes are hard to spot but do have an impact on the their authenticity. This is the sort of thing we need to watch out for and this is why we digital archivists do need to worry about the integrity of our files and take steps to ensure we can prove that we are preserving what we think we are preserving.

Thursday, 29 January 2015

Reacquainting myself with OAIS

Hands up if you have read ISO:14721:2012 (otherwise known as the Reference Model for an Open Archival Information System)…..I mean properly read it…..yes, I suspected there wouldn’t be many of you. It seems like such a key document to us digital archivists – we use the terminology, the concepts within it, even the diagrams on a regular basis, but I'll be the first to confess I have never read it in full.

Standards such as this become so familiar to those of us working in this field that it is possible to get a little complacent about keeping our knowledge of them up to date as they undergo review.

Hats off to the Digital Preservation Coalition (DPC) for updating their Technology Watch Report on the OAIS Reference Model last year. Published in October 2014 I admit I have only just managed to read it. Digital preservation reading material typically comes out on long train journeys and this report kept me company all the way from Birmingham to Coventry and then back home as far as Sheffield (I am a slow reader!). Imagine how far I would have had to travel to read the 135 pages of the full standard!

This is the 2nd edition of the first in the DPC’s series of Technology Watch reports. I remember reading the original report about 10 years ago and trying to map the active digital preservation we were doing at the Archaeology Data Service to the model. 

Ten years is quite a long time in a developing field such as digital preservation and the standard has now been updated (but as mentioned in the Technology Watch Report, the updates haven’t been extensive – the authors largely got it right first time). 

Now reading this updated report in a different job setting I can think about OAIS in a slightly different way. We don't currently do much digital preservation at the Borthwick Institute, but we do do a lot of thinking about how we would like the digital archive to look. Going back to the basics of the OAIS standard at this point in the process encourages fresh thinking about how OAIS could be implemented in practice. It was really encouraging to read the OCLC Digital Archive example cited within the Technology Watch Report (pg 14) which neatly demonstrates a modular approach to fulfilling all the necessary functions across different departments and systems. This ties in with our current thinking at the University of York about how we can build a digital archive using different systems and expertise across the Information Directorate.

Brian Lavoie mentions in his conclusion that "This Technology Watch Report has sought to re-introduce digital preservation practitioners to OAIS, by recounting its development and recognition as a standard; its key revisions; and the many channels through which its influence has been felt." This aim has certainly been met. I feel thoroughly reacquainted with OAIS and have learnt some things about the changes to the standard and even reminded myself of some things that I had forgotten I said, 10 years is a long time.


Friday, 16 January 2015

The first meeting of Archivematica UK users (or explorers)

Last week I was happy to be able to host the first meeting of a group of UK users (or potential users) of Archivematica here in York

There are several organisations within the UK that are exploring Archivematica and thinking about how they could use it within existing data management workflows to help preserve their digital holdings. I thought it would be good to get us in a room together and talking about our ideas and experiences. 

Of the institutions who attended the meeting, most were at a similar stage. Perhaps we would not yet call ourselves 'Archivematica users', but having recognised its potential, we are now in the process of testing and exploring the system to evaluate exactly how we could use it and what systems it would integrate with. 

Each of the nine institutions attending the meeting were able to give a short presentation with an overview of their experience with Archivematica and intended use of it. I asked each speaker to think about the following points:

  • Where are you with Archivematica (investigating, testing, using)?
  • What do you intend to use it for - eg: research data, born digital archives, digitised content?
  • What do you like about it / what works?
  • What don't you like about it / what doesn't work?
  • How do you see Archivematica fitting in with your wider technical infrastructure - eg: what other systems will you use for storage, access, pre-ingest?
  • What unanswered questions do you have?
By getting an overview of where we all are, we could not only learn from each other, but also see areas where we might be able to put our heads together or collaborate. Exploring new territory always seems easier when you have others to keep you company.

Over the course of the afternoon I took down pages of notes - a clear sign of how useful I found it. I can't report on everything in this blog post but I'll just summarise a couple of the elements that the presentations touched on - our likes and dislikes.

What do you like about Archivematica? What works well for you?

  • It meets our requirements (or most of them)
  • It uses METS to package and structure the metadata
  • It creates an Archival Information Package (AIP) that can be stored anywhere you like
  • You can capture the 'original order' of the deposit and then appraise and structure your package as required (this keeps us in line with standard archival theory and practice)
  • There is a strong user-community and this makes it more sustainable and attractive
  • Artefactual publish their development roadmap and wishlist so we can see the direction they are hoping to take it in
  • It is flexible and can talk to other systems
  • It doesn't tie you in to proprietary systems
  • It connects with AtoM
  • It is supported by Artefactual Systems (who are good to work with)
  • It is freely available and open source - for those of us who don't have a budget for digital preservation this is a big selling point
  • It is managed in collaboration with UNESCO
  • It has an evolving UK user base
  • The interface mirrors the Open Archival Information System (OAIS) functional entities - this is good if this is the language you speak
  • It allows for customisable workflows
  • It has come a long way since the first version was released
  • It fills a gap that lots of us seem to have within our existing data curation infrastructures
  • As well as offering a migration-based strategy to digital preservation, it also stores technical metadata which should allow for future emulation strategies
  • It is configurable - you can add your own tools and put your own preservation policies in place
  • It isn't a finished product but is continually developing - new releases with more functionality are always on the horizon
  • We can influence it's future development

What don't you like about Archivematica? What doesn't work for you?

  • It can be time consuming - you can automate a lot of the decision points but not all of them
  • Rights metadata can only be applied to a whole package (AIP) not at a more granular level (eg: per individual file)
  • Storage and storage functionality (such as integrity checking and data loss reporting) isn't included
  • Normalisation/migrations of files only happens on initial ingest but we are likely to need to carry out further migrations at a later date
  • Upgrading is always an adventure!
  • Documentation isn't always complete and up-to-date
  • It doesn't store enough information about the normalised preservation version of the files (more is stored about the original files)
  • It is not just a question of installing it and running with it - lots of thought has to go into how we really want to use it
  • You can't delete files from within an AIP
  • There is no reporting facility to enable you to check on what files of each type/version you have within your archive and use this to inform your preservation planning

A Q&A session with Artefactual Systems via WebEx later in the afternoon was really helpful in answering our unanswered questions and describing some interesting new functionality we can expect to see in the next few versions of Archivematica.

All in all a very worthwhile session and I hope this will be just the first of many meetings of the Archivematica UK group. Please do contact me if you are a UK Archivematica user (or explorer) and want to share experiences with other UK users.

Thursday, 18 December 2014

Plugging the gaps: Linking Arkivum with Archivematica

In September this year, Arkivum and Artefactual Systems (who develop and support the Open Source software Archivematica) announced that they were collaborating on a digital preservation system. This is a piece of work that myself and colleagues at the University of York were very pleased to be able to fund.

We don't currently have a digital archive for the University of York but we are in the process of planning how we can best implement one. Myself and colleagues have been thinking about requirements and assessing systems and in particular looking at ways we might create a digital archive that interfaces with existing systems, automates as much of the digital preservation process as possible ....and is affordable.

My first point of call is normally the Open Archival Information System (OAIS) reference model. I regularly wheel out the image below in presentations and meetings because I always think it helps in focusing the mind and summarising what we are trying to achieve.

OAIS Functional Entities (CCSDS 2002)

From the start we have favoured a modular approach to a technical infrastructure to support digital archiving. There doesn't appear to be any single solution that "just does it all for us" and we are not keen to sweep away established systems that already carry out some of the required functionality.

We need to keep in mind the range of data management scenarios we have to support. As a university we have a huge quantity of digital data to manage and formal digital archiving of the type described in the OAIS reference model is not always necessary. We need an architecture that has the flexibility to support a range of different workflows depending on the retention periods or perceived value of the data that we are working with. All data is not born equal so it does not make sense to treat it all in the same way.

How we've approached this challenge is to look at the systems we have currently, find the gaps and work out how best to fill them. We also need to think about how we can get different systems to talk to each other in order to create the automated workflows that are so crucial to all of this working effectively.

Looking at the OAIS model, we already have a system to provide access to data with York Digital Library which is built using Fedora, we also have some of the data management functionality ticked with various systems to store descriptive metadata about our assets (both digital and physical). We have various ingest workflows in place to get content to us from the data producers. What we don't have currently is a system that manages the preservation planning side of digital archiving or a robust and secure method of providing archival storage for the long term.

This is where Archivematica and Arkivum could come in.

Archivematica is an open source digital archiving solution. In a nutshell, it takes a microservices approach to digital archiving, running several different tools as part of the ingest process to characterise and validate the files, extracting metadata, normalising data and packing the data up into an AIP which contains both the original files (unchanged), any derived or normalised versions of these files as appropriate and technical and preservation metadata to help people make sense of that data in the future. The metadata are captured as PREMIS and METS XML, two established standards for digital preservation that ensure the AIPs are self-documenting and system-independent. Archivematica is agnostic to the storage service that is used. It merely produces the AIP which can then be stored anywhere.

Arkivum is a bit perfect preservation solution. If you store your data with Arkivum they can guarantee that you will get that data back in the same condition it was in when you deposited it. They keep multiple copies of the data and carry out regular integrity checks to ensure they can fulfil this promise. Files are not characterised or migrated through different formats. This is all about archival storage. Arkivum is agnostic to the content. It will store any file that you wish to deposit.

There does seem to be a natural partnership between Archivematica and Arkivum - there is no overlap in functionality, and they both perform a key role within the OAIS model. In actual fact, even without integration, Archivematica and Arkivum can work together. Archivematica will happily pass AIPs through to to Arkivum, but with the integration we can make this work much better.

So, the new functionality includes the following features:

  • Archivematica will let Arkivum know when there is an Archival Information Package (AIP) to ingest
  • Once the Arkivum storage service receives the data from Archivematica it will check the size of the file received matches the expected file size
  • A checksum will be calculated for the AIP in Arkivum and will be automatically compared against the checksum supplied by Archivematica. Using this, the system can accurately verify whether transfer has been successful
  • Using the Archivematica dashboard it is possible to ascertain the status of the AIP within Arkivum to ensure that all required copies of the files have been created and it has been fully archived

I'm still testing this work and had to work hard to manage my own expectations. The integration doesn't actually do anything particularly visual or exciting, it is the sort of back-end stuff that you don't even notice when everything is working as it should. It is however good to know that these sorts of checks are going on behind the scenes, automating tasks that would otherwise have to be done by hand. It is the functionality that you don't see that is all important!

Getting systems such as these to work together well is key to building up a digital archiving solution and we hope that this is of interest and use to others within the digital preservation community.

Friday, 7 November 2014

A non-archivist's perspective on cataloguing born digital archives

As blogged in my previous post, earlier this week I attended an ARA Section for Archives and Technology event on Born Digital Cataloguing and also had the opportunity to talk about some of the Borthwick's current work in this area.

I gave a non-archivist's perspective on born digital cataloguing. These were the main points I tried to put across, though some of the points below were also informed by discussions on the day:

  • Born digital cataloguing within a purely digital archive is reasonably straightforward. The real complexity comes when working with hybrid archives where content is both physical and digital
  • The Archaeology Data Service are good at born digital cataloguing. This is partly because they only have digital material to worry about, but also down to the fact that they have many years of experience and the necessary systems in place. Their new ADS Easy system allows depositors to submit data for archiving along with the required metadata (which they can enter both at project level and individual file level). A web interface for disseminating this data can then be created in a largely automated fashion. It makes sense to ask the person who knows the most about the data to catalogue it, freeing up the digital archivists' time to focus on checking the received data and metadata and more specialist digital preservation work.
  • Communication can be a problem between traditional archivists and digital archivists. We may use different metadata standards and we may not always know what the other is talking about. I was at the Borthwick Institute for approximately a year before I worked out that when my colleagues talked about describing archives at file level (which may cover multiple physical documents within the same physical file), they didn't mean the same as my perception of 'file level metadata' (which would apply to a single digital item). It is important to recognise these differences and try and work around them so that we understand each other better when working with hybrid archives.
A digital archivist may speak a different language to traditional archivists,
but we can work around this

  • At the Borthwick we are in the process of implementing a new system for accessioning and cataloguing archives (both physical and digital archives). We have installed a version of AtoM (Access to Memory) and have imported one of our more complex catalogues into it. We now need to build on this proof of concept and fully establish and populate this system. As well as holding information about our physical holdings, this system will provide a means of cataloguing born digital data and also the foundations on which a digital archiving system can be built. It will also provide the means by which we can disseminate digital objects to our users.
  • There are other types of metadata that are required for digital material and these are outside the scope of AtoM which is primarily for resource discovery metadata. More technical metadata relating to digital objects and any transformations they undergo needs to reside within a digital archiving system. This is where Archivematica comes in. We are currently testing this digital preservation system to establish whether it meets our digital archiving needs.
  • I worry about the identifiers we use within archival catalogues. The traditional archival identifier is performing two jobs – firstly acting as a unique identifier or reference for an item or group of items, and secondly showing where those items sit within the archival hierarchy. This can lead to problems...
    • ...if the arrangement of the archive changes – this may lead to the identifier changing – never a good thing once it has been published and made available, or, if that identifer is being used to link between different systems.
    • ...if we want to start describing objects before we know where they sit within the hierarchy. This may be the case in particular for digital material where we may want to start working with it with greater urgency than the physical element of the archive.*
  • We can argue that digital isn't different, but with digital we do tend to think more at item level. Digital preservation activities and the technical and preservation metadata that this generates are all at file (item) level, so perhaps it makes sense for the resource discovery metadata to follow this pattern. Unlike physical archives, for digital archives we can pretty easily generate a title (or file name) for every item. If we are to deal with digital archives at file level would this cause confusion when cataloguing a hybrid archive?
  • Before we incorporate digital material into a digital archive, some selection and appraisal needs to be carried out  - depending on the digital archiving system in use, it can be non-trivial to remove files from an AIP (archival information package) once they have been transferred, so we really do need to have a good idea of what is and isn't included before we carry out our preservation activities. In order to carry out this selection we may wish to start putting together a skeletal description of each item. Wouldn't it be nice if we could start to do this in a way which could be easily transferred into an archival management system? At the moment I have been doing this in a separate spreadsheet but we need strategies that are more sustainable and scalable.
  • Workflows are crucially important. Who does the born digital cataloguing where hybrid archives are concerned? It's place within the archive as a whole is key so it should be catalogued in tandem with the physical, but if we want to archive the digital material more rapidly than the physical how do we ensure we have the right workflows and procedures in place? Much of this will come down to institutional policies and procedures and the capabilities of the technologies you are using. These are still issues we are grappling with here at the Borthwick as we try and establish a framework for carrying out born digital cataloguing.

* as an aside (and a bit off-topic), my other bugbear with archival identifers is that they contain slashes (which means we can’t use them in directory or file names) and that they don’t order correctly in a spreadsheet or database as they are a mixture of numeric and alphabetical characters