Friday, 17 April 2015

Jisc Archivematica project update ...because digital preservation won’t just go away

Last month I was excited to discover that Jisc had agreed to fund a joint project between the Universities of York and Hull as part of their Research Data Spring initiative. The aim of this project (as mentioned in my previous blog) is to investigate the potential of Archivematica for Research Data Management. There is a brief summary including my pitch here. We have had a number of other higher education institutions in the UK express an interest in this project and it is fabulous to see that there are others who recognise that the tools that digital archivists use could have much to offer those who are charged with managing research data. Of course we hope this project will also be of interest to a more diverse and international audience and we would like to benefit from the experience and knowledge that already exists within the wider digital preservation community.

We are three weeks in to this project now and here is the first of a series of updates on progress.

One of the initial tasks for teams at both York and Hull was to ensure we had a current version of Archivematica installed. Over the next few weeks there will be a fair amount of testing going on to give us a greater understanding of Archivematica's strengths and weaknesses particularly with regard to how it may handle research data.

A pod on the lake - a great venue for our kick off meeting
(though in reality the weather wasn't this nice)
We also got together for a kick off meeting in one of the pods on the lake on York's Heslington East campus. We defined our work packages and established responsibilities and deadlines and now have a clear idea of what we are focusing on in this initial 3 month phase that Jisc have agreed to fund.

Much of the research we will be carrying out and reporting on at the end of this phase of the project will be based around the following questions:

  • Why? Why are we bothering to 'preserve' research data. What are the drivers here and what are the risks if we don't?
  • What? What are the characteristics of research data and how might it differ from other born digital data that memory institutions are establishing digital archives to manage and preserve? What types of files are our researchers producing and how would Archivematica handle these? What does Archivematica offer us and what benefits does it bring?
  • How? How would we incorporate Archivematica into a wider technical infrastructure for research data management and what workflows would we put in place? Where would it sit and what other systems would it need to talk to?
  • Who? Who else is using Archivematica (or other digital preservation systems) to do similar things and what can we learn from them?

Working in the pod - nice to have ducks for neighbours
I've started off by giving some thought to the What?

A key part of this project is to look at what a digital preservation system such as Archivematica can offer us as part of an RDM infrastructure. In order to answer this we need to understand a bit about the nature of the research data that we will be asking it to handle. We have been collating existing sources of information about the types of software and data that researchers at the University of York are using, in order to get a clearer picture of what research data is likely to look like. Following on from this we can then start to look at how Archivematica would handle this data.

A couple of years ago at York we carried out some work looking at current data management practice by researchers. We interviewed chairs of research committee for each academic department to get an overview of data management within the department and also put out an online questionnaire to capture a wider and more detailed set of information from individual researchers across the University. This has given us an overview of the types of data that researchers collect or create. This data doesn't go down to the level of specific file types and versions but does talk about broad categories of data (for example which departments are working with databases, audio, video, images etc).

A subsequent survey carried out at York looked more specifically at software packages used by researchers and is a gold mine of information for our project, giving us a list of software packages that we can investigate in more detail. The 'top 20' software packages highlighted by this survey largely consists of software I have never used and never tried to preserve the outputs of - packages such as MATLAB, SPSS, NVivo, Gaussian and ChemDraw. We are investigating how existing digital preservation tools would handle these types of data and looking initially at whether their native file formats appear in Pronom. We are talking to the helpful team at the National Archives about creating new file signatures for formats that are not currently represented. Knowing what you've got is one of the key challenges in digital preservation and if the file identification tools out there can automatically recognise a wider range of formats then this is clearly going to be a step in the right direction for not just this project but the digital preservation community.

Work will continue - watch this space for updates ...and in the meantime, we'd love to hear from you if you are using Archivematica (or another digital preservation system) for research data so we can find out about your workflows.

Monday, 2 March 2015

Pitching for Archivematica at Research Data Spring

It was with trepidation that I attended the Research Data Spring sandpit last week. A sandpit has childhood associations with lazy long summers, but reading more about the format of the event it sounded more like an episode of Dragon’s Den.

The ‘sandpit’ event was held at Birmingham’s Aston University campus
 and we were at least rewarded with summer beach weather on the second day

The first day of this event was set aside to explore and discuss ideas submitted through Jisc’s latest funding call (Research Data Spring via Ideascale) leading to potential new collaborations and partnerships.

On the second day, idea holders were given the chance to pitch their idea to a panel of judges. In a kind of four minute ‘elevator pitch’ we had to say enough about our project idea to explain what we wanted to achieve, why it was important and what resources were required, as well as persuading the judges that we should receive funding. This was followed by a four minute grilling by the judges on the finer details of the idea. No pressure!

After a late night doing sums, I have a final run through of my pitch

I was pitching a version of an idea originally submitted by the University of Hull that I had been quick to volunteer to collaborate on. The idea is based around investigating Archivematica for use within a wider technical infrastructure for research data management. We are hoping to plug the digital preservation gap and demonstrate the benefits that Archivematica can bring as part of an automated workflow for ingesting and processing research data.

My poster advertising the idea
Initially we would like to do a piece of investigative work looking at what Archivematica has to offer Higher Education Institutions in the UK who are currently setting up workflows for managing research data. We want to look at its strengths and weaknesses, and highlight areas for development in order to inform a further phase of the project in which we would sponsor some development work to enhance Archivematica. In the last phase of the project we hope to implement a proof of concept integrated into our existing Fedora repositories and push some real research data through the system. We would write up our findings as a case study and intend to continue to spread awareness of our work through presentations and blogs.

Many universities in the UK will not yet be ready to start thinking seriously about digital preservation, but the hope is that if this idea is funded, our work with Archivematica will help inform future decision making in this area for other institutions.

Wednesday, 18 February 2015

Digital preservation hits the headlines

It is not often that stories directly related to my profession make front page news, so it was with mixed feelings that I read following headline on the front of the Guardian this weekend:

"Digital is decaying. The future may have to be print"

While I agree with one of those statements, I strongly disagree with the other. Having worked in digital preservation for 12 years now, the idea of  a 'digital dark age' caused by obsolescence of the rapidly evolving hardware and software landscape is not a new one. Digital will decay if it is not properly looked after, and that is ultimately why there is a profession and practice that has built up in this area.

I would however disagree with the idea that the future may have to be print. At the Borthwick Institute for Archives we are now encouraging depositors to give us their archives in their original form. If the files are born digital we would like to receive and archive the digital files. Printouts lack the richness, accessibility and flexibility of the digital originals, which can often tell us far more about a file and the process of creation than a hard copy and can be used in a variety of different ways.

This headline was of course prompted by a BBC interview with Vint Cerf (VC of Google) on Friday in which he makes some very valid points. Digital preservation isn't just about preserving the physical bits, it is also about what they mean. We need to capture and preserve information about the digital environment as well as the data itself in order to enable future re-use. Again this is not new, but it is good to see story this hitting the headlines. Raising general awareness of the issues us digital archivists think about on a daily basis can only be a good thing in my mind. What the coverage misses though is the huge amount of work that has been going on in this area already...

The Jisc Research at Risk workshop at IDCC15 in the
fabulous surroundings of 30 Euston Square, London
Last week I spent several days at the International Digital Curation Conference (IDCC) in London. Surprisingly, this was my first IDCC (I'm not sure how I had managed to miss it until now), but this was a landmark birthday for the conference, celebrating a decade of bringing people together to talk about digital curation.

The theme of the conference was "Ten years back, ten years forward: achievements, lessons and the future for digital curation", consequently, many of the papers focused on how far we had come in the last ten years and on next steps. In ten years, we have clearly achieved much but the digital preservation problem is not yet solved. Progress is not always as fast as we would like, but enacting a culture change in the way we manage our digital assets is was never going to be a quick win, and this is sometimes not helped by the fact that as we solve one element of the problem, we adjust our expectations on what we consider to be a success. This is a constantly evolving field and we take on new challenges all the time.

It is great that public awareness about obsolescence and the fragility of digital data has been raised in the mainstream media, but it is also important for people to know that there is already a huge amount of work going on in this area and many of us who think about these issues all the time.

Monday, 2 February 2015

When checksums don't match...

No one really likes an error message but it is strangely satisfying when integrity checking of files within the digital archive throws up an issue. I know it shouldn't be, but having some confirmation that these basic administrative tasks that us digital archivists carry out are truly necessary and worthwhile is always encouraging. Furthermore, it is useful to have real life examples to hand when trying to make a business case for a level of archival storage that includes regular monitoring and comparison of checksums.

We don't have a full blown digital archiving system yet at the University of York, but as a minimum, born digital content that comes into the archives is copied on to University filestore and checksums are created. A checksum is a string of characters that acts as a unique fingerprint of a digital object, if the digital object remains unchanged, a checksumming tool will come up with the same string of characters each time the algorithm is run. This allows us digital archivists to ensure that files within our care remain authentic and free from accidental damage or corruption - and this is really one of our most basic roles as professionals.

The tool we are using at the moment to create and verify our checksums is the Checksum tool from Corz. A simple but comprehensive tool that it is quick and easy to get started with, but that gives ample scope for configuration and levels of automation for users who want to get more from it.

This morning when running integrity checks over the digital archive filestore I came across a problem. Two files in one of my directories that hold original born digital content came up with an MD5 error. Very strange.

I've just located the original CDs in the strongroom and had a look at the 2 files in question to try and work out what has gone wrong.

Another great tool that I use to manage our digital assets is Foldermatch. Foldermatch allows you to compare 2 folders and tells you with a neat graphical interface whether the contents of them are identical or not. Foldermatch has different comparison options and you can either take the quick approach and compare contents by file size, date and time or you can go for the belt and braces approach and compare using checksums. As a digital archivist I normally go for the belt and braces approach and here is a clear example of why this is necessary.
Comparison of folders using size, date and time - all looks well!

When comparing what is on the original CD from the strongroom with what is on our digital archive filestore by size, date and time, Foldermatch does not highlight any problems. All looks to be above board. The green column of the bar chart above shows that Foldermatch the files in the filestore to be 'identical' to those on the CD.

Similarly when running a comparison of contents, the results look the same. No problems highlighted.

Comparison of folders using SHA-1 checksums - differences emerge

However, when performing the same task using the SHA-1 checksum algorithm, this is where the problems are apparent. Two of the files (the purple column) are recorded as being 'different but same date/time'.

These changes appear not to have altered the file content, its size or date/time stamp. Indeed I am not clear on what specifically has been altered. Although checksum comparisons are helpful at flagging problems, they are not so helpful at giving specifics about what has changed.

As these files have sat gathering dust on the filestore, something has happened to subtly alter them, and these subtle changes are hard to spot but do have an impact on the their authenticity. This is the sort of thing we need to watch out for and this is why we digital archivists do need to worry about the integrity of our files and take steps to ensure we can prove that we are preserving what we think we are preserving.

Thursday, 29 January 2015

Reacquainting myself with OAIS

Hands up if you have read ISO:14721:2012 (otherwise known as the Reference Model for an Open Archival Information System)…..I mean properly read it…..yes, I suspected there wouldn’t be many of you. It seems like such a key document to us digital archivists – we use the terminology, the concepts within it, even the diagrams on a regular basis, but I'll be the first to confess I have never read it in full.

Standards such as this become so familiar to those of us working in this field that it is possible to get a little complacent about keeping our knowledge of them up to date as they undergo review.

Hats off to the Digital Preservation Coalition (DPC) for updating their Technology Watch Report on the OAIS Reference Model last year. Published in October 2014 I admit I have only just managed to read it. Digital preservation reading material typically comes out on long train journeys and this report kept me company all the way from Birmingham to Coventry and then back home as far as Sheffield (I am a slow reader!). Imagine how far I would have had to travel to read the 135 pages of the full standard!

This is the 2nd edition of the first in the DPC’s series of Technology Watch reports. I remember reading the original report about 10 years ago and trying to map the active digital preservation we were doing at the Archaeology Data Service to the model. 

Ten years is quite a long time in a developing field such as digital preservation and the standard has now been updated (but as mentioned in the Technology Watch Report, the updates haven’t been extensive – the authors largely got it right first time). 

Now reading this updated report in a different job setting I can think about OAIS in a slightly different way. We don't currently do much digital preservation at the Borthwick Institute, but we do do a lot of thinking about how we would like the digital archive to look. Going back to the basics of the OAIS standard at this point in the process encourages fresh thinking about how OAIS could be implemented in practice. It was really encouraging to read the OCLC Digital Archive example cited within the Technology Watch Report (pg 14) which neatly demonstrates a modular approach to fulfilling all the necessary functions across different departments and systems. This ties in with our current thinking at the University of York about how we can build a digital archive using different systems and expertise across the Information Directorate.

Brian Lavoie mentions in his conclusion that "This Technology Watch Report has sought to re-introduce digital preservation practitioners to OAIS, by recounting its development and recognition as a standard; its key revisions; and the many channels through which its influence has been felt." This aim has certainly been met. I feel thoroughly reacquainted with OAIS and have learnt some things about the changes to the standard and even reminded myself of some things that I had forgotten I said, 10 years is a long time.


Friday, 16 January 2015

The first meeting of Archivematica UK users (or explorers)

Last week I was happy to be able to host the first meeting of a group of UK users (or potential users) of Archivematica here in York

There are several organisations within the UK that are exploring Archivematica and thinking about how they could use it within existing data management workflows to help preserve their digital holdings. I thought it would be good to get us in a room together and talking about our ideas and experiences. 

Of the institutions who attended the meeting, most were at a similar stage. Perhaps we would not yet call ourselves 'Archivematica users', but having recognised its potential, we are now in the process of testing and exploring the system to evaluate exactly how we could use it and what systems it would integrate with. 

Each of the nine institutions attending the meeting were able to give a short presentation with an overview of their experience with Archivematica and intended use of it. I asked each speaker to think about the following points:

  • Where are you with Archivematica (investigating, testing, using)?
  • What do you intend to use it for - eg: research data, born digital archives, digitised content?
  • What do you like about it / what works?
  • What don't you like about it / what doesn't work?
  • How do you see Archivematica fitting in with your wider technical infrastructure - eg: what other systems will you use for storage, access, pre-ingest?
  • What unanswered questions do you have?
By getting an overview of where we all are, we could not only learn from each other, but also see areas where we might be able to put our heads together or collaborate. Exploring new territory always seems easier when you have others to keep you company.

Over the course of the afternoon I took down pages of notes - a clear sign of how useful I found it. I can't report on everything in this blog post but I'll just summarise a couple of the elements that the presentations touched on - our likes and dislikes.

What do you like about Archivematica? What works well for you?

  • It meets our requirements (or most of them)
  • It uses METS to package and structure the metadata
  • It creates an Archival Information Package (AIP) that can be stored anywhere you like
  • You can capture the 'original order' of the deposit and then appraise and structure your package as required (this keeps us in line with standard archival theory and practice)
  • There is a strong user-community and this makes it more sustainable and attractive
  • Artefactual publish their development roadmap and wishlist so we can see the direction they are hoping to take it in
  • It is flexible and can talk to other systems
  • It doesn't tie you in to proprietary systems
  • It connects with AtoM
  • It is supported by Artefactual Systems (who are good to work with)
  • It is freely available and open source - for those of us who don't have a budget for digital preservation this is a big selling point
  • It is managed in collaboration with UNESCO
  • It has an evolving UK user base
  • The interface mirrors the Open Archival Information System (OAIS) functional entities - this is good if this is the language you speak
  • It allows for customisable workflows
  • It has come a long way since the first version was released
  • It fills a gap that lots of us seem to have within our existing data curation infrastructures
  • As well as offering a migration-based strategy to digital preservation, it also stores technical metadata which should allow for future emulation strategies
  • It is configurable - you can add your own tools and put your own preservation policies in place
  • It isn't a finished product but is continually developing - new releases with more functionality are always on the horizon
  • We can influence it's future development

What don't you like about Archivematica? What doesn't work for you?

  • It can be time consuming - you can automate a lot of the decision points but not all of them
  • Rights metadata can only be applied to a whole package (AIP) not at a more granular level (eg: per individual file)
  • Storage and storage functionality (such as integrity checking and data loss reporting) isn't included
  • Normalisation/migrations of files only happens on initial ingest but we are likely to need to carry out further migrations at a later date
  • Upgrading is always an adventure!
  • Documentation isn't always complete and up-to-date
  • It doesn't store enough information about the normalised preservation version of the files (more is stored about the original files)
  • It is not just a question of installing it and running with it - lots of thought has to go into how we really want to use it
  • You can't delete files from within an AIP
  • There is no reporting facility to enable you to check on what files of each type/version you have within your archive and use this to inform your preservation planning

A Q&A session with Artefactual Systems via WebEx later in the afternoon was really helpful in answering our unanswered questions and describing some interesting new functionality we can expect to see in the next few versions of Archivematica.

All in all a very worthwhile session and I hope this will be just the first of many meetings of the Archivematica UK group. Please do contact me if you are a UK Archivematica user (or explorer) and want to share experiences with other UK users.

Thursday, 18 December 2014

Plugging the gaps: Linking Arkivum with Archivematica

In September this year, Arkivum and Artefactual Systems (who develop and support the Open Source software Archivematica) announced that they were collaborating on a digital preservation system. This is a piece of work that myself and colleagues at the University of York were very pleased to be able to fund.

We don't currently have a digital archive for the University of York but we are in the process of planning how we can best implement one. Myself and colleagues have been thinking about requirements and assessing systems and in particular looking at ways we might create a digital archive that interfaces with existing systems, automates as much of the digital preservation process as possible ....and is affordable.

My first point of call is normally the Open Archival Information System (OAIS) reference model. I regularly wheel out the image below in presentations and meetings because I always think it helps in focusing the mind and summarising what we are trying to achieve.

OAIS Functional Entities (CCSDS 2002)

From the start we have favoured a modular approach to a technical infrastructure to support digital archiving. There doesn't appear to be any single solution that "just does it all for us" and we are not keen to sweep away established systems that already carry out some of the required functionality.

We need to keep in mind the range of data management scenarios we have to support. As a university we have a huge quantity of digital data to manage and formal digital archiving of the type described in the OAIS reference model is not always necessary. We need an architecture that has the flexibility to support a range of different workflows depending on the retention periods or perceived value of the data that we are working with. All data is not born equal so it does not make sense to treat it all in the same way.

How we've approached this challenge is to look at the systems we have currently, find the gaps and work out how best to fill them. We also need to think about how we can get different systems to talk to each other in order to create the automated workflows that are so crucial to all of this working effectively.

Looking at the OAIS model, we already have a system to provide access to data with York Digital Library which is built using Fedora, we also have some of the data management functionality ticked with various systems to store descriptive metadata about our assets (both digital and physical). We have various ingest workflows in place to get content to us from the data producers. What we don't have currently is a system that manages the preservation planning side of digital archiving or a robust and secure method of providing archival storage for the long term.

This is where Archivematica and Arkivum could come in.

Archivematica is an open source digital archiving solution. In a nutshell, it takes a microservices approach to digital archiving, running several different tools as part of the ingest process to characterise and validate the files, extracting metadata, normalising data and packing the data up into an AIP which contains both the original files (unchanged), any derived or normalised versions of these files as appropriate and technical and preservation metadata to help people make sense of that data in the future. The metadata are captured as PREMIS and METS XML, two established standards for digital preservation that ensure the AIPs are self-documenting and system-independent. Archivematica is agnostic to the storage service that is used. It merely produces the AIP which can then be stored anywhere.

Arkivum is a bit perfect preservation solution. If you store your data with Arkivum they can guarantee that you will get that data back in the same condition it was in when you deposited it. They keep multiple copies of the data and carry out regular integrity checks to ensure they can fulfil this promise. Files are not characterised or migrated through different formats. This is all about archival storage. Arkivum is agnostic to the content. It will store any file that you wish to deposit.

There does seem to be a natural partnership between Archivematica and Arkivum - there is no overlap in functionality, and they both perform a key role within the OAIS model. In actual fact, even without integration, Archivematica and Arkivum can work together. Archivematica will happily pass AIPs through to to Arkivum, but with the integration we can make this work much better.

So, the new functionality includes the following features:

  • Archivematica will let Arkivum know when there is an Archival Information Package (AIP) to ingest
  • Once the Arkivum storage service receives the data from Archivematica it will check the size of the file received matches the expected file size
  • A checksum will be calculated for the AIP in Arkivum and will be automatically compared against the checksum supplied by Archivematica. Using this, the system can accurately verify whether transfer has been successful
  • Using the Archivematica dashboard it is possible to ascertain the status of the AIP within Arkivum to ensure that all required copies of the files have been created and it has been fully archived

I'm still testing this work and had to work hard to manage my own expectations. The integration doesn't actually do anything particularly visual or exciting, it is the sort of back-end stuff that you don't even notice when everything is working as it should. It is however good to know that these sorts of checks are going on behind the scenes, automating tasks that would otherwise have to be done by hand. It is the functionality that you don't see that is all important!

Getting systems such as these to work together well is key to building up a digital archiving solution and we hope that this is of interest and use to others within the digital preservation community.