Jisc Archivematica project update ...because digital preservation won’t just go away

Last month I was excited to discover that Jisc had agreed to fund a joint project between the Universities of York and Hull as part of their Research Data Spring initiative. The aim of this project (as mentioned in my previous blog) is to investigate the potential of Archivematica for Research Data Management. There is a brief summary including my pitch here. We have had a number of other higher education institutions in the UK express an interest in this project and it is fabulous to see that there are others who recognise that the tools that digital archivists use could have much to offer those who are charged with managing research data. Of course we hope this project will also be of interest to a more diverse and international audience and we would like to benefit from the experience and knowledge that already exists within the wider digital preservation community.

We are three weeks in to this project now and here is the first of a series of updates on progress.

One of the initial tasks for teams at both York and Hull was to ensure we had a current version of Archivematica installed. Over the next few weeks there will be a fair amount of testing going on to give us a greater understanding of Archivematica's strengths and weaknesses particularly with regard to how it may handle research data.

A pod on the lake - a great venue for our kick off meeting
(though in reality the weather wasn't this nice)
We also got together for a kick off meeting in one of the pods on the lake on York's Heslington East campus. We defined our work packages and established responsibilities and deadlines and now have a clear idea of what we are focusing on in this initial 3 month phase that Jisc have agreed to fund.

Much of the research we will be carrying out and reporting on at the end of this phase of the project will be based around the following questions:

  • Why? Why are we bothering to 'preserve' research data. What are the drivers here and what are the risks if we don't?
  • What? What are the characteristics of research data and how might it differ from other born digital data that memory institutions are establishing digital archives to manage and preserve? What types of files are our researchers producing and how would Archivematica handle these? What does Archivematica offer us and what benefits does it bring?
  • How? How would we incorporate Archivematica into a wider technical infrastructure for research data management and what workflows would we put in place? Where would it sit and what other systems would it need to talk to?
  • Who? Who else is using Archivematica (or other digital preservation systems) to do similar things and what can we learn from them?

Working in the pod - nice to have ducks for neighbours
I've started off by giving some thought to the What?

A key part of this project is to look at what a digital preservation system such as Archivematica can offer us as part of an RDM infrastructure. In order to answer this we need to understand a bit about the nature of the research data that we will be asking it to handle. We have been collating existing sources of information about the types of software and data that researchers at the University of York are using, in order to get a clearer picture of what research data is likely to look like. Following on from this we can then start to look at how Archivematica would handle this data.

A couple of years ago at York we carried out some work looking at current data management practice by researchers. We interviewed chairs of research committee for each academic department to get an overview of data management within the department and also put out an online questionnaire to capture a wider and more detailed set of information from individual researchers across the University. This has given us an overview of the types of data that researchers collect or create. This data doesn't go down to the level of specific file types and versions but does talk about broad categories of data (for example which departments are working with databases, audio, video, images etc).

A subsequent survey carried out at York looked more specifically at software packages used by researchers and is a gold mine of information for our project, giving us a list of software packages that we can investigate in more detail. The 'top 20' software packages highlighted by this survey largely consists of software I have never used and never tried to preserve the outputs of - packages such as MATLAB, SPSS, NVivo, Gaussian and ChemDraw. We are investigating how existing digital preservation tools would handle these types of data and looking initially at whether their native file formats appear in Pronom. We are talking to the helpful team at the National Archives about creating new file signatures for formats that are not currently represented. Knowing what you've got is one of the key challenges in digital preservation and if the file identification tools out there can automatically recognise a wider range of formats then this is clearly going to be a step in the right direction for not just this project but the digital preservation community.

Work will continue - watch this space for updates ...and in the meantime, we'd love to hear from you if you are using Archivematica (or another digital preservation system) for research data so we can find out about your workflows.

Jenny Mitcham, Digital Archivist


Popular posts from this blog

How can we preserve Google Documents?

Preserving emails. How hard can it be?

Checksum or Fixity? Which tool is for me?