Friday 28 August 2015

Enhancing Archivematica for Research Data Management

Where has the time gone? ....we are now one month into phase two of "Filling the Digital Preservation Gap" ....and I have spent much of this first month on holiday!

Not quite a 'digital preservation gap' - just an excuse to show
you my holiday snaps!
So with no time to waste, here is an update on what we are doing:

In phase two of our project we have two main areas of work. Locally at York and Hull we are going to be planning in more detail the proof of concept implementations of Archivematica for research data we hope to get up and running in phase three.

Meanwhile over in Canada, our collaborators at Artefactual Systems are starting work on a number of sponsored developments to help move Archivematica into a better position for us to incorporate it into our implementations for managing and preserving research data.

We have a project kick off call with Artefactual Systems scheduled for next week and we will be discussing our requirements and specifications for development in more detail, but in the meantime, here is a summary of the areas we are focusing on:

Automation of DIP generation on request 

Building on the AIP re-ingest functionality within Archivematica which allows an AIP to be re-processed and allows for a delay in the generation of a DIP until such a time as it is requested, this feature will enable further automation of this process.

This feature is of particular benefit to those situations where the value of data is not fully understood. It is unnecessary to create an access copy of all research datasets as some of them will never be requested. In our workflows for long term management of research data we would like to trigger the creation of a copy of the data for dissemination and re-use on request rather than create one by default and this piece of work will make this workflow possible.

METS parsing tools

This development will involve creating a Python library which could be used by third party applications to understand the METS file that is contained within an Archivematica DIP. Additionally an HTTP REST service would be developed to allow third party applications to interact with the library in a programming language agnostic fashion.

This is key to being able to work with the DIP that is created by Archivematica within other repository or access systems that are not integrated with Archivematica. Both York and Hull have repositories built with Fedora and Hydra and this feature will allow the repositories to better understand the DIP that Archivematica creates. This development is in no way specific to a Fedora/Hydra repository and will equally benefit other repository platforms in use for RDM.

Improved file identification

This feature will enable Archivematica to report on any unidentified files within a transfer alongside access to the file identification tool output. Further enhancements could help curatorial staff to submit information to PRONOM by partially automating this process.

It was highlighted in our phase one report that the identification of research data file formats is a key area of concern when managing research data for the longer term. This feature will help users of Archivematica see which files haven’t been identified and thus enable them to take action to establish what they hold. This feature will also encourage communication with PRONOM to enhance the database of file formats for the future, thus enabling a more sustainable community approach to addressing this problem.

Generic Search API

The development of a proof of concept search REST API for Archivematica  allowing third party applications to query Archivematica for information about the objects in archival storage.

There is a need to be able to produce statistics or reports on RDM and digital preservation processes in order to obtain a clear picture of what data has been archived. This development will enable these statistics to be generated more easily and sustainably. For example this would enable tools such as the DMAonline dashboard in development at Lancaster University to pull out summary statistics from Archivematica.

Support for multiple checksum algorithms 

Currently Archivematica generates SHA256 checksums for all files, and inserts those into PREMIS fixity tags in the METS file. In addition, two premis:events are generated for each file. All three of these entries are currently hardcoded to assume SHA256. This development would include support for other hash algorithms such as MD5, SHA1 and SHA512.

Research data files can be large in size and/or quantity and may take some time to process through the Archivematica pipeline. One of the potential bottlenecks highlighted in the pipeline is checksums which are created at more than one point in the process. SHA256 checksums can take a long time to create and it has been highlighted that having the option to alter the checksum algorithm within Archivematica could speed things up. Having additional configuration options within Archivematica will give institutions the flexiblity to refine and configure their pipelines to reduce bottlenecks where appropriate.

Documentation

The ability to automate processes relating to preservation are of primary importance where few resources are available to manually process data of unknown value. Fuller documentation of how an automated workflow can be configured within Archivematica using the APIs that exist would be very helpful for those considering using Archivematica for RDM and will help remove some of the barriers to its use. We will therefore be funding a small piece of work to help improve Archivematica documentation for developers and those installing or administering the system.


We very much hope these enhancements will be useful to the wider community of Archivematica users and not just to those looking specifically at preserving research data.

Our thanks go to Artefactual Systems for helping to turn our initial development ideas into these more concrete proposals.

As ever we are happy to receive your thoughts and feedback so do get in touch if you are interested in the work we are carrying out or have things to share with us around these development ideas.



Jenny Mitcham, Digital Archivist

The sustainability of a digital preservation blog...

So this is a topic pretty close to home for me. Oh the irony of spending much of the last couple of months fretting about the future prese...