Tuesday, 30 August 2016

Filling the Digital Preservation Gap - a brief update

As we near the end of the active phase of Filling the Digital Preservation Gap* here is a brief update about where we are with the main strands of work we highlighted in our phase 3 kick off blog post.

Archivematica implementation

Work at York

Work is ongoing at York to get our proof of concept implementation of Archivematica up and running. The purpose of this work was not to get a production service in place but to demonstrate that the implementation plan we published in our phase 2 report was feasible. The implementation we are developing pulls metadata (about deposited research datasets) from PURE and provides a method for capturing additional information for managing datasets  (filling some of the information gaps that are not collected through the PURE datasets module). It also includes an automated process to ingest deposited datasets (along with their metadata) into Archivematica, package them up for longer term preservation and provide a dissemination copy of the dataset to our repository. 

We have been doing this work in consultation with the staff at York who actually work with datasets that are deposited through our Research Data York service to ensure that the workflows and processes we are putting in place will make their lives easier rather than harder! We are keen to ensure that those processes that can be automated are automated and those areas where human input is required trigger e-mail notifications to relevant staff and a pause in the workflow to enable the relevant checks to be made.

Work at Hull

Like York, Hull is looking to produce a proof-of-concept system within the timeframe of the project. Whilst concentrating on research data for this Jisc funded work, we have our eye also on later using our approach for other forms of repository content that deserve long-term preservation.  To that end, we are taking as our starting point the institutional Box folder that each of our staff has access to; we will be asking depositors to assemble their material for the repository in a folder within their Box account.  As well as the content itself they will be asked for basic metadata and processing instructions in a very simple format.  When the folder is ready they share it with another Box account “owned” by Archivematica.

Hull has developed a “Box watcher” which detects the new share and instigates processing of the contents, keeping the depositor aware of progress along the way.  The contents of the folder are examined and, depending on what is found and how it is configured, one or more Bags (as in the BagIt standard) are created and handed off to Archivematica.

Like York we are then looking to have a fully automated Archivematica workflow which produces Archival Information Packages corresponding to each of the bags.  In addition, Hull will have Archivematica create Dissemination Information Package(s) which, once created, will automatically be processed to produce objects in the quality assurance queue of our Hydra repository.

Unidentified file formats

It has been clear from our project work during phase 3 that research data is much harder to identify in an automated fashion than other types of born digital data that an archive would typically hold. If you don’t believe us, read these 2 blog posts that show contrasting results when trying to identify two different types of born digital data: 


So, how are we working towards a solution? As well as directly sponsoring the development of a small selection of research data file formats by the PRONOM team at The National Archives, we also had a go at creating our own. York’s new signature will be incorporated into PRONOM in due course. Hull’s signature has been submitted and is just being tested by the PRONOM team. There have also been positive discussions with colleagues at The National Archives about wider public engagement around file format signature development and how we work towards increasing the coverage of PRONOM for research data file formats.


Dissemination and outreach

The project team have been keen to continue their focus on dissemination during phase 3 of the project. This has included presentations or posters at the following conferences and events:

  • International Digital Curation Conference (IDCC16), Amsterdam
  • 'Digital Preservation: Strategic Issues' - National Library of Wales
  • UK Archives Discovery Forum, Kew
  • UK Archivematica meeting, York
  • Research Data, Records and Archives: Breaking the Boundaries, Edinburgh
  • Open Repositories, Dublin
  • Jisc CNI conference, Oxford
  • Hydra Virtual Connect
  • TNA Digital Transformation day, Kew

...and our outreach work continues. Watch out for us at the Jisc Research Data Network event in Cambridge next week, the next UK Archivematica meeting in Lancaster the week after, the iPRES conference and Hydra Connect in October and of course the final Jisc Research Data Spring showcase event which will be later on in October.

And of course we have been blogging as usual throughout this phase of the project so do read back to see our previous posts for more information and watch out for our phase 3 final report in mid-October.



* we formally complete the project work on 14th Sept and will focus on writing up our final report over the following month




Jenny Mitcham, Digital Archivist

No comments:

Post a Comment

The sustainability of a digital preservation blog...

So this is a topic pretty close to home for me. Oh the irony of spending much of the last couple of months fretting about the future prese...