Improving RDM workflows with Archivematica

In a previous post we promised diagrams to help answer the 'How?' of the 'Filling the Digital Preservation Gap' project. In answer to this, here is a guest post by Julie Allinson, Manager of Digital York, who looks after all things Library and Archives IT.

Having just completed phase one of a Jisc Research Data Spring project with the University of Hull, we have been thinking a lot about the potential phases two and three which we are hoping will follow. But even if we aren’t funded to continue the project, the work so far won’t be wasted here at York (and certainly the wider community will benefit from our project report!) as it has given us some space to look out our current RDM workflow and look for ways that might be improved on, particularly to include a level of curation and preservation.

My perspective on all of this is to look at how things can fit together and I am a believer in using the right system for the right job. Out of lack of resource, or misunderstanding, I feel we often retro-fit existing systems and try to make them meet a need for which they weren’t designed. To a degree I think this happens to PURE. PURE is a Current Research Information System (CRIS), and a leading product in that space. But it isn’t (in my opinion) a data repository, a preservation system, a management tool for handling data access requests or any other of a long list of things we might want it to be.

For York, PURE is where we collect information about our research and our researchers. PURE provides both an internal and, through the PURE portal, an external view of York’s research. For the Library, it is where we collect information about research datasets and outputs. In our current workflow, the full-text is sent to our research repository for storage and dissemination. The researcher interfaces only with PURE for deposit, the rest is done by magic. For research data the picture is similar, but we currently take copies of data in only some cases, where there is no suitable external data archive, and we do this ‘offline’, using different mechanisms depending on the size of the dataset.

Our current workflow for dealing with Research Data deposits is OK, it works and it is probably similar to many institutions still feeling their way in this new area of activity. It looks broadly like this:

  • Researcher enters dataset metadata into PURE
  • Member of Library staff contacts them about their data and, if appropriate, takes a copy for our long term store
  • Library staff check and verify the metadata, and record extra information as needed.
  • PURE creates a DOI
  • DOI is updated to point to our interim datasets page (default PURE behaviour is to create a link to the PURE portal, which we have not currently implemented for datasets)

But … it has a lot of manual steps in it, and we don’t check or verify the data deposited with us in any systematic way. Note all of the manual (dotted) steps in the following diagram.

How we do things at the moment

I’d like to improve this workflow by removing as many manual steps as possible, thereby, increasing efficiency, eradicating re-keying and reducing the margin for error, whilst at the same time adding in some proper curation of the data.

The Jisc Research Data Spring project allowed us to ask the question: ‘Can Archivematica help?’, and without going into much detail here, the answer was a resounding ‘Yes, but...’.

What Archivematica can most certainly give us is a sense of what we’ve got, to help us ‘peer into the bag’ as it were, and it can have a good go at both identifying what’s in there and at making sure no nasties, like viruses and corrupt files, lurk.

Research data has a few prominent features:

  1. lots of files of types that standard preservation tools can’t identify
  2. bags of data we don’t know a helluva lot about and where the selection and retention process has been done by the researcher (we are unlikely to ever have the resource or the expertise to examine and assess each deposit)
  3. bags of data that we don’t really know at this point are going to be requested

My colleague, Jen, has been looking at 1) and I’ve focussed my work on 2) and 3). I had this thought:

Wouldn’t it be nice if we could push research data automatically through a simple Archivematica pipeline for safe and secure storage, but only deliver an access copy if one is requested?

Well, Archivematica CAN be almost entirely automated and it WILL be able to generate the DIP at a later step, manually, in the coming months. And with some extra funded development it COULD support automated DIP creation via an API call.

So, what’s missing now? We can gather metadata in PURE and make sure the data we get from researchers is verified and looked after, and service access requests by generating a DIP.

How will we make that DIP available? Well, we already have a public facing Digital Library which can offer that service.

What’s missing is the workflow, the glue that ties all of these steps together, and really I’d argue that’s not the job of either Archivematica or PURE.

In the diagram below I’ve proposed an architecture for both deposit and access that would both remove most manual steps and add Archivematica into the workflow.

In the deposit workflow the researcher would continue to add metadata to PURE. We have kept a manual communication step here as we feel that there will almost certainly be some human discussion needed about the nature of the data. The researcher would then upload their data, and we would be able to manage all subsequent steps in a largely automated way, using a lightweight ‘RDMonitor’ tool to match up the data with the PURE metadata using information gleaned from the PURE web services, to log the receipt of data and to initiate the transfer to Archivematica for archiving.

How we'd like to do things in the future - deposit workflow

In the standard access workflow, the requestor would send an email which would automatically be logged by the Monitor, initiate the creation of a dissemination version of the data and its transfer to our repository and send an automated email the requestor to alert them to the location of the data. Manual steps in this workflow are needed to add the direct link to the data in PURE (PURE’s has no APIs for POSTing updates, only GETting information) and also to ensure due diligence in making data available.

How we'd like to do things in the future - access to research data

These ideas need more unpicking over the coming months, and there are other ways that it could be delivered, using PURE for the upload of data, for example. We can also see potential extensions too, like pulling in a Data Management Plan from DMPOnline to store with the archive package. This post should be viewed as a first stab at illustrating our thinking in the project, motivated by the idea of making lightweight interfaces to connect systems together. Hull will have similar, but different, requirements that they want to explore in further phases too.

Jenny Mitcham, Digital Archivist


Popular posts from this blog

How can we preserve Google Documents?

Preserving emails. How hard can it be?

Checksum or Fixity? Which tool is for me?