Pages

Monday, 14 November 2016

Automating transfers with Automation Tools

This is a guest post by Julie Allinson, Technology Development Manager for Library & Archives at York. Julie has been working on York's implementation for the 'Filling the Digital Preservation Gap' project. This post describes how we have used Artefactual Systems' Automation Tools at York.

For Phase three of our 'Filling the Digital Preservation Gap' we have delivered a proof-of-concept implementation to to illustrate how PURE and Archivematica can be used as part of a Research Data management lifecycle.

One of the requirements for this work was the ability to fully automate a transfer in Archivematica. Automation Tools is a set of python scripts from Artefactual Systems that are designed to help.

The way Automation Tools works is that a script (transfer.py) runs regularly at a set interval (as cron task). The script is fed a set of parameters and, based on these, checks for new transfers in the given transfer source directory. On finding something, a transfer in Archivematica is initiated and approved.

One of the neat features of Automation Tools is that if you need custom behaviour, there are hooks in the transfer.py script that can run other scripts within specified directories. The 'pre-transfer' scripts are run before the transfer starts and 'user input' scripts can be used to act when manual steps in the processing are reached. A processing configuration can be supplied and this can fully automate all steps, or leave some manual as desired.

The best way to use Automation Tools is to fork the github repository and then add local scripts into the pre-transfer and/or user-input directories.

So, how have we used Automation Tools at York?

When a user deposits data through our Research Data York (RDYork) application, the data is written into a folder within the transfer source directory named with the id of our local Fedora resource for the data package. The directory sits on filestore that is shared between the Archivematica and RDYork servers. On seeing a new transfer, three scripts run:

1_datasets_config.py - this script copies the dedicated datasets processing config into the directory where the new data resides.

2_arrange_transfer.py - this script simply makes sure the correct file permissions are in place so that Archivematica can access the data.

3_create_metadata_csv.py - this script looks for a file called 'metadata.json' which contains metadata from PURE and if it finds it, processes the contents and writes out a metadata.csv file in a format that Archivematica will understand. These scripts are all fairly rudimentary, but could be extended for other use cases, for example to process metadata files from different sources or to select a processing config for different types of deposit.

Our processing configuration for datasets is fully automated so by using automation tools we never have to look at the Archivematica interface.

With transfer.py as inspiration I have added a second script called status.py. This one speaks directly to APIs in our researchdatayork application and updates our repository objects with information from Archiveamtica, such as the UUID for the AIP and the location of the package itself. In this way our two 'automation' scripts keep researchdatayork and Archivematica in sync. Archivematica is alerted when new transfers appear and automates the ingest, and researchdatayork is updated with the status once Archivematica has finished processing.

The good news is, the documentation for Automation Tools is very clear and that makes it pretty easy to get started. Read more at https://github.com/artefactual/automation-tools

No comments:

Post a Comment