A guest post by Richard Green who has been leading the University of Hull's technical investigations for "Filling the Preservation Gap".
Jenny is away from her desk at the moment so I've been deputised to provide a blog post around the work we've been doing at the University of Hull as part of the Jisc-funded "Filling the Preservation Gap" (FPG) project. In particular we (the FPG team) want to mention a poster that we prepared for a recent conference in the US.
Hull has had a digital repository in place for a considerable number of years. It has always had the Fedora (now Fedora Commons) repository software at its heart and for several years now has deployed Hydra over the top of that - indeed, Hull was a founder member of the Hydra Project. With the established repository goes an established three-stage workflow for adding content. Content is initially created in a “proto-queue” by a user who, when (s)he is happy with it, transfers it to the ownership of the Library who take it through a quality assurance process. When the team in the Library is happy with it the content is transferred to the repository "proper" with appropriate access permissions. The repository contains a wide range of materials and over time we are developing variants of this basic workflow suited to each content type but this activity is constrained by limited resources and we know there are other variations we would like to, and can, develop when circumstances permit. The lack of a specific workflow for research data management (RDM), encompassing the possible need for long-term preservation, was one of the reasons for getting involved in the FPG project.
Whilst the focus of the FPG project is clearly research data it became apparent during our initial work with Archivematica that its preservation needs were not so far removed from the preservation needs that we have for some of our other content. That being the case we have kept our eye on the bigger picture whilst concentrating on RDM for the purposes of our Jisc project. We have spent some time putting together an initial all-encompassing design through which an RDM workflow for the FPG project would be but one possible path. It is that overall picture that became our poster.
The Hydra Community holds one major get-together each year, the "Hydra Connect" conference. The last full week in September saw 200 people from 55 institutions gather in Minneapolis for Connect 2015. A regular feature of the conferences, much appreciated by the audience, is an afternoon given over to a poster session during which attendees can talk about the work they are doing with Hydra. Each institution is strongly encouraged to contribute and so Hull took along its grand design as its offering.
Poster for Hydra Connect 2015 in Minneapolis, MN, September 2015 |
So that’s the poster and here’s a somewhat simplified explanation!
Essentially, content comes in at the left-hand side. The upper entry point corresponds to a human workflow of the type we already have. The diagram proposes that the workflow gain the option of sending the digital content of an object through Archivematica in order to create an archival information package (AIP) for preservation and also to take advantage of such things as the software’s capability to generate technical metadata. The dissemination information package (DIP) that Archivematica produces is then “mined” for content and metadata that will be inserted into the repository object already containing the creator’s descriptive metadata record. One of the items mined is a UUID that provides a tie-up between the record in the repository and the AIP which goes to a separate preservation store.
The lower entry point corresponds to an automated (well, maybe semi-automated) batch ingest process. In this case, the DIP processor creates a repository object from scratch and, in addition to possible dissemination files and technical metadata, provides the descriptive metadata too. There are a number of scenarios for generating the descriptive metadata; at one extreme it might be detailed fields extracted from an accompanying file, at the other it might be minimal metadata derived from the context (the particular ingest folder and the title of the master file, for instance). There will be circumstances when we create a metadata-only record for the repository and do not include dissemination files in the repository object; under these circumstances the UUID in the metadata would allow us to retrieve the AIP from store and create a new DIP should anyone ever request the data itself.
Finally, we have content already in the repository where it is being “kept safe” but which really justifies a proper preservation copy. We shall create a workflow that allows this to be passed to Archivematica so that an AIP can be created and stored. It is probable that this route would use the persistent identifier (PID) of the Fedora object as the link to the AIP.
Suffice it to say that the poster was well received. It generated quite a lot of interest and, some of it, from surprising quarters. In conversation with one well-established practitioner in the repository field, from a major US university, I was told “I’ve never thought of things quite like that – and you’re probably right!” It’s sometimes reassuring to know that the work we undertake at the smaller UK universities is nevertheless respected in some of the major US institutions!
If you have any comments, or want further details, about our work then please get in touch via this blog. We’re interested in your thoughts, perspectives and ideas about the approach.
Jenny Mitcham, Digital Archivist