Digital Archiving at the University of York: July 2016

Thursday, 7 July 2016

On the closure of Edina's Unlock service

This is a guest post by Julie Allinson, Technology Development Manager for Library & Archives at York. Julie managed the technical side of the 'York's Archbishops' Registers Revealed' project. This post discusses the demise of Edina's Unlock service and wonders how sustainable open data services are.

It has recently come to my attention that Edina are retiring their 'Unlock' service on the 31st July 2016. Currently that's all I know as, AFAIK, Edina haven't provided any background or any information about why, or what users of this service might do instead. I also wasn't aware of any kind of consultation with users.

Edina's message about the Unlock service - not very informative.

At York we've been using Unlock to search the DEEP gazetteer of English place names in our Archbishops' Registers editing tool. DEEP is a fantastic resource, an online gazetteer of the 86 volume corpus of the Survey of English Place-Names (SEPN). Without Edina's Unlock service, I don't know any way of programmatically searching it.

Records from DEEP via Edina's Unlock service in our Archbishops' Registers editing tool

Co-incidentally (or not?) the http://www.placenames.org.uk/ web site, which the records returned via the Unlock DEEP search link to, is unavailable. The site is a Web site for the DEEP data and our editors use this site to help identify and disambiguate place names.

I contacted Edina and they have promised to pass on further information about the end of the Unlock service as it becomes available later in the month. They also pointed me to the Institute for Name Studies at the University of Nottingham (INS) to find out why the place names site was unavailable. The initial response from INS was 'this is not our website'. I mentioned that placenames.org.uk is listed as one of their resources and they are now following up for me with a colleague, who is away at the moment. I've also contacted the Centre for Data Digitisation and Analysis (CDDA) at Queen's University Belfast who mention the site as one of their 'project databases' on the CDDA web site. As of writing, I haven't had a response.

UPDATE: As of Thursday 7th July, the site seems to be back. Yay!

Entry from the Archbishops' Registers places list, complete with broken link

All of this is quite worrying as it means many wasted development hours implementing this feature and reflects badly on our site - displaying broken links is not something we want to do.

So what do we do now? Well, in practical terms, unless I receive any other information from Edina to the contrary, I'll write out our use of Unlock / DEEP from the editing tool at the beginning of August and our editors will have to switch to manually creating place records in the tool. We've also been using Unlock to search the Ordnance Survey, so I'll hopefully be able to add a search of their linked data services directly. But we particularly liked as it DEEP gave us historical place names and enough information to help editors make sure they were selecting the correct contemporary place.

The bigger questions that this raises, though, are:

how do we ensure that important datasets and services coming out of projects can be sustained?

how can we trust that open data services will continue to be available? even those that appear to have the backing of a service provide like Edina or Jisc

how do we find out when they aren't?

and how do we have a voice when decisions like this are being made?

Jenny Mitcham, Digital Archivist

Monday, 4 July 2016

New research data file formats now available in PRONOM

Like all good digital archivists I am interested in file formats and how we can use tools to automatically identify them. This is primarily so that when we package our digital archives up for long term preservation we can do so with a level of knowledge about how we might go about preserving and providing access to them in the future. This information is key whether migration or emulation is the preservation or access strategy of choice (or indeed a combination of both).

It has been really valuable to have some time and money as part of our "Filling the Digital Preservation Gap" project to be able to investigate issues around the identification of research data file formats and very pleasing to see the latest PRONOM signature release on 29th June which includes a couple of research data formats that we have sponsored as part of our project work.

I sent a batch of sample files off to the team who look after PRONOM at The National Archives (TNA) with a bit of contextual information about the formats and software/hardware that creates them (that I had uncovered after a bit of research on Google). TNA did the rest of the hard work and these new signatures are now available for all to use.

The formats in question are:

Gaussian input files - These are created for an application called Gaussian which is used by many researchers in the Chemistry department here in York. In a previous project update you can see that Gaussian was listed in the top 10 research software applications in use at the University of York. These files are essentially just ascii text files containing instructions for Gaussian and they can have a range of file extensions (though the samples I submitted were all .gjf). Though there is a recommended format or syntax for these instructions, there also appears to be flexibility in how these can be applied. Consequently this was a slightly challenging signature for TNA to work on and it would be useful if other institutions that have Gaussian input files could help test this signature and feedback to TNA if there are any problems or issues. In instances like this being able to develop against a range of sample files created at different times in different institutions by different researchers would help.
JEOL NMR Spectroscopy files - These are data files produced by JEOL's Nuclear Magnetic Resonance Spectrometers. These facilities at the University of York are clearly well used as data of this type was well represented in an initial assessment of the data that I reported on in a blog post last month (130 .jdf files were present in the sample of 3752 files). As these files are created by a piece of hardware in a very standard way, I am told that signature developers at TNA were able to create a signature without too many problems.

Further formats submitted from our project will appear in PRONOM within the next couple of months.

The project team are also interested in finding out how easy it is to create our own file format signatures. This is an alternative option for those who want to contribute but not something we have attempted before. Watch this space to find out how we get on!

Jenny Mitcham, Digital Archivist

Modelling Research Data with PCDM

This is a guest post by Julie Allinson, Technology Development Manager for Library & Archives at York. Julie has been working on York's implementation for the 'Filling the Digital Preservation Gap' project. This post discusses preliminary work to define a data model for 'datasets'.

For Phase three of our 'Filling the Digital Preservation Gap', I've been working on implementing a prototype to illustrate how PURE and Archivematica can be used as part of a Research Data management lifecycle. Our technology stack at York is Hydra and Fedora 4 but it's an important aspect of the project to ensure that the thinking behind the prototype is applicable to other stacks. Central to adding any kind of data to a research information system, repository or preservation tool is the data model that underpins the metadata. For this I've been making use of the Portland Common Data Model (PCDM) and it's various extensions (particularly Works).

In the past couple of years there has been a lot of work happening around PCDM, described as "a flexible, extensible domain model that is intended to underlie a wide array of repository and DAMS applications". PCDM provides a small Models ontology of classes and properties, with extension ontologies for Works and Use, among others. I like PCDM because it is high level enough to provide a language to talk across different domains and use cases.

Datasets data model version 1

My first attempt at a data model for datasets based on PCDM can be seen below.

Datasets Data Model v1

Starting with the Dataset object above, for York this is equivalent to a dataset record in our PURE research information system. The only metadata expected at this level is about the dataset as a whole, and for us, will largely come from PURE.

Below this, you'll see that I've begun to model in some OAIS constructs: the Archival Information Package (AIP) and Dissemination Information Package (DIP). The AIP is the deposit of data, prepared for preservation, a single package of data "consisting of the Content Information and the associated Preservation Description Information (PDI), which is preserved within an OAIS." (DCC Glossary). The DIP is a representation of the Content Information in the AIP, produced for dissemination. OAIS, by the way, gives no standard approach for structuring the AIP or DIP.

As it stands, this model does not consider how to 'unpack' the AIP at all and for our prototype - we are (or were - see below) intending to simply point to the AIP in Archivematica.

The DIP as illustrated above is based on what Archivematica generates. The diagram includes the processing configuration and METS files that Archivematica produces by default as filesets, for illustration. These aren't part of the dataset as deposited, hence not making them 'Works' in their own right.

GenericWork is intended for each Unit or 'Work' in the dataset as deposited, or representation thereof. GenericWork is intended for use with any kind of data object. it might be independently re-used, eg. as a member of another dataset. In most cases for research data we probably won't know much about what the data is and so GenericWork will be used, but sometimes it may make sense to use existing models. For example, if the dataset is a collection of images then a more tailored Image model could be used for each Image, or if a dataset includes some existing objects that are already in our repository, those might already have different models. They can still be members of our dataset.

The model is intended to allow for a Dataset to include multiple AIPs and thus I have suggested a local predicate for hasDIP / hasAIP to establish the relationship between the AIP and the DIP.

The trouble with DIPs : an alternative model

Discussing this model with Justin Simpson from Artefactual Systems recently, I got to thinking that the DIP is a really artificial construct, effectively a presentation 'view' on the data and not really separate 'Work' in it's own right. Archivematica's DIP, at present, provides only files, and doesn't reflect the original folder structure, which may well be meaningful and necessary for a dataset. Perhaps what we really need is the AIP, described fully in Fedora, leaving presentation of the DIP to the interface level? A set of rules for producing a DIP to our own local specification might go like this (using the PCDM Use ontology): if there is a Preservation Master file, produce a Service File for user access, otherwise present the Original File to the user.

The new model would look something like this:

Datasets Data Model alternative

The DIP would be a view constructed from elements of the AIP:

This model feels like a better approach than that in version 1 as it facilitates describing the 'whole' dataset. I do have some immediate questions about the model, though:

Is a dataset really a pcdm:Collection, rather than a Work?
If the GenericWork is each data file irrespective of whether it can be used/understood on it's own, how useful is that in reality? Is the GenericWork really needed, or are FileSets enough? Is there genuinely value is identifying each individual piece of data as a 'Work'? (re-use outside of the dataset, for example)

And when thinking beyond the model, about how this would actually work for different use cases, implementations questions start to surface.

Beyond the model

1) Dataset size and structure

Datasets may contain thousands, millions even, of files structured into folders where folders may impart meaning to the data, or be purely arbitrary. Fedora 4 can, by design, handle a folder structure using it's implementation of LDP Basic Containers. As illustrated below, each folder is a 'Basic Container' and each data file is a Work, with FileSet and File objects.

AIP ldp:contains folder1

folder1 ldp:contains folder2

folder2 ldp:contains folder3

folder3 ldp:contains GenericWork

GenericWork pcdm:hasMember FileSet

But if each of those folders are objects in Fedora and each file is in Fedora with a Work, FileSet and several File objects, then the number of Fedora objects begins to rise exponentially

Would it be better to avoid object-cost to Fedora of creating many many objects?
What alternative approaches could we take? Use BagIt and have Fedora reference only the 'bag'? Store only data files and create an 'order' to represent the folder structure (as outlined in PCDM 2.0)?

2) Storing Files

'Where should data actually be stored permanently?' is another practical question I've been thinking about. On the one hand, Archivematica makes the AIP and it's files available via URLs in the Storage Service and stores the file in a location of your choosing. On the other, Fedora can contain data files, or reference them via a URI. This gives gives us the flexibility to do several things:

Leave the AIP and DIP in Archivematica's stores and use URIs in Fedora to reference all of the files and build a PCDM-modelled view of the data (Archivematica as preservation and access store).
Manage all files in Fedora, treating the Archivematica copy of the data as temporary (Archivematica as sausage factory).
Have two copies of AIP data, one in Archivematica and one in Fedora (LOCKSS model).
Manage Preservation files in Archivematica and delivery/access files in Fedora (Archivematica as preservation and Fedora as access).

Keeping data in Archivematica makes is easy to do additional preservation actions in future, such as re-ingesting when format policy rules change, whereas managing all files within Fedora unlocks the possibilities of Fedora's audit functionality, fixity checking and versioning. Having two copies is attractive as a preservation strategy, but could be difficult to justify and sustain if data collections grow to a significant size.

On balance I think option (4) is best for the short-term with other options worth re-considering as both Archivematica and Fedora mature. But I'd be really keen to hear different views.

Conclusions

Hopefully this post illustrates that creating an outline data model is pretty easy, but when it comes to thinking about it in terms of implementation decisions, all kinds of ifs and buts start coming up.

In the model above, each data file is a Work. Each work contains one or more FileSets, and each FileSet contains one or more different representations of the file.

Is it really possible to define a general datasets model that could encompass data from across disciplines, of various sizes, structures and created for a variety of purposes? Data that might be independently re-usable (a series of oral history interviews) or might only be understandable in combination with other files (a database schema document, for example)?

This is very much a work in progress, and I'd really welcome feedback from others who have done allied work or anyone who has suggestions and comments on the approaches and issues outlined above.