Tuesday, 30 August 2016

Filling the Digital Preservation Gap - a brief update

As we near the end of the active phase of Filling the Digital Preservation Gap* here is a brief update about where we are with the main strands of work we highlighted in our phase 3 kick off blog post.

Archivematica implementation

Work at York

Work is ongoing to York to get our proof of concept implementation of Archivematica up and running. The purpose of this work was not to get a production service in place but to demonstrate that the implementation plan we published in our phase 2 report was feasible. The implementation we are developing pulls metadata (about deposited research datasets) from PURE and provides a method for capturing additional information for managing datasets  (filling some of the information gaps that are not collected through the PURE datasets module). It also includes an automated process to ingest deposited datasets (along with their metadata) into Archivematica, package them up for longer term preservation and provide a dissemination copy of the dataset to our repository. 

We have been doing this work in consultation with the staff at York who actually work with datasets that are deposited through our Research Data York service to ensure that the workflows and processes we are putting in place will make their lives easier rather than harder! We are keen to ensure that those processes that can be automated are automated and those areas where human input is required trigger e-mail notifications to relevant staff and a pause in the workflow to enable the relevant checks to be made.

Work at Hull

Like York, Hull is looking to produce a proof-of-concept system within the timeframe of the project. Whilst concentrating on research data for this Jisc funded work, we have our eye also on later using our approach for other forms of repository content that deserve long-term preservation.  To that end, we are taking as our starting point the institutional Box folder that each of our staff has access to; we will be asking depositors to assemble their material for the repository in a folder within their Box account.  As well as the content itself they will be asked for basic metadata and processing instructions in a very simple format.  When the folder is ready they share it with another Box account “owned” by Archivematica.

Hull has developed a “Box watcher” which detects the new share and instigates processing of the contents, keeping the depositor aware of progress along the way.  The contents of the folder are examined and, depending on what is found and how it is configured, one or more Bags (as in the BagIt standard) are created and handed off to Archivematica.

Like York we are then looking to have a fully automated Archivematica workflow which produces Archival Information Packages corresponding to each of the bags.  In addition, Hull will have Archivematica create Dissemination Information Package(s) which, once created, will automatically be processed to produce objects in the quality assurance queue of our Hydra repository.

Unidentified file formats

It has been clear from our project work during phase 3 that research data is much harder to identify in an automated fashion than other types of born digital data that an archive would typically hold. If you don’t believe us, read these 2 blog posts that show contrasting results when trying to identify two different types of born digital data: 

So, how are we working towards a solution? As well as directly sponsoring the development of a small selection of research data file formats by the PRONOM team at The National Archives, we also had a go at creating our own. York’s new signature will be incorporated into PRONOM in due course. Hull’s signature has been submitted and is just being tested by the PRONOM team. There have also been positive discussions with colleagues at The National Archives about wider public engagement around file format signature development and how we work towards increasing the coverage of PRONOM for research data file formats.

Dissemination and outreach

The project team have been keen to continue their focus on dissemination during phase 3 of the project. This has included presentations or posters at the following conferences and events:

  • International Digital Curation Conference (IDCC16), Amsterdam
  • 'Digital Preservation: Strategic Issues' - National Library of Wales
  • UK Archives Discovery Forum, Kew
  • UK Archivematica meeting, York
  • Research Data, Records and Archives: Breaking the Boundaries, Edinburgh
  • Open Repositories, Dublin
  • Jisc CNI conference, Oxford
  • Hydra Virtual Connect
  • TNA Digital Transformation day, Kew

...and our outreach work continues. Watch out for us at the Jisc Research Data Network event in Cambridge next week, the next UK Archivematica meeting in Lancaster the week after, the iPRES conference and Hydra Connect in October and of course the final Jisc Research Data Spring showcase event which will be later on in October.

And of course we have been blogging as usual throughout this phase of the project so do read back to see our previous posts for more information and watch out for our phase 3 final report in mid-October.

* we formally complete the project work on 14th Sept and will focus on writing up our final report over the following month

Friday, 19 August 2016

My first file format signature

As part of Filling the Digital Preservation Gap we've been doing a lot of talking about the importance of accurate file format identification and the challenges of doing so for research data.

Now we are thinking about how we can help solve the problem.

As promised in a post last month, I wanted to have a go at file format signature creation for PRONOM to see whether it is something that an average digital archivist could get their head around. Never before had I created my own signature. In the past I had considered this to be work that only technical people could carry out and it would be foolhardy to attempt it myself.

However, given the extent of the file formats identification challenge for research data wouldn't it be great if the community could engage more directly? Also, shouldn't file signature development be something every digital archivist should have a good understanding of?

Encouraged by Ross Spencer's blog post Five Star File Format Signature Development and a meeting with the PRONOM team at The National Archives in which the tricks and challenges of signature creation were explained, I decided to give it a go.

Where to begin?

  • First I read TNA's How to research and develop signatures for file format identification. This is an accessible and readable guide which tells you how to get started with signature development - from gathering samples, doing internet research on the format and using a Hex Editor to spot patterns. You don't need to be very technically minded to get your head around it.
  • Then I downloaded and installed a Hex editor. Though it is possible to view files as hexadecimal within Quick View Plus, I followed TNA's advice and used HxD Hex Editor as this allows you to compare files thus partially automating the process of spotting sequences.
  • Once I'd spotted a pattern which could be used to create a signature, I planned to use PRONOM's Signature Development Utility to create it. 
  • Once the signature is created I'm told it is possible to test this using DROID. Within DROID, go to Tools... Install signature file and replace the current signature with your new one (but remember to put it back again once you are done otherwise you may wonder why DROID isn't working properly!). Run this over your directory of sample files to see if they are all correctly identified using the signature you have developed. 

I decided start off with something that had been at the back of my mind for a while...those Wordstar 4.0 files from the Marks and Gran archive that I blogged about some time ago and had struggled to identify. When I wrote that post three years ago, Wordstar 4.0 files were not represented in PRONOM. They have more recently been added and the files can be identified but this is by extension only - not the more accurate file signature. I thought it would be fun to try and create a file signature for them.

I was very wrong.

My attempts to see a pattern within the files using the Hex Editor were unsuccessful. I decided to send the sample files to the experts at TNA to see if I was missing something. It was quickly confirmed that this was a rather awkward file type and not one that lent itself well to being automatically identified. Disappointing but at least it confirmed that my own investigations were not lacking.

For my next attempt I decided to tackle some of the unidentified research data that I had highlighted in my previous post Research data - what does it *really* look like?

I looked through the top ten most frequent unidentified file extensions in my sample and started to dig out the files themselves and assess whether they were a good candidate for me to work on. Ross Spencer suggests that PRONOM lends itself best to the creation of signatures for binary formats so this is what I wanted to focus on. No point in trying to make it hard for myself!

  • dat - 286
  • crl - 259
  • sd - 246
  • jdf - 130 (a signature for these JEOL NMR Spectroscopy files is now available)
  • out - 50
  • mrc - 47
  • inp - 46
  • xyz - 37
  • prefs - 32
  • spa - 32
Unfortunately, looking through the list (and digging out some samples) I discovered that many of these are ASCII formats rather than binary. It is possible to create signatures to identify ASCII files but it can be challenging (involving quite complex regular expressions) and not a great place for a first timer to start. I certainly did not want to start to tackle the confusing landscape of .dat files either!

After a little bit of investigation I discovered that the .spa files were something I could work with. I knew nothing about this format but found the relevant files and started doing some internet research looking for more information and perhaps some additional samples. I soon discovered they were one of many formats for optical spectroscopy and are known as Thermo Fisher’s OMNIC file format or Thermo Scientific OMNIC or Nicolet/Thermo OMNIC.

Looking at some of the files using a Hex Editor it was immediately apparent that there was a consistent pattern of bytes at the start of each file. A string which read 'Spectral Data File' which was represented by 53 70 65 63 74 72 61 6C 20 44 61 74 61 20 46 69 6C 65 in hexadecimal. Note that I actually thought the pattern was longer but advice received from the PRONOM team suggested that it was better to cut it down.

I also looked at the end of each file and at first sight there appeared to be consistency here too with each file ending with the same few bytes. This hypothesis was blown out of the water when I looked at a sample file that I had discovered online which did not display this pattern (but luckily did have the same bytes at the start of the file).

This is why it is so important to have sample data that comes from more than one source. A set of files from a single researcher may have misleading patterns that have occurred just because of the consistent way in which they work, rather than this being a true feature of the format itself.

So, once I'd looked at all 33 files and had convinced myself the hypothesis was solid, I went to the online signature development tool provided by The National Archives and created my signature.

PRONOM signature development tool

This was relatively easy to use but there were areas where more guidance was needed (so questions were fired off to the PRONOM team and a speedy response was received). I'm hoping that in the future there will be more documentation to help guide the completion of this form - so that people know how best to name the signature, where to find a definitive list of Mimetypes (this is the list they suggested I looked at), and what the Version field should contain (it is for the version of the file format if this is apparent/relevant - not the version of the signature you are creating).

Once I was done, I clicked on the 'Save Signature File' button and I was presented with the finished XML file:

Ta daaaaa!

I briefly admired my handiwork before sending it off to The National Archives for feedback.*

How long did it take me to do all of this? I would say one full day is a fair estimate (that would include reading the guidance, downloading the Hex Editor and a few false starts as I tried to find a format that I thought I could handle). The next signature would be much quicker.

The biggest challenges:

  1. It took me a while to find a binary format that I could work with. Much of the research data we hold appears to be ASCII formats ....which has benefits from a digital preservation perspective, but wasn't what I was looking for with regard to this exercise
  2. I did not really understand the file format I was working with. I am not a chemist. I have never heard of the .spa format. I struggle to even say 'Spectroscopy' let alone understand it. When I started to research it online I found the results quite confusing. If I knew more about the format in the first place it would have made life much easier. 
  3. There are limitations with the metadata we get from researchers when they deposit data with Research Data York. Reading the brief descriptions of the dataset that are provided did not really help me work out what the individual files are or what software and hardware was used to create them.
  4. I could not locate the file format specification online - I think next time I try this I may approach the software vendor direct and ask them for help. 
  5. Available documentation for creating and testing signatures could be enhanced. I had several questions as I went along and these were answered promptly by the PRONOM team, but if the information was all online then this would certainly help other newbies.

Despite the challenges this exercise has been both enjoyable and useful. The thing I like about being a digital archivist is being able to get hands on with the data and solve problems. Over the last few years I've done very little of this type of work so it was great to get stuck in. On top of the obvious benefit that after the next signature release these .spa files will now be recognised by DROID and other PRONOM-based file identification tools, I have also increased my knowledge and understanding of the process and this is a positive result.

I would definitely encourage other digital archivists, repository managers and research data managers to try this out for themselves.

* Feedback from the PRONOM team was positive. With a couple of modifications they were happy to include the signature in the next PRONOM signature release

Friday, 5 August 2016

Research data is different

This is a guest post from Simon Wilson who has been profiling the born-digital data at the Hull History Centre to provide another point of comparison with the research data at York reported on in this blog back in May.

Inspired by Jen’s blog Research data - what does it *really* look like? about the profile of the  research data at York and the responses it generated including that from the Bentley Historical Library, I decided to take a look at some of the born-digital archives we have at Hull. This data is not research data from academics, it is data that has been donated to or deposited with the Hull History Centre and it comes from a variety of different sources.

Whilst I had previously created a DROID report for each distinct accession I have never really looked into the detail, so for each accession I did the following;

  1. Run the DROID software and export the results into csv format with one row per file 
  2. Open the file in MS Excel and copy the data to a second tab for the subsequent actions
  3. Sort the data by Type field into A-Z order and then delete all of the records relating to folders 
  4. Sort the data on the PUID field into A-Z order
  5. For large datasets highlight the data and then select the subtotal tool and use it to count each time the PUID field changes and record the sub-total
  6. Once the subtotal tool has completed its calculations, select the entire dataset and select Hide Detail (adjacent to Subtotal in the Outline tools box) to leave you with just a row for each distinct PUID and the total count value

I then created a simple spreadsheet with a column for each distinct accession and added a row for each unique PUID, copying the MIME type, software and version details from the DROID report results.  I also noted the number of files that were not identified. There may be quicker ways to get the same results and I would love to hear other suggestions or shortcuts.

After having completed this for 24 accessions - totalling 270,867 files, what have I discovered?

  • An impressive 97.96% of files were identified by DROID (compared with only 37% in Jen's smaller sample of research data)
  • So far 228 different PUIDs have been identified (compared with 34 formats in Jen’s sample)
  • The most common format is fmt/40 (MS Word 97-2003) with 120,595 files (44.5%). See the top ten identified formats in the table below...

File format (version)
Total No files
% of total identified files
Microsoft Word Document (97-2003)
Microsoft Word for Windows (2007 onwards)
Microsoft Excel 97 Workbook
Graphics Interchange Format
Acrobat PDF 1.4 - Portable Document Format
JPEG File Interchange Format (1.01)
Microsoft Word Document (6.0 / 95)
Acrobat PDF 1.3 - Portable Document Format
JPEG File Interchange Format (1.02)
Hypertext Markup Language (v4)

I can now quickly look-up whether an individual archive has a particular file type, and see how frequently it occurs.  Once I have processed a few more accessions it may be possible to create a "profile" for an individual literary collection or a small business and use this to inform discussions with depositors.  I can also start to look at the identified file formats and determine whether there is a strategy in place to migrate that format. Where this isn’t the case, knowing the number and frequency of the format amongst the collections will allow me to prioritise my efforts.  I will also look to aggregate the data – for example merging all of the different versions of Adobe Acrobat or MS Word.

I haven’t forgotten the 5520 unidentified files. By noting the PRONOM signature file number used to profile each archive, it is easy to repeat the process with a later signature file.  This could validate the previous results or enable previously unidentified files to be identified (particularly if I use the results of this exercise to feed information back to the PRONOM team). Knowing which accessions have the largest number of unidentified files will allow me to focus my effort as appropriate.

Whilst this has certainly been a useful exercise in its own right, it is also interesting to note the similarities between this the and the born-digital archives profile published by the Bentley Historical Library and the contrast with the research data profile Jen reported on.

The top ten identified formats from Hull and Bentley are quite similar. Both have a good success rate for identifying file formats with 90% identified at Bentley and 98% at Hull. Though the formats do not appear in the same order in the top ten, they do contain similar types of file (MS Word, PDF, JPEGs, GIFs and HTML).

In contrast, only 37% of files were identified in York's research data sample and the top ten file formats that were identified look very different. The only area of overlap being MS Excel files which appear high up in the York research dataset as well as being in the top ten for the Hull History Centre.

Research data is different.

Thursday, 7 July 2016

On the closure of Edina's Unlock service

This is a guest post by Julie Allinson, Technology Development Manager for Library & Archives at York. Julie managed the technical side of the 'York's Archbishops' Registers Revealed' project. This post discusses the demise of Edina's Unlock service and wonders how sustainable open data services are.

It has recently come to my attention that Edina are retiring their 'Unlock' service on the 31st July 2016. Currently that's all I know as, AFAIK, Edina haven't provided any background or any information about why, or what users of this service might do instead. I also wasn't aware of any kind of consultation with users.

Edina's message about the Unlock service - not very informative. 

At York we've been using Unlock to search the DEEP gazetteer of English place names in our Archbishops' Registers editing tool. DEEP is a fantastic resource, an online gazetteer of the 86 volume corpus of the Survey of English Place-Names (SEPN). Without Edina's Unlock service, I don't know any way of programmatically searching it.

Records from DEEP via Edina's Unlock service in our Archbishops' Registers editing tool 

Co-incidentally (or not?) the web site, which the records returned via the Unlock DEEP search link to, is unavailable. The site is a Web site for the DEEP data and our editors use this site to help identify and disambiguate place names.

I contacted Edina and they have promised to pass on further information about the end of the Unlock service as it becomes available later in the month. They also pointed me to the Institute for Name Studies at the University of Nottingham (INS) to find out why the place names site was unavailable. The initial response from INS was 'this is not our website'. I mentioned that is listed as one of their resources and they are now following up for me with a colleague, who is away at the moment. I've also contacted the Centre for Data Digitisation and Analysis (CDDA) at Queen's University Belfast who mention the site as one of their 'project databases' on the CDDA web site. As of writing, I haven't had a response.

UPDATE: As of Thursday 7th July, the site seems to be back. Yay!

Entry from the Archbishops' Registers places list, complete with broken link 

All of this is quite worrying as it means many wasted development hours implementing this feature and reflects badly on our site - displaying broken links is not something we want to do.

So what do we do now? Well, in practical terms, unless I receive any other information from Edina to the contrary, I'll write out our use of Unlock / DEEP from the editing tool at the beginning of August and our editors will have to switch to manually creating place records in the tool. We've also been using Unlock to search the Ordnance Survey, so I'll hopefully be able to add a search of their linked data services directly. But we particularly liked as it DEEP gave us historical place names and enough information to help editors make sure they were selecting the correct contemporary place.

The bigger questions that this raises, though, are:

  • how do we ensure that important datasets and services coming out of projects can be sustained?
  • how can we trust that open data services will continue to be available? even those that appear to have the backing of a service provide like Edina or Jisc
  • how do we find out when they aren't?
  • and how do we have a voice when decisions like this are being made?

Monday, 4 July 2016

New research data file formats now available in PRONOM

Like all good digital archivists I am interested in file formats and how we can use tools to automatically identify them. This is primarily so that when we package our digital archives up for long term preservation we can do so with a level of knowledge about how we might go about preserving and providing access to them in the future. This information is key whether migration or emulation is the preservation or access strategy of choice (or indeed a combination of both).

It has been really valuable to have some time and money as part of our "Filling the Digital Preservation Gap" project to be able to investigate issues around the identification of research data file formats and very pleasing to see the latest PRONOM signature release on 29th June which includes a couple of research data formats that we have sponsored as part of our project work.

 Pronom release notes

I sent a batch of sample files off to the team who look after PRONOM at The National Archives (TNA) with a bit of contextual information about the formats and software/hardware that creates them (that I had uncovered after a bit of research on Google). TNA did the rest of the hard work and these new signatures are now available for all to use.

The formats in question are:

  • Gaussian input files - These are created for an application called Gaussian which is used by many researchers in the Chemistry department here in York. In a previous project update you can see that Gaussian was listed in the top 10 research software applications in use at the University of York. These files are essentially just ascii text files containing instructions for Gaussian and they can have a range of file extensions (though the samples I submitted were all .gjf). Though there is a recommended format or syntax for these instructions, there also appears to be flexibility in how these can be applied. Consequently this was a slightly challenging signature for TNA to work on and it would be useful if other institutions that have Gaussian input files could help test this signature and feedback to TNA if there are any problems or issues. In instances like this being able to develop against a range of sample files created at different times in different institutions by different researchers would help.
  • JEOL NMR Spectroscopy files - These are data files produced by JEOL's Nuclear Magnetic Resonance Spectrometers. These facilities at the University of York are clearly well used as data of this type was well represented in an initial assessment of the data that I reported on in a blog post last month (130 .jdf files were present in the sample of 3752 files). As these files are created by a piece of hardware in a very standard way, I am told that signature developers at TNA were able to create a signature without too many problems.

Further formats submitted from our project will appear in PRONOM within the next couple of months.

The project team are also interested in finding out how easy it is to create our own file format signatures. This is an alternative option for those who want to contribute but not something we have attempted before. Watch this space to find out how we get on!

Modelling Research Data with PCDM

This is a guest post by Julie Allinson, Technology Development Manager for Library & Archives at York. Julie has been working on York's implementation for the 'Filling the Digital Preservation Gap' project. This post discusses preliminary work to define a data model for 'datasets'.

For Phase three of our 'Filling the Digital Preservation Gap', I've been working on implementing a prototype to illustrate how PURE and Archivematica can be used as part of a Research Data management lifecycle. Our technology stack at York is Hydra and Fedora 4 but it's an important aspect of the project to ensure that the thinking behind the prototype is applicable to other stacks. Central to adding any kind of data to a research information system, repository or preservation tool is the data model that underpins the metadata. For this I've been making use of the Portland Common Data Model (PCDM) and it's various extensions (particularly Works).

In the past couple of years there has been a lot of work happening around PCDM, described as "a flexible, extensible domain model that is intended to underlie a wide array of repository and DAMS applications". PCDM provides a small Models ontology of classes and properties, with extension ontologies for Works and Use, among others. I like PCDM because it is high level enough to provide a language to talk across different domains and use cases.

Datasets data model version 1

My first attempt at a data model for datasets based on PCDM can be seen below.

Datasets Data Model v1

Starting with the Dataset object above, for York this is equivalent to a dataset record in our PURE research information system. The only metadata expected at this level is about the dataset as a whole, and for us, will largely come from PURE.

Below this, you'll see that I've begun to model in some OAIS constructs: the Archival Information Package (AIP) and Dissemination Information Package (DIP). The AIP is the deposit of data, prepared for preservation, a single package of data "consisting of the Content Information and the associated Preservation Description Information (PDI), which is preserved within an OAIS." (DCC Glossary). The DIP is a representation of the Content Information in the AIP, produced for dissemination. OAIS, by the way, gives no standard approach for structuring the AIP or DIP.

As it stands, this model does not consider how to 'unpack' the AIP at all and for our prototype - we are (or were - see below) intending to simply point to the AIP in Archivematica.

The DIP as illustrated above is based on what Archivematica generates. The diagram includes the processing configuration and METS files that Archivematica produces by default as filesets, for illustration. These aren't part of the dataset as deposited, hence not making them 'Works' in their own right.

GenericWork is intended for each Unit or 'Work' in the dataset as deposited, or representation thereof. GenericWork is intended for use with any kind of data object. it might be independently re-used, eg. as a member of another dataset. In most cases for research data we probably won't know much about what the data is and so GenericWork will be used, but sometimes it may make sense to use existing models. For example, if the dataset is a collection of images then a more tailored Image model could be used for each Image, or if a dataset includes some existing objects that are already in our repository, those might already have different models. They can still be members of our dataset.

The model is intended to allow for a Dataset to include multiple AIPs and thus I have suggested a local predicate for hasDIP / hasAIP to establish the relationship between the AIP and the DIP.

The trouble with DIPs : an alternative model

Discussing this model with Justin Simpson from Artefactual Systems recently, I got to thinking that the DIP is a really artificial construct, effectively a presentation 'view' on the data and not really separate 'Work' in it's own right.  Archivematica's DIP, at present, provides only files, and doesn't reflect the original folder structure, which may well be meaningful and necessary for a dataset. Perhaps what we really need is the AIP, described fully in Fedora, leaving presentation of the DIP to the interface level? A set of rules for producing a DIP to our own local specification might go like this (using the PCDM Use ontology): if there is a Preservation Master file, produce a Service File for user access, otherwise present the Original File to the user.

The new model would look something like this:

Datasets Data Model alternative

The DIP would be a view constructed from elements of the AIP:

This model feels like a better approach than that in version 1 as it facilitates describing the 'whole' dataset. I do have some immediate questions about the model, though:
  • Is a dataset really a pcdm:Collection, rather than a Work?
  • If the GenericWork is each data file irrespective of whether it can be used/understood on it's own, how useful is that in reality? Is the GenericWork really needed, or are FileSets enough? Is there genuinely value is identifying each individual piece of data as a 'Work'? (re-use outside of the dataset, for example)
And when thinking beyond the model, about how this would actually work for different use cases, implementations questions start to surface. 

Beyond the model

1) Dataset size and structure

Datasets may contain thousands, millions even, of files structured into folders where folders may impart meaning to the data, or be purely arbitrary. Fedora 4 can, by design, handle a folder structure using it's implementation of LDP Basic Containers. As illustrated below, each folder is a 'Basic Container' and each data file is a Work, with FileSet and File objects.
  • AIP ldp:contains folder1
    • folder1 ldp:contains folder2
      • folder2 ldp:contains folder3
        • folder3 ldp:contains GenericWork
          • GenericWork pcdm:hasMember FileSet 
But if each of those folders are objects in Fedora and each file is in Fedora with a Work, FileSet and several File objects, then the number of Fedora objects begins to rise exponentially
  • Would it be better to avoid object-cost to Fedora of creating many many objects? 
  • What alternative approaches could we take? Use BagIt and have Fedora reference only the 'bag'? Store only data files and create an 'order' to represent the folder structure (as outlined in PCDM 2.0)?

2) Storing Files

'Where should data actually be stored permanently?' is another practical question I've been thinking about. On the one hand, Archivematica makes the AIP and it's files available via URLs in the Storage Service and stores the file in a location of your choosing. On the other, Fedora can contain data files, or reference them via a URI. This gives gives us the flexibility to do several things:
  1. Leave the AIP and DIP in Archivematica's stores and use URIs in Fedora to reference all of the files and build a PCDM-modelled view of the data (Archivematica as preservation and access store).
  2. Manage all files in Fedora, treating the Archivematica copy of the data as temporary (Archivematica as sausage factory).
  3. Have two copies of AIP data, one in Archivematica and one in Fedora (LOCKSS model).
  4. Manage Preservation files in Archivematica and delivery/access files in Fedora (Archivematica as preservation and Fedora as access).
Keeping data in Archivematica makes is easy to do additional preservation actions in future, such as re-ingesting when format policy rules change, whereas managing all files within Fedora unlocks the possibilities of Fedora's audit functionality, fixity checking and versioning. Having two copies is attractive as a preservation strategy, but could be difficult to justify and sustain if data collections grow to a significant size. 

On balance I think option (4) is best for the short-term with other options worth re-considering as both Archivematica and Fedora mature. But I'd be really keen to hear different views.


Hopefully this post illustrates that creating an outline data model is pretty easy, but when it comes to thinking about it in terms of implementation decisions, all kinds of ifs and buts start coming up.

In the model above, each data file is a Work. Each work contains one or more FileSets, and each FileSet contains one or more different representations of the file. 

Is it really possible to define a general datasets model that could encompass data from across disciplines, of various sizes, structures and created for a variety of purposes? Data that might be independently re-usable (a series of oral history interviews) or might only be understandable in combination with other files (a database schema document, for example)?

This is very much a work in progress, and I'd really welcome feedback from others who have done allied work or anyone who has suggestions and comments on the approaches and issues outlined above.

Tuesday, 31 May 2016

Research data - what does it *really* look like?

Work continues on our Filling the Digital Preservation Gap project and I thought it was about time we updated you on some of the things we have been doing.

While my colleague Julie has been focusing on the more technical issues of implementing Archivematica for research data. I have been looking at some real research data and exploring in more detail some of the issues we discussed in our phase 1 report.

For the past year, we have been accepting research data for longer term curation. Though the systems for preservation and access to this data are still in development, we are for the time being able to allocate a DOI for each dataset, manage access and store it safely (ensuring it isn't altered) and intend to ingest it into our data curation systems once they are ready.

Having this data in one place on our filestore does give me the opportunity to test the hypothesis in our first report about the wide range of file formats that will be present in a research dataset and also the assertion that many of these will not be identified by the tools and registries in use for the creation of technical metadata.

So, I have done a fairly quick piece of analysis on the research data, running a tool called Droid developed by The National Archives over the data to get an indication of whether the files can be recognised and identified in an automated fashion.

All the data in our research data sample has been deposited with us since May 2015. The majority of the data is scientific in nature - much of it coming from the departments of Chemistry and Physics. (this may be a direct result of expectations from the EPSRC around data management). The data is mostly fairly recent, as suggested by the last modified dates on these files, which range from 2006 to 2016 with the vast majority having been modified in the last five years. The distribution of dates is illustrated below.

Here are some of the findings of this exercise:

Summary statistics

  • Droid reported that 3752 individual files were present*

  • 1382 (37%) of the files were given a file format identification by Droid

  • 1368 (99%) of those files that were identified were given just one possible identification. 12 files were given two possible identifications and a further two were given 18 possible identifications. In all these cases, the identification was done by file extension rather than signature - which perhaps explains the uncertainty

Files that were identified

  • Of the 1382 files that were identified: 
    • 668 (48%) were identified by signature (which suggests a fairly accurate identification - if a file is identified by signature it means that Droid has looked inside the file and seen something that it recognises. I'm told it does this by some sort of magic!)
    • 648 (47%) were identified by extension alone (which implies a less accurate identification)
    • 65 (5%) were identified by container. These were all Microsoft Office files - xlsx and docx as these are in effect zip files (which suggests a high level of accuracy)

  • 111 (8%) of the identified files had a file extension mismatch - this means that the file extension was not what you would expect given the identification by signature. 
    • All but 16 of these files were XML files that didn't have the .xml file extension (there were a range of extensions for these files including .orig, .plot, .xpr, .sc, .svg, .xci, .hwh, .bxml, .history). This isn't a very surprising finding given the ubiquity of XML files and the fact that applications often give their own XML output different extensions.

  • 34 different file formats were identified within the collection of research data

  • Of the identified files 360 (26%) were XML files. This was by far the most common file format identified within the research dataset. The top 10 identified files are as follows:
    • Extensible Markup Language - 360
    • Log File - 212
    • Plain Text File - 186
    • AppleDouble Resource Fork - 133
    • Comma Separated Values - 109
    • Microsoft Print File - 77
    • Portable Network Graphics - 73
    • Microsoft Excel for Windows - 57
    • ZIP Format - 23
    • XYWrite Document - 21

Files that weren't identified

  • Of the 2370 files that weren't identified by Droid, 107 different file extensions were represented

  • 614 (26%) of the unidentified files had no file extension at all. This does rather limit the chance of identification being that identification by file extension is relied on quite heavily by Droid and other similar tools. Of course it also limits our ability to actively curate this data unless we can identify it by another means.

  • The most common file extensions for the files that were not identified are as follows:
    • dat - 286
    • crl - 259
    • sd - 246
    • jdf - 130
    • out - 50
    • mrc - 47
    • inp - 46
    • xyz - 37
    • prefs - 32
    • spa - 32

Some thoughts

  • This is all very interesting and does back up our assertions about the long tail of file formats within a collection of research data and the challenges of identifying this data using current tools. I'd be interested to know whether for other collections of born digital data (not research data) a higher success rate would be expected? Is identification of 37% of files a particularly bad result or is it similar to what others have experienced?

  • As mentioned in a previous blog post, one area of work for us is to get some more research data file formats into PRONOM (a key database of file formats that is utilised by digital preservation tools). Alongside previous work on the top software applications used by researchers (a little of this is reported here) this has been helpful in informing our priorities when considering which formats we would like The National Archives to develop signatures for in phase 3 of the project.

  • Given the point made above, it could be suggested that one of our priorities for file format research should be .dat files. This would make sense being that we have 286 of these files and they are not identified by any means. However, here lies a problem. This is actually a fairly common file extension. There are many different types of .dat files produced by many different applications. PRONOM already holds information on two varieties of .dat file and the .dat files that we hold appear to come from several different software applications. In short, solving the .dat problem is not a trivial exercise!

  • It strikes me that we are really just scratching the surface here. Though it is good we are getting a few file signatures developed as an output for this project, this is clearly not going to make a big impact given the size of the problem. We will need to think about the community should continue this work going forward.

  • It has been really helpful having some genuine research data to investigate when thinking about preservation workflows - particularly those workflows for unidentified files that we were considering in some detail during phase 2 of our project. The unidentified file report that has been developed within Archivematica as a result of this project helpfully organises the files by extension and I had not envisaged at the time that so many files would have no file extension at all. We have discussed previously how useful it is to fall back on identification by file extension if identification by signature is unsuccessful but clearly for so many of our files this will not be worth attempting.

* note that this turns out not to be entirely accurate given that eight of the datasets were zipped up into a .rar archive. Though Droid looks inside several types of zip files, (including .zip, .gzip, .tar, and the web archival formats .arc and .warc) it does not yet look inside .rar, 7z, .bz, and .iso files. I didn't realise this until after I had carried out some analysis on the results. Consequently there are another 1291 files which I have not reported on here (though a quick run through with Droid after unzipping them manual identified 33% of the files so a similar ratio to the rest of the data. Note that this functionality is something that the team at The National Archives intend to develop in the future.