Monday, 4 July 2016

New research data file formats now available in PRONOM

Like all good digital archivists I am interested in file formats and how we can use tools to automatically identify them. This is primarily so that when we package our digital archives up for long term preservation we can do so with a level of knowledge about how we might go about preserving and providing access to them in the future. This information is key whether migration or emulation is the preservation or access strategy of choice (or indeed a combination of both).

It has been really valuable to have some time and money as part of our "Filling the Digital Preservation Gap" project to be able to investigate issues around the identification of research data file formats and very pleasing to see the latest PRONOM signature release on 29th June which includes a couple of research data formats that we have sponsored as part of our project work.

 Pronom release notes

I sent a batch of sample files off to the team who look after PRONOM at The National Archives (TNA) with a bit of contextual information about the formats and software/hardware that creates them (that I had uncovered after a bit of research on Google). TNA did the rest of the hard work and these new signatures are now available for all to use.

The formats in question are:

  • Gaussian input files - These are created for an application called Gaussian which is used by many researchers in the Chemistry department here in York. In a previous project update you can see that Gaussian was listed in the top 10 research software applications in use at the University of York. These files are essentially just ascii text files containing instructions for Gaussian and they can have a range of file extensions (though the samples I submitted were all .gjf). Though there is a recommended format or syntax for these instructions, there also appears to be flexibility in how these can be applied. Consequently this was a slightly challenging signature for TNA to work on and it would be useful if other institutions that have Gaussian input files could help test this signature and feedback to TNA if there are any problems or issues. In instances like this being able to develop against a range of sample files created at different times in different institutions by different researchers would help.
  • JEOL NMR Spectroscopy files - These are data files produced by JEOL's Nuclear Magnetic Resonance Spectrometers. These facilities at the University of York are clearly well used as data of this type was well represented in an initial assessment of the data that I reported on in a blog post last month (130 .jdf files were present in the sample of 3752 files). As these files are created by a piece of hardware in a very standard way, I am told that signature developers at TNA were able to create a signature without too many problems.

Further formats submitted from our project will appear in PRONOM within the next couple of months.

The project team are also interested in finding out how easy it is to create our own file format signatures. This is an alternative option for those who want to contribute but not something we have attempted before. Watch this space to find out how we get on!

