Wednesday, 19 October 2016

Filling the Digital Preservation Gap - final report available

Today we have published our third and final Filling the Digital Preservation Gap report.

The report can be accessed from Figshare:

This report details work the team at the Universities of York and Hull have been carrying out over the last six months (from March to September 2016) during phase 3 of the project.

The first section of the report focuses on our implementation work. It describes how each institution has established a proof of concept implementation of Archivematica integrated with other systems used for research data management. As well as describing how these implementations work it also discusses future priorities and lessons learned.

The second section of the report looks in more detail at the file format problem for research data. It discusses DROID profiling work that has been carried out over the course of the project (both for research data and other data types) and signature development to increase the number of research data signatures in the PRONOM registry. In recognition of the fact that this is an issue that can only be solved as a community, it also includes recommendations for a variety of different stakeholder groups.

The final section of the report details the outreach work that we have carried out over the course of this final phase of the project. It has been a real pleasure to have been given an opportunity to speak about our work at so many different events and to such a variety of different groups over the last few months!

The last of this run of events in our calendars is the final Jisc Research Data Spring showcase in Birmingham tomorrow (20th October). I hope to see you there!

Tuesday, 11 October 2016

Some highlights from iPRES 2016

A lovely view of the mountains from Bern
Last week I was at iPRES 2016 - the 13th International Conference on Digital Preservation and one of the highlights of the digital preservation year.

This year the conference was held in the Swiss town of Bern. A great place to be based for the week  - fantastic public transport, some lovely little restaurants and cellar bars, miles of shopping arcades, bizarre statues and non-stop sunshine!

There was so much content over the course of the 4 days that it is impossible to cover it all in one blog post. Instead I offer up a selection of highlights and takeaway thoughts.

Jeremy York from the University of Michigan gave an excellent paper about ‘The Stewardship Gap’. An interesting project with the aim of understanding the gap between valuable digital data and long term curation.  Jeremy reported on the results of a series of interviews with researchers at his institution where they were asked about the value of the data they created and their plans for longer term curation. A theme throughout the paper was around data value and how we assess this. Most researchers interviewed felt that their data did have long term value (and were able to articulate the reasons why). Most of the respondents expressed an intention to preserve the data for the longer term but did not have any concrete plans as to how they would achieve this. It was not yet clear to the project whether an intention to preserve actually leads to deposit with a repository or not. Work on this project is ongoing and I’ll look forward to finding out more when it is available.

Bern at night
As always there was an array of excellent posters. There were two in particular that caught my eye this year.

Firstly a poster from the University of Illinois at Urbana-Champaign entitled Should We Keep Everything Forever?: Determining Long-Term Value of Research Data.

The poster discussed an issue that we have also been grappling with recently as part of Filling the Digital Preservation Gap, that of the value of research data. It proposed an approach to assessing the value of content within the Illinois Data Bank using automated methods and measurable criteria. Recognising that a human eye is also important in assessing value, it would highlight those datasets that appear to have a low value which can then be assessed in a more manual fashion. This pragmatic two-stage approach will ensure that data thought to be of low value can be discarded after 5 years but that time intensive manual checking of datasets is kept to a minimum. This is a useful model that I would like to hear more about once they get it fully established. There was a lot of buzz around this poster and I wasn’t surprised to see it shortlisted for the best poster award.

Another excellent poster (and worthy winner of the best poster award) was To Act or Not to Act - Handling File Format Identification Issues in Practice. This poster from ETH Zurich described how the institution handles file identification and validation errors within their digital archive and showed some worked examples of the types of problems they encountered. This kind of practical sharing of the nuts and bolts of digital preservation is really good to see, and very much in line with the recommendations we are making as part of Filling the Digital Preservation Gap. As well as finding internal solutions to these problems I hope that ETH Zurich are also passing feedback to the tool providers to ensure that the tools work more effectively and efficiently for other users. It is this feedback loop that is so important in helping the discipline as a whole progress.

OAIS panel session in full swing
A panel session on Monday afternoon entitled ‘OAIS for us all’ was also a highlight. I was of course already aware that the OAIS standard is currently under review and that DPC members and other digital preservation practitioners are invited and encouraged to contribute to the discussion. Despite best intentions and an obvious interest in the standard I had not yet managed to engage with the review. This workshop was therefore a valuable opportunity to get up to speed with the process (as far as the panel understood it!) and the community feedback so far.

It was really useful to hear about the discussions about OAIS that have been held internationally and of course interesting to note the common themes recurring throughout – for example around the desire for a pre-ingest step within the model, the need to firm up the reference model to accommodate changes to AIPs that may occur through re-ingest, and around the need for openness with regard to audit and certification standards.

This session was a great example of an international collaboration to help shape the standards that we rely so much on. I do hope that the feedback from our community is given full consideration in the revised OAIS Reference Model.

Me and Steve presenting our paper
(image from @shirapeltzman)
On Tuesday morning I gave a paper with Steve Mackey from Arkivum in the Research Data Preservation session (and I was really pleased that there was a whole session devoted to this topic). I presented on our work to link repositories to Archivematica, through the Filling the Digital Preservation Gap project, and focused in particular on the long tail of research data file formats and the need to address this as a community. It was great to be able to talk to such a packed room and this led to some really useful discussions over the lunch break and at the conference dinner that evening.

One of the most interesting sessions of the conference for me was one that was devoted to ingest tools and methods. At a conference such as this, I'm always drawn to the sessions that focus on practical tools and first hand experiences of doing things rather than the more theoretical strands so this one was an obvious choice for me. First we had Bruno Ferreira from KEEP SOLUTIONS talking about the Database Preservation Toolkit (more about this toolkit later). Then "Exploring Friedrich Kittler's Digital Legacy on Different Levels: Tools to Equip the Future Archivist" by Jurgen Enge and Heinz Werner Kramski from the University of Art and Design in Basel.

It was fascinating to see how they have handled the preservation of a large, diverse and complex digital legacy and overcome some of the challenges and hurdles that this has thrown at them. The speakers also made the point that the hardware itself is important evidence in its physical form, showing for instance how regularly Friedrich Kittler clearly used the reset button on his PC!

Conference delegates relaxing on the terrace
Two further presentations focused on the preservation of e-mail - something that I have little experience of but I am sure I will need to work on in the future. Claus Jensen from the Royal Library in Denmark presented a solution for acquisition of email. This seemed a very pragmatic approach and the team had clearly thought through their requirements well and learned from their initial prototype before moving to a second iteration. I'm keen to follow up on this and read the paper in more detail.

Brent West from the University of Illinois followed on with another interesting presentation on Processing Capstone Email using Predictive Coding. This talk focused on the problems of making appraisal decisions and sensitivity assessments for email and how a technology assisted review could help, enabling the software to learn from human decisions that are made and allowing human effort to be reduced and targeted. Again, I felt that this sort of work could be really useful to me in the future if I am faced with the task of e-mail preservation at scale.

A very expensive gin and tonic!
The BitCurator Mixer on the Tuesday night provided a good opportunity to talk to other users of BitCurator. I confess to not actually being a user just yet but having now got my new ingest PC on my desk, it is only a matter of time before I get this installed and start playing and testing. Good to talk to some experienced users and pick up some tips regarding how to install it and where to find example workflows. What sticks most in my mind though is the price of the gin and tonic at the bar we were in!

On Wednesday afternoon I took part in a workshop called OSS4PRES 2.0: Building Bridges and Filling Gaps – essentially a follow on from a workshop called Using Open-Source Tools to Fulfill Digital Preservation Requirements that I blogged about from iPRES last year. This was one of those workshops where we were actually expected to work (always a bit of a shock after lunch!) and the participants split into groups to address 3 different areas. One group was looking at gaps in the open source digital preservation tool set that we should be looking to fill (either by enhancing existing tools or with the development of new tools). Another group was working on drawing up a set of guidelines for providers of open source tools. The group I was in was thinking about the creation of a community space for sharing digital preservation workflows. This is something that I think could turn into a really valuable resource for practitioners who want to see how others have implemented tools. All the groups came out with lots of ideas and an action plan by the end of the afternoon and work in these areas is scheduled to continue outside of the workshop. Great to see a workshop that is more than just a talking shop but that will lead to some more concrete results.

My laptop working hard at the Database
Preservation workshop
On Thursday morning I attended another really useful hands on workshop called Relational Database Preservation Standards and Tools. Participants were encouraged to try out the SIARD Suite and Database Preservation Toolkit on their own laptops. The value and outcomes of this workshop were clear and it really gave a sense of how we might use these tools to create preservation versions of data from relational databases. Designed to work with a number of widely used relational database systems the tools allow data to be extracted into the SIARD 2 format. This format is essentially a zip file containing the relevant information in XML. It goes one better than the csv format (the means by which I have preserved databases in the past) as it contains both information about the structure of the data as well as the content itself and allows you to add metadata about how the data was extracted. It looks to be particularly useful for taking snapshots of live and active databases for preservation on a regular cycle. I could definitely see myself using these tools in the future.

iPRES2016 and Swisse Toy 2016 meet outside the venue
There was some useful discussion at the end of the session about how these tools would actually fit into a wider preservation workflow and whether they could be incorporated into digital preservation sytems (for example Archivematica) and configured as an automatic migration path for Microsoft Access databases. The answer was yes, but subsequent discussion suggested that this may not be the best way to approach this. The tool creators suggest that full automation may not be the best approach. A human eye is typically required to establish which bits of the database should be preserved and retained and to tailor the creation of the SIARD 2 file accordingly.

On the last afternoon of the conference it was good to be able to pop into the Swiss Toy Fair which was being held at the same venue as the conference. A great opportunity to buy some presents for the family before heading back to the UK.

Wednesday, 21 September 2016

File format identification at Norfolk Record Office

This is a guest post from Pawel Jaskulski who has recently completed a Transforming Archives traineeship at Norfolk Record Office (NRO). As part of his work at Norfolk and in response to a question I posed in a previous blog post ("Is identification of 37% of files a particularly bad result?") he profiled their digital holdings using DROID and has written up his findings. Coming from a local authority context, his results provide an interesting comparison with other profiles that have emerged from both the Hull History Centre and the Bentley Historical Library and again help to demonstrate that the figure of 37% identified files for my test research dataset is unusual.

King's Lynn's borough archives are cared for jointly by the Borough Council and the Norfolk Record Office

Profiling Digital Records with DROID

With any local authority archive there is an assumption that the accession deposited might be literally anything. What it means in 'digital terms' is that it is impossible to predict what sort of data might be coming in in the future. That is the reason why NRO have been actively involved in developing their digital preservation strategy, aiming at achieving capability so as to be able to choose digital records over their paper-based equivalents (hard copies/printouts).

The archive service has been receiving digital records accessions since the late 1990's. The majority of digitally born archives came in as hybrid accessions from local schools that were being closed down. For many records there were no paper equivalents. Among other deposits containing digital records are architectural surveys, archives of private individuals and local organisations (for example Parish Council meetings minutes).

The archive service have been using DROID as part of their digital records archival processing procedure as it connects to the most comprehensive and continuously updated file formats registry PRONOM. Archivematica, an ingest system that uses the PRONOM registry, is currently being introduced at NRO. It contains other file format identification tools like FIDO or Siegfried (which both use PRONOM identifiers).

The results of DROID survey were as follows:

With the latest signature file (v.86) out of 49,117 files identification was successful for 96.46%.

DROID identified 107 various file formats. The ten most recurring file formats were:

File Format Name
Image (Raster)
JPEG File Interchange Format
1.01, 1.02
fmt/43, fmt/44
Image (Raster)
Exchangeable Image File Format (Compressed)
2.1, 2.2
x-fmt/390, x-fmt/391
Image (Raster)
Windows Bitmap
Text (Mark-up)
Hypertext Markup Language
fmt/96, fmt/99
Word Processor
Microsoft Word Document
Image (Raster)
Tagged Image File Format
Microsoft Outlook Email Message
AppleDouble Resource Fork
Image (Raster)
Graphics Interchange Format
Image (Raster)
Exchangeable Image File Format (Compressed)

Identification method breakdown:

  • 83.31% was identified by signature
  • 14.95% by container
  • 1.73% by Extension 

458 files had their extensions mismatched - that amounts to less than one per cent (0.97%). These were a variety of common raster image file formats (JPEG, PNG, TIFF) word processor (Microsoft Word Document, ClarisWorks Word Processor) and desktop publishing (Adobe Illustrator, Adobe InDesign Document, Quark Xpress Data File).

Among 3.54% of unidentified files there were 160 different unknown file extensions. Top five were:

  • .cmp
  • .mov
  • .info
  • .eml
  • .mdb

Two files returned more than 1 identification:

A spreadsheet file with .xls extension (last modified date 2006-12-17) had 3 possible file format matches:

  • fmt/175 Microsoft Excel for Macintosh 2001
  • fmt/176 Microsoft Excel for Macintosh 2002
  • fmt/177 Microsoft Excel for Macintosh 2004

And an image file with extension .bmp (last modified date 2007-02-06) received 2 file format matches

  • fmt/116 Windows Bitmap 3
  • fmt/625 Apple Disk Copy Image 4.2

After closer inspection the actual file was a bitmap image file and PUID fmt/116 was the correct one.

Understanding the Results

DROID offers very useful classification of file formats and puts all results into categories, which enables an overview of the digital collection. It is easy to understand what sort of digital content is predominantly included within the digitally born accession/archive/collection. It uses classification system that assigns file formats to broader groups like: Audio, Word Processor, Page Description, Aggregate etc. These help enormously in having a grasp on the variety of digital records. For example it was interesting to discover that over half of our digitally born archives are in various raster image file formats.

Files profiled at Norfolk Record Office as classified by DROID

I am of course also interested in the levels of risk associated with particular formats so have started to work on an additional classification for the data, creating further categories that can help with preservation planning. This would help demonstrate where preservation efforts should be focused in the future.

Friday, 16 September 2016

UK Archivematica group at Lancaster

Earlier this week UK Archivematica users descended on the University of Lancaster for our 5th user group meeting. As always it was a packed agenda, with lots of members keen to talk to the group and share their project plans or their experiences of working with Archivematica. Here are some edited highlights of the day. Also well worth a read is a blog about the day from our host which is better than mine because it contains pictures and links!

Rachel MacGregor and Adrian Albin-Clark from the University of Lancaster kicked off the meeting with an update on recent work to set up Archivematica for the preservation of research data. Adrian has been working on two Ruby gems to handle two specific parts of the workflow. The puree gem which gets metadata out of the PURE CRIS system in a format that it is easy to work with (we are big fans of this gem at York having used it in our phase 3 implementation work for Filling the Digital Preservation Gap). Another gem helps solve another problem, getting the deposited research data and associated data packaged up in a format that is suitable for Archivematica to ingest. Again, this is something we may be able to utilise in our own workflows.

Jasmin Boehmer, a student from Aberystwyth University presented some of the findings from the work she has been doing for her dissertation. She has been testing how metadata can be added to a Submission Information Package (SIP) for inclusion within an Archival Information Package (AIP) and has been looking at a range of different scenarios. It was interesting to hear her findings, particularly useful for those of us who haven’t managed to carry out systematic testing ourselves. She concluded that if you want to store descriptive metadata at a per file level within Archivematica you should submit this via a csv file as part of your SIP. If you only use the Archivematica interface itself for adding metadata, you can only do this on a per SIP basis rather than at file level. It was interesting to see that if you include rights metadata within your file level csv file this will not be stored within the PREMIS section of the XML as you might expect so this does not solve a problem we raised during our phase 1 project work for Filling the Digital Preservation Gap regarding ingesting a SIP with different rights recorded per file.

Jake Henry from the National Library of Wales discussed some work newly underway to build on the work of the ARCW digital preservation group. The project will enable members of ARCW to use Archivematica without having to install their own local version, using pydio as a means of storing files before transfer. As part of this project they are now looking at a variety of systems that they would like Archivematica to integrate with. They are hoping to work on an integration with CALM. There was some interest in this development around the room and I expect there would be many other institutions who would be keen to see this work carried out.

Kirsty Lee from the University of Edinburgh reported on her recent trip to the States to attend the inaugural ArchivematiCamp with her colleague Robin Taylor. It sounded like a great event with some really interesting sessions and discussions, particularly regarding workflows (recognising that there are many different ways you can use Archivematica) as well as some nice social events. We are looking forward to seeing an ArchivematiCamp in the UK next year!

Myself and Julie presented on some of the implementation work we have been doing over the last few months as we complete phase 3 of Filling the Digital Preservation Gap. Julie talked about what we were trying to achieve with our proof of concept implementation and then showed a screencast of the application itself. The challenges we faced and things that worked well during phase 3 were discussed before I summarised our plans for the future.

I went on to introduce the file formats problem (which I have previously touched on other blog posts) before taking the opportunity to pick people’s brains on a number of discussion points. I wanted to understand workflows around non identified files (not just for research data). I was interested to know three things really:

  1. At what point would you pick up on unidentified file formats in a deposit - prior to using Archivematica or during the transfer stage within Archivematica?
  2. What action would you take to resolve this situation (if any)?
  3. Would you continue to ingest material into the archive whilst waiting for a solution, or keep files in the backlog until the correct identification could be made?
Answers from the floor suggested that one institution would always stop and carry out further investigations before ingesting the material and creating an Archival Information Package (AIP) but that most others would continue processing the data. With limited staff resource for curating research data in particular, it is possible that institutions will favour a fully automated workflow such as the one we have established in our proof of concept implementation, and regular interventions around file format identification may not be practical. Perhaps we need to consider how we can intervene in a sustainable and manageable way, rather than looking at each deposit of data separately. One of the new features in Archivematica is the AIP re-ingest which will allow you to pull AIPs back from storage so that tools (such as file identification) can be re-run - this was thought to be a good solution.

John Kaye from Jisc updated us on the Research Data Shared Service project. Archivematica is one of the products selected by Jisc to fulfill the preservation element of the Shared Service and John reported on the developments and enhancements to Archivematica that are proposed as part of this project. It is likely that these developments will be incorporated into the main code base thus be available to all Archivematica users in the future. The growth in interest in Archivematica within the research data community in the UK is only likely to continue as a result of this project.

Heather Roberts from the Royal Northern College of Music described where her institution is with digital preservation and asked for advice on how to get started with Archivematica. Attendees were keen to share their thoughts (many of which were not specific to Archivematica itself but would be things to consider whatever solution was being implemented) and Heather went away with some ideas and some further contacts to follow up with.

To round off the meeting we had an update and Q&A session with Sarah Romkey from Artefactual Systems (who is always cheerful no matter what time you get her out of bed for an transatlantic Skype call).

Some of the attendees even managed to find the recommended ice cream shop before heading back home!

We look forward to meeting at the University of Edinburgh next time.

Tuesday, 30 August 2016

Filling the Digital Preservation Gap - a brief update

As we near the end of the active phase of Filling the Digital Preservation Gap* here is a brief update about where we are with the main strands of work we highlighted in our phase 3 kick off blog post.

Archivematica implementation

Work at York

Work is ongoing at York to get our proof of concept implementation of Archivematica up and running. The purpose of this work was not to get a production service in place but to demonstrate that the implementation plan we published in our phase 2 report was feasible. The implementation we are developing pulls metadata (about deposited research datasets) from PURE and provides a method for capturing additional information for managing datasets  (filling some of the information gaps that are not collected through the PURE datasets module). It also includes an automated process to ingest deposited datasets (along with their metadata) into Archivematica, package them up for longer term preservation and provide a dissemination copy of the dataset to our repository. 

We have been doing this work in consultation with the staff at York who actually work with datasets that are deposited through our Research Data York service to ensure that the workflows and processes we are putting in place will make their lives easier rather than harder! We are keen to ensure that those processes that can be automated are automated and those areas where human input is required trigger e-mail notifications to relevant staff and a pause in the workflow to enable the relevant checks to be made.

Work at Hull

Like York, Hull is looking to produce a proof-of-concept system within the timeframe of the project. Whilst concentrating on research data for this Jisc funded work, we have our eye also on later using our approach for other forms of repository content that deserve long-term preservation.  To that end, we are taking as our starting point the institutional Box folder that each of our staff has access to; we will be asking depositors to assemble their material for the repository in a folder within their Box account.  As well as the content itself they will be asked for basic metadata and processing instructions in a very simple format.  When the folder is ready they share it with another Box account “owned” by Archivematica.

Hull has developed a “Box watcher” which detects the new share and instigates processing of the contents, keeping the depositor aware of progress along the way.  The contents of the folder are examined and, depending on what is found and how it is configured, one or more Bags (as in the BagIt standard) are created and handed off to Archivematica.

Like York we are then looking to have a fully automated Archivematica workflow which produces Archival Information Packages corresponding to each of the bags.  In addition, Hull will have Archivematica create Dissemination Information Package(s) which, once created, will automatically be processed to produce objects in the quality assurance queue of our Hydra repository.

Unidentified file formats

It has been clear from our project work during phase 3 that research data is much harder to identify in an automated fashion than other types of born digital data that an archive would typically hold. If you don’t believe us, read these 2 blog posts that show contrasting results when trying to identify two different types of born digital data: 

So, how are we working towards a solution? As well as directly sponsoring the development of a small selection of research data file formats by the PRONOM team at The National Archives, we also had a go at creating our own. York’s new signature will be incorporated into PRONOM in due course. Hull’s signature has been submitted and is just being tested by the PRONOM team. There have also been positive discussions with colleagues at The National Archives about wider public engagement around file format signature development and how we work towards increasing the coverage of PRONOM for research data file formats.

Dissemination and outreach

The project team have been keen to continue their focus on dissemination during phase 3 of the project. This has included presentations or posters at the following conferences and events:

  • International Digital Curation Conference (IDCC16), Amsterdam
  • 'Digital Preservation: Strategic Issues' - National Library of Wales
  • UK Archives Discovery Forum, Kew
  • UK Archivematica meeting, York
  • Research Data, Records and Archives: Breaking the Boundaries, Edinburgh
  • Open Repositories, Dublin
  • Jisc CNI conference, Oxford
  • Hydra Virtual Connect
  • TNA Digital Transformation day, Kew

...and our outreach work continues. Watch out for us at the Jisc Research Data Network event in Cambridge next week, the next UK Archivematica meeting in Lancaster the week after, the iPRES conference and Hydra Connect in October and of course the final Jisc Research Data Spring showcase event which will be later on in October.

And of course we have been blogging as usual throughout this phase of the project so do read back to see our previous posts for more information and watch out for our phase 3 final report in mid-October.

* we formally complete the project work on 14th Sept and will focus on writing up our final report over the following month

Friday, 19 August 2016

My first file format signature

As part of Filling the Digital Preservation Gap we've been doing a lot of talking about the importance of accurate file format identification and the challenges of doing so for research data.

Now we are thinking about how we can help solve the problem.

As promised in a post last month, I wanted to have a go at file format signature creation for PRONOM to see whether it is something that an average digital archivist could get their head around. Never before had I created my own signature. In the past I had considered this to be work that only technical people could carry out and it would be foolhardy to attempt it myself.

However, given the extent of the file formats identification challenge for research data wouldn't it be great if the community could engage more directly? Also, shouldn't file signature development be something every digital archivist should have a good understanding of?

Encouraged by Ross Spencer's blog post Five Star File Format Signature Development and a meeting with the PRONOM team at The National Archives in which the tricks and challenges of signature creation were explained, I decided to give it a go.

Where to begin?

  • First I read TNA's How to research and develop signatures for file format identification. This is an accessible and readable guide which tells you how to get started with signature development - from gathering samples, doing internet research on the format and using a Hex Editor to spot patterns. You don't need to be very technically minded to get your head around it.
  • Then I downloaded and installed a Hex editor. Though it is possible to view files as hexadecimal within Quick View Plus, I followed TNA's advice and used HxD Hex Editor as this allows you to compare files thus partially automating the process of spotting sequences.
  • Once I'd spotted a pattern which could be used to create a signature, I planned to use PRONOM's Signature Development Utility to create it. 
  • Once the signature is created I'm told it is possible to test this using DROID. Within DROID, go to Tools... Install signature file and replace the current signature with your new one (but remember to put it back again once you are done otherwise you may wonder why DROID isn't working properly!). Run this over your directory of sample files to see if they are all correctly identified using the signature you have developed. 

I decided start off with something that had been at the back of my mind for a while...those Wordstar 4.0 files from the Marks and Gran archive that I blogged about some time ago and had struggled to identify. When I wrote that post three years ago, Wordstar 4.0 files were not represented in PRONOM. They have more recently been added and the files can be identified but this is by extension only - not the more accurate file signature. I thought it would be fun to try and create a file signature for them.

I was very wrong.

My attempts to see a pattern within the files using the Hex Editor were unsuccessful. I decided to send the sample files to the experts at TNA to see if I was missing something. It was quickly confirmed that this was a rather awkward file type and not one that lent itself well to being automatically identified. Disappointing but at least it confirmed that my own investigations were not lacking.

For my next attempt I decided to tackle some of the unidentified research data that I had highlighted in my previous post Research data - what does it *really* look like?

I looked through the top ten most frequent unidentified file extensions in my sample and started to dig out the files themselves and assess whether they were a good candidate for me to work on. Ross Spencer suggests that PRONOM lends itself best to the creation of signatures for binary formats so this is what I wanted to focus on. No point in trying to make it hard for myself!

  • dat - 286
  • crl - 259
  • sd - 246
  • jdf - 130 (a signature for these JEOL NMR Spectroscopy files is now available)
  • out - 50
  • mrc - 47
  • inp - 46
  • xyz - 37
  • prefs - 32
  • spa - 32
Unfortunately, looking through the list (and digging out some samples) I discovered that many of these are ASCII formats rather than binary. It is possible to create signatures to identify ASCII files but it can be challenging (involving quite complex regular expressions) and not a great place for a first timer to start. I certainly did not want to start to tackle the confusing landscape of .dat files either!

After a little bit of investigation I discovered that the .spa files were something I could work with. I knew nothing about this format but found the relevant files and started doing some internet research looking for more information and perhaps some additional samples. I soon discovered they were one of many formats for optical spectroscopy and are known as Thermo Fisher’s OMNIC file format or Thermo Scientific OMNIC or Nicolet/Thermo OMNIC.

Looking at some of the files using a Hex Editor it was immediately apparent that there was a consistent pattern of bytes at the start of each file. A string which read 'Spectral Data File' which was represented by 53 70 65 63 74 72 61 6C 20 44 61 74 61 20 46 69 6C 65 in hexadecimal. Note that I actually thought the pattern was longer but advice received from the PRONOM team suggested that it was better to cut it down.

I also looked at the end of each file and at first sight there appeared to be consistency here too with each file ending with the same few bytes. This hypothesis was blown out of the water when I looked at a sample file that I had discovered online which did not display this pattern (but luckily did have the same bytes at the start of the file).

This is why it is so important to have sample data that comes from more than one source. A set of files from a single researcher may have misleading patterns that have occurred just because of the consistent way in which they work, rather than this being a true feature of the format itself.

So, once I'd looked at all 33 files and had convinced myself the hypothesis was solid, I went to the online signature development tool provided by The National Archives and created my signature.

PRONOM signature development tool

This was relatively easy to use but there were areas where more guidance was needed (so questions were fired off to the PRONOM team and a speedy response was received). I'm hoping that in the future there will be more documentation to help guide the completion of this form - so that people know how best to name the signature, where to find a definitive list of Mimetypes (this is the list they suggested I looked at), and what the Version field should contain (it is for the version of the file format if this is apparent/relevant - not the version of the signature you are creating).

Once I was done, I clicked on the 'Save Signature File' button and I was presented with the finished XML file:

Ta daaaaa!

I briefly admired my handiwork before sending it off to The National Archives for feedback.*

How long did it take me to do all of this? I would say one full day is a fair estimate (that would include reading the guidance, downloading the Hex Editor and a few false starts as I tried to find a format that I thought I could handle). The next signature would be much quicker.

The biggest challenges:

  1. It took me a while to find a binary format that I could work with. Much of the research data we hold appears to be ASCII formats ....which has benefits from a digital preservation perspective, but wasn't what I was looking for with regard to this exercise
  2. I did not really understand the file format I was working with. I am not a chemist. I have never heard of the .spa format. I struggle to even say 'Spectroscopy' let alone understand it. When I started to research it online I found the results quite confusing. If I knew more about the format in the first place it would have made life much easier. 
  3. There are limitations with the metadata we get from researchers when they deposit data with Research Data York. Reading the brief descriptions of the dataset that are provided did not really help me work out what the individual files are or what software and hardware was used to create them.
  4. I could not locate the file format specification online - I think next time I try this I may approach the software vendor direct and ask them for help. 
  5. Available documentation for creating and testing signatures could be enhanced. I had several questions as I went along and these were answered promptly by the PRONOM team, but if the information was all online then this would certainly help other newbies.

Despite the challenges this exercise has been both enjoyable and useful. The thing I like about being a digital archivist is being able to get hands on with the data and solve problems. Over the last few years I've done very little of this type of work so it was great to get stuck in. On top of the obvious benefit that after the next signature release these .spa files will now be recognised by DROID and other PRONOM-based file identification tools, I have also increased my knowledge and understanding of the process and this is a positive result.

I would definitely encourage other digital archivists, repository managers and research data managers to try this out for themselves.

* Feedback from the PRONOM team was positive. With a couple of modifications they were happy to include the signature in the next PRONOM signature release

Friday, 5 August 2016

Research data is different

This is a guest post from Simon Wilson who has been profiling the born-digital data at the Hull History Centre to provide another point of comparison with the research data at York reported on in this blog back in May.

Inspired by Jen’s blog Research data - what does it *really* look like? about the profile of the  research data at York and the responses it generated including that from the Bentley Historical Library, I decided to take a look at some of the born-digital archives we have at Hull. This data is not research data from academics, it is data that has been donated to or deposited with the Hull History Centre and it comes from a variety of different sources.

Whilst I had previously created a DROID report for each distinct accession I have never really looked into the detail, so for each accession I did the following;

  1. Run the DROID software and export the results into csv format with one row per file 
  2. Open the file in MS Excel and copy the data to a second tab for the subsequent actions
  3. Sort the data by Type field into A-Z order and then delete all of the records relating to folders 
  4. Sort the data on the PUID field into A-Z order
  5. For large datasets highlight the data and then select the subtotal tool and use it to count each time the PUID field changes and record the sub-total
  6. Once the subtotal tool has completed its calculations, select the entire dataset and select Hide Detail (adjacent to Subtotal in the Outline tools box) to leave you with just a row for each distinct PUID and the total count value

I then created a simple spreadsheet with a column for each distinct accession and added a row for each unique PUID, copying the MIME type, software and version details from the DROID report results.  I also noted the number of files that were not identified. There may be quicker ways to get the same results and I would love to hear other suggestions or shortcuts.

After having completed this for 24 accessions - totalling 270,867 files, what have I discovered?

  • An impressive 97.96% of files were identified by DROID (compared with only 37% in Jen's smaller sample of research data)
  • So far 228 different PUIDs have been identified (compared with 34 formats in Jen’s sample)
  • The most common format is fmt/40 (MS Word 97-2003) with 120,595 files (44.5%). See the top ten identified formats in the table below...

File format (version)
Total No files
% of total identified files
Microsoft Word Document (97-2003)
Microsoft Word for Windows (2007 onwards)
Microsoft Excel 97 Workbook
Graphics Interchange Format
Acrobat PDF 1.4 - Portable Document Format
JPEG File Interchange Format (1.01)
Microsoft Word Document (6.0 / 95)
Acrobat PDF 1.3 - Portable Document Format
JPEG File Interchange Format (1.02)
Hypertext Markup Language (v4)

I can now quickly look-up whether an individual archive has a particular file type, and see how frequently it occurs.  Once I have processed a few more accessions it may be possible to create a "profile" for an individual literary collection or a small business and use this to inform discussions with depositors.  I can also start to look at the identified file formats and determine whether there is a strategy in place to migrate that format. Where this isn’t the case, knowing the number and frequency of the format amongst the collections will allow me to prioritise my efforts.  I will also look to aggregate the data – for example merging all of the different versions of Adobe Acrobat or MS Word.

I haven’t forgotten the 5520 unidentified files. By noting the PRONOM signature file number used to profile each archive, it is easy to repeat the process with a later signature file.  This could validate the previous results or enable previously unidentified files to be identified (particularly if I use the results of this exercise to feed information back to the PRONOM team). Knowing which accessions have the largest number of unidentified files will allow me to focus my effort as appropriate.

Whilst this has certainly been a useful exercise in its own right, it is also interesting to note the similarities between this the and the born-digital archives profile published by the Bentley Historical Library and the contrast with the research data profile Jen reported on.

The top ten identified formats from Hull and Bentley are quite similar. Both have a good success rate for identifying file formats with 90% identified at Bentley and 98% at Hull. Though the formats do not appear in the same order in the top ten, they do contain similar types of file (MS Word, PDF, JPEGs, GIFs and HTML).

In contrast, only 37% of files were identified in York's research data sample and the top ten file formats that were identified look very different. The only area of overlap being MS Excel files which appear high up in the York research dataset as well as being in the top ten for the Hull History Centre.

Research data is different.