Wednesday, 21 September 2016

File format identification at Norfolk Record Office

This is a guest post from Pawel Jaskulski who has recently completed a Transforming Archives traineeship at Norfolk Record Office (NRO). As part of his work at Norfolk and in response to a question I posed in a previous blog post ("Is identification of 37% of files a particularly bad result?") he profiled their digital holdings using DROID and has written up his findings. Coming from a local authority context, his results provide an interesting comparison with other profiles that have emerged from both the Hull History Centre and the Bentley Historical Library and again help to demonstrate that the figure of 37% identified files for my test research dataset is unusual.

King's Lynn's borough archives are cared for jointly by the Borough Council and the Norfolk Record Office

Profiling Digital Records with DROID

With any local authority archive there is an assumption that the accession deposited might be literally anything. What it means in 'digital terms' is that it is impossible to predict what sort of data might be coming in in the future. That is the reason why NRO have been actively involved in developing their digital preservation strategy, aiming at achieving capability so as to be able to choose digital records over their paper-based equivalents (hard copies/printouts).

The archive service has been receiving digital records accessions since the late 1990's. The majority of digitally born archives came in as hybrid accessions from local schools that were being closed down. For many records there were no paper equivalents. Among other deposits containing digital records are architectural surveys, archives of private individuals and local organisations (for example Parish Council meetings minutes).

The archive service have been using DROID as part of their digital records archival processing procedure as it connects to the most comprehensive and continuously updated file formats registry PRONOM. Archivematica, an ingest system that uses the PRONOM registry, is currently being introduced at NRO. It contains other file format identification tools like FIDO or Siegfried (which both use PRONOM identifiers).

The results of DROID survey were as follows:

With the latest signature file (v.86) out of 49,117 files identification was successful for 96.46%.

DROID identified 107 various file formats. The ten most recurring file formats were:

File Format Name
Image (Raster)
JPEG File Interchange Format
1.01, 1.02
fmt/43, fmt/44
Image (Raster)
Exchangeable Image File Format (Compressed)
2.1, 2.2
x-fmt/390, x-fmt/391
Image (Raster)
Windows Bitmap
Text (Mark-up)
Hypertext Markup Language
fmt/96, fmt/99
Word Processor
Microsoft Word Document
Image (Raster)
Tagged Image File Format
Microsoft Outlook Email Message
AppleDouble Resource Fork
Image (Raster)
Graphics Interchange Format
Image (Raster)
Exchangeable Image File Format (Compressed)

Identification method breakdown:

  • 83.31% was identified by signature
  • 14.95% by container
  • 1.73% by Extension 

458 files had their extensions mismatched - that amounts to less than one per cent (0.97%). These were a variety of common raster image file formats (JPEG, PNG, TIFF) word processor (Microsoft Word Document, ClarisWorks Word Processor) and desktop publishing (Adobe Illustrator, Adobe InDesign Document, Quark Xpress Data File).

Among 3.54% of unidentified files there were 160 different unknown file extensions. Top five were:

  • .cmp
  • .mov
  • .info
  • .eml
  • .mdb

Two files returned more than 1 identification:

A spreadsheet file with .xls extension (last modified date 2006-12-17) had 3 possible file format matches:

  • fmt/175 Microsoft Excel for Macintosh 2001
  • fmt/176 Microsoft Excel for Macintosh 2002
  • fmt/177 Microsoft Excel for Macintosh 2004

And an image file with extension .bmp (last modified date 2007-02-06) received 2 file format matches

  • fmt/116 Windows Bitmap 3
  • fmt/625 Apple Disk Copy Image 4.2

After closer inspection the actual file was a bitmap image file and PUID fmt/116 was the correct one.

Understanding the Results

DROID offers very useful classification of file formats and puts all results into categories, which enables an overview of the digital collection. It is easy to understand what sort of digital content is predominantly included within the digitally born accession/archive/collection. It uses classification system that assigns file formats to broader groups like: Audio, Word Processor, Page Description, Aggregate etc. These help enormously in having a grasp on the variety of digital records. For example it was interesting to discover that over half of our digitally born archives are in various raster image file formats.

Files profiled at Norfolk Record Office as classified by DROID

I am of course also interested in the levels of risk associated with particular formats so have started to work on an additional classification for the data, creating further categories that can help with preservation planning. This would help demonstrate where preservation efforts should be focused in the future.

Jenny Mitcham, Digital Archivist

Friday, 16 September 2016

UK Archivematica group at Lancaster

Earlier this week UK Archivematica users descended on the University of Lancaster for our 5th user group meeting. As always it was a packed agenda, with lots of members keen to talk to the group and share their project plans or their experiences of working with Archivematica. Here are some edited highlights of the day. Also well worth a read is a blog about the day from our host which is better than mine because it contains pictures and links!

Rachel MacGregor and Adrian Albin-Clark from the University of Lancaster kicked off the meeting with an update on recent work to set up Archivematica for the preservation of research data. Adrian has been working on two Ruby gems to handle two specific parts of the workflow. The puree gem which gets metadata out of the PURE CRIS system in a format that it is easy to work with (we are big fans of this gem at York having used it in our phase 3 implementation work for Filling the Digital Preservation Gap). Another gem helps solve another problem, getting the deposited research data and associated data packaged up in a format that is suitable for Archivematica to ingest. Again, this is something we may be able to utilise in our own workflows.

Jasmin Boehmer, a student from Aberystwyth University presented some of the findings from the work she has been doing for her dissertation. She has been testing how metadata can be added to a Submission Information Package (SIP) for inclusion within an Archival Information Package (AIP) and has been looking at a range of different scenarios. It was interesting to hear her findings, particularly useful for those of us who haven’t managed to carry out systematic testing ourselves. She concluded that if you want to store descriptive metadata at a per file level within Archivematica you should submit this via a csv file as part of your SIP. If you only use the Archivematica interface itself for adding metadata, you can only do this on a per SIP basis rather than at file level. It was interesting to see that if you include rights metadata within your file level csv file this will not be stored within the PREMIS section of the XML as you might expect so this does not solve a problem we raised during our phase 1 project work for Filling the Digital Preservation Gap regarding ingesting a SIP with different rights recorded per file.

Jake Henry from the National Library of Wales discussed some work newly underway to build on the work of the ARCW digital preservation group. The project will enable members of ARCW to use Archivematica without having to install their own local version, using pydio as a means of storing files before transfer. As part of this project they are now looking at a variety of systems that they would like Archivematica to integrate with. They are hoping to work on an integration with CALM. There was some interest in this development around the room and I expect there would be many other institutions who would be keen to see this work carried out.

Kirsty Lee from the University of Edinburgh reported on her recent trip to the States to attend the inaugural ArchivematiCamp with her colleague Robin Taylor. It sounded like a great event with some really interesting sessions and discussions, particularly regarding workflows (recognising that there are many different ways you can use Archivematica) as well as some nice social events. We are looking forward to seeing an ArchivematiCamp in the UK next year!

Myself and Julie presented on some of the implementation work we have been doing over the last few months as we complete phase 3 of Filling the Digital Preservation Gap. Julie talked about what we were trying to achieve with our proof of concept implementation and then showed a screencast of the application itself. The challenges we faced and things that worked well during phase 3 were discussed before I summarised our plans for the future.

I went on to introduce the file formats problem (which I have previously touched on other blog posts) before taking the opportunity to pick people’s brains on a number of discussion points. I wanted to understand workflows around non identified files (not just for research data). I was interested to know three things really:

  1. At what point would you pick up on unidentified file formats in a deposit - prior to using Archivematica or during the transfer stage within Archivematica?
  2. What action would you take to resolve this situation (if any)?
  3. Would you continue to ingest material into the archive whilst waiting for a solution, or keep files in the backlog until the correct identification could be made?
Answers from the floor suggested that one institution would always stop and carry out further investigations before ingesting the material and creating an Archival Information Package (AIP) but that most others would continue processing the data. With limited staff resource for curating research data in particular, it is possible that institutions will favour a fully automated workflow such as the one we have established in our proof of concept implementation, and regular interventions around file format identification may not be practical. Perhaps we need to consider how we can intervene in a sustainable and manageable way, rather than looking at each deposit of data separately. One of the new features in Archivematica is the AIP re-ingest which will allow you to pull AIPs back from storage so that tools (such as file identification) can be re-run - this was thought to be a good solution.

John Kaye from Jisc updated us on the Research Data Shared Service project. Archivematica is one of the products selected by Jisc to fulfill the preservation element of the Shared Service and John reported on the developments and enhancements to Archivematica that are proposed as part of this project. It is likely that these developments will be incorporated into the main code base thus be available to all Archivematica users in the future. The growth in interest in Archivematica within the research data community in the UK is only likely to continue as a result of this project.

Heather Roberts from the Royal Northern College of Music described where her institution is with digital preservation and asked for advice on how to get started with Archivematica. Attendees were keen to share their thoughts (many of which were not specific to Archivematica itself but would be things to consider whatever solution was being implemented) and Heather went away with some ideas and some further contacts to follow up with.

To round off the meeting we had an update and Q&A session with Sarah Romkey from Artefactual Systems (who is always cheerful no matter what time you get her out of bed for an transatlantic Skype call).

Some of the attendees even managed to find the recommended ice cream shop before heading back home!

We look forward to meeting at the University of Edinburgh next time.

Jenny Mitcham, Digital Archivist