File format identification at Norfolk Record Office

This is a guest post from Pawel Jaskulski who has recently completed a Transforming Archives traineeship at Norfolk Record Office (NRO). As part of his work at Norfolk and in response to a question I posed in a previous blog post ("Is identification of 37% of files a particularly bad result?") he profiled their digital holdings using DROID and has written up his findings. Coming from a local authority context, his results provide an interesting comparison with other profiles that have emerged from both the Hull History Centre and the Bentley Historical Library and again help to demonstrate that the figure of 37% identified files for my test research dataset is unusual.

King's Lynn's borough archives are cared for jointly by the Borough Council and the Norfolk Record Office

Profiling Digital Records with DROID

With any local authority archive there is an assumption that the accession deposited might be literally anything. What it means in 'digital terms' is that it is impossible to predict what sort of data might be coming in in the future. That is the reason why NRO have been actively involved in developing their digital preservation strategy, aiming at achieving capability so as to be able to choose digital records over their paper-based equivalents (hard copies/printouts).

The archive service has been receiving digital records accessions since the late 1990's. The majority of digitally born archives came in as hybrid accessions from local schools that were being closed down. For many records there were no paper equivalents. Among other deposits containing digital records are architectural surveys, archives of private individuals and local organisations (for example Parish Council meetings minutes).

The archive service have been using DROID as part of their digital records archival processing procedure as it connects to the most comprehensive and continuously updated file formats registry PRONOM. Archivematica, an ingest system that uses the PRONOM registry, is currently being introduced at NRO. It contains other file format identification tools like FIDO or Siegfried (which both use PRONOM identifiers).

The results of DROID survey were as follows:

With the latest signature file (v.86) out of 49,117 files identification was successful for 96.46%.

DROID identified 107 various file formats. The ten most recurring file formats were:

File Format Name
Image (Raster)
JPEG File Interchange Format
1.01, 1.02
fmt/43, fmt/44
Image (Raster)
Exchangeable Image File Format (Compressed)
2.1, 2.2
x-fmt/390, x-fmt/391
Image (Raster)
Windows Bitmap
Text (Mark-up)
Hypertext Markup Language
fmt/96, fmt/99
Word Processor
Microsoft Word Document
Image (Raster)
Tagged Image File Format
Microsoft Outlook Email Message
AppleDouble Resource Fork
Image (Raster)
Graphics Interchange Format
Image (Raster)
Exchangeable Image File Format (Compressed)

Identification method breakdown:

  • 83.31% was identified by signature
  • 14.95% by container
  • 1.73% by Extension 

458 files had their extensions mismatched - that amounts to less than one per cent (0.97%). These were a variety of common raster image file formats (JPEG, PNG, TIFF) word processor (Microsoft Word Document, ClarisWorks Word Processor) and desktop publishing (Adobe Illustrator, Adobe InDesign Document, Quark Xpress Data File).

Among 3.54% of unidentified files there were 160 different unknown file extensions. Top five were:

  • .cmp
  • .mov
  • .info
  • .eml
  • .mdb

Two files returned more than 1 identification:

A spreadsheet file with .xls extension (last modified date 2006-12-17) had 3 possible file format matches:

  • fmt/175 Microsoft Excel for Macintosh 2001
  • fmt/176 Microsoft Excel for Macintosh 2002
  • fmt/177 Microsoft Excel for Macintosh 2004

And an image file with extension .bmp (last modified date 2007-02-06) received 2 file format matches

  • fmt/116 Windows Bitmap 3
  • fmt/625 Apple Disk Copy Image 4.2

After closer inspection the actual file was a bitmap image file and PUID fmt/116 was the correct one.

Understanding the Results

DROID offers very useful classification of file formats and puts all results into categories, which enables an overview of the digital collection. It is easy to understand what sort of digital content is predominantly included within the digitally born accession/archive/collection. It uses classification system that assigns file formats to broader groups like: Audio, Word Processor, Page Description, Aggregate etc. These help enormously in having a grasp on the variety of digital records. For example it was interesting to discover that over half of our digitally born archives are in various raster image file formats.

Files profiled at Norfolk Record Office as classified by DROID

I am of course also interested in the levels of risk associated with particular formats so have started to work on an additional classification for the data, creating further categories that can help with preservation planning. This would help demonstrate where preservation efforts should be focused in the future.

Jenny Mitcham, Digital Archivist


Popular posts from this blog

How can we preserve Google Documents?

Preserving emails. How hard can it be?

Checksum or Fixity? Which tool is for me?