Research data is different

This is a guest post from Simon Wilson who has been profiling the born-digital data at the Hull History Centre to provide another point of comparison with the research data at York reported on in this blog back in May.

Inspired by Jen’s blog Research data - what does it *really* look like? about the profile of the  research data at York and the responses it generated including that from the Bentley Historical Library, I decided to take a look at some of the born-digital archives we have at Hull. This data is not research data from academics, it is data that has been donated to or deposited with the Hull History Centre and it comes from a variety of different sources.

Whilst I had previously created a DROID report for each distinct accession I have never really looked into the detail, so for each accession I did the following;

  1. Run the DROID software and export the results into csv format with one row per file 
  2. Open the file in MS Excel and copy the data to a second tab for the subsequent actions
  3. Sort the data by Type field into A-Z order and then delete all of the records relating to folders 
  4. Sort the data on the PUID field into A-Z order
  5. For large datasets highlight the data and then select the subtotal tool and use it to count each time the PUID field changes and record the sub-total
  6. Once the subtotal tool has completed its calculations, select the entire dataset and select Hide Detail (adjacent to Subtotal in the Outline tools box) to leave you with just a row for each distinct PUID and the total count value

I then created a simple spreadsheet with a column for each distinct accession and added a row for each unique PUID, copying the MIME type, software and version details from the DROID report results.  I also noted the number of files that were not identified. There may be quicker ways to get the same results and I would love to hear other suggestions or shortcuts.

After having completed this for 24 accessions - totalling 270,867 files, what have I discovered?

  • An impressive 97.96% of files were identified by DROID (compared with only 37% in Jen's smaller sample of research data)
  • So far 228 different PUIDs have been identified (compared with 34 formats in Jen’s sample)
  • The most common format is fmt/40 (MS Word 97-2003) with 120,595 files (44.5%). See the top ten identified formats in the table below...

File format (version)
Total No files
% of total identified files
Microsoft Word Document (97-2003)
Microsoft Word for Windows (2007 onwards)
Microsoft Excel 97 Workbook
Graphics Interchange Format
Acrobat PDF 1.4 - Portable Document Format
JPEG File Interchange Format (1.01)
Microsoft Word Document (6.0 / 95)
Acrobat PDF 1.3 - Portable Document Format
JPEG File Interchange Format (1.02)
Hypertext Markup Language (v4)

I can now quickly look-up whether an individual archive has a particular file type, and see how frequently it occurs.  Once I have processed a few more accessions it may be possible to create a "profile" for an individual literary collection or a small business and use this to inform discussions with depositors.  I can also start to look at the identified file formats and determine whether there is a strategy in place to migrate that format. Where this isn’t the case, knowing the number and frequency of the format amongst the collections will allow me to prioritise my efforts.  I will also look to aggregate the data – for example merging all of the different versions of Adobe Acrobat or MS Word.

I haven’t forgotten the 5520 unidentified files. By noting the PRONOM signature file number used to profile each archive, it is easy to repeat the process with a later signature file.  This could validate the previous results or enable previously unidentified files to be identified (particularly if I use the results of this exercise to feed information back to the PRONOM team). Knowing which accessions have the largest number of unidentified files will allow me to focus my effort as appropriate.

Whilst this has certainly been a useful exercise in its own right, it is also interesting to note the similarities between this the and the born-digital archives profile published by the Bentley Historical Library and the contrast with the research data profile Jen reported on.

The top ten identified formats from Hull and Bentley are quite similar. Both have a good success rate for identifying file formats with 90% identified at Bentley and 98% at Hull. Though the formats do not appear in the same order in the top ten, they do contain similar types of file (MS Word, PDF, JPEGs, GIFs and HTML).

In contrast, only 37% of files were identified in York's research data sample and the top ten file formats that were identified look very different. The only area of overlap being MS Excel files which appear high up in the York research dataset as well as being in the top ten for the Hull History Centre.

Research data is different.

Jenny Mitcham, Digital Archivist


Popular posts from this blog

How can we preserve Google Documents?

Preserving emails. How hard can it be?

Checksum or Fixity? Which tool is for me?