Pages

Monday, 21 November 2016

Every little bit helps: File format identification at Lancaster University

This is a guest post from Rachel MacGregor, Digital Archivist at Lancaster University. Her work on identifying research data follows on from the work of Filling the Digital Preservation Gap and provides a interesting comparison with the statistics reported in a previous blog post and our final project report.

Here at Lancaster University I have been very inspired by the work at York on file format identification and we thought it was high time I did my own analysis of the one hundred or so datasets held here.  The aim is to aid understanding of the nature of research data as well as to inform our approaches to preservation.  Our results are comparable to York's in that the data is characterised as research data (as yet we don't have any born digital archives or digitised image files).  I used DROID (version 6.2.1) as the tool for file identification - there are others and it would be interesting to compare results at some stage with results from using other software such as FILE (FITS), Apache Tika etc.

The exercise was carried out using the following signature files: DROID_SignatureFile_V88 and container-signature-file-20160927.  The maximum number of bytes DROID was set to scan at the start and end of each file was 65536 (which is the default setting when you install DROID).

Summary of the statistics:

There were a total of 24,705 files (so a substantially larger sample than in the comparable study at York)

Of these: 
  • 11008 (44.5%) were identified by DROID and 13697 (55.5%) not.
  • 99.3% were given one file identification and 76 files had multiple identifications.  
    • 59 files had two possible identifications
    • 13 had 3 identifications
    • 4 had 4 possible identifications.  
  • 50 of these files were asc files identified (by extension) as either 8-bit or 7-bit ASCII text files.  The remaining 26 were identified by container as various types of Microsoft files. 

Files that were identified

Of the 11008 identified files:
  • 89.34% were identified by signature: this is the overwhelming majority, far more than in Jen's survey
  • 9.2% were identified by extension, a much smaller proportion than at York
  • 1.46% identified by container

However there was one large dataset containing over 7,000 gzip files, all identified by signature which did skew the results rather.  With those files removed, the percentages identified by different methods were as follows:

  • 68% (2505) by signature
  • 27.5% (1013) by extension
  • 4.5% (161) by container
This was still different from York's results but not so dramatically.

Only 38 were identified as having a file extension mismatch (0.3%) but closer inspection may reveal more.  Of these most were Microsoft files with multiple id's (see above) but also a set of lsm files identified as TIFFs.  This is not a format I'm familiar with although it seems as if lsm is a form of TIFF file but how do I know if this is a "correct" id or not?

59 different file formats were identified, the most frequently occurring being the GZIP format (as mentioned above) with 7331 instances.  The next most popular was, unsurprisingly xml (similar to results at York) with 1456 files spread across the datasets.  The top 11 were:

Top formats identified by DROID for Lancaster University's research data


Files that weren't identified

There were 13697 files not identified by DROID of which 4947 (36%) had file extensions.  This means there was a substantial proportion of files with no file extension (64%). This is much higher than the result at York which was 26%. As at York there were 107 different extensions in the unidentified files of which the top ten were:

Top counts of unidentified file extensions


Top extensions of unidentified files


This top ten are quite different to York's results, though in both institutions dat files topped the list by some margin! We also found 20 inp and 32 out files which also occur in York's analysis. 

Like Jen at York I will be looking for a format to analyse further to create a signature - this will be a big step for me but will help my understanding of the work I am trying to do as well as contribute towards our overall understanding of file format types.

Every little bit helps.

3 comments:

  1. Any chance that you could post the raw DROID data? I'm experimenting with some data analysis and would appreciate having more examples to work with. I can also help to anonymize the data if you have concerns about sharing it.

    ReplyDelete
  2. Did you experiment with identifying the gzip files further? It would be interesting to see how unzipping them affects the ratios.

    ReplyDelete
    Replies
    1. Thanks for this. Yes – I did set it to unzip the gzip files so the files they contain are included these in the stats. One dataset alone (the one referred to in the blog) contained 14,936 files of which 7343 were zip files. Of the remaining 7593 files only one (!) had an id (x-fmt/408) and of the remaining 5,590 had extensions of some description leaving 2002 files with no extension or id.
      What I haven’t done is double check to make sure all compressed formats across the other datasets were interrogated (ie where DROID does not do this) although I think this would be a low proportion of the overall files accounted for.

      Delete