Digital Archiving at the University of York: Every little bit helps: File format identification at Lancaster University

Monday, 21 November 2016

Every little bit helps: File format identification at Lancaster University

This is a guest post from Rachel MacGregor, Digital Archivist at Lancaster University. Her work on identifying research data follows on from the work of Filling the Digital Preservation Gap and provides a interesting comparison with the statistics reported in a previous blog post and our final project report.

Here at Lancaster University I have been very inspired by the work at York on file format identification and we thought it was high time I did my own analysis of the one hundred or so datasets held here. The aim is to aid understanding of the nature of research data as well as to inform our approaches to preservation. Our results are comparable to York's in that the data is characterised as research data (as yet we don't have any born digital archives or digitised image files). I used DROID (version 6.2.1) as the tool for file identification - there are others and it would be interesting to compare results at some stage with results from using other software such as FILE (FITS), Apache Tika etc.

The exercise was carried out using the following signature files: DROID_SignatureFile_V88 and container-signature-file-20160927. The maximum number of bytes DROID was set to scan at the start and end of each file was 65536 (which is the default setting when you install DROID).

Summary of the statistics:

There were a total of 24,705 files (so a substantially larger sample than in the comparable study at York)

Of these:

11008 (44.5%) were identified by DROID and 13697 (55.5%) not.
99.3% were given one file identification and 76 files had multiple identifications.

59 files had two possible identifications
13 had 3 identifications
4 had 4 possible identifications.

50 of these files were asc files identified (by extension) as either 8-bit or 7-bit ASCII text files. The remaining 26 were identified by container as various types of Microsoft files.

Files that were identified

Of the 11008 identified files:

89.34% were identified by signature: this is the overwhelming majority, far more than in Jen's survey
9.2% were identified by extension, a much smaller proportion than at York
1.46% identified by container

However there was one large dataset containing over 7,000 gzip files, all identified by signature which did skew the results rather. With those files removed, the percentages identified by different methods were as follows:

68% (2505) by signature
27.5% (1013) by extension
4.5% (161) by container

This was still different from York's results but not so dramatically.

Only 38 were identified as having a file extension mismatch (0.3%) but closer inspection may reveal more. Of these most were Microsoft files with multiple id's (see above) but also a set of lsm files identified as TIFFs. This is not a format I'm familiar with although it seems as if lsm is a form of TIFF file but how do I know if this is a "correct" id or not?

59 different file formats were identified, the most frequently occurring being the GZIP format (as mentioned above) with 7331 instances. The next most popular was, unsurprisingly xml (similar to results at York) with 1456 files spread across the datasets. The top 11 were:

Top formats identified by DROID for Lancaster University's research data

Files that weren't identified

There were 13697 files not identified by DROID of which 4947 (36%) had file extensions. This means there was a substantial proportion of files with no file extension (64%). This is much higher than the result at York which was 26%. As at York there were 107 different extensions in the unidentified files of which the top ten were:

Top counts of unidentified file extensions

Top extensions of unidentified files

This top ten are quite different to York's results, though in both institutions dat files topped the list by some margin! We also found 20 inp and 32 out files which also occur in York's analysis.

Like Jen at York I will be looking for a format to analyse further to create a signature - this will be a big step for me but will help my understanding of the work I am trying to do as well as contribute towards our overall understanding of file format types.

Every little bit helps.

Jenny Mitcham, Digital Archivist

4 comments:

Nick Krabbenhoeft21 November 2016 at 16:26
Any chance that you could post the raw DROID data? I'm experimenting with some data analysis and would appreciate having more examples to work with. I can also help to anonymize the data if you have concerns about sharing it.

Jenny Mitcham, Digital Archivist
ReplyDelete
Replies
Arthur22 November 2016 at 11:03
Did you experiment with identifying the gzip files further? It would be interesting to see how unzipping them affects the ratios.

Jenny Mitcham, Digital Archivist
ReplyDelete
Replies

Add comment

Digital Archiving at the University of York

Monday, 21 November 2016

Every little bit helps: File format identification at Lancaster University

Summary of the statistics:

Files that were identified

Files that weren't identified

4 comments:

The sustainability of a digital preservation blog...

Twitter

Subscribe