Here at Lancaster University I have been very inspired by the work at York on file format identification and we thought it was high time I did my own analysis of the one hundred or so datasets held here. The aim is to aid understanding of the nature of research data as well as to inform our approaches to preservation. Our results are comparable to York's in that the data is characterised as research data (as yet we don't have any born digital archives or digitised image files). I used DROID (version 6.2.1) as the tool for file identification - there are others and it would be interesting to compare results at some stage with results from using other software such as FILE (FITS), Apache Tika etc.
The exercise was carried out using the following signature files: DROID_SignatureFile_V88 and container-signature-file-20160927. The maximum number of bytes DROID was set to scan at the start and end of each file was 65536 (which is the default setting when you install DROID).
Summary of the statistics:
There were a total of 24,705 files (so a substantially larger sample than in the comparable study at York)
- 11008 (44.5%) were identified by DROID and 13697 (55.5%) not.
- 99.3% were given one file identification and 76 files had multiple identifications.
- 59 files had two possible identifications
- 13 had 3 identifications
- 4 had 4 possible identifications.
- 50 of these files were asc files identified (by extension) as either 8-bit or 7-bit ASCII text files. The remaining 26 were identified by container as various types of Microsoft files.
Files that were identified
Of the 11008 identified files:
- 89.34% were identified by signature: this is the overwhelming majority, far more than in Jen's survey
- 9.2% were identified by extension, a much smaller proportion than at York
- 1.46% identified by container
However there was one large dataset containing over 7,000 gzip files, all identified by signature which did skew the results rather. With those files removed, the percentages identified by different methods were as follows:
- 68% (2505) by signature
- 27.5% (1013) by extension
- 4.5% (161) by container
This was still different from York's results but not so dramatically.
Only 38 were identified as having a file extension mismatch (0.3%) but closer inspection may reveal more. Of these most were Microsoft files with multiple id's (see above) but also a set of lsm files identified as TIFFs. This is not a format I'm familiar with although it seems as if lsm is a form of TIFF file but how do I know if this is a "correct" id or not?
59 different file formats were identified, the most frequently occurring being the GZIP format (as mentioned above) with 7331 instances. The next most popular was, unsurprisingly xml (similar to results at York) with 1456 files spread across the datasets. The top 11 were:
|Top formats identified by DROID for Lancaster University's research data|
Files that weren't identified
There were 13697 files not identified by DROID of which 4947 (36%) had file extensions. This means there was a substantial proportion of files with no file extension (64%). This is much higher than the result at York which was 26%. As at York there were 107 different extensions in the unidentified files of which the top ten were:
|Top counts of unidentified file extensions|
|Top extensions of unidentified files|
This top ten are quite different to York's results, though in both institutions dat files topped the list by some margin! We also found 20 inp and 32 out files which also occur in York's analysis.
Like Jen at York I will be looking for a format to analyse further to create a signature - this will be a big step for me but will help my understanding of the work I am trying to do as well as contribute towards our overall understanding of file format types.
Every little bit helps.
Every little bit helps.