Tuesday, 31 May 2016

Research data - what does it *really* look like?

Work continues on our Filling the Digital Preservation Gap project and I thought it was about time we updated you on some of the things we have been doing.

While my colleague Julie has been focusing on the more technical issues of implementing Archivematica for research data. I have been looking at some real research data and exploring in more detail some of the issues we discussed in our phase 1 report.

For the past year, we have been accepting research data for longer term curation. Though the systems for preservation and access to this data are still in development, we are for the time being able to allocate a DOI for each dataset, manage access and store it safely (ensuring it isn't altered) and intend to ingest it into our data curation systems once they are ready.

Having this data in one place on our filestore does give me the opportunity to test the hypothesis in our first report about the wide range of file formats that will be present in a research dataset and also the assertion that many of these will not be identified by the tools and registries in use for the creation of technical metadata.

So, I have done a fairly quick piece of analysis on the research data, running a tool called Droid developed by The National Archives over the data to get an indication of whether the files can be recognised and identified in an automated fashion.

All the data in our research data sample has been deposited with us since May 2015. The majority of the data is scientific in nature - much of it coming from the departments of Chemistry and Physics. (this may be a direct result of expectations from the EPSRC around data management). The data is mostly fairly recent, as suggested by the last modified dates on these files, which range from 2006 to 2016 with the vast majority having been modified in the last five years. The distribution of dates is illustrated below.

Here are some of the findings of this exercise:

Summary statistics

  • Droid reported that 3752 individual files were present*

  • 1382 (37%) of the files were given a file format identification by Droid

  • 1368 (99%) of those files that were identified were given just one possible identification. 12 files were given two possible identifications and a further two were given 18 possible identifications. In all these cases, the identification was done by file extension rather than signature - which perhaps explains the uncertainty

Files that were identified

  • Of the 1382 files that were identified: 
    • 668 (48%) were identified by signature (which suggests a fairly accurate identification - if a file is identified by signature it means that Droid has looked inside the file and seen something that it recognises. I'm told it does this by some sort of magic!)
    • 648 (47%) were identified by extension alone (which implies a less accurate identification)
    • 65 (5%) were identified by container. These were all Microsoft Office files - xlsx and docx as these are in effect zip files (which suggests a high level of accuracy)

  • 111 (8%) of the identified files had a file extension mismatch - this means that the file extension was not what you would expect given the identification by signature. 
    • All but 16 of these files were XML files that didn't have the .xml file extension (there were a range of extensions for these files including .orig, .plot, .xpr, .sc, .svg, .xci, .hwh, .bxml, .history). This isn't a very surprising finding given the ubiquity of XML files and the fact that applications often give their own XML output different extensions.

  • 34 different file formats were identified within the collection of research data

  • Of the identified files 360 (26%) were XML files. This was by far the most common file format identified within the research dataset. The top 10 identified files are as follows:
    • Extensible Markup Language - 360
    • Log File - 212
    • Plain Text File - 186
    • AppleDouble Resource Fork - 133
    • Comma Separated Values - 109
    • Microsoft Print File - 77
    • Portable Network Graphics - 73
    • Microsoft Excel for Windows - 57
    • ZIP Format - 23
    • XYWrite Document - 21

Files that weren't identified

  • Of the 2370 files that weren't identified by Droid, 107 different file extensions were represented

  • 614 (26%) of the unidentified files had no file extension at all. This does rather limit the chance of identification being that identification by file extension is relied on quite heavily by Droid and other similar tools. Of course it also limits our ability to actively curate this data unless we can identify it by another means.

  • The most common file extensions for the files that were not identified are as follows:
    • dat - 286
    • crl - 259
    • sd - 246
    • jdf - 130
    • out - 50
    • mrc - 47
    • inp - 46
    • xyz - 37
    • prefs - 32
    • spa - 32

Some thoughts

  • This is all very interesting and does back up our assertions about the long tail of file formats within a collection of research data and the challenges of identifying this data using current tools. I'd be interested to know whether for other collections of born digital data (not research data) a higher success rate would be expected? Is identification of 37% of files a particularly bad result or is it similar to what others have experienced?

  • As mentioned in a previous blog post, one area of work for us is to get some more research data file formats into PRONOM (a key database of file formats that is utilised by digital preservation tools). Alongside previous work on the top software applications used by researchers (a little of this is reported here) this has been helpful in informing our priorities when considering which formats we would like The National Archives to develop signatures for in phase 3 of the project.

  • Given the point made above, it could be suggested that one of our priorities for file format research should be .dat files. This would make sense being that we have 286 of these files and they are not identified by any means. However, here lies a problem. This is actually a fairly common file extension. There are many different types of .dat files produced by many different applications. PRONOM already holds information on two varieties of .dat file and the .dat files that we hold appear to come from several different software applications. In short, solving the .dat problem is not a trivial exercise!

  • It strikes me that we are really just scratching the surface here. Though it is good we are getting a few file signatures developed as an output for this project, this is clearly not going to make a big impact given the size of the problem. We will need to think about the community should continue this work going forward.

  • It has been really helpful having some genuine research data to investigate when thinking about preservation workflows - particularly those workflows for unidentified files that we were considering in some detail during phase 2 of our project. The unidentified file report that has been developed within Archivematica as a result of this project helpfully organises the files by extension and I had not envisaged at the time that so many files would have no file extension at all. We have discussed previously how useful it is to fall back on identification by file extension if identification by signature is unsuccessful but clearly for so many of our files this will not be worth attempting.

* note that this turns out not to be entirely accurate given that eight of the datasets were zipped up into a .rar archive. Though Droid looks inside several types of zip files, (including .zip, .gzip, .tar, and the web archival formats .arc and .warc) it does not yet look inside .rar, 7z, .bz, and .iso files. I didn't realise this until after I had carried out some analysis on the results. Consequently there are another 1291 files which I have not reported on here (though a quick run through with Droid after unzipping them manual identified 33% of the files so a similar ratio to the rest of the data. Note that this functionality is something that the team at The National Archives intend to develop in the future.

Jenny Mitcham, Digital Archivist

Friday, 27 May 2016

Why AtoM?

A long time ago I started to talk about how we needed a new archival management system. I described the process of how we came up with a list of requirements to find a system that would meet our needs.

People sometimes ask me about AtoM and why we chose it so I thought it would be useful to publish our requirements and how the system performed against these.

It is also interesting at this point - post launch to go back and review what we thought we knew about AtoM when we first made the selection back in 2014. Not only has our understanding of AtoM moved on since then, but the functionality of the system itself has moved on (and will move on again when version 2.3 is released). For several of the requirements, my initial assessment was negative or recorded only a partial success and coming back to it now, it is clear that things have improved.

I also discovered that for one of the requirements, I had recorded AtoM has having met it but experience and further research has demonstrated that it doesn't - this is something we are hoping to address with further funding in the future.

So here goes with the requirements and a revised and hopefully up-to-date commentary of how AtoM (version 2.2) meets them.

Cataloguing and accessioning

The system must allow us to record accessions

The short answer is Yes.

We looked at this one in some detail and produced a whole document about specifically whether AtoM met all of our requirements for accessioning. This included a mapping of the fields within AtoM to the data we held from previous systems and further thoughts related to our own accessioning workflows.

We have found that AtoM is a good tool for recording accessions but there are some issues. One of the problems we highlighted with the accessions module initially was the lack of ability to record covering dates of the material. We have been able to sponsor the development of covering dates fields by Artefactual Systems so this one is resolved now.

Another issue we have noted is the rather complex way that rights and licences are recorded within an accession record. I am still not sure we fully understand how best to use this section!

Need to be able to generate a receipt for accessioning

This one is still a No.

In our previous system for accessioning this was an auto generated report of key fields from the accessions data which could be printed out and sent to the donor or depositor with a covering letter as a record of us having received their archive. We did explore with Artefactual Systems the options around the creation of this feature within AtoM but were not able to find the money to follow up on this development.

The good news is that we have found a temporary workaround to this internally using a print style sheet to create a report. This solution isn't perfect but does at least mean we don't have to retype or copy and paste the accessions data into an accessions receipt. We are hoping a better solution emerges at some point in the future.

The system must allow us to enter catalogues/lists

Yes - this is very much what AtoM is designed to do.

Internally we are still discussing and testing out the methodology for data entry into AtoM. In some situations it makes sense to enter archival descriptions directly into the interface and in other situations the CSV or EAD import options are utilised. There are also situations where both of these methodologies might be used, for example importing the basic structure and then enhancing the records directly within AtoM.

Catalogue metadata should be ISAD(G) compatible

Yes - does what it says on the tin.

Data entry form for accessioning/cataloguing should include customisable drop down lists/controlled vocabularies where appropriate for ease and consistency of data entry

Yes - there is a taxonomies section of AtoM that allows you to manage the wordlists that are used within the system, editing and adding terms as appropriate. By default, some taxonomies (which contain controlled vocabularies directly from related standards) are locked, but can be edited by a developer.

Accessions data should be linked to catalogue data for ease of running queries and managing data (eg: producing lists of collections which haven't been catalogued)

Yes - Accessions data can be linked to archival descriptions within AtoM. This isn't mandatory (nor should it be) and since importing all of our accessions records into AtoM there would be quite a big piece of work involved in making these links. However, the functionality is there.

Running queries as described in this requirement isn't something that AtoM does currently, however access to a MySQL client (such as Squirrel) and a working knowledge of SQL does open up opportunities for querying the underlying data in a variety of ways.

It should be possible to create a collection record directly from the accessions record (avoiding the need for re-typing duplicate fields)

Yes - An accessions record can be turned into an archival description - and this is a feature we want to explore more as we incorporate AtoM fully into our internal workflows.

We should be able to create hard copy catalogues in a format suitable for search room use

Initially no...but now yes!

Functionality on this score was fairly limited when we assessed AtoM version 2.0 a three years ago. However, we knew this was on the AtoM development roadmap and since adopting AtoM there has been further work in this area. In AtoM 2.2 it is possible to create finding aids in PDF or RTF format and administrators can select whether a full finding aid or inventory summary is required.

We would like to record locations of material within the archive

In our initial assessment of AtoM we recorded this as a Yes

...but I'm reserving my judgement about whether we can use this functionality ourselves until we've finished testing this out.

We have a locations register (currently a separate spreadsheet) that records what is where within the strongroom. We also have signs for the end of each aisle which record which archives you can find in that aisle. I'd like to be able to say we could do away with these separate systems and store this information within AtoM but I'm just not sure whether the locations section of AtoM meets our needs currently. Further investigation required on this one and I'll try and blog about this another time.

We would like the system to be able to manage and log de-accessioning

Yes we believe this to be so - after creating an accession record, a simple click of a button allows the archivist to deaccession the whole or a part of the accession.

As far as I know we haven't had cause to test this one just yet.

Import and export of data

Data should be portable. We need to be able to export it to other platforms/portals in EAD, EAC (for authority records) and other formats.

Yes - this is certainly something that AtoM can do.

Data should be portable. We would like to be able to set up OAI PMH targets (or equivalent) so that our data can be harvested and re-used by others 

Ahhhh - now this is an area where perhaps we weren't specific enough with our requirement!

On the surface, this is a 'yes' - we have been able to set up OAI-PMH harvesting to our Library Catalogue using Dublin Core.

...however, what we really wanted to be able to do, which isn't articulated well enough in the requirement was to be able to expose EAD metadata via OAI-PMH. This isn't currently something that AtoM can do but watch this space! We hope to be able to make this happen at some point in the future.

We need adequate import mechanisms for incorporating backlog data in a variety of formats. This should include EAD and a means of importing accessions data

Yes - EAD can be imported, as can metadata in other formats (for example Dublin Core, MODS, EAC, SKOS). Having tested this we have discovered that the EAD import works but isn't necessarily as simple and straightforward as we would like. We know there are good reasons why this is the case but it has proved to be a barrier for us in getting more of our existing structured finding aids into AtoM. We will be doing more data import work over the next year or so.

CSV inport is also an option and this is the method we used to import all of our accessions data. We are currently testing how we can use this import functionality for archival descriptions and think this will be very useful to us.


We need to be able to run queries and reports on the data that we hold - both routine and custom

Yes and No - Reporting within AtoM itself is limited, however this doesn't matter if you have a MySQL client and the ability to query the underlying database. We have already had success at running specific reports on the data within AtoM (for our annual accessions return). Though this solution may not suit everyone, the ability to query the data outside of the web interface itself does offer flexibility above and beyond the functionality that could be programmed into the interface and is a really powerful tool.


Our users should be able to search and browse all of our collections on-line and access born-digital and digitised material that we hold

Yes absolutely - this is what AtoM does well. There are a variety of ways to search and browse the data within AtoM and one of the real strengths of the system is the ability to link between records using hyperlinked subject terms, places and authority records.

Our only limitation currently is that the vision of getting 'all' of our finding aids into AtoM may not be realised for some time!

We have not really started working with the digital object functionality within AtoM as we waiting to get Archivematica installed and in use so that we can be sure that access to digital content is backed up by a preservation system.

The system should allow different levels of access, and provide a high degree of data security

Yes - there is scope to configure what public users can and can't see within AtoM. Accessions records (including details of donor names and addresses) are only visible to staff when logged in to the system. Public users only see the records you want them to see. Within AtoM you can keep things in draft until you are ready to publish them.

AtoM also allows you to hide certain fields from public view using the 'Visible Elements' feature. This means that within an archival description you can hide the physical location field for example or a notes field if you want to keep this field for internal use only.

AtoM also allows different levels of access to specific user groups. Staff can be given access as either contributors, editors or administrators as appropriate depending on the functions that they are required to carry out within the system.

We should be able to get usage statistics from the web interface

In hindsight this perhaps wasn't a very sensible requirement. Yes, we can see web statistics for our AtoM instance, but this is not through AtoM but through Google Analytics.

We should be able to get error reports from the web site so we know if there is a problem

As above - this requirement is not being met by AtoM itself but is met by other tools our systems team have at their disposal. For example, if the AtoM search facility breaks, we have an automatic notification that is sent out to those people that have the skills to fix it!


We need to be able to record born-digital material

Yes - but looking back, I am not entirely clear what I meant here. Using AtoM you can record the presence of born digital material in an accessions record (by describing the format in a free text field) and you can also present born digital material via the web interface and describe it as you would any other item within an archive.

Of course AtoM is not a digital archive and in order to fully record born digital material (specifically all of the technical metadata required) you also need a digital preservation system. AtoM integrates with Archivematica which ticks our boxes for digital archiving. For more information on this see my blog post about how Archivematica meets our digital archiving requirements.

We need to be able to associate digitised files with their analogue masters

Yes - AtoM allows you to upload or link digitised files to an archival description. We have done a bit of testing but not really started using this feature in earnest. Watch this space though - when we come to finish our ongoing project to digitise the archive of the Retreat we will carry out a piece of work to make the links between the archival catalogue and the digitised material.

We would like to be able to record preservation actions and other technical metadata for digital material

No - this is very much outside of scope for AtoM...and this is why we are also looking at Archivematica which does tick the boxes in these areas and is designed to work alongside AtoM.

The system should allow us to allocate unique identifiers to digital objects

Not really.

In AtoM the identifier would be the archival reference code - but this might not be a unique identifier as such as there may be more than one digital object associated with an archival description.

However, this is actually a job that could be performed elsewhere. Archivematica will allocate identifiers to digital objects and AtoM is designed to work alongside Archivematica.

Other modules

We need to create authority records where appropriate

Yes - Authority records are one of the core entity types within AtoM and are based on ICA standards. Now we are using AtoM we have found this feature to be very powerful.

We would like to be able to record conservation actions that we have carried out on a particular collection or item (analogue)

No - this is not a feature within AtoM and is not currently on the roadmap (though it could be in the future if someone was to fund the necessary development work).

There is a field in the accessions module to record the physical condition of an archive which is useful but isn't enough. We will continue to maintain a separate system for recording conservation actions for the time being.

Searchroom staff need to be able to log enquiries, access, register users etc

No - again this is not a feature of AtoM and will need substantial work (and financial resource) to develop it. In the meantime we are happy to continue to maintain separate systems and processes for the day-to-day work of the searchroom staff.

We would like to have a workflow or management actions checklist so we can keep track of what work needs to be done

No - without a conservation module being included within AtoM this requirement is perhaps a bit ambitious. What we were hoping for here was a feature that tells you where you are with an archive - for example, reporting on whether a new accession has been catalogued or not or has had the necessary conservation treatment.

As described elsewhere in this list, AtoM users can choose to run reports directly on the MySQL using an SQL client. These could be tailored to help identify archives that fulfil particular requirements and could be used to help inform task allocation and team priorities.


The system should be under active development with established feedback routes for requesting enhancements

AtoM is under continuous active development. You can see what is coming up in the new release here. As well as showing how many new features and enhancements are being developed, it also shows the wide range of institutions who are involved in funding the development of AtoM.

Feedback and requests for enhancements are encouraged via the AtoM user forum. Requests for enhancements often lead to the same response from Artefactual Systems (the lead developers for AtoM) which goes along the lines of "yes we can do that if someone pays us to do it!" and that is fair enough really. It allows institutions to push various features to the top of the wishlist if they have the resources to pay for the development work. This does mean that these funded features are also available for other AtoM users to make use of if they wish.

The system should be flexible and customisable so it can be modified for our specific needs

Yes - AtoM is open source which does mean that we could tinker with the code and customise it as much as we like (assuming we have the resource and expertise to do so). We have done a bit of this - customising the way the global search works and the look and feel of the interface.

There is also scope within the AtoM interface to tweak the admin settings. Any AtoM user will want to spend a fair bit of time investigating these settings and considering what options will be best for their implementation.

The system should include technical support

Yes - technical support is available for AtoM. This can be bought as a package from Artefactual Systems and this will be particularly valuable if no technical staff resource is available in house.

Technical support is also available for free via the user forum - both from Artefactual Systems and from other AtoM users and all AtoM users are encouraged to join in the discussions and share their experiences.

The system should be used by other archives. This will provide us with another mechanism for advice, support and feedback

Yes - it is encouraging to see that AtoM is used by many institutions across the world. Some of these are listed on the Users page on the AtoM wiki

For us it was also useful to make contact with a friendly AtoM user in the UK who we could talk to directly and get advice on how they use the system.

So, that was a run down of our requirements and how AtoM performs against them. 

As you will have seen, AtoM did not get full marks when we assessed it originally, and still doesn't do everything we had originally wanted it to. However, over the last few years I would say that we have revised our expectations and have accepted the fact that one system can't necessarily do everything! AtoM works alongside other systems and processes at the Borthwick Institute in order to meet our needs. For other requirements we have developed workarounds to ensure that we have a solution that works for us.

When people ask us why we selected AtoM as our archival management system I do mention the requirements assessment but I think ultimately it was the fact that we saw potential and wanted to get on board, be part of the user community and influence the future development of this system. We have seen numerous enhancements over the last couple of years and are looking forward to seeing many more developments in the future.

Jenny Mitcham, Digital Archivist