Thursday, 7 July 2016

On the closure of Edina's Unlock service

This is a guest post by Julie Allinson, Technology Development Manager for Library & Archives at York. Julie managed the technical side of the 'York's Archbishops' Registers Revealed' project. This post discusses the demise of Edina's Unlock service and wonders how sustainable open data services are.

It has recently come to my attention that Edina are retiring their 'Unlock' service on the 31st July 2016. Currently that's all I know as, AFAIK, Edina haven't provided any background or any information about why, or what users of this service might do instead. I also wasn't aware of any kind of consultation with users.

Edina's message about the Unlock service - not very informative. 

At York we've been using Unlock to search the DEEP gazetteer of English place names in our Archbishops' Registers editing tool. DEEP is a fantastic resource, an online gazetteer of the 86 volume corpus of the Survey of English Place-Names (SEPN). Without Edina's Unlock service, I don't know any way of programmatically searching it.

Records from DEEP via Edina's Unlock service in our Archbishops' Registers editing tool 

Co-incidentally (or not?) the web site, which the records returned via the Unlock DEEP search link to, is unavailable. The site is a Web site for the DEEP data and our editors use this site to help identify and disambiguate place names.

I contacted Edina and they have promised to pass on further information about the end of the Unlock service as it becomes available later in the month. They also pointed me to the Institute for Name Studies at the University of Nottingham (INS) to find out why the place names site was unavailable. The initial response from INS was 'this is not our website'. I mentioned that is listed as one of their resources and they are now following up for me with a colleague, who is away at the moment. I've also contacted the Centre for Data Digitisation and Analysis (CDDA) at Queen's University Belfast who mention the site as one of their 'project databases' on the CDDA web site. As of writing, I haven't had a response.

UPDATE: As of Thursday 7th July, the site seems to be back. Yay!

Entry from the Archbishops' Registers places list, complete with broken link 

All of this is quite worrying as it means many wasted development hours implementing this feature and reflects badly on our site - displaying broken links is not something we want to do.

So what do we do now? Well, in practical terms, unless I receive any other information from Edina to the contrary, I'll write out our use of Unlock / DEEP from the editing tool at the beginning of August and our editors will have to switch to manually creating place records in the tool. We've also been using Unlock to search the Ordnance Survey, so I'll hopefully be able to add a search of their linked data services directly. But we particularly liked as it DEEP gave us historical place names and enough information to help editors make sure they were selecting the correct contemporary place.

The bigger questions that this raises, though, are:

  • how do we ensure that important datasets and services coming out of projects can be sustained?
  • how can we trust that open data services will continue to be available? even those that appear to have the backing of a service provide like Edina or Jisc
  • how do we find out when they aren't?
  • and how do we have a voice when decisions like this are being made?

Monday, 4 July 2016

New research data file formats now available in PRONOM

Like all good digital archivists I am interested in file formats and how we can use tools to automatically identify them. This is primarily so that when we package our digital archives up for long term preservation we can do so with a level of knowledge about how we might go about preserving and providing access to them in the future. This information is key whether migration or emulation is the preservation or access strategy of choice (or indeed a combination of both).

It has been really valuable to have some time and money as part of our "Filling the Digital Preservation Gap" project to be able to investigate issues around the identification of research data file formats and very pleasing to see the latest PRONOM signature release on 29th June which includes a couple of research data formats that we have sponsored as part of our project work.

 Pronom release notes

I sent a batch of sample files off to the team who look after PRONOM at The National Archives (TNA) with a bit of contextual information about the formats and software/hardware that creates them (that I had uncovered after a bit of research on Google). TNA did the rest of the hard work and these new signatures are now available for all to use.

The formats in question are:

  • Gaussian input files - These are created for an application called Gaussian which is used by many researchers in the Chemistry department here in York. In a previous project update you can see that Gaussian was listed in the top 10 research software applications in use at the University of York. These files are essentially just ascii text files containing instructions for Gaussian and they can have a range of file extensions (though the samples I submitted were all .jdf). Though there is a recommended format or syntax for these instructions, there also appears to be flexibility in how these can be applied. Consequently this was a slightly challenging signature for TNA to work on and it would be useful if other institutions that have Gaussian input files could help test this signature and feedback to TNA if there are any problems or issues. In instances like this being able to develop against a range of sample files created at different times in different institutions by different researchers would help.
  • JEOL NMR Spectroscopy files - These are data files produced by JEOL's Nuclear Magnetic Resonance Spectrometers. These facilities at the University of York are clearly well used as data of this type was well represented in an initial assessment of the data that I reported on in a blog post last month (130 .jdf files were present in the sample of 3752 files). As these files are created by a piece of hardware in a very standard way, I am told that signature developers at TNA were able to create a signature without too many problems.

Further formats submitted from our project will appear in PRONOM within the next couple of months.

The project team are also interested in finding out how easy it is to create our own file format signatures. This is an alternative option for those who want to contribute but not something we have attempted before. Watch this space to find out how we get on!

Modelling Research Data with PCDM

This is a guest post by Julie Allinson, Technology Development Manager for Library & Archives at York. Julie has been working on York's implementation for the 'Filling the Digital Preservation Gap' project. This post discusses preliminary work to define a data model for 'datasets'.

For Phase three of our 'Filling the Digital Preservation Gap', I've been working on implementing a prototype to illustrate how PURE and Archivematica can be used as part of a Research Data management lifecycle. Our technology stack at York is Hydra and Fedora 4 but it's an important aspect of the project to ensure that the thinking behind the prototype is applicable to other stacks. Central to adding any kind of data to a research information system, repository or preservation tool is the data model that underpins the metadata. For this I've been making use of the Portland Common Data Model (PCDM) and it's various extensions (particularly Works).

In the past couple of years there has been a lot of work happening around PCDM, described as "a flexible, extensible domain model that is intended to underlie a wide array of repository and DAMS applications". PCDM provides a small Models ontology of classes and properties, with extension ontologies for Works and Use, among others. I like PCDM because it is high level enough to provide a language to talk across different domains and use cases.

Datasets data model version 1

My first attempt at a data model for datasets based on PCDM can be seen below.

Datasets Data Model v1

Starting with the Dataset object above, for York this is equivalent to a dataset record in our PURE research information system. The only metadata expected at this level is about the dataset as a whole, and for us, will largely come from PURE.

Below this, you'll see that I've begun to model in some OAIS constructs: the Archival Information Package (AIP) and Dissemination Information Package (DIP). The AIP is the deposit of data, prepared for preservation, a single package of data "consisting of the Content Information and the associated Preservation Description Information (PDI), which is preserved within an OAIS." (DCC Glossary). The DIP is a representation of the Content Information in the AIP, produced for dissemination. OAIS, by the way, gives no standard approach for structuring the AIP or DIP.

As it stands, this model does not consider how to 'unpack' the AIP at all and for our prototype - we are (or were - see below) intending to simply point to the AIP in Archivematica.

The DIP as illustrated above is based on what Archivematica generates. The diagram includes the processing configuration and METS files that Archivematica produces by default as filesets, for illustration. These aren't part of the dataset as deposited, hence not making them 'Works' in their own right.

GenericWork is intended for each Unit or 'Work' in the dataset as deposited, or representation thereof. GenericWork is intended for use with any kind of data object. it might be independently re-used, eg. as a member of another dataset. In most cases for research data we probably won't know much about what the data is and so GenericWork will be used, but sometimes it may make sense to use existing models. For example, if the dataset is a collection of images then a more tailored Image model could be used for each Image, or if a dataset includes some existing objects that are already in our repository, those might already have different models. They can still be members of our dataset.

The model is intended to allow for a Dataset to include multiple AIPs and thus I have suggested a local predicate for hasDIP / hasAIP to establish the relationship between the AIP and the DIP.

The trouble with DIPs : an alternative model

Discussing this model with Justin Simpson from Artefactual Systems recently, I got to thinking that the DIP is a really artificial construct, effectively a presentation 'view' on the data and not really separate 'Work' in it's own right.  Archivematica's DIP, at present, provides only files, and doesn't reflect the original folder structure, which may well be meaningful and necessary for a dataset. Perhaps what we really need is the AIP, described fully in Fedora, leaving presentation of the DIP to the interface level? A set of rules for producing a DIP to our own local specification might go like this (using the PCDM Use ontology): if there is a Preservation Master file, produce a Service File for user access, otherwise present the Original File to the user.

The new model would look something like this:

Datasets Data Model alternative

The DIP would be a view constructed from elements of the AIP:

This model feels like a better approach than that in version 1 as it facilitates describing the 'whole' dataset. I do have some immediate questions about the model, though:
  • Is a dataset really a pcdm:Collection, rather than a Work?
  • If the GenericWork is each data file irrespective of whether it can be used/understood on it's own, how useful is that in reality? Is the GenericWork really needed, or are FileSets enough? Is there genuinely value is identifying each individual piece of data as a 'Work'? (re-use outside of the dataset, for example)
And when thinking beyond the model, about how this would actually work for different use cases, implementations questions start to surface. 

Beyond the model

1) Dataset size and structure

Datasets may contain thousands, millions even, of files structured into folders where folders may impart meaning to the data, or be purely arbitrary. Fedora 4 can, by design, handle a folder structure using it's implementation of LDP Basic Containers. As illustrated below, each folder is a 'Basic Container' and each data file is a Work, with FileSet and File objects.
  • AIP ldp:contains folder1
    • folder1 ldp:contains folder2
      • folder2 ldp:contains folder3
        • folder3 ldp:contains GenericWork
          • GenericWork pcdm:hasMember FileSet 
But if each of those folders are objects in Fedora and each file is in Fedora with a Work, FileSet and several File objects, then the number of Fedora objects begins to rise exponentially
  • Would it be better to avoid object-cost to Fedora of creating many many objects? 
  • What alternative approaches could we take? Use BagIt and have Fedora reference only the 'bag'? Store only data files and create an 'order' to represent the folder structure (as outlined in PCDM 2.0)?

2) Storing Files

'Where should data actually be stored permanently?' is another practical question I've been thinking about. On the one hand, Archivematica makes the AIP and it's files available via URLs in the Storage Service and stores the file in a location of your choosing. On the other, Fedora can contain data files, or reference them via a URI. This gives gives us the flexibility to do several things:
  1. Leave the AIP and DIP in Archivematica's stores and use URIs in Fedora to reference all of the files and build a PCDM-modelled view of the data (Archivematica as preservation and access store).
  2. Manage all files in Fedora, treating the Archivematica copy of the data as temporary (Archivematica as sausage factory).
  3. Have two copies of AIP data, one in Archivematica and one in Fedora (LOCKSS model).
  4. Manage Preservation files in Archivematica and delivery/access files in Fedora (Archivematica as preservation and Fedora as access).
Keeping data in Archivematica makes is easy to do additional preservation actions in future, such as re-ingesting when format policy rules change, whereas managing all files within Fedora unlocks the possibilities of Fedora's audit functionality, fixity checking and versioning. Having two copies is attractive as a preservation strategy, but could be difficult to justify and sustain if data collections grow to a significant size. 

On balance I think option (4) is best for the short-term with other options worth re-considering as both Archivematica and Fedora mature. But I'd be really keen to hear different views.


Hopefully this post illustrates that creating an outline data model is pretty easy, but when it comes to thinking about it in terms of implementation decisions, all kinds of ifs and buts start coming up.

In the model above, each data file is a Work. Each work contains one or more FileSets, and each FileSet contains one or more different representations of the file. 

Is it really possible to define a general datasets model that could encompass data from across disciplines, of various sizes, structures and created for a variety of purposes? Data that might be independently re-usable (a series of oral history interviews) or might only be understandable in combination with other files (a database schema document, for example)?

This is very much a work in progress, and I'd really welcome feedback from others who have done allied work or anyone who has suggestions and comments on the approaches and issues outlined above.

Tuesday, 31 May 2016

Research data - what does it *really* look like?

Work continues on our Filling the Digital Preservation Gap project and I thought it was about time we updated you on some of the things we have been doing.

While my colleague Julie has been focusing on the more technical issues of implementing Archivematica for research data. I have been looking at some real research data and exploring in more detail some of the issues we discussed in our phase 1 report.

For the past year, we have been accepting research data for longer term curation. Though the systems for preservation and access to this data are still in development, we are for the time being able to allocate a DOI for each dataset, manage access and store it safely (ensuring it isn't altered) and intend to ingest it into our data curation systems once they are ready.

Having this data in one place on our filestore does give me the opportunity to test the hypothesis in our first report about the wide range of file formats that will be present in a research dataset and also the assertion that many of these will not be identified by the tools and registries in use for the creation of technical metadata.

So, I have done a fairly quick piece of analysis on the research data, running a tool called Droid developed by The National Archives over the data to get an indication of whether the files can be recognised and identified in an automated fashion.

All the data in our research data sample has been deposited with us since May 2015. The majority of the data is scientific in nature - much of it coming from the departments of Chemistry and Physics. (this may be a direct result of expectations from the EPSRC around data management). The data is mostly fairly recent, as suggested by the last modified dates on these files, which range from 2006 to 2016 with the vast majority having been modified in the last five years. The distribution of dates is illustrated below.

Here are some of the findings of this exercise:

Summary statistics

  • Droid reported that 3752 individual files were present*

  • 1382 (37%) of the files were given a file format identification by Droid

  • 1368 (99%) of those files that were identified were given just one possible identification. 12 files were given two possible identifications and a further two were given 18 possible identifications. In all these cases, the identification was done by file extension rather than signature - which perhaps explains the uncertainty

Files that were identified

  • Of the 1382 files that were identified: 
    • 668 (48%) were identified by signature (which suggests a fairly accurate identification - if a file is identified by signature it means that Droid has looked inside the file and seen something that it recognises. I'm told it does this by some sort of magic!)
    • 648 (47%) were identified by extension alone (which implies a less accurate identification)
    • 65 (5%) were identified by container. These were all Microsoft Office files - xlsx and docx as these are in effect zip files (which suggests a high level of accuracy)

  • 111 (8%) of the identified files had a file extension mismatch - this means that the file extension was not what you would expect given the identification by signature. 
    • All but 16 of these files were XML files that didn't have the .xml file extension (there were a range of extensions for these files including .orig, .plot, .xpr, .sc, .svg, .xci, .hwh, .bxml, .history). This isn't a very surprising finding given the ubiquity of XML files and the fact that applications often give their own XML output different extensions.

  • 34 different file formats were identified within the collection of research data

  • Of the identified files 360 (26%) were XML files. This was by far the most common file format identified within the research dataset. The top 10 identified files are as follows:
    • Extensible Markup Language - 360
    • Log File - 212
    • Plain Text File - 186
    • AppleDouble Resource Fork - 133
    • Comma Separated Values - 109
    • Microsoft Print File - 77
    • Portable Network Graphics - 73
    • Microsoft Excel for Windows - 57
    • ZIP Format - 23
    • XYWrite Document - 21

Files that weren't identified

  • Of the 2370 files that weren't identified by Droid, 107 different file extensions were represented

  • 614 (26%) of the unidentified files had no file extension at all. This does rather limit the chance of identification being that identification by file extension is relied on quite heavily by Droid and other similar tools. Of course it also limits our ability to actively curate this data unless we can identify it by another means.

  • The most common file extensions for the files that were not identified are as follows:
    • dat - 286
    • crl - 259
    • sd - 246
    • jdf - 130
    • out - 50
    • mrc - 47
    • inp - 46
    • xyz - 37
    • prefs - 32
    • spa - 32

Some thoughts

  • This is all very interesting and does back up our assertions about the long tail of file formats within a collection of research data and the challenges of identifying this data using current tools. I'd be interested to know whether for other collections of born digital data (not research data) a higher success rate would be expected? Is identification of 37% of files a particularly bad result or is it similar to what others have experienced?

  • As mentioned in a previous blog post, one area of work for us is to get some more research data file formats into PRONOM (a key database of file formats that is utilised by digital preservation tools). Alongside previous work on the top software applications used by researchers (a little of this is reported here) this has been helpful in informing our priorities when considering which formats we would like The National Archives to develop signatures for in phase 3 of the project.

  • Given the point made above, it could be suggested that one of our priorities for file format research should be .dat files. This would make sense being that we have 286 of these files and they are not identified by any means. However, here lies a problem. This is actually a fairly common file extension. There are many different types of .dat files produced by many different applications. PRONOM already holds information on two varieties of .dat file and the .dat files that we hold appear to come from several different software applications. In short, solving the .dat problem is not a trivial exercise!

  • It strikes me that we are really just scratching the surface here. Though it is good we are getting a few file signatures developed as an output for this project, this is clearly not going to make a big impact given the size of the problem. We will need to think about the community should continue this work going forward.

  • It has been really helpful having some genuine research data to investigate when thinking about preservation workflows - particularly those workflows for unidentified files that we were considering in some detail during phase 2 of our project. The unidentified file report that has been developed within Archivematica as a result of this project helpfully organises the files by extension and I had not envisaged at the time that so many files would have no file extension at all. We have discussed previously how useful it is to fall back on identification by file extension if identification by signature is unsuccessful but clearly for so many of our files this will not be worth attempting.

* note that this turns out not to be entirely accurate given that eight of the datasets were zipped up into a .rar archive. Though Droid looks inside several types of zip files, (including .zip, .gzip, .tar, and the web archival formats .arc and .warc) it does not yet look inside .rar, 7z, .bz, and .iso files. I didn't realise this until after I had carried out some analysis on the results. Consequently there are another 1291 files which I have not reported on here (though a quick run through with Droid after unzipping them manual identified 33% of the files so a similar ratio to the rest of the data. Note that this functionality is something that the team at The National Archives intend to develop in the future.

Friday, 27 May 2016

Why AtoM?

A long time ago I started to talk about how we needed a new archival management system. I described the process of how we came up with a list of requirements to find a system that would meet our needs.

People sometimes ask me about AtoM and why we chose it so I thought it would be useful to publish our requirements and how the system performed against these.

It is also interesting at this point - post launch to go back and review what we thought we knew about AtoM when we first made the selection back in 2014. Not only has our understanding of AtoM moved on since then, but the functionality of the system itself has moved on (and will move on again when version 2.3 is released). For several of the requirements, my initial assessment was negative or recorded only a partial success and coming back to it now, it is clear that things have improved.

I also discovered that for one of the requirements, I had recorded AtoM has having met it but experience and further research has demonstrated that it doesn't - this is something we are hoping to address with further funding in the future.

So here goes with the requirements and a revised and hopefully up-to-date commentary of how AtoM (version 2.2) meets them.

Cataloguing and accessioning

The system must allow us to record accessions

The short answer is Yes.

We looked at this one in some detail and produced a whole document about specifically whether AtoM met all of our requirements for accessioning. This included a mapping of the fields within AtoM to the data we held from previous systems and further thoughts related to our own accessioning workflows.

We have found that AtoM is a good tool for recording accessions but there are some issues. One of the problems we highlighted with the accessions module initially was the lack of ability to record covering dates of the material. We have been able to sponsor the development of covering dates fields by Artefactual Systems so this one is resolved now.

Another issue we have noted is the rather complex way that rights and licences are recorded within an accession record. I am still not sure we fully understand how best to use this section!

Need to be able to generate a receipt for accessioning

This one is still a No.

In our previous system for accessioning this was an auto generated report of key fields from the accessions data which could be printed out and sent to the donor or depositor with a covering letter as a record of us having received their archive. We did explore with Artefactual Systems the options around the creation of this feature within AtoM but were not able to find the money to follow up on this development.

The good news is that we have found a temporary workaround to this internally using a print style sheet to create a report. This solution isn't perfect but does at least mean we don't have to retype or copy and paste the accessions data into an accessions receipt. We are hoping a better solution emerges at some point in the future.

The system must allow us to enter catalogues/lists

Yes - this is very much what AtoM is designed to do.

Internally we are still discussing and testing out the methodology for data entry into AtoM. In some situations it makes sense to enter archival descriptions directly into the interface and in other situations the CSV or EAD import options are utilised. There are also situations where both of these methodologies might be used, for example importing the basic structure and then enhancing the records directly within AtoM.

Catalogue metadata should be ISAD(G) compatible

Yes - does what it says on the tin.

Data entry form for accessioning/cataloguing should include customisable drop down lists/controlled vocabularies where appropriate for ease and consistency of data entry

Yes - there is a taxonomies section of AtoM that allows you to manage the wordlists that are used within the system, editing and adding terms as appropriate. By default, some taxonomies (which contain controlled vocabularies directly from related standards) are locked, but can be edited by a developer.

Accessions data should be linked to catalogue data for ease of running queries and managing data (eg: producing lists of collections which haven't been catalogued)

Yes - Accessions data can be linked to archival descriptions within AtoM. This isn't mandatory (nor should it be) and since importing all of our accessions records into AtoM there would be quite a big piece of work involved in making these links. However, the functionality is there.

Running queries as described in this requirement isn't something that AtoM does currently, however access to a MySQL client (such as Squirrel) and a working knowledge of SQL does open up opportunities for querying the underlying data in a variety of ways.

It should be possible to create a collection record directly from the accessions record (avoiding the need for re-typing duplicate fields)

Yes - An accessions record can be turned into an archival description - and this is a feature we want to explore more as we incorporate AtoM fully into our internal workflows.

We should be able to create hard copy catalogues in a format suitable for search room use

Initially no...but now yes!

Functionality on this score was fairly limited when we assessed AtoM version 2.0 a three years ago. However, we knew this was on the AtoM development roadmap and since adopting AtoM there has been further work in this area. In AtoM 2.2 it is possible to create finding aids in PDF or RTF format and administrators can select whether a full finding aid or inventory summary is required.

We would like to record locations of material within the archive

In our initial assessment of AtoM we recorded this as a Yes

...but I'm reserving my judgement about whether we can use this functionality ourselves until we've finished testing this out.

We have a locations register (currently a separate spreadsheet) that records what is where within the strongroom. We also have signs for the end of each aisle which record which archives you can find in that aisle. I'd like to be able to say we could do away with these separate systems and store this information within AtoM but I'm just not sure whether the locations section of AtoM meets our needs currently. Further investigation required on this one and I'll try and blog about this another time.

We would like the system to be able to manage and log de-accessioning

Yes we believe this to be so - after creating an accession record, a simple click of a button allows the archivist to deaccession the whole or a part of the accession.

As far as I know we haven't had cause to test this one just yet.

Import and export of data

Data should be portable. We need to be able to export it to other platforms/portals in EAD, EAC (for authority records) and other formats.

Yes - this is certainly something that AtoM can do.

Data should be portable. We would like to be able to set up OAI PMH targets (or equivalent) so that our data can be harvested and re-used by others 

Ahhhh - now this is an area where perhaps we weren't specific enough with our requirement!

On the surface, this is a 'yes' - we have been able to set up OAI-PMH harvesting to our Library Catalogue using Dublin Core.

...however, what we really wanted to be able to do, which isn't articulated well enough in the requirement was to be able to expose EAD metadata via OAI-PMH. This isn't currently something that AtoM can do but watch this space! We hope to be able to make this happen at some point in the future.

We need adequate import mechanisms for incorporating backlog data in a variety of formats. This should include EAD and a means of importing accessions data

Yes - EAD can be imported, as can metadata in other formats (for example Dublin Core, MODS, EAC, SKOS). Having tested this we have discovered that the EAD import works but isn't necessarily as simple and straightforward as we would like. We know there are good reasons why this is the case but it has proved to be a barrier for us in getting more of our existing structured finding aids into AtoM. We will be doing more data import work over the next year or so.

CSV inport is also an option and this is the method we used to import all of our accessions data. We are currently testing how we can use this import functionality for archival descriptions and think this will be very useful to us.


We need to be able to run queries and reports on the data that we hold - both routine and custom

Yes and No - Reporting within AtoM itself is limited, however this doesn't matter if you have a MySQL client and the ability to query the underlying database. We have already had success at running specific reports on the data within AtoM (for our annual accessions return). Though this solution may not suit everyone, the ability to query the data outside of the web interface itself does offer flexibility above and beyond the functionality that could be programmed into the interface and is a really powerful tool.


Our users should be able to search and browse all of our collections on-line and access born-digital and digitised material that we hold

Yes absolutely - this is what AtoM does well. There are a variety of ways to search and browse the data within AtoM and one of the real strengths of the system is the ability to link between records using hyperlinked subject terms, places and authority records.

Our only limitation currently is that the vision of getting 'all' of our finding aids into AtoM may not be realised for some time!

We have not really started working with the digital object functionality within AtoM as we waiting to get Archivematica installed and in use so that we can be sure that access to digital content is backed up by a preservation system.

The system should allow different levels of access, and provide a high degree of data security

Yes - there is scope to configure what public users can and can't see within AtoM. Accessions records (including details of donor names and addresses) are only visible to staff when logged in to the system. Public users only see the records you want them to see. Within AtoM you can keep things in draft until you are ready to publish them.

AtoM also allows you to hide certain fields from public view using the 'Visible Elements' feature. This means that within an archival description you can hide the physical location field for example or a notes field if you want to keep this field for internal use only.

AtoM also allows different levels of access to specific user groups. Staff can be given access as either contributors, editors or administrators as appropriate depending on the functions that they are required to carry out within the system.

We should be able to get usage statistics from the web interface

In hindsight this perhaps wasn't a very sensible requirement. Yes, we can see web statistics for our AtoM instance, but this is not through AtoM but through Google Analytics.

We should be able to get error reports from the web site so we know if there is a problem

As above - this requirement is not being met by AtoM itself but is met by other tools our systems team have at their disposal. For example, if the AtoM search facility breaks, we have an automatic notification that is sent out to those people that have the skills to fix it!


We need to be able to record born-digital material

Yes - but looking back, I am not entirely clear what I meant here. Using AtoM you can record the presence of born digital material in an accessions record (by describing the format in a free text field) and you can also present born digital material via the web interface and describe it as you would any other item within an archive.

Of course AtoM is not a digital archive and in order to fully record born digital material (specifically all of the technical metadata required) you also need a digital preservation system. AtoM integrates with Archivematica which ticks our boxes for digital archiving. For more information on this see my blog post about how Archivematica meets our digital archiving requirements.

We need to be able to associate digitised files with their analogue masters

Yes - AtoM allows you to upload or link digitised files to an archival description. We have done a bit of testing but not really started using this feature in earnest. Watch this space though - when we come to finish our ongoing project to digitise the archive of the Retreat we will carry out a piece of work to make the links between the archival catalogue and the digitised material.

We would like to be able to record preservation actions and other technical metadata for digital material

No - this is very much outside of scope for AtoM...and this is why we are also looking at Archivematica which does tick the boxes in these areas and is designed to work alongside AtoM.

The system should allow us to allocate unique identifiers to digital objects

Not really.

In AtoM the identifier would be the archival reference code - but this might not be a unique identifier as such as there may be more than one digital object associated with an archival description.

However, this is actually a job that could be performed elsewhere. Archivematica will allocate identifiers to digital objects and AtoM is designed to work alongside Archivematica.

Other modules

We need to create authority records where appropriate

Yes - Authority records are one of the core entity types within AtoM and are based on ICA standards. Now we are using AtoM we have found this feature to be very powerful.

We would like to be able to record conservation actions that we have carried out on a particular collection or item (analogue)

No - this is not a feature within AtoM and is not currently on the roadmap (though it could be in the future if someone was to fund the necessary development work).

There is a field in the accessions module to record the physical condition of an archive which is useful but isn't enough. We will continue to maintain a separate system for recording conservation actions for the time being.

Searchroom staff need to be able to log enquiries, access, register users etc

No - again this is not a feature of AtoM and will need substantial work (and financial resource) to develop it. In the meantime we are happy to continue to maintain separate systems and processes for the day-to-day work of the searchroom staff.

We would like to have a workflow or management actions checklist so we can keep track of what work needs to be done

No - without a conservation module being included within AtoM this requirement is perhaps a bit ambitious. What we were hoping for here was a feature that tells you where you are with an archive - for example, reporting on whether a new accession has been catalogued or not or has had the necessary conservation treatment.

As described elsewhere in this list, AtoM users can choose to run reports directly on the MySQL using an SQL client. These could be tailored to help identify archives that fulfil particular requirements and could be used to help inform task allocation and team priorities.


The system should be under active development with established feedback routes for requesting enhancements

AtoM is under continuous active development. You can see what is coming up in the new release here. As well as showing how many new features and enhancements are being developed, it also shows the wide range of institutions who are involved in funding the development of AtoM.

Feedback and requests for enhancements are encouraged via the AtoM user forum. Requests for enhancements often lead to the same response from Artefactual Systems (the lead developers for AtoM) which goes along the lines of "yes we can do that if someone pays us to do it!" and that is fair enough really. It allows institutions to push various features to the top of the wishlist if they have the resources to pay for the development work. This does mean that these funded features are also available for other AtoM users to make use of if they wish.

The system should be flexible and customisable so it can be modified for our specific needs

Yes - AtoM is open source which does mean that we could tinker with the code and customise it as much as we like (assuming we have the resource and expertise to do so). We have done a bit of this - customising the way the global search works and the look and feel of the interface.

There is also scope within the AtoM interface to tweak the admin settings. Any AtoM user will want to spend a fair bit of time investigating these settings and considering what options will be best for their implementation.

The system should include technical support

Yes - technical support is available for AtoM. This can be bought as a package from Artefactual Systems and this will be particularly valuable if no technical staff resource is available in house.

Technical support is also available for free via the user forum - both from Artefactual Systems and from other AtoM users and all AtoM users are encouraged to join in the discussions and share their experiences.

The system should be used by other archives. This will provide us with another mechanism for advice, support and feedback

Yes - it is encouraging to see that AtoM is used by many institutions across the world. Some of these are listed on the Users page on the AtoM wiki

For us it was also useful to make contact with a friendly AtoM user in the UK who we could talk to directly and get advice on how they use the system.

So, that was a run down of our requirements and how AtoM performs against them. 

As you will have seen, AtoM did not get full marks when we assessed it originally, and still doesn't do everything we had originally wanted it to. However, over the last few years I would say that we have revised our expectations and have accepted the fact that one system can't necessarily do everything! AtoM works alongside other systems and processes at the Borthwick Institute in order to meet our needs. For other requirements we have developed workarounds to ensure that we have a solution that works for us.

When people ask us why we selected AtoM as our archival management system I do mention the requirements assessment but I think ultimately it was the fact that we saw potential and wanted to get on board, be part of the user community and influence the future development of this system. We have seen numerous enhancements over the last couple of years and are looking forward to seeing many more developments in the future.

Monday, 11 April 2016

Responding to the results of user testing

Did you notice that we launched our new AtoM catalogue last week? I hope so!

In the month whilst preparing for launch we wanted to take the time to find out what a sample of users thought about our new catalogue and here I will summarise some of the findings and the steps that we have taken to react to this feedback.

We had 14 people test the catalogue for us off-site and fill out an online questionnaire which was put together using Google forms. Testing was carried out on AtoM version 2.2.0. The volunteers for user testing were found by putting out a call on Twitter and the results were helpful and constructive (though one user could not access the site so was not able to answer the questions in any meaningful way). Despite the small sample size there were several themes that were mentioned more than once. Interestingly these weren't necessarily the themes that we thought would be mentioned more than once!

Let's start with the positives....

The good things

It's always nice to receive positive feedback and we were encouraged to see that there was plenty of this to come out of the user testing - things that were praised fell into the following categories:

Look and feel - The vast majority of users found the catalogue visually appealing. A couple of people mentioned that they liked the colour scheme and one appreciated the fact that it flowed nicely from our website. The image on the home page was also praised. Others commented on the fact that it was well set out with a clean and clear appearance. One respondent compared it very favourably with other leading archival catalogues.

Our home page image

Functionality - The search functionality of the catalogue was praised as was the faceted classification that allows you to filter your search results. The browse by subject feature had several positive mentions and one person liked the ability to download XML files. Navigation within the catalogue was praised, including a specific comment about the tree-view feature on the left side of the interface.

The data - We were pleased to hear people saying good things about the quality of the data that we have in the catalogue. The information was described as being 'full' and 'comprehensive'. The level of detail held in the Conditions of Access and Use field was mentioned specifically and the fact that you could see when each description was last updated. One respondent stated that they liked the fact the catalogue conformed to recognised archival standards and that it was clear from the interface which rules had been used to create the data.

Digital objects - Several of the testers mentioned specifically that they liked the inclusion of digital objects within the catalogue. We have not utilised this feature to full effect just yet, but for some of our descriptions a finding aid or an image is available. Users liked the way that AtoM displays the thumbnails in the results list. An archival catalogue can be quite text-heavy so using digital objects to break the text up was seen as a good thing.

The help pages - Our glossary page had a positive mention. We put this together as we recognised that archival terminology can be a bit of a mystery to non-archivists (myself included) so being able to define some of the key terms we use was a priority for us.

My favourite comment under the question "What did you like about the catalogue?" was "Almost everything". This highlights to me that we have pretty much got it right but of course we shouldn't put our feet up - there is always room for improvement!

The not-so-good things

We also received comments about the things which weren't working so well in our new catalogue:

Look and feel - Of the users who did not think the catalogue was visually appealing, one comment was that it was 'bland' and that too much space on the front page was taken up by the image. The same person didn't like the fact that all the navigation was on the left and they couldn't find the search box. Another respondent thought that the links on the left hand side were too small and their eye wasn't drawn to them because of the large image on the front page. It was thought by one person that the location of the main image on the front page looked odd because it wasn't central.

Our response: We wondered about trying to increase the size of the text in the left hand navigation bar in order to make these links stand out a bit more but concluded that this may well upset the balance of the current design. Being that the majority of respondents were very happy with the visual appearance of the site, we decided that no changes were needed at this point in time.

Search box - The visibility of the search box was an issue that was raised a couple of times. We are using a slightly customised version of the default Dominion theme within AtoM and this puts the search box at the top of the screen. One person didn't find the search box at all whilst testing the catalogue. Another found it but wasn't immediately sure of its purpose as its location and proximity to the University of York logo suggested it would search our website rather than our catalogue. This may have been a direct result of our decision to style the catalogue to mirror the look and feel of our website as we do have a similar sized website search box in the top bar of our website.

Our response: We have given some serious thought to how to make the search box more prominent within AtoM but I'm not convinced there is an obvious solution to this. Prior to the user testing we had already changed the colour of the search box from dark grey to white to make it more visible. We have since made another minor tweak to the default theme to turn the 'Search' text within the search box from grey to black to make it stand out more. We considered making the search box bigger (longer) but our top bar is already getting quite crowded and filling it up any more than necessary does have knock on effects to the responsive design when viewed on smaller screens. 

While I can see a benefit to having the main search box taking centre stage on the catalogue front page, I also see it is useful having it up in the top bar so it is always accessible where ever a user is within the catalogue. We don't intend to make any further changes for the time being.

Search results - Several people mentioned that there were simply too many results when you carry out a search ...and the results that come up are not always relevant. We had already been discussing this very issue on the AtoM mailing list and were not surprised that our users were struggling with this. 

Our response: We are hoping that this is something that will be resolved in future versions of AtoM, but in the meantime we are focusing on educating our users by giving them the information they need in order to run more effective and precise searches (even just using the powerful functionality that is available within the basic search box). 

We think that a change to AtoM's default behaviour which currently searches for multiple words by default with an 'OR' operator rather than an 'AND' would produce search results that were more in line with what our users were expecting. Also, although users of Google will happily run a search that produces many thousands of results and feel comfortable not moving beyond the first ten 'hits', users of archival catalogues do not necessarily take the same approach. There seems to be more of an assumption that the list of results will be relevant and each should be worked through in turn. This is something we are definitely hoping for a solution to in the future.

Filtering the search results - One person expressed a desire to be able to filter a search by date

Our response: We agree that this would be a really useful feature and we were pleased to hear from Artefactual Systems that this will be possible within the next version of AtoM (2.3) which is due out soon. This will also introduce the ability to search within the date field in the advanced search and order results by start date in the results list. I think these features are going to be really valuable to our users.

Navigation - One person reported that the catalogue was hard to navigate but didn't give further details. Another struggled with navigation and described a scenario in which they had got lost within the catalogue. 

Our response: I can easily understand how someone could get lost within our catalogue - it has happened to me too! In some respects this problem is directly related to the powerful functionality of the AtoM interface and relational nature of the underlying data structure. Searching and browsing AtoM isn't a linear journey but rather an opportunity to follow links between one record and another based on shared subject terms or creators. Getting lost is a fairly inevitable consequence of this functionality and I struggle to think of an effective solution (apart from encouraging repeated use of the browser 'back' button to get back to where you started!)

The data - One user reported that there is "not much material yet" and another asked for more digitised documents. It was also mentioned that there were "not enough categories for searching" (we speculate that this might relate to the subject terms we have entered). Another comment received was about the term 'accrual' which is used as a field name within AtoM and also within the data that we enter in that field. It was suggested that this word might be a bit off-putting for some users. It was also mentioned that the lists within the Scope and Content field were"pretty hard reading" and a suggestion was made that this would be more user-friendly if presented as a bulleted list rather than a paragraph of text.

Our response: We did expect to get comments about our data. Just because we have launched our catalogue we do not consider it to be a finished piece of work. Further work on populating the catalogue and a fuller exploration of the functionality around digital objects will follow over the next couple of years. It was interesting to get the feedback about the word 'accrual' - we had actually anticipated much more feedback about the terminology that we use but hadn't considered this word in particular. I do agree that this word is a tricky one for non-archivists and I'm pretty sure I had not encountered it before I came to work at the Borthwick Institute. We don't want to change it on the basis of one comment but did decide to add the term to our glossary (one of the help pages we have created within AtoM) and hope that this helps our users.

The help pages - In our questionnaire we asked people specifically whether they used the catalogue help pages. The majority of users surveyed didn't use the help pages and this was not a surprising result. One person's reason for not using the help was because they "should not need to in a well designed information system". Another person stated that they preferred to "just see if I could use the catalogue instinctively". A couple of people mentioned that the page was too text heavy and someone else reported that they didn't know there were any help pages. Someone also suggested that the help pages should open in a new window.

Our response: As a result of the user testing we have made several changes to our help pages. We have updated the text (specifically to explain how to reduce the number of search results) and added a number of screenshots to help convey the information in a more visual way. 

Our help pages are now more visual and include screenshots - the first graphic simply shows how to access the search box. We have also created some printed and laminated copies of these for use in the searchroom.

Of course we can put a lot of effort into putting the right level of information into our help pages but we can not force people to use them! So, over the last couple of weeks we have been ensuring that our searchroom assistants (the people who will be providing front line support to our users as they grapple with our new catalogue) are aware of the different search options within AtoM and understand how they can be used to best affect.

There are also things we can do to make it clearer to users where the help pages are so that they can easily find them if they want to. By default the help pages in AtoM appear under an 'i' icon alongside other static pages. Replacing this 'i' icon with a '?' seemed to be a sensible step to take in order to make it clearer where help could be found. Artefactual Systems were able to point us to the relevant icon in Font Awesome which was just what we needed to implement this little change. 

We agreed that it may be useful for the help pages to open in a new tab so that someone could access them without losing their place within the catalogue (particularly being that 'getting lost' was also an issue that had been reported). Our help pages now open within a separate tab. We will monitor how users respond to this and whether the potential proliferation of tabs becomes a problem.

It has been a useful exercise reviewing this initial sample of responses and giving some thought to how AtoM and our own implementation of it can be improved. We will be continuing to gather user feedback through further more detailed testing with a smaller sample of users and by pulling together the ad hoc comments we are likely to receive now our catalogue is live. 

Thursday, 7 April 2016

Our catalogue is now live!

Was it really 3.5 years ago when I first blogged about requirements for a new archival management system?

My main aim in getting involved in this project was to create a stable base to build a digital archive on.

If you build a digital archive on wobbly foundations there is a strong chance that it will fall over.

Much safer to build it on top of a system established as the single point of truth for all accessions information your organisation holds. A system which will become the means by which you disseminate information about your digital holdings (alongside the physical ones) and enable users to access copies of born digital and digitised material.

Finally we have such a solution in place!

We chose Access to Memory (AtoM) as our new archival management system, and over the last few years there has been a huge amount of work going on behind the scenes getting it up and running. I'm so pleased that today we are in a position to unveil the results of all of that hard work.

Our new catalogue can be viewed at

In a previous blog post "A is for AtoM" I talked about some of the tasks that have been going on and decisions that have been made to get us up and running, so I won't repeat all of that here.

Suffice to say that a considerable amount of work has gone in to getting AtoM installed, configured and styled. While this has been going on, Project Genesis has been key to getting the catalogue populated with archival descriptions. The task of populating our catalogue will continue via project Genesis until April 2017 and by other channels beyond that.

While our initial focus has been to get a collection level description for each of our archives into the catalogue, further work is required on the wider task of retroconversion - getting a variety of finding aids in a range of different formats into the system. We have managed to tackle some of this in an ad hoc way but there is still much to do.

Our AtoM catalogue is live, but our work is not yet done. I need to start thinking about how we can build digital preservation functionality on top of this (via Archivematica) and of course how we can start to provide more access to our digital holdings through the catalogue interface. Watch this space!

In the meantime, we'd be happy to hear any feedback about our catalogue so do get in touch.