Monday 18 September 2017

Harvesting EAD from AtoM: we need your help!

Back in February I published a blog post about a project to develop AtoM to allow EAD (Encoded Archival Description) to be harvested via OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting): “Harvesting EAD from AtoM: a collaborative approach

Now that AtoM version 2.4 is released (hooray!), containing the functionality we have sponsored, I thought it was high time I updated you on what has been achieved by this project, where more work is needed and how the wider AtoM community can help.


What was our aim?


Our development work had a few key aims:

  • To enable finding aids from AtoM to be exposed as EAD 2002 XML for others to harvest. The partners who sponsored this project were particularly keen to enable the Archives Hub to harvest their EAD.
  • To change the way that EAD was generated by AtoM in order to make it more scalable. Moving EAD generation from the web browser to the job scheduler was considered to be the best approach here.
  • To make changes to the existing DC (Dublin Core) metadata generation feature so that it also works through the job scheduler - making this existing feature more scalable and able to handle larger quantities of data

A screen shot of the job scheduler in AtoM - showing the EAD and
DC creation jobs that have been completed

What have we achieved?

The good

We believe that the EAD harvesting feature as released in AtoM version 2.4 will enable a harvester such as the Archives Hub to harvest our catalogue metadata from AtoM as EAD. As we add new top level archival descriptions to our catalogue, subsequent harvests should pick up and display these additional records. 

This is a considerable achievement and something that has been on our wishlist for some time. This will allow our finding aids to be more widely signposted. Having our data aggregated and exposed by others is key to ensuring that potential users of our archives can find the information that they need.

Changes have also been made to the way metadata (both EAD and Dublin Core) are generated in AtoM. This means that the solution going forward is more scalable for those AtoM instances that have very large numbers of records or large descriptive hierarchies.

The new functionality in AtoM around OAI-PMH harvesting of EAD and settings for moving XML creation to the job scheduler is described in the AtoM documentation.

The not-so-good

Unfortunately the EAD harvesting functionality within AtoM 2.4 will not do everything we would like it to do. 

It does not at this point include the ability for the harvester to know when metadata records have been updated or deleted. It also does not pick up new child records that are added into an existing descriptive hierarchy. 

We want to be able to edit our records once within AtoM and have any changes reflected in the harvested versions of the data. 

We don’t want our data to become out of sync. 

So clearly this isn't ideal.

The task of enabling full harvesting functionality for EAD was found to be considerably more complex than first anticipated. This has no doubt been confounded by the hierarchical nature of the EAD which differs from the simplicity of the traditional Dublin Core approach.

The problems encountered are certainly not insurmountable, but lack of additional resources and timelines for the release of AtoM 2.4 stopped us from being able to finish off this work in full.

A note on scalability


Although the development work deliberately set out to consider issues of scalability, it turns out that scalability is actually on a sliding scale!

The National Library of Wales had the forethought to include one of their largest archival descriptions as sample data for inclusion in the version of AtoM 2.4 that Artefactual deployed for testing. Their finding aid for St David’s Diocesan Records is a very large descriptive hierarchy consisting of 33,961 individual entries. This pushed the capabilities of EAD creation (even when done via the job scheduler) and also led to discussions with The Archives Hub about exactly how they would process and display such a large description at their end even if EAD generation within AtoM were successful.

Some more thought and more manual workarounds will need to be put in place to manage the harvesting and subsequent display of large descriptions such as these.

So what next?


We are keen to get AtoM 2.4 installed at the Borthwick Institute for Archives over the next couple of months. We are currently on version 2.2 and would like to start benefiting from all the new features that have been introduced available... and of course to test in earnest the EAD harvesting feature that we have jointly sponsored.

We already know that this feature will not fully meet our needs in its current form, but would like to set up an initial harvest with the Archives Hub and further test some of our assumptions about how this will work.

We may need to put some workarounds in place to ensure that we have a way of reflecting updates and deletions in the harvested data – either with manual deletes or updates or a full delete and re-harvest periodically.

Harvesting in AtoM 2.4 - some things that need to change


So we have a list of priority things that need to be improved in order to get EAD harvesting working more smoothly in the future:


In line with the OAI-PMH specification

  • AtoM needs to expose updates to the metadata to the harvester
  • AtoM needs to expose new records (at any level of description) to the harvester
  • AtoM needs to expose information about deletions to the harvester
  • AtoM also needs to expose information about deletions to DC metadata to the harvester (it has come to my attention during the course of this project that this isn’t happening at the moment) 

Some other areas of potential work


I also wanted to bring together and highlight some other areas of potential work for the future. These are all things that were discussed during the course of the project but were not within the scope of our original development goals.

  • Harvesting of EAC (Encoded Archival Context) - this is the metadata standard for authority records. Is this something people would like to see enabled in the future? Of course this is only useful if you have someone who actually wants to harvest this information!
  • On the subject of authority records, it would be useful to change the current AtoM EAD template to use @authfilenumber and @source - so that an EAD record can link back to the relevant authority record in the local AtoM site. The ability to create rich authority records is such a key strength of AtoM, allowing an institution to weave rich interconnecting stories about their holdings. If harvesting doesn’t preserve this inter-connectivity then I think we are missing a trick!
  • EAD3 - this development work has deliberately not touched on the new EAD standard. Firstly, this would have been a much bigger job and secondly, we are looking to have our EAD harvested by The Archives Hub and they are not currently working with EAD3. This may be a priority area of work for the future.
  • Subject source - the subject source (for example "Library of Congress Subject Headings") doesn't appear in AtoM generated EAD at the moment even though it can be entered into AtoM - this would be a really useful addition to the EAD.
  • Visible elements - AtoM allows you to decide which elements you wish to display/hide in your local AtoM interface. With the exception of information relating to physical storage, the XML generation tasks currently do not take account of visible elements and will carry out an export of all fields. Further investigation of this should be carried out in the future. If an institution is using the visible elements feature to hide certain bits of information that should not be more widely distributed, they would be concerned if this information was being harvested and displayed elsewhere. As certain elements will be required in order to create valid EAD, this may get complicated!
  • ‘Manual’ EAD generation - the project team discussed the possibility of adding a button to the AtoM user interface so that staff users can manually kick-off EAD regeneration for a single descriptive hierarchy. Artefactual suggested this as a method of managing the process of EAD generation for large descriptive hierarchies. You would not want the EAD to regenerate with each minor tweak if a large archival description was undergoing several updates, however, you need to be able to trigger this task when you are ready to do so. It should be possible to switch off the automatic EAD re-generation (which normally triggers when a record is edited and saved) but have a button on the interface that staff can click when they want to initiate the process - for example when all edits are complete. 
  • As part of their work on this project, Artefactual created a simple script to help with the process of generating EAD for large descriptive hierarchies - it basically provides a way of finding out which XML files relate to a specific archival description so that EAD can be manually enhanced and updated if it is too large for AtoM to generate via the job scheduler. It would be useful to turn this script into a command-line task that is maintained as part of the AtoM codebase.

We need your help!


Although we believe we have something we can work with here and now, we are not under any illusions that this feature does all that it needs to in order to meet our requirements in the longer term. 

I would love to find out what other AtoM users (and harvesters) think of the feature. Is it useful to you? Are there other things we should put on the wishlist? 

There is a lot of additional work described in this post which the original group of project partners are unlikely to be able to fund on their own. If EAD harvesting is a priority to you and your organisation and you think you can contribute to further work in this area either on your own or as part of a collaborative project please do get in touch.


Thanks


I’d like to finish with a huge thanks to those organisations who have helped make this project happen, either through sponsorship, development or testing and feedback.




Jenny Mitcham, Digital Archivist

No comments:

Post a Comment

The sustainability of a digital preservation blog...

So this is a topic pretty close to home for me. Oh the irony of spending much of the last couple of months fretting about the future prese...