Digital Archiving at the University of York: May 2018

Friday, 18 May 2018

UK Archivematica meeting at Westminster School

Yesterday the UK Archivematica user group meeting was held in the historic location of Westminster School in central London.

A pretty impressive location for a meeting!
(credit: Elizabeth Wells)

In the morning once fuelled with tea, coffee and biscuits we set about talking about our infrastructures and workflows. It was great to hear from a range of institutions and how Archivematica fits into the bigger picture for them. One of the points that lots of attendees made was that progress can be slow. Many of us were slightly frustrated that we aren't making faster progress in establishing our preservation infrastructures but I think it was a comfort to know that we were not alone in this!

I kicked things off by showing a couple of diagrams of our proposed and developing workflows at the University of York. Firstly illustrating our infrastructure for preserving and providing access to research data and secondly looking at our hypothetical workflow for born digital content that comes to the Borthwick Institute.

Now our AtoM upgrade is complete and that Archivematica 1.7 has been released, I am hoping that colleagues can set up a test instance of AtoM talking to Archivematica that I can start to play with. In a parallel strand, I am encouraging colleagues to consider and document access requirements for digital content. This will be invaluable when thinking about what sort of experience we are trying to implement for our users. The decision is yet to be made around whether AtoM and Archivematica will meet our needs on their own or whether additional functionality is needed through an integration with Fedora and Samvera (the software on which our digital library runs)...but that decision will come once we better understand what we are trying to achieve and what the solutions offer.

Elizabeth Wells from Westminster School talked about the different types of digital content that she would like Archivematica to handle and different workflows that may be required depending on whether it is born digital or digitised content, whether a hybrid or fully digital archive and whether it has been catalogued or not. She is using Archivematica alongside AtoM and considers that her primary problems are not technical but revolve around metadata and cataloguing. We had some interesting discussion around how we would provide access to digital content through AtoM if the archive hadn't been catalogued.

Anna McNally from the University of Westminster reminded us that information about how they are using Archivematica is already well described in a webinar that is now available on YouTube: Work in Progress: reflections on our first year of digital preservation. They are using the PERPETUA service from Arkivum and they use an automated upload folder in NextCloud to move digital content into Archivematica. They are in the process of migrating from CALM to AtoM to provide access to their digital content. One of the key selling points of AtoM for them is it's support for different languages and character sets.

Chris Grygiel from the University of Leeds showed us some infrastructure diagrams and explained that this is still very much a work in progress. Alongside Archivematica, he is using BitCurator to help appraise the content and EPrints and EMU for access.

Rachel MacGregor from Lancaster University updated us on work with Archivematica at Lancaster. They have been investigating both Archivematica and Preservica as part of the Jisc Research Data Shared Service pilot. The system that they use has to be integrated in some way with PURE for research data management.

After lunch in the dining hall (yes it did feel a bit like being back at school),
Rachel MacGregor (shouting to be heard over the sound of the bells at Westminster) kicked off the afternoon with a presentation about DMAonline. This tool, originally created as part of the Jisc Research Data Spring project, is under further development as part of the Jisc Research Data Shared Service pilot.

It provides reporting functionality for a range of systems in use for research data management including Archivematica. Archivematica itself does not come with advanced reporting functionality - it is focused on the primary task of creating an archival information package (AIP).

The tool (once in production) could be used by anyone regardless of whether they are part of the Jisc Shared Service or not. Rachel also stressed that it is modular - though it can gather data from a whole range of systems, it could also work just with Archivematica if that is the only system you are interested in reporting on.

An important part of developing a tool like this is to ensure that communication is clear - if you don’t adequately communicate to the developers what you want it to do, you won’t get what you want. With that in mind, Rachel has been working collaboratively to establish clear reporting requirements for preservation. She talked us through these requirements and asked for feedback. They are also available online for people to comment on:

Go to jira.dmao.org and click on create an account to create your account
Then go to: https://confluence.dmao.org/display/DMAO/DMAonline
To see all the preservation requirements you can click on Feature requests and choose the option Preservation features

Sean Rippington from the University of St Andrews talked us through some testing he has carried out, looking at how files in SharePoint could be handled by Archivematica. St Andrews are one of the pilot organisations for the Jisc Research Data Shared Service, and they are also interested in the preservation of their corporate records. There doesn’t seem to be much information out there about how SharePoint and Archivematica might work together, so it was really useful to hear about Sean’s work.

He showed us inside a sample SharePoint export file (a .cmp file). It consisted of various office documents (the documents that had been put into SharePoint) and other metadata files. The office documents themselves had lost much of their original metadata - they had been renamed with a consecutive number and given a .DAT file extension. The date last modified had changed to the date of export from SharePoint. However, all was not lost, a manifest file was included in the export and contained lots of valuable metadata, including the last modified date, the filename, the file extension and the name of the person who created file and last modified it.

Sean tried putting the .cmp file through Archivematica to see what happens. He found that Archivematica correctly identified the MS Office files (regardless of change of file extension) but obviously the correct (original) metadata was not associated with the files. This continued to be stored in the associated manifest file. This has potential for confusing future users of the digital archive - the metadata gives useful context to the files and if hidden in a separate manifest file it may not be discovered.

Another approach he took was to use the information in the manifest file to rename the files and assign them with their correct file extensions before pushing them into Archivematica. This might be a better solution in that the files that will be served up in the dissemination information package (DIP) will be named correctly and be easier for users to locate and understand. However, this was a manual process and probably not scalable unless it could be automated in some way.

He ended with lots of questions and would be very glad to hear from anyone who has done further work in this area.

Hrafn Malmquist from the University of Edinburgh talked about his use of Archivematica’s appraisal tab and described a specfic use case for Archivematica which had specific requirements. The records of the University court have been deposited as born digital since 2007 and need to be preserved and made accessible with full text searching to aid retrieval. This has been achieved using a combination of Archivematica and DSpace and by adding a package.csv file containing appropriate metadata that can be understood by DSpace.

Laura Giles from the University of Hull described ongoing work to establish a digital archive infrastructure for the Hull City of Culture archive. They had an appetite for open source and prior experience with Archivematica so they were keen to use this solution, but they did not have the in-house resource to implement it. Hull are now working with CoSector at the University of London to plan and establish a digital preservation solution that works alongside their existing repository (Fedora and Samvera) and archives management system (CALM). Once this is in place they hope to use similar principles for other preservation use cases at Hull.

We then had time for a quick tour of Westminster School archives followed by more biscuits before Sarah Romkey from Artefactual Systems joined us remotely to update us on the recent new Archivematica release and future plans. The group is considering taking her up on her suggestion to provide some more detailed and focused feedback on the appraisal tab within Archivematica - perhaps a task for one of our future meetings.

Talking of future meetings ...we have agreed that the next UK Archivematica meeting will be held at the University of Warwick at some point in the autumn.

Jenny Mitcham, Digital Archivist

Friday, 4 May 2018

The anatomy of an AtoM upgrade

Yesterday we went live with our new upgraded production version of AtoM.

We've been using AtoM version 2.2 since we first unveiled the Borthwick Catalogue to the world two years ago. Now we have finally taken the leap to version 2.4.

We are thrilled to benefit from some of the new features - including the clipboard, being able to search by date range and the full width treeview. Of course we are also keen to test the work we jointly sponsored last year around exposing EAD via OAI-PMH for harvesting.

But what has taken us so long you might ask?

...well, upgrading AtoM has been a new experience for us and one that has involved a lot of planning behind the scenes. The technical process of upgrading has been ably handled by our systems administrator. Much of his initial work behind the scenes has been on 'puppetising' AtoM to make it easier to manage multiple versions of AtoM going forward. In this post though I will focus on the less technical steps we have taken to manage the upgrade and the decisions we have made along the way.

Checking the admin settings

One of the first things I did when I was given a test version of 2.4 to play with was to check out all of the admin settings to see what had changed.

All of our admin settings for AtoM are documented in a spreadsheet alongside a rationale for our decisions. I wanted to take some time to understand the new settings, read the documentation and decide what would work for us.

Some of these decisions were taken to a meeting for a larger group of staff to discuss. I've got a good sense of how we use AtoM but I am not really an AtoM user so it was important that others were involved in the decision making.

Most decisions were relatively straightforward and uncontroversial but the one that we spent most time on was deciding whether or not to change the slugs...

Slugs

In AtoM, the 'slug' is the last element of the url for each individual record within the catalogue - it has to be unique so that all the urls go to the right place. In previous versions of AtoM the slugs were automatically generated from the title of each record. This led to some interesting and varied urls.

Some of them were really long - if the title of the record was really long
Some of them were short and very cryptic - if the record hadn't been given a title prior to the first save
Many of our titles are not unique - for example, we have lots of records simply called 'correspondence' in the catalogue. Where titles are not unique, AtoM will use the title and then append it with a number in order to create a unique slug (eg: correspondence-150)

Slugs are therefore hard to predict ...and it is not always possible to look at a slug and know which archive it refers to.

This possibly doesn't matter, but could become an issue for us in the future should we wish to carry out more automated data manipulation or system integrations.

AtoM 2.4 now allows you to choose which fields your slugs are generated from. We have decided that it would be better if ours were generated from the identifier of the record rather than the title. The reason being that identifiers are generally quite short and sweet and of course should be unique (though we recently realised that this isn't enforced in AtoM).

But of course this is not a decision that can be taken lightly. Our catalogue has been live for 2 years now and users will have set up links and bookmarks to particular records within it. On balance we decided that it would be better to change the slugs and do our best to limit the impact on users.

So, we have changed the admin setting to ensure future slugs are generated using the identifier. We have run a script provided by Artefactual Systems that changed all the slugs that are already in the database. We have set up a series of redirects from all the old urls of top level descriptions in the catalogue to the new urls (note that having had a good look at the referrer report in Google Analytics it was apparent that external links to the catalogue generally point at top level descriptions).

Playing and testing

It was important to do a certain amount of testing and playing around with AtoM 2.4 and it was important that it wasn't just myself who did this - I encouraged all my colleagues to also have a go.

First I checked the release notes for versions 2.3 and 2.4 so I had a good sense of what had changed and where I should focus my attention. I was then able to test these new features and direct colleagues to them as appropriate for further testing or discussion.

While doing so, I tried to think about whether any of these changes would necessitate changes in our workflows and processes or updates to our staff handbook.

As an example - it was noted that there was a new field to record occupations for authority records. Rather than letting individuals to decide how to use this field, it is important to agree an institutional approach and consider an appropriate methodology or taxonomy. As it happens, we have decided not to use this field for the time being and this will be documented accordingly.

Assessing known bugs

Being a bit late to the upgrade party gives us the opportunity to assess known bugs and issues with a release. I spent some time looking at Artefactual's issues log for AtoM and establish if any of them were going to cause us major problems or required a workaround to be put in place.

There are lots of issues recorded and I looked through many of them (but not all!). Fortunately, very few looked like they would have an impact on us. Most related to functionality we don't utilise - such as the ability to use AtoM with multiple institutions or translate it into multiple languages.

The one bug that I thought would be irritating for us was related to the accessions counter which was not incrementing in version 2.4. Having spent a bit of time testing, it seemed that this wasn't a deal breaker for us and there was a workaround we could put in place to enable staff to continue to create accession records with a unique identifier relatively easily.

Testing local workarounds

Next I tested one of the local workarounds we have for AtoM. We use a CSS print stylesheet to help us to generate an accessions report to send donors and depositors to confirm receipt of an archive. This still worked in the new version of AtoM with no issues. Hoorah!

Look and feel

We gave a bit of thought to how AtoM should be styled. Two years ago we went live with a slightly customised version of the Dominion theme. This had been styled to look similar to our website (which at the time was branded orange).

In the last year, the look and feel of the University website has changed and we are no longer orange! Some thought needed to be given to whether we should change the look of our catalogue now to keep it consistent with our website. After some discussion it was agreed that our existing AtoM theme should be maintained for the time being.

We did however think it was a good idea to adopt the font of the University website, but when we tested this out on our AtoM instance it didn't look as clear...so that decision was quickly reversed.

Usability testing

When we first launched our catalogue we carried out a couple of rounds of user testing (read about it here and here) but this was quite a major piece of work and took up a substantial amount of staff time.

With this upgrade we were keen to give some consideration to the user experience but didn't have resource to invest in more user testing.

Instead we recruited the Senior User Experience Designer at our institution to cast his eye over our version of AtoM 2.4 and give us some independent feedback on usability and accessibility. It was really useful to get a fresh pair of eyes to look at our site, but as this could be a whole blog post in itself so I won't say anymore here...watch this space!

Updating our help pages

Another job was to update both the text and the screenshots on our static help pages within AtoM. There have been several changes since 2.2 and some of these are reflected in the look and feel of the interface.

The advanced search looks a bit different in version 2.4 - here is the refreshed screenshot for our help pages

We were also keen to add in some help for our users around the clipboard feature and to explain how the full width treeview works.

The icons for different types of information within AtoM have also been brought out more strongly in this version, so we also wanted to flag up what these meant for our users.

...and that reminds me, we really do need a less Canada-centric way to indicate place!

Updating our staff handbook

Since we adopted AtoM a few years ago we have developed a whole suite of staff manuals which record how we use AtoM, including tips for carrying out certain procedures and information about what to put in each field. With the new changes brought in with this upgrade, we of course had to update our internal documentation.

When to upgrade?

As we drew ever closer to our 'go live' date for the upgrade we were aware that Artefactual were busy preparing their 2.4.1 bug fix release. We were very keen to get the bug fixes (particularly for that accessions counter bug that I mentioned) but were not sure how long we were prepared to wait.

Luckily with helpful advice from Artefactual we were able to follow some instructions from the user forum and install from the GitHub code repository instead of the tarball download on the website. This meant we could benefit from those bug fixes that were already stable (and pull others to test as they become available) without having to wait for the formal 2.4.1 release.

No need to delay our upgrade further!

As it happens it was good news we upgraded when we did. The day before the upgrade we hit a bug in version 2.2 during a re-index of elasticsearch. Nice to know we had a nice clean version of 2.4 ready to go the next day!

Finishing touches

On the 'go live' date we'd put word around to staff not to edit the catalogue while we did the switch. Our systems administrator got all the data from our production version of 2.2 freshly loaded into 2.4, ran the scripts to change the slugs and re-indexed the database. I just needed to do a few things before we asked IT to do the Domain Name System switch.

First I needed to check all the admin settings were right - a few final tweaks were required here and there. Second I needed to load up the Borthwick logo and banner to our archival institution record. Thirdly I needed to paste the new help and FAQ text into the static pages (I already had this prepared and saved elsewhere).

Once the DNS switch was done we were live at last!

Sharing the news

Of course we wanted to publicise the upgrade to our users and tell them about the new features that it brings.

We've put AtoM back on the front page of our website and added a news item.

Let's tell the world all about it, with a catalogue banner and news item

My colleague has written a great blog post aimed at our users and telling them all about the new features, and of course we've all been enthusiastically tweeting!

...and a whole lot of tweeting

Future work

The upgrade is done but work continues. We need to ensure harvesting to our library catalogue still works and of course test out the new EAD harvesting functionality. Later today we will be looking at Search Engine Optimisation (particularly important since we changed our slugs). We also have some remaining tasks around finding aids - uploading pdfs of finding aids for those archives that aren't yet fully catalogued in AtoM using the new functionality in 2.4.

But right now I've got a few broken links to fix...

Jenny Mitcham, Digital Archivist

Digital Archiving at the University of York