Digital Archiving at the University of York: workflows

Showing posts with label workflows. Show all posts

Friday, 7 April 2017

Archivematica Camp York: Some thoughts from the lake

Well, that was a busy week!

Yesterday was the last day of Archivematica Camp York - an event organised by Artefactual Systems and hosted here at the University of York. The camp's intention was to provide a space for anyone interested in or currently using Archivematica to come together, learn about the platform from other users, and share their experiences. I think it succeeded in this, bringing together 30+ 'campers' from across the UK, Europe and as far afield as Brazil for three days of sessions covering different aspects of Archivematica.

Our pod on the lake (definitely a lake - not a pond!)

My main goal at camp was to ensure everyone found their way to the rooms (including the lakeside pod) and that we were suitably fuelled with coffee, popcorn and cake. Alongside these vital tasks I also managed to partake in the sessions, have a play with the new version of Archivematica (1.6) and learn a lot in the process.

I can't possibly capture everything in this brief blog post so if you want to know more, have a look back at all the #AMCampYork tweets.

What I've focused on below are some of the recurring themes that came up over the three days.

Workflows

Archivematica is just one part of a bigger picture for institutions that are carrying out digital preservation, so it is always very helpful to see how others are implementing it and what systems they will be integrating with. A session on workflows in which participants were invited to talk about their own implementations was really interesting.

Other sessions also helped highlight the variety of different configurations and workflows that are possible using Archivematica. I hadn't quite realised there were so many different ways you could carry out a transfer!

In a session on specialised workflows, Sara Allain talked us through the different options. One workflow I hadn't been aware of before was the ability to include checksums as part of your transfer. This sounds like something I need to take advantage of when I get Archivematica into production for the Borthwick.

Justin talking about Automation Tools

A session on Automation Tools with Justin Simpson highlighted other possibilities - using Archivematica in a more automated fashion.

We already have some experience of using Automation Tools at York as part of the work we carried out during phase 3 of Filling the Digital Preservation Gap, however I was struck by how many different ways these can be applied. Hearing examples from other institutions and for a variety of different use cases was really helpful.

Appraisal

The camp included a chance to play with Archivematica version 1.6 (which was only released a couple of weeks ago) as well as an introduction to the new Appraisal and Arrangement tab.

A session in progress at Archivematica Camp York

I'd been following this project with interest so it was great to be able to finally test out the new features (including the rather pleasing pie charts showing what file formats you have in your transfer). It was clear that there were a few improvements that could be made to the tab to make it more intuitive to use and to deal with things such as the ability to edit or delete tags, but it is certainly an interesting feature and one that I would like to explore more using some real data from our digital archive.

Throughout camp there was a fair bit of discussion around digital appraisal and at what point in your workflow this would be carried out. This was of particular interest to me being a topic I had recently raised with colleagues back at base.

The Bentley Historical Library who funded the work to create the new tab within Archivematica are clearly keen to get their digital archives into Archivematica as soon as possible and then carry out the work there after transfer. The addition of this new tab now makes this workflow possible.

Kirsty Lee from the University of Edinburgh described her own pre-ingest methodology and the tools she uses to help her appraise material before transfer to Archivematica. She talked about some tools (such as TreeSize Pro) that I'm really keen to follow up on.

At the moment I'm undecided about exactly where and how this appraisal work will be carried out at York, and in particular how this will work for hybrid collections so as always it is interesting to hear from others about what works for them.

Metadata and reporting

Evelyn admitting she loves PREMIS and METS

Evelyn McLellan from Artefactual led a 'Metadata Deep Dive' on day 2 and despite the title, this was actually a pretty interesting session!

We got into the details of METS and PREMIS and how they are implemented within Archivematica. Although I generally try not to look too closely at METS and PREMIS it was good to have them demystified. On the first day through a series of exercises we had been encouraged to look at a METS file created by Archivematica ourselves and try and pick out some information from it so these sessions in combination were really useful.

Across various sessions of the camp there was also a running discussion around reporting. Given that Archivematica stores such a detailed range of metadata in the METS file, how do we actually make use of this? Being able to report on how many AIPs have been created, how many files and what size is useful. These are statistics that I currently collect (manually) on a quarterly basis and share with colleagues. Once Archivematica is in place at York, digging further into those rich METS files to find out which file formats are in the digital archive would be really helpful for preservation planning (among other things). There was discussion about whether reporting should be a feature of Archivematica or a job that should be done outside Archivematica.

In relation to the later option - I described in one session how some of our phase 2 work of Filling the Digital Preservation Gap was designed to help expose metadata from Archivematica to a third party reporting system. The Jisc Research Data Shared Service was also mentioned in this context as reporting outside of Archivematica will need to be addressed as part of this project.

Community

As with most open source software, community is important. This was touched on throughout the camp and was the focus of the last session on the last day.

There was a discussion about the role of Artefactual Systems and the role of Archivematica users. Obviously we are all encouraged to engage and help sustain the project in whatever way we are able. This could be by sharing successes and failures (I was pleased that my blog got a mention here!), submitting code and bug reports, sponsoring new features (perhaps something listed on the development roadmap) or helping others by responding to queries on the mailing list. It doesn't matter - just get involved!

I was also able to highlight the UK Archivematica group and talk about what we do and what we get out of it. As well as encouraging new members to the group, there was also discussion about the potential for forming other regional groups like this in other countries.

Some of the Archivematica community - class of Archivematica Camp York 2017

...and finally

Another real success for us at York was having the opportunity to get technical staff at York working with Artefactual to resolve some problems we had with getting our first Archivematica implementation into production. Real progress was made and I'm hoping we can finally start using Archivematica for real at the end of next month.

So, that was Archivematica Camp!

A big thanks to all who came to York and to Artefactual for organising the programme. As promised, the sun shined and there were ducks on the lake - what more could you ask for?

Thanks to Paul Shields for the photos

Jenny Mitcham, Digital Archivist

Tuesday, 11 October 2016

Some highlights from iPRES 2016

A lovely view of the mountains from Bern

Last week I was at iPRES 2016 - the 13th International Conference on Digital Preservation and one of the highlights of the digital preservation year.

This year the conference was held in the Swiss town of Bern. A great place to be based for the week - fantastic public transport, some lovely little restaurants and cellar bars, miles of shopping arcades, bizarre statues and non-stop sunshine!

There was so much content over the course of the 4 days that it is impossible to cover it all in one blog post. Instead I offer up a selection of highlights and takeaway thoughts.

Jeremy York from the University of Michigan gave an excellent paper about ‘The Stewardship Gap’. An interesting project with the aim of understanding the gap between valuable digital data and long term curation. Jeremy reported on the results of a series of interviews with researchers at his institution where they were asked about the value of the data they created and their plans for longer term curation. A theme throughout the paper was around data value and how we assess this. Most researchers interviewed felt that their data did have long term value (and were able to articulate the reasons why). Most of the respondents expressed an intention to preserve the data for the longer term but did not have any concrete plans as to how they would achieve this. It was not yet clear to the project whether an intention to preserve actually leads to deposit with a repository or not. Work on this project is ongoing and I’ll look forward to finding out more when it is available.

Bern at night

As always there was an array of excellent posters. There were two in particular that caught my eye this year.

Firstly a poster from the University of Illinois at Urbana-Champaign entitled Should We Keep Everything Forever?: Determining Long-Term Value of Research Data.

The poster discussed an issue that we have also been grappling with recently as part of Filling the Digital Preservation Gap, that of the value of research data. It proposed an approach to assessing the value of content within the Illinois Data Bank using automated methods and measurable criteria. Recognising that a human eye is also important in assessing value, it would highlight those datasets that appear to have a low value which can then be assessed in a more manual fashion. This pragmatic two-stage approach will ensure that data thought to be of low value can be discarded after 5 years but that time intensive manual checking of datasets is kept to a minimum. This is a useful model that I would like to hear more about once they get it fully established. There was a lot of buzz around this poster and I wasn’t surprised to see it shortlisted for the best poster award.

Another excellent poster (and worthy winner of the best poster award) was To Act or Not to Act - Handling File Format Identification Issues in Practice. This poster from ETH Zurich described how the institution handles file identification and validation errors within their digital archive and showed some worked examples of the types of problems they encountered. This kind of practical sharing of the nuts and bolts of digital preservation is really good to see, and very much in line with the recommendations we are making as part of Filling the Digital Preservation Gap. As well as finding internal solutions to these problems I hope that ETH Zurich are also passing feedback to the tool providers to ensure that the tools work more effectively and efficiently for other users. It is this feedback loop that is so important in helping the discipline as a whole progress.

OAIS panel session in full swing

A panel session on Monday afternoon entitled ‘OAIS for us all’ was also a highlight. I was of course already aware that the OAIS standard is currently under review and that DPC members and other digital preservation practitioners are invited and encouraged to contribute to the discussion. Despite best intentions and an obvious interest in the standard I had not yet managed to engage with the review. This workshop was therefore a valuable opportunity to get up to speed with the process (as far as the panel understood it!) and the community feedback so far.

It was really useful to hear about the discussions about OAIS that have been held internationally and of course interesting to note the common themes recurring throughout – for example around the desire for a pre-ingest step within the model, the need to firm up the reference model to accommodate changes to AIPs that may occur through re-ingest, and around the need for openness with regard to audit and certification standards.

This session was a great example of an international collaboration to help shape the standards that we rely so much on. I do hope that the feedback from our community is given full consideration in the revised OAIS Reference Model.

Me and Steve presenting our paper
(image from @shirapeltzman)

On Tuesday morning I gave a paper with Steve Mackey from Arkivum in the Research Data Preservation session (and I was really pleased that there was a whole session devoted to this topic). I presented on our work to link repositories to Archivematica, through the Filling the Digital Preservation Gap project, and focused in particular on the long tail of research data file formats and the need to address this as a community. It was great to be able to talk to such a packed room and this led to some really useful discussions over the lunch break and at the conference dinner that evening.

One of the most interesting sessions of the conference for me was one that was devoted to ingest tools and methods. At a conference such as this, I'm always drawn to the sessions that focus on practical tools and first hand experiences of doing things rather than the more theoretical strands so this one was an obvious choice for me. First we had Bruno Ferreira from KEEP SOLUTIONS talking about the Database Preservation Toolkit (more about this toolkit later). Then "Exploring Friedrich Kittler's Digital Legacy on Different Levels: Tools to Equip the Future Archivist" by Jurgen Enge and Heinz Werner Kramski from the University of Art and Design in Basel.

It was fascinating to see how they have handled the preservation of a large, diverse and complex digital legacy and overcome some of the challenges and hurdles that this has thrown at them. The speakers also made the point that the hardware itself is important evidence in its physical form, showing for instance how regularly Friedrich Kittler clearly used the reset button on his PC!

Conference delegates relaxing on the terrace

Two further presentations focused on the preservation of e-mail - something that I have little experience of but I am sure I will need to work on in the future. Claus Jensen from the Royal Library in Denmark presented a solution for acquisition of email. This seemed a very pragmatic approach and the team had clearly thought through their requirements well and learned from their initial prototype before moving to a second iteration. I'm keen to follow up on this and read the paper in more detail.

Brent West from the University of Illinois followed on with another interesting presentation on Processing Capstone Email using Predictive Coding. This talk focused on the problems of making appraisal decisions and sensitivity assessments for email and how a technology assisted review could help, enabling the software to learn from human decisions that are made and allowing human effort to be reduced and targeted. Again, I felt that this sort of work could be really useful to me in the future if I am faced with the task of e-mail preservation at scale.

A very expensive gin and tonic!

The BitCurator Mixer on the Tuesday night provided a good opportunity to talk to other users of BitCurator. I confess to not actually being a user just yet but having now got my new ingest PC on my desk, it is only a matter of time before I get this installed and start playing and testing. Good to talk to some experienced users and pick up some tips regarding how to install it and where to find example workflows. What sticks most in my mind though is the price of the gin and tonic at the bar we were in!

On Wednesday afternoon I took part in a workshop called OSS4PRES 2.0: Building Bridges and Filling Gaps – essentially a follow on from a workshop called Using Open-Source Tools to Fulfill Digital Preservation Requirements that I blogged about from iPRES last year. This was one of those workshops where we were actually expected to work (always a bit of a shock after lunch!) and the participants split into groups to address 3 different areas. One group was looking at gaps in the open source digital preservation tool set that we should be looking to fill (either by enhancing existing tools or with the development of new tools). Another group was working on drawing up a set of guidelines for providers of open source tools. The group I was in was thinking about the creation of a community space for sharing digital preservation workflows. This is something that I think could turn into a really valuable resource for practitioners who want to see how others have implemented tools. All the groups came out with lots of ideas and an action plan by the end of the afternoon and work in these areas is scheduled to continue outside of the workshop. Great to see a workshop that is more than just a talking shop but that will lead to some more concrete results.

My laptop working hard at the Database
Preservation workshop

On Thursday morning I attended another really useful hands on workshop called Relational Database Preservation Standards and Tools. Participants were encouraged to try out the SIARD Suite and Database Preservation Toolkit on their own laptops. The value and outcomes of this workshop were clear and it really gave a sense of how we might use these tools to create preservation versions of data from relational databases. Designed to work with a number of widely used relational database systems the tools allow data to be extracted into the SIARD 2 format. This format is essentially a zip file containing the relevant information in XML. It goes one better than the csv format (the means by which I have preserved databases in the past) as it contains both information about the structure of the data as well as the content itself and allows you to add metadata about how the data was extracted. It looks to be particularly useful for taking snapshots of live and active databases for preservation on a regular cycle. I could definitely see myself using these tools in the future.

iPRES2016 and Swisse Toy 2016 meet outside the venue

There was some useful discussion at the end of the session about how these tools would actually fit into a wider preservation workflow and whether they could be incorporated into digital preservation sytems (for example Archivematica) and configured as an automatic migration path for Microsoft Access databases. The answer was yes, but subsequent discussion suggested that this may not be the best way to approach this. The tool creators suggest that full automation may not be the best approach. A human eye is typically required to establish which bits of the database should be preserved and retained and to tailor the creation of the SIARD 2 file accordingly.

On the last afternoon of the conference it was good to be able to pop into the Swiss Toy Fair which was being held at the same venue as the conference. A great opportunity to buy some presents for the family before heading back to the UK.

Jenny Mitcham, Digital Archivist

Digital Archiving at the University of York