Digital Archiving at the University of York: October 2016

Today we have published our third and final Filling the Digital Preservation Gap report.

The report can be accessed from Figshare: https://dx.doi.org/10.6084/m9.figshare.4040787

This report details work the team at the Universities of York and Hull have been carrying out over the last six months (from March to September 2016) during phase 3 of the project.

The first section of the report focuses on our implementation work. It describes how each institution has established a proof of concept implementation of Archivematica integrated with other systems used for research data management. As well as describing how these implementations work it also discusses future priorities and lessons learned.

The second section of the report looks in more detail at the file format problem for research data. It discusses DROID profiling work that has been carried out over the course of the project (both for research data and other data types) and signature development to increase the number of research data signatures in the PRONOM registry. In recognition of the fact that this is an issue that can only be solved as a community, it also includes recommendations for a variety of different stakeholder groups.

The final section of the report details the outreach work that we have carried out over the course of this final phase of the project. It has been a real pleasure to have been given an opportunity to speak about our work at so many different events and to such a variety of different groups over the last few months!

The last of this run of events in our calendars is the final Jisc Research Data Spring showcase in Birmingham tomorrow (20th October). I hope to see you there!

Jenny Mitcham, Digital Archivist

A lovely view of the mountains from Bern

Last week I was at iPRES 2016 - the 13th International Conference on Digital Preservation and one of the highlights of the digital preservation year.

This year the conference was held in the Swiss town of Bern. A great place to be based for the week - fantastic public transport, some lovely little restaurants and cellar bars, miles of shopping arcades, bizarre statues and non-stop sunshine!

There was so much content over the course of the 4 days that it is impossible to cover it all in one blog post. Instead I offer up a selection of highlights and takeaway thoughts.

Jeremy York from the University of Michigan gave an excellent paper about ‘The Stewardship Gap’. An interesting project with the aim of understanding the gap between valuable digital data and long term curation. Jeremy reported on the results of a series of interviews with researchers at his institution where they were asked about the value of the data they created and their plans for longer term curation. A theme throughout the paper was around data value and how we assess this. Most researchers interviewed felt that their data did have long term value (and were able to articulate the reasons why). Most of the respondents expressed an intention to preserve the data for the longer term but did not have any concrete plans as to how they would achieve this. It was not yet clear to the project whether an intention to preserve actually leads to deposit with a repository or not. Work on this project is ongoing and I’ll look forward to finding out more when it is available.

Bern at night

As always there was an array of excellent posters. There were two in particular that caught my eye this year.

Firstly a poster from the University of Illinois at Urbana-Champaign entitled Should We Keep Everything Forever?: Determining Long-Term Value of Research Data.

The poster discussed an issue that we have also been grappling with recently as part of Filling the Digital Preservation Gap, that of the value of research data. It proposed an approach to assessing the value of content within the Illinois Data Bank using automated methods and measurable criteria. Recognising that a human eye is also important in assessing value, it would highlight those datasets that appear to have a low value which can then be assessed in a more manual fashion. This pragmatic two-stage approach will ensure that data thought to be of low value can be discarded after 5 years but that time intensive manual checking of datasets is kept to a minimum. This is a useful model that I would like to hear more about once they get it fully established. There was a lot of buzz around this poster and I wasn’t surprised to see it shortlisted for the best poster award.

Another excellent poster (and worthy winner of the best poster award) was To Act or Not to Act - Handling File Format Identification Issues in Practice. This poster from ETH Zurich described how the institution handles file identification and validation errors within their digital archive and showed some worked examples of the types of problems they encountered. This kind of practical sharing of the nuts and bolts of digital preservation is really good to see, and very much in line with the recommendations we are making as part of Filling the Digital Preservation Gap. As well as finding internal solutions to these problems I hope that ETH Zurich are also passing feedback to the tool providers to ensure that the tools work more effectively and efficiently for other users. It is this feedback loop that is so important in helping the discipline as a whole progress.

OAIS panel session in full swing

A panel session on Monday afternoon entitled ‘OAIS for us all’ was also a highlight. I was of course already aware that the OAIS standard is currently under review and that DPC members and other digital preservation practitioners are invited and encouraged to contribute to the discussion. Despite best intentions and an obvious interest in the standard I had not yet managed to engage with the review. This workshop was therefore a valuable opportunity to get up to speed with the process (as far as the panel understood it!) and the community feedback so far.

It was really useful to hear about the discussions about OAIS that have been held internationally and of course interesting to note the common themes recurring throughout – for example around the desire for a pre-ingest step within the model, the need to firm up the reference model to accommodate changes to AIPs that may occur through re-ingest, and around the need for openness with regard to audit and certification standards.

This session was a great example of an international collaboration to help shape the standards that we rely so much on. I do hope that the feedback from our community is given full consideration in the revised OAIS Reference Model.

Me and Steve presenting our paper
(image from @shirapeltzman)

On Tuesday morning I gave a paper with Steve Mackey from Arkivum in the Research Data Preservation session (and I was really pleased that there was a whole session devoted to this topic). I presented on our work to link repositories to Archivematica, through the Filling the Digital Preservation Gap project, and focused in particular on the long tail of research data file formats and the need to address this as a community. It was great to be able to talk to such a packed room and this led to some really useful discussions over the lunch break and at the conference dinner that evening.

One of the most interesting sessions of the conference for me was one that was devoted to ingest tools and methods. At a conference such as this, I'm always drawn to the sessions that focus on practical tools and first hand experiences of doing things rather than the more theoretical strands so this one was an obvious choice for me. First we had Bruno Ferreira from KEEP SOLUTIONS talking about the Database Preservation Toolkit (more about this toolkit later). Then "Exploring Friedrich Kittler's Digital Legacy on Different Levels: Tools to Equip the Future Archivist" by Jurgen Enge and Heinz Werner Kramski from the University of Art and Design in Basel.

It was fascinating to see how they have handled the preservation of a large, diverse and complex digital legacy and overcome some of the challenges and hurdles that this has thrown at them. The speakers also made the point that the hardware itself is important evidence in its physical form, showing for instance how regularly Friedrich Kittler clearly used the reset button on his PC!

Conference delegates relaxing on the terrace

Two further presentations focused on the preservation of e-mail - something that I have little experience of but I am sure I will need to work on in the future. Claus Jensen from the Royal Library in Denmark presented a solution for acquisition of email. This seemed a very pragmatic approach and the team had clearly thought through their requirements well and learned from their initial prototype before moving to a second iteration. I'm keen to follow up on this and read the paper in more detail.

Brent West from the University of Illinois followed on with another interesting presentation on Processing Capstone Email using Predictive Coding. This talk focused on the problems of making appraisal decisions and sensitivity assessments for email and how a technology assisted review could help, enabling the software to learn from human decisions that are made and allowing human effort to be reduced and targeted. Again, I felt that this sort of work could be really useful to me in the future if I am faced with the task of e-mail preservation at scale.

A very expensive gin and tonic!

The BitCurator Mixer on the Tuesday night provided a good opportunity to talk to other users of BitCurator. I confess to not actually being a user just yet but having now got my new ingest PC on my desk, it is only a matter of time before I get this installed and start playing and testing. Good to talk to some experienced users and pick up some tips regarding how to install it and where to find example workflows. What sticks most in my mind though is the price of the gin and tonic at the bar we were in!

On Wednesday afternoon I took part in a workshop called OSS4PRES 2.0: Building Bridges and Filling Gaps – essentially a follow on from a workshop called Using Open-Source Tools to Fulfill Digital Preservation Requirements that I blogged about from iPRES last year. This was one of those workshops where we were actually expected to work (always a bit of a shock after lunch!) and the participants split into groups to address 3 different areas. One group was looking at gaps in the open source digital preservation tool set that we should be looking to fill (either by enhancing existing tools or with the development of new tools). Another group was working on drawing up a set of guidelines for providers of open source tools. The group I was in was thinking about the creation of a community space for sharing digital preservation workflows. This is something that I think could turn into a really valuable resource for practitioners who want to see how others have implemented tools. All the groups came out with lots of ideas and an action plan by the end of the afternoon and work in these areas is scheduled to continue outside of the workshop. Great to see a workshop that is more than just a talking shop but that will lead to some more concrete results.

My laptop working hard at the Database
Preservation workshop

On Thursday morning I attended another really useful hands on workshop called Relational Database Preservation Standards and Tools. Participants were encouraged to try out the SIARD Suite and Database Preservation Toolkit on their own laptops. The value and outcomes of this workshop were clear and it really gave a sense of how we might use these tools to create preservation versions of data from relational databases. Designed to work with a number of widely used relational database systems the tools allow data to be extracted into the SIARD 2 format. This format is essentially a zip file containing the relevant information in XML. It goes one better than the csv format (the means by which I have preserved databases in the past) as it contains both information about the structure of the data as well as the content itself and allows you to add metadata about how the data was extracted. It looks to be particularly useful for taking snapshots of live and active databases for preservation on a regular cycle. I could definitely see myself using these tools in the future.

iPRES2016 and Swisse Toy 2016 meet outside the venue

There was some useful discussion at the end of the session about how these tools would actually fit into a wider preservation workflow and whether they could be incorporated into digital preservation sytems (for example Archivematica) and configured as an automatic migration path for Microsoft Access databases. The answer was yes, but subsequent discussion suggested that this may not be the best way to approach this. The tool creators suggest that full automation may not be the best approach. A human eye is typically required to establish which bits of the database should be preserved and retained and to tailor the creation of the SIARD 2 file accordingly.

On the last afternoon of the conference it was good to be able to pop into the Swiss Toy Fair which was being held at the same venue as the conference. A great opportunity to buy some presents for the family before heading back to the UK.

Jenny Mitcham, Digital Archivist

Digital Archiving at the University of York

Wednesday, 19 October 2016

Filling the Digital Preservation Gap - final report available

Tuesday, 11 October 2016

Some highlights from iPRES 2016

The sustainability of a digital preservation blog...

Twitter

Subscribe