Digital Archiving at the University of York: OAIS

Showing posts with label OAIS. Show all posts

Tuesday, 11 October 2016

Some highlights from iPRES 2016

A lovely view of the mountains from Bern

Last week I was at iPRES 2016 - the 13th International Conference on Digital Preservation and one of the highlights of the digital preservation year.

This year the conference was held in the Swiss town of Bern. A great place to be based for the week - fantastic public transport, some lovely little restaurants and cellar bars, miles of shopping arcades, bizarre statues and non-stop sunshine!

There was so much content over the course of the 4 days that it is impossible to cover it all in one blog post. Instead I offer up a selection of highlights and takeaway thoughts.

Jeremy York from the University of Michigan gave an excellent paper about ‘The Stewardship Gap’. An interesting project with the aim of understanding the gap between valuable digital data and long term curation. Jeremy reported on the results of a series of interviews with researchers at his institution where they were asked about the value of the data they created and their plans for longer term curation. A theme throughout the paper was around data value and how we assess this. Most researchers interviewed felt that their data did have long term value (and were able to articulate the reasons why). Most of the respondents expressed an intention to preserve the data for the longer term but did not have any concrete plans as to how they would achieve this. It was not yet clear to the project whether an intention to preserve actually leads to deposit with a repository or not. Work on this project is ongoing and I’ll look forward to finding out more when it is available.

Bern at night

As always there was an array of excellent posters. There were two in particular that caught my eye this year.

Firstly a poster from the University of Illinois at Urbana-Champaign entitled Should We Keep Everything Forever?: Determining Long-Term Value of Research Data.

The poster discussed an issue that we have also been grappling with recently as part of Filling the Digital Preservation Gap, that of the value of research data. It proposed an approach to assessing the value of content within the Illinois Data Bank using automated methods and measurable criteria. Recognising that a human eye is also important in assessing value, it would highlight those datasets that appear to have a low value which can then be assessed in a more manual fashion. This pragmatic two-stage approach will ensure that data thought to be of low value can be discarded after 5 years but that time intensive manual checking of datasets is kept to a minimum. This is a useful model that I would like to hear more about once they get it fully established. There was a lot of buzz around this poster and I wasn’t surprised to see it shortlisted for the best poster award.

Another excellent poster (and worthy winner of the best poster award) was To Act or Not to Act - Handling File Format Identification Issues in Practice. This poster from ETH Zurich described how the institution handles file identification and validation errors within their digital archive and showed some worked examples of the types of problems they encountered. This kind of practical sharing of the nuts and bolts of digital preservation is really good to see, and very much in line with the recommendations we are making as part of Filling the Digital Preservation Gap. As well as finding internal solutions to these problems I hope that ETH Zurich are also passing feedback to the tool providers to ensure that the tools work more effectively and efficiently for other users. It is this feedback loop that is so important in helping the discipline as a whole progress.

OAIS panel session in full swing

A panel session on Monday afternoon entitled ‘OAIS for us all’ was also a highlight. I was of course already aware that the OAIS standard is currently under review and that DPC members and other digital preservation practitioners are invited and encouraged to contribute to the discussion. Despite best intentions and an obvious interest in the standard I had not yet managed to engage with the review. This workshop was therefore a valuable opportunity to get up to speed with the process (as far as the panel understood it!) and the community feedback so far.

It was really useful to hear about the discussions about OAIS that have been held internationally and of course interesting to note the common themes recurring throughout – for example around the desire for a pre-ingest step within the model, the need to firm up the reference model to accommodate changes to AIPs that may occur through re-ingest, and around the need for openness with regard to audit and certification standards.

This session was a great example of an international collaboration to help shape the standards that we rely so much on. I do hope that the feedback from our community is given full consideration in the revised OAIS Reference Model.

Me and Steve presenting our paper
(image from @shirapeltzman)

On Tuesday morning I gave a paper with Steve Mackey from Arkivum in the Research Data Preservation session (and I was really pleased that there was a whole session devoted to this topic). I presented on our work to link repositories to Archivematica, through the Filling the Digital Preservation Gap project, and focused in particular on the long tail of research data file formats and the need to address this as a community. It was great to be able to talk to such a packed room and this led to some really useful discussions over the lunch break and at the conference dinner that evening.

One of the most interesting sessions of the conference for me was one that was devoted to ingest tools and methods. At a conference such as this, I'm always drawn to the sessions that focus on practical tools and first hand experiences of doing things rather than the more theoretical strands so this one was an obvious choice for me. First we had Bruno Ferreira from KEEP SOLUTIONS talking about the Database Preservation Toolkit (more about this toolkit later). Then "Exploring Friedrich Kittler's Digital Legacy on Different Levels: Tools to Equip the Future Archivist" by Jurgen Enge and Heinz Werner Kramski from the University of Art and Design in Basel.

It was fascinating to see how they have handled the preservation of a large, diverse and complex digital legacy and overcome some of the challenges and hurdles that this has thrown at them. The speakers also made the point that the hardware itself is important evidence in its physical form, showing for instance how regularly Friedrich Kittler clearly used the reset button on his PC!

Conference delegates relaxing on the terrace

Two further presentations focused on the preservation of e-mail - something that I have little experience of but I am sure I will need to work on in the future. Claus Jensen from the Royal Library in Denmark presented a solution for acquisition of email. This seemed a very pragmatic approach and the team had clearly thought through their requirements well and learned from their initial prototype before moving to a second iteration. I'm keen to follow up on this and read the paper in more detail.

Brent West from the University of Illinois followed on with another interesting presentation on Processing Capstone Email using Predictive Coding. This talk focused on the problems of making appraisal decisions and sensitivity assessments for email and how a technology assisted review could help, enabling the software to learn from human decisions that are made and allowing human effort to be reduced and targeted. Again, I felt that this sort of work could be really useful to me in the future if I am faced with the task of e-mail preservation at scale.

A very expensive gin and tonic!

The BitCurator Mixer on the Tuesday night provided a good opportunity to talk to other users of BitCurator. I confess to not actually being a user just yet but having now got my new ingest PC on my desk, it is only a matter of time before I get this installed and start playing and testing. Good to talk to some experienced users and pick up some tips regarding how to install it and where to find example workflows. What sticks most in my mind though is the price of the gin and tonic at the bar we were in!

On Wednesday afternoon I took part in a workshop called OSS4PRES 2.0: Building Bridges and Filling Gaps – essentially a follow on from a workshop called Using Open-Source Tools to Fulfill Digital Preservation Requirements that I blogged about from iPRES last year. This was one of those workshops where we were actually expected to work (always a bit of a shock after lunch!) and the participants split into groups to address 3 different areas. One group was looking at gaps in the open source digital preservation tool set that we should be looking to fill (either by enhancing existing tools or with the development of new tools). Another group was working on drawing up a set of guidelines for providers of open source tools. The group I was in was thinking about the creation of a community space for sharing digital preservation workflows. This is something that I think could turn into a really valuable resource for practitioners who want to see how others have implemented tools. All the groups came out with lots of ideas and an action plan by the end of the afternoon and work in these areas is scheduled to continue outside of the workshop. Great to see a workshop that is more than just a talking shop but that will lead to some more concrete results.

My laptop working hard at the Database
Preservation workshop

On Thursday morning I attended another really useful hands on workshop called Relational Database Preservation Standards and Tools. Participants were encouraged to try out the SIARD Suite and Database Preservation Toolkit on their own laptops. The value and outcomes of this workshop were clear and it really gave a sense of how we might use these tools to create preservation versions of data from relational databases. Designed to work with a number of widely used relational database systems the tools allow data to be extracted into the SIARD 2 format. This format is essentially a zip file containing the relevant information in XML. It goes one better than the csv format (the means by which I have preserved databases in the past) as it contains both information about the structure of the data as well as the content itself and allows you to add metadata about how the data was extracted. It looks to be particularly useful for taking snapshots of live and active databases for preservation on a regular cycle. I could definitely see myself using these tools in the future.

iPRES2016 and Swisse Toy 2016 meet outside the venue

There was some useful discussion at the end of the session about how these tools would actually fit into a wider preservation workflow and whether they could be incorporated into digital preservation sytems (for example Archivematica) and configured as an automatic migration path for Microsoft Access databases. The answer was yes, but subsequent discussion suggested that this may not be the best way to approach this. The tool creators suggest that full automation may not be the best approach. A human eye is typically required to establish which bits of the database should be preserved and retained and to tailor the creation of the SIARD 2 file accordingly.

On the last afternoon of the conference it was good to be able to pop into the Swiss Toy Fair which was being held at the same venue as the conference. A great opportunity to buy some presents for the family before heading back to the UK.

Jenny Mitcham, Digital Archivist

Thursday, 29 January 2015

Reacquainting myself with OAIS

Hands up if you have read ISO:14721:2012 (otherwise known as the Reference Model for an Open Archival Information System)…..I mean properly read it…..yes, I suspected there wouldn’t be many of you. It seems like such a key document to us digital archivists – we use the terminology, the concepts within it, even the diagrams on a regular basis, but I'll be the first to confess I have never read it in full.

Standards such as this become so familiar to those of us working in this field that it is possible to get a little complacent about keeping our knowledge of them up to date as they undergo review.

Hats off to the Digital Preservation Coalition (DPC) for updating their Technology Watch Report on the OAIS Reference Model last year. Published in October 2014 I admit I have only just managed to read it. Digital preservation reading material typically comes out on long train journeys and this report kept me company all the way from Birmingham to Coventry and then back home as far as Sheffield (I am a slow reader!). Imagine how far I would have had to travel to read the 135 pages of the full standard!

This is the 2nd edition of the first in the DPC’s series of Technology Watch reports. I remember reading the original report about 10 years ago and trying to map the active digital preservation we were doing at the Archaeology Data Service to the model.

Ten years is quite a long time in a developing field such as digital preservation and the standard has now been updated (but as mentioned in the Technology Watch Report, the updates haven’t been extensive – the authors largely got it right first time).

Now reading this updated report in a different job setting I can think about OAIS in a slightly different way. We don't currently do much digital preservation at the Borthwick Institute, but we do do a lot of thinking about how we would like the digital archive to look. Going back to the basics of the OAIS standard at this point in the process encourages fresh thinking about how OAIS could be implemented in practice. It was really encouraging to read the OCLC Digital Archive example cited within the Technology Watch Report (pg 14) which neatly demonstrates a modular approach to fulfilling all the necessary functions across different departments and systems. This ties in with our current thinking at the University of York about how we can build a digital archive using different systems and expertise across the Information Directorate.

Brian Lavoie mentions in his conclusion that "This Technology Watch Report has sought to re-introduce digital preservation practitioners to OAIS, by recounting its development and recognition as a standard; its key revisions; and the many channels through which its influence has been felt." This aim has certainly been met. I feel thoroughly reacquainted with OAIS and have learnt some things about the changes to the standard and even reminded myself of some things that I had forgotten ...as I said, 10 years is a long time.

·

Jenny Mitcham, Digital Archivist

Thursday, 18 December 2014

Plugging the gaps: Linking Arkivum with Archivematica

In September this year, Arkivum and Artefactual Systems (who develop and support the Open Source software Archivematica) announced that they were collaborating on a digital preservation system. This is a piece of work that myself and colleagues at the University of York were very pleased to be able to fund.

We don't currently have a digital archive for the University of York but we are in the process of planning how we can best implement one. Myself and colleagues have been thinking about requirements and assessing systems and in particular looking at ways we might create a digital archive that interfaces with existing systems, automates as much of the digital preservation process as possible ....and is affordable.

My first point of call is normally the Open Archival Information System (OAIS) reference model. I regularly wheel out the image below in presentations and meetings because I always think it helps in focusing the mind and summarising what we are trying to achieve.

OAIS Functional Entities (CCSDS 2002)

From the start we have favoured a modular approach to a technical infrastructure to support digital archiving. There doesn't appear to be any single solution that "just does it all for us" and we are not keen to sweep away established systems that already carry out some of the required functionality.

We need to keep in mind the range of data management scenarios we have to support. As a university we have a huge quantity of digital data to manage and formal digital archiving of the type described in the OAIS reference model is not always necessary. We need an architecture that has the flexibility to support a range of different workflows depending on the retention periods or perceived value of the data that we are working with. All data is not born equal so it does not make sense to treat it all in the same way.

How we've approached this challenge is to look at the systems we have currently, find the gaps and work out how best to fill them. We also need to think about how we can get different systems to talk to each other in order to create the automated workflows that are so crucial to all of this working effectively.

Looking at the OAIS model, we already have a system to provide access to data with York Digital Library which is built using Fedora, we also have some of the data management functionality ticked with various systems to store descriptive metadata about our assets (both digital and physical). We have various ingest workflows in place to get content to us from the data producers. What we don't have currently is a system that manages the preservation planning side of digital archiving or a robust and secure method of providing archival storage for the long term.

This is where Archivematica and Arkivum could come in.

Archivematica is an open source digital archiving solution. In a nutshell, it takes a microservices approach to digital archiving, running several different tools as part of the ingest process to characterise and validate the files, extracting metadata, normalising data and packing the data up into an AIP which contains both the original files (unchanged), any derived or normalised versions of these files as appropriate and technical and preservation metadata to help people make sense of that data in the future. The metadata are captured as PREMIS and METS XML, two established standards for digital preservation that ensure the AIPs are self-documenting and system-independent. Archivematica is agnostic to the storage service that is used. It merely produces the AIP which can then be stored anywhere.

Arkivum is a bit perfect preservation solution. If you store your data with Arkivum they can guarantee that you will get that data back in the same condition it was in when you deposited it. They keep multiple copies of the data and carry out regular integrity checks to ensure they can fulfil this promise. Files are not characterised or migrated through different formats. This is all about archival storage. Arkivum is agnostic to the content. It will store any file that you wish to deposit.

There does seem to be a natural partnership between Archivematica and Arkivum - there is no overlap in functionality, and they both perform a key role within the OAIS model. In actual fact, even without integration, Archivematica and Arkivum can work together. Archivematica will happily pass AIPs through to to Arkivum, but with the integration we can make this work much better.

So, the new functionality includes the following features:

Archivematica will let Arkivum know when there is an Archival Information Package (AIP) to ingest
Once the Arkivum storage service receives the data from Archivematica it will check the size of the file received matches the expected file size
A checksum will be calculated for the AIP in Arkivum and will be automatically compared against the checksum supplied by Archivematica. Using this, the system can accurately verify whether transfer has been successful
Using the Archivematica dashboard it is possible to ascertain the status of the AIP within Arkivum to ensure that all required copies of the files have been created and it has been fully archived

I'm still testing this work and had to work hard to manage my own expectations. The integration doesn't actually do anything particularly visual or exciting, it is the sort of back-end stuff that you don't even notice when everything is working as it should. It is however good to know that these sorts of checks are going on behind the scenes, automating tasks that would otherwise have to be done by hand. It is the functionality that you don't see that is all important!

Getting systems such as these to work together well is key to building up a digital archiving solution and we hope that this is of interest and use to others within the digital preservation community.

Jenny Mitcham, Digital Archivist

Monday, 17 March 2014

'Routine encounters with the unexpected' (or what we should tell our digital depositors)

I was very interested a few months back to hear about the release of a new and much-needed report on acquiring born-digital archives: Born Digital: Guidance for Donors, Dealers, and Archival Repositories published by the Council on Library and Information Resources. I read it soon after it was published and have been mulling over its content ever since.

The quote within the title of this post "routine encounters with the unexpected" is taken from the concluding section of the report and describes the stewardship of born-digital archival collections. The report intends to describe good practices that can help reduce these archival surprises.

The publication takes an interesting and inclusive approach, being aimed at both at archivists who will taking in born-digital material, and also at those individuals and organisations involved with offering born-digital material to an archive or repository.

It appeared at a time when I was developing new content for our new website aimed specifically at donors and depositors and also a couple of weeks before I went on my first trip to collect someone's digital legacy for inclusion in our archive. This last few months alongside archivist colleagues I have also been planning and documenting our own digital accessions workflow. This report has been a rich source of information and advice and has helped inform all of these activities.

There is lots of food for thought within the publication but what I like best are the checklists at the end which neatly summarise many of the key issues highlighted within the report and provide a handy quick reference guide.

Much as I find this a very useful and interesting publication it got me thinking about the alternative and apparently conflicting advice that I give depositors and how the two relate.

I have always thought that one of the most important things that anyone can do to ensure that their digital legacy survives into the future is to put into practice good data management strategies. These strategies are often just simple common sense rules, things like weeding out duplicate or unnecessary files, organising your data into sensible and logical directory structures and naming them well.

Where we have depositors who wish to give us born-digital material for our archive, I would like to encourage them to follow rules like these to help ensure that we can make better sense of their data when it comes our way. This also helps fulfil the OAIS responsibility to ensure the independent utility of data - the more we know about data from the original source, the greater the likelihood that others will be able to make sense of it in the future. I have put guidance to this effect on our new website which is based on an advice sheet from the Archaeology Data Service.

Screenshot of the donor and depositor FAQ page on the Borthwick Institute's new website

However, this goes against the advice in the 'Born Digital' report which states that "...donors and dealers should not manipulate, rearrange, extract, or copy files from their original sources in anticipation of offering the material for gift or purchase."

In a blog post last year I talked about a digital rescue project I had been working on, looking at the data on some 5 1/4 inch floppy disks from the Marks and Gran archive. This project would not have been nearly as interesting if someone had cleaned up the data before deposit - rationalising and re-naming files and deleting earlier versions. There would have been no detective story and information about the creative process would have been lost. However, if all digital deposits came to us like this would we be able to resource the amount of work required to make sense of them?

So, my question is as follows. What do we tell our depositors? Is there room for both sets of advice - the 'organise your data before deposit' approach aimed at those organisations who regularly deposit their administrative information with us, and the 'leave well alone' approach for the digital legacies of individuals? This is the route I have tried to take on our new website, however, I have concerns as to whether it will be clear enough to donors and depositors as to which advice they should follow, especially where there are areas of cross-over. I'm interested to hear how other archives handle this question.

Jenny Mitcham, Digital Archivist

Digital Archiving at the University of York