Digital Archiving at the University of York: 2016

Wednesday, 7 December 2016

Digital Preservation Awards 2016 - celebrating collaboration and innovation

Last week members of the Filling the Digital Preservation Gap project team were lucky enough to experience the excitement and drama of the biannual Digital Preservation Awards!

The Awards ceremony was held at the Wellcome Collection in London on the evening of the 30th November. As always it was a glittering affair, complete with dramatic bagpipe music (I believe it coincided with St Andrew's Day!) and numerous references to Strictly Come Dancing from the judges and hosts!

This year our project had been shortlisted for the Software Sustainability Institute award for Research and Innovation. It was fantastic to be a finalist considering the number of nominations from across the world in this category and we certainly felt we had some strong competition from the other shortlisted projects.

One of the key strengths in our own project has been the collaboration between the Universities of York and Hull. Additionally, collaboration with Artefactual Systems, The National Archives and the wider digital preservation community has also been hugely beneficial.

Interestingly, collaboration was a key feature of all the finalists in this category, perhaps demonstrating just how important this is in order to make effective progress in this area.

The 4C project "Collaboration to Clarify the Costs of Curation" was a European project which looked at costs and benefits relating to digital preservation activities within its partner organisations and beyond. Project outputs in use across the sector include the Curation Costs Exchange.

The winner in our category however was the Dutch National Coalition for Digital Preservation (NCDD) with Constructing a Network of Nationwide Facilities Together. Again there was a strong focus on collaboration - this time cross-domain collaboration within the Netherlands. Under the motto "Joining forces for our digital memory", the project has been constructing a framework for a national shared infrastructure for digital preservation. This collaboration aimed to ensure that each institution does not have to reinvent the wheel as they establish their own digital preservation facilities. Clearly an ambitious project, and perhaps one we can learn from in the UK Higher Education sector as we work with Jisc on their Shared Service for Research Data.

Some of the project team from York and Hull at the awards reception

The awards ceremony itself came at the end of day one of the PERICLES conference where there was an excellent keynote speech from Kara Van Malssen from AV Preserve (her slides are available on SlideShare - I'd love to know how she creates such beautiful slides!).

In the context of the awards ceremony I was pondering one of the messages of Kara's talk that discussed our culture of encouraging and rewarding constant innovation and the challenges that this brings - especially for those of us who are 'maintainers'.

Maintainers maintain systems, services and the status quo - some of us maintain digital objects for the longer term and ensure we can continue to provide access to them. She argued that there are few rewards for maintainers and the incentives generally go to those who are innovating. If those around us are always chasing the next shiny new thing, how can the digital preservation community keep pace?

I would argue however that in the world of digital preservation itself, rewards for innovation are not always forthcoming. It can be risky for an institution to be an innovator in this area rather than doing what we have always done (which may actually bring risks of a different kind!) and this can stifle progress or lead to inaction.

This is why for me, the Digital Preservation Awards are so important. Being recognised as a finalist for the Research and Innovation award sends a message that what we have achieved is worthwhile and demonstrates that doing something different is A Good Thing.

For that I am very grateful. :-)

Jenny Mitcham, Digital Archivist

Monday, 21 November 2016

Every little bit helps: File format identification at Lancaster University

This is a guest post from Rachel MacGregor, Digital Archivist at Lancaster University. Her work on identifying research data follows on from the work of Filling the Digital Preservation Gap and provides a interesting comparison with the statistics reported in a previous blog post and our final project report.

Here at Lancaster University I have been very inspired by the work at York on file format identification and we thought it was high time I did my own analysis of the one hundred or so datasets held here. The aim is to aid understanding of the nature of research data as well as to inform our approaches to preservation. Our results are comparable to York's in that the data is characterised as research data (as yet we don't have any born digital archives or digitised image files). I used DROID (version 6.2.1) as the tool for file identification - there are others and it would be interesting to compare results at some stage with results from using other software such as FILE (FITS), Apache Tika etc.

The exercise was carried out using the following signature files: DROID_SignatureFile_V88 and container-signature-file-20160927. The maximum number of bytes DROID was set to scan at the start and end of each file was 65536 (which is the default setting when you install DROID).

Summary of the statistics:

There were a total of 24,705 files (so a substantially larger sample than in the comparable study at York)

Of these:

11008 (44.5%) were identified by DROID and 13697 (55.5%) not.
99.3% were given one file identification and 76 files had multiple identifications.

59 files had two possible identifications
13 had 3 identifications
4 had 4 possible identifications.

50 of these files were asc files identified (by extension) as either 8-bit or 7-bit ASCII text files. The remaining 26 were identified by container as various types of Microsoft files.

Files that were identified

Of the 11008 identified files:

89.34% were identified by signature: this is the overwhelming majority, far more than in Jen's survey
9.2% were identified by extension, a much smaller proportion than at York
1.46% identified by container

However there was one large dataset containing over 7,000 gzip files, all identified by signature which did skew the results rather. With those files removed, the percentages identified by different methods were as follows:

68% (2505) by signature
27.5% (1013) by extension
4.5% (161) by container

This was still different from York's results but not so dramatically.

Only 38 were identified as having a file extension mismatch (0.3%) but closer inspection may reveal more. Of these most were Microsoft files with multiple id's (see above) but also a set of lsm files identified as TIFFs. This is not a format I'm familiar with although it seems as if lsm is a form of TIFF file but how do I know if this is a "correct" id or not?

59 different file formats were identified, the most frequently occurring being the GZIP format (as mentioned above) with 7331 instances. The next most popular was, unsurprisingly xml (similar to results at York) with 1456 files spread across the datasets. The top 11 were:

Top formats identified by DROID for Lancaster University's research data

Files that weren't identified

There were 13697 files not identified by DROID of which 4947 (36%) had file extensions. This means there was a substantial proportion of files with no file extension (64%). This is much higher than the result at York which was 26%. As at York there were 107 different extensions in the unidentified files of which the top ten were:

Top counts of unidentified file extensions

Top extensions of unidentified files

This top ten are quite different to York's results, though in both institutions dat files topped the list by some margin! We also found 20 inp and 32 out files which also occur in York's analysis.

Like Jen at York I will be looking for a format to analyse further to create a signature - this will be a big step for me but will help my understanding of the work I am trying to do as well as contribute towards our overall understanding of file format types.

Every little bit helps.

Jenny Mitcham, Digital Archivist

Tuesday, 15 November 2016

AtoM harvesting (part 1) - it works!

When we first started using Access to Memory (AtoM) to create the Borthwick Catalogue we were keen to enable our data to be harvested via OAI-PMH (more about this feature of AtoM is available in the documentation). Indeed the ability to do this was one of our requirements when we were looking to select a new Archival Management System (read about our system requirements here).

Look! Archives now available in Library Catalogue search

So it is with great pleasure that I can announce that we are now exposing some of our data from AtoM through our University Library catalogue YorSearch. Dublin Core metadata is automatically harvested nightly from our production AtoM instance - so we don't need to worry about manual updates or old versions of our data hanging around.

Our hope is that doing this will allow users of the Library Catalogue (primarily staff and students at the University of York) to happen upon relevant information about the archives that we hold here at the Borthwick whilst they are carrying out searches for other information resources.

We believe that enabling serendipitous discovery in this way will benefit those users of the Library Catalogue who may have no idea of the extent and breadth of our holdings and who may not know that we hold archives of relevance to their research interests. Increasing the visibility of the archives within the University of York is an useful way of signposting our holdings and we think this should bring benefits both to us and our potential user base.

A fair bit of thought (and a certain amount of tweaking within YorSearch) went into getting this set up. From the archives perspective, the main decision was around exactly what should be harvested. It was agreed that only top level records from the Borthwick Catalogue should be made available in this way. If we had enabled the harvesting of all levels of records, there was a risk that search results would have been swamped by hundreds of lower level records from those archives that have been fully catalogued. This would have made the search results difficult to understand, particularly given the fact that these results could not have been displayed in a hierarchical way so the relationships between the different levels would be unclear. We would still encourage users to go direct to the Borthwick Catalogue itself to search and browse lower levels of description.

It should also be noted that only a subset of the metadata within the Borthwick Catalogue will be available through the Library Catalogue. The metadata we create within AtoM is compliant with ISAD(G): General International Standard Archival Description which contains 26 different data elements. In order to facilitate harvesting using OAI-PMH, data within AtoM is mapped to simple Dublin Core and this information is available for search and retrieval via YorSearch. As you can see from the screen shot below, Dublin Core does allow a useful level of information to be harvested, but it is not as detailed as the original record.

An example of one of our archival descriptions converted to Dublin Core within YorSearch

Further work was necessary to change the default behaviour within Primo (the software that YorSearch runs on) which displayed results from the Borthwick Catalogue with the label Electronic resource. This is what it calls anything that is harvested as Dublin Core. We didn't think this would be helpful to users because even though the finding aid itself (within AtoM) is indeed an electronic resource, the actual archive that it refers to isn't. We were keen that users didn't come to us expecting everything to be digitised! Fortunately it was possible to change this label to Borthwick Finding Aid, a term that we think will be more helpful to users.

Searches within our library catalogue (YorSearch) now surface Borthwick finding aids, harvested from AtoM.
These are clearly labelled as Borthwick Finding Aids.

Click through to a Borthwick Finding Aid and you can see the full archival description in AtoM in an iFrame

Now this development has gone live we will be able to monitor the impact. It will be interesting to see whether traffic to the Borthwick Catalogue increases and whether a greater number of University of York staff and students engage with the archives as a result.

However, note that I called this blog post AtoM harvesting (part 1).

Of course that means we would like to do more.

Specifically we would like to move beyond just harvesting our top level records as Dublin Core and enable harvesting of all of our archival descriptions in full in Encoded Archival Description (EAD) - an XML standard that is closely modelled on ISAD(G). This is currently not possible within AtoM but we are hoping to change this in the future.

Part 2 of this blog post will follow once we get further along with this aim...

Jenny Mitcham, Digital Archivist

Monday, 14 November 2016

Automating transfers with Automation Tools

This is a guest post by Julie Allinson, Technology Development Manager for Library & Archives at York. Julie has been working on York's implementation for the 'Filling the Digital Preservation Gap' project. This post describes how we have used Artefactual Systems' Automation Tools at York.

For Phase three of our 'Filling the Digital Preservation Gap' we have delivered a proof-of-concept implementation to to illustrate how PURE and Archivematica can be used as part of a Research Data management lifecycle.

One of the requirements for this work was the ability to fully automate a transfer in Archivematica. Automation Tools is a set of python scripts from Artefactual Systems that are designed to help.

The way Automation Tools works is that a script (transfer.py) runs regularly at a set interval (as cron task). The script is fed a set of parameters and, based on these, checks for new transfers in the given transfer source directory. On finding something, a transfer in Archivematica is initiated and approved.

One of the neat features of Automation Tools is that if you need custom behaviour, there are hooks in the transfer.py script that can run other scripts within specified directories. The 'pre-transfer' scripts are run before the transfer starts and 'user input' scripts can be used to act when manual steps in the processing are reached. A processing configuration can be supplied and this can fully automate all steps, or leave some manual as desired.

The best way to use Automation Tools is to fork the github repository and then add local scripts into the pre-transfer and/or user-input directories.

So, how have we used Automation Tools at York?

When a user deposits data through our Research Data York (RDYork) application, the data is written into a folder within the transfer source directory named with the id of our local Fedora resource for the data package. The directory sits on filestore that is shared between the Archivematica and RDYork servers. On seeing a new transfer, three scripts run:

1_datasets_config.py - this script copies the dedicated datasets processing config into the directory where the new data resides.

2_arrange_transfer.py - this script simply makes sure the correct file permissions are in place so that Archivematica can access the data.

3_create_metadata_csv.py - this script looks for a file called 'metadata.json' which contains metadata from PURE and if it finds it, processes the contents and writes out a metadata.csv file in a format that Archivematica will understand. These scripts are all fairly rudimentary, but could be extended for other use cases, for example to process metadata files from different sources or to select a processing config for different types of deposit.

Our processing configuration for datasets is fully automated so by using automation tools we never have to look at the Archivematica interface.

With transfer.py as inspiration I have added a second script called status.py. This one speaks directly to APIs in our researchdatayork application and updates our repository objects with information from Archiveamtica, such as the UUID for the AIP and the location of the package itself. In this way our two 'automation' scripts keep researchdatayork and Archivematica in sync. Archivematica is alerted when new transfers appear and automates the ingest, and researchdatayork is updated with the status once Archivematica has finished processing.

The good news is, the documentation for Automation Tools is very clear and that makes it pretty easy to get started. Read more at https://github.com/artefactual/automation-tools

Jenny Mitcham, Digital Archivist

Friday, 4 November 2016

From old York to New York: PASIG 2016

My walk to the conference on the first day

Last week I was lucky enough to attend PASIG 2016 (Preservation and Archiving Special Interest Group) at the Museum of Modern Art in New York. A big thanks to Jisc who generously funded my conference fee and travel expenses. This was the first time I have attended PASIG but I had heard excellent reports from previous conferences and knew I would be in for a treat.

On the conference website PASIG is described as "a place to learn from each other's practical experiences, success stories, and challenges in practising digital preservation." This sounded right up my street and I was not disappointed. The practical focus proved to be a real strength.

The conference was three days long and I took pages of notes (and lots of photographs!). As always, it would be impossible to cover everything in one blog post so here is a round up of some of my highlights. Apologies to all of those speakers who I haven't mentioned.

Bootcamp!

The first day was Bootcamp - all about finding your feet and getting started with digital preservation. However, this session had value not just for beginners but for those of us who have been working in this area for some time. There are always new things to learn in this field and a sometimes a benefit in being walked through some of the basics.

The highlight of the first day for me was an excellent talk by Bert Lyons from AVPreserve called "The Anatomy of Digital Files". This talk was a bit of a whirlwind (I couldn't type my notes fast enough) but it was so informative and hugely valuable. Bert talked us through the binary and hexadecimal notation systems and how they relate to content within a file. This information backed up some of the things I had learnt when investigating how file format signatures are created and really should be essential learning for all digital archivists. If we don't really understand what digital files are made up of then it is hard to preserve them.

Bert also went on to talk about the file system information - which is additional to the bytes within the file - and how crucial it is to also preserve this information alongside the file itself. If you want to know more, there is a great blog post by Bert that I read earlier this year - What is the chemistry of digital preservation?. It includes a comparison about the need to understand the materials you are working with whether you are working in physical conservation or digital preservation. One of the best blog posts I've read this year so pleased to get the chance to shout about it here!

Hands up if you love ISO 16363!

Kara Van Malssen, also from AVPreserve gave another good presentation called "How I learned to stop worrying and love ISO16363". Although specifically intended for formal certification, she talked about its value outside the certification process - for self assessment, to identify gaps and to prioritise further work. She concluded by saying that ISO16363 is one of the most valuable digital preservation tools we have.

Jon Tilbury from Preservica gave a thought provoking talk entitled "Preservation Architectures - Now and in the Future". He talked about how tool provision has evolved, from individual tools (like PRONOM and DROID) to integrated tools designed for an institution, to out of the box solutions. He suggested that the fourth age of digital preservation will be embedded tools - with digital preservation being seamless and invisible and very much business as usual. This will take digital preservation from the libraries and archives sector to the business world. Users will be expecting systems to be intuitive and highly automated - they won't want to think in OAIS terms. He went on to suggest that the fifth age will be when every day consumers (specifically his mum!) are using the tools without even thinking about it! This is a great vision - I wonder how long it will take us to get there?

Erin O'Meara from University of Arizona Libraries gave an interesting talk entitled "Digital Storage: Choose your own adventure". She discussed how we select suitable preservation storage and how we can get a seat at the table for storage discussions and decisions within our institutions. She suggested that often we are just getting what we are given rather than what we actually need. She referenced the excellent NDSA Levels of Digital Preservation which are a good starting point when trying to articulate preservation storage needs (and one which I have used myself). Further discussions on Twitter following on from this presentation highlighted the work on preservation storage requirements being carried out as a result of a workshop at iPRES 2016, so this is well worth following up on.

A talk from Amy Rushing and Julianna Barrera-Gomez from the University of Texas at San Antonio entitled "Jumping in and Staying Afloat: Creating Digital Preservation Capacity as a Balancing Act" really highlighted for me one of the key messages that has come out of our recent project work for Filling the Digital Preservation Gap. This is that, choosing a digital preservation system is relatively easy but actually deciding how to use it is the harder! After ArchivesDirect (a combination of Archivematica and DuraSpace) was selected as their preservation system (which included 6TB of storage), Amy and Julianna had a lot of decisions to make in order to balance the needs of their collections with the available resources. It was a really interesting case study and valuable to hear how they approached the problem and prioritised their collections.

The Museum of Modern Art in New York

Andrew French from Ex Libris Solutions gave an interesting insight into a more open future for their digital preservation system Rosetta. He pointed out that institutions when selected digital preservation systems focus on best practice and what is known. They tend to have key requirements relating to known standards such as OAIS, Dublin Core, PREMIS and METS as well as a need for automated workflows and a scalable infrastructure. However, once they start using the tool, they find they want other things too - they want to plug in different tools that suit their own needs.

In order to meet these needs, Rosetta is moving towards greater openness, enabling institutions to swap out any of the tools for ingest, preservation, deposit or publication. This flexibility allows the system to be better suited for a greater range of use cases. They are also being more open with their documentation and this is a very encouraging sign. The Rosetta Developer Network documentation is open to all and includes information, case studies and workflows from Rosetta users that help describe how Rosetta can be used in practice. We can all learn a lot from other people even if we are not using the same DP system so this kind of sharing is really great to see.

MOMA in the rain on day 2!

Day two of PASIG was a practitioners knowledge exchange. The morning sessions around reproducibility of research were of particular interest to me given my work on research data preservation and it was great to see two of the presentations referencing the work of the Filling the Digital Preservation Gap project. I'm really pleased to see our work has been of interest to others working in this area.

One of the most valuable talks of the day for me was from Fernando Chirigati from New York University. He introduced us to a useful new tool called ReproZip. He made the point that the computational environment is as important as the data itself for the reproducibility of research data. This could include information about libraries used, environment variables and options. You can not expect your depositors to find or document all of the dependencies (or your future users to install them). What ReproZip does is package up all the necessary dependencies along with the data itself. This package can then be archived and re-used in the future. ReproZip can also be used to unpack and re-use the data in the future. I can see a very real use case for this for researchers within our institution.

Another engaging talk from Joanna Phillips from the Guggenheim Museum and and Deena Engel of New York University described a really productive collaboration between the two institutions. Computer Science students from NYU have been working closely with the time-based media conservator at the museum on the digital artworks in their care. This symbiotic relationship enables the students to earn credit towards their academic studies whilst the museum receives valuable help towards understanding and preserving some of their complex digital objects. Work that the students carry out includes source code analysis and the creation of full documentation of the code so that is can be understood by others. Some also engage with the unique preservation challenges within the artwork, considering how it could be migrated or exhibited again. It was clear from the speakers that both institutions get a huge amount of benefit from this collaboration. A great case study!

Karen Cariani from WGBH Educational Foundation talked about their work (with Indiana University Libraries) to build HydraDAM2. This presentation was of real interest to me given our recent Filling the Digital Preservation Gap project in which we introduced digital preservation functionality to Hydra by integrating it with Archivematica. HydraDAM2 was a different approach, building a preservation head for audio-visual material within Hydra itself. Interesting to see a contrasting solution and to note the commonalities between their project and ours (particularly around the data modelling work and difficulties recruiting skilled developers).

More rain at the end of day 2

Ben Fino Radin from the Museum of Modern Art in "More Data, More Problems: Designing Efficient Workflows at Petabyte Scale" highlighted the challenges of digitising their time-based media holdings and shared some calculations around how much digital storage space would be required if they were to digitise all of their analogue holdings. This again really highlighted some big issues and questions around digital preservation. When working with large collections, organisations need to prioritise and compromise and these decisions can not be taken lightly. This theme was picked up again on day 3 in the session around environmental sustainability.

The lightning talks on the afternoon of the second day were also of interest. Great to hear from such a range of practitioners.... though I did feel guilty that I didn't volunteer to give one myself! Next time!

On the morning of day 3 we were treated to an excellent presentation by Dragan Espenschied from Rhizome who showed us Webrecorder. Webrecorder is a new open source tool for creating web archives. It uses a single system both for initial capture and subsequent access. One of its many strengths appears to be the ability to capture dynamic websites as you browse them and it looks like it will be particularly useful for websites that are also digital artworks. This is definitely one to watch!

MOMA again!

Also on day 3 was a really interesting session on environmental responsibility and sustainability. This was one of the reasons that PASIG made me think...this is not the sort of stuff we normally talk about so it was really refreshing to see a whole session dedicated to it.

Eira Tansey from the University of Cincinnati gave a very thought provoking talk with a key question for us to think about - why do we continue to buy more storage rather than appraise? This is particularly important considering the environmental costs of continuing to store more and more data of unknown value.

Ben Goldman of Penn State University also picked up this theme, looking at the carbon footprint of digital preservation. He pointed out the paradox in the fact we are preserving data for future generations but we are powering this work with fossil fuels. Is preserving the environment not going to be more important to future generations than our digital data? He suggested that we consider the long term impacts of our decision making and look at our own professional assumptions. Are there things that we do currently that we could do with less impact? Are we saving too many copies of things? Are we running too many integrity checks? Is capturing a full disk image wasteful? He ended his talk by suggesting that we should engage in a debate about the impacts of what we do.

Amelia Acker from the University of Texas at Austin presented another interesting perspective on digital preservation in mobile networks, asking how our collections will change as we move from an information society to a networked era and how mobile phones change the ways we read, write and create the cultural record. The atomic level of the file is no longer there on mobile devices. Most people don't really know where the actual data is on their phones or tablets, they can't show you the file structure. Data is typically tied up with an app and stored in the cloud and apps come and go rapidly. There are obvious preservation challenges here! She also mentioned the concept of the legacy contact on Facebook...something which had passed me by, but which will be of interest to many of us who care about our own personal digital legacy.

Yes, there really is steam coming out of the pavements in NYC

The stand out presentation of the conference for me was "Invisible Defaults and Percieved Limitations: Processing the Juan Gelman Files" from Eliva Arroyo-Ramirez from Princeton University. She described the archive of Juan Gelman, an Argentinian poet and human rights activist. Much of the archive was received on floppy disks and included documents relating to his human rights work and campaigns for the return of his missing son and daughter-in-law. The area she focused on within her talk was about how we preserve files with accented characters in the file names.

Diacritics can cause problems when trying to open the files or use our preservation tools (for example Bagger). When she encountered problems like these she put a question out to the digital preservation community asking how to solve the problem and she was grateful to receive so many responses but at the same time was concerned about the language used. It was suggested that she 'scrub', 'clean' or 'detox' the file names in order to remove the 'illegal characters' but she was concerned that our attitudes towards accented characters further marginalises those who do not fit into our western ideals.

She also explored how removing or replacing these accented characters would impact on the files themselves and it was clear that meaning would change significantly. 'Campaign' (a word included in so many of the filenames) would change to 'bell'. She decided not to change the file names but to try and find a work around and she was eventually successful in finding a way to keep the filenames as they were (using the command line to turn the latin characters to UTF8). The message that she ended on was that we as archivists should do no harm whether we are dealing with physical or digital archives. We must juggle our priorities but think hard about where we compromise and what is important to preserve. It is possible to work through problems rather than work around them and we need to be conscious of the needs of collections that fall outside our defaults. This was real food for thought and prompted an interesting conversation on twitter afterwards.

Times Square selfie!

Not only did I have a fantastic week in New York (its not every day you can pop out in your lunch break to take a selfie in Times Square!), but I also came away with lots to think about. PASIG is a bit closer to home next year (in Oxford) so I am hoping I'll be there!

Jenny Mitcham, Digital Archivist

Wednesday, 19 October 2016

Filling the Digital Preservation Gap - final report available

Today we have published our third and final Filling the Digital Preservation Gap report.

The report can be accessed from Figshare: https://dx.doi.org/10.6084/m9.figshare.4040787

This report details work the team at the Universities of York and Hull have been carrying out over the last six months (from March to September 2016) during phase 3 of the project.

The first section of the report focuses on our implementation work. It describes how each institution has established a proof of concept implementation of Archivematica integrated with other systems used for research data management. As well as describing how these implementations work it also discusses future priorities and lessons learned.

The second section of the report looks in more detail at the file format problem for research data. It discusses DROID profiling work that has been carried out over the course of the project (both for research data and other data types) and signature development to increase the number of research data signatures in the PRONOM registry. In recognition of the fact that this is an issue that can only be solved as a community, it also includes recommendations for a variety of different stakeholder groups.

The final section of the report details the outreach work that we have carried out over the course of this final phase of the project. It has been a real pleasure to have been given an opportunity to speak about our work at so many different events and to such a variety of different groups over the last few months!

The last of this run of events in our calendars is the final Jisc Research Data Spring showcase in Birmingham tomorrow (20th October). I hope to see you there!

Jenny Mitcham, Digital Archivist

Tuesday, 11 October 2016

Some highlights from iPRES 2016

A lovely view of the mountains from Bern

Last week I was at iPRES 2016 - the 13th International Conference on Digital Preservation and one of the highlights of the digital preservation year.

This year the conference was held in the Swiss town of Bern. A great place to be based for the week - fantastic public transport, some lovely little restaurants and cellar bars, miles of shopping arcades, bizarre statues and non-stop sunshine!

There was so much content over the course of the 4 days that it is impossible to cover it all in one blog post. Instead I offer up a selection of highlights and takeaway thoughts.

Jeremy York from the University of Michigan gave an excellent paper about ‘The Stewardship Gap’. An interesting project with the aim of understanding the gap between valuable digital data and long term curation. Jeremy reported on the results of a series of interviews with researchers at his institution where they were asked about the value of the data they created and their plans for longer term curation. A theme throughout the paper was around data value and how we assess this. Most researchers interviewed felt that their data did have long term value (and were able to articulate the reasons why). Most of the respondents expressed an intention to preserve the data for the longer term but did not have any concrete plans as to how they would achieve this. It was not yet clear to the project whether an intention to preserve actually leads to deposit with a repository or not. Work on this project is ongoing and I’ll look forward to finding out more when it is available.

Bern at night

As always there was an array of excellent posters. There were two in particular that caught my eye this year.

Firstly a poster from the University of Illinois at Urbana-Champaign entitled Should We Keep Everything Forever?: Determining Long-Term Value of Research Data.

The poster discussed an issue that we have also been grappling with recently as part of Filling the Digital Preservation Gap, that of the value of research data. It proposed an approach to assessing the value of content within the Illinois Data Bank using automated methods and measurable criteria. Recognising that a human eye is also important in assessing value, it would highlight those datasets that appear to have a low value which can then be assessed in a more manual fashion. This pragmatic two-stage approach will ensure that data thought to be of low value can be discarded after 5 years but that time intensive manual checking of datasets is kept to a minimum. This is a useful model that I would like to hear more about once they get it fully established. There was a lot of buzz around this poster and I wasn’t surprised to see it shortlisted for the best poster award.

Another excellent poster (and worthy winner of the best poster award) was To Act or Not to Act - Handling File Format Identification Issues in Practice. This poster from ETH Zurich described how the institution handles file identification and validation errors within their digital archive and showed some worked examples of the types of problems they encountered. This kind of practical sharing of the nuts and bolts of digital preservation is really good to see, and very much in line with the recommendations we are making as part of Filling the Digital Preservation Gap. As well as finding internal solutions to these problems I hope that ETH Zurich are also passing feedback to the tool providers to ensure that the tools work more effectively and efficiently for other users. It is this feedback loop that is so important in helping the discipline as a whole progress.

OAIS panel session in full swing

A panel session on Monday afternoon entitled ‘OAIS for us all’ was also a highlight. I was of course already aware that the OAIS standard is currently under review and that DPC members and other digital preservation practitioners are invited and encouraged to contribute to the discussion. Despite best intentions and an obvious interest in the standard I had not yet managed to engage with the review. This workshop was therefore a valuable opportunity to get up to speed with the process (as far as the panel understood it!) and the community feedback so far.

It was really useful to hear about the discussions about OAIS that have been held internationally and of course interesting to note the common themes recurring throughout – for example around the desire for a pre-ingest step within the model, the need to firm up the reference model to accommodate changes to AIPs that may occur through re-ingest, and around the need for openness with regard to audit and certification standards.

This session was a great example of an international collaboration to help shape the standards that we rely so much on. I do hope that the feedback from our community is given full consideration in the revised OAIS Reference Model.

Me and Steve presenting our paper
(image from @shirapeltzman)

On Tuesday morning I gave a paper with Steve Mackey from Arkivum in the Research Data Preservation session (and I was really pleased that there was a whole session devoted to this topic). I presented on our work to link repositories to Archivematica, through the Filling the Digital Preservation Gap project, and focused in particular on the long tail of research data file formats and the need to address this as a community. It was great to be able to talk to such a packed room and this led to some really useful discussions over the lunch break and at the conference dinner that evening.

One of the most interesting sessions of the conference for me was one that was devoted to ingest tools and methods. At a conference such as this, I'm always drawn to the sessions that focus on practical tools and first hand experiences of doing things rather than the more theoretical strands so this one was an obvious choice for me. First we had Bruno Ferreira from KEEP SOLUTIONS talking about the Database Preservation Toolkit (more about this toolkit later). Then "Exploring Friedrich Kittler's Digital Legacy on Different Levels: Tools to Equip the Future Archivist" by Jurgen Enge and Heinz Werner Kramski from the University of Art and Design in Basel.

It was fascinating to see how they have handled the preservation of a large, diverse and complex digital legacy and overcome some of the challenges and hurdles that this has thrown at them. The speakers also made the point that the hardware itself is important evidence in its physical form, showing for instance how regularly Friedrich Kittler clearly used the reset button on his PC!

Conference delegates relaxing on the terrace

Two further presentations focused on the preservation of e-mail - something that I have little experience of but I am sure I will need to work on in the future. Claus Jensen from the Royal Library in Denmark presented a solution for acquisition of email. This seemed a very pragmatic approach and the team had clearly thought through their requirements well and learned from their initial prototype before moving to a second iteration. I'm keen to follow up on this and read the paper in more detail.

Brent West from the University of Illinois followed on with another interesting presentation on Processing Capstone Email using Predictive Coding. This talk focused on the problems of making appraisal decisions and sensitivity assessments for email and how a technology assisted review could help, enabling the software to learn from human decisions that are made and allowing human effort to be reduced and targeted. Again, I felt that this sort of work could be really useful to me in the future if I am faced with the task of e-mail preservation at scale.

A very expensive gin and tonic!

The BitCurator Mixer on the Tuesday night provided a good opportunity to talk to other users of BitCurator. I confess to not actually being a user just yet but having now got my new ingest PC on my desk, it is only a matter of time before I get this installed and start playing and testing. Good to talk to some experienced users and pick up some tips regarding how to install it and where to find example workflows. What sticks most in my mind though is the price of the gin and tonic at the bar we were in!

On Wednesday afternoon I took part in a workshop called OSS4PRES 2.0: Building Bridges and Filling Gaps – essentially a follow on from a workshop called Using Open-Source Tools to Fulfill Digital Preservation Requirements that I blogged about from iPRES last year. This was one of those workshops where we were actually expected to work (always a bit of a shock after lunch!) and the participants split into groups to address 3 different areas. One group was looking at gaps in the open source digital preservation tool set that we should be looking to fill (either by enhancing existing tools or with the development of new tools). Another group was working on drawing up a set of guidelines for providers of open source tools. The group I was in was thinking about the creation of a community space for sharing digital preservation workflows. This is something that I think could turn into a really valuable resource for practitioners who want to see how others have implemented tools. All the groups came out with lots of ideas and an action plan by the end of the afternoon and work in these areas is scheduled to continue outside of the workshop. Great to see a workshop that is more than just a talking shop but that will lead to some more concrete results.

My laptop working hard at the Database
Preservation workshop

On Thursday morning I attended another really useful hands on workshop called Relational Database Preservation Standards and Tools. Participants were encouraged to try out the SIARD Suite and Database Preservation Toolkit on their own laptops. The value and outcomes of this workshop were clear and it really gave a sense of how we might use these tools to create preservation versions of data from relational databases. Designed to work with a number of widely used relational database systems the tools allow data to be extracted into the SIARD 2 format. This format is essentially a zip file containing the relevant information in XML. It goes one better than the csv format (the means by which I have preserved databases in the past) as it contains both information about the structure of the data as well as the content itself and allows you to add metadata about how the data was extracted. It looks to be particularly useful for taking snapshots of live and active databases for preservation on a regular cycle. I could definitely see myself using these tools in the future.

iPRES2016 and Swisse Toy 2016 meet outside the venue

There was some useful discussion at the end of the session about how these tools would actually fit into a wider preservation workflow and whether they could be incorporated into digital preservation sytems (for example Archivematica) and configured as an automatic migration path for Microsoft Access databases. The answer was yes, but subsequent discussion suggested that this may not be the best way to approach this. The tool creators suggest that full automation may not be the best approach. A human eye is typically required to establish which bits of the database should be preserved and retained and to tailor the creation of the SIARD 2 file accordingly.

On the last afternoon of the conference it was good to be able to pop into the Swiss Toy Fair which was being held at the same venue as the conference. A great opportunity to buy some presents for the family before heading back to the UK.

Jenny Mitcham, Digital Archivist

Wednesday, 21 September 2016

File format identification at Norfolk Record Office

This is a guest post from Pawel Jaskulski who has recently completed a Transforming Archives traineeship at Norfolk Record Office (NRO). As part of his work at Norfolk and in response to a question I posed in a previous blog post ("Is identification of 37% of files a particularly bad result?") he profiled their digital holdings using DROID and has written up his findings. Coming from a local authority context, his results provide an interesting comparison with other profiles that have emerged from both the Hull History Centre and the Bentley Historical Library and again help to demonstrate that the figure of 37% identified files for my test research dataset is unusual.

King's Lynn's borough archives are cared for jointly by the Borough Council and the Norfolk Record Office

Profiling Digital Records with DROID

With any local authority archive there is an assumption that the accession deposited might be literally anything. What it means in 'digital terms' is that it is impossible to predict what sort of data might be coming in in the future. That is the reason why NRO have been actively involved in developing their digital preservation strategy, aiming at achieving capability so as to be able to choose digital records over their paper-based equivalents (hard copies/printouts).

The archive service has been receiving digital records accessions since the late 1990's. The majority of digitally born archives came in as hybrid accessions from local schools that were being closed down. For many records there were no paper equivalents. Among other deposits containing digital records are architectural surveys, archives of private individuals and local organisations (for example Parish Council meetings minutes).

The archive service have been using DROID as part of their digital records archival processing procedure as it connects to the most comprehensive and continuously updated file formats registry PRONOM. Archivematica, an ingest system that uses the PRONOM registry, is currently being introduced at NRO. It contains other file format identification tools like FIDO or Siegfried (which both use PRONOM identifiers).

The results of DROID survey were as follows:

With the latest signature file (v.86) out of 49,117 files identification was successful for 96.46%.

DROID identified 107 various file formats. The ten most recurring file formats were:

Classification	File Format Name	Versions	PUIDS
Image (Raster)	JPEG File Interchange Format	1.01, 1.02	fmt/43, fmt/44
Image (Raster)	Exchangeable Image File Format (Compressed)	2.1, 2.2	x-fmt/390, x-fmt/391
Image (Raster)	Windows Bitmap	3	fmt/116
Text (Mark-up)	Hypertext Markup Language	4	fmt/96, fmt/99
Word Processor	Microsoft Word Document	97-2003	fmt/40
Image (Raster)	Tagged Image File Format		fmt/353
Email	Microsoft Outlook Email Message	97-2003	x-fmt/430
Miscellaneous	AppleDouble Resource Fork		fmt/503
Image (Raster)	Graphics Interchange Format	89a	fmt/4
Image (Raster)	Exchangeable Image File Format (Compressed)	2.2.1	fmt/645

Identification method breakdown:

83.31% was identified by signature
14.95% by container
1.73% by Extension

458 files had their extensions mismatched - that amounts to less than one per cent (0.97%). These were a variety of common raster image file formats (JPEG, PNG, TIFF) word processor (Microsoft Word Document, ClarisWorks Word Processor) and desktop publishing (Adobe Illustrator, Adobe InDesign Document, Quark Xpress Data File).

Among 3.54% of unidentified files there were 160 different unknown file extensions. Top five were:

.cmp
.mov
.info
.eml
.mdb

Two files returned more than 1 identification:

A spreadsheet file with .xls extension (last modified date 2006-12-17) had 3 possible file format matches:

fmt/175 Microsoft Excel for Macintosh 2001
fmt/176 Microsoft Excel for Macintosh 2002
fmt/177 Microsoft Excel for Macintosh 2004

And an image file with extension .bmp (last modified date 2007-02-06) received 2 file format matches

fmt/116 Windows Bitmap 3
fmt/625 Apple Disk Copy Image 4.2

After closer inspection the actual file was a bitmap image file and PUID fmt/116 was the correct one.

Understanding the Results

DROID offers very useful classification of file formats and puts all results into categories, which enables an overview of the digital collection. It is easy to understand what sort of digital content is predominantly included within the digitally born accession/archive/collection. It uses classification system that assigns file formats to broader groups like: Audio, Word Processor, Page Description, Aggregate etc. These help enormously in having a grasp on the variety of digital records. For example it was interesting to discover that over half of our digitally born archives are in various raster image file formats.

Files profiled at Norfolk Record Office as classified by DROID

I am of course also interested in the levels of risk associated with particular formats so have started to work on an additional classification for the data, creating further categories that can help with preservation planning. This would help demonstrate where preservation efforts should be focused in the future.

Jenny Mitcham, Digital Archivist

Digital Archiving at the University of York

Wednesday, 7 December 2016

Digital Preservation Awards 2016 - celebrating collaboration and innovation

Monday, 21 November 2016

Every little bit helps: File format identification at Lancaster University

Summary of the statistics:

Files that were identified

Files that weren't identified

Tuesday, 15 November 2016

AtoM harvesting (part 1) - it works!

Monday, 14 November 2016

Automating transfers with Automation Tools

So, how have we used Automation Tools at York?

Friday, 4 November 2016

From old York to New York: PASIG 2016

Wednesday, 19 October 2016

Filling the Digital Preservation Gap - final report available

Tuesday, 11 October 2016

Some highlights from iPRES 2016

Wednesday, 21 September 2016

File format identification at Norfolk Record Office

Profiling Digital Records with DROID

Understanding the Results

The sustainability of a digital preservation blog...

Twitter

Subscribe