Friday, 27 November 2015

File identification ...let's talk about the workflows

When receiving any new batch of files to add to the digital archive there are lots of things I want to know about them but "What file formats have we got here?" is often my first question.

Knowing what you've got is of great importance to digital archivists because...
  • It enables you to find the right software to open the file and view the contents (all being well)
  • It can trigger a dialog with your donor or depositor about alternative formats you might wish to receive the data in (...all not being well)
  • It allows you to consider the risks that relate to that format and if appropriate define a migration pathway for preservation and/or access
We've come a long way in the last few years and we now have lots of tools to choose from to identify files. This could be seen as both a blessing and a curse. Each tool has strengths and weaknesses and it is not easy to decide which one to use (or indeed which combination of tools would give the best results) ...and once we've started using a tool, in what way do we actually use it?

So currently I have more questions about workflows - how do we use these tools and at what points do we interact with them or take manual steps?

Where file format identification tools are used in isolation, we can do what we want with the results. Where multiple identifications are given, we may be able to gather further evidence to convince us what the file actually is. Where there is no identification given, we may decide we can assign an identification manually. However, where file identification tools are incorporated into larger digital preservation systems, the workflow will be handled by the system and the digital archivist will only be able to interact in ways that have been configured by the developers.

As part of our Jisc funded "Filling the Digital Preservation Gap" project, one of the areas of development we are working on is around file identification within Archivematica. This was seen to be a development priority because our project is looking specifically at research data and research data comes in a huge array of file formats, many of which will not currently be recognised by file format identification tools.

The project team...discussing file identification workflows...probably

Here are some of the questions we've been exploring:
  • What should happen if you ingest data that can't be identified? Should you get notification of this? Should you be offered the option to try other file id methods/tools for those non-identified files?
  • Should we allow the curator/digital archivist to over-ride file identifications - eg - "I know this isn't really xxxx format so I'm going to record this fact" (and record this manual intervention in the metadata) Can you envisage ever wanting to do this? 
  • Where a tool gives more than one possible identification should you be allowed to select which identification you trust or should the metadata just keep a record of all the possible identifications?
  • Where a file is not identified at all, should you have the option to add a manual identification? If there is no Pronom id for a file (because it isn't yet in Pronom) how would you record the identification? Would it simply be a case of writing "MATLAB file" for example? How sustainable is this?
  • How should you share info around file formats/file identifications with the wider digital preservation community? What is the best way to contribute to file format registries such as Pronom

We've been talking to people but don't necessarily have all the answers just yet. Thanks to everyone who has been feeding into our discussions so far! The key point to make here is that perhaps there isn't really a right answer - our systems need to be configurable enough in order that different institutions can work in different ways depending on local policies. It seems fairly obvious that this is quite a big nut to crack and it isn't something that we can fully resolve within our current project.

For the time being our Archivematica development work is focusing in the first instance on allowing the digital curator to see a report of the files that are not identified as a prompt to then working out how to handle them. This will be an important step towards helping us to understand the problem. Watch this space for further information.

Wednesday, 25 November 2015

Sharing the load: Jisc RDM Shared Services events

This is a guest post from Chris Awre, Head of Information Services, Library and Learning Innovation at the University of Hull. Chris has been working with me on the "Filling the Digital Preservation Gap" project.

On 18th/19th November, Jenny and I attended two events held by Jisc at Aston University looking at shared services for research data management.  This initiative has come about as many, if not all, institutions have struggled to identify a concrete way forward for managing research data, and there is widespread acknowledgement that some form of shared service provision will be of benefit.  To this end, the first day was about refining requirements for this provision, and saw over 70 representatives from across Higher Education feed in their ideas and views.  The day took an initial requirements list and refined, extended and clarified these extensively.  Jisc has provided a write-up of the day of its own that describes the process undertaken usefully.

Jenny and I were kindly invited to the event to contribute our experience of analysing requirements for digital preservation for research data management.  The brief presentation we gave highlighted the importance of digital preservation as part of a full RDM service, stressing of how a lack of digital preservation planning has led to data loss over time, and how consideration of requirements has been based on long established principles from the OAIS Reference Model and earlier work at York. Essentially the message was – make sure that any RDM shared service encompasses digital preservation, even if institutions have different policies about what does and does not get pushed through it.

Thankfully, it seems that Jisc has indeed taken this on board as part of the planning process, and the key message was re-iterated on a number of occasions during the day.  Digital preservation is also built into the procurement process that Jisc is putting together (of which more below).  It was great to be having discussions about research data management during the day where digital preservation was an assumed component.  The group was broken up to discuss different elements of the requirements for the latter half of the morning, and by chance I was on the table discussing digital preservation.  This highlighted most of the draft requirements as mandatory, but also split up some of the others and expanded most of them.  Context is everything when defining digital preservation workflows, and the challenge was to identify requirements that could work across many different institutions.  We await to see the final list to see how successful we have all been.

The second day was focused on suppliers who may have an interest in bidding to the tender that Jisc will be issuing shortly.  A range of companies were represented covering the different areas that could be bid for.  What became apparent during Day 1 was the need to provide a suit of shared services, not a single entity.  The tender process acknowledges this, and there are 8 Lots covering different aspects.  These are to be confirmed, and will be presented in the tender itself.  However, suffice to say that digital preservation is central to two of these: one for providing a shared service platform for digital preservation; and one to provide digital preservation tools that can be used independently by institutions wishing to build them in outside of a platform.  This separation offers flexibility to how DP is embedded, and it will be interesting to see what options emerge from the procurement process.

Jenny and I have been invited to sit on the Advisory Group for the development of the RDM shared service(s), so will have ongoing ability to raise digital preservation as a key component of RDM service.  Jisc is also looking for institutions to act as pilots for the service over the next two years.  This provides a good opportunity to work with service providers to establish what works locally, and the experiences will serve the wider sector well as we continue to tackle the issues of managing research data.

Monday, 16 November 2015

The third UK Archivematica user group meeting

This is a guest post from Simon Wilson, University Archivist at the University of Hull based within the Hull History Centre. Simon has been working with me on the "Filling the Digital Preservation Gap" project and agreed to provide a short write up of the UK Archivematica group meeting in my absence.

With Jen presenting at iPRES in North Carolina Julie Allinson and I attended the UK Archivematica user group meeting at the Laidlaw Library in Leeds. After the round table introductions from the 11 institutions that were represented, Julie began proceedings with an presentation on our Jisc "Filling the Digital Preservation Gap" project. She updated the group on the progress within this project since the last user group meeting 5 months previously and focused in particular on the development work and enhancements to Archivematica that are being undertaken in Phase 2.

A presentation from Fergus O'Connor and Claudia Roeck at the Tate highlighted their use of Archivematica for video art with an estimated 500 items including video, film and slide material with the largest file some 20GB in size. It was interesting to hear how digital content had impacted on their video format migration policies and practices. As they were looking at this stage at just one particular format they had been able to identify some of the micro services that weren't appropriate (for example OCR tools and bulk extractor) as a timely reminder of the value of adjusting the workflow within Archivematica as necessary. This is something we will look at when developing the workflows at Hull for research data and born-digital archives.

One question  raised was that of scalability and Matthew Addis from Arkivum reported that it had been successfully tested with 100,000 files. There was an interesting discussion about whether availability of IT support was a barrier to take-up in institutions.

John Beaman from Leeds gave a thought provoking session about data security and the issue of personal identifiable information and the impact this had on the processing of content. This is an issue we are familiar with for paper material but haven't spent a lot of time translating these experiences to digital material. There was lots of note taking in the discussion about anonymisation (removing references to personal identifiable information) and pseudonymisation (changing the personal identifiable information across the dataset) and the respective impact on security and data re-use (in summary anonymisation is best for security and pseudonymisation best for re-use). The pointer to the ISO Code of practice on this has been added to my reading list. John also discussed encryption which seems to be an important consideration for some data. These are important issues for anyone working with born digital data regardless of the system they are using.

Jonathan Ainsworth also from Leeds talked us through their work with their collections management system KE Emu and ePrints - and the challenges of fitting Archivematica into an existing workflow. He also highlighted the impossibility of trying to predict every possible scenario for receiving or processing digital content.  There was an interesting discussion about providing evidence to support a business case and what might be considered useful measures and discussion about cost models.

The day concluded with Sarah Romkey from Artefactual Systems joining us via Skype and bringing us up to speed with developments for v1.5 due later this year and v1.6 due in 2016. I am especially looking forward to getting my hands on the arrangement and appraisal tab being developed by colleagues at Bentley Historical library.

Thursday, 12 November 2015

iPRES workshop report: Using Open-Source Tools to Fulfill Digital Preservation Requirements

As promised by the conference hosts it
was definitely Autumn in Chapel Hill!
Last week I was lucky enough to be at the iPRES conference.

iPRES is the international conference on digital preservation and is exactly the sort of conference I should be at (though somehow I have managed to miss the last 4 years). The conference was generally a fantastic opportunity to meet other people doing digital preservation and share experiences. Regardless of international borders, we are all facing very similar problems and grappling with the same issues.

Breakfast as provided at Friday's workshop
iPRES 2015 was in Chapel Hill, North Carolina this year. Jetlag aside (I gave up in the end and decided to maintain a more European concept of time) it was a really valuable experience. The large quantities of cakes, pastries and bagels also helped - hats off to the conference hosts for this!

One of the most useful sessions for me was Friday's workshop on ‘Using Open-Source Tools to Fulfill Digital Preservation Requirements’. This workshop was billed as a space to talk about open-source software and share experiences about implementing open-source solutions. As well as listening to a really interesting set of talks from others, it also gave me a valuable opportunity to talk about the Jisc “Filling the Digital Preservation Gap” project to an international audience.

Archivematica featured very heavily in the scheduled talks, Other tools such as Archivespace, Islandora and BitCurator (and BitCurator Access) were also discussed so it was good to learn more about them.

Of particular interest was an announcement from Sam Meister of the Educopia Institute about a project proposal called OSSArcFlow. This project will attempt to help institutions combine open source tools in order to meet their institutional needs. It will look at issues such as how systems can be combined and how integration and hand-offs (such as transfer of metadata) can be successfully established. They will be working directly with 11 partner institutions but the lessons learned (including workflow models, guidance and training) will be available to other interested partners. This project sounds really valuable and of relevance to the work we are currently doing in our "Filling the Digital Preservation Gap" project.

The workshop was held in the Sonja Haynes
 Center for Black Culture and History
Some highlights and takeaway thoughts from the contributed talks:
  • Some great ongoing work with Archivematica was described by Andrew Berger of the Computer History Museum in California. He mentioned that the largest file he has ingested so far is 320GB and that he has also successfully ingested 17,000 in one go. The material he is working with spans 40 years and includes lots of unidentified files. Having used Archivematica for real for 6 months now, he feels he understands what each microservice is doing and has had some success with troubleshooting problems.
  • Ben Fino-Radin from the Museum of Modern Art reported that the have ingested 20TB in total using Archivematica, the largest file being 580GB. He anticipates that soon they will be attempting to ingest larger files than this. He uses Archivematica with high levels of automation. The only time he logs in to the Archivematica dashboard is to change a policy - he doesn't watch the ingest process and see the microservices running. From my perspective this is great to know as this high level of automation is something we are keen to establish at York  for our institutional research data workflows.
  • Bonnie Gordon from the Rockefeller Archive Center talked about their work integrating Archivematica with ArchivesSpace. This integration was designed to pass rights and technical metadata from Archivematica to ArchivesSpace through automated processes.
  • Cal Lee from the University of North Carolina talked to us about BitCurator - now this is tool I would really like to get playing with. I'm holding back until project work calms down, but I could see that it would be useful to use BitCurator as an initial step before data is ingested into Archivematica.
  • Mark Leggott from University of Prince Edward Island talked about Islandora and also put out a general plea to everyone to find a way to support or contribute to an open source project. This is an idea I very much support! Although open source tools are freely available for anyone to use, this doesn't mean that we should just use them and give nothing back. Even if a contribution can not be made technically or financially, it could just be done through advocacy and publicity.
  • Me talking about "Filling the Digital Preservation Gap" - can I be one of my own highlights or is that bad form?
  • Courtney Mumma spoke on behalf of Artefactual Systems and gave us a step by step walk through of how to create a new Format Policy Rule in Archivematica. This was useful to see as it is not something I have ever attempted. Good to note also that instructions are available here.
  • Mike Shallcross and Max Eckard from Bentley Historical Library at the University of Michigan talked about their Mellon funded project to integrate Archivematica and ArchivesSpace in an end-to-end workflow that also includes the deposit of content into a DSpace repository. This project should be of great interest to any institution who is using Archivematica due to the enhancements that are being made to the interface. A new appraisal and arrangement tab will enable digital curators to see in a more interactive and visual way which file types are represented within the archive, tag files to aid arrangement and view a variety of reports. This project is a good example of open source tools working alongside each other, all fulfilling very specific functions.
  • Kari Smith from MIT Libraries is using BitCurator alongside Archivematica for ingest and described some of the challenges of establishing the right workflows and levels of automation. Here's hoping some of the work of the proposed OSSArcFlow project will help with these sorts of issues.
  • Nathan Tallman of the University of Cincinnati Libraries is working with Fedora and Hydra along with other systems and is actively exploring Archivematica. He raised some interesting issues and questions about scalability of systems, how many copies of the data we need to keep (and the importance of getting this right), whether we should reprocess whole AIPs just because of a small metadata change and how we make sensible and pragmatic appraisal decisions. He reminded us all of how complicated and expensive this all is and how making the wrong decisions can impact in a big way on an organisation's budget.
I had to leave the workshop early to catch a flight home, but before I left was able to participate in an interesting breakout discussion about the greatest opportunities and challenges of using open source tools for digital curation and the gaps that we see in the current workflows.

Goodbye iPRES and I very much hope to be back next year!

Thursday, 29 October 2015

Spreading the word on the "other side of the pond"

A guest post by Richard Green who has been leading the University of Hull's technical investigations for "Filling the Preservation Gap".

Jenny is away from her desk at the moment so I've been deputised to provide a blog post around the work we've been doing at the University of Hull as part of the Jisc-funded "Filling the Preservation Gap" (FPG) project.  In particular we (the FPG team) want to mention a poster that we prepared for a recent conference in the US.

Hull has had a digital repository in place for a considerable number of years.  It has always had the Fedora (now Fedora Commons) repository software at its heart and for several years now has deployed Hydra over the top of that - indeed, Hull was a founder member of the Hydra Project.  With the established repository goes an established three-stage workflow for adding content.  Content is initially created in a “proto-queue” by a user who, when (s)he is happy with it, transfers it to the ownership of the Library who take it through a quality assurance process.  When the team in the Library is happy with it the content is transferred to the repository "proper" with appropriate access permissions.  The repository contains a wide range of materials and over time we are developing variants of this basic workflow suited to each content type but this activity is constrained by limited resources and we know there are other variations we would like to, and can, develop when circumstances permit. The lack of a specific workflow for research data management (RDM), encompassing the possible need for long-term preservation, was one of the reasons for getting involved in the FPG project.

Whilst the focus of the FPG project is clearly research data it became apparent during our initial work with Archivematica that its preservation needs were not so far removed from the preservation needs that we have for some of our other content.  That being the case we have kept our eye on the bigger picture whilst concentrating on RDM for the purposes of our Jisc project.  We have spent some time putting together an initial all-encompassing design through which an RDM workflow for the FPG project would be but one possible path. It is that overall picture that became our poster.

The Hydra Community holds one major get-together each year, the "Hydra Connect" conference.  The last full week in September saw 200 people from 55 institutions gather in Minneapolis for Connect 2015.  A regular feature of the conferences, much appreciated by the audience, is an afternoon given over to a poster session during which attendees can talk about the work they are doing with Hydra.  Each institution is strongly encouraged to contribute and so Hull took along its grand design as its offering.

Poster for Hydra Connect 2015 in Minneapolis, MN, September 2015

So that’s the poster and here’s a somewhat simplified explanation!

Essentially, content comes in at the left-hand side.  The upper entry point corresponds to a human workflow of the type we already have.  The diagram proposes that the workflow gain the option of sending the digital content of an object through Archivematica in order to create an archival information package (AIP) for preservation and also to take advantage of such things as the software’s capability to generate technical metadata.  The dissemination information package (DIP) that Archivematica produces is then “mined” for content and metadata that will be inserted into the repository object already containing the creator’s descriptive metadata record.  One of the items mined is a UUID that provides a tie-up between the record in the repository and the AIP which goes to a separate preservation store.

The lower entry point corresponds to an automated (well, maybe semi-automated) batch ingest process.  In this case, the DIP processor creates a repository object from scratch and, in addition to possible dissemination files and technical metadata, provides the descriptive metadata too.  There are a number of scenarios for generating the descriptive metadata; at one extreme it might be detailed fields extracted from an accompanying file, at the other it might be minimal metadata derived from the context (the particular ingest folder and the title of the master file, for instance).  There will be circumstances when we create a metadata-only record for the repository and do not include dissemination files in the repository object; under these circumstances the UUID in the metadata would allow us to retrieve the AIP from store and create a new DIP should anyone ever request the data itself.

Finally, we have content already in the repository where it is being “kept safe” but which really justifies a proper preservation copy.  We shall create a workflow that allows this to be passed to Archivematica so that an AIP can be created and stored.  It is probable that this route would use the persistent identifier (PID) of the Fedora object as the link to the AIP.

Suffice it to say that the poster was well received.  It generated quite a lot of interest and, some of it, from surprising quarters.  In conversation with one well-established practitioner in the repository field, from a major US university, I was told “I’ve never thought of things quite like that – and you’re probably right!” It’s sometimes reassuring to know that the work we undertake at the smaller UK universities is nevertheless respected in some of the major US institutions!

If you have any comments, or want further details, about our work then please get in touch via this blog.  We’re interested in your thoughts, perspectives and ideas about the approach.

Friday, 18 September 2015

Spreading the word at the Northern Collaboration Conference

Collaborating with other delegates at the start of the day
Photo credit: Northern Collaboration Conference, Kiran Mehta
I gave a presentation last week at the 2015 Northern Collaboration Conference. My first trip to this conference which is primarily aimed at those working in academic libraries and proved to be an interesting day.

The theme of the day was 'Being digital: opportunities for collaboration in academic libraries' so I thought our collaborative Jisc Research Data Spring project was a perfect fit. It was great to have a new audience to talk to about our plans to 'fill the digital preservation gap' for research data. Though it is academic libraries that are taking on this challenge, my typical audience tends to be those working in archives.

Slides are available on slideshare.
(epic fail on getting the embed code to work in Blogger)
My slides are available on Slideshare for those who want to see what I was talking about.

I began by making sure that we were speaking the same language. Communication is a big issue for us digital archivists. If we talk in OAIS-speak only other digital archivists will understand us. If however we use terms such as 'archiving' and 'curation' we fall into the trap of the multiple layers of meanings and (mis-) interpretations of these terms. This being not my usual audience, it was best to put my cards on the table at the start and establish basic principles.

Key takeaway message: This is not all about storage*

I then covered many of the questions included in the project FAQs that we produced in phase one of our project. Essentially the:
  • Why are we doing this/ why do we need digital preservation?
  • What does research data look like?
  • What does Archivematica do?
  • What are it's strengths and weaknesses?
  • How can we use it?

I was able to touch on the topic of the value of research data and how it is regarded by different researchers working in different disciplines. 

Researchers at York have different opinions on the value of their
data and the challenges of curating it

The lack of clarity on the value of much of the data we will be looking after is the main reason why we propose the approach we are taking.

I'm inspired by Tim Gollin's paper 'Parsimonious Preservation: Preventing Pointless Processes!' which focuses on the primary need to simply to collect the data and find out what you've got. Crucially the 'knowing what you've got' step can be done with minimum expense through the use of available open source tools. Taking a pragmatic approach such as this is particularly appealing when the value of the data we are curating is such an unknown.

I then spoke briefly about phase two of the project through which we are trying to define our own workflows and implementation plans at York and Hull. I mentioned the development work that we are sponsoring as part of this project. Artefactual Systems are currently working on six different areas of development for us (as described in a previous blog post).

At the end of the session I handed out a short feedback form to try and gauge the level of interest in the project. Though only 6 from a total of 20 questionnaires were returned, respondents unanimously agreed they would go away and read the project FAQs in more detail and expressed an interest in a show and tell event once our proof of concepts were up and running. Most also thought they would download our report and talk to their colleagues about Archivematica and our project.

Quite a good result I think!

* though Archival Storage is still essential to any digital archive

Friday, 28 August 2015

Enhancing Archivematica for Research Data Management

Where has the time gone? ....we are now one month into phase two of "Filling the Digital Preservation Gap" ....and I have spent much of this first month on holiday!

Not quite a 'digital preservation gap' - just an excuse to show
you my holiday snaps!
So with no time to waste, here is an update on what we are doing:

In phase two of our project we have two main areas of work. Locally at York and Hull we are going to be planning in more detail the proof of concept implementations of Archivematica for research data we hope to get up and running in phase three.

Meanwhile over in Canada, our collaborators at Artefactual Systems are starting work on a number of sponsored developments to help move Archivematica into a better position for us to incorporate it into our implementations for managing and preserving research data.

We have a project kick off call with Artefactual Systems scheduled for next week and we will be discussing our requirements and specifications for development in more detail, but in the meantime, here is a summary of the areas we are focusing on:

Automation of DIP generation on request 

Building on the AIP re-ingest functionality within Archivematica which allows an AIP to be re-processed and allows for a delay in the generation of a DIP until such a time as it is requested, this feature will enable further automation of this process.

This feature is of particular benefit to those situations where the value of data is not fully understood. It is unnecessary to create an access copy of all research datasets as some of them will never be requested. In our workflows for long term management of research data we would like to trigger the creation of a copy of the data for dissemination and re-use on request rather than create one by default and this piece of work will make this workflow possible.

METS parsing tools

This development will involve creating a Python library which could be used by third party applications to understand the METS file that is contained within an Archivematica DIP. Additionally an HTTP REST service would be developed to allow third party applications to interact with the library in a programming language agnostic fashion.

This is key to being able to work with the DIP that is created by Archivematica within other repository or access systems that are not integrated with Archivematica. Both York and Hull have repositories built with Fedora and Hydra and this feature will allow the repositories to better understand the DIP that Archivematica creates. This development is in no way specific to a Fedora/Hydra repository and will equally benefit other repository platforms in use for RDM.

Improved file identification

This feature will enable Archivematica to report on any unidentified files within a transfer alongside access to the file identification tool output. Further enhancements could help curatorial staff to submit information to PRONOM by partially automating this process.

It was highlighted in our phase one report that the identification of research data file formats is a key area of concern when managing research data for the longer term. This feature will help users of Archivematica see which files haven’t been identified and thus enable them to take action to establish what they hold. This feature will also encourage communication with PRONOM to enhance the database of file formats for the future, thus enabling a more sustainable community approach to addressing this problem.

Generic Search API

The development of a proof of concept search REST API for Archivematica  allowing third party applications to query Archivematica for information about the objects in archival storage.

There is a need to be able to produce statistics or reports on RDM and digital preservation processes in order to obtain a clear picture of what data has been archived. This development will enable these statistics to be generated more easily and sustainably. For example this would enable tools such as the DMAonline dashboard in development at Lancaster University to pull out summary statistics from Archivematica.

Support for multiple checksum algorithms 

Currently Archivematica generates SHA256 checksums for all files, and inserts those into PREMIS fixity tags in the METS file. In addition, two premis:events are generated for each file. All three of these entries are currently hardcoded to assume SHA256. This development would include support for other hash algorithms such as MD5, SHA1 and SHA512.

Research data files can be large in size and/or quantity and may take some time to process through the Archivematica pipeline. One of the potential bottlenecks highlighted in the pipeline is checksums which are created at more than one point in the process. SHA256 checksums can take a long time to create and it has been highlighted that having the option to alter the checksum algorithm within Archivematica could speed things up. Having additional configuration options within Archivematica will give institutions the flexiblity to refine and configure their pipelines to reduce bottlenecks where appropriate.


The ability to automate processes relating to preservation are of primary importance where few resources are available to manually process data of unknown value. Fuller documentation of how an automated workflow can be configured within Archivematica using the APIs that exist would be very helpful for those considering using Archivematica for RDM and will help remove some of the barriers to its use. We will therefore be funding a small piece of work to help improve Archivematica documentation for developers and those installing or administering the system.

We very much hope these enhancements will be useful to the wider community of Archivematica users and not just to those looking specifically at preserving research data.

Our thanks go to Artefactual Systems for helping to turn our initial development ideas into these more concrete proposals.

As ever we are happy to receive your thoughts and feedback so do get in touch if you are interested in the work we are carrying out or have things to share with us around these development ideas.