Friday, 27 November 2015

File identification ...let's talk about the workflows

When receiving any new batch of files to add to the digital archive there are lots of things I want to know about them but "What file formats have we got here?" is often my first question.

Knowing what you've got is of great importance to digital archivists because...
  • It enables you to find the right software to open the file and view the contents (all being well)
  • It can trigger a dialog with your donor or depositor about alternative formats you might wish to receive the data in (...all not being well)
  • It allows you to consider the risks that relate to that format and if appropriate define a migration pathway for preservation and/or access
We've come a long way in the last few years and we now have lots of tools to choose from to identify files. This could be seen as both a blessing and a curse. Each tool has strengths and weaknesses and it is not easy to decide which one to use (or indeed which combination of tools would give the best results) ...and once we've started using a tool, in what way do we actually use it?

So currently I have more questions about workflows - how do we use these tools and at what points do we interact with them or take manual steps?

Where file format identification tools are used in isolation, we can do what we want with the results. Where multiple identifications are given, we may be able to gather further evidence to convince us what the file actually is. Where there is no identification given, we may decide we can assign an identification manually. However, where file identification tools are incorporated into larger digital preservation systems, the workflow will be handled by the system and the digital archivist will only be able to interact in ways that have been configured by the developers.

As part of our Jisc funded "Filling the Digital Preservation Gap" project, one of the areas of development we are working on is around file identification within Archivematica. This was seen to be a development priority because our project is looking specifically at research data and research data comes in a huge array of file formats, many of which will not currently be recognised by file format identification tools.

The project team...discussing file identification workflows...probably

Here are some of the questions we've been exploring:
  • What should happen if you ingest data that can't be identified? Should you get notification of this? Should you be offered the option to try other file id methods/tools for those non-identified files?
  • Should we allow the curator/digital archivist to over-ride file identifications - eg - "I know this isn't really xxxx format so I'm going to record this fact" (and record this manual intervention in the metadata) Can you envisage ever wanting to do this? 
  • Where a tool gives more than one possible identification should you be allowed to select which identification you trust or should the metadata just keep a record of all the possible identifications?
  • Where a file is not identified at all, should you have the option to add a manual identification? If there is no Pronom id for a file (because it isn't yet in Pronom) how would you record the identification? Would it simply be a case of writing "MATLAB file" for example? How sustainable is this?
  • How should you share info around file formats/file identifications with the wider digital preservation community? What is the best way to contribute to file format registries such as Pronom

We've been talking to people but don't necessarily have all the answers just yet. Thanks to everyone who has been feeding into our discussions so far! The key point to make here is that perhaps there isn't really a right answer - our systems need to be configurable enough in order that different institutions can work in different ways depending on local policies. It seems fairly obvious that this is quite a big nut to crack and it isn't something that we can fully resolve within our current project.

For the time being our Archivematica development work is focusing in the first instance on allowing the digital curator to see a report of the files that are not identified as a prompt to then working out how to handle them. This will be an important step towards helping us to understand the problem. Watch this space for further information.

Jenny Mitcham, Digital Archivist

Wednesday, 25 November 2015

Sharing the load: Jisc RDM Shared Services events

This is a guest post from Chris Awre, Head of Information Services, Library and Learning Innovation at the University of Hull. Chris has been working with me on the "Filling the Digital Preservation Gap" project.

On 18th/19th November, Jenny and I attended two events held by Jisc at Aston University looking at shared services for research data management.  This initiative has come about as many, if not all, institutions have struggled to identify a concrete way forward for managing research data, and there is widespread acknowledgement that some form of shared service provision will be of benefit.  To this end, the first day was about refining requirements for this provision, and saw over 70 representatives from across Higher Education feed in their ideas and views.  The day took an initial requirements list and refined, extended and clarified these extensively.  Jisc has provided a write-up of the day of its own that describes the process undertaken usefully.

Jenny and I were kindly invited to the event to contribute our experience of analysing requirements for digital preservation for research data management.  The brief presentation we gave highlighted the importance of digital preservation as part of a full RDM service, stressing of how a lack of digital preservation planning has led to data loss over time, and how consideration of requirements has been based on long established principles from the OAIS Reference Model and earlier work at York. Essentially the message was – make sure that any RDM shared service encompasses digital preservation, even if institutions have different policies about what does and does not get pushed through it.

Thankfully, it seems that Jisc has indeed taken this on board as part of the planning process, and the key message was re-iterated on a number of occasions during the day.  Digital preservation is also built into the procurement process that Jisc is putting together (of which more below).  It was great to be having discussions about research data management during the day where digital preservation was an assumed component.  The group was broken up to discuss different elements of the requirements for the latter half of the morning, and by chance I was on the table discussing digital preservation.  This highlighted most of the draft requirements as mandatory, but also split up some of the others and expanded most of them.  Context is everything when defining digital preservation workflows, and the challenge was to identify requirements that could work across many different institutions.  We await to see the final list to see how successful we have all been.

The second day was focused on suppliers who may have an interest in bidding to the tender that Jisc will be issuing shortly.  A range of companies were represented covering the different areas that could be bid for.  What became apparent during Day 1 was the need to provide a suit of shared services, not a single entity.  The tender process acknowledges this, and there are 8 Lots covering different aspects.  These are to be confirmed, and will be presented in the tender itself.  However, suffice to say that digital preservation is central to two of these: one for providing a shared service platform for digital preservation; and one to provide digital preservation tools that can be used independently by institutions wishing to build them in outside of a platform.  This separation offers flexibility to how DP is embedded, and it will be interesting to see what options emerge from the procurement process.

Jenny and I have been invited to sit on the Advisory Group for the development of the RDM shared service(s), so will have ongoing ability to raise digital preservation as a key component of RDM service.  Jisc is also looking for institutions to act as pilots for the service over the next two years.  This provides a good opportunity to work with service providers to establish what works locally, and the experiences will serve the wider sector well as we continue to tackle the issues of managing research data.

Jenny Mitcham, Digital Archivist

Monday, 16 November 2015

The third UK Archivematica user group meeting

This is a guest post from Simon Wilson, University Archivist at the University of Hull based within the Hull History Centre. Simon has been working with me on the "Filling the Digital Preservation Gap" project and agreed to provide a short write up of the UK Archivematica group meeting in my absence.

With Jen presenting at iPRES in North Carolina Julie Allinson and I attended the UK Archivematica user group meeting at the Laidlaw Library in Leeds. After the round table introductions from the 11 institutions that were represented, Julie began proceedings with an presentation on our Jisc "Filling the Digital Preservation Gap" project. She updated the group on the progress within this project since the last user group meeting 5 months previously and focused in particular on the development work and enhancements to Archivematica that are being undertaken in Phase 2.

A presentation from Fergus O'Connor and Claudia Roeck at the Tate highlighted their use of Archivematica for video art with an estimated 500 items including video, film and slide material with the largest file some 20GB in size. It was interesting to hear how digital content had impacted on their video format migration policies and practices. As they were looking at this stage at just one particular format they had been able to identify some of the micro services that weren't appropriate (for example OCR tools and bulk extractor) as a timely reminder of the value of adjusting the workflow within Archivematica as necessary. This is something we will look at when developing the workflows at Hull for research data and born-digital archives.

One question  raised was that of scalability and Matthew Addis from Arkivum reported that it had been successfully tested with 100,000 files. There was an interesting discussion about whether availability of IT support was a barrier to take-up in institutions.

John Beaman from Leeds gave a thought provoking session about data security and the issue of personal identifiable information and the impact this had on the processing of content. This is an issue we are familiar with for paper material but haven't spent a lot of time translating these experiences to digital material. There was lots of note taking in the discussion about anonymisation (removing references to personal identifiable information) and pseudonymisation (changing the personal identifiable information across the dataset) and the respective impact on security and data re-use (in summary anonymisation is best for security and pseudonymisation best for re-use). The pointer to the ISO Code of practice on this has been added to my reading list. John also discussed encryption which seems to be an important consideration for some data. These are important issues for anyone working with born digital data regardless of the system they are using.

Jonathan Ainsworth also from Leeds talked us through their work with their collections management system KE Emu and ePrints - and the challenges of fitting Archivematica into an existing workflow. He also highlighted the impossibility of trying to predict every possible scenario for receiving or processing digital content.  There was an interesting discussion about providing evidence to support a business case and what might be considered useful measures and discussion about cost models.

The day concluded with Sarah Romkey from Artefactual Systems joining us via Skype and bringing us up to speed with developments for v1.5 due later this year and v1.6 due in 2016. I am especially looking forward to getting my hands on the arrangement and appraisal tab being developed by colleagues at Bentley Historical library.

Jenny Mitcham, Digital Archivist

Thursday, 12 November 2015

iPRES workshop report: Using Open-Source Tools to Fulfill Digital Preservation Requirements

As promised by the conference hosts it
was definitely Autumn in Chapel Hill!
Last week I was lucky enough to be at the iPRES conference.

iPRES is the international conference on digital preservation and is exactly the sort of conference I should be at (though somehow I have managed to miss the last 4 years). The conference was generally a fantastic opportunity to meet other people doing digital preservation and share experiences. Regardless of international borders, we are all facing very similar problems and grappling with the same issues.

Breakfast as provided at Friday's workshop
iPRES 2015 was in Chapel Hill, North Carolina this year. Jetlag aside (I gave up in the end and decided to maintain a more European concept of time) it was a really valuable experience. The large quantities of cakes, pastries and bagels also helped - hats off to the conference hosts for this!

One of the most useful sessions for me was Friday's workshop on ‘Using Open-Source Tools to Fulfill Digital Preservation Requirements’. This workshop was billed as a space to talk about open-source software and share experiences about implementing open-source solutions. As well as listening to a really interesting set of talks from others, it also gave me a valuable opportunity to talk about the Jisc “Filling the Digital Preservation Gap” project to an international audience.

Archivematica featured very heavily in the scheduled talks, Other tools such as Archivespace, Islandora and BitCurator (and BitCurator Access) were also discussed so it was good to learn more about them.

Of particular interest was an announcement from Sam Meister of the Educopia Institute about a project proposal called OSSArcFlow. This project will attempt to help institutions combine open source tools in order to meet their institutional needs. It will look at issues such as how systems can be combined and how integration and hand-offs (such as transfer of metadata) can be successfully established. They will be working directly with 11 partner institutions but the lessons learned (including workflow models, guidance and training) will be available to other interested partners. This project sounds really valuable and of relevance to the work we are currently doing in our "Filling the Digital Preservation Gap" project.

The workshop was held in the Sonja Haynes
 Center for Black Culture and History
Some highlights and takeaway thoughts from the contributed talks:
  • Some great ongoing work with Archivematica was described by Andrew Berger of the Computer History Museum in California. He mentioned that the largest file he has ingested so far is 320GB and that he has also successfully ingested 17,000 in one go. The material he is working with spans 40 years and includes lots of unidentified files. Having used Archivematica for real for 6 months now, he feels he understands what each microservice is doing and has had some success with troubleshooting problems.
  • Ben Fino-Radin from the Museum of Modern Art reported that the have ingested 20TB in total using Archivematica, the largest file being 580GB. He anticipates that soon they will be attempting to ingest larger files than this. He uses Archivematica with high levels of automation. The only time he logs in to the Archivematica dashboard is to change a policy - he doesn't watch the ingest process and see the microservices running. From my perspective this is great to know as this high level of automation is something we are keen to establish at York  for our institutional research data workflows.
  • Bonnie Gordon from the Rockefeller Archive Center talked about their work integrating Archivematica with ArchivesSpace. This integration was designed to pass rights and technical metadata from Archivematica to ArchivesSpace through automated processes.
  • Cal Lee from the University of North Carolina talked to us about BitCurator - now this is tool I would really like to get playing with. I'm holding back until project work calms down, but I could see that it would be useful to use BitCurator as an initial step before data is ingested into Archivematica.
  • Mark Leggott from University of Prince Edward Island talked about Islandora and also put out a general plea to everyone to find a way to support or contribute to an open source project. This is an idea I very much support! Although open source tools are freely available for anyone to use, this doesn't mean that we should just use them and give nothing back. Even if a contribution can not be made technically or financially, it could just be done through advocacy and publicity.
  • Me talking about "Filling the Digital Preservation Gap" - can I be one of my own highlights or is that bad form?
  • Courtney Mumma spoke on behalf of Artefactual Systems and gave us a step by step walk through of how to create a new Format Policy Rule in Archivematica. This was useful to see as it is not something I have ever attempted. Good to note also that instructions are available here.
  • Mike Shallcross and Max Eckard from Bentley Historical Library at the University of Michigan talked about their Mellon funded project to integrate Archivematica and ArchivesSpace in an end-to-end workflow that also includes the deposit of content into a DSpace repository. This project should be of great interest to any institution who is using Archivematica due to the enhancements that are being made to the interface. A new appraisal and arrangement tab will enable digital curators to see in a more interactive and visual way which file types are represented within the archive, tag files to aid arrangement and view a variety of reports. This project is a good example of open source tools working alongside each other, all fulfilling very specific functions.
  • Kari Smith from MIT Libraries is using BitCurator alongside Archivematica for ingest and described some of the challenges of establishing the right workflows and levels of automation. Here's hoping some of the work of the proposed OSSArcFlow project will help with these sorts of issues.
  • Nathan Tallman of the University of Cincinnati Libraries is working with Fedora and Hydra along with other systems and is actively exploring Archivematica. He raised some interesting issues and questions about scalability of systems, how many copies of the data we need to keep (and the importance of getting this right), whether we should reprocess whole AIPs just because of a small metadata change and how we make sensible and pragmatic appraisal decisions. He reminded us all of how complicated and expensive this all is and how making the wrong decisions can impact in a big way on an organisation's budget.
I had to leave the workshop early to catch a flight home, but before I left was able to participate in an interesting breakout discussion about the greatest opportunities and challenges of using open source tools for digital curation and the gaps that we see in the current workflows.

Goodbye iPRES and I very much hope to be back next year!

Jenny Mitcham, Digital Archivist