Digital Archiving at the University of York: 2015

Tuesday, 8 December 2015

Addressing digital preservation challenges through Research Data Spring

With the short time scales at play in the Jisc Research Data Spring initiative it is very easy to find yourself so focussed on your own project that you don’t have time to look around and see what everyone else is doing. As phase 2 of Research Data Spring comes to an end we are taking time to reflect, to think about digital preservation for research data management, to look at the other projects and think about how all the different pieces of the puzzle fit together.

Our “Filling the Digital Preservation Gap” project is very specifically about digital preservation and we are focusing primarily on what happens once the researchers have handed over their data to us for long term safekeeping. However, ‘digital preservation’ is not a thing that exists in isolation. It is very much a part of the wider ecosystem for managing data. Different projects within Research Data Spring are working on specific elements of this infrastructure and this blog post will try and unpick who is doing what and how this work contributes to helping the community address the bigger challenges of digital preservation.

The series of podcast interviews that Jisc produced for each project were a great starting point to finding out about the projects and this has been complemented by some follow up questions and discussions with project teams. Any errors or misinterpretations are my own. A follow up session on digital preservation is planned for the next Research Data Spring sandpit later this week so an update may follow next week in the light of that.

So here is a bit of a synthesis of the projects and how they relate to digital preservation and more specifically the Open Archival Information System (OAIS) reference model. If you are new to OAIS, this DPC technology watch report is a great introduction.

OAIS Functional Model (taken from the DPC Technology Watch report: http://dx.doi.org/10.7207/twr14-02)

So, starting at the left of the diagram, at the point at which researchers (producers) are creating their data and preparing it for submission to a digital archive, the CREAM project (or “Collaboration for Research Enhancement by Active Metadata”) led by the University of Southampton hopes to change the way researchers use metadata. It is looking at how different disciplines capture metadata and how this enhances the data in the long run. They are encouraging dynamic capture of metadata at the point of data creation which is the point at which researchers know most about their data. The project is investigating the use of lab notebooks (not just for scientists) and also looking at templates for metadata to help streamline the research process and enable future reuse of data.

Whilst the key aims of this project do fall within the active data creation phase and thus outside of the OAIS model, they are still fundamental to the success of a digital archive and the value of working in this area is clear. One of the mandatory responsibilities of an OAIS is to ensure the independent utility of the data that it holds. In simple terms this means that the digital archive should ensure that as well as preserving the data itself, it also preserves enough contextual information and documentation to make that data re-usable for its designated community. This sounds simple enough but speaking from experience, as a digital archivist, this is the area that often causes frustration - going back to ask a data producer for documentation after the point of submission and at a time when they have moved on to a new project can be a less than fruitful exercise. A methodology for encouraging metadata generation at the point of data creation and to enable this to be seamlessly submitted to the archive along with the data itself would be most welcome.

Another project that sits cleanly outside of the OAIS model but impacts on it in a similar way is “Artivity” from the University of the Arts London. This is again about capturing metadata but with a slightly different angle. This project is looking at metadata to capture the creative process as an artist or designer creates a piece of digital art. They are looking at tools to capture both the context and the methodology so that in the future we can ask questions such as ‘how were the software tools actually used to create this artwork?’. As above, this project is enabling an institution to fulfil the OAIS responsibility of ensuring the independent utility of the data, but the documentation and metadata it captures is quite specific to the artistic process.

For both of these projects we would need to ensure that this rich metadata and documentation was deposited in the digital archive or repository alongside the data itself in a format that could be re-used in the future. As well as thinking about the longevity of file formats used for research data we clearly also need to think about file formats for documentation and metadata. Of course, when incorporating this data and metadata into a data archive or repository, finding a way of ensuring the link between the data and associated documentation is retained is also a key consideration.

The Clipper project (“Clipper: Enhancing Time-based Media for Research”) from City of Glasgow College provides another way of generating metadata - this time specifically aimed at time-based media (audio and video). Clipper is a simple tool that allows a researcher to cite a specific clip of a digital audio or video file. This solves a real problem in the citation and re-use of time-based media. The project doesn't relate directly to digital preservation but it could interface with the OAIS model at either end. Data produced from Clipper could be deposited in a digital archive (either alongside the audio or video file itself, or referencing a file held elsewhere). This scenario could occur when a researcher needs to reference or highlight a particular section to back up their research. On the other end of the spectrum, Clipper could also be a tool that the OAIS Access system encourages data consumers to utilise, for example, by highlighting it as a way of citing a particular section of a video that they are enabling access to. The good news is that the Clipper team have already been thinking about how metadata from Clipper could be stored for the long term within a digital archive alongside the media itself. The choice of html as the native file format for metadata should ensure that this data can be fairly easily managed into the future.

Still on the edges of the OAIS model (and perhaps most comfortably sitting within the Producer-Archive Interface) is a project called “Streamlining deposit: OJS to Repository Plugin” from City University London which intends to make the process of submission of papers to journals and associated datasets to repositories more streamlined for researchers. They are developing a plugin to send data direct from a journal to a data repository. They want to streamline the submission process for authors who need to make additional data available alongside their publications. This will ensure that the appropriate data gets deposited and linked to a publication in order to ultimately enable access to others.

Along a similar theme is “Giving Researchers Credit for their Data” from the University of Oxford. This project is also looking at more streamlined ways of linking data in repositories with publisher platforms and avoiding retyping of metadata by researchers. They are working on practical prototypes with Ubiquity, Elsevier and Figshare and looking specifically at the communication between the repository platform and publication platform.

Ultimately these 2 projects are all about giving researchers the tools to make depositing data easier and, in doing so, ensuring that the repository also gets the information it needs to manage the data in the long term. This impacts on digital preservation in 2 ways. First the easier processes for deposit will encourage more data to be deposited in repositories where it can then be preserved. Secondly, data submitted in this way should include better metadata (with a direct link to a related publication) which will make the job of the repository in providing access to this data easier and ultimately encourage re-use.

Other projects explore the stages of the research lifecycle that occur once the active research phase is over, addressing what happens when data is handed over to an archive or repository for longer term storage.

The “DataVault” project at the Universities of Edinburgh and Manchester is primarily addressing the Archival Storage entity of the OAIS model. They are establishing a DataVault - a safe place to store research data arising from research that has been completed. This facility will ensure that that data is stored unchanged for an appropriate period of time. Researchers will be encouraged to use this facility for data that isn’t suitable for deposit via the repository but that they wish to keep copies of. This will enable them to fulfill funder requirements around retention periods. The DataVault whilst primarily being a storage facility will also carry out other digital preservation functionality. Data will be packaged using the BagIt specification, an initial stab at file identification will be carried out using Apache Tika and fixity checks will be run periodically to monitor the file store and ensure files remain unchanged. The project team have highlighted the fact that file identification is problematic in the sphere of research data as you work with so many data types across disciplines. This is certainly a concern that the “Filling the Digital Preservation Gap” project has shared.

Our own “Filling the Digital Preservation Gap” project focuses on some of the more hidden elements of a digital preservation system. We are not looking at digital preservation software or tools that a researcher will interact with, but with the help of Archivematica are looking at among other things the OAIS Ingest entity (how we process the data as it arrives in the digital archive) and the Preservation Planning entity (how we monitor preservation risks and react to them). In phase 3 we plan to address OAIS more holistically with our proof of concepts. I won’t go into any further detail here as our project already gets so much air space on this blog!

Another project looking more holistically at OAIS is “A Consortial Approach to Building an Integrated RDM System - Small and Specialist” led by the University for the Creative Arts. This project is looking at the whole technical infrastructure for RDM and in particular looking at how this infrastructure can be achievable for small and specialist research institutes with limited resources. In a phase 1 project report by Matthew Addis from Arkivum there are a full range of workflows described which cover many of the different elements of an OAIS. To give a few examples, there are workflows around data deposit (Producer-Archive Interface), research data archiving using Arkivum (Archival Storage), access using EPrints (Access), gathering and reporting usage metrics (Data Management) and last but not least a workflow for research data preservation using Archivematica which has parallels with some of the work we are doing in “Filling the Digital Preservation Gap”.

“DMAOnline” sits firmly into the Data Management entity of the OAIS, running queries on the functions of the other entities and producing reports. This tool being created by the University of Lancaster will report on the administrative data around research data management systems (including statistics around access, storage and the preservation of that data). Using a tool like this, institutions will be able to monitor their RDM activities at a high level, drill down to see some of the detail and use this information to monitor the uptake of their RDM services or to make an assessment of their level of compliance to funder mandates. From the perspective of the “Filling the Digital Preservation Gap” project we are pleased that the DMAOnline team have agreed to include reporting on the statistics from Archivematica in their phase 3 plans. One of the limitations of Archivematica that was highlighted in the requirements section of our own phase 1 report was the lack of reporting options within the system. A development we have been sponsoring during phase 2 of our project will enable third party systems such as DMAOnline to extract information from Archivematica for reporting purposes.

Much focus in RDM activities typically goes into the Access functional entity, which naturally follows on from viewing a summary of activity through DMAOnline. This is one of the more visible parts of the model - the end product if you like of all the work that goes on behind the scenes. A project with a key focus on access is “Software Reuse, Repurposing and Reproducibility” from the University of St Andrews. However, as is the case for many of these projects, it also touches on other areas of the model. At the end of the day, access isn't sustainable without preservation so the project team are also thinking more broadly about these issues.

This project is looking at software that is created through research (the software that much research data actually depends on). What happens to software written by researchers, or created through projects when the person who was maintaining it leaves? How do people who want to reuse the data get hold of the right software? The project team have been looking at how you assign identifiers to software, how you capture software in such a way to make it usable in the future and how you then make that software accessible. Versioning is also a key concern in this area - different versions of software may need to be maintained with their own unique identifiers in order to allow future users of the data to replicate the results of a particular study. Issues around the preservation of and access to software are a bit of a hot topic in the digital preservation world so it is great to see an RDS project looking specifically at this.

The Administration entity of an OAIS coordinates the other high level functional entities, oversees the operation of them and serves as a central hub for internal and external interactions. The “Extending OPD to cover RDM” project from the University of Edinburgh could be one of these external interactions. It has put in place a framework for recording what facilities and services your institution has in place for managing research data - both technical infrastructure, policy and training. It allows an institution to make visible the information about their infrastructure and facilities and to compare it or benchmark it against others. The level of detail in this profile goes far above and beyond OAIS but allows an organisation to report on how it is meeting the ‘Data repository for longer term access and preservation’ component for example.

In summary it has been a useful exercise thinking about the OAIS model and how the different RDS projects in phase 2 fit within this framework. It is good to see how they all impact on and address digital preservation in some way - some by helping get the necessary metadata into the system, or enabling a more streamline deposit process, others helping monitor or assess the performance of the systems in place and some projects more directly addressing key entities within the model. The outputs from these projects complement each other - designed to solve different problems and addressing discrete elements of the complex puzzle that is research data management.

Jenny Mitcham, Digital Archivist

Wednesday, 2 December 2015

Research Data Spring - a case study for collaboration

Digital preservation is not a problem that any single institution can realistically find a solution to on their own. Collaboration with others is a great way of working towards sustainable solutions in a more effective way. This post is a case study about how we have benefited from collaboration whilst working on the "Filling the Digital Preservation Gap" project.

In late 2014 Jisc announced a new collaborative initiative called Research Data Spring. The project model specifically aimed to create innovative partnerships and collaborations between stakeholders at different HE institutions working within the field of Research Data Management. Project teams were asked to work in short sprints of between three and six months and were funded for a maximum of three phases of work. One of the projects lucky enough to be funded as part of this initiative was the “Filling the Digital Preservation Gap” project, a collaboration between the Universities of Hull and York. This was a valuable opportunity for teams at the two universities to work together on a shared solution to a shared problem and come up with a solution that might be beneficial to others.

The project team from Hull and York

The aim of the project was to address a perceived gap in existing research data management infrastructures around the active preservation of the data. Both Hull and York had existing digital repositories and sufficient storage provision but were lacking systems and workflows for fully addressing preservation. The project aimed to investigate the open source tool Archivematica and establish whether this would be a suitable solution to fill this gap.

As well as the collaboration between Hull and York, further collaborations emerged as the project progressed.

Artefactual Systems are the organisation who support and develop Archivematica and the project team worked closely with them throughout the project. Having concluded that Archivematica has great potential for helping to preserve research data, the project team highlighted several areas where they felt additional development was required in order to enhance existing functionality. Artefactual Systems were consulted in detail as the project team scoped out priorities for further work. They were able to offer many useful insights about the best way of tackling the problems we described. Their extensive knowledge of the system put them in a good place to look at the issues from various angles to find a solution which would meet our needs as well as the needs of the wider community of users. Artefactual Systems were also able to help us with one of our outreach activities, joining us (virtually) to give a presentation about our work.

The UK Archivematica group was kept informed about the project and invited to help shape the priorities for development (you can read a bit about this in a previous blog post). Experienced and established Archivematica users from the international community were also consulted to discuss the new features and to review how the proposed features would impact on their workflows. Ultimately, none of us wanted to create bespoke developments that were only going to be of use to Hull and York.

Collaboration with another Research Data Spring project being carried out at Lancaster University was also necessary to enable future join up of these two initiatives. One of the areas highlighted for further work was improved reporting within Archivematica. By sponsoring a development to enable data to be more easily exposed to third party applications, the project team worked closely with the DMAOnline project team at Lancaster to ensure the data would be made available in a manner that was suitable for their tool to work with.

Another area of work that called for additional collaboration was in the area of file format identification. This is very much an area that the digital preservation community as a whole needs to work together on. For research data in particular, there are many types of file that are not identified by current identification tools and are not present within the Pronom registry of file types. We wanted to get greater representation of research data file formats within Pronom and also enhance Archivematica to enable better workflows for non-identified files (see my previous post for more about file identification workflows). This is why we have also been collaborating with the team at The National Archives who develop new file signatures for Pronom.

The collaborative nature of this project brought several benefits. Despite the short time scales at play (or perhaps because of them) there was a strength in working together on a new and innovative solution to preserve research data.

The universities of Hull and York were similar enough to share the same problem and see the need to fill the digital preservation gap, but different enough to introduce interesting variations in workflows and implementation strategies. This demonstrated that there is often more than one way to implement a solution depending on institutional differences.

By collaborating and consulting widely, the project hoped to create a better final outcome and produce a set of enhancements and case studies that would benefit a wide community of users.

Jenny Mitcham, Digital Archivist

Friday, 27 November 2015

File identification ...let's talk about the workflows

When receiving any new batch of files to add to the digital archive there are lots of things I want to know about them but "What file formats have we got here?" is often my first question.

Knowing what you've got is of great importance to digital archivists because...

It enables you to find the right software to open the file and view the contents (all being well)
It can trigger a dialog with your donor or depositor about alternative formats you might wish to receive the data in (...all not being well)
It allows you to consider the risks that relate to that format and if appropriate define a migration pathway for preservation and/or access

We've come a long way in the last few years and we now have lots of tools to choose from to identify files. This could be seen as both a blessing and a curse. Each tool has strengths and weaknesses and it is not easy to decide which one to use (or indeed which combination of tools would give the best results) ...and once we've started using a tool, in what way do we actually use it?

So currently I have more questions about workflows - how do we use these tools and at what points do we interact with them or take manual steps?

Where file format identification tools are used in isolation, we can do what we want with the results. Where multiple identifications are given, we may be able to gather further evidence to convince us what the file actually is. Where there is no identification given, we may decide we can assign an identification manually. However, where file identification tools are incorporated into larger digital preservation systems, the workflow will be handled by the system and the digital archivist will only be able to interact in ways that have been configured by the developers.

As part of our Jisc funded "Filling the Digital Preservation Gap" project, one of the areas of development we are working on is around file identification within Archivematica. This was seen to be a development priority because our project is looking specifically at research data and research data comes in a huge array of file formats, many of which will not currently be recognised by file format identification tools.

The project team...discussing file identification workflows...probably

Here are some of the questions we've been exploring:

What should happen if you ingest data that can't be identified? Should you get notification of this? Should you be offered the option to try other file id methods/tools for those non-identified files?
Should we allow the curator/digital archivist to over-ride file identifications - eg - "I know this isn't really xxxx format so I'm going to record this fact" (and record this manual intervention in the metadata) Can you envisage ever wanting to do this?
Where a tool gives more than one possible identification should you be allowed to select which identification you trust or should the metadata just keep a record of all the possible identifications?
Where a file is not identified at all, should you have the option to add a manual identification? If there is no Pronom id for a file (because it isn't yet in Pronom) how would you record the identification? Would it simply be a case of writing "MATLAB file" for example? How sustainable is this?
How should you share info around file formats/file identifications with the wider digital preservation community? What is the best way to contribute to file format registries such as Pronom

We've been talking to people but don't necessarily have all the answers just yet. Thanks to everyone who has been feeding into our discussions so far! The key point to make here is that perhaps there isn't really a right answer - our systems need to be configurable enough in order that different institutions can work in different ways depending on local policies. It seems fairly obvious that this is quite a big nut to crack and it isn't something that we can fully resolve within our current project.

For the time being our Archivematica development work is focusing in the first instance on allowing the digital curator to see a report of the files that are not identified as a prompt to then working out how to handle them. This will be an important step towards helping us to understand the problem. Watch this space for further information.

Jenny Mitcham, Digital Archivist

Wednesday, 25 November 2015

Sharing the load: Jisc RDM Shared Services events

This is a guest post from Chris Awre, Head of Information Services, Library and Learning Innovation at the University of Hull. Chris has been working with me on the "Filling the Digital Preservation Gap" project.

On 18th/19th November, Jenny and I attended two events held by Jisc at Aston University looking at shared services for research data management. This initiative has come about as many, if not all, institutions have struggled to identify a concrete way forward for managing research data, and there is widespread acknowledgement that some form of shared service provision will be of benefit. To this end, the first day was about refining requirements for this provision, and saw over 70 representatives from across Higher Education feed in their ideas and views. The day took an initial requirements list and refined, extended and clarified these extensively. Jisc has provided a write-up of the day of its own that describes the process undertaken usefully.

Jenny and I were kindly invited to the event to contribute our experience of analysing requirements for digital preservation for research data management. The brief presentation we gave highlighted the importance of digital preservation as part of a full RDM service, stressing of how a lack of digital preservation planning has led to data loss over time, and how consideration of requirements has been based on long established principles from the OAIS Reference Model and earlier work at York. Essentially the message was – make sure that any RDM shared service encompasses digital preservation, even if institutions have different policies about what does and does not get pushed through it.

Thankfully, it seems that Jisc has indeed taken this on board as part of the planning process, and the key message was re-iterated on a number of occasions during the day. Digital preservation is also built into the procurement process that Jisc is putting together (of which more below). It was great to be having discussions about research data management during the day where digital preservation was an assumed component. The group was broken up to discuss different elements of the requirements for the latter half of the morning, and by chance I was on the table discussing digital preservation. This highlighted most of the draft requirements as mandatory, but also split up some of the others and expanded most of them. Context is everything when defining digital preservation workflows, and the challenge was to identify requirements that could work across many different institutions. We await to see the final list to see how successful we have all been.

The second day was focused on suppliers who may have an interest in bidding to the tender that Jisc will be issuing shortly. A range of companies were represented covering the different areas that could be bid for. What became apparent during Day 1 was the need to provide a suit of shared services, not a single entity. The tender process acknowledges this, and there are 8 Lots covering different aspects. These are to be confirmed, and will be presented in the tender itself. However, suffice to say that digital preservation is central to two of these: one for providing a shared service platform for digital preservation; and one to provide digital preservation tools that can be used independently by institutions wishing to build them in outside of a platform. This separation offers flexibility to how DP is embedded, and it will be interesting to see what options emerge from the procurement process.

Jenny and I have been invited to sit on the Advisory Group for the development of the RDM shared service(s), so will have ongoing ability to raise digital preservation as a key component of RDM service. Jisc is also looking for institutions to act as pilots for the service over the next two years. This provides a good opportunity to work with service providers to establish what works locally, and the experiences will serve the wider sector well as we continue to tackle the issues of managing research data.

Jenny Mitcham, Digital Archivist

Monday, 16 November 2015

The third UK Archivematica user group meeting

This is a guest post from Simon Wilson, University Archivist at the University of Hull based within the Hull History Centre. Simon has been working with me on the "Filling the Digital Preservation Gap" project and agreed to provide a short write up of the UK Archivematica group meeting in my absence.

With Jen presenting at iPRES in North Carolina Julie Allinson and I attended the UK Archivematica user group meeting at the Laidlaw Library in Leeds. After the round table introductions from the 11 institutions that were represented, Julie began proceedings with an presentation on our Jisc "Filling the Digital Preservation Gap" project. She updated the group on the progress within this project since the last user group meeting 5 months previously and focused in particular on the development work and enhancements to Archivematica that are being undertaken in Phase 2.

A presentation from Fergus O'Connor and Claudia Roeck at the Tate highlighted their use of Archivematica for video art with an estimated 500 items including video, film and slide material with the largest file some 20GB in size. It was interesting to hear how digital content had impacted on their video format migration policies and practices. As they were looking at this stage at just one particular format they had been able to identify some of the micro services that weren't appropriate (for example OCR tools and bulk extractor) as a timely reminder of the value of adjusting the workflow within Archivematica as necessary. This is something we will look at when developing the workflows at Hull for research data and born-digital archives.

One question raised was that of scalability and Matthew Addis from Arkivum reported that it had been successfully tested with 100,000 files. There was an interesting discussion about whether availability of IT support was a barrier to take-up in institutions.

John Beaman from Leeds gave a thought provoking session about data security and the issue of personal identifiable information and the impact this had on the processing of content. This is an issue we are familiar with for paper material but haven't spent a lot of time translating these experiences to digital material. There was lots of note taking in the discussion about anonymisation (removing references to personal identifiable information) and pseudonymisation (changing the personal identifiable information across the dataset) and the respective impact on security and data re-use (in summary anonymisation is best for security and pseudonymisation best for re-use). The pointer to the ISO Code of practice on this has been added to my reading list. John also discussed encryption which seems to be an important consideration for some data. These are important issues for anyone working with born digital data regardless of the system they are using.

Jonathan Ainsworth also from Leeds talked us through their work with their collections management system KE Emu and ePrints - and the challenges of fitting Archivematica into an existing workflow. He also highlighted the impossibility of trying to predict every possible scenario for receiving or processing digital content. There was an interesting discussion about providing evidence to support a business case and what might be considered useful measures and discussion about cost models.

The day concluded with Sarah Romkey from Artefactual Systems joining us via Skype and bringing us up to speed with developments for v1.5 due later this year and v1.6 due in 2016. I am especially looking forward to getting my hands on the arrangement and appraisal tab being developed by colleagues at Bentley Historical library.

Jenny Mitcham, Digital Archivist

Thursday, 12 November 2015

iPRES workshop report: Using Open-Source Tools to Fulfill Digital Preservation Requirements

As promised by the conference hosts it
was definitely Autumn in Chapel Hill!

Last week I was lucky enough to be at the iPRES conference.

iPRES is the international conference on digital preservation and is exactly the sort of conference I should be at (though somehow I have managed to miss the last 4 years). The conference was generally a fantastic opportunity to meet other people doing digital preservation and share experiences. Regardless of international borders, we are all facing very similar problems and grappling with the same issues.

Breakfast as provided at Friday's workshop

iPRES 2015 was in Chapel Hill, North Carolina this year. Jetlag aside (I gave up in the end and decided to maintain a more European concept of time) it was a really valuable experience. The large quantities of cakes, pastries and bagels also helped - hats off to the conference hosts for this!

One of the most useful sessions for me was Friday's workshop on ‘Using Open-Source Tools to Fulfill Digital Preservation Requirements’. This workshop was billed as a space to talk about open-source software and share experiences about implementing open-source solutions. As well as listening to a really interesting set of talks from others, it also gave me a valuable opportunity to talk about the Jisc “Filling the Digital Preservation Gap” project to an international audience.

Archivematica featured very heavily in the scheduled talks, Other tools such as Archivespace, Islandora and BitCurator (and BitCurator Access) were also discussed so it was good to learn more about them.

Of particular interest was an announcement from Sam Meister of the Educopia Institute about a project proposal called OSSArcFlow. This project will attempt to help institutions combine open source tools in order to meet their institutional needs. It will look at issues such as how systems can be combined and how integration and hand-offs (such as transfer of metadata) can be successfully established. They will be working directly with 11 partner institutions but the lessons learned (including workflow models, guidance and training) will be available to other interested partners. This project sounds really valuable and of relevance to the work we are currently doing in our "Filling the Digital Preservation Gap" project.

The workshop was held in the Sonja Haynes
Center for Black Culture and History

Some highlights and takeaway thoughts from the contributed talks:

Some great ongoing work with Archivematica was described by Andrew Berger of the Computer History Museum in California. He mentioned that the largest file he has ingested so far is 320GB and that he has also successfully ingested 17,000 in one go. The material he is working with spans 40 years and includes lots of unidentified files. Having used Archivematica for real for 6 months now, he feels he understands what each microservice is doing and has had some success with troubleshooting problems.
Ben Fino-Radin from the Museum of Modern Art reported that the have ingested 20TB in total using Archivematica, the largest file being 580GB. He anticipates that soon they will be attempting to ingest larger files than this. He uses Archivematica with high levels of automation. The only time he logs in to the Archivematica dashboard is to change a policy - he doesn't watch the ingest process and see the microservices running. From my perspective this is great to know as this high level of automation is something we are keen to establish at York for our institutional research data workflows.
Bonnie Gordon from the Rockefeller Archive Center talked about their work integrating Archivematica with ArchivesSpace. This integration was designed to pass rights and technical metadata from Archivematica to ArchivesSpace through automated processes.
Cal Lee from the University of North Carolina talked to us about BitCurator - now this is tool I would really like to get playing with. I'm holding back until project work calms down, but I could see that it would be useful to use BitCurator as an initial step before data is ingested into Archivematica.
Mark Leggott from University of Prince Edward Island talked about Islandora and also put out a general plea to everyone to find a way to support or contribute to an open source project. This is an idea I very much support! Although open source tools are freely available for anyone to use, this doesn't mean that we should just use them and give nothing back. Even if a contribution can not be made technically or financially, it could just be done through advocacy and publicity.
Me talking about "Filling the Digital Preservation Gap" - can I be one of my own highlights or is that bad form?
Courtney Mumma spoke on behalf of Artefactual Systems and gave us a step by step walk through of how to create a new Format Policy Rule in Archivematica. This was useful to see as it is not something I have ever attempted. Good to note also that instructions are available here.
Mike Shallcross and Max Eckard from Bentley Historical Library at the University of Michigan talked about their Mellon funded project to integrate Archivematica and ArchivesSpace in an end-to-end workflow that also includes the deposit of content into a DSpace repository. This project should be of great interest to any institution who is using Archivematica due to the enhancements that are being made to the interface. A new appraisal and arrangement tab will enable digital curators to see in a more interactive and visual way which file types are represented within the archive, tag files to aid arrangement and view a variety of reports. This project is a good example of open source tools working alongside each other, all fulfilling very specific functions.
Kari Smith from MIT Libraries is using BitCurator alongside Archivematica for ingest and described some of the challenges of establishing the right workflows and levels of automation. Here's hoping some of the work of the proposed OSSArcFlow project will help with these sorts of issues.
Nathan Tallman of the University of Cincinnati Libraries is working with Fedora and Hydra along with other systems and is actively exploring Archivematica. He raised some interesting issues and questions about scalability of systems, how many copies of the data we need to keep (and the importance of getting this right), whether we should reprocess whole AIPs just because of a small metadata change and how we make sensible and pragmatic appraisal decisions. He reminded us all of how complicated and expensive this all is and how making the wrong decisions can impact in a big way on an organisation's budget.

I had to leave the workshop early to catch a flight home, but before I left was able to participate in an interesting breakout discussion about the greatest opportunities and challenges of using open source tools for digital curation and the gaps that we see in the current workflows.

Goodbye iPRES and I very much hope to be back next year!

Jenny Mitcham, Digital Archivist

Thursday, 29 October 2015

Spreading the word on the "other side of the pond"

A guest post by Richard Green who has been leading the University of Hull's technical investigations for "Filling the Preservation Gap".

Jenny is away from her desk at the moment so I've been deputised to provide a blog post around the work we've been doing at the University of Hull as part of the Jisc-funded "Filling the Preservation Gap" (FPG) project. In particular we (the FPG team) want to mention a poster that we prepared for a recent conference in the US.

Hull has had a digital repository in place for a considerable number of years. It has always had the Fedora (now Fedora Commons) repository software at its heart and for several years now has deployed Hydra over the top of that - indeed, Hull was a founder member of the Hydra Project. With the established repository goes an established three-stage workflow for adding content. Content is initially created in a “proto-queue” by a user who, when (s)he is happy with it, transfers it to the ownership of the Library who take it through a quality assurance process. When the team in the Library is happy with it the content is transferred to the repository "proper" with appropriate access permissions. The repository contains a wide range of materials and over time we are developing variants of this basic workflow suited to each content type but this activity is constrained by limited resources and we know there are other variations we would like to, and can, develop when circumstances permit. The lack of a specific workflow for research data management (RDM), encompassing the possible need for long-term preservation, was one of the reasons for getting involved in the FPG project.

Whilst the focus of the FPG project is clearly research data it became apparent during our initial work with Archivematica that its preservation needs were not so far removed from the preservation needs that we have for some of our other content. That being the case we have kept our eye on the bigger picture whilst concentrating on RDM for the purposes of our Jisc project. We have spent some time putting together an initial all-encompassing design through which an RDM workflow for the FPG project would be but one possible path. It is that overall picture that became our poster.

The Hydra Community holds one major get-together each year, the "Hydra Connect" conference. The last full week in September saw 200 people from 55 institutions gather in Minneapolis for Connect 2015. A regular feature of the conferences, much appreciated by the audience, is an afternoon given over to a poster session during which attendees can talk about the work they are doing with Hydra. Each institution is strongly encouraged to contribute and so Hull took along its grand design as its offering.

Poster for Hydra Connect 2015 in Minneapolis, MN, September 2015

So that’s the poster and here’s a somewhat simplified explanation!

Essentially, content comes in at the left-hand side. The upper entry point corresponds to a human workflow of the type we already have. The diagram proposes that the workflow gain the option of sending the digital content of an object through Archivematica in order to create an archival information package (AIP) for preservation and also to take advantage of such things as the software’s capability to generate technical metadata. The dissemination information package (DIP) that Archivematica produces is then “mined” for content and metadata that will be inserted into the repository object already containing the creator’s descriptive metadata record. One of the items mined is a UUID that provides a tie-up between the record in the repository and the AIP which goes to a separate preservation store.

The lower entry point corresponds to an automated (well, maybe semi-automated) batch ingest process. In this case, the DIP processor creates a repository object from scratch and, in addition to possible dissemination files and technical metadata, provides the descriptive metadata too. There are a number of scenarios for generating the descriptive metadata; at one extreme it might be detailed fields extracted from an accompanying file, at the other it might be minimal metadata derived from the context (the particular ingest folder and the title of the master file, for instance). There will be circumstances when we create a metadata-only record for the repository and do not include dissemination files in the repository object; under these circumstances the UUID in the metadata would allow us to retrieve the AIP from store and create a new DIP should anyone ever request the data itself.

Finally, we have content already in the repository where it is being “kept safe” but which really justifies a proper preservation copy. We shall create a workflow that allows this to be passed to Archivematica so that an AIP can be created and stored. It is probable that this route would use the persistent identifier (PID) of the Fedora object as the link to the AIP.

Suffice it to say that the poster was well received. It generated quite a lot of interest and, some of it, from surprising quarters. In conversation with one well-established practitioner in the repository field, from a major US university, I was told “I’ve never thought of things quite like that – and you’re probably right!” It’s sometimes reassuring to know that the work we undertake at the smaller UK universities is nevertheless respected in some of the major US institutions!

If you have any comments, or want further details, about our work then please get in touch via this blog. We’re interested in your thoughts, perspectives and ideas about the approach.

Jenny Mitcham, Digital Archivist

Friday, 18 September 2015

Spreading the word at the Northern Collaboration Conference

Collaborating with other delegates at the start of the day
Photo credit: Northern Collaboration Conference, Kiran Mehta

I gave a presentation last week at the 2015 Northern Collaboration Conference. My first trip to this conference which is primarily aimed at those working in academic libraries and proved to be an interesting day.

The theme of the day was 'Being digital: opportunities for collaboration in academic libraries' so I thought our collaborative Jisc Research Data Spring project was a perfect fit. It was great to have a new audience to talk to about our plans to 'fill the digital preservation gap' for research data. Though it is academic libraries that are taking on this challenge, my typical audience tends to be those working in archives.

Slides are available on slideshare.
(epic fail on getting the embed code to work in Blogger)

My slides are available on Slideshare for those who want to see what I was talking about.

I began by making sure that we were speaking the same language. Communication is a big issue for us digital archivists. If we talk in OAIS-speak only other digital archivists will understand us. If however we use terms such as 'archiving' and 'curation' we fall into the trap of the multiple layers of meanings and (mis-) interpretations of these terms. This being not my usual audience, it was best to put my cards on the table at the start and establish basic principles.

Key takeaway message: This is not all about storage*

I then covered many of the questions included in the project FAQs that we produced in phase one of our project. Essentially the:

Why are we doing this/ why do we need digital preservation?
What does research data look like?
What does Archivematica do?
What are it's strengths and weaknesses?
How can we use it?

I was able to touch on the topic of the value of research data and how it is regarded by different researchers working in different disciplines.

Researchers at York have different opinions on the value of their
data and the challenges of curating it

The lack of clarity on the value of much of the data we will be looking after is the main reason why we propose the approach we are taking.

I'm inspired by Tim Gollin's paper 'Parsimonious Preservation: Preventing Pointless Processes!' which focuses on the primary need to simply to collect the data and find out what you've got. Crucially the 'knowing what you've got' step can be done with minimum expense through the use of available open source tools. Taking a pragmatic approach such as this is particularly appealing when the value of the data we are curating is such an unknown.

I then spoke briefly about phase two of the project through which we are trying to define our own workflows and implementation plans at York and Hull. I mentioned the development work that we are sponsoring as part of this project. Artefactual Systems are currently working on six different areas of development for us (as described in a previous blog post).

At the end of the session I handed out a short feedback form to try and gauge the level of interest in the project. Though only 6 from a total of 20 questionnaires were returned, respondents unanimously agreed they would go away and read the project FAQs in more detail and expressed an interest in a show and tell event once our proof of concepts were up and running. Most also thought they would download our report and talk to their colleagues about Archivematica and our project.

Quite a good result I think!

* though Archival Storage is still essential to any digital archive

Jenny Mitcham, Digital Archivist

Friday, 28 August 2015

Enhancing Archivematica for Research Data Management

Where has the time gone? ....we are now one month into phase two of "Filling the Digital Preservation Gap" ....and I have spent much of this first month on holiday!

Not quite a 'digital preservation gap' - just an excuse to show
you my holiday snaps!

So with no time to waste, here is an update on what we are doing:

In phase two of our project we have two main areas of work. Locally at York and Hull we are going to be planning in more detail the proof of concept implementations of Archivematica for research data we hope to get up and running in phase three.

Meanwhile over in Canada, our collaborators at Artefactual Systems are starting work on a number of sponsored developments to help move Archivematica into a better position for us to incorporate it into our implementations for managing and preserving research data.

We have a project kick off call with Artefactual Systems scheduled for next week and we will be discussing our requirements and specifications for development in more detail, but in the meantime, here is a summary of the areas we are focusing on:

Automation of DIP generation on request

Building on the AIP re-ingest functionality within Archivematica which allows an AIP to be re-processed and allows for a delay in the generation of a DIP until such a time as it is requested, this feature will enable further automation of this process.

This feature is of particular benefit to those situations where the value of data is not fully understood. It is unnecessary to create an access copy of all research datasets as some of them will never be requested. In our workflows for long term management of research data we would like to trigger the creation of a copy of the data for dissemination and re-use on request rather than create one by default and this piece of work will make this workflow possible.

METS parsing tools

This development will involve creating a Python library which could be used by third party applications to understand the METS file that is contained within an Archivematica DIP. Additionally an HTTP REST service would be developed to allow third party applications to interact with the library in a programming language agnostic fashion.

This is key to being able to work with the DIP that is created by Archivematica within other repository or access systems that are not integrated with Archivematica. Both York and Hull have repositories built with Fedora and Hydra and this feature will allow the repositories to better understand the DIP that Archivematica creates. This development is in no way specific to a Fedora/Hydra repository and will equally benefit other repository platforms in use for RDM.

Improved file identification

This feature will enable Archivematica to report on any unidentified files within a transfer alongside access to the file identification tool output. Further enhancements could help curatorial staff to submit information to PRONOM by partially automating this process.

It was highlighted in our phase one report that the identification of research data file formats is a key area of concern when managing research data for the longer term. This feature will help users of Archivematica see which files haven’t been identified and thus enable them to take action to establish what they hold. This feature will also encourage communication with PRONOM to enhance the database of file formats for the future, thus enabling a more sustainable community approach to addressing this problem.

Generic Search API

The development of a proof of concept search REST API for Archivematica allowing third party applications to query Archivematica for information about the objects in archival storage.

There is a need to be able to produce statistics or reports on RDM and digital preservation processes in order to obtain a clear picture of what data has been archived. This development will enable these statistics to be generated more easily and sustainably. For example this would enable tools such as the DMAonline dashboard in development at Lancaster University to pull out summary statistics from Archivematica.

Support for multiple checksum algorithms

Currently Archivematica generates SHA256 checksums for all files, and inserts those into PREMIS fixity tags in the METS file. In addition, two premis:events are generated for each file. All three of these entries are currently hardcoded to assume SHA256. This development would include support for other hash algorithms such as MD5, SHA1 and SHA512.

Research data files can be large in size and/or quantity and may take some time to process through the Archivematica pipeline. One of the potential bottlenecks highlighted in the pipeline is checksums which are created at more than one point in the process. SHA256 checksums can take a long time to create and it has been highlighted that having the option to alter the checksum algorithm within Archivematica could speed things up. Having additional configuration options within Archivematica will give institutions the flexiblity to refine and configure their pipelines to reduce bottlenecks where appropriate.

Documentation

The ability to automate processes relating to preservation are of primary importance where few resources are available to manually process data of unknown value. Fuller documentation of how an automated workflow can be configured within Archivematica using the APIs that exist would be very helpful for those considering using Archivematica for RDM and will help remove some of the barriers to its use. We will therefore be funding a small piece of work to help improve Archivematica documentation for developers and those installing or administering the system.

We very much hope these enhancements will be useful to the wider community of Archivematica users and not just to those looking specifically at preserving research data.

Our thanks go to Artefactual Systems for helping to turn our initial development ideas into these more concrete proposals.

As ever we are happy to receive your thoughts and feedback so do get in touch if you are interested in the work we are carrying out or have things to share with us around these development ideas.

Jenny Mitcham, Digital Archivist

Friday, 24 July 2015

Archivematica for research data? The FAQs

Thinking of preserving research data? Wondering what Archivematica does? Interested in what it might cost?

What follows is a series of FAQs put together by the "Filling the digital preservation gap" project team and included as part A of our phase 1 project report. We hope you find this useful in helping to work out whether Archivematica is something you could use for RDM. There are bound to be questions we haven't answered so let us know if you want to know more...

************************************************

Why do we need a digital preservation system for research data?

Research data should be seen as a valuable institutional asset and treated accordingly. Research data is often unique and irreplaceable. It may need to be kept to validate or verify conclusions recorded in publications. Funder, publisher and often internal university requirements ask that research data is available for others to consult and is preserved in a usable form after the project that generated it is complete.

In order to facilitate future access to research data we need to actively manage and curate it. Digital preservation is not just about implementing a good archival storage system or ‘preserving the bits’ it is about working within the framework set out by international standards (for example the Open Archival Information System) and taking steps to increase the chances of enabling meaningful re-use in the future.

What are the risks if we don't address digital preservation?

Digital preservation has been in the news this year (2015). An interview with Google CEO Vint Cerf in February grabbed the attention of the mainstream media with headlines about the fragility of a digital media and the onset of a digital dark age.

This is clearly already a problem for researchers with issues around format and media obsolescence already being encountered. In a 2013 Research Data Management (RDM) survey undertaken at the University of York just under a quarter of respondents to the question “Which data management issues have you come across in your research over the last five years?” selected the answer “Inability to read files in old software formats on old media or because of expired software licences”. These are the sorts of problems that a digital preservation system is designed to address.

Due to its complexity digital preservation is very easy to put in the ‘too difficult’ box. There is no single perfect solution out there and it could be argued that we should sit it out and wait until a fuller set of tools emerges. A better approach is to join the existing community of practice and embrace some of the working and evolving solutions that are available.

Why are we interested in Archivematica?

Archivematica is an open source digital preservation system that is based on recognised standards in the field. Its functionality and the design of its interfaces were based on the Open Archival Information System and it uses standards such as PREMIS and METS to store metadata about the objects that are being preserved. Archivematica is flexible and configurable and can interface with a range of other systems.

A fully fledged RDM solution is likely to consist of a variety of different systems performing different functions within the workflow; Archivematica will fit well into this modular architecture and fills the digital preservation gap in the infrastructure.

The Archivematica website states that “The goal of the Archivematica project is to give archivists and librarians with limited technical and financial capacity the tools, methodology and confidence to begin preserving digital information today.” This vision appears to be a good fit with the needs and resources of those who are charged with managing an institution’s research data.

It should be noted that there are other digital preservation solutions available, both commercial and open source, but these were not assessed as part of this project.

Why do we recommend Archivematica to help preserve research data?

It is flexible and can be configured in different ways for different institutional needs and workflows
It allows many of the tasks around digital preservation to be carried out in an automated fashion
It can be used alongside other existing systems as part of a wider workflow for research data
It is a good digital preservation solution for those with limited resources
It is an evolving solution that is continually driven and enhanced by and for the digital preservation community; it is responsive to developments in the field of digital preservation
It gives institutions greater confidence that they will be able to continue to provide access to usable copies of research data over time.

What does Archivematica actually do?

Archivematica runs a series of microservices on the data and packages it up (with any metadata that has been extracted from it) in a standards compliant way for long term storage. Where a migration path exists, it will create preservation or dissemination versions of the data files to store alongside the originals and create metadata to record the preservation actions that have been carried out.

A more in depth discussion of what Archivematica does can be found in the report text. Full documentation for Archivematica is available online.

How could Archivematica be incorporated into a wider technical infrastructure for research data management?

Archivematica performs a very specific task within a wider infrastructure for research data management - that of preparing data for long term storage and access. It is also worth stating here what it doesn’t do:

It does not help with the transfer of data (and/or metadata) from researchers
It does not provide storage
It does not provide public access to data
It does not allocate Digital Object Identifiers (DOIs)
It does not provide statistics on when data was last accessed
It does not manage retention periods and trigger disposal actions when that period has passed

These functions and activities will need to be established elsewhere within the infrastructure as appropriate.

What does research data look like?

Research data is hard to characterise, varying across institutions, disciplines and individual projects. A wide range of software applications are in use by researchers and the file formats generated are diverse and often specialist.

Higher education institutions typically have little control over the data types and file formats that their researchers are producing. We ask researchers to consider file formats as a part of their data management plan and can provide generic advice on preferred file formats if asked, but where many of the specialist data formats are concerned it is likely that there is no ‘preservation-friendly’ alternative that retains the significant properties of the data.

Research data can be large in size, and/or quantity. It often includes elements that are confidential or sensitive. Sensitivities are likely to vary across a dataset with some files being suitable for wider access and others being restricted. A one-size fits all approach to rights metadata is not appropriate. In some cases there will be different versions of the data that need to be preserved or different deposits of data for a single research project. Scenarios such as these are likely to come about where data is being used to support multiple publications over the course of a piece of research.

Research data may come with administrative data and documentation. These may be documents relating to ethical approval or grant funding, data management plans or documentation or metadata relating to particular files. The association between the research data and any associated administrative information should be maintained.

It can be difficult to ascertain the value of research data at the point of ingest. Some data will be widely used and should be preserved for the long term and other data will never be accessed and will be disposed of at the end of its retention period.

How would Archivematica handle research data?

Archivematica can handle any type of data but it should be noted that a richer level of preservation will only be available for some file formats. Archivematica (like other digital preservation systems) will recognise and identify a large number of research data formats but by no means the full range. For a smaller subset of these file formats (for example a range of raster and vector image and audio visual formats) it comes with normalisation rules and tools. It can be configured to normalise other file formats as required (where open source command line tools are available). Archivematica also allows for the flexibility of manual normalisations. This gives data curators the opportunity to migrate files in a more manual way and update the PREMIS metadata by hand accordingly.

For other data types (and this will include many of the file formats that are created by researchers), Archivematica may not be able to identify, characterise or normalise the files but will still be able to perform certain functions such as virus check, cleaning up file names, creating checksums and packaging the data and metadata up to create an archival information package.

Archivematica can handle large files (or large volumes of small files) but its abilities in this area are very much dependent on the processing power that has been allocated to it. Users of Archivematica should be aware of the capabilities of their own implementation and be prepared to establish a cut off point over which data files of a certain size may not be processed, or may need to be processed in a different way.

Archivematica uses the PREMIS metadata standard to record rights metadata. Rights metadata can be added for the Submission Information Package as a whole rather than in a granular fashion. This is not ideal for research data for which there are likely to be different levels of sensitivity for different elements within the final submitted dataset. The Archivematica manual suggests that fuller rights information would be added to the access system (outside of Archivematica).

The use of Archival Information Collections (AICs) in Archivematica enables the loose association of groups of related Archival Information Packages (AIPs). This may be a useful feature for research data where different versions of a dataset or parts of a dataset are deposited at different times but are all associated with the same research project.

Archivematica is a suitable tool for preserving data of unknown value. Workflows within Archivematica and the processing of a transferred dataset from a Submission Information Package (SIP) to an Archival Information Package (AIP) can be automated. This means that some control over the data and a level of confidence that the data is being looked after adequately can be gained, without expending a large amount of staff time on curating the data in a manual fashion. If the value of the data is seen to increase (by frequent requests to access that data or as a result of assessment by curatorial staff) further efforts can be made to preserve the data using the AIP re-ingest feature and perhaps by carrying out a level of manual curation. The extent of automation within Archivematica can be configured so staff are able to treat datasets in different ways as appropriate. Institutions may have a range of approaches here, but the levels of automation that are possible provide a compelling argument for the adoption of Archivematica if few staff resources are available for manual preservation.

What are the limitations of Archivematica for research data?

Archivematica should not be seen as a magic bullet. It does not guarantee that data will be preserved in a re-usable state into the future. It can only be as good as digital preservation theory and practice is currently and digital preservation itself is not a fully solved problem.

Research data is particularly challenging from a preservation point of view due to the range of data types and formats that are in existence, many of which are not formats that digital preservation tools and policies exist for, thus they will not receive as a high a level of curation when ingested into Archivematica.

As mentioned above, the rights metadata within Archivematica may not fit the granularity that would be required for research data. This information would need to be held elsewhere within the infrastructure.

One of Archivematica’s strengths is its flexibility and the fact it can be configured to suit the needs of the institution or for a particular workflow. This however may also act as an initial barrier to use. It takes a bit of time to become familiar with Archivematica and to work out how you want to set it up. It is also a tool that most people would not want to use in isolation, and considerable thought needs to go into how it needs to interact with other systems and what workflow may best suit your institution.

The user interface for Archivematica is not always intuitive and takes some time to fully understand. There is currently no indication within the GUI that Archivematica is processing or estimate of how long a particular microservice may have left to run. This is a limitation for large datasets if you are processing them through Archivematica’s dashboard. For a more automated curation workflow this will not have any impact.

What costs are associated with using Archivematica?

Archivematica is a free and open source application but this does not mean it will not cost anything to run. As a minimum an organisation wishing to run Archivematica locally will need both technical and curatorial staff resource. A level of technical knowledge is required to install and troubleshoot Archivematica and perform necessary upgrades. Further technical knowledge is required to consider how Archivematica fits into a wider infrastructure and to get systems talking to each other. As Archivematica is open source, developer time could be devoted to enhancing it to suit institutional needs. Developer time can also be bought from Artefactual Systems, the lead developer of Archivematica, to fund specific enhancements or new functionality which will be made available to the wider user base. In order to make the most of the system, organisations may want to consider factoring in a budget for developments and enhancements.

It is essential to have at least one member of curatorial staff who can get to grips with the Archivematica interface, make administrative decisions about the workflow and edit the format policy registry where appropriate. A level of knowledge of digital preservation is required for this, particularly where changes or additions to normalisation rules within the format policy registry are being considered. A greater number of curatorial staff working on Archivematica will be necessary the more manual steps there are within the workflow (for example if manual selection and arrangement, metadata entry or normalisations are carried out). This requirement for curatorial staff will increase in line with the volumes of data that are being processed.

Technical costs of establishing an Archivematica installation should also be considered. For a production system the following server configuration is recommended as a minimum:

Processor: dual core i5 3rd generation CPU or better
Memory: 8GB+
Disk space: 20GB plus the disk space required for the collection

The software can be installed on a single machine or across a number of machines to share the workload. At the time of writing, the software requires a current version of Ubuntu LTS (14.04 or 12.04) as its operating system.

What other systems is Archivematica integrated with?

Archivematica provides various levels of integration with DSpace, CONTENTdm, AtoM, Islandora and Archivist’s Toolkit for access and Arkivum, DuraCloud and LOCKSS for storage. There are ongoing integrations underway with ArchivesSpace, Hydra, BitCurator and DataVerse.

In addition, Archivematica provides a transfer REST API that can be used to initiate transfers within the software, the first step of the preservation workflow. Archivematica’s underlying Storage Service also provides RESTful APIs to facilitate the creation, retrieval and deletion of AIPs.

How can you use Archivematica?

There are 3 different ways that an institution might wish to use Archivematica:

Local - institutions may install and host Archivematica locally and link it to their preferred storage option
Arkivum - a managed service from Arkivum will allow Archivematica to be hosted locally within the institution with upgrades and support available through Arkivum in partnership with Artefactual Systems. A remote hosting option is also available. Both include integration of Archivematica and Arkivum storage.
ArchivesDirect - a hosted service from DuraSpace that combines Archivematica’s preservation functionality with DuraCloud for storage.

How could Archivematica be improved for research data?

It should be noted that Archivematica is an evolving tool that is under active development. During the short 3 months of phase 1 of our project “Filling the Digital Preservation Gap” we have been assessing a moving target. Version 1.3 was installed for initial testing. A month in, version 1.4 was released to the community. As we write this report, version 1.5 is under development and due for imminent release; a version of this has been made available to us for testing.

Archivematica is open source but much of the development work is carried out by Artefactual Systems the company that support it. They have their own development roadmap for Archivematica but most new features that appear are directly sponsored by the user community. Users can pay to have new features, functionality or integrations built into Archivematica, and Artefactual Systems try to carry out this work in such a way to make the features useful to the wider user community and agree to continue to support and maintain the code base through subsequent versions of the software. This ‘bounty model’ for open source development seems to work well and keeps the software evolving in line with the priorities of its user base.

During the testing phase of this project we have highlighted several areas where Archivematica could be improved or enhanced to provide a better solution for research data and several of these features are already in development (sponsored by other Archivematica users). In phase 2 of the project we hope to be able to contribute to the continued development of Archivematica.

Who else is using Archivematica to do similar things?

Archivematica has been adopted by several institutions internationally but its key user base is in Canada and the United States. A list of a selection of Archivematica users can be found on their community wiki pages. Some institutions are using Archivematica to preserve research data. Both the Zuse Institute Berlin and Research Data Canada’s Federated Pilot for Data Ingest and Preservation are important to mention in this context.

Archivematica is not widely used in the UK but there are current implementations at the National Library of Wales and the University of Warwick’s Modern Records Centre. Interest in Archivematica in the UK is growing. This is evidenced by the establishment this year of a UK Archivematica group which provides a local forum to share ideas and case studies. Representatives from 15 different organisations were present at the last meeting at Tate Britain in June 2015 and a further meeting is planned at the University of Leeds in the autumn.

Where can I find out more?

Our full project report is available here: http://dx.doi.org/10.6084/m9.figshare.1481170

Read Part B of the report for fuller details of our findings.

Jenny Mitcham, Digital Archivist

Friday, 17 July 2015

Improving RDM workflows with Archivematica

In a previous post we promised diagrams to help answer the 'How?' of the 'Filling the Digital Preservation Gap' project. In answer to this, here is a guest post by Julie Allinson, Manager of Digital York, who looks after all things Library and Archives IT.

Having just completed phase one of a Jisc Research Data Spring project with the University of Hull, we have been thinking a lot about the potential phases two and three which we are hoping will follow. But even if we aren’t funded to continue the project, the work so far won’t be wasted here at York (and certainly the wider community will benefit from our project report!) as it has given us some space to look out our current RDM workflow and look for ways that might be improved on, particularly to include a level of curation and preservation.

My perspective on all of this is to look at how things can fit together and I am a believer in using the right system for the right job. Out of lack of resource, or misunderstanding, I feel we often retro-fit existing systems and try to make them meet a need for which they weren’t designed. To a degree I think this happens to PURE. PURE is a Current Research Information System (CRIS), and a leading product in that space. But it isn’t (in my opinion) a data repository, a preservation system, a management tool for handling data access requests or any other of a long list of things we might want it to be.

For York, PURE is where we collect information about our research and our researchers. PURE provides both an internal and, through the PURE portal, an external view of York’s research. For the Library, it is where we collect information about research datasets and outputs. In our current workflow, the full-text is sent to our research repository for storage and dissemination. The researcher interfaces only with PURE for deposit, the rest is done by magic. For research data the picture is similar, but we currently take copies of data in only some cases, where there is no suitable external data archive, and we do this ‘offline’, using different mechanisms depending on the size of the dataset.

Our current workflow for dealing with Research Data deposits is OK, it works and it is probably similar to many institutions still feeling their way in this new area of activity. It looks broadly like this:

Researcher enters dataset metadata into PURE
Member of Library staff contacts them about their data and, if appropriate, takes a copy for our long term store
Library staff check and verify the metadata, and record extra information as needed.
PURE creates a DOI
DOI is updated to point to our interim datasets page (default PURE behaviour is to create a link to the PURE portal, which we have not currently implemented for datasets)

But … it has a lot of manual steps in it, and we don’t check or verify the data deposited with us in any systematic way. Note all of the manual (dotted) steps in the following diagram.

How we do things at the moment

I’d like to improve this workflow by removing as many manual steps as possible, thereby, increasing efficiency, eradicating re-keying and reducing the margin for error, whilst at the same time adding in some proper curation of the data.

The Jisc Research Data Spring project allowed us to ask the question: ‘Can Archivematica help?’, and without going into much detail here, the answer was a resounding ‘Yes, but...’.

What Archivematica can most certainly give us is a sense of what we’ve got, to help us ‘peer into the bag’ as it were, and it can have a good go at both identifying what’s in there and at making sure no nasties, like viruses and corrupt files, lurk.

Research data has a few prominent features:

lots of files of types that standard preservation tools can’t identify
bags of data we don’t know a helluva lot about and where the selection and retention process has been done by the researcher (we are unlikely to ever have the resource or the expertise to examine and assess each deposit)
bags of data that we don’t really know at this point are going to be requested

My colleague, Jen, has been looking at 1) and I’ve focussed my work on 2) and 3). I had this thought:

Wouldn’t it be nice if we could push research data automatically through a simple Archivematica pipeline for safe and secure storage, but only deliver an access copy if one is requested?

Well, Archivematica CAN be almost entirely automated and it WILL be able to generate the DIP at a later step, manually, in the coming months. And with some extra funded development it COULD support automated DIP creation via an API call.

So, what’s missing now? We can gather metadata in PURE and make sure the data we get from researchers is verified and looked after, and service access requests by generating a DIP.

How will we make that DIP available? Well, we already have a public facing Digital Library which can offer that service.

What’s missing is the workflow, the glue that ties all of these steps together, and really I’d argue that’s not the job of either Archivematica or PURE.

In the diagram below I’ve proposed an architecture for both deposit and access that would both remove most manual steps and add Archivematica into the workflow.

In the deposit workflow the researcher would continue to add metadata to PURE. We have kept a manual communication step here as we feel that there will almost certainly be some human discussion needed about the nature of the data. The researcher would then upload their data, and we would be able to manage all subsequent steps in a largely automated way, using a lightweight ‘RDMonitor’ tool to match up the data with the PURE metadata using information gleaned from the PURE web services, to log the receipt of data and to initiate the transfer to Archivematica for archiving.

How we'd like to do things in the future - deposit workflow

In the standard access workflow, the requestor would send an email which would automatically be logged by the Monitor, initiate the creation of a dissemination version of the data and its transfer to our repository and send an automated email the requestor to alert them to the location of the data. Manual steps in this workflow are needed to add the direct link to the data in PURE (PURE’s has no APIs for POSTing updates, only GETting information) and also to ensure due diligence in making data available.

How we'd like to do things in the future - access to research data

These ideas need more unpicking over the coming months, and there are other ways that it could be delivered, using PURE for the upload of data, for example. We can also see potential extensions too, like pulling in a Data Management Plan from DMPOnline to store with the archive package. This post should be viewed as a first stab at illustrating our thinking in the project, motivated by the idea of making lightweight interfaces to connect systems together. Hull will have similar, but different, requirements that they want to explore in further phases too.

Jenny Mitcham, Digital Archivist