Digital Archiving at the University of York: Archivematica for research data? The FAQs

Thinking of preserving research data? Wondering what Archivematica does? Interested in what it might cost?

What follows is a series of FAQs put together by the "Filling the digital preservation gap" project team and included as part A of our phase 1 project report. We hope you find this useful in helping to work out whether Archivematica is something you could use for RDM. There are bound to be questions we haven't answered so let us know if you want to know more...

************************************************

Why do we need a digital preservation system for research data?

Research data should be seen as a valuable institutional asset and treated accordingly. Research data is often unique and irreplaceable. It may need to be kept to validate or verify conclusions recorded in publications. Funder, publisher and often internal university requirements ask that research data is available for others to consult and is preserved in a usable form after the project that generated it is complete.

In order to facilitate future access to research data we need to actively manage and curate it. Digital preservation is not just about implementing a good archival storage system or ‘preserving the bits’ it is about working within the framework set out by international standards (for example the Open Archival Information System) and taking steps to increase the chances of enabling meaningful re-use in the future.

What are the risks if we don't address digital preservation?

Digital preservation has been in the news this year (2015). An interview with Google CEO Vint Cerf in February grabbed the attention of the mainstream media with headlines about the fragility of a digital media and the onset of a digital dark age.

This is clearly already a problem for researchers with issues around format and media obsolescence already being encountered. In a 2013 Research Data Management (RDM) survey undertaken at the University of York just under a quarter of respondents to the question “Which data management issues have you come across in your research over the last five years?” selected the answer “Inability to read files in old software formats on old media or because of expired software licences”. These are the sorts of problems that a digital preservation system is designed to address.

Due to its complexity digital preservation is very easy to put in the ‘too difficult’ box. There is no single perfect solution out there and it could be argued that we should sit it out and wait until a fuller set of tools emerges. A better approach is to join the existing community of practice and embrace some of the working and evolving solutions that are available.

Why are we interested in Archivematica?

Archivematica is an open source digital preservation system that is based on recognised standards in the field. Its functionality and the design of its interfaces were based on the Open Archival Information System and it uses standards such as PREMIS and METS to store metadata about the objects that are being preserved. Archivematica is flexible and configurable and can interface with a range of other systems.

A fully fledged RDM solution is likely to consist of a variety of different systems performing different functions within the workflow; Archivematica will fit well into this modular architecture and fills the digital preservation gap in the infrastructure.

The Archivematica website states that “The goal of the Archivematica project is to give archivists and librarians with limited technical and financial capacity the tools, methodology and confidence to begin preserving digital information today.” This vision appears to be a good fit with the needs and resources of those who are charged with managing an institution’s research data.

It should be noted that there are other digital preservation solutions available, both commercial and open source, but these were not assessed as part of this project.

Why do we recommend Archivematica to help preserve research data?

It is flexible and can be configured in different ways for different institutional needs and workflows
It allows many of the tasks around digital preservation to be carried out in an automated fashion
It can be used alongside other existing systems as part of a wider workflow for research data
It is a good digital preservation solution for those with limited resources
It is an evolving solution that is continually driven and enhanced by and for the digital preservation community; it is responsive to developments in the field of digital preservation
It gives institutions greater confidence that they will be able to continue to provide access to usable copies of research data over time.

What does Archivematica actually do?

Archivematica runs a series of microservices on the data and packages it up (with any metadata that has been extracted from it) in a standards compliant way for long term storage. Where a migration path exists, it will create preservation or dissemination versions of the data files to store alongside the originals and create metadata to record the preservation actions that have been carried out.

A more in depth discussion of what Archivematica does can be found in the report text. Full documentation for Archivematica is available online.

How could Archivematica be incorporated into a wider technical infrastructure for research data management?

Archivematica performs a very specific task within a wider infrastructure for research data management - that of preparing data for long term storage and access. It is also worth stating here what it doesn’t do:

It does not help with the transfer of data (and/or metadata) from researchers
It does not provide storage
It does not provide public access to data
It does not allocate Digital Object Identifiers (DOIs)
It does not provide statistics on when data was last accessed
It does not manage retention periods and trigger disposal actions when that period has passed

These functions and activities will need to be established elsewhere within the infrastructure as appropriate.

What does research data look like?

Research data is hard to characterise, varying across institutions, disciplines and individual projects. A wide range of software applications are in use by researchers and the file formats generated are diverse and often specialist.

Higher education institutions typically have little control over the data types and file formats that their researchers are producing. We ask researchers to consider file formats as a part of their data management plan and can provide generic advice on preferred file formats if asked, but where many of the specialist data formats are concerned it is likely that there is no ‘preservation-friendly’ alternative that retains the significant properties of the data.

Research data can be large in size, and/or quantity. It often includes elements that are confidential or sensitive. Sensitivities are likely to vary across a dataset with some files being suitable for wider access and others being restricted. A one-size fits all approach to rights metadata is not appropriate. In some cases there will be different versions of the data that need to be preserved or different deposits of data for a single research project. Scenarios such as these are likely to come about where data is being used to support multiple publications over the course of a piece of research.

Research data may come with administrative data and documentation. These may be documents relating to ethical approval or grant funding, data management plans or documentation or metadata relating to particular files. The association between the research data and any associated administrative information should be maintained.

It can be difficult to ascertain the value of research data at the point of ingest. Some data will be widely used and should be preserved for the long term and other data will never be accessed and will be disposed of at the end of its retention period.

How would Archivematica handle research data?

Archivematica can handle any type of data but it should be noted that a richer level of preservation will only be available for some file formats. Archivematica (like other digital preservation systems) will recognise and identify a large number of research data formats but by no means the full range. For a smaller subset of these file formats (for example a range of raster and vector image and audio visual formats) it comes with normalisation rules and tools. It can be configured to normalise other file formats as required (where open source command line tools are available). Archivematica also allows for the flexibility of manual normalisations. This gives data curators the opportunity to migrate files in a more manual way and update the PREMIS metadata by hand accordingly.

For other data types (and this will include many of the file formats that are created by researchers), Archivematica may not be able to identify, characterise or normalise the files but will still be able to perform certain functions such as virus check, cleaning up file names, creating checksums and packaging the data and metadata up to create an archival information package.

Archivematica can handle large files (or large volumes of small files) but its abilities in this area are very much dependent on the processing power that has been allocated to it. Users of Archivematica should be aware of the capabilities of their own implementation and be prepared to establish a cut off point over which data files of a certain size may not be processed, or may need to be processed in a different way.

Archivematica uses the PREMIS metadata standard to record rights metadata. Rights metadata can be added for the Submission Information Package as a whole rather than in a granular fashion. This is not ideal for research data for which there are likely to be different levels of sensitivity for different elements within the final submitted dataset. The Archivematica manual suggests that fuller rights information would be added to the access system (outside of Archivematica).

The use of Archival Information Collections (AICs) in Archivematica enables the loose association of groups of related Archival Information Packages (AIPs). This may be a useful feature for research data where different versions of a dataset or parts of a dataset are deposited at different times but are all associated with the same research project.

Archivematica is a suitable tool for preserving data of unknown value. Workflows within Archivematica and the processing of a transferred dataset from a Submission Information Package (SIP) to an Archival Information Package (AIP) can be automated. This means that some control over the data and a level of confidence that the data is being looked after adequately can be gained, without expending a large amount of staff time on curating the data in a manual fashion. If the value of the data is seen to increase (by frequent requests to access that data or as a result of assessment by curatorial staff) further efforts can be made to preserve the data using the AIP re-ingest feature and perhaps by carrying out a level of manual curation. The extent of automation within Archivematica can be configured so staff are able to treat datasets in different ways as appropriate. Institutions may have a range of approaches here, but the levels of automation that are possible provide a compelling argument for the adoption of Archivematica if few staff resources are available for manual preservation.

What are the limitations of Archivematica for research data?

Archivematica should not be seen as a magic bullet. It does not guarantee that data will be preserved in a re-usable state into the future. It can only be as good as digital preservation theory and practice is currently and digital preservation itself is not a fully solved problem.

Research data is particularly challenging from a preservation point of view due to the range of data types and formats that are in existence, many of which are not formats that digital preservation tools and policies exist for, thus they will not receive as a high a level of curation when ingested into Archivematica.

As mentioned above, the rights metadata within Archivematica may not fit the granularity that would be required for research data. This information would need to be held elsewhere within the infrastructure.

One of Archivematica’s strengths is its flexibility and the fact it can be configured to suit the needs of the institution or for a particular workflow. This however may also act as an initial barrier to use. It takes a bit of time to become familiar with Archivematica and to work out how you want to set it up. It is also a tool that most people would not want to use in isolation, and considerable thought needs to go into how it needs to interact with other systems and what workflow may best suit your institution.

The user interface for Archivematica is not always intuitive and takes some time to fully understand. There is currently no indication within the GUI that Archivematica is processing or estimate of how long a particular microservice may have left to run. This is a limitation for large datasets if you are processing them through Archivematica’s dashboard. For a more automated curation workflow this will not have any impact.

What costs are associated with using Archivematica?

Archivematica is a free and open source application but this does not mean it will not cost anything to run. As a minimum an organisation wishing to run Archivematica locally will need both technical and curatorial staff resource. A level of technical knowledge is required to install and troubleshoot Archivematica and perform necessary upgrades. Further technical knowledge is required to consider how Archivematica fits into a wider infrastructure and to get systems talking to each other. As Archivematica is open source, developer time could be devoted to enhancing it to suit institutional needs. Developer time can also be bought from Artefactual Systems, the lead developer of Archivematica, to fund specific enhancements or new functionality which will be made available to the wider user base. In order to make the most of the system, organisations may want to consider factoring in a budget for developments and enhancements.

It is essential to have at least one member of curatorial staff who can get to grips with the Archivematica interface, make administrative decisions about the workflow and edit the format policy registry where appropriate. A level of knowledge of digital preservation is required for this, particularly where changes or additions to normalisation rules within the format policy registry are being considered. A greater number of curatorial staff working on Archivematica will be necessary the more manual steps there are within the workflow (for example if manual selection and arrangement, metadata entry or normalisations are carried out). This requirement for curatorial staff will increase in line with the volumes of data that are being processed.

Technical costs of establishing an Archivematica installation should also be considered. For a production system the following server configuration is recommended as a minimum:

Processor: dual core i5 3rd generation CPU or better
Memory: 8GB+
Disk space: 20GB plus the disk space required for the collection

The software can be installed on a single machine or across a number of machines to share the workload. At the time of writing, the software requires a current version of Ubuntu LTS (14.04 or 12.04) as its operating system.

What other systems is Archivematica integrated with?

Archivematica provides various levels of integration with DSpace, CONTENTdm, AtoM, Islandora and Archivist’s Toolkit for access and Arkivum, DuraCloud and LOCKSS for storage. There are ongoing integrations underway with ArchivesSpace, Hydra, BitCurator and DataVerse.

In addition, Archivematica provides a transfer REST API that can be used to initiate transfers within the software, the first step of the preservation workflow. Archivematica’s underlying Storage Service also provides RESTful APIs to facilitate the creation, retrieval and deletion of AIPs.

How can you use Archivematica?

There are 3 different ways that an institution might wish to use Archivematica:

Local - institutions may install and host Archivematica locally and link it to their preferred storage option
Arkivum - a managed service from Arkivum will allow Archivematica to be hosted locally within the institution with upgrades and support available through Arkivum in partnership with Artefactual Systems. A remote hosting option is also available. Both include integration of Archivematica and Arkivum storage.
ArchivesDirect - a hosted service from DuraSpace that combines Archivematica’s preservation functionality with DuraCloud for storage.

How could Archivematica be improved for research data?

It should be noted that Archivematica is an evolving tool that is under active development. During the short 3 months of phase 1 of our project “Filling the Digital Preservation Gap” we have been assessing a moving target. Version 1.3 was installed for initial testing. A month in, version 1.4 was released to the community. As we write this report, version 1.5 is under development and due for imminent release; a version of this has been made available to us for testing.

Archivematica is open source but much of the development work is carried out by Artefactual Systems the company that support it. They have their own development roadmap for Archivematica but most new features that appear are directly sponsored by the user community. Users can pay to have new features, functionality or integrations built into Archivematica, and Artefactual Systems try to carry out this work in such a way to make the features useful to the wider user community and agree to continue to support and maintain the code base through subsequent versions of the software. This ‘bounty model’ for open source development seems to work well and keeps the software evolving in line with the priorities of its user base.

During the testing phase of this project we have highlighted several areas where Archivematica could be improved or enhanced to provide a better solution for research data and several of these features are already in development (sponsored by other Archivematica users). In phase 2 of the project we hope to be able to contribute to the continued development of Archivematica.

Who else is using Archivematica to do similar things?

Archivematica has been adopted by several institutions internationally but its key user base is in Canada and the United States. A list of a selection of Archivematica users can be found on their community wiki pages. Some institutions are using Archivematica to preserve research data. Both the Zuse Institute Berlin and Research Data Canada’s Federated Pilot for Data Ingest and Preservation are important to mention in this context.

Archivematica is not widely used in the UK but there are current implementations at the National Library of Wales and the University of Warwick’s Modern Records Centre. Interest in Archivematica in the UK is growing. This is evidenced by the establishment this year of a UK Archivematica group which provides a local forum to share ideas and case studies. Representatives from 15 different organisations were present at the last meeting at Tate Britain in June 2015 and a further meeting is planned at the University of Leeds in the autumn.

Where can I find out more?

Our full project report is available here: http://dx.doi.org/10.6084/m9.figshare.1481170

Read Part B of the report for fuller details of our findings.

Jenny Mitcham, Digital Archivist

Digital Archiving at the University of York

Friday, 24 July 2015

Archivematica for research data? The FAQs