Addressing digital preservation challenges through Research Data Spring

With the short time scales at play in the Jisc Research Data Spring initiative it is very easy to find yourself so focussed on your own project that you don’t have time to look around and see what everyone else is doing. As phase 2 of Research Data Spring comes to an end we are taking time to reflect, to think about digital preservation for research data management, to look at the other projects and think about how all the different pieces of the puzzle fit together.

Our “Filling the Digital Preservation Gap” project is very specifically about digital preservation and we are focusing primarily on what happens once the researchers have handed over their data to us for long term safekeeping. However, ‘digital preservation’ is not a thing that exists in isolation. It is very much a part of the wider ecosystem for managing data. Different projects within Research Data Spring are working on specific elements of this infrastructure and this blog post will try and unpick who is doing what and how this work contributes to helping the community address the bigger challenges of digital preservation.

The series of podcast interviews that Jisc produced for each project were a great starting point to finding out about the projects and this has been complemented by some follow up questions and discussions with project teams. Any errors or misinterpretations are my own. A follow up session on digital preservation is planned for the next Research Data Spring sandpit later this week so an update may follow next week in the light of that.

So here is a bit of a synthesis of the projects and how they relate to digital preservation and more specifically the Open Archival Information System (OAIS) reference model. If you are new to OAIS, this DPC technology watch report is a great introduction.

OAIS Functional Model (taken from the DPC Technology Watch report:

So, starting at the left of the diagram, at the point at which researchers (producers) are creating their data and preparing it for submission to a digital archive, the CREAM project (or “Collaboration for Research Enhancement by Active Metadata”) led by the University of Southampton hopes to change the way researchers use metadata. It is looking at how different disciplines capture metadata and how this enhances the data in the long run. They are encouraging dynamic capture of metadata at the point of data creation which is the point at which researchers know most about their data. The project is investigating the use of lab notebooks (not just for scientists) and also looking at templates for metadata to help streamline the research process and enable future reuse of data.

Whilst the key aims of this project do fall within the active data creation phase and thus outside of the OAIS model, they are still fundamental to the success of a digital archive and the value of working in this area is clear. One of the mandatory responsibilities of an OAIS is to ensure the independent utility of the data that it holds. In simple terms this means that the digital archive should ensure that as well as preserving the data itself, it also preserves enough contextual information and documentation to make that data re-usable for its designated community. This sounds simple enough but speaking from experience, as a digital archivist, this is the area that often causes frustration - going back to ask a data producer for documentation after the point of submission and at a time when they have moved on to a new project can be a less than fruitful exercise. A methodology for encouraging metadata generation at the point of data creation and to enable this to be seamlessly submitted to the archive along with the data itself would be most welcome.

Another project that sits cleanly outside of the OAIS model but impacts on it in a similar way is “Artivity” from the University of the Arts London. This is again about capturing metadata but with a slightly different angle. This project is looking at metadata to capture the creative process as an artist or designer creates a piece of digital art. They are looking at tools to capture both the context and the methodology so that in the future we can ask questions such as ‘how were the software tools actually used to create this artwork?’. As above, this project is enabling an institution to fulfil the OAIS responsibility of ensuring the independent utility of the data, but the documentation and metadata it captures is quite specific to the artistic process.

For both of these projects we would need to ensure that this rich metadata and documentation was deposited in the digital archive or repository alongside the data itself in a format that could be re-used in the future. As well as thinking about the longevity of file formats used for research data we clearly also need to think about file formats for documentation and metadata. Of course, when incorporating this data and metadata into a data archive or repository, finding a way of ensuring the link between the data and associated documentation is retained is also a key consideration.

The Clipper project (“Clipper: Enhancing Time-based Media for Research”) from City of Glasgow College provides another way of generating metadata - this time specifically aimed at time-based media (audio and video). Clipper is a simple tool that allows a researcher to cite a specific clip of a digital audio or video file. This solves a real problem in the citation and re-use of time-based media. The project doesn't relate directly to digital preservation but it could interface with the OAIS model at either end. Data produced from Clipper could be deposited in a digital archive (either alongside the audio or video file itself, or referencing a file held elsewhere). This scenario could occur when a researcher needs to reference or highlight a particular section to back up their research. On the other end of the spectrum, Clipper could also be a tool that the OAIS Access system encourages data consumers to utilise, for example, by highlighting it as a way of citing a particular section of a video that they are enabling access to. The good news is that the Clipper team have already been thinking about how metadata from Clipper could be stored for the long term within a digital archive alongside the media itself. The choice of html as the native file format for metadata should ensure that this data can be fairly easily managed into the future.

Still on the edges of the OAIS model (and perhaps most comfortably sitting within the Producer-Archive Interface) is a project called “Streamlining deposit: OJS to Repository Plugin” from City University London which intends to make the process of submission of papers to journals and associated datasets to repositories more streamlined for researchers. They are developing a plugin to send data direct from a journal to a data repository. They want to streamline the submission process for authors who need to make additional data available alongside their publications. This will ensure that the appropriate data gets deposited and linked to a publication in order to ultimately enable access to others.

Along a similar theme is “Giving Researchers Credit for their Data” from the University of Oxford. This project is also looking at more streamlined ways of linking data in repositories with publisher platforms and avoiding retyping of metadata by researchers. They are working on practical prototypes with Ubiquity, Elsevier and Figshare and looking specifically at the communication between the repository platform and publication platform.

Ultimately these 2 projects are all about giving researchers the tools to make depositing data easier and, in doing so, ensuring that the repository also gets the information it needs to manage the data in the long term. This impacts on digital preservation in 2 ways. First the easier processes for deposit will encourage more data to be deposited in repositories where it can then be preserved. Secondly, data submitted in this way should include better metadata (with a direct link to a related publication) which will make the job of the repository in providing access to this data easier and ultimately encourage re-use.

Other projects explore the stages of the research lifecycle that occur once the active research phase is over, addressing what happens when data is handed over to an archive or repository for longer term storage.

The “DataVault” project at the Universities of Edinburgh and Manchester is primarily addressing the Archival Storage entity of the OAIS model. They are establishing a DataVault - a safe place to store research data arising from research that has been completed. This facility will ensure that that data is stored unchanged for an appropriate period of time. Researchers will be encouraged to use this facility for data that isn’t suitable for deposit via the repository but that they wish to keep copies of. This will enable them to fulfill funder requirements around retention periods. The DataVault whilst primarily being a storage facility will also carry out other digital preservation functionality. Data will be packaged using the BagIt specification, an initial stab at file identification will be carried out using Apache Tika and fixity checks will be run periodically to monitor the file store and ensure files remain unchanged. The project team have highlighted the fact that file identification is problematic in the sphere of research data as you work with so many data types across disciplines. This is certainly a concern that the “Filling the Digital Preservation Gap” project has shared.

Our own “Filling the Digital Preservation Gap” project focuses on some of the more hidden elements of a digital preservation system. We are not looking at digital preservation software or tools that a researcher will interact with, but with the help of Archivematica are looking at among other things the OAIS Ingest entity (how we process the data as it arrives in the digital archive) and the Preservation Planning entity (how we monitor preservation risks and react to them). In phase 3 we plan to address OAIS more holistically with our proof of concepts. I won’t go into any further detail here as our project already gets so much air space on this blog!

Another project looking more holistically at OAIS is “A Consortial Approach to Building an Integrated RDM System - Small and Specialist” led by the University for the Creative Arts. This project is looking at the whole technical infrastructure for RDM and in particular looking at how this infrastructure can be achievable for small and specialist research institutes with limited resources. In a phase 1 project report by Matthew Addis from Arkivum there are a full range of workflows described which cover many of the different elements of an OAIS. To give a few examples, there are workflows around data deposit (Producer-Archive Interface), research data archiving using Arkivum (Archival Storage), access using EPrints (Access), gathering and reporting usage metrics (Data Management) and last but not least a workflow for research data preservation using Archivematica which has parallels with some of the work we are doing in “Filling the Digital Preservation Gap”.

DMAOnline” sits firmly into the Data Management entity of the OAIS, running queries on the functions of the other entities and producing reports. This tool being created by the University of Lancaster will report on the administrative data around research data management systems (including statistics around access, storage and the preservation of that data). Using a tool like this, institutions will be able to monitor their RDM activities at a high level, drill down to see some of the detail and use this information to monitor the uptake of their RDM services or to make an assessment of their level of compliance to funder mandates. From the perspective of the “Filling the Digital Preservation Gap” project we are pleased that the DMAOnline team have agreed to include reporting on the statistics from Archivematica in their phase 3 plans. One of the limitations of Archivematica that was highlighted in the requirements section of our own phase 1 report was the lack of reporting options within the system. A development we have been sponsoring during phase 2 of our project will enable third party systems such as DMAOnline to extract information from Archivematica for reporting purposes.

Much focus in RDM activities typically goes into the Access functional entity, which naturally follows on from viewing a summary of activity through DMAOnline. This is one of the more visible parts of the model - the end product if you like of all the work that goes on behind the scenes. A project with a key focus on access is “Software Reuse, Repurposing and Reproducibility” from the University of St Andrews. However, as is the case for many of these projects, it also touches on other areas of the model. At the end of the day, access isn't sustainable without preservation so the project team are also thinking more broadly about these issues.

This project is looking at software that is created through research (the software that much research data actually depends on). What happens to software written by researchers, or created through projects when the person who was maintaining it leaves? How do people who want to reuse the data get hold of the right software? The project team have been looking at how you assign identifiers to software, how you capture software in such a way to make it usable in the future and how you then make that software accessible. Versioning is also a key concern in this area - different versions of software may need to be maintained with their own unique identifiers in order to allow future users of the data to replicate the results of a particular study. Issues around the preservation of and access to software are a bit of a hot topic in the digital preservation world so it is great to see an RDS project looking specifically at this.

The Administration entity of an OAIS coordinates the other high level functional entities, oversees the operation of them and serves as a central hub for internal and external interactions. The “Extending OPD to cover RDM” project from the University of Edinburgh could be one of these external interactions. It has put in place a framework for recording what facilities and services your institution has in place for managing research data - both technical infrastructure, policy and training. It allows an institution to make visible the information about their infrastructure and facilities and to compare it or benchmark it against others. The level of detail in this profile goes far above and beyond OAIS but allows an organisation to report on how it is meeting the ‘Data repository for longer term access and preservation’ component for example.

In summary it has been a useful exercise thinking about the OAIS model and how the different RDS projects in phase 2  fit within this framework. It is good to see how they all impact on and address digital preservation in some way - some by helping get the necessary metadata into the system, or enabling a more streamline deposit process, others helping monitor or assess the performance of the systems in place and some projects more directly addressing key entities within the model. The outputs from these projects complement each other - designed to solve different problems and addressing discrete elements of the complex puzzle that is research data management.

Jenny Mitcham, Digital Archivist


Popular posts from this blog

How can we preserve Google Documents?

Preserving emails. How hard can it be?

Checksum or Fixity? Which tool is for me?