Friday, 18 September 2015

Spreading the word at the Northern Collaboration Conference

Collaborating with other delegates at the start of the day
Photo credit: Northern Collaboration Conference, Kiran Mehta
I gave a presentation last week at the 2015 Northern Collaboration Conference. My first trip to this conference which is primarily aimed at those working in academic libraries and proved to be an interesting day.

The theme of the day was 'Being digital: opportunities for collaboration in academic libraries' so I thought our collaborative Jisc Research Data Spring project was a perfect fit. It was great to have a new audience to talk to about our plans to 'fill the digital preservation gap' for research data. Though it is academic libraries that are taking on this challenge, my typical audience tends to be those working in archives.

Slides are available on slideshare.
(epic fail on getting the embed code to work in Blogger)
My slides are available on Slideshare for those who want to see what I was talking about.

I began by making sure that we were speaking the same language. Communication is a big issue for us digital archivists. If we talk in OAIS-speak only other digital archivists will understand us. If however we use terms such as 'archiving' and 'curation' we fall into the trap of the multiple layers of meanings and (mis-) interpretations of these terms. This being not my usual audience, it was best to put my cards on the table at the start and establish basic principles.

Key takeaway message: This is not all about storage*

I then covered many of the questions included in the project FAQs that we produced in phase one of our project. Essentially the:
  • Why are we doing this/ why do we need digital preservation?
  • What does research data look like?
  • What does Archivematica do?
  • What are it's strengths and weaknesses?
  • How can we use it?

I was able to touch on the topic of the value of research data and how it is regarded by different researchers working in different disciplines. 

Researchers at York have different opinions on the value of their
data and the challenges of curating it

The lack of clarity on the value of much of the data we will be looking after is the main reason why we propose the approach we are taking.

I'm inspired by Tim Gollin's paper 'Parsimonious Preservation: Preventing Pointless Processes!' which focuses on the primary need to simply to collect the data and find out what you've got. Crucially the 'knowing what you've got' step can be done with minimum expense through the use of available open source tools. Taking a pragmatic approach such as this is particularly appealing when the value of the data we are curating is such an unknown.

I then spoke briefly about phase two of the project through which we are trying to define our own workflows and implementation plans at York and Hull. I mentioned the development work that we are sponsoring as part of this project. Artefactual Systems are currently working on six different areas of development for us (as described in a previous blog post).

At the end of the session I handed out a short feedback form to try and gauge the level of interest in the project. Though only 6 from a total of 20 questionnaires were returned, respondents unanimously agreed they would go away and read the project FAQs in more detail and expressed an interest in a show and tell event once our proof of concepts were up and running. Most also thought they would download our report and talk to their colleagues about Archivematica and our project.

Quite a good result I think!

* though Archival Storage is still essential to any digital archive

Friday, 28 August 2015

Enhancing Archivematica for Research Data Management

Where has the time gone? ....we are now one month into phase two of "Filling the Digital Preservation Gap" ....and I have spent much of this first month on holiday!

Not quite a 'digital preservation gap' - just an excuse to show
you my holiday snaps!
So with no time to waste, here is an update on what we are doing:

In phase two of our project we have two main areas of work. Locally at York and Hull we are going to be planning in more detail the proof of concept implementations of Archivematica for research data we hope to get up and running in phase three.

Meanwhile over in Canada, our collaborators at Artefactual Systems are starting work on a number of sponsored developments to help move Archivematica into a better position for us to incorporate it into our implementations for managing and preserving research data.

We have a project kick off call with Artefactual Systems scheduled for next week and we will be discussing our requirements and specifications for development in more detail, but in the meantime, here is a summary of the areas we are focusing on:

Automation of DIP generation on request 

Building on the AIP re-ingest functionality within Archivematica which allows an AIP to be re-processed and allows for a delay in the generation of a DIP until such a time as it is requested, this feature will enable further automation of this process.

This feature is of particular benefit to those situations where the value of data is not fully understood. It is unnecessary to create an access copy of all research datasets as some of them will never be requested. In our workflows for long term management of research data we would like to trigger the creation of a copy of the data for dissemination and re-use on request rather than create one by default and this piece of work will make this workflow possible.

METS parsing tools

This development will involve creating a Python library which could be used by third party applications to understand the METS file that is contained within an Archivematica DIP. Additionally an HTTP REST service would be developed to allow third party applications to interact with the library in a programming language agnostic fashion.

This is key to being able to work with the DIP that is created by Archivematica within other repository or access systems that are not integrated with Archivematica. Both York and Hull have repositories built with Fedora and Hydra and this feature will allow the repositories to better understand the DIP that Archivematica creates. This development is in no way specific to a Fedora/Hydra repository and will equally benefit other repository platforms in use for RDM.

Improved file identification

This feature will enable Archivematica to report on any unidentified files within a transfer alongside access to the file identification tool output. Further enhancements could help curatorial staff to submit information to PRONOM by partially automating this process.

It was highlighted in our phase one report that the identification of research data file formats is a key area of concern when managing research data for the longer term. This feature will help users of Archivematica see which files haven’t been identified and thus enable them to take action to establish what they hold. This feature will also encourage communication with PRONOM to enhance the database of file formats for the future, thus enabling a more sustainable community approach to addressing this problem.

Generic Search API

The development of a proof of concept search REST API for Archivematica  allowing third party applications to query Archivematica for information about the objects in archival storage.

There is a need to be able to produce statistics or reports on RDM and digital preservation processes in order to obtain a clear picture of what data has been archived. This development will enable these statistics to be generated more easily and sustainably. For example this would enable tools such as the DMAonline dashboard in development at Lancaster University to pull out summary statistics from Archivematica.

Support for multiple checksum algorithms 

Currently Archivematica generates SHA256 checksums for all files, and inserts those into PREMIS fixity tags in the METS file. In addition, two premis:events are generated for each file. All three of these entries are currently hardcoded to assume SHA256. This development would include support for other hash algorithms such as MD5, SHA1 and SHA512.

Research data files can be large in size and/or quantity and may take some time to process through the Archivematica pipeline. One of the potential bottlenecks highlighted in the pipeline is checksums which are created at more than one point in the process. SHA256 checksums can take a long time to create and it has been highlighted that having the option to alter the checksum algorithm within Archivematica could speed things up. Having additional configuration options within Archivematica will give institutions the flexiblity to refine and configure their pipelines to reduce bottlenecks where appropriate.


The ability to automate processes relating to preservation are of primary importance where few resources are available to manually process data of unknown value. Fuller documentation of how an automated workflow can be configured within Archivematica using the APIs that exist would be very helpful for those considering using Archivematica for RDM and will help remove some of the barriers to its use. We will therefore be funding a small piece of work to help improve Archivematica documentation for developers and those installing or administering the system.

We very much hope these enhancements will be useful to the wider community of Archivematica users and not just to those looking specifically at preserving research data.

Our thanks go to Artefactual Systems for helping to turn our initial development ideas into these more concrete proposals.

As ever we are happy to receive your thoughts and feedback so do get in touch if you are interested in the work we are carrying out or have things to share with us around these development ideas.

Friday, 24 July 2015

Archivematica for research data? The FAQs

Thinking of preserving research data? Wondering what Archivematica does? Interested in what it might cost?

What follows is a series of FAQs put together by the "Filling the digital preservation gap" project team and included as part A of our phase 1 project report. We hope you find this useful in helping to work out whether Archivematica is something you could use for RDM. There are bound to be questions we haven't answered so let us know if you want to know more...


Why do we need a digital preservation system for research data?

Research data should be seen as a valuable institutional asset and treated accordingly. Research data is often unique and irreplaceable. It may need to be kept to validate or verify conclusions recorded in publications. Funder, publisher and often internal university requirements ask that research data is available for others to consult and is preserved in a usable form after the project that generated it is complete.

In order to facilitate future access to research data we need to actively manage and curate it. Digital preservation is not just about implementing a good archival storage system or ‘preserving the bits’ it is about working within the framework set out by international standards (for example the Open Archival Information System) and taking steps to increase the chances of enabling meaningful re-use in the future.

What are the risks if we don't address digital preservation?

Digital preservation has been in the news this year (2015). An interview with Google CEO Vint Cerf in February grabbed the attention of the mainstream media with headlines about the fragility of a digital media and the onset of a digital dark age.

This is clearly already a problem for researchers with issues around format and media obsolescence already being encountered. In a 2013 Research Data Management (RDM) survey undertaken at the University of York just under a quarter of respondents to the question “Which data management issues have you come across in your research over the last five years?” selected the answer “Inability to read files in old software formats on old media or because of expired software licences”. These are the sorts of problems that a digital preservation system is designed to address.

Due to its complexity digital preservation is very easy to put in the ‘too difficult’ box. There is no single perfect solution out there and it could be argued that we should sit it out and wait until a fuller set of tools emerges. A better approach is to join the existing community of practice and embrace some of the working and evolving solutions that are available.

Why are we interested in Archivematica?

Archivematica is an open source digital preservation system that is based on recognised standards in the field. Its functionality and the design of its interfaces were based on the Open Archival Information System and it uses standards such as PREMIS and METS to store metadata about the objects that are being preserved. Archivematica is flexible and configurable and can interface with a range of other systems.

A fully fledged RDM solution is likely to consist of a variety of different systems performing different functions within the workflow; Archivematica will fit well into this modular architecture and fills the digital preservation gap in the infrastructure.

The Archivematica website states that “The goal of the Archivematica project is to give archivists and librarians with limited technical and financial capacity the tools, methodology and confidence to begin preserving digital information today.” This vision appears to be a good fit with the needs and resources of those who are charged with managing an institution’s research data.

It should be noted that there are other digital preservation solutions available, both commercial and open source, but these were not assessed as part of this project.

Why do we recommend Archivematica to help preserve research data?

  • It is flexible and can be configured in different ways for different institutional needs and workflows
  • It allows many of the tasks around digital preservation to be carried out in an automated fashion
  • It can be used alongside other existing systems as part of a wider workflow for research data
  • It is a good digital preservation solution for those with limited resources
  • It is an evolving solution that is continually driven and enhanced by and for the digital preservation community; it is responsive to developments in the field of digital preservation
  • It gives institutions greater confidence that they will be able to continue to provide access to usable copies of research data over time.

What does Archivematica actually do?

Archivematica runs a series of microservices on the data and packages it up (with any metadata that has been extracted from it) in a standards compliant way for long term storage. Where a migration path exists, it will create preservation or dissemination versions of the data files to store alongside the originals and create metadata to record the preservation actions that have been carried out.

A more in depth discussion of what Archivematica does can be found in the report text. Full documentation for Archivematica is available online.

How could Archivematica be incorporated into a wider technical infrastructure for research data management?

Archivematica performs a very specific task within a wider infrastructure for research data management - that of preparing data for long term storage and access. It is also worth stating here what it doesn’t do:

  • It does not help with the transfer of data (and/or metadata) from researchers
  • It does not provide storage
  • It does not provide public access to data
  • It does not allocate Digital Object Identifiers (DOIs)
  • It does not provide statistics on when data was last accessed
  • It does not manage retention periods and trigger disposal actions when that period has passed

These functions and activities will need to be established elsewhere within the infrastructure as appropriate.

What does research data look like?

Research data is hard to characterise, varying across institutions, disciplines and individual projects. A wide range of software applications are in use by researchers and the file formats generated are diverse and often specialist.

Higher education institutions typically have little control over the data types and file formats that their researchers are producing. We ask researchers to consider file formats as a part of their data management plan and can provide generic advice on preferred file formats if asked, but where many of the specialist data formats are concerned it is likely that there is no ‘preservation-friendly’ alternative that retains the significant properties of the data.

Research data can be large in size, and/or quantity. It often includes elements that are confidential or sensitive. Sensitivities are likely to vary across a dataset with some files being suitable for wider access and others being restricted. A one-size fits all approach to rights metadata is not appropriate. In some cases there will be different versions of the data that need to be preserved or different deposits of data for a single research project. Scenarios such as these are likely to come about where data is being used to support multiple publications over the course of a piece of research.

Research data may come with administrative data and documentation. These may be documents relating to ethical approval or grant funding, data management plans or documentation or metadata relating to particular files. The association between the research data and any associated administrative information should be maintained.

It can be difficult to ascertain the value of research data at the point of ingest. Some data will be widely used and should be preserved for the long term and other data will never be accessed and will be disposed of at the end of its retention period.

How would Archivematica handle research data?

Archivematica can handle any type of data but it should be noted that a richer level of preservation will only be available for some file formats. Archivematica (like other digital preservation systems) will recognise and identify a large number of research data formats but by no means the full range. For a smaller subset of these file formats (for example a range of raster and vector image and audio visual formats) it comes with normalisation rules and tools. It can be configured to normalise other file formats as required (where open source command line tools are available). Archivematica also allows for the flexibility of manual normalisations. This gives data curators the opportunity to migrate files in a more manual way and update the PREMIS metadata by hand accordingly.

For other data types (and this will include many of the file formats that are created by researchers), Archivematica may not be able to identify, characterise or normalise the files but will still be able to perform certain functions such as virus check, cleaning up file names, creating checksums and packaging the data and metadata up to create an archival information package.

Archivematica can handle large files (or large volumes of small files) but its abilities in this area are very much dependent on the processing power that has been allocated to it. Users of Archivematica should be aware of the capabilities of their own implementation and be prepared to establish a cut off point over which data files of a certain size may not be processed, or may need to be processed in a different way.

Archivematica uses the PREMIS metadata standard to record rights metadata. Rights metadata can be added for the Submission Information Package as a whole rather than in a granular fashion. This is not ideal for research data for which there are likely to be different levels of sensitivity for different elements within the final submitted dataset. The Archivematica manual suggests that fuller rights information would be added to the access system (outside of Archivematica).

The use of Archival Information Collections (AICs) in Archivematica enables the loose association of groups of related Archival Information Packages (AIPs). This may be a useful feature for research data where different versions of a dataset or parts of a dataset are deposited at different times but are all associated with the same research project.

Archivematica is a suitable tool for preserving data of unknown value. Workflows within Archivematica and the processing of a transferred dataset from a Submission Information Package (SIP) to an Archival Information Package (AIP) can be automated. This means that some control over the data and a level of confidence that the data is being looked after adequately can be gained, without expending a large amount of staff time on curating the data in a manual fashion. If the value of the data is seen to increase (by frequent requests to access that data or as a result of assessment by curatorial staff) further efforts can be made to preserve the data using the AIP re-ingest feature and perhaps by carrying out a level of manual curation. The extent of automation within Archivematica can be configured so staff are able to treat datasets in different ways as appropriate. Institutions may have a range of approaches here, but the levels of automation that are possible provide a compelling argument for the adoption of Archivematica if few staff resources are available for manual preservation.

What are the limitations of Archivematica for research data?

Archivematica should not be seen as a magic bullet. It does not guarantee that data will be preserved in a re-usable state into the future. It can only be as good as digital preservation theory and practice is currently and digital preservation itself is not a fully solved problem.

Research data is particularly challenging from a preservation point of view due to the range of data types and formats that are in existence, many of which are not formats that digital preservation tools and policies exist for, thus they will not receive as a high a level of curation when ingested into Archivematica.

As mentioned above, the rights metadata within Archivematica may not fit the granularity that would be required for research data. This information would need to be held elsewhere within the infrastructure.

One of Archivematica’s strengths is its flexibility and the fact it can be configured to suit the needs of the institution or for a particular workflow. This however may also act as an initial barrier to use. It takes a bit of time to become familiar with Archivematica and to work out how you want to set it up. It is also a tool that most people would not want to use in isolation, and considerable thought needs to go into how it needs to interact with other systems and what workflow may best suit your institution.

The user interface for Archivematica is not always intuitive and takes some time to fully understand. There is currently no indication within the GUI that Archivematica is processing or estimate of how long a particular microservice may have left to run. This is a limitation for large datasets if you are processing them through Archivematica’s dashboard. For a more automated curation workflow this will not have any impact.

What costs are associated with using Archivematica?

Archivematica is a free and open source application but this does not mean it will not cost anything to run. As a minimum an organisation wishing to run Archivematica locally will need both technical and curatorial staff resource. A level of technical knowledge is required to install and troubleshoot Archivematica and perform necessary upgrades. Further technical knowledge is required to consider how Archivematica fits into a wider infrastructure and to get systems talking to each other. As Archivematica is open source, developer time could be devoted to enhancing it to suit institutional needs. Developer time can also be bought from Artefactual Systems, the lead developer of Archivematica, to fund specific enhancements or new functionality which will be made available to the wider user base. In order to make the most of the system, organisations may want to consider factoring in a budget for developments and enhancements.

It is essential to have at least one member of curatorial staff who can get to grips with the Archivematica interface, make administrative decisions about the workflow and edit the format policy registry where appropriate. A level of knowledge of digital preservation is required for this, particularly where changes or additions to normalisation rules within the format policy registry are being considered. A greater number of curatorial staff working on Archivematica will be necessary the more manual steps there are within the workflow (for example if manual selection and arrangement, metadata entry or normalisations are carried out). This requirement for curatorial staff will increase in line with the volumes of data that are being processed.

Technical costs of establishing an Archivematica installation should also be considered. For a production system the following server configuration is recommended as a minimum:
  • Processor: dual core i5 3rd generation CPU or better
  • Memory: 8GB+
  • Disk space: 20GB plus the disk space required for the collection
The software can be installed on a single machine or across a number of machines to share the workload. At the time of writing, the software requires a current version of Ubuntu LTS (14.04 or 12.04) as its operating system.

What other systems is Archivematica integrated with?

Archivematica provides various levels of integration with DSpace, CONTENTdm, AtoM, Islandora and Archivist’s Toolkit for access and Arkivum, DuraCloud and LOCKSS for storage. There are ongoing integrations underway with ArchivesSpace, Hydra, BitCurator and DataVerse.

In addition, Archivematica provides a transfer REST API that can be used to initiate transfers within the software, the first step of the preservation workflow. Archivematica’s underlying Storage Service also provides RESTful APIs to facilitate the creation, retrieval and deletion of AIPs.

How can you use Archivematica?

There are 3 different ways that an institution might wish to use Archivematica:
  • Local - institutions may install and host Archivematica locally and link it to their preferred storage option
  • Arkivum - a managed service from Arkivum will allow Archivematica to be hosted locally within the institution with upgrades and support available through Arkivum in partnership with Artefactual Systems. A remote hosting option is also available. Both include integration of Archivematica and Arkivum storage.
  • ArchivesDirect - a hosted service from DuraSpace that combines Archivematica’s preservation functionality with DuraCloud for storage.

How could Archivematica be improved for research data?

It should be noted that Archivematica is an evolving tool that is under active development. During the short 3 months of phase 1 of our project “Filling the Digital Preservation Gap” we have been assessing a moving target. Version 1.3 was installed for initial testing. A month in, version 1.4 was released to the community. As we write this report, version 1.5 is under development and due for imminent release; a version of this has been made available to us for testing.

Archivematica is open source but much of the development work is carried out by Artefactual Systems the company that support it. They have their own development roadmap for Archivematica but most new features that appear are directly sponsored by the user community. Users can pay to have new features, functionality or integrations built into Archivematica, and Artefactual Systems try to carry out this work in such a way to make the features useful to the wider user community and agree to continue to support and maintain the code base through subsequent versions of the software. This ‘bounty model’ for open source development seems to work well and keeps the software evolving in line with the priorities of its user base.

During the testing phase of this project we have highlighted several areas where Archivematica could be improved or enhanced to provide a better solution for research data and several of these features are already in development (sponsored by other Archivematica users). In phase 2 of the project we hope to be able to contribute to the continued development of Archivematica.

Who else is using Archivematica to do similar things?

Archivematica has been adopted by several institutions internationally but its key user base is in Canada and the United States. A list of a selection of Archivematica users can be found on their community wiki pages. Some institutions are using Archivematica to preserve research data. Both the Zuse Institute Berlin and Research Data Canada’s Federated Pilot for Data Ingest and Preservation are important to mention in this context.

Archivematica is not widely used in the UK but there are current implementations at the National Library of Wales and the University of Warwick’s Modern Records Centre. Interest in Archivematica in the UK is growing. This is evidenced by the establishment this year of a UK Archivematica group which provides a local forum to share ideas and case studies. Representatives from 15 different organisations were present at the last meeting at Tate Britain in June 2015 and a further meeting is planned at the University of Leeds in the autumn.

Where can I find out more?

Our full project report is available here:

Read Part B of the report for fuller details of our findings.

Friday, 17 July 2015

Improving RDM workflows with Archivematica

In a previous post we promised diagrams to help answer the 'How?' of the 'Filling the Digital Preservation Gap' project. In answer to this, here is a guest post by Julie Allinson, Manager of Digital York, who looks after all things Library and Archives IT.

Having just completed phase one of a Jisc Research Data Spring project with the University of Hull, we have been thinking a lot about the potential phases two and three which we are hoping will follow. But even if we aren’t funded to continue the project, the work so far won’t be wasted here at York (and certainly the wider community will benefit from our project report!) as it has given us some space to look out our current RDM workflow and look for ways that might be improved on, particularly to include a level of curation and preservation.

My perspective on all of this is to look at how things can fit together and I am a believer in using the right system for the right job. Out of lack of resource, or misunderstanding, I feel we often retro-fit existing systems and try to make them meet a need for which they weren’t designed. To a degree I think this happens to PURE. PURE is a Current Research Information System (CRIS), and a leading product in that space. But it isn’t (in my opinion) a data repository, a preservation system, a management tool for handling data access requests or any other of a long list of things we might want it to be.

For York, PURE is where we collect information about our research and our researchers. PURE provides both an internal and, through the PURE portal, an external view of York’s research. For the Library, it is where we collect information about research datasets and outputs. In our current workflow, the full-text is sent to our research repository for storage and dissemination. The researcher interfaces only with PURE for deposit, the rest is done by magic. For research data the picture is similar, but we currently take copies of data in only some cases, where there is no suitable external data archive, and we do this ‘offline’, using different mechanisms depending on the size of the dataset.

Our current workflow for dealing with Research Data deposits is OK, it works and it is probably similar to many institutions still feeling their way in this new area of activity. It looks broadly like this:

  • Researcher enters dataset metadata into PURE
  • Member of Library staff contacts them about their data and, if appropriate, takes a copy for our long term store
  • Library staff check and verify the metadata, and record extra information as needed.
  • PURE creates a DOI
  • DOI is updated to point to our interim datasets page (default PURE behaviour is to create a link to the PURE portal, which we have not currently implemented for datasets)

But … it has a lot of manual steps in it, and we don’t check or verify the data deposited with us in any systematic way. Note all of the manual (dotted) steps in the following diagram.

How we do things at the moment

I’d like to improve this workflow by removing as many manual steps as possible, thereby, increasing efficiency, eradicating re-keying and reducing the margin for error, whilst at the same time adding in some proper curation of the data.

The Jisc Research Data Spring project allowed us to ask the question: ‘Can Archivematica help?’, and without going into much detail here, the answer was a resounding ‘Yes, but...’.

What Archivematica can most certainly give us is a sense of what we’ve got, to help us ‘peer into the bag’ as it were, and it can have a good go at both identifying what’s in there and at making sure no nasties, like viruses and corrupt files, lurk.

Research data has a few prominent features:

  1. lots of files of types that standard preservation tools can’t identify
  2. bags of data we don’t know a helluva lot about and where the selection and retention process has been done by the researcher (we are unlikely to ever have the resource or the expertise to examine and assess each deposit)
  3. bags of data that we don’t really know at this point are going to be requested

My colleague, Jen, has been looking at 1) and I’ve focussed my work on 2) and 3). I had this thought:

Wouldn’t it be nice if we could push research data automatically through a simple Archivematica pipeline for safe and secure storage, but only deliver an access copy if one is requested?

Well, Archivematica CAN be almost entirely automated and it WILL be able to generate the DIP at a later step, manually, in the coming months. And with some extra funded development it COULD support automated DIP creation via an API call.

So, what’s missing now? We can gather metadata in PURE and make sure the data we get from researchers is verified and looked after, and service access requests by generating a DIP.

How will we make that DIP available? Well, we already have a public facing Digital Library which can offer that service.

What’s missing is the workflow, the glue that ties all of these steps together, and really I’d argue that’s not the job of either Archivematica or PURE.

In the diagram below I’ve proposed an architecture for both deposit and access that would both remove most manual steps and add Archivematica into the workflow.

In the deposit workflow the researcher would continue to add metadata to PURE. We have kept a manual communication step here as we feel that there will almost certainly be some human discussion needed about the nature of the data. The researcher would then upload their data, and we would be able to manage all subsequent steps in a largely automated way, using a lightweight ‘RDMonitor’ tool to match up the data with the PURE metadata using information gleaned from the PURE web services, to log the receipt of data and to initiate the transfer to Archivematica for archiving.

How we'd like to do things in the future - deposit workflow

In the standard access workflow, the requestor would send an email which would automatically be logged by the Monitor, initiate the creation of a dissemination version of the data and its transfer to our repository and send an automated email the requestor to alert them to the location of the data. Manual steps in this workflow are needed to add the direct link to the data in PURE (PURE’s has no APIs for POSTing updates, only GETting information) and also to ensure due diligence in making data available.

How we'd like to do things in the future - access to research data

These ideas need more unpicking over the coming months, and there are other ways that it could be delivered, using PURE for the upload of data, for example. We can also see potential extensions too, like pulling in a Data Management Plan from DMPOnline to store with the archive package. This post should be viewed as a first stab at illustrating our thinking in the project, motivated by the idea of making lightweight interfaces to connect systems together. Hull will have similar, but different, requirements that they want to explore in further phases too.

Tuesday, 14 July 2015

Archivematica fills the digital preservation gap – report now available

Phase 1 of our Jisc Research Data Spring project “Filling the Digital Preservation Gap” is now complete.

Over the 3 months of this project, the team from the Universities of York and Hull have been testing Archivematica, talking to other users, exploring the nature of research data, and thinking about workflows and the wider RDM infrastructure. 

In conclusion, we are happy to recommend Archivematica as a solution for the preservation of research data for the following reasons:

Source: Julie Allinson
  • It is freely available
  • It is an evolving and developing system
  • It has an engaged international user community (who directly help fund and drive developments and improvements)
  • It is configurable to the needs of different institutions
  • Automated workflows can be created
  • It is standards compliant

We have produced a report that details our findings and this is now available on Figshare:

The report has two distinct parts:

  • Part A has been created as a series of FAQs that answer the main questions you might have about using Archivematica for research data. This is a summary of the questions in our mind at the start of the project and pulls together some of the conclusions  we have reached as the project has progressed. We hope this will provide a quick and easy reference to those who just want to know a little bit more about Archivematica, digital preservation and research data including information on how you could use Archivematica and what it might cost to do so.
  • Part B contains more detail about the project  findings and what areas we have been working in and provides evidence for the conclusions in part A. It includes information about file formats for research data, our testing of Archivematica, technical details of how Archivematica works and our plans for developing and enhancing Archivematica should we get funding for phase 2 of the project. Report appendices focus on our digital preservation requirements for research data and how Archivematica meets these, technical details of our local Archivematica implementations (and how we have configured them) and a brief analysis of how research data file formats are currently represented in PRONOM.

We are confident that this report will be of interest not just to the growing community of Archivematica users, but to those who are still in the early stages of thinking about digital preservation. Though much of the report is focussed specifically on Archivematica, there are also several sections that will be relevant to those who are considering using other digital preservation systems for research data. It should also be noted that though the project has looked specifically at research data many of the lessons and findings are transferable to other data types or workflows.

Please do tell us what you think!

Tuesday, 9 June 2015

The second meeting of the UK Archivematica group

Unaffected by the threatened rail strike and predicted thundery showers, the 2nd UK Archivematica meeting went ahead at the Tate Britain in London last week.

This second meeting saw an increase in attendees with 22 individuals representing 15 different organisations in the UK and beyond (a representative from the Zuse Institute in Berlin joined us). It is great to see a growing awareness and engagement with Archivematica in the UK. The meeting provided another valuable chance to catch up, compare notes and talk about some of the substantial progress that has been made since our last meeting in January.

Updates were given by Marco Klindt from the Zuse Institute on their plans for an infrastructure for digital preservation as a service and Chris Grygeil from the University of Leeds on their latest thinking on a workflow for digital archiving. Marco plans to use Archivematica as an ingest tool before pushing the data to Fedora for access. The Zuse Institute have been working with Artefactual Systems on sponsoring some useful AIP re-ingest functionality which will allow Archivematica to re-process AIPs at a later date. Chris updated us on ongoing work at Leeds to define their Archivematica workflows. Here Bit Curator is being used before ingest into Archivematica and there is ongoing discussion about how exactly the 2 tools fit together in the workflow. Bit Curator can highlight sensitive information and perform redactions but do you want to do this to original files before ingesting with Archivematica?

Matthew Addis from Arkivum gave a really interesting presentation on some work he has been doing on testing how Archivematica handles scientific datasets, specifically genomics data. He described this as being large in size, unidentified by Archivematica and with no normalisation pathways. This struck a chord with me being that I have spent much of the past few weeks looking at the types of data that researchers produce and finding a long tail of specialist or scientific data formats that are of a similar nature. His testing of the capabilities of Archivematica has produced some useful results, with success at processing a 500GB file in 5 hours.
Donuts at J.Co by Masked Card on Flickr
 CC BY-NC-ND 2.0

Next I gave an update on our Jisc Research Data Spring project “Filling the digital preservation gap”. Apart from going on for too long and keeping people from their coffee and doughnuts, I gave an introduction to our project, focusing on the nature of research data and our findings about file formats in use by researchers at the University of York. See previous blogs (first, second) for more infomation on where we are with this work.

I talked about how file identification was key to digital preservation as demonstrated by the NDSA Levels of Digital Preservation where having an inventory of the file formats in your archive comes in quite early at level 2. If you don’t know what you’ve got it is very difficult to make decisions about how you can manage and preserve that content for the long term. This is the case whether we are talking about a migration or an emulation based strategy to digital preservation.

I went on to discuss briefly the 3 basic workflows for using Archivematica and asked for feedback on these:
  1. Archivematica is the start of the process. Archivematica produces both the Archival Information Package (AIP) and the Dissemination Information Package (DIP) and the DIP is sent to the repository
  2. The repository is the start of the process and the Submission Information Package (SIP) goes from there into Archivematica. There are potential variations in the workflow here depending on whether you want Archivematica or the repository to produce the DIP
  3. Archivematica is utilised as a tool that is called separate to the repository as part of the workflow

Are there any others that I've missed?

I also talked through some of the ideas we have had for enhancements to Archivematica. We are hoping that subsequent phase of this project will enable us to sponsor some development work which will make Archivematica better or more suitable for inclusion within a wider infrastructure for managing research data. I highlighted the development ideas on our current short list and asked attendees to select whether the ideas were 'very useful', 'quite useful' or 'not at all useful' for their own proposed implementations. 

It is really helpful for us to get feedback from other Archivematica users so that we can ensure that what we are proposing will be more widely useful (and that we haven't missed an alternative solution or workaround). Over the next week the project team will be reviewing the development ideas and the feedback received at the UK Archivematica meeting and speaking to Artefactual Systems (who support Archivematica) about our ideas.

The day finished with an introduction to Binder ("an open source digital repository management application designed to meet the needs and complex digital preservation requirements of cultural heritage institutions"). We watched this video as an introduction to the system and then had a conference call with Ben Fino-Radin from the Museum of Modern Art who was able to answer our questions. Binder looks to be an impressive tool for helping to manage digital assets. Building on the basics of Archivematica (which essentially packages things up for preservation), Binder provides an attractive front end enabling curators to more effectively manage and better understand their digital collections.

The next meeting of the UK Archivematica group is planned to be held in Leeds in October/November 2015. It was agreed that we would schedule a longer session in order to allow for more informal discussion and networking alongside the scheduled presentations and progress reports. I'm confident that the group will have lots more Archivematica activity to report on by the time we next meet.

Friday, 15 May 2015

Jisc Archivematica project update - making progress with the 'how?' and the 'what?'

I mentioned in a previous post that we have funding through the Jisc Research Data Spring initiative for phase 1 of a project to look at the potential of using Archivematica to help manage research data for the longer term. Here is the second of our project updates showing what progress we have made over the last few weeks.

I can not quite believe we are already halfway through the first 3 month phase of this project. Where has all the time gone?

The most exciting moment of the last few weeks was during a Skype call between the project teams in York and Hull when within minutes of each other, both institutions managed to get their first test archives transferred into their institutional implementations of Archivematica! A few technical hiccups after the initial setting up of Archivematica had stopped this momentous occasion happening earlier. This does to my mind highlight one of the resourcing implications of Archivematica, that a certain amount of technical skill is required to understand and troubleshoot the error messages and configure the system to ensure that all is working smoothly. With my limited technical abilities, this is not something I would have been able to do myself!

We are now continuing to test the capabilities of Archivematica and alongside this I have read the Archivematica documentation from virtual cover to virtual cover, followed the mailing list posts within interest and chatted to other users. I am hugely grateful to other institutions that have been happy to share information about their infrastructure and workflows. 

The project team have a brainstorming meeting planned for next week to discuss the project 'how?'...

  • How? How would we incorporate Archivematica into a wider technical infrastructure for research data management and what workflows would we put in place? Where would it sit and what other systems would it need to talk to?

If all goes well, expect diagrams next time!

Following up from my previous blog post which talked about the project 'what?'...

  • What? What are the characteristics of research data and how might it differ from other born digital data that memory institutions are establishing digital archives to manage and preserve? What types of files are our researchers producing and how would Archivematica handle these?
...I've been having some interesting conversations with real researchers.

Though most of my time is spent hidden away in my office within the archives, it is real bonus being involved with the wider Research Data Management project at the University of York and helping deliver data management training courses to researchers. Getting out and having the opportunity to talk to researchers about their data is invaluable in helping to keep an eye on the longer term goals of this project. 

In a recent training session I've encouraged researchers to complete a simple questionnaire, which tells us a bit more about software packages and file formats.

Helping to answer the project 'what?'

Some researchers on completing this basic level of information, have also agreed to be contacted by me and have subsequently provided some samples of their data. Nothing sensitive or confidential, but files that they have agreed that I can share with The National Archives to create file signatures within Pronom. I hope this will lead to more types of research data being identifiable within digital preservation systems (Archivematica included).

I'm not reaching a huge number of researchers through this and subsequent training sessions over the next few weeks, so with help from colleagues, we've also sent an e-mail out requesting sample files from the top 20 software packages used by researchers at York. Sample files are coming in at a slow trickle rather than a deluge but hopefully, we will soon have suitable test set to share with The National Archives.

 The most popular applications and software used by researchers at the University of York (from Software and Training Questionnaire report by  Emma Barnes and Andrew Smith, 2014)