Showing posts with label Filling the Digital Preservation Gap. Show all posts
Showing posts with label Filling the Digital Preservation Gap. Show all posts

Wednesday, 20 September 2017

Moving a proof of concept into production? it's harder than you might think...

Myself and colleagues blogged a lot during the Filling the Digital Preservation Gap Project but I’m aware that I’ve gone a bit quiet on this topic since…

I was going to wait until we had a big success to announce, but follow on work has taken longer than expected. So in the meantime here is an update on where we are and what we are up to.

Background


Just to re-cap, by the end of phase 3 of Filling the Digital Preservation Gap we had created a working proof of concept at the University of York that demonstrated that it is possible create an automated preservation workflow for research data using PURE, Archivematica, Fedora and Samvera (then called Hydra!).

This is described in our phase 3 project report (and a detailed description of the workflow we were trying to implement was included as an appendix in the phase 2 report).

After the project was over, it was agreed that we should go ahead and move this into production.

Progress has been slower than expected. I hadn’t quite appreciated just how different a proof of concept is to a production-ready environment!

Here are some of the obstacles we have encountered (and in some cases overcome):

Error reporting


One of the key things that we have had to build in to the existing code in order to get it ready for production is error handling.

This was not a priority for the proof of concept. A proof of concept is really designed to demonstrate that something is possible, not to be used in earnest.

If errors happen and things stop working (which they sometimes do) you can just kill it and rebuild.

In a production environment we want to be alerted when something goes wrong so we can work out how to fix it. Alerts and errors are crucial to a system like this.

We are sorting this out by enabling Archivematica's own error handling and error catching within Automation Tools.


What happens when something goes wrong?


...and of course once things have gone wrong in Archivematica and you've fixed the underlying technical issue, you then need to deal with any remaining problems with your information packages in Archivematica.

For example, if the problems have resulted in failed transfers in Archivematica then you need to work out what you are going to do with those failed transfers. Although it is (very) tempting to just clear out Archivematica and start again, colleagues have advised me that it is far more useful to actually try and solve the problems and establish how we might handle a multitude of problematic scenarios if we were in a production environment!

So we now have scenarios in which an automated transfer has failed so in order to get things moving again we need to carry out a manual transfer of the dataset into Archivematica. Will the other parts of our workflow still work if we intervene in this way?

One issue we have encountered along the way is that though our automated transfer uses a specific 'datasets' processing configuration that we have set up within Archivematica, when we push things through manually it uses the 'default' processing configuration which is not what we want.

We are now looking at how we can encourage Archivematica to use the specified processing configuration. As described in the Archivematica documentation, you can do this by including an XML file describing your processing configuration within your transfer.

It is useful to learn lessons like this outside of a production environment!


File size/upload


Although our project recognised that there would be limit to the size of dataset that we could accept and process with our application, we didn't really bottom out what size dataset we intended to support.

It has now been agreed that we should reasonably expect the data deposit form to accept datasets of up to 20 GB in size. Anything larger than this would need to be handed in a different way.

Testing the proof of concept in earnest showed that it was not able to handle datasets of over 1 GB in size. Its primary purpose was to demonstrate the necessary integrations and workflow not to handle larger files.

Additional (and ongoing) work was required to enable the web deposit form to work with larger datasets.


Space


In testing the application we of course ended up trying to push some quite substantial datasets through it.

This was fine until everything abrubtly seemed to stop working!

The problem was actually a fairly simple one but because of our own inexperience with Archivematica it took a while to troubleshoot and get things moving in the right direction again.

It turned out that we hadn’t allocated enough space in one of the bits of filestore that Archivematica uses for failed transfers (/var/archivematica/sharedDirectory/failed). This had filled up and was stopping Archivematica from doing anything else.

Once we knew the cause of the problem the available space was increased but then everything ground to a halt again because we had quickly used that up again ….increasing the space had got things moving but of course while we were trying to demonstrate the fact that it wasn't working, we had deposited several further datasets which were waiting in the transfer directory and quickly blocked things up again.

On a related issue, one of the test datasets I had been using to see how well Research Data York could handle larger datasets consisted of c.5 GB consisting of about 2000 JPEG images. Of course one of the default normalisation tasks in Archivematica is to convert all of these JPEGs to TIFF.

Once this collection of JPEGs were converted to TIFF the size of the dataset increased to around 80 GB. Until I witnessed this it hadn't really occurred to me that this could cause problems.

The solution - allocate Archivematica much more space than you think it will need!

We also now have the filestore set up so that it will inform us when the space in these directories gets to 75% full. Hopefully this will allow us to stop the filestore filling up in the future.


Workflow


The proof of concept did not undergo rigorous testing - it was designed for demonstration purposes only.

During the project we thought long and hard about the deposit, request and preservation workflows that we wanted to support, but we were always aware that once we had it in an environment that we could all play with and test, additional requirements would emerge.

As it happens, we have discovered that the workflow implemented is very true to that described in the appendix of our phase 2 report and does meet our needs. However, there are lots of bits of fine tuning required to enhance the functionality and make the interface more user friendly.

The challenge here is to try to carry out the minimum of work required to turn it into an adequate solution to take into production. There are so many enhancements we could make – I have a wish list as long as my arm – but until we better understand whether a local solution or a shared solution (provided by the Jisc Research Data Shared Service) will be adopted in the future it is not worth trying to make this application perfect.

Making it fit for production is the priority. Bells and whistles can be added later as necessary!





My thanks to all those who have worked on creating, developing, troubleshooting and testing this application and workflow. It couldn't have happened without you!



Jenny Mitcham, Digital Archivist

Friday, 7 April 2017

Archivematica Camp York: Some thoughts from the lake

Well, that was a busy week!

Yesterday was the last day of Archivematica Camp York - an event organised by Artefactual Systems and hosted here at the University of York. The camp's intention was to provide a space for anyone interested in or currently using Archivematica to come together, learn about the platform from other users, and share their experiences. I think it succeeded in this, bringing together 30+ 'campers' from across the UK, Europe and as far afield as Brazil for three days of sessions covering different aspects of Archivematica.

Our pod on the lake (definitely a lake - not a pond!)
My main goal at camp was to ensure everyone found their way to the rooms (including the lakeside pod) and that we were suitably fuelled with coffee, popcorn and cake. Alongside these vital tasks I also managed to partake in the sessions, have a play with the new version of Archivematica (1.6) and learn a lot in the process.

I can't possibly capture everything in this brief blog post so if you want to know more, have a look back at all the #AMCampYork tweets.

What I've focused on below are some of the recurring themes that came up over the three days.

Workflows

Archivematica is just one part of a bigger picture for institutions that are carrying out digital preservation, so it is always very helpful to see how others are implementing it and what systems they will be integrating with. A session on workflows in which participants were invited to talk about their own implementations was really interesting. 

Other sessions  also helped highlight the variety of different configurations and workflows that are possible using Archivematica. I hadn't quite realised there were so many different ways you could carry out a transfer! 

In a session on specialised workflows, Sara Allain talked us through the different options. One workflow I hadn't been aware of before was the ability to include checksums as part of your transfer. This sounds like something I need to take advantage of when I get Archivematica into production for the Borthwick. 

Justin talking about Automation Tools
A session on Automation Tools with Justin Simpson highlighted other possibilities - using Archivematica in a more automated fashion. 

We already have some experience of using Automation Tools at York as part of the work we carried out during phase 3 of Filling the Digital Preservation Gap, however I was struck by how many different ways these can be applied. Hearing examples from other institutions and for a variety of different use cases was really helpful.


Appraisal

The camp included a chance to play with Archivematica version 1.6 (which was only released a couple of weeks ago) as well as an introduction to the new Appraisal and Arrangement tab.

A session in progress at Archivematica Camp York
I'd been following this project with interest so it was great to be able to finally test out the new features (including the rather pleasing pie charts showing what file formats you have in your transfer). It was clear that there were a few improvements that could be made to the tab to make it more intuitive to use and to deal with things such as the ability to edit or delete tags, but it is certainly an interesting feature and one that I would like to explore more using some real data from our digital archive.

Throughout camp there was a fair bit of discussion around digital appraisal and at what point in your workflow this would be carried out. This was of particular interest to me being a topic I had recently raised with colleagues back at base.

The Bentley Historical Library who funded the work to create the new tab within Archivematica are clearly keen to get their digital archives into Archivematica as soon as possible and then carry out the work there after transfer. The addition of this new tab now makes this workflow possible.

Kirsty Lee from the University of Edinburgh described her own pre-ingest methodology and the tools she uses to help her appraise material before transfer to Archivematica. She talked about some tools (such as TreeSize Pro) that I'm really keen to follow up on.

At the moment I'm undecided about exactly where and how this appraisal work will be carried out at York, and in particular how this will work for hybrid collections so as always it is interesting to hear from others about what works for them.


Metadata and reporting

Evelyn admitting she loves PREMIS and METS
Evelyn McLellan from Artefactual led a 'Metadata Deep Dive' on day 2 and despite the title, this was actually a pretty interesting session!

We got into the details of METS and PREMIS and how they are implemented within Archivematica. Although I generally try not to look too closely at METS and PREMIS it was good to have them demystified. On the first day through a series of exercises we had been encouraged to look at a METS file created by Archivematica ourselves and try and pick out some information from it so these sessions in combination were really useful.

Across various sessions of the camp there was also a running discussion around reporting. Given that Archivematica stores such a detailed range of metadata in the METS file, how do we actually make use of this? Being able to report on how many AIPs have been created, how many files and what size is useful. These are statistics that I currently collect (manually) on a quarterly basis and share with colleagues. Once Archivematica is in place at York, digging further into those rich METS files to find out which file formats are in the digital archive would be really helpful for preservation planning (among other things). There was discussion about whether reporting should be a feature of Archivematica or a job that should be done outside Archivematica.

In relation to the later option - I described in one session how some of our phase 2 work of Filling the Digital Preservation Gap was designed to help expose metadata from Archivematica to a third party reporting system. The Jisc Research Data Shared Service was also mentioned in this context as reporting outside of Archivematica will need to be addressed as part of this project.

Community

As with most open source software, community is important. This was touched on throughout the camp and was the focus of the last session on the last day.

There was a discussion about the role of Artefactual Systems and the role of Archivematica users. Obviously we are all encouraged to engage and help sustain the project in whatever way we are able. This could be by sharing successes and failures (I was pleased that my blog got a mention here!), submitting code and bug reports, sponsoring new features (perhaps something listed on the development roadmap) or helping others by responding to queries on the mailing list. It doesn't matter - just get involved!

I was also able to highlight the UK Archivematica group and talk about what we do and what we get out of it. As well as encouraging new members to the group, there was also discussion about the potential for forming other regional groups like this in other countries.

Some of the Archivematica community - class of Archivematica Camp York 2017

...and finally

Another real success for us at York was having the opportunity to get technical staff at York working with Artefactual to resolve some problems we had with getting our first Archivematica implementation into production. Real progress was made and I'm hoping we can finally start using Archivematica for real at the end of next month.

So, that was Archivematica Camp!

A big thanks to all who came to York and to Artefactual for organising the programme. As promised, the sun shined and there were ducks on the lake - what more could you ask for?



Thanks to Paul Shields for the photos

Jenny Mitcham, Digital Archivist

Wednesday, 7 December 2016

Digital Preservation Awards 2016 - celebrating collaboration and innovation

Last week members of the Filling the Digital Preservation Gap project team were lucky enough to experience the excitement and drama of the biannual Digital Preservation Awards!

The Awards ceremony was held at the Wellcome Collection in London on the evening of the 30th November. As always it was a glittering affair, complete with dramatic bagpipe music (I believe it coincided with St Andrew's Day!) and numerous references to Strictly Come Dancing from the judges and hosts!

This year our project had been shortlisted for the Software Sustainability Institute award for Research and Innovation. It was fantastic to be a finalist considering the number of nominations from across the world in this category and we certainly felt we had some strong competition from the other shortlisted projects.

One of the key strengths in our own project has been the collaboration between the Universities of York and Hull. Additionally, collaboration with Artefactual Systems, The National Archives and the wider digital preservation community has also been hugely beneficial.

Interestingly, collaboration was a key feature of all the finalists in this category, perhaps demonstrating just how important this is in order to make effective progress in this area.

The 4C project "Collaboration to Clarify the Costs of Curation" was a European project which looked at costs and benefits relating to digital preservation activities within its partner organisations and beyond. Project outputs in use across the sector include the Curation Costs Exchange.

The winner in our category however was the Dutch National Coalition for Digital Preservation (NCDD) with Constructing a Network of Nationwide Facilities Together. Again there was a strong focus on collaboration - this time cross-domain collaboration within the Netherlands. Under the motto "Joining forces for our digital memory", the project has been constructing a framework for a national shared infrastructure for digital preservation. This collaboration aimed to ensure that each institution does not have to reinvent the wheel as they establish their own digital preservation facilities. Clearly an ambitious project, and perhaps one we can learn from in the UK Higher Education sector as we work with Jisc on their Shared Service for Research Data.

Some of the project team from York and Hull at the awards reception

The awards ceremony itself came at the end of day one of the PERICLES conference where there was an excellent keynote speech from Kara Van Malssen from AV Preserve (her slides are available on SlideShare - I'd love to know how she creates such beautiful slides!).

In the context of the awards ceremony I was pondering one of the messages of Kara's talk that discussed our culture of encouraging and rewarding constant innovation and the challenges that this brings - especially for those of us who are 'maintainers'.

Maintainers maintain systems, services and the status quo - some of us maintain digital objects for the longer term and ensure we can continue to provide access to them. She argued that there are few rewards for maintainers and the incentives generally go to those who are innovating. If those around us are always chasing the next shiny new thing, how can the digital preservation community keep pace?

I would argue however that in the world of digital preservation itself, rewards for innovation are not always forthcoming. It can be risky for an institution to be an innovator in this area rather than doing what we have always done (which may actually bring risks of a different kind!) and this can stifle progress or lead to inaction.

This is why for me, the Digital Preservation Awards are so important. Being recognised as a finalist for the Research and Innovation award sends a message that what we have achieved is worthwhile and demonstrates that doing something different is A Good Thing.

For that I am very grateful. :-)



Jenny Mitcham, Digital Archivist

Monday, 21 November 2016

Every little bit helps: File format identification at Lancaster University

This is a guest post from Rachel MacGregor, Digital Archivist at Lancaster University. Her work on identifying research data follows on from the work of Filling the Digital Preservation Gap and provides a interesting comparison with the statistics reported in a previous blog post and our final project report.

Here at Lancaster University I have been very inspired by the work at York on file format identification and we thought it was high time I did my own analysis of the one hundred or so datasets held here.  The aim is to aid understanding of the nature of research data as well as to inform our approaches to preservation.  Our results are comparable to York's in that the data is characterised as research data (as yet we don't have any born digital archives or digitised image files).  I used DROID (version 6.2.1) as the tool for file identification - there are others and it would be interesting to compare results at some stage with results from using other software such as FILE (FITS), Apache Tika etc.

The exercise was carried out using the following signature files: DROID_SignatureFile_V88 and container-signature-file-20160927.  The maximum number of bytes DROID was set to scan at the start and end of each file was 65536 (which is the default setting when you install DROID).

Summary of the statistics:

There were a total of 24,705 files (so a substantially larger sample than in the comparable study at York)

Of these: 
  • 11008 (44.5%) were identified by DROID and 13697 (55.5%) not.
  • 99.3% were given one file identification and 76 files had multiple identifications.  
    • 59 files had two possible identifications
    • 13 had 3 identifications
    • 4 had 4 possible identifications.  
  • 50 of these files were asc files identified (by extension) as either 8-bit or 7-bit ASCII text files.  The remaining 26 were identified by container as various types of Microsoft files. 

Files that were identified

Of the 11008 identified files:
  • 89.34% were identified by signature: this is the overwhelming majority, far more than in Jen's survey
  • 9.2% were identified by extension, a much smaller proportion than at York
  • 1.46% identified by container

However there was one large dataset containing over 7,000 gzip files, all identified by signature which did skew the results rather.  With those files removed, the percentages identified by different methods were as follows:

  • 68% (2505) by signature
  • 27.5% (1013) by extension
  • 4.5% (161) by container
This was still different from York's results but not so dramatically.

Only 38 were identified as having a file extension mismatch (0.3%) but closer inspection may reveal more.  Of these most were Microsoft files with multiple id's (see above) but also a set of lsm files identified as TIFFs.  This is not a format I'm familiar with although it seems as if lsm is a form of TIFF file but how do I know if this is a "correct" id or not?

59 different file formats were identified, the most frequently occurring being the GZIP format (as mentioned above) with 7331 instances.  The next most popular was, unsurprisingly xml (similar to results at York) with 1456 files spread across the datasets.  The top 11 were:

Top formats identified by DROID for Lancaster University's research data


Files that weren't identified

There were 13697 files not identified by DROID of which 4947 (36%) had file extensions.  This means there was a substantial proportion of files with no file extension (64%). This is much higher than the result at York which was 26%. As at York there were 107 different extensions in the unidentified files of which the top ten were:

Top counts of unidentified file extensions


Top extensions of unidentified files


This top ten are quite different to York's results, though in both institutions dat files topped the list by some margin! We also found 20 inp and 32 out files which also occur in York's analysis. 

Like Jen at York I will be looking for a format to analyse further to create a signature - this will be a big step for me but will help my understanding of the work I am trying to do as well as contribute towards our overall understanding of file format types.

Every little bit helps.



Jenny Mitcham, Digital Archivist

Monday, 14 November 2016

Automating transfers with Automation Tools

This is a guest post by Julie Allinson, Technology Development Manager for Library & Archives at York. Julie has been working on York's implementation for the 'Filling the Digital Preservation Gap' project. This post describes how we have used Artefactual Systems' Automation Tools at York.

For Phase three of our 'Filling the Digital Preservation Gap' we have delivered a proof-of-concept implementation to to illustrate how PURE and Archivematica can be used as part of a Research Data management lifecycle.

One of the requirements for this work was the ability to fully automate a transfer in Archivematica. Automation Tools is a set of python scripts from Artefactual Systems that are designed to help.

The way Automation Tools works is that a script (transfer.py) runs regularly at a set interval (as cron task). The script is fed a set of parameters and, based on these, checks for new transfers in the given transfer source directory. On finding something, a transfer in Archivematica is initiated and approved.

One of the neat features of Automation Tools is that if you need custom behaviour, there are hooks in the transfer.py script that can run other scripts within specified directories. The 'pre-transfer' scripts are run before the transfer starts and 'user input' scripts can be used to act when manual steps in the processing are reached. A processing configuration can be supplied and this can fully automate all steps, or leave some manual as desired.

The best way to use Automation Tools is to fork the github repository and then add local scripts into the pre-transfer and/or user-input directories.

So, how have we used Automation Tools at York?

When a user deposits data through our Research Data York (RDYork) application, the data is written into a folder within the transfer source directory named with the id of our local Fedora resource for the data package. The directory sits on filestore that is shared between the Archivematica and RDYork servers. On seeing a new transfer, three scripts run:

1_datasets_config.py - this script copies the dedicated datasets processing config into the directory where the new data resides.

2_arrange_transfer.py - this script simply makes sure the correct file permissions are in place so that Archivematica can access the data.

3_create_metadata_csv.py - this script looks for a file called 'metadata.json' which contains metadata from PURE and if it finds it, processes the contents and writes out a metadata.csv file in a format that Archivematica will understand. These scripts are all fairly rudimentary, but could be extended for other use cases, for example to process metadata files from different sources or to select a processing config for different types of deposit.

Our processing configuration for datasets is fully automated so by using automation tools we never have to look at the Archivematica interface.

With transfer.py as inspiration I have added a second script called status.py. This one speaks directly to APIs in our researchdatayork application and updates our repository objects with information from Archiveamtica, such as the UUID for the AIP and the location of the package itself. In this way our two 'automation' scripts keep researchdatayork and Archivematica in sync. Archivematica is alerted when new transfers appear and automates the ingest, and researchdatayork is updated with the status once Archivematica has finished processing.

The good news is, the documentation for Automation Tools is very clear and that makes it pretty easy to get started. Read more at https://github.com/artefactual/automation-tools



Jenny Mitcham, Digital Archivist

Wednesday, 19 October 2016

Filling the Digital Preservation Gap - final report available

Today we have published our third and final Filling the Digital Preservation Gap report.

The report can be accessed from Figshare: https://dx.doi.org/10.6084/m9.figshare.4040787

This report details work the team at the Universities of York and Hull have been carrying out over the last six months (from March to September 2016) during phase 3 of the project.

The first section of the report focuses on our implementation work. It describes how each institution has established a proof of concept implementation of Archivematica integrated with other systems used for research data management. As well as describing how these implementations work it also discusses future priorities and lessons learned.

The second section of the report looks in more detail at the file format problem for research data. It discusses DROID profiling work that has been carried out over the course of the project (both for research data and other data types) and signature development to increase the number of research data signatures in the PRONOM registry. In recognition of the fact that this is an issue that can only be solved as a community, it also includes recommendations for a variety of different stakeholder groups.

The final section of the report details the outreach work that we have carried out over the course of this final phase of the project. It has been a real pleasure to have been given an opportunity to speak about our work at so many different events and to such a variety of different groups over the last few months!

The last of this run of events in our calendars is the final Jisc Research Data Spring showcase in Birmingham tomorrow (20th October). I hope to see you there!




Jenny Mitcham, Digital Archivist

Tuesday, 11 October 2016

Some highlights from iPRES 2016

A lovely view of the mountains from Bern
Last week I was at iPRES 2016 - the 13th International Conference on Digital Preservation and one of the highlights of the digital preservation year.

This year the conference was held in the Swiss town of Bern. A great place to be based for the week  - fantastic public transport, some lovely little restaurants and cellar bars, miles of shopping arcades, bizarre statues and non-stop sunshine!

There was so much content over the course of the 4 days that it is impossible to cover it all in one blog post. Instead I offer up a selection of highlights and takeaway thoughts.

Jeremy York from the University of Michigan gave an excellent paper about ‘The Stewardship Gap’. An interesting project with the aim of understanding the gap between valuable digital data and long term curation.  Jeremy reported on the results of a series of interviews with researchers at his institution where they were asked about the value of the data they created and their plans for longer term curation. A theme throughout the paper was around data value and how we assess this. Most researchers interviewed felt that their data did have long term value (and were able to articulate the reasons why). Most of the respondents expressed an intention to preserve the data for the longer term but did not have any concrete plans as to how they would achieve this. It was not yet clear to the project whether an intention to preserve actually leads to deposit with a repository or not. Work on this project is ongoing and I’ll look forward to finding out more when it is available.

Bern at night
As always there was an array of excellent posters. There were two in particular that caught my eye this year.

Firstly a poster from the University of Illinois at Urbana-Champaign entitled Should We Keep Everything Forever?: Determining Long-Term Value of Research Data.

The poster discussed an issue that we have also been grappling with recently as part of Filling the Digital Preservation Gap, that of the value of research data. It proposed an approach to assessing the value of content within the Illinois Data Bank using automated methods and measurable criteria. Recognising that a human eye is also important in assessing value, it would highlight those datasets that appear to have a low value which can then be assessed in a more manual fashion. This pragmatic two-stage approach will ensure that data thought to be of low value can be discarded after 5 years but that time intensive manual checking of datasets is kept to a minimum. This is a useful model that I would like to hear more about once they get it fully established. There was a lot of buzz around this poster and I wasn’t surprised to see it shortlisted for the best poster award.

Another excellent poster (and worthy winner of the best poster award) was To Act or Not to Act - Handling File Format Identification Issues in Practice. This poster from ETH Zurich described how the institution handles file identification and validation errors within their digital archive and showed some worked examples of the types of problems they encountered. This kind of practical sharing of the nuts and bolts of digital preservation is really good to see, and very much in line with the recommendations we are making as part of Filling the Digital Preservation Gap. As well as finding internal solutions to these problems I hope that ETH Zurich are also passing feedback to the tool providers to ensure that the tools work more effectively and efficiently for other users. It is this feedback loop that is so important in helping the discipline as a whole progress.

OAIS panel session in full swing
A panel session on Monday afternoon entitled ‘OAIS for us all’ was also a highlight. I was of course already aware that the OAIS standard is currently under review and that DPC members and other digital preservation practitioners are invited and encouraged to contribute to the discussion. Despite best intentions and an obvious interest in the standard I had not yet managed to engage with the review. This workshop was therefore a valuable opportunity to get up to speed with the process (as far as the panel understood it!) and the community feedback so far.

It was really useful to hear about the discussions about OAIS that have been held internationally and of course interesting to note the common themes recurring throughout – for example around the desire for a pre-ingest step within the model, the need to firm up the reference model to accommodate changes to AIPs that may occur through re-ingest, and around the need for openness with regard to audit and certification standards.

This session was a great example of an international collaboration to help shape the standards that we rely so much on. I do hope that the feedback from our community is given full consideration in the revised OAIS Reference Model.

Me and Steve presenting our paper
(image from @shirapeltzman)
On Tuesday morning I gave a paper with Steve Mackey from Arkivum in the Research Data Preservation session (and I was really pleased that there was a whole session devoted to this topic). I presented on our work to link repositories to Archivematica, through the Filling the Digital Preservation Gap project, and focused in particular on the long tail of research data file formats and the need to address this as a community. It was great to be able to talk to such a packed room and this led to some really useful discussions over the lunch break and at the conference dinner that evening.

One of the most interesting sessions of the conference for me was one that was devoted to ingest tools and methods. At a conference such as this, I'm always drawn to the sessions that focus on practical tools and first hand experiences of doing things rather than the more theoretical strands so this one was an obvious choice for me. First we had Bruno Ferreira from KEEP SOLUTIONS talking about the Database Preservation Toolkit (more about this toolkit later). Then "Exploring Friedrich Kittler's Digital Legacy on Different Levels: Tools to Equip the Future Archivist" by Jurgen Enge and Heinz Werner Kramski from the University of Art and Design in Basel.

It was fascinating to see how they have handled the preservation of a large, diverse and complex digital legacy and overcome some of the challenges and hurdles that this has thrown at them. The speakers also made the point that the hardware itself is important evidence in its physical form, showing for instance how regularly Friedrich Kittler clearly used the reset button on his PC!

Conference delegates relaxing on the terrace
Two further presentations focused on the preservation of e-mail - something that I have little experience of but I am sure I will need to work on in the future. Claus Jensen from the Royal Library in Denmark presented a solution for acquisition of email. This seemed a very pragmatic approach and the team had clearly thought through their requirements well and learned from their initial prototype before moving to a second iteration. I'm keen to follow up on this and read the paper in more detail.

Brent West from the University of Illinois followed on with another interesting presentation on Processing Capstone Email using Predictive Coding. This talk focused on the problems of making appraisal decisions and sensitivity assessments for email and how a technology assisted review could help, enabling the software to learn from human decisions that are made and allowing human effort to be reduced and targeted. Again, I felt that this sort of work could be really useful to me in the future if I am faced with the task of e-mail preservation at scale.

A very expensive gin and tonic!
The BitCurator Mixer on the Tuesday night provided a good opportunity to talk to other users of BitCurator. I confess to not actually being a user just yet but having now got my new ingest PC on my desk, it is only a matter of time before I get this installed and start playing and testing. Good to talk to some experienced users and pick up some tips regarding how to install it and where to find example workflows. What sticks most in my mind though is the price of the gin and tonic at the bar we were in!

On Wednesday afternoon I took part in a workshop called OSS4PRES 2.0: Building Bridges and Filling Gaps – essentially a follow on from a workshop called Using Open-Source Tools to Fulfill Digital Preservation Requirements that I blogged about from iPRES last year. This was one of those workshops where we were actually expected to work (always a bit of a shock after lunch!) and the participants split into groups to address 3 different areas. One group was looking at gaps in the open source digital preservation tool set that we should be looking to fill (either by enhancing existing tools or with the development of new tools). Another group was working on drawing up a set of guidelines for providers of open source tools. The group I was in was thinking about the creation of a community space for sharing digital preservation workflows. This is something that I think could turn into a really valuable resource for practitioners who want to see how others have implemented tools. All the groups came out with lots of ideas and an action plan by the end of the afternoon and work in these areas is scheduled to continue outside of the workshop. Great to see a workshop that is more than just a talking shop but that will lead to some more concrete results.

My laptop working hard at the Database
Preservation workshop
On Thursday morning I attended another really useful hands on workshop called Relational Database Preservation Standards and Tools. Participants were encouraged to try out the SIARD Suite and Database Preservation Toolkit on their own laptops. The value and outcomes of this workshop were clear and it really gave a sense of how we might use these tools to create preservation versions of data from relational databases. Designed to work with a number of widely used relational database systems the tools allow data to be extracted into the SIARD 2 format. This format is essentially a zip file containing the relevant information in XML. It goes one better than the csv format (the means by which I have preserved databases in the past) as it contains both information about the structure of the data as well as the content itself and allows you to add metadata about how the data was extracted. It looks to be particularly useful for taking snapshots of live and active databases for preservation on a regular cycle. I could definitely see myself using these tools in the future.

iPRES2016 and Swisse Toy 2016 meet outside the venue
There was some useful discussion at the end of the session about how these tools would actually fit into a wider preservation workflow and whether they could be incorporated into digital preservation sytems (for example Archivematica) and configured as an automatic migration path for Microsoft Access databases. The answer was yes, but subsequent discussion suggested that this may not be the best way to approach this. The tool creators suggest that full automation may not be the best approach. A human eye is typically required to establish which bits of the database should be preserved and retained and to tailor the creation of the SIARD 2 file accordingly.

On the last afternoon of the conference it was good to be able to pop into the Swiss Toy Fair which was being held at the same venue as the conference. A great opportunity to buy some presents for the family before heading back to the UK.



Jenny Mitcham, Digital Archivist

Wednesday, 21 September 2016

File format identification at Norfolk Record Office

This is a guest post from Pawel Jaskulski who has recently completed a Transforming Archives traineeship at Norfolk Record Office (NRO). As part of his work at Norfolk and in response to a question I posed in a previous blog post ("Is identification of 37% of files a particularly bad result?") he profiled their digital holdings using DROID and has written up his findings. Coming from a local authority context, his results provide an interesting comparison with other profiles that have emerged from both the Hull History Centre and the Bentley Historical Library and again help to demonstrate that the figure of 37% identified files for my test research dataset is unusual.

King's Lynn's borough archives are cared for jointly by the Borough Council and the Norfolk Record Office


Profiling Digital Records with DROID

With any local authority archive there is an assumption that the accession deposited might be literally anything. What it means in 'digital terms' is that it is impossible to predict what sort of data might be coming in in the future. That is the reason why NRO have been actively involved in developing their digital preservation strategy, aiming at achieving capability so as to be able to choose digital records over their paper-based equivalents (hard copies/printouts).

The archive service has been receiving digital records accessions since the late 1990's. The majority of digitally born archives came in as hybrid accessions from local schools that were being closed down. For many records there were no paper equivalents. Among other deposits containing digital records are architectural surveys, archives of private individuals and local organisations (for example Parish Council meetings minutes).

The archive service have been using DROID as part of their digital records archival processing procedure as it connects to the most comprehensive and continuously updated file formats registry PRONOM. Archivematica, an ingest system that uses the PRONOM registry, is currently being introduced at NRO. It contains other file format identification tools like FIDO or Siegfried (which both use PRONOM identifiers).

The results of DROID survey were as follows:

With the latest signature file (v.86) out of 49,117 files identification was successful for 96.46%.

DROID identified 107 various file formats. The ten most recurring file formats were:

Classification
File Format Name
Versions
PUIDS
Image (Raster)
JPEG File Interchange Format
1.01, 1.02
fmt/43, fmt/44
Image (Raster)
Exchangeable Image File Format (Compressed)
2.1, 2.2
x-fmt/390, x-fmt/391
Image (Raster)
Windows Bitmap
3
fmt/116
Text (Mark-up)
Hypertext Markup Language
4
fmt/96, fmt/99
Word Processor
Microsoft Word Document
97-2003
fmt/40
Image (Raster)
Tagged Image File Format
fmt/353
Email
Microsoft Outlook Email Message
97-2003
x-fmt/430
Miscellaneous
AppleDouble Resource Fork
fmt/503
Image (Raster)
Graphics Interchange Format
89a
fmt/4
Image (Raster)
Exchangeable Image File Format (Compressed)
2.2.1
fmt/645

Identification method breakdown:

  • 83.31% was identified by signature
  • 14.95% by container
  • 1.73% by Extension 


458 files had their extensions mismatched - that amounts to less than one per cent (0.97%). These were a variety of common raster image file formats (JPEG, PNG, TIFF) word processor (Microsoft Word Document, ClarisWorks Word Processor) and desktop publishing (Adobe Illustrator, Adobe InDesign Document, Quark Xpress Data File).

Among 3.54% of unidentified files there were 160 different unknown file extensions. Top five were:

  • .cmp
  • .mov
  • .info
  • .eml
  • .mdb


Two files returned more than 1 identification:

A spreadsheet file with .xls extension (last modified date 2006-12-17) had 3 possible file format matches:

  • fmt/175 Microsoft Excel for Macintosh 2001
  • fmt/176 Microsoft Excel for Macintosh 2002
  • fmt/177 Microsoft Excel for Macintosh 2004


And an image file with extension .bmp (last modified date 2007-02-06) received 2 file format matches

  • fmt/116 Windows Bitmap 3
  • fmt/625 Apple Disk Copy Image 4.2

After closer inspection the actual file was a bitmap image file and PUID fmt/116 was the correct one.


Understanding the Results


DROID offers very useful classification of file formats and puts all results into categories, which enables an overview of the digital collection. It is easy to understand what sort of digital content is predominantly included within the digitally born accession/archive/collection. It uses classification system that assigns file formats to broader groups like: Audio, Word Processor, Page Description, Aggregate etc. These help enormously in having a grasp on the variety of digital records. For example it was interesting to discover that over half of our digitally born archives are in various raster image file formats.

Files profiled at Norfolk Record Office as classified by DROID


I am of course also interested in the levels of risk associated with particular formats so have started to work on an additional classification for the data, creating further categories that can help with preservation planning. This would help demonstrate where preservation efforts should be focused in the future.






Jenny Mitcham, Digital Archivist

Friday, 16 September 2016

UK Archivematica group at Lancaster

Earlier this week UK Archivematica users descended on the University of Lancaster for our 5th user group meeting. As always it was a packed agenda, with lots of members keen to talk to the group and share their project plans or their experiences of working with Archivematica. Here are some edited highlights of the day. Also well worth a read is a blog about the day from our host which is better than mine because it contains pictures and links!

Rachel MacGregor and Adrian Albin-Clark from the University of Lancaster kicked off the meeting with an update on recent work to set up Archivematica for the preservation of research data. Adrian has been working on two Ruby gems to handle two specific parts of the workflow. The puree gem which gets metadata out of the PURE CRIS system in a format that it is easy to work with (we are big fans of this gem at York having used it in our phase 3 implementation work for Filling the Digital Preservation Gap). Another gem helps solve another problem, getting the deposited research data and associated data packaged up in a format that is suitable for Archivematica to ingest. Again, this is something we may be able to utilise in our own workflows.

Jasmin Boehmer, a student from Aberystwyth University presented some of the findings from the work she has been doing for her dissertation. She has been testing how metadata can be added to a Submission Information Package (SIP) for inclusion within an Archival Information Package (AIP) and has been looking at a range of different scenarios. It was interesting to hear her findings, particularly useful for those of us who haven’t managed to carry out systematic testing ourselves. She concluded that if you want to store descriptive metadata at a per file level within Archivematica you should submit this via a csv file as part of your SIP. If you only use the Archivematica interface itself for adding metadata, you can only do this on a per SIP basis rather than at file level. It was interesting to see that if you include rights metadata within your file level csv file this will not be stored within the PREMIS section of the XML as you might expect so this does not solve a problem we raised during our phase 1 project work for Filling the Digital Preservation Gap regarding ingesting a SIP with different rights recorded per file.

Jake Henry from the National Library of Wales discussed some work newly underway to build on the work of the ARCW digital preservation group. The project will enable members of ARCW to use Archivematica without having to install their own local version, using pydio as a means of storing files before transfer. As part of this project they are now looking at a variety of systems that they would like Archivematica to integrate with. They are hoping to work on an integration with CALM. There was some interest in this development around the room and I expect there would be many other institutions who would be keen to see this work carried out.

Kirsty Lee from the University of Edinburgh reported on her recent trip to the States to attend the inaugural ArchivematiCamp with her colleague Robin Taylor. It sounded like a great event with some really interesting sessions and discussions, particularly regarding workflows (recognising that there are many different ways you can use Archivematica) as well as some nice social events. We are looking forward to seeing an ArchivematiCamp in the UK next year!

Myself and Julie presented on some of the implementation work we have been doing over the last few months as we complete phase 3 of Filling the Digital Preservation Gap. Julie talked about what we were trying to achieve with our proof of concept implementation and then showed a screencast of the application itself. The challenges we faced and things that worked well during phase 3 were discussed before I summarised our plans for the future.

I went on to introduce the file formats problem (which I have previously touched on other blog posts) before taking the opportunity to pick people’s brains on a number of discussion points. I wanted to understand workflows around non identified files (not just for research data). I was interested to know three things really:

  1. At what point would you pick up on unidentified file formats in a deposit - prior to using Archivematica or during the transfer stage within Archivematica?
  2. What action would you take to resolve this situation (if any)?
  3. Would you continue to ingest material into the archive whilst waiting for a solution, or keep files in the backlog until the correct identification could be made?
Answers from the floor suggested that one institution would always stop and carry out further investigations before ingesting the material and creating an Archival Information Package (AIP) but that most others would continue processing the data. With limited staff resource for curating research data in particular, it is possible that institutions will favour a fully automated workflow such as the one we have established in our proof of concept implementation, and regular interventions around file format identification may not be practical. Perhaps we need to consider how we can intervene in a sustainable and manageable way, rather than looking at each deposit of data separately. One of the new features in Archivematica is the AIP re-ingest which will allow you to pull AIPs back from storage so that tools (such as file identification) can be re-run - this was thought to be a good solution.

John Kaye from Jisc updated us on the Research Data Shared Service project. Archivematica is one of the products selected by Jisc to fulfill the preservation element of the Shared Service and John reported on the developments and enhancements to Archivematica that are proposed as part of this project. It is likely that these developments will be incorporated into the main code base thus be available to all Archivematica users in the future. The growth in interest in Archivematica within the research data community in the UK is only likely to continue as a result of this project.

Heather Roberts from the Royal Northern College of Music described where her institution is with digital preservation and asked for advice on how to get started with Archivematica. Attendees were keen to share their thoughts (many of which were not specific to Archivematica itself but would be things to consider whatever solution was being implemented) and Heather went away with some ideas and some further contacts to follow up with.

To round off the meeting we had an update and Q&A session with Sarah Romkey from Artefactual Systems (who is always cheerful no matter what time you get her out of bed for an transatlantic Skype call).

Some of the attendees even managed to find the recommended ice cream shop before heading back home!

We look forward to meeting at the University of Edinburgh next time.


Jenny Mitcham, Digital Archivist

The sustainability of a digital preservation blog...

So this is a topic pretty close to home for me. Oh the irony of spending much of the last couple of months fretting about the future prese...