Tuesday, 9 June 2015

The second meeting of the UK Archivematica group

Unaffected by the threatened rail strike and predicted thundery showers, the 2nd UK Archivematica meeting went ahead at the Tate Britain in London last week.

This second meeting saw an increase in attendees with 22 individuals representing 15 different organisations in the UK and beyond (a representative from the Zuse Institute in Berlin joined us). It is great to see a growing awareness and engagement with Archivematica in the UK. The meeting provided another valuable chance to catch up, compare notes and talk about some of the substantial progress that has been made since our last meeting in January.

Updates were given by Marco Klindt from the Zuse Institute on their plans for an infrastructure for digital preservation as a service and Chris Grygeil from the University of Leeds on their latest thinking on a workflow for digital archiving. Marco plans to use Archivematica as an ingest tool before pushing the data to Fedora for access. The Zuse Institute have been working with Artefactual Systems on sponsoring some useful AIP re-ingest functionality which will allow Archivematica to re-process AIPs at a later date. Chris updated us on ongoing work at Leeds to define their Archivematica workflows. Here Bit Curator is being used before ingest into Archivematica and there is ongoing discussion about how exactly the 2 tools fit together in the workflow. Bit Curator can highlight sensitive information and perform redactions but do you want to do this to original files before ingesting with Archivematica?

Matthew Addis from Arkivum gave a really interesting presentation on some work he has been doing on testing how Archivematica handles scientific datasets, specifically genomics data. He described this as being large in size, unidentified by Archivematica and with no normalisation pathways. This struck a chord with me being that I have spent much of the past few weeks looking at the types of data that researchers produce and finding a long tail of specialist or scientific data formats that are of a similar nature. His testing of the capabilities of Archivematica has produced some useful results, with success at processing a 500GB file in 5 hours.
Donuts at J.Co by Masked Card on Flickr
 CC BY-NC-ND 2.0

Next I gave an update on our Jisc Research Data Spring project “Filling the digital preservation gap”. Apart from going on for too long and keeping people from their coffee and doughnuts, I gave an introduction to our project, focusing on the nature of research data and our findings about file formats in use by researchers at the University of York. See previous blogs (first, second) for more infomation on where we are with this work.

I talked about how file identification was key to digital preservation as demonstrated by the NDSA Levels of Digital Preservation where having an inventory of the file formats in your archive comes in quite early at level 2. If you don’t know what you’ve got it is very difficult to make decisions about how you can manage and preserve that content for the long term. This is the case whether we are talking about a migration or an emulation based strategy to digital preservation.

I went on to discuss briefly the 3 basic workflows for using Archivematica and asked for feedback on these:
  1. Archivematica is the start of the process. Archivematica produces both the Archival Information Package (AIP) and the Dissemination Information Package (DIP) and the DIP is sent to the repository
  2. The repository is the start of the process and the Submission Information Package (SIP) goes from there into Archivematica. There are potential variations in the workflow here depending on whether you want Archivematica or the repository to produce the DIP
  3. Archivematica is utilised as a tool that is called separate to the repository as part of the workflow

Are there any others that I've missed?

I also talked through some of the ideas we have had for enhancements to Archivematica. We are hoping that subsequent phase of this project will enable us to sponsor some development work which will make Archivematica better or more suitable for inclusion within a wider infrastructure for managing research data. I highlighted the development ideas on our current short list and asked attendees to select whether the ideas were 'very useful', 'quite useful' or 'not at all useful' for their own proposed implementations. 

It is really helpful for us to get feedback from other Archivematica users so that we can ensure that what we are proposing will be more widely useful (and that we haven't missed an alternative solution or workaround). Over the next week the project team will be reviewing the development ideas and the feedback received at the UK Archivematica meeting and speaking to Artefactual Systems (who support Archivematica) about our ideas.

The day finished with an introduction to Binder ("an open source digital repository management application designed to meet the needs and complex digital preservation requirements of cultural heritage institutions"). We watched this video as an introduction to the system and then had a conference call with Ben Fino-Radin from the Museum of Modern Art who was able to answer our questions. Binder looks to be an impressive tool for helping to manage digital assets. Building on the basics of Archivematica (which essentially packages things up for preservation), Binder provides an attractive front end enabling curators to more effectively manage and better understand their digital collections.

The next meeting of the UK Archivematica group is planned to be held in Leeds in October/November 2015. It was agreed that we would schedule a longer session in order to allow for more informal discussion and networking alongside the scheduled presentations and progress reports. I'm confident that the group will have lots more Archivematica activity to report on by the time we next meet.

Friday, 15 May 2015

Jisc Archivematica project update - making progress with the 'how?' and the 'what?'

I mentioned in a previous post that we have funding through the Jisc Research Data Spring initiative for phase 1 of a project to look at the potential of using Archivematica to help manage research data for the longer term. Here is the second of our project updates showing what progress we have made over the last few weeks.

I can not quite believe we are already halfway through the first 3 month phase of this project. Where has all the time gone?

The most exciting moment of the last few weeks was during a Skype call between the project teams in York and Hull when within minutes of each other, both institutions managed to get their first test archives transferred into their institutional implementations of Archivematica! A few technical hiccups after the initial setting up of Archivematica had stopped this momentous occasion happening earlier. This does to my mind highlight one of the resourcing implications of Archivematica, that a certain amount of technical skill is required to understand and troubleshoot the error messages and configure the system to ensure that all is working smoothly. With my limited technical abilities, this is not something I would have been able to do myself!

We are now continuing to test the capabilities of Archivematica and alongside this I have read the Archivematica documentation from virtual cover to virtual cover, followed the mailing list posts within interest and chatted to other users. I am hugely grateful to other institutions that have been happy to share information about their infrastructure and workflows. 

The project team have a brainstorming meeting planned for next week to discuss the project 'how?'...

  • How? How would we incorporate Archivematica into a wider technical infrastructure for research data management and what workflows would we put in place? Where would it sit and what other systems would it need to talk to?

If all goes well, expect diagrams next time!

Following up from my previous blog post which talked about the project 'what?'...

  • What? What are the characteristics of research data and how might it differ from other born digital data that memory institutions are establishing digital archives to manage and preserve? What types of files are our researchers producing and how would Archivematica handle these?
...I've been having some interesting conversations with real researchers.

Though most of my time is spent hidden away in my office within the archives, it is real bonus being involved with the wider Research Data Management project at the University of York and helping deliver data management training courses to researchers. Getting out and having the opportunity to talk to researchers about their data is invaluable in helping to keep an eye on the longer term goals of this project. 

In a recent training session I've encouraged researchers to complete a simple questionnaire, which tells us a bit more about software packages and file formats.

Helping to answer the project 'what?'

Some researchers on completing this basic level of information, have also agreed to be contacted by me and have subsequently provided some samples of their data. Nothing sensitive or confidential, but files that they have agreed that I can share with The National Archives to create file signatures within Pronom. I hope this will lead to more types of research data being identifiable within digital preservation systems (Archivematica included).

I'm not reaching a huge number of researchers through this and subsequent training sessions over the next few weeks, so with help from colleagues, we've also sent an e-mail out requesting sample files from the top 20 software packages used by researchers at York. Sample files are coming in at a slow trickle rather than a deluge but hopefully, we will soon have suitable test set to share with The National Archives.

 The most popular applications and software used by researchers at the University of York (from Software and Training Questionnaire report by  Emma Barnes and Andrew Smith, 2014)

Tuesday, 28 April 2015

IT's personal: some thoughts from the journey home

Today I went to London to attend a Digital Preservation Coalition event on Personal Digital Archives. I like going to London and I like going to these sorts of workshops. I also like the time for reflection sat in the quiet carriage of a Virgin train on the way home. There is something about being cut off from the internet and away from the everyday distractions of the office which helps focus the mind. 

Today was interesting because I expect like many of the attendees I was there with two hats on – being able to benefit from the day both as a digital archivist and as an individual with my own personal digital archive to maintain.

What follows is not so much a summing up of the day, but just a quick mention of some of the thoughts I’m taking away. There were some interesting presentations that I haven’t mentioned (apologies). 

Gabriella Redwine from the Beinecke Library at the University of Yale gave a great introduction to the topic of personal digital archives, and defining them as the things created by or about an individual, a rather formal term for the digital stuff we all create over the course of our lives. We all have them. They are fragile, regularly neglected and at risk of loss. People tend to manage them when faced with a crisis (eg: computer virus), problem (eg: running out of storage space) or life changing event (eg: moving house or job). We as digital archivists need to be able to advise individuals on how to manage their own digital archives in the hope that the material will survive long enough to be deposited within an archive in the future if appropriate.

Amber Cushing from University College Dublin gave a really interesting talk on how people assign value to their digital files. Both her and Gabriella made the point (that I had only been partially aware of) that people tend to place less value on digital than physical things, that the born-digital is seen as less important than something you can more easily see or hold. I appear to be guilty of this myself I realise. Every year I take hundreds of digital photographs which I store on my computer. These are of high value to me. They provide a record of my life and my family and I want to keep them so that I and subsequent generations can look back on them. Despite the high value I place on them I don’t back them up as often as I should and have even been known to lose some (see previous confession).

At the end of each year I create a photo book for that year. A printed, glossy, hard back album of selected photos from that year, with a title page and captions (documentation and metadata!). I love to receive the finished photo book through the post and place even more value on this physical object than I did of the original photos. This is clear by the fact that I hover around the kids as they look at it, checking that they don’t have grubby hands and worrying that they might inadvertently rip a page whilst turning it. 

Damage to my one of my photo books - imagine my distress!
(don't worry I have ordered a replacement)
Do I have the same level of worry when they access my digital originals? No, I happily let them click through them on the computer, never checking whether they had accidentally edited or deleted one or moved an image out of its context from one folder to another. These are eventualities which are probably just as likely (but harder to spot and thus rectify) than damage to the physical book*. 

Is this slightly skewed notion of value a result of the extra time and effort I have put into arranging the photographs into a physical book, the expense of having had to pay for it to be printed, or simply down to the fact that it is shiny and I can hold it?

Anyway, this is a slight tangent. It was really interesting to hear about Amber’s research on possession and self extension in relation to personal digital archives and how we as individuals may or may not assign value to the digital stuff that we create.

I was also really pleased to hear James Baxter from the British Library talk about a practical way they had set up workflows for dealing with personal digital archives that have been put in their care. Shutting himself and colleagues in a room for 3 days with some media, and some tools in order to brainstorm workflows and make progress with trying to access, identify and preserve some of this born digital material seemed like a great approach and there were some useful lessons learned from the process. I liked the ‘learning by doing’ approach that he advocated. I tend to agree that the best way to find out if something is going to work is to roll up your sleeves and have a go.

Another repeated message of the day was about language and how we can communicate and bring people along with us. Mike Ashenfelder from the Library of Congress mentioned that though libraries may run personal digital archiving courses for the public, it is hard to compete with other courses and learning opportunities with more appealing names. Amber mentioned that when interviewing people for her research, she avoided use of the term 'archiving' instead asking them about how they ‘maintained’ their digital files.

Having over the last week taught two sessions at the University of York on ‘Research Data Management’ I can relate to this problem. Getting people to come along and engage with a topic that has quite a dry title can certainly be a challenge. Perhaps as Mike suggested “looking after your digital stuff” would make it clearer what we were talking about and its immediate relevance to all of us!

My train journey is nearly over so I’ll leave it there, having over the course of this journey created yet another thing to add to my own digital legacy. 

I’m looking forward to reading the new DPC technology watch report on the subject of personal digital archiving in the near future.

* yes, I know I could do this with checksums but I do not create checksums for my personal digital life is busy!

Friday, 17 April 2015

Jisc Archivematica project update ...because digital preservation won’t just go away

Last month I was excited to discover that Jisc had agreed to fund a joint project between the Universities of York and Hull as part of their Research Data Spring initiative. The aim of this project (as mentioned in my previous blog) is to investigate the potential of Archivematica for Research Data Management. There is a brief summary including my pitch here. We have had a number of other higher education institutions in the UK express an interest in this project and it is fabulous to see that there are others who recognise that the tools that digital archivists use could have much to offer those who are charged with managing research data. Of course we hope this project will also be of interest to a more diverse and international audience and we would like to benefit from the experience and knowledge that already exists within the wider digital preservation community.

We are three weeks in to this project now and here is the first of a series of updates on progress.

One of the initial tasks for teams at both York and Hull was to ensure we had a current version of Archivematica installed. Over the next few weeks there will be a fair amount of testing going on to give us a greater understanding of Archivematica's strengths and weaknesses particularly with regard to how it may handle research data.

A pod on the lake - a great venue for our kick off meeting
(though in reality the weather wasn't this nice)
We also got together for a kick off meeting in one of the pods on the lake on York's Heslington East campus. We defined our work packages and established responsibilities and deadlines and now have a clear idea of what we are focusing on in this initial 3 month phase that Jisc have agreed to fund.

Much of the research we will be carrying out and reporting on at the end of this phase of the project will be based around the following questions:

  • Why? Why are we bothering to 'preserve' research data. What are the drivers here and what are the risks if we don't?
  • What? What are the characteristics of research data and how might it differ from other born digital data that memory institutions are establishing digital archives to manage and preserve? What types of files are our researchers producing and how would Archivematica handle these? What does Archivematica offer us and what benefits does it bring?
  • How? How would we incorporate Archivematica into a wider technical infrastructure for research data management and what workflows would we put in place? Where would it sit and what other systems would it need to talk to?
  • Who? Who else is using Archivematica (or other digital preservation systems) to do similar things and what can we learn from them?

Working in the pod - nice to have ducks for neighbours
I've started off by giving some thought to the What?

A key part of this project is to look at what a digital preservation system such as Archivematica can offer us as part of an RDM infrastructure. In order to answer this we need to understand a bit about the nature of the research data that we will be asking it to handle. We have been collating existing sources of information about the types of software and data that researchers at the University of York are using, in order to get a clearer picture of what research data is likely to look like. Following on from this we can then start to look at how Archivematica would handle this data.

A couple of years ago at York we carried out some work looking at current data management practice by researchers. We interviewed chairs of research committee for each academic department to get an overview of data management within the department and also put out an online questionnaire to capture a wider and more detailed set of information from individual researchers across the University. This has given us an overview of the types of data that researchers collect or create. This data doesn't go down to the level of specific file types and versions but does talk about broad categories of data (for example which departments are working with databases, audio, video, images etc).

A subsequent survey carried out at York looked more specifically at software packages used by researchers and is a gold mine of information for our project, giving us a list of software packages that we can investigate in more detail. The 'top 20' software packages highlighted by this survey largely consists of software I have never used and never tried to preserve the outputs of - packages such as MATLAB, SPSS, NVivo, Gaussian and ChemDraw. We are investigating how existing digital preservation tools would handle these types of data and looking initially at whether their native file formats appear in Pronom. We are talking to the helpful team at the National Archives about creating new file signatures for formats that are not currently represented. Knowing what you've got is one of the key challenges in digital preservation and if the file identification tools out there can automatically recognise a wider range of formats then this is clearly going to be a step in the right direction for not just this project but the digital preservation community.

Work will continue - watch this space for updates ...and in the meantime, we'd love to hear from you if you are using Archivematica (or another digital preservation system) for research data so we can find out about your workflows.

Monday, 2 March 2015

Pitching for Archivematica at Research Data Spring

It was with trepidation that I attended the Research Data Spring sandpit last week. A sandpit has childhood associations with lazy long summers, but reading more about the format of the event it sounded more like an episode of Dragon’s Den.

The ‘sandpit’ event was held at Birmingham’s Aston University campus
 and we were at least rewarded with summer beach weather on the second day

The first day of this event was set aside to explore and discuss ideas submitted through Jisc’s latest funding call (Research Data Spring via Ideascale) leading to potential new collaborations and partnerships.

On the second day, idea holders were given the chance to pitch their idea to a panel of judges. In a kind of four minute ‘elevator pitch’ we had to say enough about our project idea to explain what we wanted to achieve, why it was important and what resources were required, as well as persuading the judges that we should receive funding. This was followed by a four minute grilling by the judges on the finer details of the idea. No pressure!

After a late night doing sums, I have a final run through of my pitch

I was pitching a version of an idea originally submitted by the University of Hull that I had been quick to volunteer to collaborate on. The idea is based around investigating Archivematica for use within a wider technical infrastructure for research data management. We are hoping to plug the digital preservation gap and demonstrate the benefits that Archivematica can bring as part of an automated workflow for ingesting and processing research data.

My poster advertising the idea
Initially we would like to do a piece of investigative work looking at what Archivematica has to offer Higher Education Institutions in the UK who are currently setting up workflows for managing research data. We want to look at its strengths and weaknesses, and highlight areas for development in order to inform a further phase of the project in which we would sponsor some development work to enhance Archivematica. In the last phase of the project we hope to implement a proof of concept integrated into our existing Fedora repositories and push some real research data through the system. We would write up our findings as a case study and intend to continue to spread awareness of our work through presentations and blogs.

Many universities in the UK will not yet be ready to start thinking seriously about digital preservation, but the hope is that if this idea is funded, our work with Archivematica will help inform future decision making in this area for other institutions.

Wednesday, 18 February 2015

Digital preservation hits the headlines

It is not often that stories directly related to my profession make front page news, so it was with mixed feelings that I read following headline on the front of the Guardian this weekend:

"Digital is decaying. The future may have to be print"

While I agree with one of those statements, I strongly disagree with the other. Having worked in digital preservation for 12 years now, the idea of  a 'digital dark age' caused by obsolescence of the rapidly evolving hardware and software landscape is not a new one. Digital will decay if it is not properly looked after, and that is ultimately why there is a profession and practice that has built up in this area.

I would however disagree with the idea that the future may have to be print. At the Borthwick Institute for Archives we are now encouraging depositors to give us their archives in their original form. If the files are born digital we would like to receive and archive the digital files. Printouts lack the richness, accessibility and flexibility of the digital originals, which can often tell us far more about a file and the process of creation than a hard copy and can be used in a variety of different ways.

This headline was of course prompted by a BBC interview with Vint Cerf (VC of Google) on Friday in which he makes some very valid points. Digital preservation isn't just about preserving the physical bits, it is also about what they mean. We need to capture and preserve information about the digital environment as well as the data itself in order to enable future re-use. Again this is not new, but it is good to see story this hitting the headlines. Raising general awareness of the issues us digital archivists think about on a daily basis can only be a good thing in my mind. What the coverage misses though is the huge amount of work that has been going on in this area already...

The Jisc Research at Risk workshop at IDCC15 in the
fabulous surroundings of 30 Euston Square, London
Last week I spent several days at the International Digital Curation Conference (IDCC) in London. Surprisingly, this was my first IDCC (I'm not sure how I had managed to miss it until now), but this was a landmark birthday for the conference, celebrating a decade of bringing people together to talk about digital curation.

The theme of the conference was "Ten years back, ten years forward: achievements, lessons and the future for digital curation", consequently, many of the papers focused on how far we had come in the last ten years and on next steps. In ten years, we have clearly achieved much but the digital preservation problem is not yet solved. Progress is not always as fast as we would like, but enacting a culture change in the way we manage our digital assets is was never going to be a quick win, and this is sometimes not helped by the fact that as we solve one element of the problem, we adjust our expectations on what we consider to be a success. This is a constantly evolving field and we take on new challenges all the time.

It is great that public awareness about obsolescence and the fragility of digital data has been raised in the mainstream media, but it is also important for people to know that there is already a huge amount of work going on in this area and many of us who think about these issues all the time.

Monday, 2 February 2015

When checksums don't match...

No one really likes an error message but it is strangely satisfying when integrity checking of files within the digital archive throws up an issue. I know it shouldn't be, but having some confirmation that these basic administrative tasks that us digital archivists carry out are truly necessary and worthwhile is always encouraging. Furthermore, it is useful to have real life examples to hand when trying to make a business case for a level of archival storage that includes regular monitoring and comparison of checksums.

We don't have a full blown digital archiving system yet at the University of York, but as a minimum, born digital content that comes into the archives is copied on to University filestore and checksums are created. A checksum is a string of characters that acts as a unique fingerprint of a digital object, if the digital object remains unchanged, a checksumming tool will come up with the same string of characters each time the algorithm is run. This allows us digital archivists to ensure that files within our care remain authentic and free from accidental damage or corruption - and this is really one of our most basic roles as professionals.

The tool we are using at the moment to create and verify our checksums is the Checksum tool from Corz. A simple but comprehensive tool that it is quick and easy to get started with, but that gives ample scope for configuration and levels of automation for users who want to get more from it.

This morning when running integrity checks over the digital archive filestore I came across a problem. Two files in one of my directories that hold original born digital content came up with an MD5 error. Very strange.

I've just located the original CDs in the strongroom and had a look at the 2 files in question to try and work out what has gone wrong.

Another great tool that I use to manage our digital assets is Foldermatch. Foldermatch allows you to compare 2 folders and tells you with a neat graphical interface whether the contents of them are identical or not. Foldermatch has different comparison options and you can either take the quick approach and compare contents by file size, date and time or you can go for the belt and braces approach and compare using checksums. As a digital archivist I normally go for the belt and braces approach and here is a clear example of why this is necessary.
Comparison of folders using size, date and time - all looks well!

When comparing what is on the original CD from the strongroom with what is on our digital archive filestore by size, date and time, Foldermatch does not highlight any problems. All looks to be above board. The green column of the bar chart above shows that Foldermatch the files in the filestore to be 'identical' to those on the CD.

Similarly when running a comparison of contents, the results look the same. No problems highlighted.

Comparison of folders using SHA-1 checksums - differences emerge

However, when performing the same task using the SHA-1 checksum algorithm, this is where the problems are apparent. Two of the files (the purple column) are recorded as being 'different but same date/time'.

These changes appear not to have altered the file content, its size or date/time stamp. Indeed I am not clear on what specifically has been altered. Although checksum comparisons are helpful at flagging problems, they are not so helpful at giving specifics about what has changed.

As these files have sat gathering dust on the filestore, something has happened to subtly alter them, and these subtle changes are hard to spot but do have an impact on the their authenticity. This is the sort of thing we need to watch out for and this is why we digital archivists do need to worry about the integrity of our files and take steps to ensure we can prove that we are preserving what we think we are preserving.