Digital Archiving at the University of York: July 2017

Monday, 31 July 2017

The mysterious case of the changed last modified dates

Today's blog post is effectively a mystery story.

Like any good story it has a beginning (the problem is discovered, the digital archive is temporarily thrown into chaos), a middle (attempts are made to solve the mystery and make things better, several different avenues are explored) and an end (the digital preservation community come to my aid).

This story has a happy ending (hooray) but also includes some food for thought (all the best stories do) and as always I'd be very pleased to hear what you think.

The beginning

I have probably mentioned before that I don't have a full digital archive in place just yet. While I work towards a bigger and better solution, I have a set of temporary procedures in place to ingest digital archives on to what is effectively a piece of locked down university filestore. The procedures and workflows are both 'better than nothing' and 'good enough' as a temporary measure and actually appear to take us pretty much up to Level 2 of the NDSA Levels of Preservation (and beyond in some places).

One of the ways I ensure that all is well in the little bit of filestore that I call 'The Digital Archive' is to run frequent integrity checks over the data, using a free checksum utility. Checksums (effectively unique digital fingerprints) for each file in the digital archive are created when content is ingested and these are checked periodically to ensure that nothing has changed. IT keep back-ups of the filestore for a period of three months, so as long as this integrity checking happens within this three month period (in reality I actually do this 3 or 4 times a month) then problems can be rectified and digital preservation nirvana can be seamlessly restored.

Checksum checking is normally quite dull. Thankfully it is an automated process that runs in the background and I can just get on with my work and cheer when I get a notification that tells me all is well. Generally all is well, it is very rare that any errors are highlighted - when that happens I blog about it!

I have perhaps naively believed for some time that I'm doing everything I need to do to keep those files safe and unchanged because if the checksum is the same then all is well, however this month I encountered a problem...

I've been doing some tidying of the digital archive structure and alongside this have been gathering a bit of data about the archives, specifically looking at things like file formats, number of unidentified files and last modified dates.

Whilst doing this I noticed that one of the archives that I had received in 2013 contained 26 files with a last modified date of 18th January 2017 at 09:53. How could this be so if I have been looking after these files carefully and the checksums are the same as they were when the files were deposited?

The 26 files were all EML files - email messages exported from Microsoft Outlook. These were the only EML files within the whole digital archive. The files weren't all in the same directory and other files sitting in those directories retained their original last modified dates.

The middle

So this was all a bit strange...and worrying too. Am I doing my job properly? Is this something I should be bringing to the supportive environment of the DPC's Fail Club?

The last modified dates of files are important to us as digital archivists. This is part of the metadata that comes with a file. It tells us something about the file. If we lose this date are we losing a little piece of the authentic digital object that we are trying to preserve?

Instead of beating myself up about it I wanted to do three things:

Solve the mystery (find out what happened and why)
See if I could fix it
Stop it happening again

So how could it have happened? Has someone tampered with these 26 files? Perhaps unlikely considering they all have the exact same date/time stamp which to me suggests a more automated process. Also, the digital archive isn't widely accessible. Quite deliberately it is only really me (and the filestore administrators) who have access.

I asked IT whether they could explain it. Had some process been carried out across all filestores that involved EML files specifically? They couldn't think of a reason why this may have occurred. They also confirmed my suspicions that we have no backups of the files with the original last modified dates.

I spoke to a digital forensics expert from the Computer Science department and he said he could analyse the files for me and see if he could work out what had acted on them and also suggest a methodology of restoring the dates.

I have a record of the last modified dates of these 26 files when they arrived - the checksum tool that I use writes the last modified date to the hash file it creates. I wondered whether manually changing the last modified dates back to what they were originally was the right thing to do or whether I should just accept and record the change.

...but I decided to sit on it until I understood the problem better.

The end

I threw the question out to the digital preservation community on Twitter and as usual I was not disappointed!

In fact, along with a whole load of discussion and debate, Andy Jackson was able to track down what appears to be the cause of the problem.

He very helpfully pointed me to a thread on StackExchange which described the issue I was seeing.

It was a great comfort to discover that the cause of this problem was apparently a bug and not something more sinister. It appears I am not alone!

...but what now?

So I now I think I know what caused the problem but questions remain around how to catch issues like this more quickly (not six months after it has happened) and what to do with the files themselves.

IT have mentioned to me that an OS upgrade may provide us with better auditing support on the filestore. Being able to view reports on changes made to digital objects within the digital archive would be potentially very useful (though perhaps even that wouldn't have picked up this Windows bug?). I'm also exploring whether I can make particular directories read only and whether that would stop issues such as this occurring in the future.

If anyone knows of any other tools that can help, please let me know.

The other decision to make is what to do with the files themselves. Should I try and fix them? More interesting debate on Twitter on this topic and even on the value of these dates in the first place. If we can fudge them then so can others - they may have already been fudged before they got to the digital archive - in which case, how much value do they really have?

So should we try and fix last modified dates or should we focus our attention on capturing and storing them within the metadata. The later may be a more sustainable solution in the longer term, given their slightly slippery nature!

I know there are lots of people interested in this topic - just see this recent blog post by Sarah Mason and in particular the comments - When was that?: Maintaining or changing ‘created’ and ‘last modified’ dates. It is great that we are talking about real nuts and bolts of digital preservation and that there are so many people willing to share their thoughts with the community.

...and perhaps if you have EML files in your digital archive you should check them too!

Jenny Mitcham, Digital Archivist

Friday, 7 July 2017

Preserving Google docs - decisions and a way forward

Back in April I blogged about some work I had been doing around finding a suitable export (and ultimately preservation) format for Google documents.

This post has generated a lot of interest and I've had some great comments both on the post itself and via Twitter.

I was also able to take advantage of a slot I had been given at last week's Jisc Research Data Network event to introduce the issue to the audience (who had really come to hear me talk about something else but I don't think they minded).

There were lots of questions and discussion at the end of this session, mostly focused on the Google Drive issue rather than the rest of the talk. I was really pleased to see that the topic had made people think. In a lightening talk later that day, William Kilbride, Executive Director of The Digital Preservation Coalition mused on the subject of "What is data?". Google Drive was one of the examples he used, asking where does the data end and the software application start?

I just wanted to write a quick update on a couple of things - decisions that have been made as a result of this work and attempts to move the issue forward.

Decisions decisions

I took a summary of the Google docs data export work to my colleagues in a Research Data Management meeting last month in order to discuss a practical way forward for the institutional research data we are planning on capturing and preserving.

One element of the Proof of Concept that we had established at the end of phase 3 of Filling the Digital Preservation Gap was a deposit form to allow researchers to deposit data to the Research Data York service.

As well as the ability to enable researchers to browse and select a file or a folder on their computer or network, this deposit form also included a button to allow deposit to be carried out via Google Drive.

As I mentioned in a previous post, Google Drive is widely used at our institution. It is clear that many researchers are using Google Drive to collect, create and analyse their research data so it made sense to provide an easy way for them to deposit direct from Google Drive. I just needed to check out the export options and decide which one we should support as part of this automated export.

However, given the inconclusive findings of my research into export options it didn't seem that there was one clear option that adequately preserved the data.

As a group we decided the best way out of this imperfect situation was to ask researchers to export their own data from Google Drive in whatever format they consider best captures the significant properties of the item. By exporting themselves in a manual fashion prior to upload, this does give them the opportunity to review and check their files and make their own decision on issues such as whether comments are included in the version of their data that they upload to Research Data York.

So for the time being we are disabling the Google Drive upload button from our data deposit interface....which is a shame because a certain amount of effort went into getting that working in the first place.

This is the right decision for the time being though. Two things need to happen before we can make this available again:

Understanding the use case - We need to gain a greater understanding of how researchers use Google Drive and what they consider to be 'significant' about their native Google Drive files.
Improving the technology - We need to make some requests to Google to make the export options better.

Understanding the use case

We've known for a while that some researchers use Google Drive to store their research data. The graphic below was taken from a survey we carried out with researchers in 2013 to find out about current practice across the institution.

Of the 188 researchers who answered the question "Where is your digital research data stored (excluding back up copies)?" 22 mentioned Google Drive. This is only around 12% of respondents but I would speculate that over the last four years, use of Google Drive will have increased considerably as Google applications have become more embedded within the working practices of staff and students at the University.

Where is your digital research data stored (excluding back up copies)?

To understand the Google Drive use case today I really need to talk to researchers.

We've run a couple of Research Data Management teaching sessions over the last term. These sessions are typically attended by PhD students but occasionally a member of research staff also comes along. When we talk about data storage I've been asking the researchers to give a show of hands as to who is using Google Drive to store at least some of their research data.

About half of the researchers in the room raise their hand.

So this is a real issue.

Of course what I'd like to do is find out exactly how they are using it. Whether they are creating native Google Drive files or just using Google Drive as a storage location or filing system for data that they create in another application.

I did manage to get a bit more detail from one researcher who said that they used Google Drive as a way of collaborating on their research with colleagues working at another institution but that once a document has been completed they will export the data out of Google Drive for storage elsewhere.

This fits well with the solution described above.

I also arranged a meeting with a Researcher in our BioArCh department. Professor Matthew Collins is known to be an enthusiastic user of Google Drive.

Talking to Matthew gave me a really interesting perspective on Google Drive. For him it has become an essential research tool. He and his colleagues use many of the features of the Google Suite of tools for their day to day work and as a means to collaborate and share ideas and resources, both internally and with researchers in other institutions. He showed me PaperPile, an extension to Google Drive that I had not been aware of. He uses this to manage his references and share them with colleagues. This clearly adds huge value to the Google Drive suite for researchers.

He talked me through a few scenarios of how they use Google - some, (such as the comments facility) I was very much aware of. Others, I've not used myself such as the use of the Google APIs to visualise for example activity on preparing a report in Google Drive - showing a time line and when different individuals edited the document. Now that looks like fun!

He also talked about the importance of the 'previous versions' information that is stored within a native Google Drive file. When working collaboratively it can be useful to be able to track back and see who edited what and when.

He described a real scenario in which he had had to go back to a previous version of a Google Sheet to show exactly when a particular piece of data had been entered. I hadn't considered that the previous versions feature could be used to demonstrate that you made a particular discovery first. Potentially quite important in the competitive world of academic research.

For this reason Matthew considered the native Google Drive file itself to be "the ultimate archive" and "a virtual collaborative lab notebook". A flat, static export of the data would not be an adequate replacement.

He did however acknowledge that the data can only exist for as long as Google provides us with the facility and that there are situations where it is a good idea to take a static back up copy.

He mentioned that the precursor to Google Docs was a product called Writely (which he was also an early adopter of). Google bought Writely in 2006 after seeing the huge potential in this online word processing tool. Matthew commented that backwards compatibility became a problem when Google started making some fundamental changes to the way the application worked. This is perhaps the issue that is being described in this blog post: Google Docs and Backwards Compatibility.

So, I'm still convinced that even if we can't preserve a native Google Drive file perfectly in a static form, this shouldn't stop us having a go!

Improving the technology

Along side trying to understand how researchers use Google Drive and what they consider to be significant and worthy of preservation, I have also been making some requests and suggestions to Google around their export options. There are a few ideas I've noted that would make it easier for us to archive the data.

I contacted the Google Drive forum and was told that as a Google customer I was able to log in and add my suggestions to Google Cloud Connect so this I did...and what I asked for was as follows:

Please can we have a PDF/A export option?
Please could we choose whether or not to export comments or not ...and if we are exporting comments can we choose whether historic/resolved comments are also exported
Please can metadata be retained - specifically the created and last modified dates. (Author is a bit trickier - in Google Drive a document has an owner rather than an author. The owner probably is the author (or one of them) but not necessarily if ownership has been transferred).
I also mentioned a little bug relating to comment dates that I found when exporting a Google document containing comments out into docx format and then importing it back again.

Since I submitted these feature requests and comments in early May it has all gone very very quiet...

I have a feeling that ideas only get anywhere if they are popular ...and none of my ideas are popular ...because they do not lead to new and shiny functionality.

Only one of my suggestions (re comments) has received a vote by another member of the community.

So, what to do?

Luckily, since having spoken about my problem at the Jisc Research Data Network, two people have mentioned they have Google contacts who might be interested in hearing my ideas.

I'd like to follow up on this, but in the meantime it would be great if people could feedback to me.

Are my suggestions sensible?
Are there are any other features that would help the digital preservation community preserve Google Drive? I can't imagine I've captured everything...

Jenny Mitcham, Digital Archivist

Thursday, 6 July 2017

The UK Archivematica group goes to Scotland

Yesterday the UK Archivematica group met in Scotland for the first time. The meeting was hosted by the University of Edinburgh and as always it was great to be able to chat informally to other Archivematica users in the UK and find out what everyone is up to.

The first thing to note was that since this group of Archivematica ‘explorers’ first met in 2015 real and tangible progress seems to have been made. This was encouraging to see. This is particularly the case at the University of Edinburgh. Kirsty Lee talked us through their Archivematica implementation (now in production) and the steps they are taking to ingest digital content.

One of the most interesting bits of her presentation was a discussion about appraisal of digital material and how to manage this at scale using the available tools. When using Archivematica (or other digital preservation systems) it is necessary to carry out appraisal at an early stage before an Archival Information Package (AIP) is created and stored. It is very difficult (perhaps impossible) to unpick specific files from an AIP at a later date.

Kirsty described how one of her test collections has been reduced from 5.9GB to 753MB using a combination of traditional and technical appraisal techniques.

Appraisal is something that is mentioned frequently in digital preservation discussions. There was a group talking about just this a couple of weeks ago at the recent DPC unconference ‘Connecting the Bits’.

As ever it was really valuable to hear how someone is moving forward with this in a practical way.

It will be interesting to find out how these techniques can be applied at scale of some of the larger collections Kirsty intends to work with.

Kirsty recommended an article by Victoria Sloyan, Born-digital archives at the Wellcome Library: appraisal and sensitivity review of two hard drives which was helpful to her and her colleagues when formulating their approach to this thorny problem.

She also referenced the work that the Bentley Historical Library at University of Michigan have carried out with Archivematica and we watched a video showing how they have integrated Archivematica with DSpace. This approach has influenced Edinburgh’s internal discussions about workflow.

Kirsty concluded with something that rings very true for me (in fact I think I said it myself the two presentations I gave last week!). Striving for perfection isn’t helpful, the main thing is just to get started and learn as you go along.

Rachel McGregor from the University of Lancaster gave an entertaining presentation about the UK Archivematica Camp that was held in York in April, covering topics as wide ranging as the weather, the food and finally feeling the love for PREMIS!

I gave a talk on work at York to move Archivematica and our Research Data York application towards production. I had given similar talks last week at the Jisc Research Data Network event and a DPC briefing day but I took a slightly different focus this time. I wanted to drill in a bit more detail into our workflow, the processing configuration within Archivematica and some problems I was grappling with.

It was really helpful to get some feedback and solutions from the group on an error message I’d encountered whilst preparing my slides the previous day and to have a broader discussion on the limitations of web forms for data upload. This is what is so good about presenting within a small group setting like this as it allows for informality and genuinely productive discussion. As a result of this I over ran and made people wait for their lunch (very bad form I know!)

After lunch John Kaye updated the group on the Jisc Research Data Shared Service. This is becoming a regular feature of our meetings! There are many members of the UK Archivematica group who are not involved in the Jisc Shared Service so it is really useful to be able to keep them in the loop.

It is clear that there will be a substantial amount of development work within Archivematica as a result of its inclusion in the Shared Service and features will be made available to all users (not just those who engage directly with Jisc). One example of this is containerisation which will allow Archivematica to be more quickly and easily installed. This is going to make life easier for everyone!

Sean Rippington from the University of St Andrews gave an interesting perspective on some of the comparison work he has been doing of Preservica and Archivematica.

Both of these digital preservation systems are on offer through the Jisc Shared Service and as a pilot institution St Andrews has decided to test them side by side. Although he hasn’t yet got his hands on both, he was still able to offer some really useful insights on the solutions based on observations he has made so far.

First he listed a number of similarities - for example alignment with the OAIS Reference Model, the migration-based approach, the use of microservices and many of the tools and standards that they are built on.

He also listed a lot of differences - some are obvious, for example one system is commercial and the other open source. This leads to slightly different models for support and development. He mentioned some of the additional functionality that Preservica has, for example the ability to handle emails and web archives and the inclusion of an access front end.

He also touched on reporting. Preservica does this out of the box whereas with Archivematica you will need to use a third party reporting system. He talked a bit about the communities that have adopted each solution and concluded that Preservica seems to have a broader user base (in terms of the types of institution that use it). The engaged, active and honest user community for Archivematica was highlighted as a specific selling point and the work of the Filling the Digital Preservation Gap project (thanks!).

Sean intends to do some more detailed comparison work once he has access to both systems and we hope he will report back to a future meeting.

Next up we had a collaborative session called ‘Room 101’ (even though our meeting had been moved to room 109). Considering we were encouraged to grumble about our pet hates this session came out with some useful nuggets:

Check your migrated files. Don’t assume everything is always OK.
Don’t assume that just because Archivematica is installed all your digital preservation problems are solved.
Just because a feature exists within Archivematica it doesn’t mean you have to use it - it may not suit your workflow
There is no single ‘right’ way to set up Archivematica and integrate with other systems - we need to talk more about workflows and share experiences!

After coffee break we were joined (remotely) by several representatives from the OSSArcFlow project from Educopia and the University of North Carolina. This project is very new but it was great that they were able to share with us some information about what they intend to achieve over the course of the two year project.

They are looking specifically at preservation workflows using open source tools (specifically Archivematica, BitCurator and ArchivesSpace) and they are working with 12 partner institutions who will all be using at least two of these tools. The project will not only provide training and technical support, but will fully document the workflows put in place at each institution. This information will be shared with the wider community.

This is going to be really helpful for those of us who are adopting open source preservation tools, helping to answer some of those niggling questions such as how to fill the gaps and what happens when there are overlaps in the functionality of two tools.

We registered our interest in continuing to be kept in the loop about this project and we hope to hear more at a future meeting.

The day finished with a brief update from Sara Allain from Artifactual Systems. She talked about some of the new things that are coming in version 1.6.1 and 1.7 of Archivematica.

Before leaving Edinburgh it was a pleasure to be able to join the University at an event celebrating their progress in digital preservation. Celebrations such as this are pretty few and far between - perhaps because digital preservation is a task that doesn’t have an obvious end point. It was really refreshing to see an institution publicly celebrating the considerable achievements made so far. Congratulations to the University of Edinburgh!

Jenny Mitcham, Digital Archivist

Digital Archiving at the University of York

Monday, 31 July 2017

The mysterious case of the changed last modified dates

The beginning

The middle

The end

...but what now?

Friday, 7 July 2017

Preserving Google docs - decisions and a way forward

Decisions decisions

Understanding the use case

Improving the technology

Thursday, 6 July 2017

The UK Archivematica group goes to Scotland

The sustainability of a digital preservation blog...

Twitter

Subscribe