Friday, 7 July 2017

Preserving Google docs - decisions and a way forward

Back in April I blogged about some work I had been doing around finding a suitable export (and ultimately preservation) format for Google documents.

This post has generated a lot of interest and I've had some great comments both on the post itself and via Twitter.

I was also able to take advantage of a slot I had been given at last week's Jisc Research Data Network event to introduce the issue to the audience (who had really come to hear me talk about something else but I don't think they minded).

There were lots of questions and discussion at the end of this session, mostly focused on the Google Drive issue rather than the rest of the talk. I was really pleased to see that the topic had made people think. In a lightening talk later that day, William Kilbride, Executive Director of The Digital Preservation Coalition mused on the subject of "What is data?". Google Drive was one of the examples he used, asking where does the data end and the software application start?

I just wanted to write a quick update on a couple of things - decisions that have been made as a result of this work and attempts to move the issue forward.

Decisions decisions

I took a summary of the Google docs data export work to my colleagues in a Research Data Management meeting last month in order to discuss a practical way forward for the institutional research data we are planning on capturing and preserving.

One element of the Proof of Concept that we had established at the end of phase 3 of Filling the Digital Preservation Gap was a deposit form to allow researchers to deposit data to the Research Data York service.

As well as the ability to enable researchers to browse and select a file or a folder on their computer or network, this deposit form also included a button to allow deposit to be carried out via Google Drive.

As I mentioned in a previous post, Google Drive is widely used at our institution. It is clear that many researchers are using Google Drive to collect, create and analyse their research data so it made sense to provide an easy way for them to deposit direct from Google Drive. I just needed to check out the export options and decide which one we should support as part of this automated export.

However, given the inconclusive findings of my research into export options it didn't seem that there was one clear option that adequately preserved the data.

As a group we decided the best way out of this imperfect situation was to ask researchers to export their own data from Google Drive in whatever format they consider best captures the significant properties of the item. By exporting themselves in a manual fashion prior to upload, this does give them the opportunity to review and check their files and make their own decision on issues such as whether comments are included in the version of their data that they upload to Research Data York.

So for the time being we are disabling the Google Drive upload button from our data deposit interface....which is a shame because a certain amount of effort went into getting that working in the first place.

This is the right decision for the time being though. Two things need to happen before we can make this available again:

  1. Understanding the use case - We need to gain a greater understanding of how researchers use Google Drive and what they consider to be 'significant' about their native Google Drive files.
  2. Improving the technology - We need to make some requests to Google to make the export options better.

Understanding the use case

We've known for a while that some researchers use Google Drive to store their research data. The graphic below was taken from a survey we carried out with researchers in 2013 to find out about current practice across the institution. 

Of the 188 researchers who answered the question "Where is your digital research data stored (excluding back up copies)?" 22 mentioned Google Drive. This is only around 12% of respondents but I would speculate that over the last four years, use of Google Drive will have increased considerably as Google applications have become more embedded within the working practices of staff and students at the University.

Where is your digital research data stored (excluding back up copies)?

To understand the Google Drive use case today I really need to talk to researchers.

We've run a couple of Research Data Management teaching sessions over the last term. These sessions are typically attended by PhD students but occasionally a member of research staff also comes along. When we talk about data storage I've been asking the researchers to give a show of hands as to who is using Google Drive to store at least some of their research data.

About half of the researchers in the room raise their hand.

So this is a real issue. 

Of course what I'd like to do is find out exactly how they are using it. Whether they are creating native Google Drive files or just using Google Drive as a storage location or filing system for data that they create in another application.

I did manage to get a bit more detail from one researcher who said that they used Google Drive as a way of collaborating on their research with colleagues working at another institution but that once a document has been completed they will export the data out of Google Drive for storage elsewhere. 

This fits well with the solution described above.

I also arranged a meeting with a Researcher in our BioArCh department. Professor Matthew Collins is known to be an enthusiastic user of Google Drive.

Talking to Matthew gave me a really interesting perspective on Google Drive. For him it has become an essential research tool. He and his colleagues use many of the features of the Google Suite of tools for their day to day work and as a means to collaborate and share ideas and resources, both internally and with researchers in other institutions. He showed me PaperPile, an extension to Google Drive that I had not been aware of. He uses this to manage his references and share them with colleagues. This clearly adds huge value to the Google Drive suite for researchers.

He talked me through a few scenarios of how they use Google - some, (such as the comments facility) I was very much aware of. Others, I've not used myself such as the use of the Google APIs to visualise for example activity on preparing a report in Google Drive - showing a time line and when different individuals edited the document. Now that looks like fun!

He also talked about the importance of the 'previous versions' information that is stored within a native Google Drive file. When working collaboratively it can be useful to be able to track back and see who edited what and when. 

He described a real scenario in which he had had to go back to a previous version of a Google Sheet to show exactly when a particular piece of data had been entered. I hadn't considered that the previous versions feature could be used to demonstrate that you made a particular discovery first. Potentially quite important in the competitive world of academic research.

For this reason Matthew considered the native Google Drive file itself to be "the ultimate archive" and "a virtual collaborative lab notebook". A flat, static export of the data would not be an adequate replacement.

He did however acknowledge that the data can only exist for as long as Google provides us with the facility and that there are situations where it is a good idea to take a static back up copy.

He mentioned that the precursor to Google Docs was a product called Writely (which he was also an early adopter of). Google bought Writely in 2006 after seeing the huge potential in this online word processing tool. Matthew commented that backwards compatibility became a problem when Google started making some fundamental changes to the way the application worked. This is perhaps the issue that is being described in this blog post: Google Docs and Backwards Compatibility.

So, I'm still convinced that even if we can't preserve a native Google Drive file perfectly in a static form, this shouldn't stop us having a go!

Improving the technology

Along side trying to understand how researchers use Google Drive and what they consider to be significant and worthy of preservation, I have also been making some requests and suggestions to Google around their export options. There are a few ideas I've noted that would make it easier for us to archive the data.

I contacted the Google Drive forum and was told that as a Google customer I was able to log in and add my suggestions to Google Cloud Connect so this I did...and what I asked for was as follows:

  • Please can we have a PDF/A export option?
  • Please could we choose whether or not to export comments or not ...and if we are exporting comments can we choose whether historic/resolved comments are also exported
  • Please can metadata be retained - specifically the created and last modified dates. (Author is a bit trickier - in Google Drive a document has an owner rather than an author. The owner probably is the author (or one of them) but not necessarily if ownership has been transferred).
  • I also mentioned a little bug relating to comment dates that I found when exporting a Google document containing comments out into docx format and then importing it back again.
Since I submitted these feature requests and comments in early May it has all gone very very quiet...

I have a feeling that ideas only get anywhere if they are popular ...and none of my ideas are popular ...because they do not lead to new and shiny functionality.

Only one of my suggestions (re comments) has received a vote by another member of the community.

So, what to do?

Luckily, since having spoken about my problem at the Jisc Research Data Network, two people have mentioned they have Google contacts who might be interested in hearing my ideas.

I'd like to follow up on this, but in the meantime it would be great if people could feedback to me. 

  • Are my suggestions sensible? 
  • Are there are any other features that would help the digital preservation community preserve Google Drive? I can't imagine I've captured everything...

Thursday, 6 July 2017

The UK Archivematica group goes to Scotland

Yesterday the UK Archivematica group met in Scotland for the first time. The meeting was hosted by the University of Edinburgh and as always it was great to be able to chat informally to other Archivematica users in the UK and find out what everyone is up to.

The first thing to note was that since this group of Archivematica ‘explorers’ first met in 2015 real and tangible progress seems to have been made. This was encouraging to see. This is particularly the case at the University of Edinburgh. Kirsty Lee talked us through their Archivematica implementation (now in production) and the steps they are taking to ingest digital content.

One of the most interesting bits of her presentation was a discussion about appraisal of digital material and how to manage this at scale using the available tools. When using Archivematica (or other digital preservation systems) it is necessary to carry out appraisal at an early stage before an Archival Information Package (AIP) is created and stored. It is very difficult (perhaps impossible) to unpick specific files from an AIP at a later date.

Kirsty described how one of her test collections has been reduced from 5.9GB to 753MB using a combination of traditional and technical appraisal techniques. 

Appraisal is something that is mentioned frequently in digital preservation discussions. There was a group talking about just this a couple of weeks ago at the recent DPC unconference ‘Connecting the Bits’. 

As ever it was really valuable to hear how someone is moving forward with this in a practical way. 

It will be interesting to find out how these techniques can be applied at scale of some of the larger collections Kirsty intends to work with.

Kirsty recommended an article by Victoria Sloyan, Born-digital archives at the Wellcome Library: appraisal and sensitivity review of two hard drives which was helpful to her and her colleagues when formulating their approach to this thorny problem.

She also referenced the work that the Bentley Historical Library at University of Michigan have carried out with Archivematica and we watched a video showing how they have integrated Archivematica with DSpace. This approach has influenced Edinburgh’s internal discussions about workflow.

Kirsty concluded with something that rings very true for me (in fact I think I said it myself the two presentations I gave last week!). Striving for perfection isn’t helpful, the main thing is just to get started and learn as you go along.

Rachel McGregor from the University of Lancaster gave an entertaining presentation about the UK Archivematica Camp that was held in York in April, covering topics as wide ranging as the weather, the food and finally feeling the love for PREMIS!

I gave a talk on work at York to move Archivematica and our Research Data York application towards production. I had given similar talks last week at the Jisc Research Data Network event and a DPC briefing day but I took a slightly different focus this time. I wanted to drill in a bit more detail into our workflow, the processing configuration within Archivematica and some problems I was grappling with. 

It was really helpful to get some feedback and solutions from the group on an error message I’d encountered whilst preparing my slides the previous day and to have a broader discussion on the limitations of web forms for data upload. This is what is so good about presenting within a small group setting like this as it allows for informality and genuinely productive discussion. As a result of this I over ran and made people wait for their lunch (very bad form I know!)

After lunch John Kaye updated the group on the Jisc Research Data Shared Service. This is becoming a regular feature of our meetings! There are many members of the UK Archivematica group who are not involved in the Jisc Shared Service so it is really useful to be able to keep them in the loop. 

It is clear that there will be a substantial amount of development work within Archivematica as a result of its inclusion in the Shared Service and features will be made available to all users (not just those who engage directly with Jisc). One example of this is containerisation which will allow Archivematica to be more quickly and easily installed. This is going to make life easier for everyone!

Sean Rippington from the University of St Andrews gave an interesting perspective on some of the comparison work he has been doing of Preservica and Archivematica. 

Both of these digital preservation systems are on offer through the Jisc Shared Service and as a pilot institution St Andrews has decided to test them side by side. Although he hasn’t yet got his hands on both, he was still able to offer some really useful insights on the solutions based on observations he has made so far. 

First he listed a number of similarities - for example alignment with the OAIS Reference Model, the migration-based approach, the use of microservices and many of the tools and standards that they are built on.

He also listed a lot of differences - some are obvious, for example one system is commercial and the other open source. This leads to slightly different models for support and development. He mentioned some of the additional functionality that Preservica has, for example the ability to handle emails and web archives and the inclusion of an access front end. 

He also touched on reporting. Preservica does this out of the box whereas with Archivematica you will need to use a third party reporting system. He talked a bit about the communities that have adopted each solution and concluded that Preservica seems to have a broader user base (in terms of the types of institution that use it). The engaged, active and honest user community for Archivematica was highlighted as a specific selling point and the work of the Filling the Digital Preservation Gap project (thanks!).

Sean intends to do some more detailed comparison work once he has access to both systems and we hope he will report back to a future meeting.

Next up we had a collaborative session called ‘Room 101’ (even though our meeting had been moved to room 109). Considering we were encouraged to grumble about our pet hates this session came out with some useful nuggets:

  • Check your migrated files. Don’t assume everything is always OK.
  • Don’t assume that just because Archivematica is installed all your digital preservation problems are solved.
  • Just because a feature exists within Archivematica it doesn’t mean you have to use it - it may not suit your workflow
  • There is no single ‘right’ way to set up Archivematica and integrate with other systems - we need to talk more about workflows and share experiences!

After coffee break we were joined (remotely) by several representatives from the OSSArcFlow project from Educopia and the University of North Carolina. This project is very new but it was great that they were able to share with us some information about what they intend to achieve over the course of the two year project. 

They are looking specifically at preservation workflows using open source tools (specifically Archivematica, BitCurator and ArchivesSpace) and they are working with 12 partner institutions who will all be using at least two of these tools. The project will not only provide training and technical support, but will fully document the workflows put in place at each institution. This information will be shared with the wider community. 

This is going to be really helpful for those of us who are adopting open source preservation tools, helping to answer some of those niggling questions such as how to fill the gaps and what happens when there are overlaps in the functionality of two tools.

We registered our interest in continuing to be kept in the loop about this project and we hope to hear more at a future meeting.

The day finished with a brief update from Sara Allain from Artifactual Systems. She talked about some of the new things that are coming in version 1.6.1 and 1.7 of Archivematica.

Before leaving Edinburgh it was a pleasure to be able to join the University at an event celebrating their progress in digital preservation. Celebrations such as this are pretty few and far between - perhaps because digital preservation is a task that doesn’t have an obvious end point. It was really refreshing to see an institution publicly celebrating the considerable achievements made so far. Congratulations to the University of Edinburgh!

Friday, 23 June 2017

Emulation for preservation - is it for me?

I’ve previously been of the opinion that emulation isn’t really for me.

I’ve seen presentations about emulation at conferences such as iPRES and it is fair to say that much of it normally goes over my head.

This hasn’t been helped by the fact that I’ve not really had a concrete use case for it in my own work - I find it so much easier to relate and engage to a topic or technology if I can see how it might be directly useful to me.

However, for a while now I’ve been aware that emulation is what all the ‘cool kids’ in the digital preservation world seem to be talking about. From the very migration heavy thinking of the 2000’s it appears that things are now moving in a different direction.

This fact first hit my radar at the 2014 Digital Preservation Awards where the University of Freiburg won the The OPF Award for Research and Innovation award for their work on Emulation as a Service with bwFLA Functional Long Term Archiving and Access.

So I was keen to attend the DPC event Halcyon, On and On: Emulating to Preserve to keep up to speed... not only because it was hosted on the doorstep in the centre of my home town of York!

It was an interesting and enlightening day. As usual the Digital Preservation Coalition did a great job of getting all the right experts in the room (sometimes virtually) at the same time, and a range of topics and perspectives were covered.

After an introduction from Paul Wheatley we heard from the British Library about their experiences of doing emulation as part of their Flashback project. No day on emulation would be complete without a contribution from the University of Freiburg. We had a thought provoking talk via WebEx from Euan Cochrane of Yale University Library and an excellent short film created by Jason Scott from the Internet Archive. One of the highlights for me was Jim Boulton talking about Digital Archaeology - and that wasn’t just because it had ‘Archaeology’ in the title (honest!). His talk didn’t really cover emulation, it related more to that other preservation strategy that we don’t talk about much anymore - hardware preservation. However, many of the points he raised were entirely relevant to emulation - for example, how to maintain an authentic experience, how you define what the significant properties of an item actually are and what decisions you have to make as a curator of the digital past. It was great to see how engaged the public were with his exhibitions and how people interacted with it.

Some of the themes of the day and take away thoughts for me:

  • Choosing the best strategy - It is not all about which preservation strategy to use it is more about how we can use them together - as Paul Wheatley pointed out - emulation is a good partner to migration as it can help you to test a migration strategy. The British Library showed off their lab of old hardware - they use this to check whether their emulators are working OK. As digital archivists we can (and should) use all of the tools at our disposal to make sure we are doing the job well.
  • A window of emulation opportunity? - Simon Whibley from the British Library mentioned that older material tends to emulate better than the more recent material they worked with. Later on in the day Euan Cochrane talked about the ways technology is rapidly moving forward (see for example The Internet of Things). This offers up new challenges for those working in digital preservation, whatever strategy they employ. Will there be a relatively small window of opportunity for emulation (from the 1980's to the 2000's)? Beyond that point, will it all get just too complex?
  • Software is a problem - setting up the emulation environments is easy (in that some people have this solved) but if you don’t have the necessary software to install in order to read your files then you are stuck. Obviously this is a thorny problem due to licencing and IPR and not one which has been systematically solved. The British Library have been ‘accidentally’ collecting software but this area continues to be a problematic one.
  • What constitutes an 'authentic experience'? - most of the presentations mentioned this idea of the authentic experience - ultimately this is what we are trying to provide. Simon Whibley asked whether an emulation that appears in full colour is authentic if it would have been monochrome on the original hardware? Jim Boulton mentioned that some of the artists he worked with wanted the bandwidth to be throttled on their historic websites to recreate the authentic speed (or lack of it!). Some of the emulators demonstrated over the course of the day also provided the original sounds of the operating system and this is an important element in providing an authentic experience. It isn't just about serving up the data.
Thinking about how this all relates to me and my work, I am immediately struck by two use cases.

Firstly research data - we are taking great steps forward in enabling this data to be preserved and maintained for the long term but will it be re-usable? For many types of research data there is no clear migration strategy. Emulation as a strategy for accessing this data ten or twenty years from now needs to be seriously considered. In the meantime we need to ensure we can identify the files themselves and collect adequate documentation - it is these things that will help us to enable reuse through emulators in the future.

Secondly, there are some digital archives that we hold at the Borthwick Institute from the 1980's. For example I have been working on a batch of WordStar files in my spare moments over the last few years. I'd love to get a contemporary emulator fired up and see if I could install WordStar and work with these files in their native setting. I've already gone a little way down the technology preservation route, getting WordStar installed on an old Windows 98 PC and viewing the files, but this isn't exactly contemporary. These approaches will help to establish the significant properties of the files and assess how successful subsequent migration strategies are....but this is a future blog post.

It was a fun event and it was clear that everybody loves a bit of nostalgia. Jim Boulton ended his presentation saying "There is something quite romantic about letting people play with old hardware".

We have come a long way and this is most apparent when seeing artefacts (hardware, software, operating systems, data) from early computing. Only this week whilst taking the kids to school we got into a conversation about floppy disks (yes, I know...). I asked the kids if they knew what they looked like and they answered "Yes, it is the save icon on the computer"(see Why is the save icon still a floppy disk?)...but of course they've never seen a real one. Clearly some obsolete elements of our computer history will remain in our collective consciousness for many years and perhaps it is our job to continue to keep them alive in some form.

Friday, 16 June 2017

A typical week as a digital archivist?

Sometimes (admittedly not very often) I'm asked what I actually do all day. So at the end of a busy week being a digital archivist I've decided to blog about what I've been up to.


Today I had a couple of meetings. One specifically to talk about digital preservation of electronic theses submissions. I've also had a work experience placement in this week so have set up a metadata creation task which he has been busy working on.

When I had a spare moment I did a little more testing work on the EAD harvesting feature the University of York is jointly sponsoring Artefactual Systems to develop in AtoM. Testing this feature from my perspective involves logging into the test site that Artefactual has created for us and tweaking some of the archival descriptions. Once those descriptions are saved, I can take a peek at the job scheduler and make sure that new EAD files are being created behind the scenes for the Archives Hub to attempt to harvest at a later date.

This piece of development work has been going on for a few months now and communications have been technically quite complex so I'm also trying to ensure all the organisations involved are happy with what has been achieved and will be arranging a virtual meeting so we can all get together and talk through any remaining issues.

I was slightly surprised today to have a couple of requests to talk to the media. This has sprung from the news that the Queen's Speech will be delayed. One of the reasons for the delay relates to the fact that the speech has to be written on goat's skin parchment, which takes a few days to dry. I had previously been interviewed for a article entitled Why is the UK still printing its laws on vellum? and am now mistaken for someone who knows about vellum. I explained to potential interviewers that this is not my specialist subject!


In the morning I went to visit a researcher at the University of York. I wanted to talk to him about how he uses Google Drive in relation to his research. This is a really interesting topic to me right now as I consider how best we might be able to preserve current research datasets. Seeing how exactly Google Drive is used and what features the researcher considers to be significant (and necessary for reuse) is really helpful when thinking about a suitable approach to this problem. I sometimes think I work a little bit too much in my own echo chamber, so getting out and hearing different perspectives is incredibly valuable.

Later that afternoon I had an unexpected meeting with one of our depositors (well, there were two of them actually). I've not met them before but have been working with their data for a little while. In our brief meeting it was really interesting to chat and see the data from a fresh perspective. I was able to reunite them with some digital files that they had created in the mid 1980's, had saved on to floppy disk and had not been able to access for a long time.

Digital preservation can be quite a behind the scenes sort of job - we always give a nod to the reason why we do what we do (ie: we preserve for future reuse), but actually seeing the results of that work unfold in front of your eyes is genuinely rewarding. I had rescued something from the jaws of digital obsolescence so it could now be reused and revitalised!

At the end of the day I presented a joint webinar for the Open Preservation Foundation called 'PRONOM in practice'. Alongside David Clipsham (The National Archives) and Justin Simpson (Artefactual Systems), I talked about my own experiences with PRONOM, particularly relating to file signature creation, and ending with a call to arms "Do try this at home!". It would be great if more of the community could get involved!

I was really pleased that the webinar platform worked OK for me this time round (always a bit stressful when it doesn't) and that I got to use the yellow highlighter pen on my slides.

In my spare moments (which were few and far between), I put together a powerpoint presentation for the following day...


I spent the day at the British Library in Boston Spa. I'd been invited to speak at a training event they regularly hold for members of staff who want to find out a bit more about digital preservation and the work of the team.

I was asked specifically to talk through some of the challenges and issues that I face in my work. I found this pretty easy - there are lots of challenges - and I eventually realised I had too many slides so had to cut it short! I suppose that is better than not having enough to say!

Visiting Boston Spa meant that I could also chat to the team over lunch and visit their lab. They had a very impressive range of old computers and were able to give me a demonstration of Kryoflux (which I've never seen in action before) and talk a little about emulation. This was a good warm up for the DPC event about emulation I'm attending next week: Halcyon On and On: Emulating to Preserve.

Still left on my to do list from my trip is to download Teracopy. I currently use Foldermatch for checking that files I have copied have remained unchanged. From the quick demo I saw at the British Library I think that Teracopy would be a more simple one step solution. I need to have a play with this and then think about incorporating it into the digital ingest workflow.

Sharing information and collaborating with others working in the digital preservation field really is directly beneficial to the day to day work that we do!


Back in the office today and a much quieter day.

I extracted some reports from our AtoM catalogue for a colleague and did a bit of work with our test version of Research Data York. I also met with another colleague to talk about storing and providing access to digitised images.

In the afternoon I wrote another powerpoint presentation, this time for a forthcoming DPC event: From Planning to Deployment: Digital Preservation and Organizational Change.

I'm going to be talking about our experiences of moving our Research Data York application from proof of concept to production. We are not yet in production and some of the reasons why will be explored in the presentation! Again I was asked to talk about barriers and challenges and again, this brief is fairly easy to fit! The event itself is over a week away so this is unprecedentedly well organised. Long may it continue!


On Fridays I try to catch up on the week just gone and plan for the week ahead as well as reading the relevant blogs that have appeared over the week. It is also a good chance to catch up with some admin tasks and emails.

Lunch time reading today was provided by William Kilbride's latest blog post. Some of it went over my head but the final messages around value and reuse and the need to "do more with less" rang very true.

Sometimes I even blog myself - as I am today!

Was this a typical week - perhaps not, but in this job there is probably no such thing! Every week brings new ideas, challenges and surprises!

I would say the only real constant is that I've always got lots of things to keep me busy.

Friday, 12 May 2017

AtoM Camp take aways

The view from the window at AtoM Camp ...not that there was
any time to gaze out of the window of course...
I’ve spent the last three days in Cambridge at AtoM Camp. This was the second ever AtoM Camp, and the first in Europe. A big thanks to St John’s College for hosting it and to Artefactual Systems for putting it on.

It really has been an interesting few days, with a packed programme and an engaged group of attendees from across Europe and beyond bringing different levels of experience with AtoM.

As a ‘camp counsellor’ I was able to take to the floor at regular intervals to share some of our experiences of implementing AtoM at the Borthwick, covering topics such as system selection, querying the MySQL database, building the community and overcoming implementation challenges.

However, I was also there to learn!

Here are some bits and pieces that I’ve taken away.

My first real take away is that I now have a working copy of the soon to be released AtoM 2.4 on my Macbook - this is really quite cool. I'll never again be bored on a train - I can just fire up Ubuntu and have a play!

Walk to Camp takes you over Cambridge's Bridge of Sighs
During the camp it was great to be able to hear about some of the new features that will be available in this latest release.

At the Borthwick Institute our catalogue is still running on AtoM 2.2 so we are pretty excited about moving to 2.4 and being able to take advantage of all of this new functionality.

Just some of the new features I learnt about that I can see an immediate use case are:

  • Being able to generate slugs (the end bit of the URL to a record in AtoM) from archival reference numbers rather than titles - this makes perfect sense to me and would make for neater links
  • A modification of the re-indexing script which allows you to specify which elements you want to re-index. I like this one as it means I will not need to get out of bed so early to carry out re-indexes if for example it is only the (non-public facing) accessions records that need indexing.
  • Some really helpful changes to the search results - The default operator in an AtoM search has now been changed from ‘OR’ to ‘AND’. This is a change we already made to our local instance (as have several others) but it is good to see that AtoM now has this built in. Another change focuses on weighting of results and ensures that the most relevant results appear first. This relevance ranking is related to the fields in which the search terms appear - thus, a hit in the title field would appear higher than a hit in scope and content.
  • Importing data through the interface will be carried out through the job scheduler so will be better and won't time out. This is great news as it will give colleagues the ability to do all imports themselves rather than having to wait until someone can do this through the command line

On day two of camp I enjoyed the implementation tours, seeing how other institutions have implemented AtoM and the tweaks and modifications they have made. For example it was interesting to see the shopping cart feature developed for the Mennonite Archival Image Database and most popular image carousel feature on front page of the Chinese Canadian Artifacts Project. I was also interested in some of the modifications the National Library of Wales have made to meet their own needs.

It was also nice to hear the Borthwick Catalogue described  by Dan as “elegant”!

There was a great session on community and governance at the end of day two which was one of the highlights of the camp for me. It gave attendees the chance to really understand the business model of Artefactual (as well as alternatives to the bounty model in use by other open source projects). We also got a full history of the evolution of AtoM and saw the very first project logo and vision.

The AtoM vision hasn't changed too much but the name and logo have!

Dan Gillean from Artefactual articulated the problem of trying to get funding for essential and ongoing tasks, such as code modernisation. Two examples he used were updating AtoM to work with the latest version of Symfony and Elasticsearch - both of these tasks need to happen in order to keep AtoM moving in the right direction but both require a substantial amount of work and are not likely to be picked up and funded by the community.

I was interested to see Artefactual’s vision for a new AtoM 3.0 which would see some fundamental changes to the way AtoM works and a more up-to-date, modular and scalable architecture designed to meet the future use cases of the growing AtoM community.

Artefactual's proposed modular architecture for AtoM 3.0

There is no time line for AtoM 3.0, and whether it goes ahead or not is entirely dependent on a substantial source of funding being available. It was great to see Artefactual sharing their vision and encouraging feedback from the community at this early stage though.

Another highlight of Camp:
a tour of the archives of St John's College from Tracy Deakin
A session on data migrations on day three included a demo of OpenRefine from Sara Allain from Artefactual. I’d heard of this tool before but wasn’t entirely sure what it did and whether it would be of use to me. Sara demonstrated how it could be used to bash data into shape before import into AtoM. It seemed to be capable of doing all the things that I’ve previously done in Excel (and more) ...but without so much pain. I’ll definitely be looking to try this out when I next have some data to clean up.

Dan Gillean and Pete Vox from IMAGIZ talked through the process of importing data into AtoM. Pete focused on an example from Croydon Museum Service who's data needed to be migrated from CALM. He talked through some of the challenges of the task and how he would approach this differently in future. It is clear that the complexities of data migration may be one of the biggest barriers to institutions moving to AtoM from an alternative system, but it was encouraging to hear that none of these challenges are insurmountable.

My final take away from AtoM Camp is a long list of actions - new things I have learnt that I want to read up on or try out for myself ...I best crack on!

Friday, 28 April 2017

How can we preserve Google Documents?

Last month I asked (and tried to answer) the question How can we preserve our wiki pages?

This month I am investigating the slightly more challenging issue of how to preserve native Google Drive files, specifically documents*.


At the University of York we work a lot with Google Drive. We have the G Suite for Education (formally known as Google Apps for Education) and as part of this we have embraced Google Drive and it is now widely used across the University. For many (me included) it has become the tool of choice for creating documents, spreadsheets and presentations. The ability to share documents and directly collaborate are key.

So of course it is inevitable that at some point we will need to think about how to preserve them.

How hard can it be?

Quite hard actually.

The basic problem is that documents created in Google Drive are not really "files" at all.

The majority of the techniques and models that we use in digital preservation are based around the fact that you have a digital object that you can see in your file system, copy from place to place and package up into an Archival Information Package (AIP).

In the digital preservation community we're all pretty comfortable with that way of working.

The key challenge with stuff created in Google Drive is that it doesn't really exist as a file.

Always living in hope that someone has already solved the problem, I asked the question on Twitter and that really helped with my research.

Isn't the digital preservation community great?

Exporting Documents from Google Drive

I started off testing the different download options available within Google docs. For my tests I used 2 native Google documents. One was the working version of our Phase 1 Filling the Digital Preservation Gap report. This report was originally authored as a Google doc, was 56 pages long and consisted of text, tables, images, footnotes, links, formatted text, page numbers, colours etc (ie: lots of significant properties I could assess). I also used another more simple document for testing - this one was just basic text and tables but also included comments by several contributors.

I exported both of these documents into all of the different export formats that Google supports and assessed the results, looking at each characteristic of the document in turn and establishing whether or not I felt it was adequately retained.

Here is a summary of my findings, looking specifically at the Filling the Digital Preservation Gap phase 1 report document:

  • docx - This was a pretty good copy of the original. It retained all of the key features of the report that I was looking for (images, tables, footnotes, links, colours, formatting etc), however, the 56 page report was now only 55 pages (in the original, page 48 was blank, but in the docx version this blank page wasn't there).
  • odt - Again, this was a good copy of the originals and much like the docx version in terms of the features it retained. However, the 56 page report was now only 54 pages long. Again it omitted page 48 which was blank in the Google version, but also slightly more words were squeezed on to each page which meant that it comprised of fewer pages. Initially I thought the quality of the images was degraded slightly but this turned out to be just the way they appeared to render in LibreOffice. Looking inside the actual odt file structure and viewing the images as files demonstrated to me that they were actually OK. 
  • rtf - First of all it is worth saying that the Rich Text Format file was *massive*. The key features of the document were retained, although the report document was now 60 pages long instead of 56!
  • txt - Not surprisingly this produces a tiny file that retains only the text of the original document. Obviously the original images, tables, colours, formatting etc were all lost. About the only other notable feature that was retained were the footnotes and these appeared together right at the end of the document. Also a txt file does not have a number of 'pages'... not until you print it at least.
  • pdf - This was a good copy of the original report and retained all the formatting and features that I was looking for. This was also the only copy of the report that had the right number of pages. However, it seems that this is not something we can rely on. A close comparison of the pages of the pdf compared with the original shows that there are some differences regarding which words fall on to which page - it isn't exact!
  • epub - Many features of the report were retained but like the text file it was not paginated and the footnotes were all at the end of the document. The formatting was partially retained - the images were there, but were not always placed in the same positions as in the original. For example on the title page, the logos were not aligned correctly. Similarly, the title on the front page was not central.
  • html - This was very similar to the epub file regarding what was and wasn't retained. It included footnotes at the end and had the same issues with inconsistent formatting.

...but what about the comments?

My second test document was chosen so I could look specifically at the comments feature and how these were retained (or not) in the exported version.

  • docx - Comments are exported. On first inspection they appear to be anonymised, however this seems to be just how they are rendered in Microsoft Word. Having unzipped and dug into the actual docx file and looked at the XML file that holds the comments, it is clear that a more detailed level of information is retained - see images below. The placement of the comments is not always accurate. In one instance the reply to a comment is assigned to text within a subsequent row of the table rather than to the same row as the original comment.
  • odt -  Comments are included, are attributed to individuals and have a date and time. Again, matching up of comments with right section of text is not always accurate - in one instance a comment and it's reply are linked to the table cell underneath the one that they referenced in the original document.
  • rtf - Comments are included but appear to be anonymised when displayed in MS Word...I haven't dug around enough to establish whether or not this is just a rendering issue.
  • txt - Comments are retained but appear at the end of the document with a [a], [b] etc prefix - these letters appear in the main body text to show where the comments appeared. No information about who made the comment is preserved.
  • pdf - Comments not exported
  • epub - Comments not exported
  • html - Comments are present but appear at the end of the document with a code which also acts as a placeholder in the text where the comment appeared. References to the comments in the text are hyperlinks which take you to the right comment at the bottom of the document. There is no indication of who made the comment (not even hidden within the html tags).

A comment in original Google doc

The same comment in docx as rendered by MS Word

...but in the XML buried deep within the docx file structure - we do have attribution and date/time
(though clearly in a different time zone)

What about bulk export options?

Ed Pinsent pointed me to the Google Takeout Service which allows you to:
"Create an archive with your data from Google products"
[Google's words not mine - and perhaps this is a good time to point you to Ed's blog post on the meaning of the term 'Archive']

This is really useful. It allows you to download Google Drive files in bulk and to select which formats you want to export them as.

I tested this a couple of times and was surprised to discover that if you select pdf or docx (and perhaps other formats that I didn't test) as your export format of choice, the takeout service creates the file in the format requested and an html file which includes all comments within the document (even those that have been resolved). The content of the comments/responses including dates and times is all included within the html file, as are names of individuals.

The downside of the Google Takeout Service is that it only allows you to select folders and not individual files. There is another incentive for us to organise our files better! The other issue is that it will only export documents that you are the owner of - and you may not own everything that you want to archive!

What's missing?

Quite a lot actually.

The owner, creation and last modified dates of a document in Google Drive are visible when you click on Document details... within the File menu. Obviously this is really useful information for the archive but is lost as soon as you download it into one of the available export formats.

Creation and last modified dates as visible in Document details

Update: I was pleased to see that if using the Google Takeout Service to bulk export files from Drive, the last modified dates are retained, however on single file export/download these dates are lost and the last modified date of the file becomes the date that you carried out the export. 

Part of the revision history of my Google doc
But of course in a Google document there is more metadata. Similar to the 'Page History' that I mentioned when talking about preserving wiki pages, a Google document has a 'Revision history'

Again, this *could* be useful to the archive. Perhaps not so much so for my document which I worked on by myself in March, but I could see more of a use case for mapping and recording the creative process of writing a novel for example. 

Having this revision history would also allow you to do some pretty cool stuff such as that described in this blog post: How I reverse engineered Google Docs to play back any documents Keystrokes (thanks to Nick Krabbenhoft for the link).

It would seem that the only obvious way to retain this information would be to keep the documents in their original native Google format within Google Drive but how much confidence do we have that it will be safe there for the long term?


If you want to preserve a Google Drive document there are several options but no one-size-fits-all solution.

As always it boils down to what the significant properties of the document are. What is it we are actually trying to preserve?

  • If we want a fairly accurate but non interactive digital 'print' of the document, pdf might be the most accurate representation though even the pdf export can't be relied on to retain the exact pagination. Note that I didn't try and validate the pdf files that I exported and sadly there is no pdf/a export option.
  • If comments are seen to be a key feature of the document then docx or odt will be a good option but again this is not perfect. With the test document I used, comments were not always linked to the correct point within the document.
  • If it is possible to get the owner of the files to export them, the Google Takeout Service could be used. Perhaps creating a pdf version of the static document along with a separate html file to capture the comments.

A key point to note is that all export options are imperfect so it would be important to check the exported document against the original to ensure it accurately retains the important features.

Another option would be simply keeping them in their native format but trying to get some level of control over them - taking ownership and managing sharing and edit permissions so that they can't be changed. I've been speaking to one of our Google Drive experts in IT about the logistics of this. A Google Team Drive belonging to the Archives could be used to temporarily store and lock down Google documents of archival value whilst we wait and see what happens next. 

...I live in hope that export options will improve in the future.

This is a work in progress and I'd love to find out what others think.

* note, I've also been looking at Google Sheets and that may be the subject of another blog post

Friday, 7 April 2017

Archivematica Camp York: Some thoughts from the lake

Well, that was a busy week!

Yesterday was the last day of Archivematica Camp York - an event organised by Artefactual Systems and hosted here at the University of York. The camp's intention was to provide a space for anyone interested in or currently using Archivematica to come together, learn about the platform from other users, and share their experiences. I think it succeeded in this, bringing together 30+ 'campers' from across the UK, Europe and as far afield as Brazil for three days of sessions covering different aspects of Archivematica.

Our pod on the lake (definitely a lake - not a pond!)
My main goal at camp was to ensure everyone found their way to the rooms (including the lakeside pod) and that we were suitably fuelled with coffee, popcorn and cake. Alongside these vital tasks I also managed to partake in the sessions, have a play with the new version of Archivematica (1.6) and learn a lot in the process.

I can't possibly capture everything in this brief blog post so if you want to know more, have a look back at all the #AMCampYork tweets.

What I've focused on below are some of the recurring themes that came up over the three days.


Archivematica is just one part of a bigger picture for institutions that are carrying out digital preservation, so it is always very helpful to see how others are implementing it and what systems they will be integrating with. A session on workflows in which participants were invited to talk about their own implementations was really interesting. 

Other sessions  also helped highlight the variety of different configurations and workflows that are possible using Archivematica. I hadn't quite realised there were so many different ways you could carry out a transfer! 

In a session on specialised workflows, Sara Allain talked us through the different options. One workflow I hadn't been aware of before was the ability to include checksums as part of your transfer. This sounds like something I need to take advantage of when I get Archivematica into production for the Borthwick. 

Justin talking about Automation Tools
A session on Automation Tools with Justin Simpson highlighted other possibilities - using Archivematica in a more automated fashion. 

We already have some experience of using Automation Tools at York as part of the work we carried out during phase 3 of Filling the Digital Preservation Gap, however I was struck by how many different ways these can be applied. Hearing examples from other institutions and for a variety of different use cases was really helpful.


The camp included a chance to play with Archivematica version 1.6 (which was only released a couple of weeks ago) as well as an introduction to the new Appraisal and Arrangement tab.

A session in progress at Archivematica Camp York
I'd been following this project with interest so it was great to be able to finally test out the new features (including the rather pleasing pie charts showing what file formats you have in your transfer). It was clear that there were a few improvements that could be made to the tab to make it more intuitive to use and to deal with things such as the ability to edit or delete tags, but it is certainly an interesting feature and one that I would like to explore more using some real data from our digital archive.

Throughout camp there was a fair bit of discussion around digital appraisal and at what point in your workflow this would be carried out. This was of particular interest to me being a topic I had recently raised with colleagues back at base.

The Bentley Historical Library who funded the work to create the new tab within Archivematica are clearly keen to get their digital archives into Archivematica as soon as possible and then carry out the work there after transfer. The addition of this new tab now makes this workflow possible.

Kirsty Lee from the University of Edinburgh described her own pre-ingest methodology and the tools she uses to help her appraise material before transfer to Archivematica. She talked about some tools (such as TreeSize Pro) that I'm really keen to follow up on.

At the moment I'm undecided about exactly where and how this appraisal work will be carried out at York, and in particular how this will work for hybrid collections so as always it is interesting to hear from others about what works for them.

Metadata and reporting

Evelyn admitting she loves PREMIS and METS
Evelyn McLellan from Artefactual led a 'Metadata Deep Dive' on day 2 and despite the title, this was actually a pretty interesting session!

We got into the details of METS and PREMIS and how they are implemented within Archivematica. Although I generally try not to look too closely at METS and PREMIS it was good to have them demystified. On the first day through a series of exercises we had been encouraged to look at a METS file created by Archivematica ourselves and try and pick out some information from it so these sessions in combination were really useful.

Across various sessions of the camp there was also a running discussion around reporting. Given that Archivematica stores such a detailed range of metadata in the METS file, how do we actually make use of this? Being able to report on how many AIPs have been created, how many files and what size is useful. These are statistics that I currently collect (manually) on a quarterly basis and share with colleagues. Once Archivematica is in place at York, digging further into those rich METS files to find out which file formats are in the digital archive would be really helpful for preservation planning (among other things). There was discussion about whether reporting should be a feature of Archivematica or a job that should be done outside Archivematica.

In relation to the later option - I described in one session how some of our phase 2 work of Filling the Digital Preservation Gap was designed to help expose metadata from Archivematica to a third party reporting system. The Jisc Research Data Shared Service was also mentioned in this context as reporting outside of Archivematica will need to be addressed as part of this project.


As with most open source software, community is important. This was touched on throughout the camp and was the focus of the last session on the last day.

There was a discussion about the role of Artefactual Systems and the role of Archivematica users. Obviously we are all encouraged to engage and help sustain the project in whatever way we are able. This could be by sharing successes and failures (I was pleased that my blog got a mention here!), submitting code and bug reports, sponsoring new features (perhaps something listed on the development roadmap) or helping others by responding to queries on the mailing list. It doesn't matter - just get involved!

I was also able to highlight the UK Archivematica group and talk about what we do and what we get out of it. As well as encouraging new members to the group, there was also discussion about the potential for forming other regional groups like this in other countries.

Some of the Archivematica community - class of Archivematica Camp York 2017

...and finally

Another real success for us at York was having the opportunity to get technical staff at York working with Artefactual to resolve some problems we had with getting our first Archivematica implementation into production. Real progress was made and I'm hoping we can finally start using Archivematica for real at the end of next month.

So, that was Archivematica Camp!

A big thanks to all who came to York and to Artefactual for organising the programme. As promised, the sun shined and there were ducks on the lake - what more could you ask for?

Thanks to Paul Shields for the photos