Digital Archiving at the University of York: November 2018

Monday, 19 November 2018

The sustainability of a digital preservation blog...

So this is a topic pretty close to home for me.

Oh the irony of spending much of the last couple of months fretting about the future preservation of my digital preservation blog...!

I have invested a fair bit of time over the last 6 or so years writing posts for this blog. Not only has it been useful for me as a way of documenting what I've done or what I've learnt, but it has also been of interest to a much wider audience.

The most viewed posts have been on the topic of preserving Google Drive formats and disseminating the outputs of the Filling the Digital Preservation Gap project. Access statistics show that the audience is truly international.

When I decided to accept a job elsewhere I was of course concerned about what would happen to my blog. I hoped that all would be well, given that Blogger is a Google supported solution and part of the suite of Google tools that University of York staff and students use. But what would happen when my institutional Google account was closed down?

Initially I believed that as long as I handed over ownership of the blog to another member of staff who remained at the University, then all would be well. However, I soon realised that there were going to be some bigger challenges.

The problem

Once I leave the institution and my IT account is closed, Blogger will no longer have a record of who I am.

All posts that have been written my me will be marked as 'Unknown'. They will no longer have my name automatically associated with them. Not ideal from my perspective and also not ideal for anyone who might want to cite the blog posts in the future.

The other problem is the fact that once my account is closed down, all images within blog posts that I have posted will completely disappear.

This is pretty bad news!

When a member of staff adds images to a blog post the usual method of doing this is to select an image from the local PC or network drive. What Google then does is stores a copy of that image in https://get.google.com/albumarchive/ (in a location that is tied to that individual's account). When the account is closed, all of these blog related images are also wiped. The images are not recoverable.

So, I could make copies of all my images now and hand them to my colleagues, so that they could put them all back in again once I leave...but who is going to want to do that?

A solution of sorts

I asked IT Support to help me, and a colleague has had some success at extracting the contents of my blog, amending the image urls in the XML and importing the posts back into a test Blogger account with images hosted in a location that isn't associated with an individual staff account.

There is a description of how this result was achieved here and I'm hugely grateful for all of the time that was spent trying to fix this problem for me.

The XML was also amended directly to add the words 'Jenny Mitcham, Digital Archivist' to the end of every blog post, to save me having to open each of the 120 posts in turn and adding my name to them. That was a big help too.

So, in my last couple of weeks at work I have been experimenting with importing the tweaked XML file back into blogger.

Initially, I just imported the XML file back into the blog without deleting the original blog posts. I had understood that the imported blogs would merge with the original ones and that all would be well. Unfortunately though, I ended up with two versions of each blog post - the original one and the new one at a slightly different url.

So, I held my breath, took the plunge and deleted everything and then re-imported the amended XML.

I had envisaged that the imported blog posts would be assigned their original urls but was disappointed to see that this was not the case. Despite the url being included within the XML, blogger clearly had a record that these urls had already been used and would not re-use them.

I know some people link to the blog posts from other blogs and websites. I also interlink between blog posts from within the blog, so a change to all the urls will lead to lots of broken links. Bad news!

I tried going into individual posts and changing the permalink by hand back to the original link, but Blogger would not accept this and kept adding a different number to the end of the url to ensure it did not replicate the url of one of my deleted posts. Hugely frustrating!

Luckily my colleague in IT came up with an alternative solution, adding some clever code into the header of the blog which carries out a search every time a page is requested. This seems to work well, serving up one or more posts based on the url that is requested. Being that the new urls are very similar to the old ones (essentially the same but with some numbers added to the end), the search is very effective and the right post is served up at the top of the page. Hopefully this will work for the foreseeable future and should lead to minimal impact for users of the blog.

Advice for Blogger users

If you are using Blogger from an institutional Google account, think about what will happen to your posts after your account is closed down.

There are a few things you can do to help future proof the blog:

Host images externally in a location that isn't tied to your institutional account - for example a Google Team Drive or an institutional website - link to this location from the blog post rather than uploading images directly.
Ensure that your name is associated with the blog posts you write by hard coding it in to the text of your blog post - don't rely on blogger knowing who you are forever.
Ensure that there are others who have administrative control of the blog so that it continues after your account has been closed.

And lastly - if just starting out, consider using a different blogging platform. Presumably they are not all this unsustainable...!

Apologies...

Unfortunately, with the tweak that has been made to how the images are hosted and pulled in to the posts, some of them appear to have degraded in quality. I began to edit each post and resizing the images (which appears to fix the problem) but have run out of time to work through 120 posts before my account is closed.

Generally, if an image looks bad in the blog, you can see a clearer version of it if you click on it so this isn't a disaster.

Also, there may be some images that are out of place - I have found (and fixed) one example of this but have not had time to check all of them.

Apologies to anyone who subscribes to this blog - I understand you may have had a lot of random emails as a result of me re-importing or republishing blog posts over the last few weeks!

Thanks to...

As well as thanking Tom Smith at the University of York for his help with fixing the blog, I'd also like to thank the Web Archiving team at the British Library who very promptly harvested my blog before we started messing around with it. Knowing that it was already preserved and available within a web archive did give me comfort as I repeatedly broke it!

A plea to Google

Blogger could (and should) be a much more sustainable blogging platform. It should be able to handle situations where someone's account closes down. It should be possible to make the blogs (including images) more portable. It should be possible for an institution to create a blog that can be handed from one staff member to another without breaking it. A blog should be able to outlive its primary author.

I genuinely don't think these things would be that hard for a clever developer at Google to fix. The current situation creates a very real headache for those of us who have put a lot of time and effort into creating content within this platform.

It really doesn't need to be this way!

Thursday, 15 November 2018

Goodbye and thanks

This is my last day as Digital Archivist for the University of York.

Next week I will be taking on a brand new post as Head of Standards and Good Practice at the Digital Preservation Coalition. This is an exciting move for me but it is with some sadness that I leave the Borthwick Institute and University of York behind.

I have been working in digital preservation at the University of York for the last 15 years. Initially with the Archaeology Data Service as part of the team that preserves and disseminates digital data produced by archaeologists in the UK; and since 2012, branching out to work with many other types of digital material at the Borthwick Institute.

This last six years has been both interesting and challenging and I have learnt a huge amount.

Perhaps the biggest change for me was moving from being one of a team of digital archivists to being a lone digital archivist. I think this is one of the reasons I started this blog. I missed having other digital archivists around who were happy to endlessly discuss the merits of different preservation file formats and tools!

Blogging about my work at the Borthwick became a helpful way for me to use the wider digital preservation community as a sounding board and for sense checking what I was doing. I have received some really helpful advice in the comments and the blogs have led to many interesting discussions on Twitter.

In a discipline where resources are often scarce, it makes no sense for us all to quietly invent the same wheel in our own local contexts. Admittedly there is no one-size-fits-all solution to digital preservation, but talking about what we do and learning from each other is so very important.

Of course there have been challenges along the way...

It is difficult to solve a problem that does not have clear boundaries. The use cases for digital preservation in a large institution are complex and ever growing.

I began by focusing on the born digital archives that come to the Borthwick from our donors and depositors. Perhaps if that were the only challenge, we would be further down the line of solving it...

However, we also have the complexities of research data to consider, the huge volumes of digitised content we are producing, the need to digitise audio-visual archives and preserve them in digital formats, the need to preserve the institutional record (including websites, social media, email), and the desire to preserve theses in digital formats. On top of this, is the increasing need to be able to provide access to digital resources.

The use cases overlap and are not neatly bounded. Multiple integrations with other systems are required to ensure that preservation processes are seamless and can be implemented at scale.

I have frequently reached the limit of my own technical ability. I am an archaeologist with probably above average IT skills but I can only get so far with the knowledge I have. Getting the right level of technical support to move digital preservation forward is key.

So, I’ve made some mistakes, I’ve changed my mind about some things, I’ve often tried to do too much, but ultimately I've had the freedom to try things out and to share those experiences with the wider community.

Some lessons learned from my 6 years at the Borthwick:

Doing something is normally better than doing nothing
Accept solutions that are 'good enough' ...don't hold out for 'perfect'
Try things out. Research and planning are important, but it is hard to fully understand things without diving in and having a go
Digital continuity actually begins quite close to home - consider the sustainability of your blogging platform!

The biggest lesson for me perhaps has been that I have spent much of my 6 years chasing the somewhat elusive dream of an all-singing-all-dancing 'digital preservation system', but in actual fact, the interim measures I have put in place at the Borthwick might be just about ‘good enough’ for the time being.

It is not always helpful to think about digital preservation in 'forever' terms. It is more realistic to consider our role to be to keep digital archives safe to hand over to the next person. Indeed, digital preservation has frequently been likened to a relay race.

So I hereby finish my leg of this particular race and hand over the baton to the next digital archivist...

A big thank you and goodbye to all my colleagues at the Borthwick Institute and across Information Services. It has been fun! :-)

Thursday, 8 November 2018

Testing manual normalisation workflows in Archivematica

This week I traveled to Warwick University for the UK Archivematica meeting. As usual, it was a really interesting day. I’m not going to blog about all of it (I’ll leave that to our host, Rachel MacGregor) but I will blog about the work I presented there.

Followers of my blog will be aware that I recently carried out some file migration work on a batch of WordStar 4 files from screenwriting duo Marks and Gran.

The final piece of work I wanted to carry out was to consider how we might move the original files along with the migrated versions of those files into Archivematica (if we were to adopt Archivematica in the future).

I knew the migration I had carried out was a bit of an odd one so I was particularly interested to see how Archivematica would handle it.

It was odd for a number of reasons.

1. Firstly, I ended up creating 3 different versions of each WordStar file – DOCX, PDF/A and ASCII TXT. After an assessment of the significant properties of the files (essentially, the features that I wanted to preserve) and some intense QA of the files, it was clear that all versions were imperfect. None of them preserved all the properties I wanted them to. They all had strengths and weaknesses and between them, they pretty much captured everything.

2. Secondly I wasn’t able to say whether the files I had created were for preservation or access purposes. Typically, a file migration process will make a clear distinction between those files that are for dissemination to users and those files that are preservation copies to keep behind the scenes.

After talking to my colleagues about the file migrations and discussing the pros and cons of the resulting files, it was agreed that we could potentially use any (or all) of the formats to provide access to future users, depending on user needs. I’m not particularly happy with any of the versions I created being a preservation files, given the fact that none of them capture all the elements of the file that I considered to be of value, but they may indeed need to become preservation versions in future if WordStar 4 files become impossible to read.

3. Thirdly the names of the migrated versions of the files did not exactly match up with the original files. The original WordStar files were created in the mid 1980’s. In this early period of home computing, file extensions appeared to be optional. The WordStar 4 manual actually suggests that you use the 3 characters available for the file extension to record additional information that won't fit in the 8 character filename.

For many of the WordStar files in this archive, there is no file extension at all. For other files, the advice in the manual has been followed, in that an extension has been used which gives additional context to the filename. For many of the files WordStar has also created a backup file (essentially an earlier version of the file is saved with a .BAK extension but it is still a Wordstar file). There are therefore scenarios where we need to save information about the original file extension in the migrated version in order to ensure that we don’t create filenaming conflicts and don’t lose information from the original filename.

Why not just normalise within Archivematica?

These WordStar files are not recognised by the file identification tools in Archivematica (I don’t think they are recognised by any file identification tools). The Format Policy Registry only works on those files that are identified for which there is a rule/policy set up
Even if they were identifiable, it would not be possible to replicate some of the manual steps we went through to create the migrated versions with command line tools called by Archivematica.
As part of the migration process itself, several days were spent doing QA and checking of the migrated files against the originals as viewed in WordStar 4 and detailed documentation (to be stored alongside the files) was created. Archivematica does give you a decision point after a normalisation so that checking or QA can be carried out, but we’d need to find a way of doing the QA, creating the necessary documentation and associating it with the AIP half way through the ingest process.

How are manually normalised files handled in Archivematica?

I’d been aware for a while that there is a workflow for manually normalised files in Archivematica and I was keen to see how it would work. Reading the documentation, it was clear that the workflow allows for a number of different approaches (for example normalising files before ingest or at a later date) but was also quite specific regarding the assumptions it is based on.

There is an assumption that the names of original and normalised files will be identical (apart from the file extension which will have changed). However, there is a workaround in place which allows you to include a csv file with your transfer. The csv file should provide information about how the originals are related to the preservation and/or access files. Given the filename issues described above, this was something I would need to include.

There is an assumption that you will know whether the migrated versions of files are intended for preservation or for access. This is a fair assumption – and very much in line with digital preservation thinking, but does it reflect the imperfect real world?

There is also an assumption that there will be no more than one preservation file and no more than one access file for each original (happy to be corrected on this if I am wrong).

Testing the manual normalisation workflow in Archivematica

Without a fully working test version of Archivematica to try things out on, my experimentation was limited. However, a friendly Archivematica guru (thanks Matthew) was able to push a couple of test files through for me and provide me with the AIP and the DIP to inspect.

The good news is that the basic workflow did work – we were able to push an already normalised ‘access’ file and ‘preservation’ file into Archivematica along with the original as part of the ingest process. The preservation files appeared in the AIP and the access files appeared in the DIP as expected.

We also investigated the workflow for adding additional metadata about the migration.

Archivematica creates PREMIS metadata as part of the ingest process, recording the details and outcomes of events (such as virus checks and file normalisations) that it carries out. The fact that Archivematica creates PREMIS events automatically has always been a big selling point for me. As I have mentioned before – who wants to create PREMIS by hand?

Where files are included using the manual normalisation workflow, Archivematica will always create a PREMIS event for the normalisation and if you set up your processing configuration in the right way, it will stop and prompt you to add additional information into the PREMIS eventDetail field. This is a good start but it would be great if a more detailed level of information could be included in the PREMIS metadata.

I wondered what would happen to all the documentation I had created? I concluded that the best way of keeping this documentation alongside the files would be by putting it in the SubmissionDocumentation directory as described in the manual and submitting along with the transfer. This information will be stored with the AIP but the link between the documentation and the normalised files may not be immediately apparent.

What I didn’t test was whether it is possible to push more than one access file into Archivematica using this workflow. I'm assuming that Archivematica will not support this scenario.

Some suggested improvements to the manual normalisation workflow in Archivematica

So, I wanted to highlight a few improvements that could be made.

Allow the user to add PREMIS eventDetail in bulk – at the moment you have to enter the information one file at a time
Allow the user to add information into more than one PREMIS field. Being able to add the actual date of the manual migration into the date field would be a good start. Being able to add an event outcome would also be helpful (by default event outcome is 'None' for files added using the manual normalisation workflow).
Even better – allow the user to be able to add this more detailed PREMIS information through a csv import. Another spreadsheet containing information about the normalisation event could be included in the transfer and be automatically associated with the files and included in the METS.
Allow Archivematica to accept more than one access or preservation file

In conclusion

Testing shows that Archivematica's workflow for manual normalisation will work for a standard scenario and will cope well with changes in filenames (with the addition of a normalization.csv file). However, do not assume it will currently handle more complex scenarios.

I accept that it is impossible for a digital preservation system to do everything we might want it to do. It can’t be flexible and adaptable to every imperfect use case. As a baseline it does have to know whether to put a manually migrated file into the DIP for access or AIP for preservation and it would perhaps not be fair to suggest it should cope with uncertainty.

In his forthcoming book, Trevor Owens makes the interesting point that "Specialized digital preservation tools and software are just as likely to get in the way of solving your digital preservation problems as they are to help." I do therefore wonder how we should address the gap between the somewhat rigid, rule-based tools and the sometimes imperfect real world scenarios?

Jenny Mitcham, Digital Archivist