Pages

Friday, 15 December 2017

How would you change Archivematica's Format Policy Registry?

A train trip through snowy Shropshire to get to Aberystwyth
This week the UK Archivematica user group fought through the snow and braved the winds and driving rain to meet at the National Library of Wales in Aberystwyth.

This was the first time the group had visited Wales and we celebrated with a night out at a lovely restaurant on the evening before our meeting. Our visit also coincided with the National Library cafe’s Christmas menu so we were treated to a generous Christmas lunch (and crackers) at lunch time. Thanks NLW!

As usual the meeting covered an interesting range of projects and perspectives from Archivematica users in the UK and beyond. As usual there was too much to talk about and not nearly enough time. Fortunately this took my mind off the fact I had damp feet for most of the day.

This post focuses on a discussion we had about Archivematica's Format Policy Registry or FPR. The FPR in Archivematica is a fairly complex beast, but a crucial tool for the 'Preservation Planning' step in digital archiving. It is essentially a database which allows users to define policies for handling different file formats (including the actions, tools and settings to apply to specific file type for the purposes of preservation or access). The FPR comes ready populated with a set of rules based on agreed best practice in the sector, but institutions are free to change these and add new tools and rules to meet their own requirements.

Jake Henry from the National Library of Wales kicked off the discussion by telling us about some work they had done to make the thumbnail generation for pdf files more useful. Instead of supplying a generic thumbnail image for all pdfs they wanted the thumbnail to actually represent the file in question. They made changes to the FPR to change the pdf thumbnail generation to use GhostScript.

NLW liked the fact that Archivematica converted pdf files to pdf/a but also wanted that same normalisation pathway to apply to existing pdf/a files. Just because a pdf/a file is already in a preservation file format it doesn’t mean it is a valid file. By also putting pdf/a files through a normalisation step they had more confidence that they were creating and preserving pdf/a files with some consistency.

Sea view from our meeting room!
Some institutions had not had any time to look in any detail at the default FPR rules. It was mentioned that there was trust in how the rules had been set up by Artefactual and that people didn’t feel expert enough to override these rules. The interface to the FPR within Archivematica itself is also not totally intuative and requires quite a bit of time to understand. It was mentioned that adding a tool and a new rule for a specific file format in Archivematica is quite an complex task and not for the faint hearted...!

Discussion also touched on the subject of those files that are not identified. A file needs to be identified before a FPR rule can be set up for it. Ensuring files are identified in the first instance was seen to be a crucial step. Even once a format makes its way into PRONOM (TNA’s database of file formats) Artefactual Systems have to carry out extra work to get Archivematica to pick up that new PUID.

Unfortunately normalisation tools do not exist for all files and in many cases you may just have to accept that a file will stay in the format in which it was received. For example a Microsoft Word document (.doc) may not be an ideal preservation format but in the absence of open source command line migration tools we may just have to accept the level of risk associated with this format.

Moving on from this, we also discussed manual normalisations. This approach may be too resource intensive for many (particularly those of us who are implementing automated workflows) but others would see this as an essential part of the digital preservation process. I gave the example of the WordStar files I have been working with this year. These files are already obsolete and though there are other ways of viewing them, I plan to migrate them to a format more suitable for preservation and access. This would need to be carried out outside of Archivematica using the manual normalisation workflow. I haven’t tried this yet but would very much like to test it out in the future.

I shared some other examples that I'd gathered outside the meeting. Kirsty Chatwin-Lee from the University of Edinburgh had a proactive approach to handling the FPR on a collection by collection and PUID by PUID basis. She checks all of the FPR rules for the PUIDs she is working with as she transfers a collection of digital objects into Archivematica and ensures she is happy before proceding with the normalisation step.

Back in October I'd tweeted to the wider Archivematica community to find out what people do with the FPR and had a few additional examples to share. For example, using Unoconv to convert office documents and creating PDF access versions of Microsoft Word documents. We also looked at some more detailed preservation planning documentation that Robert Gillesse from the International Institute of Social History had shared with the group.

We had a discussion about the benefits (or not) of normalising a compressed file (such as a JPEG) to an uncompressed format (such as TIFF). I had already mentioned in my presentation earlier that this default migration rule was turning 5GB of JPEG images into 80GB of TIFFs - and this is without improving the quality or the amount of information contained within the image. The same situation would apply to compressed audio and video which would increase even more in size when converted to an uncompressed format.

If storage space is at a premium (or if you are running this as a service and charging for storage space used) this could be seen as a big problem. We discussed the reasons for and against leaving this rule in the FPR. It is true that we may have more confidence in the longevity of TIFFs and see them as more robust in the face of corruption, but if we are doing digital preservation properly (checking checksums, keeping multiple copies etc) shouldn't corruption be easily spotted and fixed?

Another reason we may migrate or normalise files is to restrict the file formats we are preserving to a limited set of known formats in the hope that this will lead to less headaches in the future. This would be a reason to keep on converting all those JPEGs to TIFFs.

The FPR is there to be changed and being that not all organisations have exactly the same requirements it is not surprising that we are starting to tweak it here and there – if we don’t understand it, don’t look at it and don’t consider changing it perhaps we aren’t really doing our jobs properly.

However there was also a strong feeling in the room that we shouldn’t all be re-inventing the wheel. It is incredibly useful to hear what others have done with the FPR and the rationale behind their decisions.

Hopefully it is helpful to capture this discussion in a blog post, but this isn’t a sustainable way to communicate FPR changes for the longer term. There was a strong feeling in the room that we need a better way of communicating with each other around our preservation planning - the decisions we have made and the reasons for those decisions. This feeling was echoed by Kari Smith (MIT Libraries) and Nick Krabbenhoeft (New York Public Library) who joined us remotely to talk about the OSSArcFlow project - so this is clearly an international problem! This is something that Jisc are considering as part of their Research Data Shared Service project so it will be interesting to see how this might develop in the future.

Thanks to the UK Archivematica group meeting attendees for contributing to the discussion and informing this blog post.

2 comments:

  1. Hi Jenny, thanks for this blog! Totally agree that we should share our FPR rules/file format policies and discuss the rationale behind them. I promise that we shall publish ours very soon. Also this issue is of course broader than the Archivematica community alone. We could have the same discussion with our Preservica user friends I guess.

    We have been reviewing our (IISH) file format policy (for born digital material) recently and changed quite a bit. One interesting discussion was about packaged files (as ZIP or RAR): should we preserve the packaged file as a sort of "original file" and extract them as well or should we simply extract these files and throw the original ZIP's away? The first case seems the most elegant but will also consume a lot of extra storage space. We have decided for now to take the latter route: unpack the files and throw away the original packaged files. With extra argument that the original packaged file can also be found in the METS. So you could - the basis of the METS file - always reconstruct the original ZIP or RAR.

    Another question for us are the video files. As these file formats are prone to a lot of change (especially in the codecs used within the container files) we see a clear case for normalisation. But also here the dilemma is that of storage space. Video files can be really big. Also normalizing them will most probably take up a lot of processing time. For now we have decided to go the normalizing route and see in practice (as we are only starting out with our born digital collections) what we will end up with.

    In short: The IISH file format policy is still a theoretical document and will have to prove itself in a real archiving workflow. Begin 2018 we will start with archiving our born digital so then we will really see the consequences of our choices...

    ReplyDelete
    Replies
    1. Thanks for your thoughts Robert. Re your zip file discussions, what we will be doing here (for research data at least) is unzipping the files but leaving them in a directory that reflects the name of the original package. I believe our researchers see the zip format as just a convenient way of packaging up their data to be sent to us rather than as a significant part of their dataset that needs to be preserved. Don't know if that helps you? Like you, this is largely theoretical for us at the moment. Would be keen to find out how you get on next year!

      Delete