How would you change Archivematica's Format Policy Registry?
|A train trip through snowy Shropshire to get to Aberystwyth|
This was the first time the group had visited Wales and we celebrated with a night out at a lovely restaurant on the evening before our meeting. Our visit also coincided with the National Library cafe’s Christmas menu so we were treated to a generous Christmas lunch (and crackers) at lunch time. Thanks NLW!
As usual the meeting covered an interesting range of projects and perspectives from Archivematica users in the UK and beyond. As usual there was too much to talk about and not nearly enough time. Fortunately this took my mind off the fact I had damp feet for most of the day.
This post focuses on a discussion we had about Archivematica's Format Policy Registry or FPR. The FPR in Archivematica is a fairly complex beast, but a crucial tool for the 'Preservation Planning' step in digital archiving. It is essentially a database which allows users to define policies for handling different file formats (including the actions, tools and settings to apply to specific file type for the purposes of preservation or access). The FPR comes ready populated with a set of rules based on agreed best practice in the sector, but institutions are free to change these and add new tools and rules to meet their own requirements.
Jake Henry from the National Library of Wales kicked off the discussion by telling us about some work they had done to make the thumbnail generation for pdf files more useful. Instead of supplying a generic thumbnail image for all pdfs they wanted the thumbnail to actually represent the file in question. They made changes to the FPR to change the pdf thumbnail generation to use GhostScript.
NLW liked the fact that Archivematica converted pdf files to pdf/a but also wanted that same normalisation pathway to apply to existing pdf/a files. Just because a pdf/a file is already in a preservation file format it doesn’t mean it is a valid file. By also putting pdf/a files through a normalisation step they had more confidence that they were creating and preserving pdf/a files with some consistency.
|Sea view from our meeting room!|
Discussion also touched on the subject of those files that are not identified. A file needs to be identified before a FPR rule can be set up for it. Ensuring files are identified in the first instance was seen to be a crucial step. Even once a format makes its way into PRONOM (TNA’s database of file formats) Artefactual Systems have to carry out extra work to get Archivematica to pick up that new PUID.
Unfortunately normalisation tools do not exist for all files and in many cases you may just have to accept that a file will stay in the format in which it was received. For example a Microsoft Word document (.doc) may not be an ideal preservation format but in the absence of open source command line migration tools we may just have to accept the level of risk associated with this format.
Moving on from this, we also discussed manual normalisations. This approach may be too resource intensive for many (particularly those of us who are implementing automated workflows) but others would see this as an essential part of the digital preservation process. I gave the example of the WordStar files I have been working with this year. These files are already obsolete and though there are other ways of viewing them, I plan to migrate them to a format more suitable for preservation and access. This would need to be carried out outside of Archivematica using the manual normalisation workflow. I haven’t tried this yet but would very much like to test it out in the future.
I shared some other examples that I'd gathered outside the meeting. Kirsty Chatwin-Lee from the University of Edinburgh had a proactive approach to handling the FPR on a collection by collection and PUID by PUID basis. She checks all of the FPR rules for the PUIDs she is working with as she transfers a collection of digital objects into Archivematica and ensures she is happy before proceding with the normalisation step.
Back in October I'd tweeted to the wider Archivematica community to find out what people do with the FPR and had a few additional examples to share. For example, using Unoconv to convert office documents and creating PDF access versions of Microsoft Word documents. We also looked at some more detailed preservation planning documentation that Robert Gillesse from the International Institute of Social History had shared with the group.
We had a discussion about the benefits (or not) of normalising a compressed file (such as a JPEG) to an uncompressed format (such as TIFF). I had already mentioned in my presentation earlier that this default migration rule was turning 5GB of JPEG images into 80GB of TIFFs - and this is without improving the quality or the amount of information contained within the image. The same situation would apply to compressed audio and video which would increase even more in size when converted to an uncompressed format.
If storage space is at a premium (or if you are running this as a service and charging for storage space used) this could be seen as a big problem. We discussed the reasons for and against leaving this rule in the FPR. It is true that we may have more confidence in the longevity of TIFFs and see them as more robust in the face of corruption, but if we are doing digital preservation properly (checking checksums, keeping multiple copies etc) shouldn't corruption be easily spotted and fixed?
Another reason we may migrate or normalise files is to restrict the file formats we are preserving to a limited set of known formats in the hope that this will lead to less headaches in the future. This would be a reason to keep on converting all those JPEGs to TIFFs.
The FPR is there to be changed and being that not all organisations have exactly the same requirements it is not surprising that we are starting to tweak it here and there – if we don’t understand it, don’t look at it and don’t consider changing it perhaps we aren’t really doing our jobs properly.
However there was also a strong feeling in the room that we shouldn’t all be re-inventing the wheel. It is incredibly useful to hear what others have done with the FPR and the rationale behind their decisions.
Hopefully it is helpful to capture this discussion in a blog post, but this isn’t a sustainable way to communicate FPR changes for the longer term. There was a strong feeling in the room that we need a better way of communicating with each other around our preservation planning - the decisions we have made and the reasons for those decisions. This feeling was echoed by Kari Smith (MIT Libraries) and Nick Krabbenhoeft (New York Public Library) who joined us remotely to talk about the OSSArcFlow project - so this is clearly an international problem! This is something that Jisc are considering as part of their Research Data Shared Service project so it will be interesting to see how this might develop in the future.
Thanks to the UK Archivematica group meeting attendees for contributing to the discussion and informing this blog post.
Jenny Mitcham, Digital Archivist