Testing manual normalisation workflows in Archivematica

This week I traveled to Warwick University for the UK Archivematica meeting. As usual, it was a really interesting day. I’m not going to blog about all of it (I’ll leave that to our host, Rachel MacGregor) but I will blog about the work I presented there.

Followers of my blog will be aware that I recently carried out some file migration work on a batch of WordStar 4 files from screenwriting duo Marks and Gran.

The final piece of work I wanted to carry out was to consider how we might move the original files along with the migrated versions of those files into Archivematica (if we were to adopt Archivematica in the future).

I knew the migration I had carried out was a bit of an odd one so I was particularly interested to see how Archivematica would handle it.

It was odd for a number of reasons.

1. Firstly, I ended up creating 3 different versions of each WordStar file – DOCX, PDF/A and ASCII TXT. After an assessment of the significant properties of the files (essentially, the features that I wanted to preserve) and some intense QA of the files, it was clear that all versions were imperfect. None of them preserved all the properties I wanted them to. They all had strengths and weaknesses and between them, they pretty much captured everything.

2. Secondly I wasn’t able to say whether the files I had created were for preservation or access purposes. Typically, a file migration process will make a clear distinction between those files that are for dissemination to users and those files that are preservation copies to keep behind the scenes.

After talking to my colleagues about the file migrations and discussing the pros and cons of the resulting files, it was agreed that we could potentially use any (or all) of the formats to provide access to future users, depending on user needs. I’m not particularly happy with any of the versions I created being a preservation files, given the fact that none of them capture all the elements of the file that I considered to be of value, but they may indeed need to become preservation versions in future if WordStar 4 files become impossible to read.

3. Thirdly the names of the migrated versions of the files did not exactly match up with the original files. The original WordStar files were created in the mid 1980’s. In this early period of home computing, file extensions appeared to be optional. The WordStar 4 manual actually suggests that you use the 3 characters available for the file extension to record additional information that won't fit in the 8 character filename.

For many of the WordStar files in this archive, there is no file extension at all. For other files, the advice in the manual has been followed, in that an extension has been used which gives additional context to the filename. For many of the files WordStar has also created a backup file (essentially an earlier version of the file is saved with a .BAK extension but it is still a Wordstar file). There are therefore scenarios where we need to save information about the original file extension in the migrated version in order to ensure that we don’t create filenaming conflicts and don’t lose information from the original filename.

Why not just normalise within Archivematica?

  • These WordStar files are not recognised by the file identification tools in Archivematica (I don’t think they are recognised by any file identification tools). The Format Policy Registry only works on those files that are identified for which there is a rule/policy set up
  • Even if they were identifiable, it would not be possible to replicate some of the manual steps we went through to create the migrated versions with command line tools called by Archivematica. 
  • As part of the migration process itself, several days were spent doing QA and checking of the migrated files against the originals as viewed in WordStar 4 and detailed documentation (to be stored alongside the files) was created. Archivematica does give you a decision point after a normalisation so that checking or QA can be carried out, but we’d need to find a way of doing the QA, creating the necessary documentation and associating it with the AIP half way through the ingest process.

How are manually normalised files handled in Archivematica?

I’d been aware for a while that there is a workflow for manually normalised files in Archivematica and I was keen to see how it would work. Reading the documentation,  it was clear that the workflow allows for a number of different approaches (for example normalising files before ingest or at a later date) but was also quite specific regarding the assumptions it is based on.

There is an assumption that the names of original and normalised files will be identical (apart from the file extension which will have changed). However, there is a workaround in place which allows you to include a csv file with your transfer. The csv file should provide information about how the originals are related to the preservation and/or access files. Given the filename issues described above, this was something I would need to include.

There is an assumption that you will know whether the migrated versions of files are intended for preservation or for access. This is a fair assumption – and very much in line with digital preservation thinking, but does it reflect the imperfect real world?

There is also an assumption that there will be no more than one preservation file and no more than one access file for each original (happy to be corrected on this if I am wrong).

Testing the manual normalisation workflow in Archivematica

Without a fully working test version of Archivematica to try things out on, my experimentation was limited. However, a friendly Archivematica guru (thanks Matthew) was able to push a couple of test files through for me and provide me with the AIP and the DIP to inspect.

The good news is that the basic workflow did work – we were able to push an already normalised ‘access’ file and ‘preservation’ file into Archivematica along with the original as part of the ingest process. The preservation files appeared in the AIP and the access files appeared in the DIP as expected.

We also investigated the workflow for adding additional metadata about the migration.

Archivematica creates PREMIS metadata as part of the ingest process, recording the details and outcomes of events (such as virus checks and file normalisations) that it carries out. The fact that Archivematica creates PREMIS events automatically has always been a big selling point for me. As I have mentioned before – who wants to create PREMIS by hand?

Where files are included using the manual normalisation workflow, Archivematica will always create a PREMIS event for the normalisation and if you set up your processing configuration in the right way, it will stop and prompt you to add additional information into the PREMIS eventDetail field. This is a good start but it would be great if a more detailed level of information could be included in the PREMIS metadata.

I wondered what would happen to all the documentation I had created? I concluded that the best way of keeping this documentation alongside the files would be by putting it in the SubmissionDocumentation directory as described in the manual and submitting along with the transfer. This information will be stored with the AIP but the link between the documentation and the normalised files may not be immediately apparent.

What I didn’t test was whether it is possible to push more than one access file into Archivematica using this workflow. I'm assuming that Archivematica will not support this scenario.

Some suggested improvements to the manual normalisation workflow in Archivematica

So, I wanted to highlight a few improvements that could be made.

  1. Allow the user to add PREMIS eventDetail in bulk – at the moment you have to enter the information one file at a time
  2. Allow the user to add information into more than one PREMIS field. Being able to add the actual date of the manual migration into the date field would be a good start. Being able to add an event outcome would also be helpful (by default event outcome is 'None' for files added using the manual normalisation workflow).
  3. Even better – allow the user to be able to add this more detailed PREMIS information through a csv import. Another spreadsheet containing information about the normalisation event could be included in the transfer and be automatically associated with the files and included in the METS.
  4. Allow Archivematica to accept more than one access or preservation file

In conclusion

Testing shows that Archivematica's workflow for manual normalisation will work for a standard scenario and will cope well with changes in filenames (with the addition of a normalization.csv file). However, do not assume it will currently handle more complex scenarios.

I accept that it is impossible for a digital preservation system to do everything we might want it to do. It can’t be flexible and adaptable to every imperfect use case. As a baseline it does have to know whether to put a manually migrated file into the DIP for access or AIP for preservation and it would perhaps not be fair to suggest it should cope with uncertainty.

In his forthcoming book, Trevor Owens makes the interesting point that "Specialized digital preservation tools and software are just as likely to get in the way of solving your digital preservation problems as they are to help." I do therefore wonder how we should address the gap between the somewhat rigid, rule-based tools and the sometimes imperfect real world scenarios?

Jenny Mitcham, Digital Archivist


Popular posts from this blog

How can we preserve Google Documents?

Preserving emails. How hard can it be?

Checksum or Fixity? Which tool is for me?