File identification ...let's talk about the workflows

When receiving any new batch of files to add to the digital archive there are lots of things I want to know about them but "What file formats have we got here?" is often my first question.

Knowing what you've got is of great importance to digital archivists because...
  • It enables you to find the right software to open the file and view the contents (all being well)
  • It can trigger a dialog with your donor or depositor about alternative formats you might wish to receive the data in (...all not being well)
  • It allows you to consider the risks that relate to that format and if appropriate define a migration pathway for preservation and/or access
We've come a long way in the last few years and we now have lots of tools to choose from to identify files. This could be seen as both a blessing and a curse. Each tool has strengths and weaknesses and it is not easy to decide which one to use (or indeed which combination of tools would give the best results) ...and once we've started using a tool, in what way do we actually use it?

So currently I have more questions about workflows - how do we use these tools and at what points do we interact with them or take manual steps?

Where file format identification tools are used in isolation, we can do what we want with the results. Where multiple identifications are given, we may be able to gather further evidence to convince us what the file actually is. Where there is no identification given, we may decide we can assign an identification manually. However, where file identification tools are incorporated into larger digital preservation systems, the workflow will be handled by the system and the digital archivist will only be able to interact in ways that have been configured by the developers.

As part of our Jisc funded "Filling the Digital Preservation Gap" project, one of the areas of development we are working on is around file identification within Archivematica. This was seen to be a development priority because our project is looking specifically at research data and research data comes in a huge array of file formats, many of which will not currently be recognised by file format identification tools.

The project team...discussing file identification workflows...probably

Here are some of the questions we've been exploring:
  • What should happen if you ingest data that can't be identified? Should you get notification of this? Should you be offered the option to try other file id methods/tools for those non-identified files?
  • Should we allow the curator/digital archivist to over-ride file identifications - eg - "I know this isn't really xxxx format so I'm going to record this fact" (and record this manual intervention in the metadata) Can you envisage ever wanting to do this? 
  • Where a tool gives more than one possible identification should you be allowed to select which identification you trust or should the metadata just keep a record of all the possible identifications?
  • Where a file is not identified at all, should you have the option to add a manual identification? If there is no Pronom id for a file (because it isn't yet in Pronom) how would you record the identification? Would it simply be a case of writing "MATLAB file" for example? How sustainable is this?
  • How should you share info around file formats/file identifications with the wider digital preservation community? What is the best way to contribute to file format registries such as Pronom

We've been talking to people but don't necessarily have all the answers just yet. Thanks to everyone who has been feeding into our discussions so far! The key point to make here is that perhaps there isn't really a right answer - our systems need to be configurable enough in order that different institutions can work in different ways depending on local policies. It seems fairly obvious that this is quite a big nut to crack and it isn't something that we can fully resolve within our current project.

For the time being our Archivematica development work is focusing in the first instance on allowing the digital curator to see a report of the files that are not identified as a prompt to then working out how to handle them. This will be an important step towards helping us to understand the problem. Watch this space for further information.

Jenny Mitcham, Digital Archivist


Popular posts from this blog

How can we preserve Google Documents?

Preserving emails. How hard can it be?

Checksum or Fixity? Which tool is for me?