Some thoughts on pdf/a 3

As a digital archivist, I need to keep my ear to the ground with regard to new file formats, particularly when they are billed as being suitable for long term preservation. This is why I attended a DPC event today on the new version of the pdf/a standard (version 3). With pdf/a the clue is in the name, the ‘a’ stands for ‘archive’.

The original pdf/a file format was one that was the source of endless debate in my previous job at the Archaeology Data Service (see summary blog post). It is a format that we eventually embraced as an acceptable preservation format for documents deposited with us in standard pdf format. The self-contained nature of pdf/a also provides an excellent format for providing on-line access to reports, having far greater longevity than standard pdf files, some of which were starting to produce error messages ten years after deposit – again there is a related blog post on this issue – a problem I was grappling with in my last couple of months working at the ADS (not the cause of my leaving I might add!)

Today’s event was very useful, giving me enough background information about the new format to feel I could now hold my own in a discussion of its pros and cons.

The main difference between pdf/a 3 and previous versions of the standard is the ability to include embedded objects. You can for example include the raw data that sits behind one of the graphs in your report, the original MS Word document that you created your pdf/a file from, or an alternative version of the report (an audio file for example). The relationship of the embedded object to the pdf/a file will be recorded in the associated metadata (whether it is data, source or alternative).

It is easy to see the benefits of this, however the objects that you embed can be in any format and may therefore not be in a suitable format for preservation. This provides a headache for digital archivists as any file that was deposited in an archive in pdf/a 3 format would then have to be assessed for the presence of embedded files and a separate check on both their value and longevity would need to be made. It was stated in the briefing today that material with long term archival value should not be embedded in a PDF/a file, this would be a difficult concept to express to our donors and depositors when negotiating submission of their data into our archives.

I cannot currently imagine a situation where I would want to embed data within a pdf file in a preservation context. Having each element as separate files with metadata that explains the relationships between them would always be my preferred option. 

The only use case I could envisage right now for pdf/a version 3 would be as a future dissemination option, allowing a user to download a report with associated data (as embedded files) as a single bundle. Whether this would have any major benefits over the use of zip files I am not sure. Before this happens, tools for creating, reading and editing files of this nature would need to be widely and freely available allowing pdf/a 3 to become main stream. I know from my previous job that educating depositors about the benefits of creating pdf/a files over standard pdf files was a long process and my concern is that this new standard might give us even more explaining to do. Rather than advising depositors that pdf/a is simply ‘A Good Thing’, we may need to add on caveats relating to which version they should use or advice on the format of embedded objects. Confusion may well ensue!

Jenny Mitcham, Digital Archivist


Popular posts from this blog

How can we preserve Google Documents?

Preserving emails. How hard can it be?

Checksum or Fixity? Which tool is for me?