Pages

Thursday, 2 May 2013

Some thoughts about our preservation policy

We are updating the Borthwick Institute's preservation policy. Originally written in 2007, it is due for review by our preservation archivist and conservation team but my interest in this task is to ensure that digital archives are represented within the policy. In the current policy there is no mention of the preservation of digital archives. There was no need for us to mention them six years ago but times are changing - this has now become a priority for us.

It was interesting having the opportunity to sit down and read our existing preservation policy. The only other preservation policies I had ever read up to this point (not that many of them I might add) related purely to digital material so had quite a different emphasis. As I am still quite new to the world of traditional archives, reading about everything that is in place to protect and preserve the documents in our care is an education for me.

The first question in my head is whether to crack on with this now or wait until such a time that the structure of the digital archive is more firmly in place. By getting the policy for the preservation of digital material written now we are jumping the gun a bit as procedures are not yet established. However by writing a policy at this point we are at least setting out our intentions and providing an overview of how we will approach the preservation of digital material. There are certain things I am confident we will be doing. The finer detail of how this will occur will follow in time and may be incorporated into a more detailed preservation strategy document in the future.

The second decision to make regarding the revised preservation policy was whether we integrate the digital within the current policy or create a separate document. Both approaches are legitimate and widely used. For the Borthwick we have decided that an integrated policy is the way forward. Ultimately, we want our systems for receiving, managing and providing access to archives to be seamless and media blind. It shouldn't matter to our depositors or users whether the media is digital or not, they should be confident in our ability to preserve and provide access to them regardless. Digital preservation should not be a specialist outpost, it should be fully integrated in the psyche of both our staff and our users.

There are differences to the way we preserve physical and digital archives:

  • With physical archives (paper and parchment) it is most important for the material to be appropriately packaged and stored in the correct environmental conditions. Once the conditions are right, they can be largely left alone. Intervention should only be required where specific issues occur.
  • With digital archives it is all about continuous active management. The digital environment is fast moving and the threat of obsolescence is never far away. Leaving the data alone in a static environment is a very risky approach.
Despite these differences, the basic premis of preservation is the same. What is highlighted in our current preservation policy is the idea of "preservation for access". This is why we are all here after all. Whether the material is physical or digital, we need to ensure that we preserve them so that others may access them in the future.



Wednesday, 24 April 2013

Preservation metadata


Yesterday I attended the DPC event on preservation metadata which focused on the PREMIS and METS standards. Coming at a time when I am reviewing various systems and software for managing digital archives and research data this was quite timely. The main message I took away from the day is that the key to actually creating and managing this metadata as ever is having the necessary tools!

Jelly Beans by kayaker1204, on Flickr
 - lots of these were consumed yesterday!
Earlier this year with the help of an intern we collected together digital media deposited with the Borthwick Institute and got the data safely stored on our digital archive file store. For the time being, the metadata I am holding about these files is pretty limited (I am currently working to level one of the NDSA Levels of Digital Preservation – metadata comes further down the line so I’m temporarily off the hook!).

I do have some technical metadata for files (MD5 checksums and output from DROID) however, the need to have a system in place to hold information about preservation actions is becoming more pressing. Working on the premis that doing something is better than doing nothing, there have been certain files that I have already been compelled to migrate to file formats more suitable for long term archiving. Documenting these kinds of actions is an essential part of any digital archivists job but my problem at the moment is that I have no proper system in which to hold this information.

The benefits of using PREMIS are clear. As a couple of the speakers at yesterday's event stated, it provides a metadata schema that includes all things “that most working preservation repositories are likely to need to know to support digital preservation”. It is flexible enough that you can pick and choose from the elements, using only those that you require. If this is the case then I do not see a reason to re-invent the wheel and design a new schema to record the technical metadata and preservation actions that I need to record.

So what is stopping me from starting to use this today? Firstly a confession – I like databases. Given the choice I would much rather work with databases than XML. This isn’t a problem in itself. Rob Sharpe from Tessella explained that a system can be externally PREMIS compliant even if PREMIS isn’t used internally. So long as the metadata recorded by an archive can be mapped to PREMIS then PREMIS XML could be exported and made available to others.

Following on from this, it should not really matter whether you prefer databases or XML so long as you have the right tools for the job. A preservation system in which much of the preservation metadata is automatically generated and where a user-friendly interface is provided for the creation and editing of preservation metadata would be the ideal. How the data is stored behind the scenes is not so important as long as you know you are storing the right bits of information.

Fortunately there are several digital preservation tools into which PREMIS is incorporated. Some of these such as Rosetta and Archivematica are already on my list of systems to look at. Armed with the knowledge I gained yesterday (and a pocket full of jelly beans) I am in a good position to push forward with my investigations.

Wednesday, 13 March 2013

Some thoughts on pdf/a 3


As a digital archivist, I need to keep my ear to the ground with regard to new file formats, particularly when they are billed as being suitable for long term preservation. This is why I attended a DPC event today on the new version of the pdf/a standard (version 3). With pdf/a the clue is in the name, the ‘a’ stands for ‘archive’.

The original pdf/a file format was one that was the source of endless debate in my previous job at the Archaeology Data Service (see summary blog post). It is a format that we eventually embraced as an acceptable preservation format for documents deposited with us in standard pdf format. The self-contained nature of pdf/a also provides an excellent format for providing on-line access to reports, having far greater longevity than standard pdf files, some of which were starting to produce error messages ten years after deposit – again there is a related blog post on this issue – a problem I was grappling with in my last couple of months working at the ADS (not the cause of my leaving I might add!)

Today’s event was very useful, giving me enough background information about the new format to feel I could now hold my own in a discussion of its pros and cons.

The main difference between pdf/a 3 and previous versions of the standard is the ability to include embedded objects. You can for example include the raw data that sits behind one of the graphs in your report, the original MS Word document that you created your pdf/a file from, or an alternative version of the report (an audio file for example). The relationship of the embedded object to the pdf/a file will be recorded in the associated metadata (whether it is data, source or alternative).

It is easy to see the benefits of this, however the objects that you embed can be in any format and may therefore not be in a suitable format for preservation. This provides a headache for digital archivists as any file that was deposited in an archive in pdf/a 3 format would then have to be assessed for the presence of embedded files and a separate check on both their value and longevity would need to be made. It was stated in the briefing today that material with long term archival value should not be embedded in a PDF/a file, this would be a difficult concept to express to our donors and depositors when negotiating submission of their data into our archives.

I cannot currently imagine a situation where I would want to embed data within a pdf file in a preservation context. Having each element as separate files with metadata that explains the relationships between them would always be my preferred option. 

The only use case I could envisage right now for pdf/a version 3 would be as a future dissemination option, allowing a user to download a report with associated data (as embedded files) as a single bundle. Whether this would have any major benefits over the use of zip files I am not sure. Before this happens, tools for creating, reading and editing files of this nature would need to be widely and freely available allowing pdf/a 3 to become main stream. I know from my previous job that educating depositors about the benefits of creating pdf/a files over standard pdf files was a long process and my concern is that this new standard might give us even more explaining to do. Rather than advising depositors that pdf/a is simply ‘A Good Thing’, we may need to add on caveats relating to which version they should use or advice on the format of embedded objects. Confusion may well ensue!

Friday, 15 February 2013

In praise of Quick View Plus

I just wanted to sing the praises of a great little bit of software that we have been using here to help us recover old files.

Quick View Plus is a simple tool for viewing lots of different file formats. As we have been working through boxes of old floppy disks we have found many files that we can read in modern software that we have installed on our PCs (old versions of Word Perfect for example) but many more that we can not view. Rather than purchasing and installing lots of different bits of software to read all of our obsolete Paradox databases and Microsoft Works spreadsheets from the 1990's it has proved to be far more efficient just to purchase a single copy of Quick View Plus Standard Edition. This software allows you to view the contents of over 300 different file formats (often with many different versions of each format) and is a really useful tool for anyone who needs to view or extract content from a wide range of file formats. Though it is purely a tool for viewing files (rather than migrating files), it does allow the contents of the files to be copied and then pasted into a different application.

Of one of the boxes of floppy disks from the office we had been looking at as part of our digital rescue warm-up mission, it was decided that about 300 files were worthy of rescue. The range of formats of these files (dating from between 1990 and 2004) are shown in the graph below.



File identification was carried out using Droid from the National Archives (another great little tool!). The files listed as 'Not recognised' included File Maker Pro and others with a qic extension which appeared to be some sort of backup file. We have shared some of these unrecognised formats with the Droid and Pronom team so that they can incorporate them into future versions where appropriate.

Anyway, the good news is that Quick View Plus was successful at viewing pretty much all of the files that we needed to rescue and once viewed they could be copied and saved into a modern MS Office format suitable for current staff to view and use. The next step is to find appropriate homes for the material we have rescued. What we do not want to do is copy them back on to a different sort of portable media that will become obsolete or corrupt again in another 20 years time!


Wednesday, 6 February 2013

The Atlas of Digital Damages

One of our 'dead'!
Last week I attended a Digital Preservation Coalition day of action on collaborative approaches to digital archiving (and file format identification in particular) jokingly subtitled 'Bring out your Dead'.

Our current work has certainly uncovered some digital media and files that could be described as 'dead' and though I didn't really have cause to bring them out on the day, one of the key things that was reinforced by many of the speakers on the day was the importance of collaboration.

Digital archiving is complex and evolving field and we can not hope to solve all of the problems we are faced with alone. Although sometimes we may struggle to find the time to actively engage with collaboration initiatives, the importance of making time to do so was highlighted and at the end of the workshop we were asked to commit to at least one of the collaborative initiatives neatly summarised on the OPF wiki page Support your digital preservation community.

Only a small step I know, but I decided that something we could easily contribute to was the Atlas of Digital Damages. This is a group on Flickr with a remit to collect "visual examples of digital preservation challenges, failed renderings, encoding damage, corrupt data...". These images all tell a story and visually highlight and describe preservation issues that many of us will face. I hope to use some of these images to illustrate future presentations.

So, I have re-registered with Flickr (it has been a long time!) and uploaded my first picture (see above). It is that easy! I feel the compact disc photographed represents a very real digital preservation challenge! I was relieved to be told today that we do hold the data on this corrupt CD (burnt less than 6 years ago) elsewhere in the office, so in this case at least, the level of digital damage is minimal.

Friday, 1 February 2013

CDs versus floppies

The digital rescue mission continues...

Here are a few words from James who is in the middle of his internship on this project and has been moving his focus away from floppy disks and on to CDs within the office:

Floppy disks - more robust than we expected!
"I am finding it very enjoyable rifling through collections of CDs and floppy disks
to try and discover what they contain, what is recoverable, what has duplicates saved elsewhere and what is important. Something that has been a big surprise to me, and I have found really interesting, was to discover that the floppy disks appear to have a greater lifespan than the CDs that superseded them. Out of the 97 floppy disks I’ve been through (most of these were from the 1990s) only one is completely corrupt. This is in contrast to the 62 CDs I have been working with (mostly from the 2000s) of which there were 4 entirely corrupt
disks. It just seems odd that the older technology outlasts the more modern."

So in our sample, the floppies have a 1% disk corruption rate whereas for the CDs it is somewhere around 6.5%. Is this typical? It will be interesting to see if this pattern continues as we move on to digital material in the strongrooms.

There is an interesting analogy from my colleagues who work on the conservation of analogue material within the archive. Apparently, the same is true of paper. Old paper is generally of better quality thus in a better state of physical preservation than more modern paper. I love it when my work on digital material finds parallels in the traditional archival experiences.

It just goes to show, they don't make things like they used to!

Wednesday, 16 January 2013

Digital Rescue Mission: Operation Warm-up

Some of the floppy disks we have been looking at
My digital archiving intern has started work. His mission is to rescue digital media from the strongroom and prepare it for inclusion into the digital archive.

The first stage of this work has been to search our accessions database and catalogues for any mention of digital media. James has had the rather tedious job of running keyword searches over our finding aids in order to create a list of the archives that contain digital media. This list is pretty much complete now and next week we will start to pull material out of the strongrooms and see what we actually have.

However, in preparation for working with the real archives, we have also been doing a bit of a warm-up exercise on some old digital media found tucked away in people's offices. As well as being a means to hone our skills in this area, and test tools and workflows, this is also a very useful exercise in its own right. As a long-established organisation that has existed over a period of many technological changes, it is not surprising that we have a variety of different digital storage solutions for our administrative files. We are currently establishing a more cohesive strategy for storing all of our own digital material and part of this process is to ensure that any digital files that are important to us are stored in a logical location on a server and regularly backed up.

In the past floppy disks were regularly used to back up files and transfer files between computers. Some of these floppy disks still exist (and most are still readable) so James is currently working through them, virus checking them, running DROID to find out the file formats and opening the files to look at the content. He is creating a list from which we can highlight anything that is important and unique and in need of rescue. Many will be back-ups or earlier versions of things we already have on our shared filestore. Many of the unique files will be of limited value. However, we are expecting there will be other files that will be more interesting, and will be useful in showing the history and development of our organisation.

In terms of file formats, James is finding a lot of Microsoft Word 97-2003, WordPerfect 6.0 and PDF 1.4 files. No major problems with being able to read these at the moment so that is encouraging. A couple of old databases have emerged in Paradox and an early version of Microsoft Access which we are struggling to open. We expect to find earlier data going back to the 1990's as James works through the stratigraphy of the pile of disks on his desk. What better job for a recent archaeology graduate!

After the floppy disks have been tackled we may decide to move on to the CDs in the office. Far more numerous but it could be argued they are equally vulnerable.