Tuesday, 17 December 2013

Updating my requirements

Last week I published my digital preservation Christmas wishlist. A bit tongue in cheek really but I saw it as my homework in advance of the latest Digital Preservation Coalition (DPC) day on Friday which was specifically about articulating requirements for digital preservation systems.

This turned out to be a very timely and incredibly useful event. Along with many other digital preservation practitioners I am currently thinking about what I really need a digital preservation system to do and which  systems and software might be able to help.

Angela Dappert from the DPC started off the day with a very useful summary of requirements gathering methodology. I have since returned to my list and tidied it up a bit to get my requirements in line with her SMART framework – specific, measurable, attainable, relevant and time-bound. I also realised that by focusing on the framework of the OAIS model I have omitted some of the ‘non-functional’ requirements that are essential to having a working system – requirements related to the quality of a service, its reliability and performance for example.

As Carl Wilson of the Open Planets Foundation (OPF) mentioned, it can be quite hard to create sensible measurable requirements for digital preservation when we are talking about time frames which are so far in the future. How do we measure the fact that a particular digital object will still be readable in 50 years time? In digital preservation we regularly use phrases such as ‘always’, ‘forever’ and ‘in perpetuity’. Use of these terms in a requirements document inevitably leads us to requirements that can not be tested and this can be problematic.

I was interested to hear Carl describing requirements as being primarily about communication - communication with your colleagues and communication with the software vendors or developers. This idea tallies well with the thoughts I voiced last week. One of my main drivers for getting the requirements down in writing was to communicate my ideas with colleagues and stakeholders.

The Service Providers Forum at the end of the morning with representatives from Ex Libris, Tessella, Arkivum, Archivematica, Keep Solutions and the OPF was incredibly useful. Just hearing a little bit about each of the products and services on offer and some of the history behind their creation was interesting. There was lots of talk about community and the benefits of adopting a solution that other people are also using. Individual digital preservation tools have communities that grow around them and feed into their development. Ed Fay (soon to be of the OPF) made an important point that the wider digital preservation community is as important as the digital preservation solution that you adopt. Digital preservation is still not a solved problem. The community is where standards and best practice come from and these are still evolving outside of the arena of digital preservation vendors and service providers.

Following on from this discussion about community there was further talk about how useful it is for organisations to share their requirements. Is one organisation's needs going to differ wildly from another's? There are likely to be a core set of digital preservation requirements that are going to be relevant for most organisations. 

Also discussed was how we best compare the range of digital preservation software and solutions that are available. This can be hard to do when each vendor markets themselves or describes their product in a different way. Having a grid from which we can compare products against a base line of requirements would be incredibly useful. Something like the excellent tool grid provided by POWRR with a higher level of detail in the criteria used would be good.

I am not surprised that after spending a day learning about requirements gathering I now feel the need to go back and review my previous list. I was comforted by the fact that Maite Braud from Tessella stated that “requirements are never right first time round” and Susan Corrigall from the National Records of Scotland informed us that requirements gathering exercises can take months and will often go through many iterations before they are complete. Going back to the drawing board is not such a bad thing.

Wednesday, 11 December 2013

My digital preservation Christmas wish list

All I want for Christmas is a digital archive.

By paparutzi on Flickr CC BY 2.0
Since I started at the Borthwick Institute for Archives I have been keen to adopt a digital preservation solution. Up until this point, exploratory work on the digital archive has been overtaken by other priorities, perhaps the most important of these being an audit of digital data held at the Borthwick and an audit of research data management practices across the University. The outcome is clear to me – we hold a lot of data and if we are to manage this data effectively over time, a digital archiving system is required.

In a talk at the SPRUCE end of project workshop a couple of weeks ago both Ed Fay and Chris Fryer spoke about the importance of the language that we use when we talk about digital archiving. This is a known problem for the digital preservation community and one I have myself come up against on a number of different levels.

In an institution relatively new to digital preservation the term ‘digital archiving’ can mean a variety of different things and on the most basic IT level it implies static storage, a conceptual box we can put data in, a place where we put data when we have finished using it, a place where data will be stored but no longer maintained.

Those of us who work in digital preservation have a different understanding of digital archiving. We see digital archiving as the continuous active management of our digital assets, the curation of data over its whole life cycle, the systems that ensure data remains not only preserved, but fit for reuse over the long term. Digital archiving is more than just storage and needs to encompass activities as described within the Open Archival Information System reference model such as preservation planning and data management. Storage should be seen as just one part of a digital preservation solution.

To this end, and to inform discussions about what digital preservation really is, I pulled together a list of digital preservation requirements which any digital preservation system or software should be assessed against. This became my wish list for a digital preservation system. I do not really expect to have a system such as this unwrapped and ready to go on Christmas morning this year but may-be some time in the future!

In order to create this list of requirements I looked at the OAIS reference model and the main functional entities within this model. The list below is structured around these entities. 

I also bravely revisited ISO16363: Audit and Certification of Trustworthy Digital Repositories. This is the key (and most rigorous) certification route for those organisations who would like to become Trusted Digital Repositories. It goes into great detail about some of the activities which should be taking place within a digital archive and many of these are processes which would be most effectively carried out by an automated system built into the software or system on which the digital archive runs.

This list of requirements I have come up with has a slightly different emphasis from other lists of this nature due to the omission of the OAIS entity for Access. 

Access should be a key part of any digital archive. What is the point of preserving information if we are not going to allow others to access it at some point down the line? However, at York we already have an established system for providing access to digital data in the shape of York Digital Library. Any digital preservation system we adopt would need to build on and work alongside this existing repository not replace it. 

Functional requirements for access have also been well articulated by colleagues at Leeds University as part of their RoaDMaP project and I was keen not to duplicate effort here.

As well as helping to articulate what I actually mean when I talk about my hypothetical ‘digital archive’, one of the purposes of this is to provide a grid for comparing the functionality of different digital preservation systems and software.

Thanks to Julie Allinson and Chris Fryer for providing comment thus far. Chris's excellent case study for the SPRUCE project helped inform this exercise.

My requirements are listed below. Feedback is most welcome


The digital archive will enable us to record/store administrative information relating to the Submission Information Package (information and correspondence relating to receipt of the SIP)
The digital archive will include a means for recording decisions regarding selection/retention/disposal of material from the Submission Information Package
The digital archive will be able to identify and characterise data objects (where appropriate tools exist)
The digital archive will be able to validate files (where appropriate tools exist)
The digital archive will support automated extraction of metadata from files
The digital archive will incorporate virus checking as part of the ingest process
The digital archive will be able to record the presence and location of related physical material

The digital archive will generate persistent, unique internal identifiers
The digital archive will ensure that preservation description information (PDI) is persistently associated with the relevant content information. The relationship between a file and its metadata/documentation should be permanent
The digital archive will support the PREMIS metadata schema and use it to store preservation metadata
The digital archive will enable us to describe data at different levels of granularity – for example metadata could be attached to a collection, a group of files or an individual file
The digital archive will accurately record and maintain relationships between different representations of a file (for example, from submitted originals to dissemination and preservation versions and subsequent migrations)
The digital archive must store technical metadata extracted from files (for example that created as part of the ingest process)

The digital archive will allow preservation plans (such as file migration or refreshment) to be enacted on individual or groups of files.
Automated checking of significant properties of files will be carried out post-migration to ensure they are preserved (where tools exist).
The digital archive will record actions, migrations and administrative processes that occur whilst the digital objects are contained within the digital archive

The digital archive will allow for disposal of data where appropriate. A record must be kept of this data and when disposal occurred
The digital archive will have reporting capabilities so statistics can be gathered on numbers of files, types of files etc.

The digital archive will actively monitor the integrity of digital objects with the use of checksums
Where problems of data loss or corruption occur, The digital archive will have a reporting/notification system to prompt appropriate action

The digital archive will be able to connect to, and support a range of storage systems

The digital archive will be compliant with the Open Archival Information System (OAIS) reference model
The digital archive will integrate with the access system/repository
The digital archive will have APIs or other services for integrating with other systems
The digital archive will be able to incorporate new digital preservation tools (for migration, file validation, characterisation etc) as they become available
The digital archive will include functionality for extracting and exporting the data and associated metadata in standards compliant formats
The software or system chosen for the digital archive will be supported and technical help will be available
The software or system chosen for the digital archive will be under active development

Friday, 29 November 2013

COPTR: It's short for "Making my Thursday much easier"

This is a guest post from Nathan Williams, Archives Assistant.
For four days of a working week I can largely be found on the front desk of the Borthwick Institute assisting people with their research, fetching up documents within our vast holdings, and assisting people with interpreting the materials they have in front of them. Part of the role of an Archives Assistant is one of providing researchers with the tools of discovery.

On the fifth day of a working week I don a different cap altogether, for on Thursday I head on up to Jen Mitcham’s office to help with a different challenge altogether: digital preservation.

So it was somewhat of a pleasant surprise when I received an email circulated through the jisc-digital-preservation list regarding the beta launch of COPTR or the Community Owned digital Preservation Tool Registry. Ok, so my title is silly, but here’s why it really should stand for “Making my Thursday much easier”:

  • As an institutional repository with strong University, Research, Diocesan and local and national collections of import, we have varying and ever-increasing demands on our ability to manage digital objects.
  • We don’t currently have an overarching OAIS-compliant preservation system, but we still have to take action on digital objects both in our care and yet to be created.
  • We have to act but resources are limited and the correct tools, used properly, can help us to act now instead of risking our digital assets.
  • Sometimes finding those tools, especially for the entry level practitioner, isn’t easy - COPTR should help to make it easy.
COPTR is not the first such ‘tool registry’ to exist but its aim is to collate the contents of five previously used registries (amongst them those present via the Digital Curation Centre and Open Planets Foundation to name just two).

Here are just a few potentially great things about it:

  1. It’s working to collate all the information that’s currently out there into one place.
  2. It’s managed by us.
  3. It’s browse function already looks really promising - show all the tools, or tools by functional category, or even tools by content they act upon. I don’t think it can be overstated how useful this is for the entry level practitioner!
  4. It brings together advanced and entry level practitioners and allows for collaboration across the digital preservation spectrum.
  5. User experiences go beyond just descriptions but actually provide use cases and general experiences from people who have used a tool. These sections will hopefully get a lot more material added to them as time goes on.
  6. There is already quite a bit to get your teeth into and entries are added to all the time - the activity log already looks promising.
I’ve already found some potential tools for investigation for my second look at finding us a temporary fixity solution. It’s also great to just browse and see what else is out there. What tools will you discover through COPTR?

Tuesday, 26 November 2013

Fund it, Solve it, Keep it – a personal perspective on SPRUCE

Yesterday I attended the SPRUCE end of project event at the fabulous new Library of Birmingham. The SPRUCE project was lauded by Neil Grindley as one of the best digital preservation projects that JISC has funded and it is easy to see why. Over the 2 years it has run, SPRUCE has done for a great deal for the digital preservation community. Bringing together people to come up with solutions for some of our digital preservation problems being one of the most important of these. The SPRUCE project is perhaps most well-known for its mash-up events* but should also be praised for its involvement and leadership in other community based digital preservation initiatives such as the recently launched tool registry COPTR (more about this in a future blog post).
Library of Birmingham by KellyNicholls27 on Flickr

SPRUCE can’t fix all the problems of the digital preservation community but what it has done very effectively is what William Kilbride describes as “productive small scale problem solving”. 

This event was a good opportunity to learn more about some of the tools and resources that have come out of the SPRUCE project. 

I was interested to hear Toni Sant of the Malta Music Memory Project describing their tool for extracting data from audio cds that was made available last week. I have not had a chance to investigate this in any detail yet but think this could be exactly what we need in order to move us forward from our audit of audio formats at the Borthwick Institute earlier this year to a methodology for ensuring their long term preservation in line with the proposed 15 year digitisation strategy as described last month. Obviously this deals only with audio CDs so its scope is limited, but being that audio CDs are a high priority for digital preservation this is an important development.

Another interesting tool described by Eleonora Nicchiarelli at Nottingham University allowed them to put XMP metadata into the headers of TIF images produced by their digitisation team. This avoids the separation of the images from the contextual information that is so important in making sense of them.

It was also good to hear Ray Moore from the Archaeology Data Service talk about the ReACT tool (Resource Audit and Comparison), the proposal for which I wrote in my last few weeks at the ADS. A simple tool written in VBA with a friendly Excel GUI capable of automatically checking for the presence of related files in different directories. Originally created for those situations where you want to ensure a dissemination or preservation version of a file is present for each of your archival originals, it could have many use cases in alternative scenarios. As Ray articulated, “simple solutions are sometimes the best solutions”. Thanks due to Ray and Andrew Amato of LSE for seeing that project through.

Chris Fryer of Northumberland Estates described some great work he has done (along with Ed Pinsent of ULCC) on defining digital preservation requirements and assessing a number of solutions against these requirements. He has produced a set of resources that could be widely re-used by others going through a similar process.

When I attended the first SPRUCE mash-up in Glasgow early last year participants did a bit of work on defining the business case for digital preservation in the context in of their own organisations and roles. At the time this seemed barely relevant to me, working as I was at the time within an organisation for which digital preservation was its very reason for being and for which the business case had already been well defined using the Keeping Research Data Safe model. Since Glasgow I have moved to a different job within the University of York so it was useful yesterday to have a reminder of this work from Ed Fay who was able to summarise some of the key tools and techniques and highlight why a business case is so important in order to get senior buy in for digital preservation. This is something I need to go back to and review.  The recently published Digital Preservation Business Case Toolkit should be a great resource to help me with this. 

The need to have a well prepared elevator pitch to persuade senior managers that more resources should be put into digital preservation has also become more real for me. The one I wrote at the time in Glasgow was a good start but perhaps needs to be a little bit less tongue in cheek!

* as an ex-archaeologist I see SPRUCE mash-ups as being the digital preservation equivalent of Channel 4's Time Team but without the TV cameras, and with Paul Wheatley ably taking on  the role of Tony Robinson. Instead of 3 days to excavate an archaeological site we have 3 days to solve a selection of digital preservation problems and issues.

Friday, 25 October 2013

Advice for our donors and depositors

Anyone who knows anything about digital archiving knows that one of the best ways to ensure the longevity of your digital data is to plan for it at the point of creation.

If data is created with long term archiving in mind and following a few simple and common sense data management rules, then the files that are created are not only much easier for the digital archivist to manage in the future, but also easier for the creator to work with. How much easier is it to locate and retrieve files that are ordered in a sensible and logical hierarchy of folders and named in a way that is helpful? We are producing more and more data over time and as the quantity of data increases, so do the size of our problems in managing it.

We do not have many donors and depositors at the Borthwick who regularly put digital archives into our care but this picture will no doubt change over time. For those who do deposit digital archives, it is important that we encourage them to put good data management into practice and the earlier we speak to them about this the better.

File:BitRot web.png
'Bitrot' From the Digital Preservation Business
Case Toolkit

Last week I was fortunate enough to be invited to speak to a group of people from one of our depositor organisations who are likely to start giving us digital data to archive in the future. They were from a large organisation with no central IT infrastructure and many people working from home on their own computers. Good data management is particularly important in this sort of scenario. This was a great opportunity for me to test out what could become the basis of a standard presentation on digital data management techniques that could be delivered to our donors and depositors.

I started off talking about what digital preservation is and why we really need to do it. It is always handy to throw in a few cautionary tales at this point as to what happens when we don't look after our data. I think these sorts of stories resonate with people more than just hearing the dry facts about obsolescence and corruption. I made a good plug for the 'Sorry for your data loss' cards put together by the Library of Congress earlier this year as this is something that any of us who have experienced data loss can relate to.

I then moved on to my own recent tale of digital rescue, using the 5 1/4 inch floppy disks from the Marks and Gran archive as my example (discussed in a previous blog post). This was partly because this is my current pet project, but also because it is a good way to cement and describe the real issues of hardware and software obsolescence and how we can work around these.

In the last section of the presentation I gave out my top tips on data management. I wanted the audience to go home with a positive sense of what they can start to do immediately in order to help protect their data from corruption, loss or misinterpretation.

Much of what I discussed in this section was common sense stuff really. Topics covered included:
  • how to name files sensibly
  • how to organise files well within a directory structure
  • how to document files
  • the importance of back up
  • the importance of anti-virus software
The presentation went well and sparked lots of interesting questions and debate and it was encouraging to see just how accessible this topic is to a non-specialist audience. Some of the questions raised related to current 'hot topics' in the digital archiving world which I hadn't had time to mention in any depth in my presentation:

  • How do you archive e-mails?
  • Is cloud storage safe?
  • What is wrong with pdf files?
  • What is the life span of a memory stick?
I had an answer to all of these apart from the last one, for which I have since found out the answer is 'it depends'. I have recently been told on Twitter that most digital preservation questions can be answered in this way!

Monday, 7 October 2013

Do sound engineers have more fun?

At the end of last week I was at the British Library on their excellent ‘Understanding and Preserving Audio Collections’ course.

British Library and Newton by Joanna Penn on Flickr
The concept of ‘Preserving audio’ is not a new one to me. Audio needs to be digitised for preservation and access and that pushes it firmly into my domain as digital archivist. I know the very basics such as the recommended file formats for long term preservation, but when faced with a real life physical audio collection on a variety of digital and analogue carriers it is hard to know what the priorities are and where exactly to start. This is where the ‘Understanding audio’ part of the course came to my rescue, filling in some of the gaps in my knowledge.

The course

The first day of the course was fascinating. We were given a run-down of the history of audio media and were introduced to (and in many cases, allowed to handle) many different physical carriers of audio. Hearing a wax cylinder being played on an original phonograph was a highlight for me. Digital archivists don’t normally get to play with the physical artefacts held within archives!  Perhaps most useful in this session was learning how to recognise different types of physical media and spot the signs of physical degradation.

In the following two days the emphasis moved on to digital carriers and digital files. Interestingly digital carriers were seen to be more vulnerable than analogue in many respects. Digitisation workflows were also discussed and we got the chance to see around the digitisation studios with a wide range of equipment demonstrated. This was the point at which I started to wish I was a sound engineer!

Not so different after all…

One of the things that struck me throughout the three days was that this really isn't an alien subject to me at all. Familiar concepts were repeated again and again about obsolescence of technology, lack of standards (particularly when a new type of media takes off), the importance of metadata, the idea that future technologies may be able to do a better job of this than us, and the vain hopes that an ‘everlasting’ media carrier may be made available to us and solve all of our problems. Standard topics in any introductory presentation concerning digital archiving!

What was new and interesting to me though was that for audio media a time limit has been internationally agreed for taking action. We need to plan to digitise our audio and preserve it within a digital archive within the next 15 years because there will come a point at which this strategy will not be possible any more. We have a limited window of opportunity to work in. Digitising obsolete analogue and digital carriers is becoming harder to do (as the media degrades in a variety of different ways) and more expensive (as the necessary hardware and parts becomes harder to source). In fact, whereas digitisation of documents is becoming cheaper over time as new technologies are introduced, the digitisation of audio is becoming more expensive over time as the necessary equipment becomes harder to get hold of.

Has such a time limit has ever been discussed for rescuing data from obsolete digital media such as floppies and zip disks? If so, it is not one that has hit my radar.

Putting the learning into context

The Borthwick Institute and the University of York curate some substantial music collections, but we have also been carrying out an audit of all the other bits and pieces of audio that are buried within some of our other collections. Currently we have a list with basic information about each item including the media type, the location in the strongroom and descriptive information taken from the label or packaging. The next step was to work out a digitisation strategy for these items.

This is all well and good, but work on this stalled as it quickly found its way into the ‘too difficult’ box. Following on from the information absorbed on this course, I now have the ability to start the process of prioritising the audio for digitisation based on variables such as vulnerability of the physical media and the condition of individual items. Also taking into account whether the content is unique or of particular interest.

Another benefit is that I now feel that I could now hold a conversation with a sound engineer! This is key to planning a digitisation project. Every discipline has its own particular language or jargon and happily I now have some understanding of waveforms, equalisation curves and sampling rates. At the very least I know what to ask for if writing a specification for an audio digitisation project and have a wealth of references, resources and contacts at my finger tips if I need to find out more.

Tuesday, 13 August 2013

A short detective story involving 5 ¼ inch floppy disks

Earlier this year my colleague encountered two small boxes of 5 ¼ inch floppy disks buried within the Marks and Gran archive in the strongrooms of the Borthwick Institute. He had been performing an audit of audio visual material in our care and came across these in an unlisted archive.

This was exciting to me as I had not worked with this media before. As a digital archivist I had often encountered 3 ½ inch floppies but not their larger (and floppier) precursors. The story and detective work that follows, took us firmly into the realm of ‘digital archaeology’.

Digital archaeology: 
The process of reclaiming digital information that has been damaged or become unusable due to technological obsolescence of formats and/or media” (definition from Glossaurus)

Marks and Gran were a writing duo who wrote the scripts of many TV sitcoms from the late 1970's on-wards. ‘Birds of a Feather’ was the one that I remember watching myself in the 80’s and 90's but their credits include many others such as ‘Goodnight Sweetheart’ and ‘The New Statesman’. Their archive had been entrusted to us and was awaiting cataloguing.

Clues on the labels

There were some clues on the physical disks themselves about what they might contain. All the disks were labelled and many of the labels referred to the TV shows they had written, sometimes with other information such as an episode or series number. Some disks had dates written on them (1986 and 1987). One disk was intriguingly labelled 'IDEAS'. WordStar was also mentioned on several labels 'WordStar 2000' and 'Copy of Master WordStar disk'. WordStar was a popular and quite pioneering word processing package of the early 80’s.

However, clues on labels must always be taken with a pinch of salt. I remember being guilty of not keeping the labels on floppy disks up to date, of replacing or deleting files and not recording the fact. The information on these labels will be stored in some form in the digital archive but the files may have a different story to tell.

Reading the disks

The first challenge was to see if the data could be rescued from the obsolete media they were on. A fortuitous set of circumstances led me to a nice chap in IT who is somewhat of an enthusiast in this area. I was really pleased to learn that he had a working 5 ¼ inch drive on one of his old PCs at home. He very kindly agreed to have a go at copying the data off the disks for me and did so with very few problems. Data from 18 of the 19 disks was recovered. The only one of the disks that was no longer readable appeared from the label to be a backup disk of some accounting software - this is a level of loss I am happy to accept.

Looking at file names

Looking at the files that were recovered is like stepping back in time. Many of us remember the days when file names looked like this - capital letters, very short, rather cryptic, missing file extensions. WordStar like many of the software packages from this era, was not a stickler for enforcing use of file extensions! File extensions were also not always used correctly to define the file type but sometimes were used to hold additional information about the file.

Looking back at files from 30 years ago really does present a challenge. Modern operating systems allow long and descriptive file names to be created. When used well, file names often provide an invaluable source of metadata about the file. 30 years ago computer users had only 8 characters at their disposal. Creating file names that were both unique and descriptive was difficult. The file names in this collection do not always give many clues as to what is contained within them.

Missing clues

For a digital archivist, the file extension of a file is a really valuable piece of information. It gives us an idea of what software the file might have been created in and from this we can start to look for a suitable tool that we could use to work with these files. In a similar way a lack of file extension confuses modern operating systems. Double click on a file and Windows 7 has no idea what it is and what software to fire up to open it. File characterisation tools such as Droid used on a day to day basis by digital archivists also rely heavily on file extensions to help identify the file type. Running Droid on this collection (not surprisingly) produced lots of blank rows and inaccurate results*.

Another observation on initial inspection of this set of files is that the creation dates associated with them are very misleading. It is really useful to know the creation date of a file and this is the sort of information that digital archivists put some effort into recording as accurately as they can. The creation dates on this set of files were rather strange. The vast majority of files appeared to have been created on 1st January 1980 but there were a handful of files with creation dates between 1984 and 1987. It does seem unlikely that Marks and Gran produced the main body of their work on a bank holiday in 1980, so it would seem that this date is not very accurate. My contact in IT pointed out that on old DOS computers it was up to the user to enter the correct date each time they used the PC. If no date was entered the PC defaulted to 1/1/1980. Not a great system clearly and we should be thankful that technology has moved on in this regard!

So, we are missing important metadata that will help us understand the files, but all is not lost, the next step is to see whether we can read and make sense of them with our modern software. 

Reading the files

I have previously blogged about one of my favourite little software programmes, Quick View Plus – a useful addition to any digital archivist’s toolkit. True to form, Quick View Plus quickly identified the majority of these files as WordStar version 4.0 and was able to display them as well-formatted text documents. The vast majority of files appear to be sections of scripts and cast lists for various sitcoms from the 1980’s but there are other documents of a more administrative nature such as PHONE which looks to be a list of speed dial shortcuts to individuals and organisations that Marks and Gran were working with (including a number of famous names).

Unanswered questions

I have not finished investigating this collection yet and still have many questions in my head:

  • How do these digital files relate to what we hold in our physical archive? The majority of the Marks and Gran archive is in paper form. Do we also have this same data as print outs? 
  • Do the digital versions of these files give us useful information that the physical do not (or vice versa)? 
  • Many of the scripts are split into a number of separate files, some are just small snippets of dialogue. How do all of these relate to each other? 
  • What can these files tell us about the creative process and about early word processing practices?

I am hoping that when I have a chance to investigate further I will come up with some answers.

* I will be providing some sample files to the Droid techies at The National Archives to see if they can tackle this issue.

Tuesday, 23 July 2013

Twelve interesting things I learnt last week in Glasgow

The Cloisters were a good place to shelter from the heat! 
Photo Credit: _skynet via Compfight
Last week I was lucky enough to attend the first iteration of the Digital Preservation Coalition's Advanced Practitioner Course. This was a week long course organised by the APARSEN project and based at the University of Glasgow on the warmest week of 2013 so far. On the first day Ingrid Dillo begun by telling us that 'data is hot' - by the end of the week it was not only data that was hot.

It would be a very long blog post if I was to try and do justice to each and every presentation over the course of the week, (I took 30 pages of notes) so here is the abridged version:

A list of twelve interesting things:

These are my main 'take home' snippets of information. Some things I already knew but were reinforced at some point over the week, and others are things that were totally new to me or provided me with different ways of looking at things. Some of these things are facts, some are tools and some are rather open-ended challenges.

A novel way to present a cake.
Photo credit: Jenny Mitcham
1) We can think about data and interpretations of data using the analogy of a cake (Ingrid Dillo). The raw ingredients of the cake (eggs, flour, sugar etc) represent the raw data. The cake as it comes out of the oven is the information that is held within that raw data. The iced and decorated cake is the presentation of the data - data can be presented in lots of different ways just as the same cake could be decorated in many different ways. The leftover crumbs of the cake on the plate after it is eaten represents the intangible knowledge that we have gained. This really reinforces for me the reason behind our digital preservation actions - curating and preserving the raw data so that others can create alternative interpretations of that data and we all can benefit from the knowledge that that will bring.

2) Quantifying the costs of digital curation is hard (Kirnn Kaur). No surprise really considering there have been so many different projects looking at cost models for digital preservation. If it was easy perhaps it would have been solved by a single project. The main problem seems to be that many of the cost models that are currently out there are quite specific to the organisation or domain that produced them. Some interesting work on comparing and testing cost models has been carried out by the APARSEN project and a report on all of this work is available here.

3) In 20 years time we won't be talking about 'digital preservation' anymore, we will just be talking about 'preservation' (Ruben Riestra). There will be nothing special about preserving digital material, it will just be business as usual. I eagerly await 2033 when all my headaches will be over!

4) I am right in my feeling that uncompressed TIF is the best preservation file format for images. The main contender, JPEG2000 is complicated and not widely supported (Photoshop Elements dropped support for it due to lack of interest from their users) (Tomasz Parkola).

5) There is a tool created by the SCAPE project called Matchbox. I haven't had a chance to try it out yet but it sounds like one of those things that could be worth it's weight in gold. The tool lets you compare scanned pages of text to try and find duplicates and corresponding images. It uses visual keypoints and looks for structural similarity (Rainer Schmidt and Roman Graf). More tools like this that automate some of the more tedious and time consuming jobs in digital curation are welcome!

6) Digital preservation policy needs to be on several levels (Catherine Jones). I was already aware of the concept of high level guidance (we tend to call this 'the policy') and then preservation procedures (which we would call a 'method statement' or 'strategy') but Catherine suggests we go one step further with our digital preservation policy and create a control policy - a set of specific measurable objectives which could also be written in a machine readable form so that they could be understood by preservation monitoring and planning tools and the process of decision making based on these controls could be further automated. I really like this idea but think we may be a long long way off to achieving this particular policy nirvana!

7) The SCAPE project has produced a tool that can help with preservation and technology watch ...that elusive digital preservation function that we all say we do ...but actually do it in such an ad hoc way that it feels like we may have missed something. Scout is designed to automate this process (Luis Faria). Obviously it needs a certain amount of work to give it the information that it needs in order for it to know what exactly it is meant to be monitoring (see comment on machine readable preservation policies above), but this is one of those important things the digital preservation community as a whole should be investing their efforts into and collaborating on.

8) There is a new version of the Open Planets Foundation preservation planning tool Plato that sounds like it is definitely worth a look (Kresimir Duretec). I first came across Plato when it was first released several years ago but like many, decided it was too complex and time consuming to bring into our preservation workflows in earnest. The project team have now developed the tool further and many of the more time consuming elements of the workflow have been automated. I really need to re-visit this.

9) The beautifully named C3PO (Clever, crafty content profiling for objects) is another tool we can use to characterise and identify the digital files that we hold in our archives (Artur Kulmukhametov). We were lucky enough to have a chance to play with this during the training course. C3PO gives a really nice visual and interactive interface for viewing data from the FITS characterisation tool. I must admit I haven't really used FITS in earnest because of the complexity of managing conflicts between different tools, and the fact that the individual tools currently wrapped in FITS are not kept up-to-date. C3PO helps with the first of these problems but not so much with the later. The SCAPE project team suggested that if more people start to use FITS then the issue might be resolved more quickly. It does become one of those difficult situations where people may not use the tool until the elements are updated but the tool developers may not update the elements until they see the tool being more widely used! I think I will wait and see.

Of the handful of example files I threw at FITS, my plain text file was identified as an HTML file by JHove so not ideal. We need greater accuracy in all of these characterisation tools so that we can have greater trust the output they give us. How can we base our preservation planning actions on these tools unless we have this trust?

10) Different persistent identifier schemes offer different levels of persistence (Maurizio Lunghi). For a Digital Object Identifier (DOI) you only need the indentifier and metadata to remain persistent, the NBN namespace scheme requires that the accessibility of the resource itself is also persistent. Another reason why I think DOI's may be the way to go. We do not want a situation where we are not allowed to de-accession or remove access to any of the digital data within our care.

11) I now know the difference between a URN a URL and a URI (Maurizio Lunghi) - this is progress!

12) In the future data is likely to be more active more of the time, making it harder to curate. Data in a repository or archive could be annotated, tagged and linked. This is particularly the case with research data as it is re-used over time. Can our current best practices of archiving be maintained on live datasets? Repositories and archives may need to rethink how we support researchers in this to allow for different and more dynamic research methodologies, particularly with regards to 'big data' which can not simply be downloaded (Adam Carter).

So, there is much food for thought here and a lot more besides.

I also learnt that Glasgow too has heat waves, the Botanic Gardens are beautiful and that there is a very good ice cream parlour on Byres Road!

Friday, 24 May 2013

Assessing the value of digital surrogates

I had an interesting meeting with colleagues yesterday to discuss how we manage digital surrogates - digitised versions of physical items we hold in the archives.

At the Borthwick Institute we do a fair amount of digitisation for a variety of reasons. These range from the large scale digitisation projects such as the York Cause Papers, to digitisation to create images for publications and exhibitions to single pages of Parish Registers for family history researchers.

As I am in the process of setting up a digital archive for the Borthwick Institute, effectively managing these digital surrogates also becomes my concern. The need to preserve these items is not as pressing as for born digital data (because they are only copies, not originals) however, to start to build up the collections that we have in digital form, to allow access to users who can not visit our searchroom, and to avoid having to carry out the same work twice, appropriate creation and management of this data is important. Although there will be a clear distinction in the digital archive between material that has been donated or deposited with us in digital form, and material we have digitised in-house, both these types of data need to be actively managed and migrated over time.

One of the big questions we have been mulling over is how we decide which digitised material we keep and which we discard. There is no pressing need to keep everything we digitise. Much of the reprographics work we carry out for orders would consist of a single page of a larger volume. Creating appropriate metadata to describe exactly which section of the item had been digitised would undoubtedly become an administrative burden and the re-use potential of that individual section would be limited. We therefore need to make some pragmatic decisions on what we keep and what we throw away.

Here are some points to consider:
  • What is the condition of the original document? Is it fragile? Do we need a digital surrogate so we can create a copy for access that avoids any further handling of the original?
  • How easy is it to digitise the original? If problematic because of it's large size or tight binding then we should ensure we maintain the digital surrogate to avoid having to digitise again in the future.
  • Is the section to be digitised suitable for re-use or would it make little sense out of context (for example if it is only one small section of a larger item)?
  • Is the archive catalogued to item level? This would certainly make it easier to administer the digital surrogates and ensure they can be related to their physical counterparts
  • What are the access conditions for the original document? Is there any value in maintaining a digital copy of a document that we can not make more widely available?
I am interested in how others have addressed this decision making process and how digital surrogates are treated within a digital archive so would welcome your thoughts...

Thursday, 2 May 2013

Some thoughts about our preservation policy

We are updating the Borthwick Institute's preservation policy. Originally written in 2007, it is due for review by our preservation archivist and conservation team but my interest in this task is to ensure that digital archives are represented within the policy. In the current policy there is no mention of the preservation of digital archives. There was no need for us to mention them six years ago but times are changing - this has now become a priority for us.

It was interesting having the opportunity to sit down and read our existing preservation policy. The only other preservation policies I had ever read up to this point (not that many of them I might add) related purely to digital material so had quite a different emphasis. As I am still quite new to the world of traditional archives, reading about everything that is in place to protect and preserve the documents in our care is an education for me.

The first question in my head is whether to crack on with this now or wait until such a time that the structure of the digital archive is more firmly in place. By getting the policy for the preservation of digital material written now we are jumping the gun a bit as procedures are not yet established. However by writing a policy at this point we are at least setting out our intentions and providing an overview of how we will approach the preservation of digital material. There are certain things I am confident we will be doing. The finer detail of how this will occur will follow in time and may be incorporated into a more detailed preservation strategy document in the future.

The second decision to make regarding the revised preservation policy was whether we integrate the digital within the current policy or create a separate document. Both approaches are legitimate and widely used. For the Borthwick we have decided that an integrated policy is the way forward. Ultimately, we want our systems for receiving, managing and providing access to archives to be seamless and media blind. It shouldn't matter to our depositors or users whether the media is digital or not, they should be confident in our ability to preserve and provide access to them regardless. Digital preservation should not be a specialist outpost, it should be fully integrated in the psyche of both our staff and our users.

There are differences to the way we preserve physical and digital archives:

  • With physical archives (paper and parchment) it is most important for the material to be appropriately packaged and stored in the correct environmental conditions. Once the conditions are right, they can be largely left alone. Intervention should only be required where specific issues occur.
  • With digital archives it is all about continuous active management. The digital environment is fast moving and the threat of obsolescence is never far away. Leaving the data alone in a static environment is a very risky approach.
Despite these differences, the basic premis of preservation is the same. What is highlighted in our current preservation policy is the idea of "preservation for access". This is why we are all here after all. Whether the material is physical or digital, we need to ensure that we preserve them so that others may access them in the future.