Digital Archiving at the University of York: August 2017

Anyone who has heard me talk about digital preservation will know that I am a big fan of the NDSA Levels of Preservation.

This is also pretty obvious if you visit me in my office – a print out of the NDSA Levels is pinned to the notice board above my PC monitor!

When talking to students and peers about how to get started in digital preservation in a logical, pragmatic and iterative way, I always recommend using the NDSA Levels to get started. Start at level 1 and move forward to the more advanced levels as and when you are able. This is a much more accessible and simple way to start addressing digital preservation than digesting some of the bigger and more complex certification standards and benchmarking tools.

Over the last few months I have been doing a lot of documentation work. Both ensuring that our digital archiving procedures are written down somewhere and documenting where we are going in the future.

As part of this documentation it seemed like a good idea to use the NDSA Levels:

to demonstrate where we are
to show where improvements need to be made
to demonstrate progress in the future

Previously I have used the NDSA Levels in quite a superficial way – as a guide and a talking point, it has been quite a different exercise actually mapping where we stand.

It was not always straightforward to establish where we are and to unpick and interpret exactly what each level meant in practice. I guess this is one of the problems of using a relatively simple set of metrics to describe what is really quite a complex set of processes.

Without publishing the whole document that I've written on this, here is a summary of where I think we are currently. I'm also including some questions I've been grappling with as part of the process.

Storage and geographic location

Currently at LEVEL 2: 'know your data' with some elements of LEVEL 3 and 4 in place

See the full NDSA levels here

Four years ago we carried out a ‘rescue mission’ to get all digital data in the archives off portable media and on to the digital archive filestore. This now happens as a matter of course when born digital media is received by the archives.

The data isn’t in what I would call a proper digital archive but it is on a fairly well locked down area of University of York filestore.

There are three copies of the data available at any one time (not including the copy that is on original media within the strongrooms). The University stores two copies of the data on spinning disk. One at a data centre on one campus and the other at a data centre on another campus with another copy backed up to tape which is kept for 90 days.

I think I can argue that storage of the data on two different campuses is two different geographic locations but these locations are both in York and only about 1 mile apart. I'm not sure whether they could be described as having different disaster threats so I'm going to hold back from putting us at Level 3 though IT do seem to have systems in place to ensure that filestore is migrated on a regular schedule.

Questions:

On a practical level, what really constitutes a different geographic location with a different disaster threat? How far away is good enough?

File fixity and data integrity

Currently at LEVEL 4: 'repair your data'

See the full NDSA levels here

Having been in this job for five years now I can say with confidence that I have never once received file fixity information alongside data that has been submitted to us. Obviously if I did receive it I would check it on ingest, but I can not envisage this scenario occurring in the near future! I do however create fixity information for all content as part of the ingest process.

I use a tool called Foldermatch to ensure that the digital data I have copied into the archive is identical to the original. Foldermatch allows you to compare the contents of two folders and one of the comparison methods (the one I use at ingest) uses checksums to do this.

Last year I purchased a write blocker for use when working with digital content delivered to us on portable hard drives and memory sticks. A check for viruses is carried out on all content that is ingested into the digital archive so this fulfills the requirements of level 2 and some of level 3.

Despite putting us at Level 4, I am still very keen to improve our processes and procedures around fixity. Fixity checks are carried out at intervals (several times a month) and these checks are logged but at the moment this is all initiated manually. As the digital archive gets bigger, we will need to re-think our approaches to this important area and find solutions that are scalable.

Questions:

Does it really matter if fixity isn't checked at 'fixed intervals'? That to me suggests a certain rigidity. Do the intervals really need to be fixed or does it not matter as long as it happens within an agreed time frame?
At level 2 we are meant to ‘check fixity on all ingests’ - I am unclear as to what is expected here. What would I check if fixity information hasn’t been supplied (as is always the case currently)? Perhaps it means check fixity of the copy of the data that has been made against the fixity information on the original media? I do do that.

Information Security

Currently at LEVEL 2: 'know your data' with some elements of LEVEL 3 in place

See the full NDSA levels here

Access to the digital archive filestore is limited to the digital archivist and IT staff who administer the filestore. If staff or others need to see copies of data within the digital archive filestore, copies are made elsewhere after appropriate checks are made regarding access permissions. The master copy is always kept on the digital archive filestore to ensure that the authentic original version of the data is maintained. Access restrictions are documented.

We are also moving towards the higher levels here. A recent issue reported on a mysterious change of last modified dates for .eml files has led to discussions with colleagues in IT, and I have been informed that an operating system upgrade for the server should include the ability to provide logs of who has done what to files in the archive.

It is worth pointing out that as I don't currently have systems in place for recording PREMIS (preservation) metadata. I am currently taking a hands off approach to preservation planning within the digital archive. Preservation actions such as file migration are few and far between and are recorded in a temporary way until a more robust system is established.

Metadata

Currently at LEVEL 3: 'monitor your data'

See the full NDSA levels here

We do OK with metadata currently, (considering a full preservation system is not yet in place). Using DROID at ingest is helpful at fulfilling some of the requirements of levels 1 to 3 (essentially, having a record of what was received and where it is).

Our implementation of AtoM as our archival management system has helped fulfil some of the other metadata requirements. It gives us a place to store administrative metadata (who gave us it and when) as well as providing a platform to surface descriptive metadata about the digital archives that we hold.

Whether we actually have descriptive metadata or not for digital archives will remain an issue. Much metadata for the digital archive can be generated automatically but descriptive metadata isn't quite as straightforward. In some cases a basic listing is created for files within the digital archive (using Dublin Core as a framework) but this will not happen in all cases. Descriptive metadata typically will not be created until an archive is catalogued which may come at a later date.

Our plans to implement Archivematica next year will help us get to Level 4 as this will create full preservation metadata for us as PREMIS.

Questions:

What is the difference between the 'transformative metadata' as mentioned at Level 2 and Preservation metadata as mentioned at Level 4? Is this to do with the standards used? For example, at Level 2 you need to be storing metadata about transformations and events that have occured, but at Level 4 this must be in PREMIS?

File formats

Currently at LEVEL 2: 'know your data' with some elements of LEVEL 3 in place

See the full NDSA levels here

It took me a while to convince myself that we fulfilled Level 1 here! This is a pretty hard one to crack, especially if you have lots of different archives coming in from different sources, and sometimes with little notice. I think it is useful that the requirement at this level is prefaced with "When you can..."!

Thinking about it, we do do some work in this area - for example:

Giving presentations or arranging meetings with depositors who are likely to give us digital archives in the future and covering topics such as file naming, documentation and file formats
Investigating how to extract documents from Google Drive for preservation so that we can advise internal depositors about the best approach
Talking to depositors during the transfer process, for example to ask if they could provide a file in a different format

To get us to Level 2, as part of the ingest process we run DROID to get a list of file formats included within a digital archive. Summary stats are kept within a spreadsheet that covers all content within the digital archive so we can quickly see the range of formats that we hold and find out which archives they are in.

This should allow us to move towards Level 3 but we are not there yet. Some pretty informal and fairly ad hoc thinking goes into file format obsolescence but I won't go as far as saying that we 'monitor' it. I have an awareness of some specific areas of concern in terms of obsolete files (for example I've still got those WordStar 4.0 files and I really do want to do something with them!) but there are no doubt other formats that need attention that haven't hit my radar yet.

As mentioned earlier, we are not really doing migration right now - not until I have a better system for creating the PREMIS metadata, so Level 4 is still out of reach.

Questions:

I do think there is more we could at Level 1, but there has also been concern raised by colleagues that in being too dictatorial you are altering the authenticity of the original archive and perhaps losing information about how a person or an organisation worked. I'd be interested to hear how others walk this tricky line.
Is it a valid answer to simple note at Level 1 that input into the creation of digital files is never given because it has been decided not to be appropriate in the content in which you are working?
I'd love to hear examples of how others monitor and report on file obsolescence - particularly if this is done in a systematic way

Conclusions

This has been a useful exercise and it is good to see where we need to progress. Going from using the Levels in the abstract and actually trying to apply them as a tool has been a bit challenging in some areas. I think additional information and examples would be useful to help clear up some of the questions that I have raised.

I've also found that even where we meet a level there is often other ways we could do things better. File fixity and data integrity looks like a strong area for us but I am all too aware that I would like to find a more sustainable and scalable way to do this. This is something we'll be working on as we get Archivematica in place. Reaching Level 4 shouldn't lead to complacency!

An interesting blog post last year by Shira Peltzman from the UCLA Library talked about Expanding the NDSA Levels of Preservation to include an additional row focused on Access. This seems sensible given that the ability to provide access is the reason why we preserve archives. I would be keen to see this developed further so long as the bar wasn't set too high. At the Borthwick my initial consideration has been preservation - getting the stuff and keeping it safe - but access is something that will be addressed over the next couple of years as we move forward with our plans for Archivematica and AtoM.

Has anyone else assessed themselves against the NDSA Levels? I would be keen to see how others have interpreted the requirements.

Jenny Mitcham, Digital Archivist

Digital Archiving at the University of York

Friday, 18 August 2017

Benchmarking with the NDSA Levels of Preservation

Storage and geographic location

File fixity and data integrity

Questions:

Information Security

Metadata

Questions:

File formats

Questions:

Conclusions

The sustainability of a digital preservation blog...

Twitter

Subscribe