Friday, 15 December 2017

How would you change Archivematica's Format Policy Registry?

A train trip through snowy Shropshire to get to Aberystwyth
This week the UK Archivematica user group fought through the snow and braved the winds and driving rain to meet at the National Library of Wales in Aberystwyth.

This was the first time the group had visited Wales and we celebrated with a night out at a lovely restaurant on the evening before our meeting. Our visit also coincided with the National Library cafe’s Christmas menu so we were treated to a generous Christmas lunch (and crackers) at lunch time. Thanks NLW!

As usual the meeting covered an interesting range of projects and perspectives from Archivematica users in the UK and beyond. As usual there was too much to talk about and not nearly enough time. Fortunately this took my mind off the fact I had damp feet for most of the day.

This post focuses on a discussion we had about Archivematica's Format Policy Registry or FPR. The FPR in Archivematica is a fairly complex beast, but a crucial tool for the 'Preservation Planning' step in digital archiving. It is essentially a database which allows users to define policies for handling different file formats (including the actions, tools and settings to apply to specific file type for the purposes of preservation or access). The FPR comes ready populated with a set of rules based on agreed best practice in the sector, but institutions are free to change these and add new tools and rules to meet their own requirements.

Jake Henry from the National Library of Wales kicked off the discussion by telling us about some work they had done to make the thumbnail generation for pdf files more useful. Instead of supplying a generic thumbnail image for all pdfs they wanted the thumbnail to actually represent the file in question. They made changes to the FPR to change the pdf thumbnail generation to use GhostScript.

NLW liked the fact that Archivematica converted pdf files to pdf/a but also wanted that same normalisation pathway to apply to existing pdf/a files. Just because a pdf/a file is already in a preservation file format it doesn’t mean it is a valid file. By also putting pdf/a files through a normalisation step they had more confidence that they were creating and preserving pdf/a files with some consistency.

Sea view from our meeting room!
Some institutions had not had any time to look in any detail at the default FPR rules. It was mentioned that there was trust in how the rules had been set up by Artefactual and that people didn’t feel expert enough to override these rules. The interface to the FPR within Archivematica itself is also not totally intuative and requires quite a bit of time to understand. It was mentioned that adding a tool and a new rule for a specific file format in Archivematica is quite an complex task and not for the faint hearted...!

Discussion also touched on the subject of those files that are not identified. A file needs to be identified before a FPR rule can be set up for it. Ensuring files are identified in the first instance was seen to be a crucial step. Even once a format makes its way into PRONOM (TNA’s database of file formats) Artefactual Systems have to carry out extra work to get Archivematica to pick up that new PUID.

Unfortunately normalisation tools do not exist for all files and in many cases you may just have to accept that a file will stay in the format in which it was received. For example a Microsoft Word document (.doc) may not be an ideal preservation format but in the absence of open source command line migration tools we may just have to accept the level of risk associated with this format.

Moving on from this, we also discussed manual normalisations. This approach may be too resource intensive for many (particularly those of us who are implementing automated workflows) but others would see this as an essential part of the digital preservation process. I gave the example of the WordStar files I have been working with this year. These files are already obsolete and though there are other ways of viewing them, I plan to migrate them to a format more suitable for preservation and access. This would need to be carried out outside of Archivematica using the manual normalisation workflow. I haven’t tried this yet but would very much like to test it out in the future.

I shared some other examples that I'd gathered outside the meeting. Kirsty Chatwin-Lee from the University of Edinburgh had a proactive approach to handling the FPR on a collection by collection and PUID by PUID basis. She checks all of the FPR rules for the PUIDs she is working with as she transfers a collection of digital objects into Archivematica and ensures she is happy before proceding with the normalisation step.

Back in October I'd tweeted to the wider Archivematica community to find out what people do with the FPR and had a few additional examples to share. For example, using Unoconv to convert office documents and creating PDF access versions of Microsoft Word documents. We also looked at some more detailed preservation planning documentation that Robert Gillesse from the International Institute of Social History had shared with the group.

We had a discussion about the benefits (or not) of normalising a compressed file (such as a JPEG) to an uncompressed format (such as TIFF). I had already mentioned in my presentation earlier that this default migration rule was turning 5GB of JPEG images into 80GB of TIFFs - and this is without improving the quality or the amount of information contained within the image. The same situation would apply to compressed audio and video which would increase even more in size when converted to an uncompressed format.

If storage space is at a premium (or if you are running this as a service and charging for storage space used) this could be seen as a big problem. We discussed the reasons for and against leaving this rule in the FPR. It is true that we may have more confidence in the longevity of TIFFs and see them as more robust in the face of corruption, but if we are doing digital preservation properly (checking checksums, keeping multiple copies etc) shouldn't corruption be easily spotted and fixed?

Another reason we may migrate or normalise files is to restrict the file formats we are preserving to a limited set of known formats in the hope that this will lead to less headaches in the future. This would be a reason to keep on converting all those JPEGs to TIFFs.

The FPR is there to be changed and being that not all organisations have exactly the same requirements it is not surprising that we are starting to tweak it here and there – if we don’t understand it, don’t look at it and don’t consider changing it perhaps we aren’t really doing our jobs properly.

However there was also a strong feeling in the room that we shouldn’t all be re-inventing the wheel. It is incredibly useful to hear what others have done with the FPR and the rationale behind their decisions.

Hopefully it is helpful to capture this discussion in a blog post, but this isn’t a sustainable way to communicate FPR changes for the longer term. There was a strong feeling in the room that we need a better way of communicating with each other around our preservation planning - the decisions we have made and the reasons for those decisions. This feeling was echoed by Kari Smith (MIT Libraries) and Nick Krabbenhoeft (New York Public Library) who joined us remotely to talk about the OSSArcFlow project - so this is clearly an international problem! This is something that Jisc are considering as part of their Research Data Shared Service project so it will be interesting to see how this might develop in the future.

Thanks to the UK Archivematica group meeting attendees for contributing to the discussion and informing this blog post.

Jenny Mitcham, Digital Archivist

Monday, 4 December 2017

Cakes, quizzes, blogs and advocacy

Last Thursday was International Digital Preservation Day and I think I needed the weekend to recover.

It was pretty intense...

...but also pretty amazing!

Amazing to see what a fabulous international community there is out there working on the same sorts of problems as me!

Amazing to see quite what a lot of noise we can make if we all talk at once!

Amazing to see such a huge amount of advocacy and awareness raising going on in such a small space of time!

International Digital Preservation Day was crazy but now I have had a bit more time to reflect, catch up...and of course read a selection of the many blog posts and tweets that were posted.

So here are some of my selected highlights:

Cakes

Of course the highlights have to include the cakes and biscuits including those produced by Rachel MacGregor and Sharon McMeekin. Turning the problems that we face into something edible helps does seem to make our challenges easier to digest!

Quizzes and puzzles

A few quizzes and puzzles were posed on the day via social media - a great way to engage the wider world and have a bit of fun in the process.


There was a great quiz from the Parliamentary Archives (the answers are now available here) and a digital preservation pop quiz from Ed Pinsent of CoSector which started here. Also for those hexadecimal geeks out there, a puzzle from the DP0C Fellows at Oxford and Cambridge which came just at the point that I was firing up a hexadecimal viewer as it happens!

In a blog post called Name that item in...? Kirsty Chatwin-Lee at Edinburgh University encourages the digital preservation community to help her to identify a mysterious large metal disk found in their early computing collections. Follow the link to the blog to see a picture - I'm sure someone out there can help!

Announcements and releases

There were lots of big announcements on the day too. IDPD just kept on giving!

Of course the 'Bit List' (a list of digitally endangered species) was announced and I was able to watch this live. Kevin Ashley from the Digital Curation Coalition discusses this in a blog post. It was interesting to finally see what was on the list (and then think further about how we can use this for further advocacy and awareness raising).

I celebrated this fact with some Fake News but to be fair, William Kilbride had already been on the BBC World Service the previous evening talking about just this so it wasn't too far from the truth!

New versions of JHOVE and VeraPDF were released as well as a new PRONOM release.  A digital preservation policy for Wales was announced and a new course on file migration was launched by CoSector at the University of London. Two new members also joined the Digital Preservation Coalition - and what a great day to join!

Roadshows

Some institutions did a roadshow or a pop up museum in order to spread the message about digital preservation more widely. This included the revival of the 'fish screensaver' at Trinity College Dublin and a pop up computer museum at the British Geological Survey.

Digital Preservation at Oxford and Cambridge blogged about their portable digital preservation roadshow kit. I for one found this a particularly helpful resource - perhaps I will manage to do something similar myself next IDPD!

A day in the life

Several institutions chose to mark the occasion by blogging or tweeting about the details of their day. This gives an insight into what we DP folks actually do all day and can be really useful being that the processes behind digital preservation work are often less tangible and understandable than those used for physical archives!

I particularly enjoyed the nostalgia of following ex colleagues at the Archaeology Data Service for the day (including references to those much loved checklists!) and hearing from  Artefactual Systems about the testing, breaking and fixing of Archivematica that was going on behind the scenes.

The Danish National Archives blogged about 'a day in the life' and I was particularly interested to hear about the life-cycle perspective they have as new software is introduced, assessed and approved.

Exploring specific problems and challenges

Plans are my reality from Yvonne Tunnat of the ZBW Leibniz Information Centre for Economics was of particular interest to me as it demonstrates just how hard the preservation tasks can be. I like it when people are upfront and honest about the limitations of the tools or the imperfections of the processes they are using. We all need to share more of this!

In Sustaining the software that preserves access to web archives, Andy Jackson from the British Library tells the story of an attempt to maintain a community of practice around open source software over time and shares some of the lessons learned - essential reading for any of us that care about collaborating to sustain open source.

Kirsty Chatwin-Lee from Edinburgh University invites us to head back to 1985 with her as she describes their Kryoflux-athon challenge for the day. What a fabulous way to spend the day!

Disaster stories

Digital Preservation Day wouldn't be Digital Preservation Day without a few disaster stories too! Despite our desire to move away beyond the 'digital dark age' narrative, it is often helpful to refer to worse case scenarios when advocating for digital preservation.

Cees Hof from DANS in the Netherlands talks about the loss of digital data related to rare or threatened species in The threat of double extinction, Sarah Mason from Oxford University uses the recent example of the shutdown of DCist to discuss institutional risk, José Borbinha from Lisbon University, Portugal talks about his own experiences of digital preservation disaster and Neil Beagrie from Charles Beagrie Ltd highlights the costs of inaction.

The bigger picture

Other blogs looked at the bigger picture

Preservation as a present by Barbara Sierman from the National Library of the Netherlands is a forward thinking piece about how we could communicate and plan better in order to move forward.

Shira Peltzman from the University of California, Los Angeles tries to understand some of the results of the 2017 NDSA Staffing Survey in It's difficult to solve a problem if you don't know what's wrong.

David Minor from the University of San Diego Library, provides his thoughts on What we’ve done well, and some things we still need to figure out.

I enjoyed reading a post from Euan Cochrane from Yale University Library on The Emergence of “Digital Patinas”. A really interesting piece... and who doesn't like to be reminded of the friendly and helpful Word 97 paperclip?

In Towards a philosophy of digital preservation, Stacey Erdman from Beloit College, Wisconsin USA asks whether archivists are born or made and discusses her own 'archivist "gene"'.




So much going on and there were so many other excellent contributions that I missed.

I'll end with a tweet from Euan Cochrane which I thought nicely summed up what International Digital Preservation Day is all about and of course the day was also concluded by William Kilbride of the DPC with a suitably inspirational blog post.



Congratulations to the Digital Preservation Coalition for organising the day and to the whole digital preservation community for making such a lot of noise!




Jenny Mitcham, Digital Archivist

Thursday, 30 November 2017

What shall I do for International Digital Preservation Day?

I have been thinking about this question for a few months now and have only recently come up with a solution.

I wanted to do something big on International Digital Preservation Day. Unfortunately other priorities have limited the amount of time available and I am doing something a bit more low key. To take a positive from a negative I would like to suggest that as with digital preservation more generally, it is better to just do something rather than wait for the perfect solution to come along!

I am sometimes aware that I spend a lot of time in my own echo chamber - for example talking on Twitter and through this blog to other folks who also work in digital preservation. Though this is undoubtedly a useful two-way conversation, for International Digital Preservation Day I wanted to target some new audiences.

So instead of blogging here (yes I know I am blogging here too) I have blogged on the Borthwick Institute for Archives blog.

The audience for the Borthwick blog is a bit different to my usual readership. It is more likely to be read by users of our services at the Borthwick Institute and those who donate or deposit with us, perhaps also by staff working in other archives in the UK and beyond. Perfect for what I had planned.

In response to the tagline of International Digital Preservation Day ‘Bits Decay: Do Something Today’ I wanted to encourage as many people as possible to ‘Do Something’. This shouldn’t be just limited to us digital preservation folks, but to anyone anywhere who uses a computer to create or manage data.

This is why I decided to focus on Personal Digital Archiving. The blog post is called “Save your digital stuff!” (credit to the DPC Technology Watch Report on Personal Digital Archiving for this inspiring title - it was noted that at a briefing day hosted by the Digital Preservation Coalition (DPC) in April 2015, one of the speakers suggested that the term ‘personal digital archiving’ be replaced by the more urgent exhortation, ‘Save your digital stuff!’).

The blog post aimed to highlight the fragility of digital resources and then give a few tips on how to protect them. Nothing too complicated or technical, but hopefully just enough to raise awareness and perhaps encourage engagement. Not wishing to replicate all the great work that has already been done on Personal Digital Archiving, by the Library of Congress, the Paradigm project and others I decided to focus on just a few simple pieces of advice and then link out to other resources.

At the end of the post I encourage people to share information about any actions they have taken to protect their own digital legacies (of course using the #IDPD17 hashtag). If I inspire just one person to take action I'll consider it a win!

I'm also doing a 'Digital Preservation Takeover' of the Borthwick twitter account @UoYBorthwick. I lined up a series of 'fascinating facts' about the digital archives we hold here at the Borthwick and tweeted them over the course of the day.

  • There are 28 archives at the Borthwick for which we hold at least some digital material - this may be some of the most fragile and vulnerable material that we hold
  • The first digital archive received at the Borthwick arrived in 2004 as part of the York Peptic Ulcer Trust archive
  • We hold 135GB of deposited digital archive material here at the Borthwick (10896 individual files to preserve) - not a huge amount but we do expect this to grow!
  • The largest digital archive we hold at the Borthwick is the Historic Masters Archive which consists of 997 files and is 82 GB in size - it came in yesterday and I’m processing it right now!
  • We believe that the oldest files in the digital archive go back to 1984 - these are in the Marks and Gran archive
  • Approximately a quarter of the digital archives that we hold contain file formats that are not automatically identified by DROID
  • The average number of files received in a digital archive deposit at the Borthwick is 300 (though in reality it can range from 1 to 2400)
  • The average number of different file formats (at least those that can be identified)  in a new digital accession received at the Borthwick is 6, though our Richard Orton archive contains 48 different identified file formats and many more that are not identified
  • The file format that gets deposited with us the most is the Microsoft Word Document 97-2003 (we have over 1700 of these)


OK - admittedly they won't be fascinating to everyone, but if nothing else it helps us to move further away from the notion that an archive is where you go to look at very old documents!

...and of course I now have a whole year to plan for International Digital Preservation Day 2018 so perhaps I'll be able to do something bigger and better?! I'm certainly feeling inspired by the range of activities going on across the globe today.



Jenny Mitcham, Digital Archivist

Wednesday, 29 November 2017

Preserving Google Drive: What about Google Sheets?

There was lots of interest in a blog post earlier this year about preserving Google Docs.

Often the issues we grapple with in the field of digital preservation are not what you'd call 'solved problems' and that is what makes them so interesting. I always like to hear how others are approaching these same challenges so it is great to see so many comments on the blog itself and via Twitter.

This time I'm turning my focus to the related issue of Google Sheets. This is the native spreadsheet application for Google Drive.

Why?

Again, this is an application that is widely used at the University of York in a variety of different contexts, including for academic research data. We need to think about how we might preserve data created in Google Sheets for the longer term.


How hard can it be?

Quite hard actually - see my earlier post!


Exporting from Google Drive

For Google Sheets I followed a similar methodology to Google Docs. Taking a couple of sample spreadsheets and downloading them in the formats that Google provides, then examining these exported versions to assess how well specific features of the spreadsheet were retained.

I used the File...Download as... menu in Google Sheets to test out the available export formats

The two spreadsheets I worked with were as follows:

  • A simple spreadsheet which staff had used to select their menu choices for a celebration event. This consisted of just one sheet of data and no particularly advanced features. The sheet did include use of the Google Drive comments facility
  • My flexitime sheet which is provided by my department and used to record the hours I work over the course of the year. It seems to be about as complex as it gets and includes a whole range of features: multiple sheets (that reference each other), controlled data entry through drop down lists, calculations of hours using formula, conditional formatting (ie: specific cells turning red if you have left work too early or taken an inadequate lunch break), code that jumps straight to today's date when you first open it up.

Here is a summary of my findings:

Microsoft Excel - xlsx

I had high hopes for the xlsx export option - however, on opening the exported xlsx version of my flexisheet I was immediately faced with an error message telling me that the file contained unreadable content and asking whether I wanted to recover the contents.

This doesn't look encouraging...

Clicking 'Yes' on this dialogue box then allows the sheet to open and another message appears telling you what has been repaired. In this case it tells me that a formula has been removed.


Excel can open the file if it removes the formula

This is not ideal if the formula is considered to be worthy of preservation.

So clearly we already know that this isn't going to be a perfect copy of the Google sheet.

This version of my flexisheet looks pretty messed up. The dates and values look OK, but none of the calculated values are there - they are all replaced with "#VALUE".

The colours on the original flexisheet are important as they flag up problems and issues with the data entered. These however are not fully retained - for example, weekends are largely (but not consistently) marked as red and in the original file they are green (because it is assumed that I am not actually meant to be working weekends).

The XLSX export does however give a better representation of the more simple menu choices Google sheet. The data is accurate, and comments are present in a partial way. Unfortunately though, replies to comments are not displayed and the comments are not associated with a date or time.


Open Document Format - ods

I tried opening the ODS version of the flexisheet in LibreOffice on a Macbook. There were no error messages (which was nice) but the sheet was a bit of a mess. There were similar issues to those that I encountered in the Excel export though it wasn't identical. The colours were certainly applied differently, neither entirely accurate to the original.

If I actually tried to use the sheet to enter more data in, the formula do not work - they do not calculate anything, though it does appear that the formula itself appears to be retained. Any values that are calculated on the original sheet are not present.

Comments are retained (and replies to comments) but no date or time appears to be associated with them (note that the data may be there but just not displaying in LibreOffice).

I also tried opening the ODS file in Microsoft Office. On opening it the same error message was displayed to the one originally encountered in the XLSX version described above and this was followed by notification that “Excel completed file level validation and repair. Some parts of this workbook may have been repaired or discarded.” Unlike the XLSX file there didn't appear to be any additional information available about exactly what had been repaired or discarded - this didn't exactly fill me with confidence!

PDF document - pdf

When downloading a spreadsheet as a PDF you are presented with a few choices - for example:
  • Should the export include all sheets, just the current sheet or current selection (note that current sheet is the default response)
  • Should the export include the document title?
  • Should the export include sheet names?
To make the export as thorough as possible I chose to export all sheets and include document title and sheet names.

As you might expect this was a good representation of the values on the spreadsheet - a digital print if you like - but all functionality and interactivity was lost. In order to re-use the data, it would need to be copied and pasted or re-typed back into a spreadsheet application.

Note that comments within the sheet were not retained and also there was no option to export sheets that were hidden.

Web page - html

This gave an accurate representation of the values on the spreadsheet, but, similar to the PDF version, not in a way that really encourages reuse. Formula were not retained and the resulting copy is just a static snapshot.

Interestingly, the comments in the menu choices example weren't retained. This surprised me because when using the html export option for Google documents one of the noted benefits was that comments were retained. Seems to be a lack of consistency here.

Another thing that surprised me about this version of the flexisheet was that it included hidden sheets (I hadn't until this point realised that there were hidden sheets!). I later discovered that the XLSX and ODS also retained the hidden sheets ...but they were (of course) hidden so I didn't immediately notice them! 

Tab delimited and comma separated values - tsv and csv

It is made clear on export that only the current sheet is exported so if using this as an export strategy you would need to ensure you exported each individual sheet one by one.

The tab delimited export of the flexisheet surprised me. In order to look at the data properly I tried importing it into MS Excel. It came up with a circular reference warning which surprised me - were some of the dynamic properties of the sheets being somehow retained (all be it in a way that was broken)?

tab_delim_error_when_import_to_Excel.png
A circular reference warning when opening the tab delimited file in Microsoft Excel

Both of these formats did a reasonable job of capturing the simple menu choices data (though note that the comments were not retained) but neither did an acceptable job of representing the complex data within the flexisheet (given that the more complex elements such as formulas and colours were not retained).

What about the metadata?

I won't go into detail again about the other features of a Google Sheet that won't be saved with these export options - for example information about who created it and when and the complete revision history that is available through Google Drive - this is covered in a previous post. Given my findings when I interviewed a researcher here at the University of York about their use of Google Sheets, the inability of the export options to capture the version history will be seen as problematic for some use cases.

What is the best export format for Google Sheets?

The short answer is 'it depends'.

The export options available all have pros and cons and as ever, the most suitable one will very much depend on the nature of the original file and the properties that you consider to be most worthy of preservation.


  • If for example the inclusion of comments is an essential requirement, XLSX or ODS will be the only formats that retain them (with varying degrees of success). 
  • If you just want a static snapshot of the data in its final form, PDF will do a good job (you must specify that all sheets are saved), but note that if you want to include hidden sheets, HTML may be a better option. 
  • If the data is required in a usable form (including a record of the formula used) you will need to try XLSX or ODS but note that calculated values present in the original sheet may be missing. Similar but not identical results were noted with XLSX and ODS so it would be worth trying them both and seeing if either is suitable for the data in question.


It should be possible to export an acceptable version of the data for a simple Google Sheet but for a complex dataset it will be difficult to find an export option that adequately retains all features.

Exporting Google Sheets seems even more problematic and variable than Google Documents and for a sheet as complex as my flexisheet it appears that there is no suitable option that retains the functionality of the sheet as well as the content.

So, here's hoping that native Google Drive files appear on the list of World's Endangered Digital Species...due to be released on International Digital Preservation Day! We will have to wait until tomorrow to find out...



A disclaimer: I carried out the best part of this work about 6 months ago but have only just got around to publishing it. Since I originally carried out the exports and noted my findings, things may have changed!



Jenny Mitcham, Digital Archivist

Friday, 20 October 2017

Understanding WordStar - check out the manuals!

Last month I was pleased to be able to give a presentation at 'After the Digital Revolution' about some of the work I have been doing on the WordStar 4.0 files in the Marks and Gran digital archive that we hold here at the Borthwick Institute for Archives. This event specifically focused on literary archives.

It was some time ago now that I first wrote about these files that were recovered from 5.25 inch floppy (really floppy) disks deposited with us in 2009.

My original post described the process of re-discovery, data capture and file format identification - basically the steps that were carried out to get some level of control over the material and put it somewhere safe.

I recorded some of my initial observations about the files but offered no conclusions about the reasons for the idiosyncrasies.

I’ve since been able to spend a bit more time looking at the files and investigating the creating application (WordStar) so in my presentation at this event I was able to talk at length (too long as usual) about WordStar and early word processing. A topic guaranteed to bring out my inner geek!

WordStar is not an application I had any experience with in the past. I didn’t start word processing until the early 90’s when my archaeology essays and undergraduate dissertation were typed up into a DOS version of Word Perfect. Prior to that I used a typewriter (now I feel old!).

WordStar by all accounts was ahead of its time. It was the first Word Processing application to include mail merge functionality. It was hugely influential, introducing a number of keyboard shortcuts that are still used today in modern word processing applications (for example control-B to make text bold). Users interacted with WordStar using their keyboard, selecting the necessary keystrokes from a set of different menus. The computer mouse (if it was present at all) was entirely redundant.

WordStar was widely used as home computing and word processing increased in popularity through the 1980’s and into the early 90’s. However, with the introduction of Windows 3.0 and Word for Windows in 1989, WordStar gradually fell out of favour (info from Wikipedia).

Despite this it seems that WordStar had a loyal band of followers, particularly among writers. Of course the word processor was the key tool of their trade so if they found an application they were comfortable with it is understandable that they might want to stick with it.

I was therefore not surprised to hear that others presenting at 'After the Digital Revolution' also had WordStar files in their literary archives. Clear opportunities for collaboration here! If we are all thinking about how to provide access to and preserve these files for the future then wouldn't it be useful to talk about it together?

I've already learnt a lot through conversations with the National Library of New Zealand who have been carrying out work in this area (read all about it here: Gattuso J, McKinney P (2014) Converting WordStar to HTML4. iPres.)

However, this blog post is not about defining a preservation strategy for the files it is about better understanding them. My efforts have been greatly helped by finding a copy of both a WordStar 3 manual and a WordStar 4 manual online.

As noted in my previous post on this subject there were a few things that stand out when first looking at the recovered WordStar files and I've used the manuals and other research avenues to try and understand these better.


Created and last modified dates

The Marks and Gran digital archive consists of 174 files, most of which are WordStar files (and I believe them to be WordStar version 4).

Looking at the details that appear on the title pages of some of the scripts, the material appears to be from the period 1984 to 1987 (though not everything is dated).

However the system dates associated with the files themselves tell a different story. 

The majority of files in the archive have a creation date of 1st January 1980.

This was odd. Not only would that have been a very busy New Year's Day for the screen writing duo, but the timestamps on the files suggest that they were also working in the very early hours of the morning - perhaps unexpected when many people are out celebrating having just seen in the New Year!

This is the point at which I properly lost my faith in technical metadata!

In this period computers weren't quite as clever as they are today. When you switched them on they would ask you what date it was. If you didn't tell them the date, the PC would fall back to a system default ....which just so happens to be 1st January 1980.

I was interested to see Abby Adams from the Harry Ransom Center, University of Texas at Austin (also presenting at 'After the Digital Revolution') flag up some similarly suspicious dates on files in a digital archive held at her institution. Her dates differed just slightly to mine, falling on the evening of the 31st December 1979. Again, these dates looked unreliable as they were clearly out of line with the rest of the collection.

This is the same issue as mine, but the differences relate to the timezone. There is further explanation here highlighted by David Clipsham when I threw the question out to Twitter. Thanks!


Fragmentation

Another thing I had noticed about the files was the way that they were broken up into fragments. The script for a single episode was not saved as a single file but typically as 3 or 4 separate files. These files were named in such a way that it was clear that they were related and that the order that the files should be viewed or accessed was apparent - for example GINGER1, GINGER2 or PILOT and PILOTB.

This seemed curious to me - why not just save the document as a single file? The WordStar 4 manual didn't offer any clues but I found this piece of information in the WordStar 3 manual which describes how files should be split up to help manage the storage space on your diskettes:

From the WordStar 3 manual




Perhaps some of the files in the digital archive are from WordStar 3, or perhaps Marks and Gran had been previously using WordStar 3 and had just got into the habit of splitting a document into several files in order to ensure they didn't run out of space on their floppy disks.

I can not imagine working this way today! Technology really has come on a long way. Imagine trying to format, review or spell check a document that exists as several discrete files potentially sitting on different media!


Filenames

One thing that stands out when browsing the disks is that all the filenames are in capital letters. DOES ANYONE KNOW WHY THIS WAS THE CASE?

File names in this digital archive were also quite cryptic.This is the 1980’s so filenames conform to the 8.3 limit. Only 8 characters are allowed in a filename and it *may* also include a 3 character file extension.

Note that the file extension really is optional and WordStar version 4 doesn’t enforce the use of a standard file extension. Users were encouraged to use those last 3 characters of the file name to give additional context to the file content rather than to describe the file format itself.

Guidance on file naming from the WordStar 4 manual
Some of the tools and processes we have in place to analyse and process the files in our digital archives use the file extension information to help understand the format. The file naming methodology described here therefore makes me quite uncomfortable!

Marks and Gran tended not to use the file extension in this way (though there are a few examples of this in the archive). The majority of WordStar files have no extension at all. The real consistent use of file extensions related to their back up files.


Backup files

Scattered amongst the recovered data were a set of files that had the extension BAK. This clearly is a file extension that WordStar creates and uses consistently. These files clearly contained very similar content to other documents within the archive but typically with just a few differences in content. These files were clearly back up files of some sort but I wondered whether they had been created automatically or by the writers themselves.

Again the manual was helpful in moving forward my understanding on this:

Backup files from the WordStar 4 manual

This backup procedure is also summarised with the help of a diagram in the WordStar 3 manual:


The backup procedure from WordStar 3 manual


This does help explain why there were so many back up files in the archive. I guess the next question is 'should we keep them?'. It does seem that they are an artefact of the application rather than representing a conscious process by the writers to back their files up at a particular point in time and that may impact on their value. However, as discussed in a previous post on preserving Google documents there could be some benefit in preserving revision history (even if only partial).



...and finally

My understanding of these WordStar files has come on in leaps and bounds by doing a bit of research and in particular through finding copies of the manuals.

The manuals even explain why alongside the scripts within the digital archive we also have a disk that contains a copy of the WordStar application itself. 

The very first step in the manual asks users to make a copy of the software:


I do remember having to do this sort of thing in the past! From WordStar 4 manual


Of course the manuals themselves are also incredibly useful in teaching me how to actually use the software. Keystroke based navigation is hardly intuitive to those of us who are now used to using a mouse, but I think that might be the subject of another blog post!




Jenny Mitcham, Digital Archivist

Wednesday, 27 September 2017

The first UK AtoM user group meeting

Yesterday the newly formed UK AtoM user group met for the first time at St John's College Cambridge and I was really pleased that myself and a colleague were able to attend.
Bridge of Sighs in Autumn (photo by Sally-Anne Shearn)

This group has been established to provide the growing UK AtoM community with a much needed forum for exchanging ideas and sharing experiences of using AtoM.

The meeting was attended by about 15 people though we were informed that there are nearly 50 people on the email distribution list. Interest in AtoM is certainly increasing in the UK.

As this was our first meeting, those who had made progress with AtoM were encouraged to give a brief presentation covering the following points:
  1. Where are you with AtoM (investigating, testing, using)?
  2. What do you use it for? (cataloguing, accessions, physical storage locations)
  3. What do you like about it/ what works?
  4. What don’t you like about it/ what doesn’t work?
  5. How do you see AtoM fitting into your wider technical infrastructure? (do you have separate location or accession databases etc?)
  6. What unanswered questions do you have?
It was really interesting to find out how others are using AtoM in the UK. A couple of attendees had already upgraded to the new 2.4 release so that was encouraging to see.

I'm not going to summarise the whole meeting but I made a note of people's likes and dislikes (questions 3 and 4 above). There were some common themes that came up.

Note that most users are still using AtoM 2.2 or 2.3, those who have moved to 2.4 haven't had much chance to explore it yet. It may be that some of these comments are already out of date and fixed in the new release.


What works?


AtoM seems to have lots going for it!

The words 'intuitive', 'user friendly', 'simple', 'clear' and 'flexible' were mentioned several times. One attendee described some user testing she carried out during which she found her users just getting on and using it without any introduction or explanation! Clearly a good sign!

The fact that it was standards compliant was mentioned as well as the fact that consistency was enforced. When moving from unstructured finding aids to AtoM it really does help ensure that the right bits of information are included. The fact that AtoM highlights which mandatory fields are missing at the top of a page is really helpful when checking through your own or others records.

The ability to display digital images was highlighted by others as a key selling point, particularly the browse by digital objects feature.

The way that different bits of the AtoM database interlink was a plus point that was mentioned more than once - this allows you to build up complex interconnecting records using archival descriptions and authority records and these can also be linked to accession records and a physical location.

The locations section of AtoM was thought to be 'a good thing' - for recording information about where in the building each archive is stored. This works well once you get your head around how best to use it.

Integration with Archivematica was mentioned by one user as being a key selling point for them - several people in the room were either using, or thinking of using Archivematica for digital preservation.

The user community itself and the quick and helpful responses to queries posted on the user forum were mentioned by more than one attendee. Also praised was the fact that AtoM is in continuous active development and very much moving in the right direction.


What doesn't work?


Several attendees mentioned the digital object functionality in AtoM. As well as being a clear selling point, it was also highlighted as an area that could be improved. The one-to-one relationship between an archival description and a digital object wasn't thought to be ideal and there was some discussion about linking through to external repositories - it would be nice if items linked in this way could be displayed in the AtoM image carousel even where the url doesn't end in a filename.

The typeahead search suggestions when you enter search terms were not thought to be helpful all of the time. Sometimes the closest matches do not appear in the list of suggested results.

One user mentioned that they would like a publication status that is somewhere in between draft and published. This would be useful for those records that are complete and can be viewed internally by a selected group of users who are logged in but are not available to the wider public.

More than one person mentioned that they would like to see a conservation module in AtoM.

There was some discussion about the lack of an audit trail for descriptions within AtoM. It isn't possible to see who created a record, when it was created and information about updates. This would be really useful for data quality checking, particularly when training new members of staff and volunteers.

Some concerns about scalability were mentioned - particularly for one user with a very large number of records within AtoM - the process of re-indexing AtoM can take three days.

When creating creator or access points, the drop down menu doesn’t display all the options so this causes difficulties when trying to link to the right point or establishing whether the desired record is in the system or not. This can be particularly problematic for common surnames as several different records may exist.

There are some issues with the way authority records are created currently, with no automated way of creating a unique identifier and no ability to keep authority records in draft.

A comment about the lack of auto-save and the issue of the web form timing out and losing all of your work seemed to be a shared concern for many attendees.

Other things that were mentioned included an integration with Active Directory and local workarounds that had to be put in place to make finding aids bi-lingual.


Moving forward


The group agreed that it would be useful to keep a running list of these potential areas of development for AtoM and that perhaps in the future members may be able to collaborate to jointly sponsor work to improve AtoM. This would be a really positive outcome for this new network.

I was also able to present on a recent collaboration to enable OAI-PMH harvesting of EAD from AtoM and use it as an opportunity to try to drum up support for further development of this new feature. I had to try and remember what OAI-PMH stood for and think I got 83% of it right!

Thanks to St John's College Cambridge for hosting. I look forward to our next meeting which we hope to hold here in York in the Spring.


Jenny Mitcham, Digital Archivist

Wednesday, 20 September 2017

Moving a proof of concept into production? it's harder than you might think...

Myself and colleagues blogged a lot during the Filling the Digital Preservation Gap Project but I’m aware that I’ve gone a bit quiet on this topic since…

I was going to wait until we had a big success to announce, but follow on work has taken longer than expected. So in the meantime here is an update on where we are and what we are up to.

Background


Just to re-cap, by the end of phase 3 of Filling the Digital Preservation Gap we had created a working proof of concept at the University of York that demonstrated that it is possible create an automated preservation workflow for research data using PURE, Archivematica, Fedora and Samvera (then called Hydra!).

This is described in our phase 3 project report (and a detailed description of the workflow we were trying to implement was included as an appendix in the phase 2 report).

After the project was over, it was agreed that we should go ahead and move this into production.

Progress has been slower than expected. I hadn’t quite appreciated just how different a proof of concept is to a production-ready environment!

Here are some of the obstacles we have encountered (and in some cases overcome):

Error reporting


One of the key things that we have had to build in to the existing code in order to get it ready for production is error handling.

This was not a priority for the proof of concept. A proof of concept is really designed to demonstrate that something is possible, not to be used in earnest.

If errors happen and things stop working (which they sometimes do) you can just kill it and rebuild.

In a production environment we want to be alerted when something goes wrong so we can work out how to fix it. Alerts and errors are crucial to a system like this.

We are sorting this out by enabling Archivematica's own error handling and error catching within Automation Tools.


What happens when something goes wrong?


...and of course once things have gone wrong in Archivematica and you've fixed the underlying technical issue, you then need to deal with any remaining problems with your information packages in Archivematica.

For example, if the problems have resulted in failed transfers in Archivematica then you need to work out what you are going to do with those failed transfers. Although it is (very) tempting to just clear out Archivematica and start again, colleagues have advised me that it is far more useful to actually try and solve the problems and establish how we might handle a multitude of problematic scenarios if we were in a production environment!

So we now have scenarios in which an automated transfer has failed so in order to get things moving again we need to carry out a manual transfer of the dataset into Archivematica. Will the other parts of our workflow still work if we intervene in this way?

One issue we have encountered along the way is that though our automated transfer uses a specific 'datasets' processing configuration that we have set up within Archivematica, when we push things through manually it uses the 'default' processing configuration which is not what we want.

We are now looking at how we can encourage Archivematica to use the specified processing configuration. As described in the Archivematica documentation, you can do this by including an XML file describing your processing configuration within your transfer.

It is useful to learn lessons like this outside of a production environment!


File size/upload


Although our project recognised that there would be limit to the size of dataset that we could accept and process with our application, we didn't really bottom out what size dataset we intended to support.

It has now been agreed that we should reasonably expect the data deposit form to accept datasets of up to 20 GB in size. Anything larger than this would need to be handed in a different way.

Testing the proof of concept in earnest showed that it was not able to handle datasets of over 1 GB in size. Its primary purpose was to demonstrate the necessary integrations and workflow not to handle larger files.

Additional (and ongoing) work was required to enable the web deposit form to work with larger datasets.


Space


In testing the application we of course ended up trying to push some quite substantial datasets through it.

This was fine until everything abrubtly seemed to stop working!

The problem was actually a fairly simple one but because of our own inexperience with Archivematica it took a while to troubleshoot and get things moving in the right direction again.

It turned out that we hadn’t allocated enough space in one of the bits of filestore that Archivematica uses for failed transfers (/var/archivematica/sharedDirectory/failed). This had filled up and was stopping Archivematica from doing anything else.

Once we knew the cause of the problem the available space was increased but then everything ground to a halt again because we had quickly used that up again ….increasing the space had got things moving but of course while we were trying to demonstrate the fact that it wasn't working, we had deposited several further datasets which were waiting in the transfer directory and quickly blocked things up again.

On a related issue, one of the test datasets I had been using to see how well Research Data York could handle larger datasets consisted of c.5 GB consisting of about 2000 JPEG images. Of course one of the default normalisation tasks in Archivematica is to convert all of these JPEGs to TIFF.

Once this collection of JPEGs were converted to TIFF the size of the dataset increased to around 80 GB. Until I witnessed this it hadn't really occurred to me that this could cause problems.

The solution - allocate Archivematica much more space than you think it will need!

We also now have the filestore set up so that it will inform us when the space in these directories gets to 75% full. Hopefully this will allow us to stop the filestore filling up in the future.


Workflow


The proof of concept did not undergo rigorous testing - it was designed for demonstration purposes only.

During the project we thought long and hard about the deposit, request and preservation workflows that we wanted to support, but we were always aware that once we had it in an environment that we could all play with and test, additional requirements would emerge.

As it happens, we have discovered that the workflow implemented is very true to that described in the appendix of our phase 2 report and does meet our needs. However, there are lots of bits of fine tuning required to enhance the functionality and make the interface more user friendly.

The challenge here is to try to carry out the minimum of work required to turn it into an adequate solution to take into production. There are so many enhancements we could make – I have a wish list as long as my arm – but until we better understand whether a local solution or a shared solution (provided by the Jisc Research Data Shared Service) will be adopted in the future it is not worth trying to make this application perfect.

Making it fit for production is the priority. Bells and whistles can be added later as necessary!





My thanks to all those who have worked on creating, developing, troubleshooting and testing this application and workflow. It couldn't have happened without you!



Jenny Mitcham, Digital Archivist

Monday, 18 September 2017

Harvesting EAD from AtoM: we need your help!

Back in February I published a blog post about a project to develop AtoM to allow EAD (Encoded Archival Description) to be harvested via OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting): “Harvesting EAD from AtoM: a collaborative approach

Now that AtoM version 2.4 is released (hooray!), containing the functionality we have sponsored, I thought it was high time I updated you on what has been achieved by this project, where more work is needed and how the wider AtoM community can help.


What was our aim?


Our development work had a few key aims:

  • To enable finding aids from AtoM to be exposed as EAD 2002 XML for others to harvest. The partners who sponsored this project were particularly keen to enable the Archives Hub to harvest their EAD.
  • To change the way that EAD was generated by AtoM in order to make it more scalable. Moving EAD generation from the web browser to the job scheduler was considered to be the best approach here.
  • To make changes to the existing DC (Dublin Core) metadata generation feature so that it also works through the job scheduler - making this existing feature more scalable and able to handle larger quantities of data

A screen shot of the job scheduler in AtoM - showing the EAD and
DC creation jobs that have been completed

What have we achieved?

The good

We believe that the EAD harvesting feature as released in AtoM version 2.4 will enable a harvester such as the Archives Hub to harvest our catalogue metadata from AtoM as EAD. As we add new top level archival descriptions to our catalogue, subsequent harvests should pick up and display these additional records. 

This is a considerable achievement and something that has been on our wishlist for some time. This will allow our finding aids to be more widely signposted. Having our data aggregated and exposed by others is key to ensuring that potential users of our archives can find the information that they need.

Changes have also been made to the way metadata (both EAD and Dublin Core) are generated in AtoM. This means that the solution going forward is more scalable for those AtoM instances that have very large numbers of records or large descriptive hierarchies.

The new functionality in AtoM around OAI-PMH harvesting of EAD and settings for moving XML creation to the job scheduler is described in the AtoM documentation.

The not-so-good

Unfortunately the EAD harvesting functionality within AtoM 2.4 will not do everything we would like it to do. 

It does not at this point include the ability for the harvester to know when metadata records have been updated or deleted. It also does not pick up new child records that are added into an existing descriptive hierarchy. 

We want to be able to edit our records once within AtoM and have any changes reflected in the harvested versions of the data. 

We don’t want our data to become out of sync. 

So clearly this isn't ideal.

The task of enabling full harvesting functionality for EAD was found to be considerably more complex than first anticipated. This has no doubt been confounded by the hierarchical nature of the EAD which differs from the simplicity of the traditional Dublin Core approach.

The problems encountered are certainly not insurmountable, but lack of additional resources and timelines for the release of AtoM 2.4 stopped us from being able to finish off this work in full.

A note on scalability


Although the development work deliberately set out to consider issues of scalability, it turns out that scalability is actually on a sliding scale!

The National Library of Wales had the forethought to include one of their largest archival descriptions as sample data for inclusion in the version of AtoM 2.4 that Artefactual deployed for testing. Their finding aid for St David’s Diocesan Records is a very large descriptive hierarchy consisting of 33,961 individual entries. This pushed the capabilities of EAD creation (even when done via the job scheduler) and also led to discussions with The Archives Hub about exactly how they would process and display such a large description at their end even if EAD generation within AtoM were successful.

Some more thought and more manual workarounds will need to be put in place to manage the harvesting and subsequent display of large descriptions such as these.

So what next?


We are keen to get AtoM 2.4 installed at the Borthwick Institute for Archives over the next couple of months. We are currently on version 2.2 and would like to start benefiting from all the new features that have been introduced available... and of course to test in earnest the EAD harvesting feature that we have jointly sponsored.

We already know that this feature will not fully meet our needs in its current form, but would like to set up an initial harvest with the Archives Hub and further test some of our assumptions about how this will work.

We may need to put some workarounds in place to ensure that we have a way of reflecting updates and deletions in the harvested data – either with manual deletes or updates or a full delete and re-harvest periodically.

Harvesting in AtoM 2.4 - some things that need to change


So we have a list of priority things that need to be improved in order to get EAD harvesting working more smoothly in the future:


In line with the OAI-PMH specification

  • AtoM needs to expose updates to the metadata to the harvester
  • AtoM needs to expose new records (at any level of description) to the harvester
  • AtoM needs to expose information about deletions to the harvester
  • AtoM also needs to expose information about deletions to DC metadata to the harvester (it has come to my attention during the course of this project that this isn’t happening at the moment) 

Some other areas of potential work


I also wanted to bring together and highlight some other areas of potential work for the future. These are all things that were discussed during the course of the project but were not within the scope of our original development goals.

  • Harvesting of EAC (Encoded Archival Context) - this is the metadata standard for authority records. Is this something people would like to see enabled in the future? Of course this is only useful if you have someone who actually wants to harvest this information!
  • On the subject of authority records, it would be useful to change the current AtoM EAD template to use @authfilenumber and @source - so that an EAD record can link back to the relevant authority record in the local AtoM site. The ability to create rich authority records is such a key strength of AtoM, allowing an institution to weave rich interconnecting stories about their holdings. If harvesting doesn’t preserve this inter-connectivity then I think we are missing a trick!
  • EAD3 - this development work has deliberately not touched on the new EAD standard. Firstly, this would have been a much bigger job and secondly, we are looking to have our EAD harvested by The Archives Hub and they are not currently working with EAD3. This may be a priority area of work for the future.
  • Subject source - the subject source (for example "Library of Congress Subject Headings") doesn't appear in AtoM generated EAD at the moment even though it can be entered into AtoM - this would be a really useful addition to the EAD.
  • Visible elements - AtoM allows you to decide which elements you wish to display/hide in your local AtoM interface. With the exception of information relating to physical storage, the XML generation tasks currently do not take account of visible elements and will carry out an export of all fields. Further investigation of this should be carried out in the future. If an institution is using the visible elements feature to hide certain bits of information that should not be more widely distributed, they would be concerned if this information was being harvested and displayed elsewhere. As certain elements will be required in order to create valid EAD, this may get complicated!
  • ‘Manual’ EAD generation - the project team discussed the possibility of adding a button to the AtoM user interface so that staff users can manually kick-off EAD regeneration for a single descriptive hierarchy. Artefactual suggested this as a method of managing the process of EAD generation for large descriptive hierarchies. You would not want the EAD to regenerate with each minor tweak if a large archival description was undergoing several updates, however, you need to be able to trigger this task when you are ready to do so. It should be possible to switch off the automatic EAD re-generation (which normally triggers when a record is edited and saved) but have a button on the interface that staff can click when they want to initiate the process - for example when all edits are complete. 
  • As part of their work on this project, Artefactual created a simple script to help with the process of generating EAD for large descriptive hierarchies - it basically provides a way of finding out which XML files relate to a specific archival description so that EAD can be manually enhanced and updated if it is too large for AtoM to generate via the job scheduler. It would be useful to turn this script into a command-line task that is maintained as part of the AtoM codebase.

We need your help!


Although we believe we have something we can work with here and now, we are not under any illusions that this feature does all that it needs to in order to meet our requirements in the longer term. 

I would love to find out what other AtoM users (and harvesters) think of the feature. Is it useful to you? Are there other things we should put on the wishlist? 

There is a lot of additional work described in this post which the original group of project partners are unlikely to be able to fund on their own. If EAD harvesting is a priority to you and your organisation and you think you can contribute to further work in this area either on your own or as part of a collaborative project please do get in touch.


Thanks


I’d like to finish with a huge thanks to those organisations who have helped make this project happen, either through sponsorship, development or testing and feedback.




Jenny Mitcham, Digital Archivist

Friday, 18 August 2017

Benchmarking with the NDSA Levels of Preservation

Anyone who has heard me talk about digital preservation will know that I am a big fan of the NDSA Levels of Preservation.

This is also pretty obvious if you visit me in my office – a print out of the NDSA Levels is pinned to the notice board above my PC monitor!

When talking to students and peers about how to get started in digital preservation in a logical, pragmatic and iterative way, I always recommend using the NDSA Levels to get started. Start at level 1 and move forward to the more advanced levels as and when you are able. This is a much more accessible and simple way to start addressing digital preservation than digesting some of the bigger and more complex certification standards and benchmarking tools.

Over the last few months I have been doing a lot of documentation work. Both ensuring that our digital archiving procedures are written down somewhere and documenting where we are going in the future.

As part of this documentation it seemed like a good idea to use the NDSA Levels:

  • to demonstrate where we are
  • to show where improvements need to be made
  • to demonstrate progress in the future


Previously I have used the NDSA Levels in quite a superficial way – as a guide and a talking point, it has been quite a different exercise actually mapping where we stand.

It was not always straightforward to establish where we are and to unpick and interpret exactly what each level meant in practice. I guess this is one of the problems of using a relatively simple set of metrics to describe what is really quite a complex set of processes.

Without publishing the whole document that I've written on this, here is a summary of where I think we are currently. I'm also including some questions I've been grappling with as part of the process.

Storage and geographic location

Currently at LEVEL 2: 'know your data' with some elements of LEVEL 3 and 4 in place

See the full NDSA levels here


Four years ago we carried out a ‘rescue mission’ to get all digital data in the archives off portable media and on to the digital archive filestore. This now happens as a matter of course when born digital media is received by the archives.

The data isn’t in what I would call a proper digital archive but it is on a fairly well locked down area of University of York filestore.

There are three copies of the data available at any one time (not including the copy that is on original media within the strongrooms). The University stores two copies of the data on spinning disk. One at a data centre on one campus and the other at a data centre on another campus with another copy backed up to tape which is kept for 90 days.

I think I can argue that storage of the data on two different campuses is two different geographic locations but these locations are both in York and only about 1 mile apart. I'm not sure whether they could be described as having different disaster threats so I'm going to hold back from putting us at Level 3 though IT do seem to have systems in place to ensure that filestore is migrated on a regular schedule.

Questions:

  • On a practical level, what really constitutes a different geographic location with a different disaster threat? How far away is good enough?


File fixity and data integrity

Currently at LEVEL 4: 'repair your data'

See the full NDSA levels here


Having been in this job for five years now I can say with confidence that I have never once received file fixity information alongside data that has been submitted to us. Obviously if I did receive it I would check it on ingest, but I can not envisage this scenario occurring in the near future! I do however create fixity information for all content as part of the ingest process.

I use a tool called Foldermatch to ensure that the digital data I have copied into the archive is identical to the original. Foldermatch allows you to compare the contents of two folders and one of the comparison methods (the one I use at ingest) uses checksums to do this.

Last year I purchased a write blocker for use when working with digital content delivered to us on portable hard drives and memory sticks. A check for viruses is carried out on all content that is ingested into the digital archive so this fulfills the requirements of level 2 and some of level 3.

Despite putting us at Level 4, I am still very keen to improve our processes and procedures around fixity. Fixity checks are carried out at intervals (several times a month) and these checks are logged but at the moment this is all initiated manually. As the digital archive gets bigger, we will need to re-think our approaches to this important area and find solutions that are scalable.

Questions:


  • Does it really matter if fixity isn't checked at 'fixed intervals'? That to me suggests a certain rigidity. Do the intervals really need to be fixed or does it not matter as long as it happens within an agreed time frame?
  • At level 2 we are meant to ‘check fixity on all ingests’ - I am unclear as to what is expected here. What would I check if fixity information hasn’t been supplied (as is always the case currently)? Perhaps it means check fixity of the copy of the data that has been made against the fixity information on the original media? I do do that.


Information Security

Currently at LEVEL 2: 'know your data' with some elements of LEVEL 3 in place

See the full NDSA levels here


Access to the digital archive filestore is limited to the digital archivist and IT staff who administer the filestore. If staff or others need to see copies of data within the digital archive filestore, copies are made elsewhere after appropriate checks are made regarding access permissions. The master copy is always kept on the digital archive filestore to ensure that the authentic original version of the data is maintained. Access restrictions are documented.

We are also moving towards the higher levels here. A recent issue reported on a mysterious change of last modified dates for .eml files has led to discussions with colleagues in IT, and I have been informed that an operating system upgrade for the server should include the ability to provide logs of who has done what to files in the archive.

It is worth pointing out that as I don't currently have systems in place for recording PREMIS (preservation) metadata. I am currently taking a hands off approach to preservation planning within the digital archive. Preservation actions such as file migration are few and far between and are recorded in a temporary way until a more robust system is established.


Metadata

Currently at LEVEL 3: 'monitor your data'

See the full NDSA levels here


We do OK with metadata currently, (considering a full preservation system is not yet in place). Using DROID at ingest is helpful at fulfilling some of the requirements of levels 1 to 3 (essentially, having a record of what was received and where it is).

Our implementation of AtoM as our archival management system has helped fulfil some of the other metadata requirements. It gives us a place to store administrative metadata (who gave us it and when) as well as providing a platform to surface descriptive metadata about the digital archives that we hold.

Whether we actually have descriptive metadata or not for digital archives will remain an issue. Much metadata for the digital archive can be generated automatically but descriptive metadata isn't quite as straightforward. In some cases a basic listing is created for files within the digital archive (using Dublin Core as a framework) but this will not happen in all cases. Descriptive metadata typically will not be created until an archive is catalogued which may come at a later date.

Our plans to implement Archivematica next year will help us get to Level 4 as this will create full preservation metadata for us as PREMIS.

Questions:


  • What is the difference between the 'transformative metadata' as mentioned at Level 2 and Preservation metadata as mentioned at Level 4? Is this to do with the standards used? For example, at Level 2 you need to be storing metadata about transformations and events that have occured, but at Level 4 this must be in PREMIS?


File formats

Currently at LEVEL 2: 'know your data' with some elements of LEVEL 3 in place

See the full NDSA levels here


It took me a while to convince myself that we fulfilled Level 1 here! This is a pretty hard one to crack, especially if you have lots of different archives coming in from different sources, and sometimes with little notice. I think it is useful that the requirement at this level is prefaced with "When you can..."!

Thinking about it, we do do some work in this area - for example:

To get us to Level 2, as part of the ingest process we run DROID to get a list of file formats included within a digital archive. Summary stats are kept within a spreadsheet that covers all content within the digital archive so we can quickly see the range of formats that we hold and find out which archives they are in.

This should allow us to move towards Level 3 but we are not there yet. Some pretty informal and fairly ad hoc thinking goes into  file format obsolescence but I won't go as far as saying that we 'monitor' it. I have an awareness of some specific areas of concern in terms of obsolete files (for example I've still got those WordStar 4.0 files and I really do want to do something with them!) but there are no doubt other formats that need attention that haven't hit my radar yet.

As mentioned earlier, we are not really doing migration right now - not until I have a better system for creating the PREMIS metadata, so Level 4 is still out of reach.

Questions:


  • I do think there is more we could at Level 1, but there has also been concern raised by colleagues that in being too dictatorial you are altering the authenticity of the original archive and perhaps losing information about how a person or an organisation worked. I'd be interested to hear how others walk this tricky line.
  • Is it a valid answer to simple note at Level 1 that input into the creation of digital files is never given because it has been decided not to be appropriate in the content in which you are working?
  • I'd love to hear examples of how others monitor and report on file obsolescence - particularly if this is done in a systematic way


Conclusions

This has been a useful exercise and it is good to see where we need to progress. Going from using the Levels in the abstract and actually trying to apply them as a tool has been a bit challenging in some areas. I think additional information and examples would be useful to help clear up some of the questions that I have raised.

I've also found that even where we meet a level there is often other ways we could do things better. File fixity and data integrity looks like a strong area for us but I am all too aware that I would like to find a more sustainable and scalable way to do this. This is something we'll be working on as we get Archivematica in place. Reaching Level 4 shouldn't lead to complacency!

An interesting blog post last year by Shira Peltzman from the UCLA Library talked about Expanding the NDSA Levels of Preservation to include an additional row focused on Access. This seems sensible given that the ability to provide access is the reason why we preserve archives. I would be keen to see this developed further so long as the bar wasn't set too high. At the Borthwick my initial consideration has been preservation - getting the stuff and keeping it safe - but access is something that will be addressed over the next couple of years as we move forward with our plans for Archivematica and AtoM.

Has anyone else assessed themselves against the NDSA Levels?  I would be keen to see how others have interpreted the requirements.








Jenny Mitcham, Digital Archivist

The sustainability of a digital preservation blog...

So this is a topic pretty close to home for me. Oh the irony of spending much of the last couple of months fretting about the future prese...