Monday 19 November 2018

The sustainability of a digital preservation blog...

So this is a topic pretty close to home for me.

Oh the irony of spending much of the last couple of months fretting about the future preservation of my digital preservation blog...!

I have invested a fair bit of time over the last 6 or so years writing posts for this blog. Not only has it been useful for me as a way of documenting what I've done or what I've learnt, but it has also been of interest to a much wider audience.

The most viewed posts have been on the topic of preserving Google Drive formats and disseminating the outputs of the Filling the Digital Preservation Gap project. Access statistics show that the audience is truly international.

When I decided to accept a job elsewhere I was of course concerned about what would happen to my blog. I hoped that all would be well, given that Blogger is a Google supported solution and part of the suite of Google tools that University of York staff and students use. But what would happen when my institutional Google account was closed down?

Initially I believed that as long as I handed over ownership of the blog to another member of staff who remained at the University, then all would be well. However, I soon realised that there were going to be some bigger challenges.

The problem

Once I leave the institution and my IT account is closed, Blogger will no longer have a record of who I am.

All posts that have been written my me will be marked as 'Unknown'. They will no longer have my name automatically associated with them. Not ideal from my perspective and also not ideal for anyone who might want to cite the blog posts in the future.

The other problem is the fact that once my account is closed down, all images within blog posts that I have posted will completely disappear.

This is pretty bad news!

When a member of staff adds images to a blog post the usual method of doing this is to select an image from the local PC or network drive. What Google then does is stores a copy of that image in https://get.google.com/albumarchive/ (in a location that is tied to that individual's account). When the account is closed, all of these blog related images are also wiped. The images are not recoverable.

So, I could make copies of all my images now and hand them to my colleagues, so that they could put them all back in again once I leave...but who is going to want to do that?

A solution of sorts

I asked IT Support to help me, and a colleague has had some success at extracting the contents of my blog, amending the image urls in the XML and importing the posts back into a test Blogger account with images hosted in a location that isn't associated with an individual staff account.

There is a description of how this result was achieved here and I'm hugely grateful for all of the time that was spent trying to fix this problem for me.

The XML was also amended directly to add the words 'Jenny Mitcham, Digital Archivist' to the end of every blog post, to save me having to open each of the 120 posts in turn and adding my name to them. That was a big help too.

So, in my last couple of weeks at work I have been experimenting with importing the tweaked XML file back into blogger.

Initially, I just imported the XML file back into the blog without deleting the original blog posts. I had understood that the imported blogs would merge with the original ones and that all would be well. Unfortunately though, I ended up with two versions of each blog post - the original one and the new one at a slightly different url.

So, I held my breath, took the plunge and deleted everything and then re-imported the amended XML.

I had envisaged that the imported blog posts would be assigned their original urls but was disappointed to see that this was not the case. Despite the url being included within the XML, blogger clearly had a record that these urls had already been used and would not re-use them.

I know some people link to the blog posts from other blogs and websites. I also interlink between blog posts from within the blog, so a change to all the urls will lead to lots of broken links. Bad news!

I tried going into individual posts and changing the permalink by hand back to the original link, but Blogger would not accept this and kept adding a different number to the end of the url to ensure it did not replicate the url of one of my deleted posts. Hugely frustrating!

Luckily my colleague in IT came up with an alternative solution, adding some clever code into the header of the blog which carries out a search every time a page is requested. This seems to work well, serving up one or more posts based on the url that is requested. Being that the new urls are very similar to the old ones (essentially the same but with some numbers added to the end), the search is very effective and the right post is served up at the top of the page. Hopefully this will work for the foreseeable future and should lead to minimal impact for users of the blog.


Advice for Blogger users

If you are using Blogger from an institutional Google account, think about what will happen to your posts after your account is closed down.

There are a few things you can do to help future proof the blog:
  • Host images externally in a location that isn't tied to your institutional account - for example a Google Team Drive or an institutional website - link to this location from the blog post rather than uploading images directly.
  • Ensure that your name is associated with the blog posts you write by hard coding it in to the text of your blog post - don't rely on blogger knowing who you are forever.
  • Ensure that there are others who have administrative control of the blog so that it continues after your account has been closed.
And lastly - if just starting out, consider using a different blogging platform. Presumably they are not all this unsustainable...!

Apologies...

Unfortunately, with the tweak that has been made to how the images are hosted and pulled in to the posts, some of them appear to have degraded in quality. I began to edit each post and resizing the images (which appears to fix the problem) but have run out of time to work through 120 posts before my account is closed.

Generally, if an image looks bad in the blog, you can see a clearer version of it if you click on it so this isn't a disaster.

Also, there may be some images that are out of place - I have found (and fixed) one example of this but have not had time to check all of them.

Apologies to anyone who subscribes to this blog - I understand you may have had a lot of random emails as a result of me re-importing or republishing blog posts over the last few weeks!

Thanks to...

As well as thanking Tom Smith at the University of York for his help with fixing the blog, I'd also like to thank the Web Archiving team at the British Library who very promptly harvested my blog before we started messing around with it. Knowing that it was already preserved and available within a web archive did give me comfort as I repeatedly broke it!

A plea to Google

Blogger could (and should) be a much more sustainable blogging platform. It should be able to handle situations where someone's account closes down. It should be possible to make the blogs (including images) more portable. It should be possible for an institution to create a blog that can be handed from one staff member to another without breaking it. A blog should be able to outlive its primary author.

I genuinely don't think these things would be that hard for a clever developer at Google to fix. The current situation creates a very real headache for those of us who have put a lot of time and effort into creating content within this platform.

It really doesn't need to be this way!


Thursday 15 November 2018

Goodbye and thanks

This is my last day as Digital Archivist for the University of York.

Next week I will be taking on a brand new post as Head of Standards and Good Practice at the Digital Preservation Coalition. This is an exciting move for me but it is with some sadness that I leave the Borthwick Institute and University of York behind.

I have been working in digital preservation at the University of York for the last 15 years. Initially with the Archaeology Data Service as part of the team that preserves and disseminates digital data produced by archaeologists in the UK; and since 2012, branching out to work with many other types of digital material at the Borthwick Institute.

This last six years has been both interesting and challenging and I have learnt a huge amount.

Perhaps the biggest change for me was moving from being one of a team of digital archivists to being a lone digital archivist. I think this is one of the reasons I started this blog. I missed having other digital archivists around who were happy to endlessly discuss the merits of different preservation file formats and tools!

Blogging about my work at the Borthwick became a helpful way for me to use the wider digital preservation community as a sounding board and for sense checking what I was doing. I have received some really helpful advice in the comments and the blogs have led to many interesting discussions on Twitter.

In a discipline where resources are often scarce, it makes no sense for us all to quietly invent the same wheel in our own local contexts. Admittedly there is no one-size-fits-all solution to digital preservation, but talking about what we do and learning from each other is so very important.

Of course there have been challenges along the way...

It is difficult to solve a problem that does not have clear boundaries. The use cases for digital preservation in a large institution are complex and ever growing.

I began by focusing on the born digital archives that come to the Borthwick from our donors and depositors. Perhaps if that were the only challenge, we would be further down the line of solving it...

However, we also have the complexities of research data to consider, the huge volumes of digitised content we are producing, the need to digitise audio-visual archives and preserve them in digital formats, the need to preserve the institutional record (including websites, social media, email), and the desire to preserve theses in digital formats. On top of this, is the increasing need to be able to provide access to digital resources. 

The use cases overlap and are not neatly bounded. Multiple integrations with other systems are required to ensure that preservation processes are seamless and can be implemented at scale.

I have frequently reached the limit of my own technical ability. I am an archaeologist with probably above average IT skills but I can only get so far with the knowledge I have. Getting the right level of technical support to move digital preservation forward is key. 

So, I’ve made some mistakes, I’ve changed my mind about some things, I’ve often tried to do too much, but ultimately I've had the freedom to try things out and to share those experiences with the wider community.

Some lessons learned from my 6 years at the Borthwick:
  • Doing something is normally better than doing nothing
  • Accept solutions that are 'good enough' ...don't hold out for 'perfect'
  • Try things out. Research and planning are important, but it is hard to fully understand things without diving in and having a go
  • Digital continuity actually begins quite close to home - consider the sustainability of your blogging platform!

The biggest lesson for me perhaps has been that I have spent much of my 6 years chasing the somewhat elusive dream of an all-singing-all-dancing 'digital preservation system', but in actual fact, the interim measures I have put in place at the Borthwick might be just about ‘good enough’ for the time being.

It is not always helpful to think about digital preservation in 'forever' terms. It is more realistic to consider our role to be to keep digital archives safe to hand over to the next person. Indeed, digital preservation has frequently been likened to a relay race.

So I hereby finish my leg of this particular race and hand over the baton to the next digital archivist...

A big thank you and goodbye to all my colleagues at the Borthwick Institute and across Information Services. It has been fun! :-)




Thursday 8 November 2018

Testing manual normalisation workflows in Archivematica

This week I traveled to Warwick University for the UK Archivematica meeting. As usual, it was a really interesting day. I’m not going to blog about all of it (I’ll leave that to our host, Rachel MacGregor) but I will blog about the work I presented there.

Followers of my blog will be aware that I recently carried out some file migration work on a batch of WordStar 4 files from screenwriting duo Marks and Gran.

The final piece of work I wanted to carry out was to consider how we might move the original files along with the migrated versions of those files into Archivematica (if we were to adopt Archivematica in the future).

I knew the migration I had carried out was a bit of an odd one so I was particularly interested to see how Archivematica would handle it.

It was odd for a number of reasons.

1. Firstly, I ended up creating 3 different versions of each WordStar file – DOCX, PDF/A and ASCII TXT. After an assessment of the significant properties of the files (essentially, the features that I wanted to preserve) and some intense QA of the files, it was clear that all versions were imperfect. None of them preserved all the properties I wanted them to. They all had strengths and weaknesses and between them, they pretty much captured everything.


2. Secondly I wasn’t able to say whether the files I had created were for preservation or access purposes. Typically, a file migration process will make a clear distinction between those files that are for dissemination to users and those files that are preservation copies to keep behind the scenes.

After talking to my colleagues about the file migrations and discussing the pros and cons of the resulting files, it was agreed that we could potentially use any (or all) of the formats to provide access to future users, depending on user needs. I’m not particularly happy with any of the versions I created being a preservation files, given the fact that none of them capture all the elements of the file that I considered to be of value, but they may indeed need to become preservation versions in future if WordStar 4 files become impossible to read.


3. Thirdly the names of the migrated versions of the files did not exactly match up with the original files. The original WordStar files were created in the mid 1980’s. In this early period of home computing, file extensions appeared to be optional. The WordStar 4 manual actually suggests that you use the 3 characters available for the file extension to record additional information that won't fit in the 8 character filename.

For many of the WordStar files in this archive, there is no file extension at all. For other files, the advice in the manual has been followed, in that an extension has been used which gives additional context to the filename. For many of the files WordStar has also created a backup file (essentially an earlier version of the file is saved with a .BAK extension but it is still a Wordstar file). There are therefore scenarios where we need to save information about the original file extension in the migrated version in order to ensure that we don’t create filenaming conflicts and don’t lose information from the original filename.



Why not just normalise within Archivematica?

  • These WordStar files are not recognised by the file identification tools in Archivematica (I don’t think they are recognised by any file identification tools). The Format Policy Registry only works on those files that are identified for which there is a rule/policy set up
  • Even if they were identifiable, it would not be possible to replicate some of the manual steps we went through to create the migrated versions with command line tools called by Archivematica. 
  • As part of the migration process itself, several days were spent doing QA and checking of the migrated files against the originals as viewed in WordStar 4 and detailed documentation (to be stored alongside the files) was created. Archivematica does give you a decision point after a normalisation so that checking or QA can be carried out, but we’d need to find a way of doing the QA, creating the necessary documentation and associating it with the AIP half way through the ingest process.



How are manually normalised files handled in Archivematica?

I’d been aware for a while that there is a workflow for manually normalised files in Archivematica and I was keen to see how it would work. Reading the documentation,  it was clear that the workflow allows for a number of different approaches (for example normalising files before ingest or at a later date) but was also quite specific regarding the assumptions it is based on.

There is an assumption that the names of original and normalised files will be identical (apart from the file extension which will have changed). However, there is a workaround in place which allows you to include a csv file with your transfer. The csv file should provide information about how the originals are related to the preservation and/or access files. Given the filename issues described above, this was something I would need to include.

There is an assumption that you will know whether the migrated versions of files are intended for preservation or for access. This is a fair assumption – and very much in line with digital preservation thinking, but does it reflect the imperfect real world?

There is also an assumption that there will be no more than one preservation file and no more than one access file for each original (happy to be corrected on this if I am wrong).



Testing the manual normalisation workflow in Archivematica

Without a fully working test version of Archivematica to try things out on, my experimentation was limited. However, a friendly Archivematica guru (thanks Matthew) was able to push a couple of test files through for me and provide me with the AIP and the DIP to inspect.

The good news is that the basic workflow did work – we were able to push an already normalised ‘access’ file and ‘preservation’ file into Archivematica along with the original as part of the ingest process. The preservation files appeared in the AIP and the access files appeared in the DIP as expected.

We also investigated the workflow for adding additional metadata about the migration.

Archivematica creates PREMIS metadata as part of the ingest process, recording the details and outcomes of events (such as virus checks and file normalisations) that it carries out. The fact that Archivematica creates PREMIS events automatically has always been a big selling point for me. As I have mentioned before – who wants to create PREMIS by hand?

Where files are included using the manual normalisation workflow, Archivematica will always create a PREMIS event for the normalisation and if you set up your processing configuration in the right way, it will stop and prompt you to add additional information into the PREMIS eventDetail field. This is a good start but it would be great if a more detailed level of information could be included in the PREMIS metadata.

I wondered what would happen to all the documentation I had created? I concluded that the best way of keeping this documentation alongside the files would be by putting it in the SubmissionDocumentation directory as described in the manual and submitting along with the transfer. This information will be stored with the AIP but the link between the documentation and the normalised files may not be immediately apparent.

What I didn’t test was whether it is possible to push more than one access file into Archivematica using this workflow. I'm assuming that Archivematica will not support this scenario.


Some suggested improvements to the manual normalisation workflow in Archivematica

So, I wanted to highlight a few improvements that could be made.

  1. Allow the user to add PREMIS eventDetail in bulk – at the moment you have to enter the information one file at a time
  2. Allow the user to add information into more than one PREMIS field. Being able to add the actual date of the manual migration into the date field would be a good start. Being able to add an event outcome would also be helpful (by default event outcome is 'None' for files added using the manual normalisation workflow).
  3. Even better – allow the user to be able to add this more detailed PREMIS information through a csv import. Another spreadsheet containing information about the normalisation event could be included in the transfer and be automatically associated with the files and included in the METS.
  4. Allow Archivematica to accept more than one access or preservation file


In conclusion

Testing shows that Archivematica's workflow for manual normalisation will work for a standard scenario and will cope well with changes in filenames (with the addition of a normalization.csv file). However, do not assume it will currently handle more complex scenarios.

I accept that it is impossible for a digital preservation system to do everything we might want it to do. It can’t be flexible and adaptable to every imperfect use case. As a baseline it does have to know whether to put a manually migrated file into the DIP for access or AIP for preservation and it would perhaps not be fair to suggest it should cope with uncertainty.

In his forthcoming book, Trevor Owens makes the interesting point that "Specialized digital preservation tools and software are just as likely to get in the way of solving your digital preservation problems as they are to help." I do therefore wonder how we should address the gap between the somewhat rigid, rule-based tools and the sometimes imperfect real world scenarios?



Jenny Mitcham, Digital Archivist

Monday 22 October 2018

Probably my last UK AtoM user group meeting

This week the 3rd UK AtoM users group meeting was held at the Honourable Artillery Company (HAC) in London. A packed and interesting programme had been put together by Justine Taylor and it was great to see how well attended it was. Indeed a room change was required to accommodate the number of people who wanted to attend.

Elizabeth Wells from Westminster School Archives started off the presentations by talking about how she is using AtoM to catalogue objects and artefacts. Several of us in the room have items in their care that are not archives, but I think Westminster School were the only archive to be looking after a 92 year old pancake! Being able to catalogue such items in AtoM is a high priority for many AtoM users given that they don’t want to manage multiple systems.

It is really interesting to hear how different institutions use AtoM and in particular the workarounds they use to resolve specific problems. Elizabeth talked us through the processes she has put in place for storing additional information about objects (such as valuations) that she doesn’t want to make available to the wider public. She mentioned how useful a previous UK AtoM meeting was in highlighting the fact that information within an archival description that is hidden from view within the AtoM interface will still be available to users if they download the EAD. This was a concern so she is using the accessions module of AtoM to store information that is confidential.

She also mentioned that she was using the RAD template for describing the objects in her collections. These can sit within an ISAD(G) hierarchy, but the RAD standard gives more flexibility to record different types of items. I had not realised that AtoM allowed you to chop and change between the templates in this way so this was really interesting to hear.

Victoria Peters from Strathclyde University talked to us about their work to link AtoM with their Library Catalogue interface Primo (SUPrimo - the best name ever for a Primo library catalogue!). Following on from York’s own work in this area, they enabled top level records from AtoM to be harvested into Primo and this allows staff and students to more easily discover things that are available in the archives.

They have also been thinking about how to best surface special collections. Special collections are catalogued at item level within the library catalogue but there is no overarching record describing each of the collections (for example who collected the material and why), and no obvious way to enter this information into the library catalogue, which doesn't support hierarchical descriptions. Information about special collections isn't discoverable from AtoM and there is no way to cross link with information that is held by the archives even though there are obvious links between material held in the archives and special collections.

The solution they have come up with is to add a description of each of the special collections into AtoM. This allows links to be made between related archives and special collections and will really help those users who are browsing the archives catalogue to see which special collections they may also be interested in. The description within AtoM then links back to the individual items within SUPrimo for more detailed item level information.

Victoria summed this work up by saying that it isn’t perfect but was a pretty quick and effective way of solving a problem. As a consequence, both archives and special collections are more discoverable and the links between them are clearer. Users do not need to know whether they should go to the library catalogue or the archives catalogue as both archives and special collections are signposted from both systems.

I then updated the group on work to enable EAD harvesting in AtoM. I have previously blogged about phase 1 of the project and wanted to talk about more recent testing since we have upgraded to AtoM 2.4 and future plans to make the harvesting functionality better. This may be the subject of a future blog post….if I have time!

Caroline Catchpole from The National Archives followed on from my presentation to tell us about Discovery and their future plans. The ability to harvest EAD from systems like AtoM is still very much on their wishlist but the development resource is not currently available. She has however extracted some EAD from various AtoM sites in the UK so that she can explore how easy it would be to incorporate it into Discovery. She talked through some of the problems with the “unwieldy beast” that is EAD and how different implementations and lack of consistency can cause problems for aggregators.

After lunch Justine Taylor our host talked us through how she is using the Function entity in AtoM. She has been experimenting with AtoM’s functions as a way to create a useful company structure to hold information about what key activities HAC carries out. This will be another useful way for users to browse the catalogue and find information that is of interest to them.

Lucy Shepherd from Imperial College gave us a brief overview of preparatory work around establishing AtoM and Archivematica. They have not yet got this up and running but she is thinking about how it will be used and what deposit workflows will be put in place. She sees the AtoM community as a key selling point, but mentioned that were potential challenges around finding the time to complete this exploratory work and what systems their IT department would support.

Matthew Addis from Arkivum gave us a nice demo of the integration between AtoM and Archivematica and talked through an issue around how the two systems share metadata (or not as the case may be). He has been investigating this because Arkivum's Perpetua service includes both AtoM and Archivematica and a good integration between the two products is something that is required by their customers. He described the use case where clients have digital objects and metadata to add in batches. They want automated preservation using Archivematica, the master copy protected in long term storage and an access version accessible in AtoM with rich and hierarchical metadata to give context and enable search and retrieval.

AtoM supports bulk imports and hierarchical description, but when digital objects are passed through Archivematica, the metadata within the Dissemination Information Package (DIP) is flattened - only Dublin Core metadata is passed to AtoM through the DIP. Archivematica however, will accept various types of metadata and will store them in its Archival Information Package (AIP). This is a potential problem because valuable metadata that is stored in Archivematica will not be associated with the dissemination copy in AtoM unless it is Dublin Core.

Matthew demonstrated a workaround he has been using to get the right level of metadata into AtoM. After digital objects have been transferred from Archivematica to AtoM at the right point in an existing hierarchy, he then imported additional metadata directly into AtoM using the CSV import to enhance the basic Dublin Core metadata that has come through AtoM. He suggested that configuring AtoM with the slugs generated from the identifier field makes this process easier to automate. He is still thinking about this issue, and in particular whether the AIP in Archivematica could be enhanced by metadata from AtoM.

Geoff Browell from King's College London talked to us about an ambitious project to create an AtoM catalogue for the whole of Africa. The Archives Africa project has been working with The National Archives of Madagascar and exploring a lightweight way of getting local descriptions into an AtoM instance hosted in the UK using spreadsheets and email.

Lastly, we had an update from Dan Gillean from Artefactual Systems which included some news about initial technical planning for AtoM 3 and an update on the AtoM Foundation. The Foundation has been set up to oversee and support the development, sustainability and adoption of AtoM, specifically in relation to AtoM 3. Dan talked about the benefits in moving the governance of AtoM outside of Artefactual Systems and establishing a more diverse ecosystem. The Foundation will be collecting information from AtoM users about the functionality that is required in AtoM 3 at some point in the future. Dan also revealed that AtoM version 2.4.1 should be with us very soon and that the next UK AtoM Camp will be held at the University of Westminster in July 2019.

I anticipate this will be my last UK AtoM user group meeting given that I am moving on to pastures new next month. It has been really encouraging to see how much the user community in the UK has grown since my first involvement in AtoM back in 2014 and it is great to see the active knowledge sharing and collaboration in the UK user group. Long may it continue!



This post was written by Jenny Mitcham, Digital Archivist

Friday 28 September 2018

Auditing the digital archive filestore

A couple of months ago I blogged about checksums and the methodology I have in place to ensure that I can verify the integrity and authenticity of the files within the digital archive.

I was aware that my current workflows for integrity checking were 'good enough' for the scale at which I'm currently working, but that there was room for improvement. This is often the case when there are humans involved in a process. What if I forget to create checksums to a directory? What happens if I forget to run the checksum verification?

Also, I am aware that checksum verification does not solve everything. For example, read all about The mysterious case of the changed last modified dates. Also, When checksums don't match... the checksum verification process doesn't tell you what has changed, who has changed it, when it was changed...it just tells you that something has changed. So perhaps we need more information.

A colleague in IT Services here at York mentioned to me that after an operating system upgrade on the filestore server last year, there is now auditing support (a bit more information here). This wasn't being widely used yet but it was an option if I wanted to give it a try and see what it did.

This seemed like an interesting idea so we have given it a whirl. With a bit of help (and the right level of permissions on the filestore), I have switched on auditing for the digital archive.

My helpful IT colleague showed me an example of the logs that were coming through. It has been a busy week in the digital archive. I have ingested 11 memory sticks, 24 CD-ROM and a pile of floppy disks. The logs were extensive and not very user friendly in the first instance.

That morning I had wanted to find out the total size of the born digital archives in the digital archive filestore and had right clicked on the folder and selected 'properties'. This had produced tens of thousands of lines of XML in the filestore logs as the attributes of each individual file had to be accessed by the server in order to make the calculation. The audit logs really are capable of auditing everything that happens to the files!

...but do I really need that level of information? Too much information is a problem if it hides the useful stuff.

It is possible to configure the logging so that it looks for specific types of events. So, while I am not specifically interested in accesses to the files, I am interested in changes to them. We configured the auditing to record only certain types of events (as illustrated below). This cuts down the size of the resulting logs and restricts it just to those things that might be of interest to me.




There is little point in switching this on if it is not going to be of use. So what do I intend to do with the output?

The format this is created in is XML, but this would be more user-friendly in a spreadsheet. IT have worked out how to pull out the relevant bits of the log into a tab delimited format that I can then open in a spreadsheet application.

What I have is some basic information about the date and time of the event, who initiated it, the type of event (eg RENAME, WRITE, ATTRIBUTE|WRITE) and the folder/file that was affected.

As I can view this in a spreadsheet application, it is so easy to reorder the columns to look for unexpected or unusual activity.

  • Was there anyone other than me working on the filestore? (there shouldn't be right now)
  • Was there any activity on a date I wasn't in the office?
  • Were there any activity in a folder I wasn't intentionally working on?
The current plan is that these logs will be emailed to me on a weekly basis and I will have a brief check to ensure all looks OK. This will sit alongside my regular integrity checking as another means of assuring that all is as it should be.

We'll review how this is working a few weeks to see if it continues to be a valuable exercise or should be tweaked further.

In my Benchmarking with the NDSA Levels of Preservation post last year, I put us at level 2 for Information Security (as highlighted in green below).



See the full NDSA levels here

Now we have switched on this auditing feature and have a plan in place for regular checking of the logs, does this now take us to level 4 or is more work required?

I'd be really interested to find out whether other digital archivists are utilising filestore audit logs and what processes and procedures are in place to monitor these.

Final thoughts...

This was a quick win and hopefully will prove a useful tool for the digital archive her at the University of York. It is also a nice little example of collaboration between IT and Archives staff.

I sometimes think that IT people and digital preservation folk don't talk enough. If we take the time to talk and to explain our needs and use cases, then the chances are that IT might have some helpful solutions to share. The tools that we need to do our jobs effectively are sometimes already in place in our institutions. We just need to talk to the right people to get them working for us.

Jenny Mitcham, Digital Archivist

Thursday 16 August 2018

What are the significant properties of a WordStar file?

I blogged a couple of months ago about an imperfect file migration.

One of the reasons this was imperfect (aside from the fact that perhaps all file migrations are imperfect - see below) was because it was an emergency rescue triggered by our Windows 10 upgrade.



Digital preservation is about making best use of your resources to mitigate the most pressing preservation threats and risks. This is a point that Trevor Owens makes very clearly in his excellent new book The Theory and Craft of Digital Preservation (draft).

I saw an immediate risk and I happened to have available resource (someone working with me on placement), so it seemed a good idea to dive in and take action.

This has led to a slightly back-to-front approach to file migration. We took urgent action and in the following months have had time to reflect, carry out QA and document the significant properties of the files.

Significant properties are the characteristics of a file that should be retained in order to create an adequate representation. We've been thinking about what it is we are really trying to preserve? What are the important features of these documents?

Again, Trevor Owens has some really useful insights on this process and numerous helpful examples in The Theory and Craft of Digital Preservation. The following is one of my favourite quotes from his book, and is particularly relevant in this context:
“The answer to nearly all-digital preservation question is “it depends.” In almost every case, the details matter. Deciding what matters about an object or a set of objects is largely contingent on what their future use might be.”
So, in fact the title of this blog post is wrong. There is no use in me asking "What are the significant properties of a WordStar file?" - the real question is "What are the significant properties of this particular set of WordStar files from the Marks and Gran archive?"

To answer this question, a selection of the WordStar files were manually inspected (within a copy of WordStar) to understand how the files were constructed and formatted.

Particular attention was given to how the document was laid out and to the presence of Control and Dot commands. Control commands are markup proceeded by ^ within WordStar - for example ^B to denote bold text). Dot commands are (not surprisingly) proceeded by ‘.’ within WordStar - for example, ‘.OP’ to indicate that page numbering should be omitted within the printed document.

These commands, along with use of carriage returns and horizontal spacing show the intention of the authors.

A few other things have helped with this piece of research.


It is worth also considering the intention of the authors. It seems logical to assume that these documents were created with the intention of printing them out. The use of WordStar was a means to an end - the printed copy to hand out to the actors being the end goal.

I've made the assumption that what we are trying to preserve is the content of the documents in the format that the creator intended, not the original experience of working with the documents within WordStar.


Properties considered to be significant


The words

Perhaps stating the obvious...but the characters and words present, and their order on the page are the primary intellectual property of this set of documents. This includes the use of upper and lower case. Upper case is typically used to indicate actions or instructions to the actor and also to indicate who is speaking.

It also includes formatting mistakes or  typos, for example in some files the character # is used instead of £. # and £ are often confused depending on whether your keyboard is set up as US or UK configuration. This is a problem that people experience today but appears to go back to the mid 1980’s.


Carriage returns

Another key characteristic of the documents is the arrangement of words into paragraphs. This is done in the original document using carriage returns. The screenplays would make little sense without them.

New line for the name of the character.
Blank line.
New line for the dialogue
Another blank line

The carriage returns make the screenplay readable. Without them it is very difficult to follow what is going on, and the look and feel of the documents is entirely different.


Bold text

Some of the text in these files is marked up as bold (using ^B). For example headings at the start of each scene and information on the title page. Bold text is occasionally used for emphasis within the dialogue and thus gives additional information to the reader as to how the text should be delivered: for example “^Bgood^B dog”


Alternate pitch

Alternate pitch is a new concept to me, but it appears in the WordStar files with an opening command of ^A and a closing command of ^N to mark the point at which the formatting should return to ‘Standard pitch’.

The Marks and Gran WordStar files appear make use of this command to emphasise particular sections of text. For example, one character tells another that he has seen Hubert with “^Aanother woman^N”. The fact that these words are displayed differently on the page, gives the actor additional instruction as to how these words should be spoken.

The Wordstar 4.0 manual describes how alternate pitch relates to the character width that is used when the file is printed and that “WordStar has a default alternate pitch of 12 characters per inch”.

However, in a printed physical script that was located for reference (which appears to correspond to some of the files in the digital archive), it appears that text marked as alternate pitch was printed in italics. We can not be sure that this would always be the case.

However this may be interpreted, the most important point is that Marks and Gran wanted these sections of text to stand out in some way and this is therefore a property that is significant.


Underlined text

In a small number of documents there is underlined text (marked up with ^S in WordStar). This is typically used for titles and headings.

As well as being marked up with ^S, underlined text in the documents typically has underscores instead of spaces. This is no doubt because (as described in the manual), spaces between underlined words are not underlined when you print. Using underscores presumably ensures that spaces are also underlined without impacting on the meaning of the text.


Monospace font

Although font type can make a substantial difference to the layout of a page, the concept of font (as we understand it today) does not seem to be a property of the WordStar 4.0 files themselves, however I do think we can comfortably say that the files were designed to be printed using a monospace font.

WordStar itself does not provide the ability to assign a specific font, and the fact that the interface is not WYSIWYG means font can not be assumed by seeing the document in a native environment.

Searching for 'font' within the WordStar manual brings up references to 'italics font' for example but not modern font type as we know it. It does however talk about using the .PS command to change to 'proportional spacing'. As described in the manual:

"Proportional spacing means that each character is allocated space that is proportional to the character's actual width. For example, an i is narrower than an m, so it is allocated less horizontal space in the line when printed. In monospacing (nonproportional spacing), all characters are allocated the same horizontal space regardless of the actual character width."

The .PS command is not used in the Marks and Gran WordStar files so we can assume that monospace font is used.

This is backed up by looking at the physical screenplays that we have in the Marks and Gran archive. The font on contemporary physical items is a serif font that looks similar to Courier.

This is also consistent with the description of screenplays as described on wikipedia: “The standard font is 12 point, 10 pitch Courier Typeface”.

Courier font is also mentioned in the description of a WordStar migration by Jay Gattuso and Peter McKinney (2014).


Hard page breaks

The Marks and Gran WordStar files make frequent use of hard page breaks. In a small selection of files that were inspected in detail, 65% of pages ended with a hard page break. A hard page break is visible in the WordStar file as the .PA command that appears at the bottom of the page.

As described in the wikipedia page on screenplays “The format is structured so that one page equates to roughly one minute of screen time, though this is only used as a ballpark estimate”.

This may help explain the frequent use of hard page breaks in these documents. As this is a deliberate action and impacts on the look of the final screenplay this is a property that is considered significant.


Text justification

In most of the documents, the authors have positioned the text in a particular way on the page, centering the headings and indenting the text so it sits towards the right of the page. In many documents, the name of a character sits on a new line and is centred, and the actual dialogue appears below. This formatting is deliberate and impacts on the look and feel of the document thus is considered a significant property.


Page numbering

Page numbering is another feature that was deliberately controlled by the document creators.

Many documents start with the .OP command that means ‘omit page numbering’.

In some documents page numbering is deliberately started at a later point in the document (after the title pages) with the .PN1 command to indicate (in this instance) that the page numbering should start at this point with page 1.

Screenplay files in this archive are characteristically split into several files (as is the recommended practice for longer documents created in WordStar). As these separate files are intended to be combined into a single document once printed, the inclusion of page numbers would have been helpful. In some cases Marks and Gran have deliberately edited the starting page number for each individual file to ensure that the order of the final screenplay is clear. For example the file CRIME5 starts with .PN31 (the first 30 pages clearly being in files CRIME1 to CRIME4).


Number of pages

The number of pages is considered to be significant for this collection of WordStar files. This is because of the way that Marks and Gran made frequent use of hard page breaks to control how much text appeared on each page and occasionally used the page numbering command in WordStar.

Note however that this is not an exact science given that other properties (for example font type and font size) also have an impact on how much text is included on each page.

Just to go back to my previous point that the question in the title of this blog is not really valid...

Other work that has been carried out on the preservation of a collection of WordStar files at the National Library of New Zealand reached a different conclusion about the number of pages. As described by Jay Gattuso and Peter McKinney, the documents they were working with were not screenplays, they were oral history transcripts and they describe one of their decisions below:
"We had to consider therefore if people had referenced these documents and how they did so. Did they (or would they in future) reference by page number? The decision was made that in this case, the movement of text across pages was allowable as accurate reference would be made through timepoints noted in the text rather than page numbers. However, it was an impact that required some considerable attention."
Different type of content = different decisions.


Headers

Several documents make use of a document header that defines the text that should appear at the top of every document in the printed copy. Sometimes the information in the header is not included elsewhere in the document and provides valuable metadata - for example the fact that a file is described in the header as "REVISED SECOND DRAFT” is potentially very useful to future users of the resource so this information (and ideally it's placement within the header of the documents as appropriate) should be retained.


Corruption

This is an interesting one. Can corruption be considered to be a significant property of a file? I think perhaps it can.

One of the 19 disks from the Marks and Gran digital archive appears to have suffered from some sort of corruption at some stage in its life. Five of the files on this disk display a jumble of apparently meaningless characters at one or more points within the text. This behaviour has not been noted on any of the other files on the other disks.



The corruption can not be fixed. The original content that has been lost can not be replaced. It therefore needs to be retained in some form.

There is a question around how this corruption is presented to future users of the digital archive. It should be clear that some content is missing because corruption has occurred, but it is not essential that the exact manifestation of the corruption is preserved in access copies. Perhaps a note along the lines of ...

[THE FILE WAS CORRUPT AT THIS POINT. SOME CONTENT IS NO LONGER AVAILABLE]

...would be helpful?

Very interested to hear how others have dealt with this issue.



Properties not considered to be significant:

Other properties noted within the document were thought to be less significant and are described below:

Font size

The size of a font will have a substantial impact on the layout and pagination of a document. This appears to have been controlled using the Character Width (.CW) command as described in the manual:

"In WordStar, the default character width is 12/120 inch. To change character width, use the dot command .CW followed by the new width in 120ths of an inch. For example, the 12/120 inch default is written as .CW 12. This is 10 characters per inch, which is normal pitch for pica type. "

The documents I'm working with do not use the .CW command so will accept the defaults. Trying to work out what this actually means in modern font sizes is making my head hurt. Help needed!

As mentioned above, the description of screenplays on wikipedia states that: “The standard font is 12 point, 10 pitch Courier Typeface”. We could use this as a guide but can't always be sure that this standard was followed.

In the National Library of New Zealand WordStar migration the original font is considered to be 10 point Courier.


Word wrap

Where hard carriage returns are not used to denote a new line,  WordStar will wrap the text onto a new line as appropriate.

As this operation was outside the control of the document creator, this property isn’t considered to be significant. This decision is also documented by the National Library of New Zealand in their work with WordStar files as discussed in Gattuso and McKinney (2014).


Soft page breaks

Where hard page breaks are not used, text flows on to the next page automatically.

As this operation was not directly controlled by the document creator it is not considered to be significant.


In conclusion

Defining significant properties is not an exact science, particularly given that properties are often interlinked. Note, that I have considered number of pages to be significant but other factors such as font size, word wrap and soft page breaks (that will clearly influence the number of pages) to be not so significant. Perhaps there is a flaw in my approach but I'm running with this for the time being!

This is a work in progress and comments and thoughts are very welcome.

I hope to blog another time about how these properties are (or are not) being preserved in the migrated files.


Jenny Mitcham, Digital Archivist

Tuesday 31 July 2018

Checksum or Fixity? Which tool is for me?

The digital preservation community are in agreement that file fixity and data integrity are important.

Indeed there is a whole row devoted to this in the NDSA Levels of Preservation. But how do we all do it? There are many tools out there - see for example those listed in COPTR.

It was noted in the NDSA Fixity Survey Report of 2017 that there isn't really a consensus on how checksums are created and verified across the digital preservation community. Many respondents to the survey also gave the impression that current procedures were a work in progress and that other options should be explored.

From the conclusion of the report:
"Respondents also talked frequently about improvements they wanted to make, such as looking for new tools, developing workflows, and making other changes to their fixity activities. It may be useful, then, for practitioners to think of their fixity practices within a maturity or continuous improvement model; digital preservation practitioners develop a set of ideal practices for their institution and continuously evaluate methods and the allocation of available resources to get as close as possible to their ideal state."

This nicely sums up how I feel about a lot of the digital preservation routines I put into practice ...but continuous improvement needs time.

Life is busy and technology moves fast.

I realised that I haven't reviewed my tools and procedures for fixity checking since 2013.

A  recent upgrade of my PC to Windows 10 gave me a good excuse to change this. Being that I was going to have to re-install and configure all my tools post-upgrade anyway, this was the catalyst I needed to review the way that I currently create and verify checksums within the digital archive.

Current procedures using Checksum

Since 2013 I have been using a tool called Checksum to generate and verify checksums. Checksum describes itself as a "blisteringly fast, no-nonsense file hashing application for Windows" and this has worked well for me in practice. One of the key selling points of Checksum is it's speed. This has become a more important requirement over time as the digital archive has grown in size.

I have a set of procedures around this tool that document and describe how it is configured and how it is used, both as part of the ingest process and as a routine integrity checking task. I keep a log of the dates that checksums are verified, numbers of files checked and the results of these checks and am able to respond to issues and errors as and when they occur.

There has been little drama on this score over the last 5 years apart from a brief blip when my checksums didn't match and the realisation that my integrity checking routine wasn't going to catch things like changed last modified dates.

I'm not unhappy with Checksum but I am aware that there is room for improvement. Getting it set up and configured correctly isn't as easy as I would like. I sometimes wonder if there are things I'm missing. In the past I have scheduled a regular checksum verification task using Windows Task Scheduler as this is not a feature of Checksum itself but more recently I've just been initiating it manually on a regular schedule.


Introducing Fixity

Fixity is a free tool from AVP. It has been around since 2013 but hasn't hit my radar until recent times. It was mentioned several times in the NDSA Fixity Survey Report and I was keen to try it out.



Fixity was created in recognition of the key role that checksum generation and validation have in digital preservation workflows. The intention of the developers was to provide institutions with a simple and low cost (in fact...free) tool that allows checksums to be generated and validated and that enables tasks to be scheduled and reports to be generated.

The Fixity User Guide is a wonderful thing. From personal experience I would say that one of the key barriers to adopting new tools is not being able to understand exactly what they do and why.

Documentation for open source tools can sometimes be a bit scattered and impenetrable, or occasionally too technical for me to understand - not the Fixity User Guide!

It starts off by explaining what problem it is trying to solve and it includes step by step instructions with screen shots and an FAQ at the back. Full marks!

Testing Fixity

The Graphical User Interface (GUI)

I like the interface for Fixity. It is clear and easy to understand, and it gives you the flexibility to manage different fixity checks you may want to set up for different types of content or areas of your filestore.

First impressions were that it is certainly easier to use than the Checksum tool I use currently. On the downside though, there were a few glitches or bugs that I encountered when using it.

Yes, I did manage to break it.

Yes, I did have to use the Task Manager to shut it down on several occasions.

Reporting

The reporting is good. I found this to be clearer and more more flexible than the reports generated by Checksum. It helps that it is presented in a format that can be opened in a spreadsheet application - this means that you can explore the results in whatever way you like.

Fixity will send you an email summary with statistics about how many files were changed, new, removed or moved/renamed. Attached to this email is a tab delimited file that includes all the details. This can be read into a spreadsheet application for closer inspection. You can quickly filter by the status column and focus in on those files that are new or changed. A helpful way of checking whether all is as it should be.

A useful summary report emailed to me from Fixity

One of the limitations of Checksum is that I can add new files into a directory and if I forget to update the checksum file it will not tell me that I have a new file - in theory that file could go without having its fixity created or checked for some time...or, if I enable the 'synchronise' option it will add the new checksums to the hash file when it next does a verification task. This is helpful, but perhaps I don't want them to be added without notifying me in some way. I would prefer to cast my eye over them to double check that they should be in there.


Scheduling

You can use the Fixity GUI to schedule checksum verification events - so you don't have to kick off your weekly or monthly scan manually. This is a nice feature. If your PC isn't switched on when a scheduled scan is due to start, it simply delays it until you next switch your computer on.

One of the downsides of running a scheduled scan in Fixity is that there is little feedback on the screen to let you know what it is doing and where it has got to. Also given that a scan can take quite a while (over 5 hours in my case) it would be helpful to have a notification to remind you that the scheduled scan is going to start and to allow you to opt out or reschedule if you know that your PC isn't going to be switched on for long enough to make it worthwhile.


Managing projects

The Fixity GUI will allow you to manage more than one project - meaning that you could set up and save different scans on different content using different settings. This is a great feature and a real selling point if you have this use case (I don't at the moment).

One downside I found when trying to test this out was when moving between different projects Fixity kept giving me the message that there are unsaved changes in my project and asking if I want to save - I don't believe there were unsaved changes at this point so this is perhaps a bug?

Also, selecting the checksum algorithm for your project can be a little clunky. You have to save your project before you can choose which algorithm you would like to use. This feature is hidden up in the Preferences menu but I'd prefer to see it up front and visible in the GUI so you can't forget about it or ignore it.

I thought I had set my project up to use the MD5 algorithm but when looking at the summary report I realise it is using SHA256. I thought it would be a fairly easy procedure to go back into my project and change the algorithm but now Fixity is stuck on a message saying 'Reading Files, please wait...'. It may be reprocessing all of the files in my project but I don't know for sure because there is no indicator of progress. If this is the case I expect I will have to switch my PC off before it has finished.


Progress

Following on from the comment above...one really useful addition to Fixity would be for it to give an indication of how far through a checksum verification task it is (and how long it has left). To be fair, the feedback around progress in Checksum, is not exact around timings - but it does give a clear notification about how many checksum files it still has to look at (there is one in each directory) and how many individual files are left to look at in the directory it is currently in.


Timings

When you kick off a checksum verification task Fixity comes up with a message that says 'Please do not close until a report is generated'. This is useful, but it doesn't give an indication of how long the scan might take, so you have no idea when you kick off the task how long you will need to keep your computer on for. I've had to close it down in the middle of the scan on a few occasions and I don't know whether this has any impact on the checksums that are stored for next time.


The message you get when you manually initiate a scan using Fixity

Fixity certainly takes longer to run a scan than the Checksum tool that I have used over the last few years. The last scan of the digital archive took 5 hours and 49 minutes (though this was using SHA256). The fixity check using Checksum (with MD5) takes around 1 and a half hours so the difference is not inconsequential.

Note I'm currently trying to change the checksum algorithm on my Fixity test project to MD5 so I can do a fairer comparison and it has been thinking about it for over 3 hours.


Where are the checksums stored?

In the Checksum tool I use currently, the checksums themselves are stored in a little text file in each directory of the digital archive. This could be seen as a good thing (all the information you need to verify the files is stored and backed up with the files) or a bad thing (the digital archive is being contaminated with additional administrative files, as are my DROID reports!).

Fixity however, stores the actual checksums in a /History/ folder and also in a SQLite database. When you first install Fixity it asks you to specify a location for the History folder. I initially changed this to my C: drive to stop it clogging up my limited profile space, but this may not have been the best option in retrospect as my C: drive isn't backed up. It would certainly be useful to have backup copies of the checksums elsewhere, though we could debate at length what exactly should be kept and for how long.


How to keep checksums up to date

My current workflow using the Checksum application is to create or update checksums as part of the ingest process or as preservation actions are carried out so that I always have a current checksum for every file that is being preserved.

I'm unsure how I would do this if I was using Fixity. Is there a way of adding checksums into the database or history log without actually running the whole integrity check? I don't think so.

Given the time it takes Fixity to run, running the whole project might not be a practical solution in many cases so files may sit for a little while before checksums are created. There are risks involved with this.


The verdict

Fixity is a good tool and I would recommend it to digital archivists who are looking for a user-friendly introduction to fixity checking if they don't have a large quantity of digital content. The documentation, reporting and GUI are clear and user friendly and it is pretty easy to get started with it.

One of the nice things about Fixity is that it has been developed for the digital preservation community with our digital preservation use cases in mind. Perhaps the reason why the documentation is so clear is because it feels like it is written for me (I like to think I am a part of the designated community for Fixity!). In contrast, I find it quite difficult to understand the documentation for Checksum or to know whether I have discovered all that it is capable of.

However, after careful consideration I have decided to stick with Checksum for the time being. The main reason being speed. Clearly checksum's claims to be a "blisteringly fast, no-nonsense file hashing application for Windows" are not unfounded! I don't have a huge digital archive to manage at 187 GB but the ability to verify all of my checksums in 1.5 hours instead of 5+ hours is a compelling argument. The digital archive is only going to grow, and speed of operation will become more important over time.

Knowing that I can quickly create or update checksums as part of the ingest or preservation planning process is also a big reason for me to stick to current workflows.

It has been really interesting testing Fixity and comparing it with the functionality of my existing Checksum tool. There are lots of good things about it and it has helped me to review what I want from a fixity checking tool and to examine the workflows and preferences that I have in place with Checksum.

As part of a cycle of continuous improvement as mentioned in the NDSA Fixity Survey Report of 2017 I hope to review file fixity procedures again in another few years. Fixity has potential and I'll certainly be keeping my eye on it as an option for the future.



Jenny Mitcham, Digital Archivist

The sustainability of a digital preservation blog...

So this is a topic pretty close to home for me. Oh the irony of spending much of the last couple of months fretting about the future prese...