Digital Archiving at the University of York: back up

Showing posts with label back up. Show all posts

Friday, 20 April 2018

The 2nd UK AtoM user group meeting

I was pleased to be able to host the second meeting of the UK AtoM user group here in York at the end of last week. AtoM (or Access to Memory) is the Archival Management System that we use here at the Borthwick Institute and it seems to be increasing in popularity across the UK.

We had 18 attendees from across England, Scotland and Wales representing both archives and service providers. It was great to see several new faces and meet people at different stages of their AtoM implementation.

We started off with introductions and everyone had the chance to mention one recent AtoM triumph and one current problem or challenge. A good way to start the conversation and perhaps a way of considering future development opportunities and topics for future meetings.

Here is a selection of the successes that were mentioned:

Establishing a search facility that searches across two AtoM instances
Getting senior management to agree to establishing AtoM
Getting AtoM up and running
Finally having an online catalogue
Working with authority records in AtoM
Working with other contributors and getting their records displaying on AtoM
Using the API to drive another website
Upgrading to version 2.4
Importing legacy EAD into AtoM
Uploading finding aids into AtoM 2.4
Adding 1000+ urls to digital resources into AtoM using a set of SQL update statements

...and here are some of the current challenges or problems users are trying to solve:

How to bar code boxes - can this be linked to AtoM?
Moving from CALM to AtoM
Not being able to see the record you want to link to when trying to select related records
Using the API to move things into an online showcase
Advocacy for taking the open source approach
Working out where to start and how best to use AtoM
Sharing data with the Archives Hub
How to record objects alongside archives
Issues with harvesting EAD via OAI-PMH
Building up the right level of expertise to be able to contribute code back to AtoM
Working out what to do when AtoM stops working
Discovering that AtoM doesn't enforce uniqueness in identifiers for archival descriptions

After some discussion about some of the issues that had been raised, Louise Hughes from the University of Gloucestershire showed us her catalogue and talked us through some of the decisions they had made as they set this up.

The University of Gloucestershire's AtoM instance

She praised the digital object functionality and has been using this to add images and audio to the archival descriptions. She was also really happy with the authority records, in particular, being able to view a person and easily see which archives relate to them. She discussed ongoing work to enable records from AtoM to be picked up and displayed within the library catalogue. She hasn't yet started to use AtoM for accessioning but hopes to do so in the future. Adopting all the functionality available within AtoM needs time and thought and tackling it one step at a time (particularly if you are a lone archivist) makes a lot of sense.

Tracy Deakin from St John's College, Cambridge talked us through some recent work to establish a shared search page for their two institutional AtoM instances. One holds the catalogue of the college archives and the other is for the Special Collections Library. They had taken the decision to implement two separate instances of AtoM as they required separate front pages and the ability to manage the editing rights separately. However, as some researchers will find it helpful to search across both instances a search page has been developed that accesses the Elasticsearch index of each site in order to cross search.

The interface for a shared search across St John's College AtoM sites

Vicky Phillips from the National Library of Wales talked us through their processes for upgrading their AtoM instance to version 2.4 and discussed some of the benefits of moving to 2.4. They are really happy to have the full width treeview and the drag and drop functionality within it.

The upgrade has not been without it's challenges though. They have had to sort out some issues with invalid slugs, ongoing issues due to the size of some of their archives (they think the XML caching functionality will help with this) and sometimes find that MySQL gets overwhelmed with the number of queries and needs a restart. They still have some testing to do around bilingual finding aids and have also been working on testing out the new functionality around OAI PMH harvesting of EAD.

Following on from this I gave a presentation on upgrading AtoM to 2.4 at the Borthwick Institute. We are not quite there yet but I talked about the upgrade plan and process and some decisions we have made along the way. I won't say any more for the time being as I think this will be the subject of a future blog post.

Before lunch my colleague Charles Fonge introduced VIAF (Virtual International Authority File) to the group. This initiative will enable Authority Records created by different organisations across the world to be linked together more effectively. Several institutions may create an authority record about the same individual and currently it is difficult to allow these to be linked together when data is aggregated by services such as The Archives Hub. It is worth thinking about how we might use VIAF in an AtoM context. At the moment there is no place to store a VIAF ID in AtoM and it was agreed this would be a useful development for the future.

After lunch Justine Taylor from the Honourable Artillery Company introduced us to the topic of back up and disaster recovery of AtoM. She gave the group some useful food for thought, covering techniques and the types of data that would need to be included (hint: it's not solely about the database). This was particularly useful for those working in small institutions who don't have an IT department that just does all this for them as a matter of course. Some useful and relevant information on this subject can be found in the AtoM documentation.

Max Communications are a company who provide services around AtoM. They talked through some of their work with institutions and what services they can offer. As well as being able to provide hosting and support for AtoM in the UK, they can also help with data migration from other archival management systems (such as CALM). They demonstrated their crosswalker tool that allows archivists to map structured data to ISAD(G) before import to AtoM.

They showed us an AtoM theme they had developed to allow Vimeo videos to be embedded and accessible to users. Although AtoM does have support for video, the files can be very large in size and there are large overheads involved in running a video server if substantial quantities are involved. Keeping the video outside of AtoM and managing the permissions through Vimeo provided a good solution for one of their clients.

They also demonstrated an AtoM plugin they had developed for Wordpress. Though they are big fans of AtoM, they pointed out that it is not the best platform for creating interesting narratives around archives. They were keen to be able to create stories about archives by pulling in data from AtoM where appropriate.

At the end of the meeting Dan Gillean from Artefactual Systems updated us (via Skype) about the latest AtoM developments. It was really interesting to hear about the new features that will be in version 2.5. Note, that none of this is ever a secret - Artefactual make their road map and release notes publicly available on their wiki - however it is still helpful to hear it enthusiastically described.

The group was really pleased to hear about the forthcoming audit logging feature, the clever new functionality around calculating creation dates, and the ability for users to save their clipboard across sessions (and share them with the searchroom when they want to access the items). Thanks to those organisations that are funding this exciting new functionality. Also worth a mention is the slightly less sexy, but very valuable work that Artefactual is doing behind the scenes to upgrade Elasticsearch.

Another very useful meeting and my thanks go to all who contributed. It is certainly encouraging to see the thriving and collaborative AtoM community we have here in the UK.

Our next meeting will be in London in the autumn.

Jenny Mitcham, Digital Archivist

Thursday, 29 March 2018

Digital preservation begins at home

A couple of things happened recently to remind me of the fact that I sometimes need to step out of my little bubble of digital preservation expertise.

It is a bubble in which I assume that everyone knows what language I'm speaking, in which everyone knows how important it is to back up your data, knows where their digital assets are stored, how big they might be and even what file formats they hold.

But in order to communicate with donors and depositors I need to move outside that bubble otherwise opportunities may be missed.

A disaster story

Firstly a relative of mine lost their laptop...along with all their digital photographs, documents etc.

I won't tell you who they are or how they lost it for fear of embarrassing them...

It wasn’t backed up...or at least not in a consistent way.

How can this have happened?

I am such a vocal advocate of digital preservation and do try and communicate outside my echo chamber (see for example my blog for International Digital Preservation Day "Save your digital stuff!") but perhaps I should take this message closer to home.

Lesson #1:

Digital preservation advocacy should definitely begin at home

When a back up is not a back up...

In a slightly delayed response to this sad event I resolved to help another family member ensure that their data was 'safe'. I was directed to their computer and a portable hard drive that is used as their back up. They confessed that they didn’t back up their digital photographs very often...and couldn’t remember the last time they had actually done so.

I asked where their files were stored on the computer and they didn’t know (well at least, they couldn’t explain it to me verbally).

They could however show me how they get to them, so from that point I could work it out. Essentially everything was in ‘My Documents’ or ‘My Pictures’.

Lesson #2:

Don’t assume anything. Just because someone uses a computer regularly it doesn’t mean they know where they put things.

Having looked firstly at what was on the computer and then what was on the hard drive it became apparent that the hard drive was not actually a ‘back up’ of the PC at all, but contained copies of data from a previous PC.

Nothing on the current PC was backed up and nothing on the hard drive was backed up.

There were however multiple copies of the same thing on the portable hard drive. I guess some people might consider that a back up of sorts but certainly not a very robust one.

So I spent a bit of time ensuring that there were 2 copies of everything (one on the PC and one on the portable hard drive) and promised to come back and do it again in a few months time.

Lesson #3:

Just because someone says they have 'a back up' it does not mean it actually is a back up.

Talking to donors and depositors

All of this made me re-evaluate my communication with potential donors and depositors.

Not everyone is confident in communicating about digital archives. Not everyone speaks the same language or uses the same words to mean the same thing.

In a recent example of this, someone who was discussing the transfer of a digital archive to the Borthwick talked about a 'database'. I prepared myself to receive a set of related tables of structured data alongside accompanying documentation to describe field names and table relationships, however, as the conversation evolved it became apparent that there was actually no database at all. The term database had simply been used to describe a collection of unstructured documents and images.

I'm taking this as a timely reminder that I should try and leave my assumptions behind me when communicating about digital archives or digital housekeeping practices from this point forth.

Jenny Mitcham, Digital Archivist

Friday, 20 October 2017

Understanding WordStar - check out the manuals!

Last month I was pleased to be able to give a presentation at 'After the Digital Revolution' about some of the work I have been doing on the WordStar 4.0 files in the Marks and Gran digital archive that we hold here at the Borthwick Institute for Archives. This event specifically focused on literary archives.

It was some time ago now that I first wrote about these files that were recovered from 5.25 inch floppy (really floppy) disks deposited with us in 2009.

My original post described the process of re-discovery, data capture and file format identification - basically the steps that were carried out to get some level of control over the material and put it somewhere safe.

I recorded some of my initial observations about the files but offered no conclusions about the reasons for the idiosyncrasies.

I’ve since been able to spend a bit more time looking at the files and investigating the creating application (WordStar) so in my presentation at this event I was able to talk at length (too long as usual) about WordStar and early word processing. A topic guaranteed to bring out my inner geek!

WordStar is not an application I had any experience with in the past. I didn’t start word processing until the early 90’s when my archaeology essays and undergraduate dissertation were typed up into a DOS version of Word Perfect. Prior to that I used a typewriter (now I feel old!).

WordStar by all accounts was ahead of its time. It was the first Word Processing application to include mail merge functionality. It was hugely influential, introducing a number of keyboard shortcuts that are still used today in modern word processing applications (for example control-B to make text bold). Users interacted with WordStar using their keyboard, selecting the necessary keystrokes from a set of different menus. The computer mouse (if it was present at all) was entirely redundant.

WordStar was widely used as home computing and word processing increased in popularity through the 1980’s and into the early 90’s. However, with the introduction of Windows 3.0 and Word for Windows in 1989, WordStar gradually fell out of favour (info from Wikipedia).

Despite this it seems that WordStar had a loyal band of followers, particularly among writers. Of course the word processor was the key tool of their trade so if they found an application they were comfortable with it is understandable that they might want to stick with it.

I was therefore not surprised to hear that others presenting at 'After the Digital Revolution' also had WordStar files in their literary archives. Clear opportunities for collaboration here! If we are all thinking about how to provide access to and preserve these files for the future then wouldn't it be useful to talk about it together?

I've already learnt a lot through conversations with the National Library of New Zealand who have been carrying out work in this area (read all about it here: Gattuso J, McKinney P (2014) Converting WordStar to HTML4. iPres.)

However, this blog post is not about defining a preservation strategy for the files it is about better understanding them. My efforts have been greatly helped by finding a copy of both a WordStar 3 manual and a WordStar 4 manual online.

As noted in my previous post on this subject there were a few things that stand out when first looking at the recovered WordStar files and I've used the manuals and other research avenues to try and understand these better.

Created and last modified dates

The Marks and Gran digital archive consists of 174 files, most of which are WordStar files (and I believe them to be WordStar version 4).

Looking at the details that appear on the title pages of some of the scripts, the material appears to be from the period 1984 to 1987 (though not everything is dated).

However the system dates associated with the files themselves tell a different story.

The majority of files in the archive have a creation date of 1st January 1980.

This was odd. Not only would that have been a very busy New Year's Day for the screen writing duo, but the timestamps on the files suggest that they were also working in the very early hours of the morning - perhaps unexpected when many people are out celebrating having just seen in the New Year!

This is the point at which I properly lost my faith in technical metadata!

In this period computers weren't quite as clever as they are today. When you switched them on they would ask you what date it was. If you didn't tell them the date, the PC would fall back to a system default ....which just so happens to be 1st January 1980.

I was interested to see Abby Adams from the Harry Ransom Center, University of Texas at Austin (also presenting at 'After the Digital Revolution') flag up some similarly suspicious dates on files in a digital archive held at her institution. Her dates differed just slightly to mine, falling on the evening of the 31st December 1979. Again, these dates looked unreliable as they were clearly out of line with the rest of the collection.

This is the same issue as mine, but the differences relate to the timezone. There is further explanation here highlighted by David Clipsham when I threw the question out to Twitter. Thanks!

Fragmentation

Another thing I had noticed about the files was the way that they were broken up into fragments. The script for a single episode was not saved as a single file but typically as 3 or 4 separate files. These files were named in such a way that it was clear that they were related and that the order that the files should be viewed or accessed was apparent - for example GINGER1, GINGER2 or PILOT and PILOTB.

This seemed curious to me - why not just save the document as a single file? The WordStar 4 manual didn't offer any clues but I found this piece of information in the WordStar 3 manual which describes how files should be split up to help manage the storage space on your diskettes:

From the WordStar 3 manual

Perhaps some of the files in the digital archive are from WordStar 3, or perhaps Marks and Gran had been previously using WordStar 3 and had just got into the habit of splitting a document into several files in order to ensure they didn't run out of space on their floppy disks.

I can not imagine working this way today! Technology really has come on a long way. Imagine trying to format, review or spell check a document that exists as several discrete files potentially sitting on different media!

Filenames

One thing that stands out when browsing the disks is that all the filenames are in capital letters. DOES ANYONE KNOW WHY THIS WAS THE CASE?

File names in this digital archive were also quite cryptic.This is the 1980’s so filenames conform to the 8.3 limit. Only 8 characters are allowed in a filename and it *may* also include a 3 character file extension.

Note that the file extension really is optional and WordStar version 4 doesn’t enforce the use of a standard file extension. Users were encouraged to use those last 3 characters of the file name to give additional context to the file content rather than to describe the file format itself.

Guidance on file naming from the WordStar 4 manual

Some of the tools and processes we have in place to analyse and process the files in our digital archives use the file extension information to help understand the format. The file naming methodology described here therefore makes me quite uncomfortable!

Marks and Gran tended not to use the file extension in this way (though there are a few examples of this in the archive). The majority of WordStar files have no extension at all. The real consistent use of file extensions related to their back up files.

Backup files

Scattered amongst the recovered data were a set of files that had the extension BAK. This clearly is a file extension that WordStar creates and uses consistently. These files clearly contained very similar content to other documents within the archive but typically with just a few differences in content. These files were clearly back up files of some sort but I wondered whether they had been created automatically or by the writers themselves.

Again the manual was helpful in moving forward my understanding on this:

Backup files from the WordStar 4 manual

This backup procedure is also summarised with the help of a diagram in the WordStar 3 manual:

The backup procedure from WordStar 3 manual

This does help explain why there were so many back up files in the archive. I guess the next question is 'should we keep them?'. It does seem that they are an artefact of the application rather than representing a conscious process by the writers to back their files up at a particular point in time and that may impact on their value. However, as discussed in a previous post on preserving Google documents there could be some benefit in preserving revision history (even if only partial).

...and finally

My understanding of these WordStar files has come on in leaps and bounds by doing a bit of research and in particular through finding copies of the manuals.

The manuals even explain why alongside the scripts within the digital archive we also have a disk that contains a copy of the WordStar application itself.

The very first step in the manual asks users to make a copy of the software:

I do remember having to do this sort of thing in the past! From WordStar 4 manual

Of course the manuals themselves are also incredibly useful in teaching me how to actually use the software. Keystroke based navigation is hardly intuitive to those of us who are now used to using a mouse, but I think that might be the subject of another blog post!

Jenny Mitcham, Digital Archivist

Tuesday, 5 January 2016

When digital preservation really matters...

Of course digital preservation always matters* but recent events in York and beyond over the festive period really do highlight the importance of looking after your stuff - both physical and digital.

Not everyone is lucky enough to get much warning before a disaster of any type strikes but in some situations (such as that which I found myself in just after Christmas) we have some time to prepare.

Hang on...there isn't normally a lake near my house

Beyond relocating important things such as the hamster and photo albums upstairs and moving the Christmas decorations higher up the tree, it is also important to remember the digital....

Digital is robust in some respects but perhaps more at risk in others. Robust in that it is possible to very quickly make as many additional copies as you like and store them in different places (perfect for a disaster scenario such as this), but the risk is that it is more easily forgotten.

Of course I back up my personal data (digital photographs mostly) regularly, but with the chaos of the build up to Christmas I had not done so for a few weeks, so was prompted to do so before unplugging the PC and moving it to higher ground.

We were some of the lucky ones in York - the water levels didn't reach us so the preparations were not necessary but others were not so lucky. Many houses and businesses in York and in other areas of the country were flooded and many did not have the luxury of time to prepare for the worst. The very basics of digital preservation, (maintaining a regular back up strategy and storing copies of the data in different locations) really is something that should happen in a proactive way not just in response to specific threats.

* I have to say that - it is in my job description

Jenny Mitcham, Digital Archivist

Digital Archiving at the University of York