Wednesday, 24 September 2014

To crop or not to crop? Preparing images for page turning applications

How do you prepare digital images of physical archive volumes for display within a web-based page turning application?

I thought this was going to be a fairly straight forward question when I was faced with it a couple of months ago.

Over the summer I have been supervising an internship project with the goal of finalising a set of exisiting digital images for display within a page turning application. The images were digital surrogates of the visitation records for the Archdeaconry of York between 1598 and 1690 (for more information about these records see our project page on the Borthwick website).

I soon realised that there are many ways of approaching this problem and few standard answers.

Google is normally my friend but googling the problem surfaced only guidelines geared towards particular tools and technologies - not the generic guides to good practice in this area that I was hoping for.

Page turning for digital versions of modern books is fairly straightforward. They will be uniform in size and shape, with few idiosyncrasies. The images will be cropped right down to the edges of the page resulting in a crisp and consistent presentation. 

However, we have slightly different expectations of digital surrogates of an archival volume. When photographing material from the archives it is good practice to leave a clear border around the edge of the physical document. This makes it explicit that the whole page has been captured and helps people to make a judgement on the authenticity of the digital surrogate. 

For archival volumes we have decided the best strategy is to leave a thin border around the edges of the page as shown on the left. The problem with the right image is that it is not clear that the whole page has been captured.

Volumes that we find in the archives are unique and idiosyncratic and often refuse to conform to the standard that we see in modern books. Exceptions are the norm in archives and this can make digitisation and display slightly more challenging. Page turning can work in this context, but it does require a little more thought:

Volumes within the archives do not
always have straight edges!

  • Bound volumes within the archives are not always uniform. Straight edges are rare. Damage is sometimes present, pages may even have holes in allowing other pages to be visible underneath. Should such pages be imaged as is, or should we insert a sheet underneath the page so we can see only the page that is being imaged?
  • Page size may not be consistent. A volume may contain pages of all different shapes and sizes. Fold outs may be present - meaning that a page may be larger than the size of the cover. Fold outs may have writing on both sides.
  • Inserts may be present and can occur in all shapes and sizes. They may be scattered throughout the volume or may be all inserted at the start or the end of the volume. Is their current position in the volume indicative of where they should appear within the page turning application? Should they be photographed in situ (difficult if they are folded and are larger than the volume) or removed from the volume for photography? Should they be displayed as part of the page turning application or as separate (but related) items within the interface?
  • Archival volumes may not all be in one piece. The original cover for the volume may have been separated from the pages. The pages may be loose. Should the page turning application display these volumes as they exist today, or attempt to reconstruct the volume as it once was?

There are lots of different ways we could address these challenges. Here is a summary of some of the lessons we have learnt:

  • Thoroughly assess the physical copies before digitisation commences - having an idea of what challenges you will encounter will help. It is best to work out a strategy for the volume as a whole at the start of the process and have to image the volume only once, rather than have to go back and re-image specific pages (bearing in mind you will need to try and ensure any new images are consistent with the previous ones to ensure a good page turning experience for the end user). If you come across fold outs, inserts or holes, decide how you are going to image them.
  • As part of this assessment process, seek the help of a conservator if there are pages for which a good image could not be easily captured (for example if a corner of the page is folded over and obscuring text). A conservator may be able to treat the document prior to digitisation to enable a better image to be captured.
  • Choose a background that will be suitable for the whole volume and stick to it.
  • Crop images to the spine of the book but with a small border around the other edges of the page. Try to keep a consistent crop size for the resulting images, but accept the fact that where there are fold outs or large inserts, the image will have to be larger. A good page turning application should be able to handle this.
  • Different page turning technologies will be able to support different things. Work out what technology you are using and know its capabilities

The last point to make is that we should not focus solely on dissemination. Image dissemination strategies, tools and applications will come and go but ultimately when you are taking high quality digital images of archives you will need to maintain a high resolution preservation version of those images within a digital archive.

An insert found within Visitations Court Book 2 - should this be
photographed within the volume or separate from the volume?
These preservation images will be around for the long term and can be used to make further dissemination copies where necessary. Think carefully about what is required here and remember to save your preservation originals at the right point within the workflow (for example once the images have been checked and a sensible file naming strategy implemented, but before any loss of information or degradation in image quality occurs). 

Also think about what other images may be needed to fully record the physical object for preservation purposes. It may be necessary to take some images that would not be used within the page turning application but that record valuable information about the physical volume. For example, the spine of the book, or a small detail on the cover that needs to be captured at a higher resolution. 

Monday, 8 September 2014

Physical diaries versus digital calendars - a digital perspective

This summer as part of our annual staff festival I had the chance to play at being a ‘real’ archivist. Coming to work at a traditional archive through a digital route with no formal archives training means that there are many traditional archives activities that I have not had any experience of. It was great to have the chance to handle some physical archives as Borthwick staff embarked on a ‘mass list in’ of the Alan Ayckbourn archive.

Given a couple of heavy brown archive boxes and a pencil (no pens please!) and paper I was tasked with creating a box list (essentially just a brief description of what the boxes contained) for a selection of Ayckbourn’s diaries. This proved to be an interesting way to spend a morning.

My job doesn't take me into the strongrooms or searchroom very often and opportunities to handle physical archives are rare. Opening a box from the archives and lifting out the contents was reminiscent of my past career in archaeological fieldwork, in particular the excitement of not quite knowing what you may find.

The diaries I was looking at were appointments diaries rather than personal journals. The more recent diaries were used by Ayckbourn in a fairly standard way (as I use my physical appointments diary today). They were brief and factual, recording events happening on a particular day, be it the dress rehearsal of a particular performance, dinner with friends, Christmas parties or a reminder to take the cat to the cattery.

Earlier diaries from the late eighties were used in a slightly different way by Ayckbourn. These are A4 diaries with a page devoted to each day of the year. This format provided more space and allowed for uses beyond the simple appointments format. The diaries were used for to-do lists (with lots of crossings out as tasks were completed), names and addresses, notes and thoughts and thus had more points of interest as I looked through them. Much of the content I couldn’t make sense of – the handwriting was often a challenge (particularly when crossed out), and notes were often present without relevant contextual information required to fully understand them. These diaries were very much a personal tool and not created with future access in mind but this does not mean they could never be a valuable resource for research.

Whilst looking at these diaries it occurred to me to think about the modern day digital equivalent of these hard backed physical diaries and how they might be preserved and re-used into the future.

I am a keen user of a digital calendar in my professional life. At York University we have embraced the Google suite of tools and this includes Google calendar. It is an incredibly valuable tool with benefits far and above anything that could easily be achieved with its paper equivalent. I can share the calendar with colleagues to enable them to see where I am when, check multiple people's calendars at the same time and invite colleagues to meetings. Of course it also helps me manage my time in an more immediate way by popping reminders up 10 minutes before I am meant to be at a particular meeting or appointment.

Will we be archiving Google calendars in the future instead of (or alongside – I certainly use both at the moment) their paper equivalents? I think so. In December last year Google announced a new (and long awaited) feature which enables users of the calendar to download their appointments to a file. This of course would enable donors and depositors to hand their digital calendar over to a digital archive for longer term curation and access just as they would with their physical diaries and no doubt this is something we might expect to see delivered to us in the future.

This is the message Google sends once your calendar
has been prepared for export and archiving

Information from a Google calendar can be downloaded as described in the Gmail blog post. It exports the calendar data as iCalendar format (.ics) which is an independent format for exchange of calendar information (rather than something that is specific to Google). The fact that it is essentially a plain text file is great news for us digital archivists. It means we can open it up in a simple text editor and make some sense of the content without any specialist software.

After downloading my calendar from Google I had a look at it to see what level of detail was included within the iCalendar file and whether all the significant properties of my online calendar were preserved. Initial inspection shows that this is a pretty good version, though of course not as easy to read or understand as it is in its creating application. All the information appears to be there,
  • the date and time of each event
  • the date and time the event was created and last modified
  • whether my attendance is confirmed or not
  • the location of the meeting
  • who created the calendar event (including e-mail address)
  • who else is invited (including e-mail addresses)
  • any further details of the meeting that have been included in the entry

So although this is the modern equivalent (and even the future) of the physical appointments diaries in the Alan Ayckbourn archive, it is a very different beast. In some ways the data within it is better - more consistent and more detailed - than the physical diary and this can be one of the key benefits to working in a digital sphere. In other ways it is far less rich - there are no crossings out, no scribbles within the margin, no coffee stains and very little personality. The very things that are good about the digital calendar are the things which make it harder to get a sense of the real person behind the appointments.

Musings on value aside, it is good to know that when I'm faced with this question in the future I am in a better position to understand how we might preserve a digital calendar for the long term within our archive.

Monday, 21 July 2014

How much data can you afford to lose?

“How often should I back up my data?” is a question I am sometimes asked. There are several answers to this.
A bee - one of the rescued digital images

An ideal solution would be a regular and frequent automatic backup that ‘just happens’ behind the scenes. What is often closer to reality (particularly in a personal sphere) is a manual process managed by an individual in a slightly more ad hoc way. Frequency of back up may vary depending on need (how much new data has been added) or engagement (how often the individual remembers or has the inclination to do it!). In the fast-paced digital world that we live, backing up our data is often seen as an additional administrative overhead that can fall to the bottom of our overflowing to do list.

My standard answer to the question posed above is “How much can you afford to lose?” Back up strategies are essentially all about risk management. This approach works well across the full range of different types of data and working practices. If your data is fairly static, with new additions added infrequently, a back up every 2-3 weeks may be perfectly adequate. On the other hand if losing just an hour’s work would be catastrophic then the regularity of your back-ups should reflect this and minimise the risk of this loss.

In a professional sphere I spend much of my time managing digital assets - good back up strategies are an essential part of this. However, in a personal sphere (where family life ultimately takes precedence) I may not always practice what I preach.

A makeshift dolls house - another rescued digital image

Like many of us, the data that I create (and curate) in a personal sphere consists almost solely of digital photographs. I have a long standing interest in photography, but my current subjects are limited primarily to photographs of my children, their toys and their hamster interspersed (in the summer months at least) with photographs of insects (mainly butterflies). I take photographs and download them to my home computer every weekend. I then go through some basic selection, deleting any that I don’t want to keep long term. I upload a selection of my favourites to Facebook and have a couple of portable hard drives to back them up to.

The process of back up is a manual one and fitting it in with a busy working and family life results in a fairly ad hoc schedule – one I would not tolerate in my professional sphere. None of this data is of any importance to anyone but myself and my close family so though there is risk of data loss, the impact of this loss will not be large. This was a level of risk I thought I was comfortable with.

…but then a couple of weeks ago my home PC died resulting in a complete inability to access the files on it.

Nibbles the hamster - another rescued digital image

This was not a good time to realise that I hadn’t backed up my data for at least 3 weeks – a period that included my children’s school sports days and a successful butterfly photography session. Of course all was not lost as the best shots were duplicated on Facebook, but only as a low resolution version that was not really suitable for anything but viewing on a screen.

After my initial acceptance of the level of risk in my back up strategy, I started to feel that perhaps the system should have been more robust. Hindsight is a wonderful thing. This was one of those points where a ‘Sorry for your data loss’ card may have been welcome.

A Ringlet - another rescued digital image
This story has a happy ending. After two weeks of communications with the supplier of the PC we have reached the point where we are once again able to switch on the PC and access my digital photographs.

Digital data loss had been averted and back-up nirvana is restored. This has prompted a much needed re-think about my personal back-up strategy. Even a simple tweak to the workflow to ensure that images are not deleted from the memory card of the digital camera by default at the point of download would ensure that that two copies of the data are always available. This would provide a valuable stop gap until such a point as back up occurs.

This near-data-loss experience was a wake up call I would rather not have had but is certainly something I can learn from.

What level of risk are you happy to accept?

Friday, 25 April 2014

How does Archivematica meet my requirements?

It seems a long time ago that I first blogged about my failed attempts to install archivematica. This is probably because it *was* quite a long time ago... other priorities had a habit of getting in the way!

With the help of a colleague (more technically able than I) I've now had a chance to see the new version of Archivematica. I have been assured that Archivematica version 1.0 is easier to install than it's predecessors so that is good news!

Any decent digital preservation system is going to have to be pretty complex in order to carry out the required tasks and workflows so assessing products such as this one is not something that can be done in one sitting.

As well as playing with the software itself, I've watched the video, I've signed up to the mailing list and I'm talking to others who are using it. A recent 'Technology Bytes' webinar hosted by the DPC (Digital Preservation Coalition) also helped me find out more. Artefactual Systems (who support and develop the software) have been really helpful in answering all of my many awkward questions.

In a more recent blog I talked about my digital preservation requirements, so one of the things I've been trying to do as I've been looking at Archivematica is see whether it could meet these requirements.

Below is a list of my requirements again (possibly slightly altered since the last time I published them) and an assessment of Archivematica against them.

It does seem to be a pretty good match and it is worth noting that any digital preservation system we implement will be just one part of a wider technical infrastructure for data management (that will also include a deposit workflow, data storage and an access system). There is some functionality within my requirements that could doubtless be fulfilled elsewhere within that infrastructure so I am not too concerned that we do not have a clear 'Yes' on all of these requirements. Where there are bits of functionality that we really do need Archivematica to perform, we have the option of either building it ourselves, or sponsoring Artefactual Systems to develop it for us and for the wider user community.

It is encouraging to see just how many developments are being sponsored at the moment and how many organisations are involved in this process.

It is also worth noting that while Archivematica is free and open source. Artefactual Systems are always keen to state that it is free as in 'free kittens' - time and money needs to go into looking after it, feeding it and taking it to the vet. There will clearly always be some element of cost involved with the implementation of an open source system that needs to be configured and integrated with existing systems.

Just to end with one very interesting piece of information that was mentioned in the Technology Bytes webinar:

Archivematica runs lots of microservices as part of the ingest and preservation workflow. You can configure it in various ways but there are a couple of points where the system waits for instructions from an administrator before proceeding with an operation. I was very interested to learn that one Archivematica user has configured his system to bypass these prompts for human interaction and has it set up as a fully automated workflow for a particular set of content.

Am I scared that this development might put digital archivists such as me out of a job? ....only a little bit

Am I excited by the opportunities to automate many of the repetitive and previously manual processes that digital archivists can spend a lot of time doing? ....very much so!

Does Archivematica meet this requirement?

The digital archive will enable us to store administrative information relating to the Submission Information Package (information and correspondence relating to receipt of the SIP)
Yes – a transfer can be made with submission documentation and this will be preserved within the AIP. Note that submission information as described in the archivematica wiki can be “donor agreements, transfer forms, copyright agreements and any correspondence or other documentation relating to the transfer”. Any SIPs generated will automatically include copies of this information too. We do need to establish where the best place to store supporting information is within our technical architecture.
The digital archive will include a means for recording appraisal decisions relating to the Submission Information Package and individual elements within it
No – appears to be out of scope for Archivematica but as we are not considering using this system in isolation, this information may be best stored elsewhere within the technical infrastructure.
The digital archive will be able to identify and characterise data objects (where appropriate tools exist)
Yes – this is an automated process. Uses FITS (Bundles file utility, ffident, DROID, JHOVE, FIDO, Tika, mediainfo). Output is stored in the METS and PREMIS XML within the AIP. New tools for identification will be included in future releases of Archivematica, and there is also the option for users of the system to add their own tools via the Format Policy Registry.
The digital archive will be able to validate files (where appropriate tools exist)
Yes – JHOVE is part of the package and output from JHOVE is stored in the METS and PREMIS XML within the AIP
The digital archive will support automated extraction of metadata from files
Yes – Tika is part of the package and output is stored in the METS and PREMIS XML within the AIP
The digital archive will virus check files on ingest
Yes – ClamAV is part of the package and information about virus checking is included within the PREMIS and METS XML. If a virus is detected within a file it will be sent to the ‘failed’ directory and all processing on that SIP will stop until the problem is resolved by an adminstrator
The digital archive will be able to record the presence and location of related physical material
No – this is out of scope for Archivematica but we would be able to store this metadata within Fedora

The digital archive will generate persistent, unique internal identifiers
Yes – a unique internal identifier is generated incorporated into filenames and stored in the METS.xml for both packages and digital objects.
The digital archive will ensure that preservation description information (PDI) is persistently associated with the relevant content information. The relationship between a file and its metadata/documentation must be permanent
Yes – any documentation that is included in the SIP will be included in the AIP. All technical and preservation metadata generated by Archivematica will also be wrapped up in the AIP.
The digital archive will support the PREMIS metadata schema and use it to store preservation metadata
Yes – creates and stores PREMIS/METS as part of the ingest process and as preservation actions are carried out. This XML is stored within the AIP
The digital archive will enable us to describe data at different levels of granularity – for example metadata may be attached to a collection, a group of files or an individual file
Partial – Preservation and technical metadata are generated at file level. Descriptive (Dublin Core) metadata appears to be only at project/collection level. If we require more detailed or granular metadata this will be stored elsewhere within the technical architecture.
The digital archive will accurately record and maintain relationships between different representations of a file (for example, from submitted originals to dissemination and preservation versions that will be created over time)
Yes – this is very much a part of the system. This is achieved using a unique identifier which is allocated to a submitted file, and included in any subsequent representations that are created
The digital archive will store technical metadata extracted from files (for example that is created as part of the ingest process)
Yes – very comprehensive technical metadata including details of all of the tools used are stored as part of the AIP

The digital archive will allow preservation plans (such as file migration or refreshment) to be enacted on individual or groups of files.
Partial(?) – on ingest, rules are in place to normalise files (migrate them) to different formats as appropriate for preservation/dissemination. These rules can be updated to meet local needs.

Need to explore how these rules can be run on all files of a certain type within the archive. Artefactual Systems report that a new AIP re-ingest feature will fulfil this need.
Automated checking of significant properties of files will be carried out post-migration to ensure these properties are adequately preserved (where appropriate tools exist).
Partial – default format policy choices are based on a comprehensive analysis of the significant properties of the samples as well as tests of many tools. Results of these tests are publicly available on the wiki. Archivematica users are able to run their own tests using other migration tools and if they are thought to adequately preserve significant properties they can be added to the system to serve local needs.
The digital archive will record actions, migrations and administrative processes that occur whilst the digital objects are contained within the digital archive
Yes – detailed information (in PREMIS and METS format) is stored within the AIP. The AIP keeps various logs which are gathered throughout the ingest process. Where migrations are carried out manually, PREMIS metadata can be added. This is a new feature in the 1.1 release and is documented here ( Note, it does assume a one to one relationship between original and migrated file which may not always be the case.

The digital archive will allow for disposal of data where appropriate.
Partial – it is possible to delete an AIP and set a reason but file level deletions within an AIP is not supported. The system deliberately makes it difficult to carry out deletions and can only be carried out by administrative users
A record must be kept of data disposal including what was disposed of, when it was disposed of and reasons for disposal.
Yes – It is possible to set a reason for deletion in Archivematica and this will be visible to the storage service adminstrator. Disposal decisions may be best recorded elsewhere within the infrastructure (Fedora/AtoM)
The digital archive will have reporting capabilities so statistics can be collated. For example it would be useful to be able to report on numbers of files, types of files, size of files, preservation actions carried out
No – This may be something we have to set up ourselves using the MySQL data that sits behind the system.

Artefactual Systems are keen that better reporting capabilities are sponsored in future releases of the software.

The digital archive will actively monitor the integrity of digital objects on a regular and automated schedule with the use of checksums
No – Checksums are generated by Archivematica and stored as part of the AIP but integrity checking is not performed. There is a plan to include active fixity checking in a future release of Archivematica, but in the meantime this could be carried out somewhere else within the technical infrastructure.
Where problems of data loss or corruption occur, The digital archive will have a reporting/notification system to prompt appropriate action
No – this is out of scope for Archivematica. The archival storage module will need to carry out integrity checking and a notification system (or automatic restore from backup) will need to be in place to guard against data loss.
The digital archive will be able to connect to, and support a range of storage systems
Yes – a number of different storage options can be configured within Archivematica and it is possible to have several different options depending on the nature of the data.

The digital archive will be compliant with the Open Archival Information System (OAIS) reference model
Yes – the design of Archivematica was created with OAIS in mind. The GUI leads you through the relevant OAIS functional entities and the language used throughout the application is consistent with that used within the OAIS reference model
The digital archive will integrate with our Fedora repository
Partial – Fedora is not directly supported but this may be something we can configure ourselves. Artefactual Systems are working with related systems (Islandora) which will go a little way towards Fedora integration.
The digital archive will integrate with our archival management system (AtoM)
Yes – Archivematica and AtoM are both supported by Artefactual systems and are designed to complement each other. AtoM is the recommended access front end to Archivematica
The digital archive will have APIs or other services for integrating with other systems
Yes – it has a REST API and a SWORD API planned
The digital archive will be able to incorporate new digital preservation tools (for migration, file validation, characterisation etc) as they become available
Yes – In terms of migration tools there is a handy interface for adding tools or commands and setting up new rules. The Roadmap includes plans for updating the tools that are internal to the system. Archivematica developers contribute to the development of tools such as FITS to make them better and more scalable.
The digital archive will include functionality for extracting and exporting the data and associated metadata in standards compliant formats
Yes – Archivematica uses open standards where possible. Metadata is in XML format, uses recognised standards and is packaged with the AIP. Archivematica packages its AIPs using BagIt which is an open standard for storage and transfer of files and metadata. Archival storage is separate so extracting the information from here needs to be a feature of the storage system.
The software or system chosen for the digital archive will be supported and technical help should be available
Yes – Open Source but supported by Artefactual Systems. An active mailing list exists for technical support and Artefactual Systems seem to be quick to respond to any queries
The software or system chosen for the digital archive will be under active development
Yes – Archivematica is very much in development. Wish lists are published online. Specific developments happen quicker if we are able to sponsor them. Alternatively, our own developers could help develop the system to meet our needs.