Digital Archiving at the University of York: metadata

Showing posts with label metadata. Show all posts

Friday, 7 April 2017

Archivematica Camp York: Some thoughts from the lake

Well, that was a busy week!

Yesterday was the last day of Archivematica Camp York - an event organised by Artefactual Systems and hosted here at the University of York. The camp's intention was to provide a space for anyone interested in or currently using Archivematica to come together, learn about the platform from other users, and share their experiences. I think it succeeded in this, bringing together 30+ 'campers' from across the UK, Europe and as far afield as Brazil for three days of sessions covering different aspects of Archivematica.

Our pod on the lake (definitely a lake - not a pond!)

My main goal at camp was to ensure everyone found their way to the rooms (including the lakeside pod) and that we were suitably fuelled with coffee, popcorn and cake. Alongside these vital tasks I also managed to partake in the sessions, have a play with the new version of Archivematica (1.6) and learn a lot in the process.

I can't possibly capture everything in this brief blog post so if you want to know more, have a look back at all the #AMCampYork tweets.

What I've focused on below are some of the recurring themes that came up over the three days.

Workflows

Archivematica is just one part of a bigger picture for institutions that are carrying out digital preservation, so it is always very helpful to see how others are implementing it and what systems they will be integrating with. A session on workflows in which participants were invited to talk about their own implementations was really interesting.

Other sessions also helped highlight the variety of different configurations and workflows that are possible using Archivematica. I hadn't quite realised there were so many different ways you could carry out a transfer!

In a session on specialised workflows, Sara Allain talked us through the different options. One workflow I hadn't been aware of before was the ability to include checksums as part of your transfer. This sounds like something I need to take advantage of when I get Archivematica into production for the Borthwick.

Justin talking about Automation Tools

A session on Automation Tools with Justin Simpson highlighted other possibilities - using Archivematica in a more automated fashion.

We already have some experience of using Automation Tools at York as part of the work we carried out during phase 3 of Filling the Digital Preservation Gap, however I was struck by how many different ways these can be applied. Hearing examples from other institutions and for a variety of different use cases was really helpful.

Appraisal

The camp included a chance to play with Archivematica version 1.6 (which was only released a couple of weeks ago) as well as an introduction to the new Appraisal and Arrangement tab.

A session in progress at Archivematica Camp York

I'd been following this project with interest so it was great to be able to finally test out the new features (including the rather pleasing pie charts showing what file formats you have in your transfer). It was clear that there were a few improvements that could be made to the tab to make it more intuitive to use and to deal with things such as the ability to edit or delete tags, but it is certainly an interesting feature and one that I would like to explore more using some real data from our digital archive.

Throughout camp there was a fair bit of discussion around digital appraisal and at what point in your workflow this would be carried out. This was of particular interest to me being a topic I had recently raised with colleagues back at base.

The Bentley Historical Library who funded the work to create the new tab within Archivematica are clearly keen to get their digital archives into Archivematica as soon as possible and then carry out the work there after transfer. The addition of this new tab now makes this workflow possible.

Kirsty Lee from the University of Edinburgh described her own pre-ingest methodology and the tools she uses to help her appraise material before transfer to Archivematica. She talked about some tools (such as TreeSize Pro) that I'm really keen to follow up on.

At the moment I'm undecided about exactly where and how this appraisal work will be carried out at York, and in particular how this will work for hybrid collections so as always it is interesting to hear from others about what works for them.

Metadata and reporting

Evelyn admitting she loves PREMIS and METS

Evelyn McLellan from Artefactual led a 'Metadata Deep Dive' on day 2 and despite the title, this was actually a pretty interesting session!

We got into the details of METS and PREMIS and how they are implemented within Archivematica. Although I generally try not to look too closely at METS and PREMIS it was good to have them demystified. On the first day through a series of exercises we had been encouraged to look at a METS file created by Archivematica ourselves and try and pick out some information from it so these sessions in combination were really useful.

Across various sessions of the camp there was also a running discussion around reporting. Given that Archivematica stores such a detailed range of metadata in the METS file, how do we actually make use of this? Being able to report on how many AIPs have been created, how many files and what size is useful. These are statistics that I currently collect (manually) on a quarterly basis and share with colleagues. Once Archivematica is in place at York, digging further into those rich METS files to find out which file formats are in the digital archive would be really helpful for preservation planning (among other things). There was discussion about whether reporting should be a feature of Archivematica or a job that should be done outside Archivematica.

In relation to the later option - I described in one session how some of our phase 2 work of Filling the Digital Preservation Gap was designed to help expose metadata from Archivematica to a third party reporting system. The Jisc Research Data Shared Service was also mentioned in this context as reporting outside of Archivematica will need to be addressed as part of this project.

Community

As with most open source software, community is important. This was touched on throughout the camp and was the focus of the last session on the last day.

There was a discussion about the role of Artefactual Systems and the role of Archivematica users. Obviously we are all encouraged to engage and help sustain the project in whatever way we are able. This could be by sharing successes and failures (I was pleased that my blog got a mention here!), submitting code and bug reports, sponsoring new features (perhaps something listed on the development roadmap) or helping others by responding to queries on the mailing list. It doesn't matter - just get involved!

I was also able to highlight the UK Archivematica group and talk about what we do and what we get out of it. As well as encouraging new members to the group, there was also discussion about the potential for forming other regional groups like this in other countries.

Some of the Archivematica community - class of Archivematica Camp York 2017

...and finally

Another real success for us at York was having the opportunity to get technical staff at York working with Artefactual to resolve some problems we had with getting our first Archivematica implementation into production. Real progress was made and I'm hoping we can finally start using Archivematica for real at the end of next month.

So, that was Archivematica Camp!

A big thanks to all who came to York and to Artefactual for organising the programme. As promised, the sun shined and there were ducks on the lake - what more could you ask for?

Thanks to Paul Shields for the photos

Jenny Mitcham, Digital Archivist

Friday, 28 March 2014

Discovering archives: it's all about the standards

Yesterday at the UK Archives Discovery Forum we mostly talked about standards.*

Specifically metadata standards for resource discovery of archives, both physical and digital.. Standards are key to making archival data discoverable and of course this is our main reason for being - we preserve things so that they can be reused - they can only be reused if they can be discovered.

The day was really relevant to work we are currently doing at the Borthwick Institute, with the installation of a new archival management system (AtoM) underway and scoping work ongoing for a retroconversion project which will help us move our legacy catalogues into this new system - both major initiatives intended to make our catalogue data more widely discoverable.

Nick Poole from the Collections Trust talked about user focused design (both for physical buildings and digital interfaces), how we should avoid putting barriers between our users and the information we need. The gov.uk website is an obvious example of how this approach to design can work in a digital sphere and their design principles are on-line. This is something I think we can all learn from.

He also touched on the Open Data agenda and how the principles of making data ‘open by default’ are sometimes seen as being at odds with traditional models for income generation in the archives sector. Nick argues that by opening up data we are allowing more people to find us and making way for new opportunities and transactions as they engage further with the other services we have to offer.

He also mentioned that we can be ‘digitally promiscuous’ - making our data available in many different ways via many different platforms. We do not need to keep our data close to our chests but should be signposting what we have and drawing people in. We can only really do this if we make use of data standards. Standards help us to exchange and share our data and allow others to find and interpret it.

Jane Stevenson talked about the importance of standards to the Archives Hub. Aggregating data from multiple sources would be very tricky if no-one used metadata standards. The problem is that the standards that we have are not perfect. Encoded Archival Description (EAD), the XML realisation of ISAD(G), can be too flexible and thus is realised in different ways by different institutions. Even those archives using CALM as their archival cataloguing system may have individual differences in how they use the metadata fields available to them. This does make life as an aggregator more challenging.

Once data is standardised into the Archives Hub flavour of EAD it can be transformed again into other data standards allowing it to be cross searchable beyond the UK archives sector. Jane touched on their work with RDF and linked data and the opportunities this can bring.

We should make use of opportunities to join the European stage. The Archives Hub are 'country manager' for Archives Portal Europe (APE) thus making it a simple matter for Hub contributors to push their data out beyond national borders. For those archival descriptions that link directly to a digital object, the opportunity exists to make this data available through Europeana. This takes our data beyond the archives sector, allowing our collections to be cross-searched alongside other European digital cultural heritage resources. In my mind, this really is the start of ‘digital promiscuity’ and an opportunity I feel we should be embracing (if we can accept the necessity to open up our metadata with a CC0 licence).

Geoff Browell from Kings College London talked about what we as archivists can offer our users over and above what they can get by visiting Google. He highlighted our years of experience at indexing data and pointed out that at approximately half of users of the AIM25 search interface appreciate our efforts in this area and use the index terms provided to browse for data in preference to the google-style free text search. He thinks that we should be talking more closely with both our users and the interface developers to ensure we are giving people what they need. He mentioned that delivery of data to users should be a conversation not a one-sided process.

The National Archives asked us for comment on a beta version of the new Discovery interface which will provide a new portal into selected UK archival holdings. They are encouraging conversation with users by encouraging ‘tagging’ of pages within the search interface.

Malcolm Howitt from Axiell discussed how systems and software can support standards. Standards is a topic that is often raised and they are asked to support many of them. They are keen to help where they can and need to work with the community to ensure that they know what is required of them. The different flavours of EAD was again raised as an issue but Malcolm pointed out that when standards work, the user doesn’t even need to be aware of them.

The National Archives, Kew, London by Jim Linwood on Flickr CC BY 2.0

Reflections

I think we are all in agreement that metadata standards are necessary and we need to work with them in order to make our catalogue data more visible. Some further issues were picked out in the final session of the day where attendees were invited to share their thoughts on the standards they use and the ones they would like to know more about.

Do we need a standard for accessions data? Would this be a specific subset of ISAD(G) or does it need further definition? The next step in our planned implementation of AtoM is to populate it with accessions data from various different sources and I expect there will be some issues to deal with along the way as a result of lack of standards in this area.
How do we describe digital material? Is ISAD(G) fit for this purpose? As born digital material becomes more and more prevalent in our collections this will become more of an issue. The use of PREMIS to hold technical preservation metadata will be essential alongside the resource discovery metadata but is this enough? This is undoubtedly an area for future exploration.
Does the hierarchical nature of ISAD(G) and EAD hold us back? If we can’t create detailed resource discovery metadata for an archive until we know both the hierarchy and its place in the hierarchy does this slow us down in getting the information out there?

*…mostly standards - with the addition of a surprisingly entertaining session on copyright from Roman Deazley – check out the CREATe project for more on this topic

Jenny Mitcham, Digital Archivist

Tuesday, 13 August 2013

A short detective story involving 5 ¼ inch floppy disks

Earlier this year my colleague encountered two small boxes of 5 ¼ inch floppy disks buried within the Marks and Gran archive in the strongrooms of the Borthwick Institute. He had been performing an audit of audio visual material in our care and came across these in an unlisted archive.

This was exciting to me as I had not worked with this media before. As a digital archivist I had often encountered 3 ½ inch floppies but not their larger (and floppier) precursors. The story and detective work that follows, took us firmly into the realm of ‘digital archaeology’.

Digital archaeology:

“The process of reclaiming digital information that has been damaged or become unusable due to technological obsolescence of formats and/or media” (definition from Glossaurus)

Marks and Gran were a writing duo who wrote the scripts of many TV sitcoms from the late 1970's on-wards. ‘Birds of a Feather’ was the one that I remember watching myself in the 80’s and 90's but their credits include many others such as ‘Goodnight Sweetheart’ and ‘The New Statesman’. Their archive had been entrusted to us and was awaiting cataloguing.

Clues on the labels

There were some clues on the physical disks themselves about what they might contain. All the disks were labelled and many of the labels referred to the TV shows they had written, sometimes with other information such as an episode or series number. Some disks had dates written on them (1986 and 1987). One disk was intriguingly labelled 'IDEAS'. WordStar was also mentioned on several labels 'WordStar 2000' and 'Copy of Master WordStar disk'. WordStar was a popular and quite pioneering word processing package of the early 80’s.

However, clues on labels must always be taken with a pinch of salt. I remember being guilty of not keeping the labels on floppy disks up to date, of replacing or deleting files and not recording the fact. The information on these labels will be stored in some form in the digital archive but the files may have a different story to tell.

Reading the disks

The first challenge was to see if the data could be rescued from the obsolete media they were on. A fortuitous set of circumstances led me to a nice chap in IT who is somewhat of an enthusiast in this area. I was really pleased to learn that he had a working 5 ¼ inch drive on one of his old PCs at home. He very kindly agreed to have a go at copying the data off the disks for me and did so with very few problems. Data from 18 of the 19 disks was recovered. The only one of the disks that was no longer readable appeared from the label to be a backup disk of some accounting software - this is a level of loss I am happy to accept.

Looking at file names

Looking at the files that were recovered is like stepping back in time. Many of us remember the days when file names looked like this - capital letters, very short, rather cryptic, missing file extensions. WordStar like many of the software packages from this era, was not a stickler for enforcing use of file extensions! File extensions were also not always used correctly to define the file type but sometimes were used to hold additional information about the file.

Looking back at files from 30 years ago really does present a challenge. Modern operating systems allow long and descriptive file names to be created. When used well, file names often provide an invaluable source of metadata about the file. 30 years ago computer users had only 8 characters at their disposal. Creating file names that were both unique and descriptive was difficult. The file names in this collection do not always give many clues as to what is contained within them.

Missing clues

For a digital archivist, the file extension of a file is a really valuable piece of information. It gives us an idea of what software the file might have been created in and from this we can start to look for a suitable tool that we could use to work with these files. In a similar way a lack of file extension confuses modern operating systems. Double click on a file and Windows 7 has no idea what it is and what software to fire up to open it. File characterisation tools such as Droid used on a day to day basis by digital archivists also rely heavily on file extensions to help identify the file type. Running Droid on this collection (not surprisingly) produced lots of blank rows and inaccurate results*.

Another observation on initial inspection of this set of files is that the creation dates associated with them are very misleading. It is really useful to know the creation date of a file and this is the sort of information that digital archivists put some effort into recording as accurately as they can. The creation dates on this set of files were rather strange. The vast majority of files appeared to have been created on 1^st January 1980 but there were a handful of files with creation dates between 1984 and 1987. It does seem unlikely that Marks and Gran produced the main body of their work on a bank holiday in 1980, so it would seem that this date is not very accurate. My contact in IT pointed out that on old DOS computers it was up to the user to enter the correct date each time they used the PC. If no date was entered the PC defaulted to 1/1/1980. Not a great system clearly and we should be thankful that technology has moved on in this regard!

So, we are missing important metadata that will help us understand the files, but all is not lost, the next step is to see whether we can read and make sense of them with our modern software.

Reading the files

I have previously blogged about one of my favourite little software programmes, Quick View Plus – a useful addition to any digital archivist’s toolkit. True to form, Quick View Plus quickly identified the majority of these files as WordStar version 4.0 and was able to display them as well-formatted text documents. The vast majority of files appear to be sections of scripts and cast lists for various sitcoms from the 1980’s but there are other documents of a more administrative nature such as PHONE which looks to be a list of speed dial shortcuts to individuals and organisations that Marks and Gran were working with (including a number of famous names).

Unanswered questions

I have not finished investigating this collection yet and still have many questions in my head:

How do these digital files relate to what we hold in our physical archive? The majority of the Marks and Gran archive is in paper form. Do we also have this same data as print outs?
Do the digital versions of these files give us useful information that the physical do not (or vice versa)?
Many of the scripts are split into a number of separate files, some are just small snippets of dialogue. How do all of these relate to each other?
What can these files tell us about the creative process and about early word processing practices?

I am hoping that when I have a chance to investigate further I will come up with some answers.

* I will be providing some sample files to the Droid techies at The National Archives to see if they can tackle this issue.