Digital Archiving at the University of York: born digital

Showing posts with label born digital. Show all posts

Friday, 5 August 2016

Research data is different

This is a guest post from Simon Wilson who has been profiling the born-digital data at the Hull History Centre to provide another point of comparison with the research data at York reported on in this blog back in May.

Inspired by Jen’s blog Research data - what does it *really* look like? about the profile of the research data at York and the responses it generated including that from the Bentley Historical Library, I decided to take a look at some of the born-digital archives we have at Hull. This data is not research data from academics, it is data that has been donated to or deposited with the Hull History Centre and it comes from a variety of different sources.

Whilst I had previously created a DROID report for each distinct accession I have never really looked into the detail, so for each accession I did the following;

Run the DROID software and export the results into csv format with one row per file
Open the file in MS Excel and copy the data to a second tab for the subsequent actions
Sort the data by Type field into A-Z order and then delete all of the records relating to folders
Sort the data on the PUID field into A-Z order
For large datasets highlight the data and then select the subtotal tool and use it to count each time the PUID field changes and record the sub-total
Once the subtotal tool has completed its calculations, select the entire dataset and select Hide Detail (adjacent to Subtotal in the Outline tools box) to leave you with just a row for each distinct PUID and the total count value

I then created a simple spreadsheet with a column for each distinct accession and added a row for each unique PUID, copying the MIME type, software and version details from the DROID report results. I also noted the number of files that were not identified. There may be quicker ways to get the same results and I would love to hear other suggestions or shortcuts.

After having completed this for 24 accessions - totalling 270,867 files, what have I discovered?

An impressive 97.96% of files were identified by DROID (compared with only 37% in Jen's smaller sample of research data)
So far 228 different PUIDs have been identified (compared with 34 formats in Jen’s sample)
The most common format is fmt/40 (MS Word 97-2003) with 120,595 files (44.5%). See the top ten identified formats in the table below...

File format (version)	Total No files	% of total identified files
Microsoft Word Document (97-2003)	120,595	44.52%
Microsoft Word for Windows (2007 onwards)	15,261	5.63%
Microsoft Excel 97 Workbook	13,750	5.08%
Graphics Interchange Format	11,240	4.15%
Acrobat PDF 1.4 - Portable Document Format	8458	3.12%
JPEG File Interchange Format (1.01)	7357	2.72%
Microsoft Word Document (6.0 / 95)	6656	2.46%
Acrobat PDF 1.3 - Portable Document Format	6462	2.39%
JPEG File Interchange Format (1.02)	4947	1.83%
Hypertext Markup Language (v4)	4512	1.67%

I can now quickly look-up whether an individual archive has a particular file type, and see how frequently it occurs. Once I have processed a few more accessions it may be possible to create a "profile" for an individual literary collection or a small business and use this to inform discussions with depositors. I can also start to look at the identified file formats and determine whether there is a strategy in place to migrate that format. Where this isn’t the case, knowing the number and frequency of the format amongst the collections will allow me to prioritise my efforts. I will also look to aggregate the data – for example merging all of the different versions of Adobe Acrobat or MS Word.

I haven’t forgotten the 5520 unidentified files. By noting the PRONOM signature file number used to profile each archive, it is easy to repeat the process with a later signature file. This could validate the previous results or enable previously unidentified files to be identified (particularly if I use the results of this exercise to feed information back to the PRONOM team). Knowing which accessions have the largest number of unidentified files will allow me to focus my effort as appropriate.

Whilst this has certainly been a useful exercise in its own right, it is also interesting to note the similarities between this the and the born-digital archives profile published by the Bentley Historical Library and the contrast with the research data profile Jen reported on.

The top ten identified formats from Hull and Bentley are quite similar. Both have a good success rate for identifying file formats with 90% identified at Bentley and 98% at Hull. Though the formats do not appear in the same order in the top ten, they do contain similar types of file (MS Word, PDF, JPEGs, GIFs and HTML).

In contrast, only 37% of files were identified in York's research data sample and the top ten file formats that were identified look very different. The only area of overlap being MS Excel files which appear high up in the York research dataset as well as being in the top ten for the Hull History Centre.

Research data is different.

Jenny Mitcham, Digital Archivist

Friday, 7 November 2014

A non-archivist's perspective on cataloguing born digital archives

As blogged in my previous post, earlier this week I attended an ARA Section for Archives and Technology event on Born Digital Cataloguing and also had the opportunity to talk about some of the Borthwick's current work in this area.

I gave a non-archivist's perspective on born digital cataloguing. These were the main points I tried to put across, though some of the points below were also informed by discussions on the day:

Born digital cataloguing within a purely digital archive is reasonably straightforward. The real complexity comes when working with hybrid archives where content is both physical and digital
The Archaeology Data Service are good at born digital cataloguing. This is partly because they only have digital material to worry about, but also down to the fact that they have many years of experience and the necessary systems in place. Their new ADS Easy system allows depositors to submit data for archiving along with the required metadata (which they can enter both at project level and individual file level). A web interface for disseminating this data can then be created in a largely automated fashion. It makes sense to ask the person who knows the most about the data to catalogue it, freeing up the digital archivists' time to focus on checking the received data and metadata and more specialist digital preservation work.
Communication can be a problem between traditional archivists and digital archivists. We may use different metadata standards and we may not always know what the other is talking about. I was at the Borthwick Institute for approximately a year before I worked out that when my colleagues talked about describing archives at file level (which may cover multiple physical documents within the same physical file), they didn't mean the same as my perception of 'file level metadata' (which would apply to a single digital item). It is important to recognise these differences and try and work around them so that we understand each other better when working with hybrid archives.

A digital archivist may speak a different language to traditional archivists,
but we can work around this

At the Borthwick we are in the process of implementing a new system for accessioning and cataloguing archives (both physical and digital archives). We have installed a version of AtoM (Access to Memory) and have imported one of our more complex catalogues into it. We now need to build on this proof of concept and fully establish and populate this system. As well as holding information about our physical holdings, this system will provide a means of cataloguing born digital data and also the foundations on which a digital archiving system can be built. It will also provide the means by which we can disseminate digital objects to our users.
There are other types of metadata that are required for digital material and these are outside the scope of AtoM which is primarily for resource discovery metadata. More technical metadata relating to digital objects and any transformations they undergo needs to reside within a digital archiving system. This is where Archivematica comes in. We are currently testing this digital preservation system to establish whether it meets our digital archiving needs.
I worry about the identifiers we use within archival catalogues. The traditional archival identifier is performing two jobs – firstly acting as a unique identifier or reference for an item or group of items, and secondly showing where those items sit within the archival hierarchy. This can lead to problems...

...if the arrangement of the archive changes – this may lead to the identifier changing – never a good thing once it has been published and made available, or, if that identifer is being used to link between different systems.
...if we want to start describing objects before we know where they sit within the hierarchy. This may be the case in particular for digital material where we may want to start working with it with greater urgency than the physical element of the archive.*

We can argue that digital isn't different, but with digital we do tend to think more at item level. Digital preservation activities and the technical and preservation metadata that this generates are all at file (item) level, so perhaps it makes sense for the resource discovery metadata to follow this pattern. Unlike physical archives, for digital archives we can pretty easily generate a title (or file name) for every item. If we are to deal with digital archives at file level would this cause confusion when cataloguing a hybrid archive?
Before we incorporate digital material into a digital archive, some selection and appraisal needs to be carried out - depending on the digital archiving system in use, it can be non-trivial to remove files from an AIP (archival information package) once they have been transferred, so we really do need to have a good idea of what is and isn't included before we carry out our preservation activities. In order to carry out this selection we may wish to start putting together a skeletal description of each item. Wouldn't it be nice if we could start to do this in a way which could be easily transferred into an archival management system? At the moment I have been doing this in a separate spreadsheet but we need strategies that are more sustainable and scalable.
Workflows are crucially important. Who does the born digital cataloguing where hybrid archives are concerned? It's place within the archive as a whole is key so it should be catalogued in tandem with the physical, but if we want to archive the digital material more rapidly than the physical how do we ensure we have the right workflows and procedures in place? Much of this will come down to institutional policies and procedures and the capabilities of the technologies you are using. These are still issues we are grappling with here at the Borthwick as we try and establish a framework for carrying out born digital cataloguing.

* as an aside (and a bit off-topic), my other bugbear with archival identifers is that they contain slashes (which means we can’t use them in directory or file names) and that they don’t order correctly in a spreadsheet or database as they are a mixture of numeric and alphabetical characters

Jenny Mitcham, Digital Archivist

Wednesday, 5 November 2014

Born Digital Cataloguing (some thoughts from the ARA SAT event)

First day back to work after my holiday and I am straight back into the fray – no quiet day catching up on e-mails and getting my head slowly back into work mode for me! On Monday I attended an ARA Section for Archives and Technology event on BornDigital Cataloguing and also had the opportunity to talk about some of our current work in this area.

It was great to see the event so well attended (the organisers had to find a bigger room due to the huge amount of interest!). This is clearly an important and interesting subject for many archives professionals and it was clear throughout the day that many of us are grappling with very similar issues. Here are some of the main points that I latched on to from the morning’s presentations:

It is important to preserve the directory structure of digital files as submitted into the archive – even if you subsequently move the files into a different structure. This is the equivalent of original order and can give context to the files. End users of the digital archive should also have access to this information so it is included in the description within the Discovery interface (Anthea Seles, TNA).
Users don’t really know what they want or need with regard to born digital material – it is too early to say and too new a field. We need to try and predict what they will require and also need to learn from our experiences as we go along (Anthea Seles, TNA).
“It’s all just stuff” – born digital archives should be treated the same as paper as far as possible (Chris Hilton, Wellcome Library).
Interesting case study from the Wellcome Library about how an archival management system (Calm) and a digital preservation system (Preservica) can work together. It is important to establish which data is duplicated between the 2 systems (there may be some overlap) and if this is the case, which is the master data and how the information is synchronised between systems. In this case study, digital data starts off in Preservica and overnight catalogue records are copied over into Calm. Calm then becomes the master for resource discovery metadata and any subsequent edits need to be made in Calm before syncing back to the digital archive (Chris Hilton, Wellcome Library).
Original order – in the Wellcome Library’s case study, the method the creator or donor used to store and order his digital files was different to the system of arrangement used for paper. Digital files were arranged chronologically but the paper archive was arranged according to themes. This results in a hybrid archive that is ordered or arranged inconsistently depending on the media and leaves the archivists with a decision to make (Victoria Sloyan, Wellcome Library).
Workflow is crucially important. It matters what happens when. Once data is ingested into a digital archive such as Preservica (I believe Archivematica is the same), it becomes difficult to remove individual items from the Archival Information Package. This becomes more of a problem when that information has also been replicated into an archival management system. Selection and Appraisal therefore needs to happen at an early stage in the workflow….and we also need to accept that our digital archives may not be perfect – we are unlikely to be able to weed out all redundant files on a first pass so we may end up with items in the digital archive that are not needed (Victoria Sloyan Wellcome Library).
Should we stop using the word cataloguing and instead talk about ‘enabling discovery’ – this is really what we are trying to do? We may end up moving away from the traditional archival catalogue (particularly for digital data) but we still need to ensure that we can enable our users to find the information they require. Digital collections may lead to alternative (less labour intensive) ways of enabling resource discovery (Jessica Womack and Rebecca Webster, Institute of Education).
We should be working with donors and depositors to get them to structure and label their data appropriately (and thus help with born digital cataloguing). It is very hard for archivists to deal with large quantities of digital data that has been created with little order or structure (Jessica Womack and Rebecca Webster, Institute of Education).
Digital is different to paper in that it requires more immediate action once it has been accepted into an archive and we need to ensure our processes, procedures and workflows can cope with this (Jessica Womack and Rebecca Webster, Institute of Education).

The last scheduled presentation of the morning was from me in which I gave a non-archivists perspective on born digital cataloguing. I'll try and summarise some of my points in a separate post later this week.

And here are some of the main messages I took away from the day as a whole:

Try things out – it is better to do something now than to wait until we have a perfect solution. This is the best way of learning what works and what doesn't.
Accept that the solutions you put in place may be temporary. We are all learning, and born digital cataloguing is not a solved problem (particularly with regard to hybrid archives).
Be honest about failures as well as successes – others can learn as much from finding out what didn't work and why as they can from finding out what did.
Think about which approaches are scalable in the longer term. Digital archives are going to increase in size and volume and we need to explore different ways of enabling discovery.

Despite the fact that there were more problems than solutions highlighted during the course of the day, it was comforting (as always) to discover that we are not alone!

Jenny Mitcham, Digital Archivist

Monday, 17 March 2014

'Routine encounters with the unexpected' (or what we should tell our digital depositors)

I was very interested a few months back to hear about the release of a new and much-needed report on acquiring born-digital archives: Born Digital: Guidance for Donors, Dealers, and Archival Repositories published by the Council on Library and Information Resources. I read it soon after it was published and have been mulling over its content ever since.

The quote within the title of this post "routine encounters with the unexpected" is taken from the concluding section of the report and describes the stewardship of born-digital archival collections. The report intends to describe good practices that can help reduce these archival surprises.

The publication takes an interesting and inclusive approach, being aimed at both at archivists who will taking in born-digital material, and also at those individuals and organisations involved with offering born-digital material to an archive or repository.

It appeared at a time when I was developing new content for our new website aimed specifically at donors and depositors and also a couple of weeks before I went on my first trip to collect someone's digital legacy for inclusion in our archive. This last few months alongside archivist colleagues I have also been planning and documenting our own digital accessions workflow. This report has been a rich source of information and advice and has helped inform all of these activities.

There is lots of food for thought within the publication but what I like best are the checklists at the end which neatly summarise many of the key issues highlighted within the report and provide a handy quick reference guide.

Much as I find this a very useful and interesting publication it got me thinking about the alternative and apparently conflicting advice that I give depositors and how the two relate.

I have always thought that one of the most important things that anyone can do to ensure that their digital legacy survives into the future is to put into practice good data management strategies. These strategies are often just simple common sense rules, things like weeding out duplicate or unnecessary files, organising your data into sensible and logical directory structures and naming them well.

Where we have depositors who wish to give us born-digital material for our archive, I would like to encourage them to follow rules like these to help ensure that we can make better sense of their data when it comes our way. This also helps fulfil the OAIS responsibility to ensure the independent utility of data - the more we know about data from the original source, the greater the likelihood that others will be able to make sense of it in the future. I have put guidance to this effect on our new website which is based on an advice sheet from the Archaeology Data Service.

Screenshot of the donor and depositor FAQ page on the Borthwick Institute's new website

However, this goes against the advice in the 'Born Digital' report which states that "...donors and dealers should not manipulate, rearrange, extract, or copy files from their original sources in anticipation of offering the material for gift or purchase."

In a blog post last year I talked about a digital rescue project I had been working on, looking at the data on some 5 1/4 inch floppy disks from the Marks and Gran archive. This project would not have been nearly as interesting if someone had cleaned up the data before deposit - rationalising and re-naming files and deleting earlier versions. There would have been no detective story and information about the creative process would have been lost. However, if all digital deposits came to us like this would we be able to resource the amount of work required to make sense of them?

So, my question is as follows. What do we tell our depositors? Is there room for both sets of advice - the 'organise your data before deposit' approach aimed at those organisations who regularly deposit their administrative information with us, and the 'leave well alone' approach for the digital legacies of individuals? This is the route I have tried to take on our new website, however, I have concerns as to whether it will be clear enough to donors and depositors as to which advice they should follow, especially where there are areas of cross-over. I'm interested to hear how other archives handle this question.

Jenny Mitcham, Digital Archivist

Digital Archiving at the University of York