Showing posts sorted by relevance for query research data what does it really look. Sort by date Show all posts
Showing posts sorted by relevance for query research data what does it really look. Sort by date Show all posts

Tuesday, 31 May 2016

Research data - what does it *really* look like?

Work continues on our Filling the Digital Preservation Gap project and I thought it was about time we updated you on some of the things we have been doing.

While my colleague Julie has been focusing on the more technical issues of implementing Archivematica for research data. I have been looking at some real research data and exploring in more detail some of the issues we discussed in our phase 1 report.

For the past year, we have been accepting research data for longer term curation. Though the systems for preservation and access to this data are still in development, we are for the time being able to allocate a DOI for each dataset, manage access and store it safely (ensuring it isn't altered) and intend to ingest it into our data curation systems once they are ready.

Having this data in one place on our filestore does give me the opportunity to test the hypothesis in our first report about the wide range of file formats that will be present in a research dataset and also the assertion that many of these will not be identified by the tools and registries in use for the creation of technical metadata.

So, I have done a fairly quick piece of analysis on the research data, running a tool called Droid developed by The National Archives over the data to get an indication of whether the files can be recognised and identified in an automated fashion.

All the data in our research data sample has been deposited with us since May 2015. The majority of the data is scientific in nature - much of it coming from the departments of Chemistry and Physics. (this may be a direct result of expectations from the EPSRC around data management). The data is mostly fairly recent, as suggested by the last modified dates on these files, which range from 2006 to 2016 with the vast majority having been modified in the last five years. The distribution of dates is illustrated below.




Here are some of the findings of this exercise:

Summary statistics


  • Droid reported that 3752 individual files were present*

  • 1382 (37%) of the files were given a file format identification by Droid

  • 1368 (99%) of those files that were identified were given just one possible identification. 12 files were given two possible identifications and a further two were given 18 possible identifications. In all these cases, the identification was done by file extension rather than signature - which perhaps explains the uncertainty

Files that were identified

  • Of the 1382 files that were identified: 
    • 668 (48%) were identified by signature (which suggests a fairly accurate identification - if a file is identified by signature it means that Droid has looked inside the file and seen something that it recognises. I'm told it does this by some sort of magic!)
    • 648 (47%) were identified by extension alone (which implies a less accurate identification)
    • 65 (5%) were identified by container. These were all Microsoft Office files - xlsx and docx as these are in effect zip files (which suggests a high level of accuracy)

  • 111 (8%) of the identified files had a file extension mismatch - this means that the file extension was not what you would expect given the identification by signature. 
    • All but 16 of these files were XML files that didn't have the .xml file extension (there were a range of extensions for these files including .orig, .plot, .xpr, .sc, .svg, .xci, .hwh, .bxml, .history). This isn't a very surprising finding given the ubiquity of XML files and the fact that applications often give their own XML output different extensions.

  • 34 different file formats were identified within the collection of research data

  • Of the identified files 360 (26%) were XML files. This was by far the most common file format identified within the research dataset. The top 10 identified files are as follows:
    • Extensible Markup Language - 360
    • Log File - 212
    • Plain Text File - 186
    • AppleDouble Resource Fork - 133
    • Comma Separated Values - 109
    • Microsoft Print File - 77
    • Portable Network Graphics - 73
    • Microsoft Excel for Windows - 57
    • ZIP Format - 23
    • XYWrite Document - 21

Files that weren't identified

  • Of the 2370 files that weren't identified by Droid, 107 different file extensions were represented

  • 614 (26%) of the unidentified files had no file extension at all. This does rather limit the chance of identification being that identification by file extension is relied on quite heavily by Droid and other similar tools. Of course it also limits our ability to actively curate this data unless we can identify it by another means.

  • The most common file extensions for the files that were not identified are as follows:
    • dat - 286
    • crl - 259
    • sd - 246
    • jdf - 130
    • out - 50
    • mrc - 47
    • inp - 46
    • xyz - 37
    • prefs - 32
    • spa - 32



Some thoughts

  • This is all very interesting and does back up our assertions about the long tail of file formats within a collection of research data and the challenges of identifying this data using current tools. I'd be interested to know whether for other collections of born digital data (not research data) a higher success rate would be expected? Is identification of 37% of files a particularly bad result or is it similar to what others have experienced?

  • As mentioned in a previous blog post, one area of work for us is to get some more research data file formats into PRONOM (a key database of file formats that is utilised by digital preservation tools). Alongside previous work on the top software applications used by researchers (a little of this is reported here) this has been helpful in informing our priorities when considering which formats we would like The National Archives to develop signatures for in phase 3 of the project.

  • Given the point made above, it could be suggested that one of our priorities for file format research should be .dat files. This would make sense being that we have 286 of these files and they are not identified by any means. However, here lies a problem. This is actually a fairly common file extension. There are many different types of .dat files produced by many different applications. PRONOM already holds information on two varieties of .dat file and the .dat files that we hold appear to come from several different software applications. In short, solving the .dat problem is not a trivial exercise!

  • It strikes me that we are really just scratching the surface here. Though it is good we are getting a few file signatures developed as an output for this project, this is clearly not going to make a big impact given the size of the problem. We will need to think about the community should continue this work going forward.

  • It has been really helpful having some genuine research data to investigate when thinking about preservation workflows - particularly those workflows for unidentified files that we were considering in some detail during phase 2 of our project. The unidentified file report that has been developed within Archivematica as a result of this project helpfully organises the files by extension and I had not envisaged at the time that so many files would have no file extension at all. We have discussed previously how useful it is to fall back on identification by file extension if identification by signature is unsuccessful but clearly for so many of our files this will not be worth attempting.




* note that this turns out not to be entirely accurate given that eight of the datasets were zipped up into a .rar archive. Though Droid looks inside several types of zip files, (including .zip, .gzip, .tar, and the web archival formats .arc and .warc) it does not yet look inside .rar, 7z, .bz, and .iso files. I didn't realise this until after I had carried out some analysis on the results. Consequently there are another 1291 files which I have not reported on here (though a quick run through with Droid after unzipping them manual identified 33% of the files so a similar ratio to the rest of the data. Note that this functionality is something that the team at The National Archives intend to develop in the future.

Jenny Mitcham, Digital Archivist

Friday, 5 August 2016

Research data is different

This is a guest post from Simon Wilson who has been profiling the born-digital data at the Hull History Centre to provide another point of comparison with the research data at York reported on in this blog back in May.

Inspired by Jen’s blog Research data - what does it *really* look like? about the profile of the  research data at York and the responses it generated including that from the Bentley Historical Library, I decided to take a look at some of the born-digital archives we have at Hull. This data is not research data from academics, it is data that has been donated to or deposited with the Hull History Centre and it comes from a variety of different sources.

Whilst I had previously created a DROID report for each distinct accession I have never really looked into the detail, so for each accession I did the following;

  1. Run the DROID software and export the results into csv format with one row per file 
  2. Open the file in MS Excel and copy the data to a second tab for the subsequent actions
  3. Sort the data by Type field into A-Z order and then delete all of the records relating to folders 
  4. Sort the data on the PUID field into A-Z order
  5. For large datasets highlight the data and then select the subtotal tool and use it to count each time the PUID field changes and record the sub-total
  6. Once the subtotal tool has completed its calculations, select the entire dataset and select Hide Detail (adjacent to Subtotal in the Outline tools box) to leave you with just a row for each distinct PUID and the total count value


I then created a simple spreadsheet with a column for each distinct accession and added a row for each unique PUID, copying the MIME type, software and version details from the DROID report results.  I also noted the number of files that were not identified. There may be quicker ways to get the same results and I would love to hear other suggestions or shortcuts.

After having completed this for 24 accessions - totalling 270,867 files, what have I discovered?

  • An impressive 97.96% of files were identified by DROID (compared with only 37% in Jen's smaller sample of research data)
  • So far 228 different PUIDs have been identified (compared with 34 formats in Jen’s sample)
  • The most common format is fmt/40 (MS Word 97-2003) with 120,595 files (44.5%). See the top ten identified formats in the table below...


File format (version)
Total No files
% of total identified files
Microsoft Word Document (97-2003)
120,595
44.52%
Microsoft Word for Windows (2007 onwards)
15,261
5.63%
Microsoft Excel 97 Workbook
13,750
5.08%
Graphics Interchange Format
11,240
4.15%
Acrobat PDF 1.4 - Portable Document Format
8458
3.12%
JPEG File Interchange Format (1.01)
7357
2.72%
Microsoft Word Document (6.0 / 95)
6656
2.46%
Acrobat PDF 1.3 - Portable Document Format
6462
2.39%
JPEG File Interchange Format (1.02)
4947
1.83%
Hypertext Markup Language (v4)
4512
1.67%


I can now quickly look-up whether an individual archive has a particular file type, and see how frequently it occurs.  Once I have processed a few more accessions it may be possible to create a "profile" for an individual literary collection or a small business and use this to inform discussions with depositors.  I can also start to look at the identified file formats and determine whether there is a strategy in place to migrate that format. Where this isn’t the case, knowing the number and frequency of the format amongst the collections will allow me to prioritise my efforts.  I will also look to aggregate the data – for example merging all of the different versions of Adobe Acrobat or MS Word.

I haven’t forgotten the 5520 unidentified files. By noting the PRONOM signature file number used to profile each archive, it is easy to repeat the process with a later signature file.  This could validate the previous results or enable previously unidentified files to be identified (particularly if I use the results of this exercise to feed information back to the PRONOM team). Knowing which accessions have the largest number of unidentified files will allow me to focus my effort as appropriate.

Whilst this has certainly been a useful exercise in its own right, it is also interesting to note the similarities between this the and the born-digital archives profile published by the Bentley Historical Library and the contrast with the research data profile Jen reported on.

The top ten identified formats from Hull and Bentley are quite similar. Both have a good success rate for identifying file formats with 90% identified at Bentley and 98% at Hull. Though the formats do not appear in the same order in the top ten, they do contain similar types of file (MS Word, PDF, JPEGs, GIFs and HTML).

In contrast, only 37% of files were identified in York's research data sample and the top ten file formats that were identified look very different. The only area of overlap being MS Excel files which appear high up in the York research dataset as well as being in the top ten for the Hull History Centre.

Research data is different.



Jenny Mitcham, Digital Archivist

Wednesday, 4 January 2017

Hello 2017

Looking back


2016 was a busy year.

I can tell that from just looking at my untidy desk...I was going to include a photo at this point but that would be too embarrassing.

The highlights of 2016 for me were getting our AtoM catalogue released and available to the world in April, completing Filling the Digital Preservation Gap (and seeing the project move from the early 'thinking' phases to actual implementation) and of course having our work on this project shortlisted in the Research and Innovation category of the Digital Preservation Awards.

...but other things happened too. Blogging really is a great way of keeping track of what I've been working on and of course what people are most interested to read about.

The top 5 most viewed posts from 2016 on this blog have been as follows:

  • Research Data - what does it *really* look like? - A post describing my (not entirely successful) efforts to automatically identify the file formats of research data deposited with Research Data York using DROID. This post spawned other similar posts profiling data using DROID and the cumulative value of all of these profiles is gradually increasing over time. I'm still keen to follow this up with a comparison using the born digital data that we hold at the Borthwick Institute so hopefully that is something for 2017.
  • A is for AtoM - An A-Z (actually I only got to 'Y'!) of implementing AtoM at the Borthwick. This post covers some of the problems and issues we have had to address and decisions we have made as we have gone through the process of getting our new archival management system up and running.
  • Modelling Research Data with PCDM - A guest post by Julie Allinson on some thinking carried out as part of the implementation work for Filling the Digital Preservation Gap project. The post describes some preliminary work to define a data model for datasets using the Portland Common Data Model.
  • Why AtoM? - A look back at why we selected AtoM for our archival management system and how it meets our requirements. This post was in response to a question I was frequently asked and hopefully is useful to others who are going through a similar selection process.
  • From Old York to New York: PASIG 2016 - Quite a long summary of the highlights of the PASIG conference that I attended in New York in October 2016. There was some fantastic content at this event and my post really just scrapes the surface of this!


Looking forward


So what is on the horizon for 2017?

Here are some of the things I'm going to be working on - expect blog posts on some or all of these things as the year progresses.

AtoM

I blogged about AtoM a fair bit last year as we prepared our new catalogue for release in the wild! I expect I'll be talking less about AtoM this year as it becomes business as usual at the Borthwick, but don't expect me to be completely silent on this topic.

A group of AtoM users in the UK is sponsoring some development work within AtoM to enable EAD to be harvested via OAI-PMH. This is a very exciting new collaboration and will see us being able to expose our catalogue entries to the wider world, enabling them to be harvested by aggregators such as the Archives Hub. I'm very much looking forward to seeing this take shape.

This year I'm also keen to explore the Locations functionality of AtoM to see whether it is fit for our purposes.

Archivematica

Work with Archivematica is of course continuing. 

Post Filling the Digital Preservation Gap at York we are working on moving our proof of concept into production. We are also continuing our work with Jisc on the Research Data Shared Service. York is a pilot institution for this project so we will be improving and refining our processes and workflows for the management and preservation of research data through this collaboration.

Another priority for the year is to make progress with the preservation of the born digital data that is held by the Borthwick Institute for Archives. Over the year we will be planning a different set of Archivematica workflows specifically for the archives. I'm really excited about seeing this take shape.

We are also thrilled to be hosting the first European ArchivematiCamp here in York in the Spring. This will be a great opportunity to get current and potential Archivematica users across the UK and the rest of Europe together to share experiences and find out more about the system. There will no doubt be announcements about this over the next couple of months once the details are finalised so watch this space.

Ingest processes

Last year a new ingest PC arrived on my desk. I haven't yet had much chance to play with this but the plan is to get this set up for digital ingest work.

I'm keen to get BitCurator installed and to refine our current digital ingest procedures. After some useful chats about BitCurator with colleagues in the UK and the US over 2016 I'm very much looking forward to getting stuck into this.




...but really the first challenge of 2017 is to tidy my desk!


Jenny Mitcham, Digital Archivist

Friday, 19 August 2016

My first file format signature

As part of Filling the Digital Preservation Gap we've been doing a lot of talking about the importance of accurate file format identification and the challenges of doing so for research data.

Now we are thinking about how we can help solve the problem.

As promised in a post last month, I wanted to have a go at file format signature creation for PRONOM to see whether it is something that an average digital archivist could get their head around. Never before had I created my own signature. In the past I had considered this to be work that only technical people could carry out and it would be foolhardy to attempt it myself.

However, given the extent of the file formats identification challenge for research data wouldn't it be great if the community could engage more directly? Also, shouldn't file signature development be something every digital archivist should have a good understanding of?

Encouraged by Ross Spencer's blog post Five Star File Format Signature Development and a meeting with the PRONOM team at The National Archives in which the tricks and challenges of signature creation were explained, I decided to give it a go.

Where to begin?

  • First I read TNA's How to research and develop signatures for file format identification. This is an accessible and readable guide which tells you how to get started with signature development - from gathering samples, doing internet research on the format and using a Hex Editor to spot patterns. You don't need to be very technically minded to get your head around it.
  • Then I downloaded and installed a Hex editor. Though it is possible to view files as hexadecimal within Quick View Plus, I followed TNA's advice and used HxD Hex Editor as this allows you to compare files thus partially automating the process of spotting sequences.
  • Once I'd spotted a pattern which could be used to create a signature, I planned to use PRONOM's Signature Development Utility to create it. 
  • Once the signature is created I'm told it is possible to test this using DROID. Within DROID, go to Tools... Install signature file and replace the current signature with your new one (but remember to put it back again once you are done otherwise you may wonder why DROID isn't working properly!). Run this over your directory of sample files to see if they are all correctly identified using the signature you have developed. 


I decided start off with something that had been at the back of my mind for a while...those Wordstar 4.0 files from the Marks and Gran archive that I blogged about some time ago and had struggled to identify. When I wrote that post three years ago, Wordstar 4.0 files were not represented in PRONOM. They have more recently been added and the files can be identified but this is by extension only - not the more accurate file signature. I thought it would be fun to try and create a file signature for them.

I was very wrong.

My attempts to see a pattern within the files using the Hex Editor were unsuccessful. I decided to send the sample files to the experts at TNA to see if I was missing something. It was quickly confirmed that this was a rather awkward file type and not one that lent itself well to being automatically identified. Disappointing but at least it confirmed that my own investigations were not lacking.

For my next attempt I decided to tackle some of the unidentified research data that I had highlighted in my previous post Research data - what does it *really* look like?

I looked through the top ten most frequent unidentified file extensions in my sample and started to dig out the files themselves and assess whether they were a good candidate for me to work on. Ross Spencer suggests that PRONOM lends itself best to the creation of signatures for binary formats so this is what I wanted to focus on. No point in trying to make it hard for myself!

  • dat - 286
  • crl - 259
  • sd - 246
  • jdf - 130 (a signature for these JEOL NMR Spectroscopy files is now available)
  • out - 50
  • mrc - 47
  • inp - 46
  • xyz - 37
  • prefs - 32
  • spa - 32
Unfortunately, looking through the list (and digging out some samples) I discovered that many of these are ASCII formats rather than binary. It is possible to create signatures to identify ASCII files but it can be challenging (involving quite complex regular expressions) and not a great place for a first timer to start. I certainly did not want to start to tackle the confusing landscape of .dat files either!

After a little bit of investigation I discovered that the .spa files were something I could work with. I knew nothing about this format but found the relevant files and started doing some internet research looking for more information and perhaps some additional samples. I soon discovered they were one of many formats for optical spectroscopy and are known as Thermo Fisher’s OMNIC file format or Thermo Scientific OMNIC or Nicolet/Thermo OMNIC.

Looking at some of the files using a Hex Editor it was immediately apparent that there was a consistent pattern of bytes at the start of each file. A string which read 'Spectral Data File' which was represented by 53 70 65 63 74 72 61 6C 20 44 61 74 61 20 46 69 6C 65 in hexadecimal. Note that I actually thought the pattern was longer but advice received from the PRONOM team suggested that it was better to cut it down.

I also looked at the end of each file and at first sight there appeared to be consistency here too with each file ending with the same few bytes. This hypothesis was blown out of the water when I looked at a sample file that I had discovered online which did not display this pattern (but luckily did have the same bytes at the start of the file).

This is why it is so important to have sample data that comes from more than one source. A set of files from a single researcher may have misleading patterns that have occurred just because of the consistent way in which they work, rather than this being a true feature of the format itself.

So, once I'd looked at all 33 files and had convinced myself the hypothesis was solid, I went to the online signature development tool provided by The National Archives and created my signature.

PRONOM signature development tool

This was relatively easy to use but there were areas where more guidance was needed (so questions were fired off to the PRONOM team and a speedy response was received). I'm hoping that in the future there will be more documentation to help guide the completion of this form - so that people know how best to name the signature, where to find a definitive list of Mimetypes (this is the list they suggested I looked at), and what the Version field should contain (it is for the version of the file format if this is apparent/relevant - not the version of the signature you are creating).

Once I was done, I clicked on the 'Save Signature File' button and I was presented with the finished XML file:

Ta daaaaa!

I briefly admired my handiwork before sending it off to The National Archives for feedback.*

How long did it take me to do all of this? I would say one full day is a fair estimate (that would include reading the guidance, downloading the Hex Editor and a few false starts as I tried to find a format that I thought I could handle). The next signature would be much quicker.

The biggest challenges:

  1. It took me a while to find a binary format that I could work with. Much of the research data we hold appears to be ASCII formats ....which has benefits from a digital preservation perspective, but wasn't what I was looking for with regard to this exercise
  2. I did not really understand the file format I was working with. I am not a chemist. I have never heard of the .spa format. I struggle to even say 'Spectroscopy' let alone understand it. When I started to research it online I found the results quite confusing. If I knew more about the format in the first place it would have made life much easier. 
  3. There are limitations with the metadata we get from researchers when they deposit data with Research Data York. Reading the brief descriptions of the dataset that are provided did not really help me work out what the individual files are or what software and hardware was used to create them.
  4. I could not locate the file format specification online - I think next time I try this I may approach the software vendor direct and ask them for help. 
  5. Available documentation for creating and testing signatures could be enhanced. I had several questions as I went along and these were answered promptly by the PRONOM team, but if the information was all online then this would certainly help other newbies.


Despite the challenges this exercise has been both enjoyable and useful. The thing I like about being a digital archivist is being able to get hands on with the data and solve problems. Over the last few years I've done very little of this type of work so it was great to get stuck in. On top of the obvious benefit that after the next signature release these .spa files will now be recognised by DROID and other PRONOM-based file identification tools, I have also increased my knowledge and understanding of the process and this is a positive result.

I would definitely encourage other digital archivists, repository managers and research data managers to try this out for themselves.



* Feedback from the PRONOM team was positive. With a couple of modifications they were happy to include the signature in the next PRONOM signature release


Jenny Mitcham, Digital Archivist

Tuesday, 30 August 2016

Filling the Digital Preservation Gap - a brief update

As we near the end of the active phase of Filling the Digital Preservation Gap* here is a brief update about where we are with the main strands of work we highlighted in our phase 3 kick off blog post.

Archivematica implementation

Work at York

Work is ongoing at York to get our proof of concept implementation of Archivematica up and running. The purpose of this work was not to get a production service in place but to demonstrate that the implementation plan we published in our phase 2 report was feasible. The implementation we are developing pulls metadata (about deposited research datasets) from PURE and provides a method for capturing additional information for managing datasets  (filling some of the information gaps that are not collected through the PURE datasets module). It also includes an automated process to ingest deposited datasets (along with their metadata) into Archivematica, package them up for longer term preservation and provide a dissemination copy of the dataset to our repository. 

We have been doing this work in consultation with the staff at York who actually work with datasets that are deposited through our Research Data York service to ensure that the workflows and processes we are putting in place will make their lives easier rather than harder! We are keen to ensure that those processes that can be automated are automated and those areas where human input is required trigger e-mail notifications to relevant staff and a pause in the workflow to enable the relevant checks to be made.

Work at Hull

Like York, Hull is looking to produce a proof-of-concept system within the timeframe of the project. Whilst concentrating on research data for this Jisc funded work, we have our eye also on later using our approach for other forms of repository content that deserve long-term preservation.  To that end, we are taking as our starting point the institutional Box folder that each of our staff has access to; we will be asking depositors to assemble their material for the repository in a folder within their Box account.  As well as the content itself they will be asked for basic metadata and processing instructions in a very simple format.  When the folder is ready they share it with another Box account “owned” by Archivematica.

Hull has developed a “Box watcher” which detects the new share and instigates processing of the contents, keeping the depositor aware of progress along the way.  The contents of the folder are examined and, depending on what is found and how it is configured, one or more Bags (as in the BagIt standard) are created and handed off to Archivematica.

Like York we are then looking to have a fully automated Archivematica workflow which produces Archival Information Packages corresponding to each of the bags.  In addition, Hull will have Archivematica create Dissemination Information Package(s) which, once created, will automatically be processed to produce objects in the quality assurance queue of our Hydra repository.

Unidentified file formats

It has been clear from our project work during phase 3 that research data is much harder to identify in an automated fashion than other types of born digital data that an archive would typically hold. If you don’t believe us, read these 2 blog posts that show contrasting results when trying to identify two different types of born digital data: 


So, how are we working towards a solution? As well as directly sponsoring the development of a small selection of research data file formats by the PRONOM team at The National Archives, we also had a go at creating our own. York’s new signature will be incorporated into PRONOM in due course. Hull’s signature has been submitted and is just being tested by the PRONOM team. There have also been positive discussions with colleagues at The National Archives about wider public engagement around file format signature development and how we work towards increasing the coverage of PRONOM for research data file formats.


Dissemination and outreach

The project team have been keen to continue their focus on dissemination during phase 3 of the project. This has included presentations or posters at the following conferences and events:

  • International Digital Curation Conference (IDCC16), Amsterdam
  • 'Digital Preservation: Strategic Issues' - National Library of Wales
  • UK Archives Discovery Forum, Kew
  • UK Archivematica meeting, York
  • Research Data, Records and Archives: Breaking the Boundaries, Edinburgh
  • Open Repositories, Dublin
  • Jisc CNI conference, Oxford
  • Hydra Virtual Connect
  • TNA Digital Transformation day, Kew

...and our outreach work continues. Watch out for us at the Jisc Research Data Network event in Cambridge next week, the next UK Archivematica meeting in Lancaster the week after, the iPRES conference and Hydra Connect in October and of course the final Jisc Research Data Spring showcase event which will be later on in October.

And of course we have been blogging as usual throughout this phase of the project so do read back to see our previous posts for more information and watch out for our phase 3 final report in mid-October.



* we formally complete the project work on 14th Sept and will focus on writing up our final report over the following month




Jenny Mitcham, Digital Archivist

Monday, 4 July 2016

Modelling Research Data with PCDM

This is a guest post by Julie Allinson, Technology Development Manager for Library & Archives at York. Julie has been working on York's implementation for the 'Filling the Digital Preservation Gap' project. This post discusses preliminary work to define a data model for 'datasets'.

For Phase three of our 'Filling the Digital Preservation Gap', I've been working on implementing a prototype to illustrate how PURE and Archivematica can be used as part of a Research Data management lifecycle. Our technology stack at York is Hydra and Fedora 4 but it's an important aspect of the project to ensure that the thinking behind the prototype is applicable to other stacks. Central to adding any kind of data to a research information system, repository or preservation tool is the data model that underpins the metadata. For this I've been making use of the Portland Common Data Model (PCDM) and it's various extensions (particularly Works).

In the past couple of years there has been a lot of work happening around PCDM, described as "a flexible, extensible domain model that is intended to underlie a wide array of repository and DAMS applications". PCDM provides a small Models ontology of classes and properties, with extension ontologies for Works and Use, among others. I like PCDM because it is high level enough to provide a language to talk across different domains and use cases.


Datasets data model version 1

My first attempt at a data model for datasets based on PCDM can be seen below.



Datasets Data Model v1

Starting with the Dataset object above, for York this is equivalent to a dataset record in our PURE research information system. The only metadata expected at this level is about the dataset as a whole, and for us, will largely come from PURE.

Below this, you'll see that I've begun to model in some OAIS constructs: the Archival Information Package (AIP) and Dissemination Information Package (DIP). The AIP is the deposit of data, prepared for preservation, a single package of data "consisting of the Content Information and the associated Preservation Description Information (PDI), which is preserved within an OAIS." (DCC Glossary). The DIP is a representation of the Content Information in the AIP, produced for dissemination. OAIS, by the way, gives no standard approach for structuring the AIP or DIP.

As it stands, this model does not consider how to 'unpack' the AIP at all and for our prototype - we are (or were - see below) intending to simply point to the AIP in Archivematica.

The DIP as illustrated above is based on what Archivematica generates. The diagram includes the processing configuration and METS files that Archivematica produces by default as filesets, for illustration. These aren't part of the dataset as deposited, hence not making them 'Works' in their own right.

GenericWork is intended for each Unit or 'Work' in the dataset as deposited, or representation thereof. GenericWork is intended for use with any kind of data object. it might be independently re-used, eg. as a member of another dataset. In most cases for research data we probably won't know much about what the data is and so GenericWork will be used, but sometimes it may make sense to use existing models. For example, if the dataset is a collection of images then a more tailored Image model could be used for each Image, or if a dataset includes some existing objects that are already in our repository, those might already have different models. They can still be members of our dataset.

The model is intended to allow for a Dataset to include multiple AIPs and thus I have suggested a local predicate for hasDIP / hasAIP to establish the relationship between the AIP and the DIP.


The trouble with DIPs : an alternative model

Discussing this model with Justin Simpson from Artefactual Systems recently, I got to thinking that the DIP is a really artificial construct, effectively a presentation 'view' on the data and not really separate 'Work' in it's own right.  Archivematica's DIP, at present, provides only files, and doesn't reflect the original folder structure, which may well be meaningful and necessary for a dataset. Perhaps what we really need is the AIP, described fully in Fedora, leaving presentation of the DIP to the interface level? A set of rules for producing a DIP to our own local specification might go like this (using the PCDM Use ontology): if there is a Preservation Master file, produce a Service File for user access, otherwise present the Original File to the user.

The new model would look something like this:


Datasets Data Model alternative

The DIP would be a view constructed from elements of the AIP:




This model feels like a better approach than that in version 1 as it facilitates describing the 'whole' dataset. I do have some immediate questions about the model, though:
  • Is a dataset really a pcdm:Collection, rather than a Work?
  • If the GenericWork is each data file irrespective of whether it can be used/understood on it's own, how useful is that in reality? Is the GenericWork really needed, or are FileSets enough? Is there genuinely value is identifying each individual piece of data as a 'Work'? (re-use outside of the dataset, for example)
And when thinking beyond the model, about how this would actually work for different use cases, implementations questions start to surface. 


Beyond the model

1) Dataset size and structure

Datasets may contain thousands, millions even, of files structured into folders where folders may impart meaning to the data, or be purely arbitrary. Fedora 4 can, by design, handle a folder structure using it's implementation of LDP Basic Containers. As illustrated below, each folder is a 'Basic Container' and each data file is a Work, with FileSet and File objects.
  • AIP ldp:contains folder1
    • folder1 ldp:contains folder2
      • folder2 ldp:contains folder3
        • folder3 ldp:contains GenericWork
          • GenericWork pcdm:hasMember FileSet 
But if each of those folders are objects in Fedora and each file is in Fedora with a Work, FileSet and several File objects, then the number of Fedora objects begins to rise exponentially
  • Would it be better to avoid object-cost to Fedora of creating many many objects? 
  • What alternative approaches could we take? Use BagIt and have Fedora reference only the 'bag'? Store only data files and create an 'order' to represent the folder structure (as outlined in PCDM 2.0)?

2) Storing Files

'Where should data actually be stored permanently?' is another practical question I've been thinking about. On the one hand, Archivematica makes the AIP and it's files available via URLs in the Storage Service and stores the file in a location of your choosing. On the other, Fedora can contain data files, or reference them via a URI. This gives gives us the flexibility to do several things:
  1. Leave the AIP and DIP in Archivematica's stores and use URIs in Fedora to reference all of the files and build a PCDM-modelled view of the data (Archivematica as preservation and access store).
  2. Manage all files in Fedora, treating the Archivematica copy of the data as temporary (Archivematica as sausage factory).
  3. Have two copies of AIP data, one in Archivematica and one in Fedora (LOCKSS model).
  4. Manage Preservation files in Archivematica and delivery/access files in Fedora (Archivematica as preservation and Fedora as access).
Keeping data in Archivematica makes is easy to do additional preservation actions in future, such as re-ingesting when format policy rules change, whereas managing all files within Fedora unlocks the possibilities of Fedora's audit functionality, fixity checking and versioning. Having two copies is attractive as a preservation strategy, but could be difficult to justify and sustain if data collections grow to a significant size. 

On balance I think option (4) is best for the short-term with other options worth re-considering as both Archivematica and Fedora mature. But I'd be really keen to hear different views.

Conclusions

Hopefully this post illustrates that creating an outline data model is pretty easy, but when it comes to thinking about it in terms of implementation decisions, all kinds of ifs and buts start coming up.

In the model above, each data file is a Work. Each work contains one or more FileSets, and each FileSet contains one or more different representations of the file. 

Is it really possible to define a general datasets model that could encompass data from across disciplines, of various sizes, structures and created for a variety of purposes? Data that might be independently re-usable (a series of oral history interviews) or might only be understandable in combination with other files (a database schema document, for example)?

This is very much a work in progress, and I'd really welcome feedback from others who have done allied work or anyone who has suggestions and comments on the approaches and issues outlined above.








Jenny Mitcham, Digital Archivist

Friday, 4 November 2016

From old York to New York: PASIG 2016

My walk to the conference on the first day
Last week I was lucky enough to attend PASIG 2016 (Preservation and Archiving Special Interest Group) at the Museum of Modern Art in New York. A big thanks to Jisc who generously funded my conference fee and travel expenses. This was the first time I have attended PASIG but I had heard excellent reports from previous conferences and knew I would be in for a treat.

On the conference website PASIG is described as "a place to learn from each other's practical experiences, success stories, and challenges in practising digital preservation." This sounded right up my street and I was not disappointed. The practical focus proved to be a real strength.

The conference was three days long and I took pages of notes (and lots of photographs!). As always, it would be impossible to cover everything in one blog post so here is a round up of some of my highlights. Apologies to all of those speakers who I haven't mentioned.

Bootcamp!
 The first day was Bootcamp - all about finding your feet and getting started with digital preservation. However, this session had value not just for beginners but for those of us who have been working in this area for some time. There are always new things to learn in this field and a sometimes a benefit in being walked through some of the basics.

The highlight of the first day for me was an excellent talk by Bert Lyons from AVPreserve called "The Anatomy of Digital Files". This talk was a bit of a whirlwind (I couldn't type my notes fast enough) but it was so informative and hugely valuable. Bert talked us through the binary and hexadecimal notation systems and how they relate to content within a file. This information backed up some of the things I had learnt when investigating how file format signatures are created and really should be essential learning for all digital archivists. If we don't really understand what digital files are made up of then it is hard to preserve them.

Bert also went on to talk about the file system information - which is additional to the bytes within the file - and how crucial it is to also preserve this information alongside the file itself. If you want to know more, there is a great blog post by Bert that I read earlier this year - What is the chemistry of digital preservation?. It includes a comparison about the need to understand the materials you are working with whether you are working in physical conservation or digital preservation. One of the best blog posts I've read this year so pleased to get the chance to shout about it here!

Hands up if you love ISO 16363!
Kara Van Malssen, also from AVPreserve gave another good presentation called "How I learned to stop worrying and love ISO16363". Although specifically intended for formal certification, she talked about its value outside the certification process - for self assessment, to identify gaps and to prioritise further work. She concluded by saying that ISO16363 is one of the most valuable digital preservation tools we have.

Jon Tilbury from Preservica gave a thought provoking talk entitled "Preservation Architectures - Now and in the Future". He talked about how tool provision has evolved, from individual tools (like PRONOM and DROID) to integrated tools designed for an institution, to out of the box solutions. He suggested that the fourth age of digital preservation will be embedded tools - with digital preservation being seamless and invisible and very much business as usual. This will take digital preservation from the libraries and archives sector to the business world. Users will be expecting systems to be intuitive and highly automated - they won't want to think in OAIS terms. He went on to suggest that the fifth age will be when every day consumers (specifically his mum!) are using the tools without even thinking about it! This is a great vision - I wonder how long it will take us to get there?

Erin O'Meara from University of Arizona Libraries gave an interesting talk entitled "Digital Storage: Choose your own adventure". She discussed how we select suitable preservation storage and how we can get a seat at the table for storage discussions and decisions within our institutions. She suggested that often we are just getting what we are given rather than what we actually need. She referenced the excellent NDSA Levels of Digital Preservation which are a good starting point when trying to articulate preservation storage needs (and one which I have used myself). Further discussions on Twitter following on from this presentation highlighted the work on preservation storage requirements being carried out as a result of a workshop at iPRES 2016, so this is well worth following up on.

A talk from Amy Rushing and Julianna Barrera-Gomez from the University of Texas at San Antonio entitled "Jumping in and Staying Afloat: Creating Digital Preservation Capacity as a Balancing Act" really highlighted for me one of the key messages that has come out of our recent project work for Filling the Digital Preservation Gap. This is that, choosing a digital preservation system is relatively easy but actually deciding how to use it is the harder! After ArchivesDirect (a combination of Archivematica and DuraSpace) was selected as their preservation system (which included 6TB of storage), Amy and Julianna had a lot of decisions to make in order to balance the needs of their collections with the available resources. It was a really interesting case study and valuable to hear how they approached the problem and prioritised their collections.

The Museum of Modern Art in New York
Andrew French from Ex Libris Solutions gave an interesting insight into a more open future for their digital preservation system Rosetta. He pointed out that institutions when selected digital preservation systems focus on best practice and what is known. They tend to have key requirements relating to known standards such as OAIS, Dublin Core, PREMIS and METS as well as a need for automated workflows and a scalable infrastructure. However, once they start using the tool, they find they want other things too - they want to plug in different tools that suit their own needs.

In order to meet these needs, Rosetta is moving towards greater openness, enabling institutions to swap out any of the tools for ingest, preservation, deposit or publication. This flexibility allows the system to be better suited for a greater range of use cases. They are also being more open with their documentation and this is a very encouraging sign. The Rosetta Developer Network documentation is open to all and includes information, case studies and workflows from Rosetta users that help describe how Rosetta can be used in practice. We can all learn a lot from other people even if we are not using the same DP system so this kind of sharing is really great to see.

MOMA in the rain on day 2!
Day two of PASIG was a practitioners knowledge exchange. The morning sessions around reproducibility of research were of particular interest to me given my work on research data preservation and it was great to see two of the presentations referencing the work of the Filling the Digital Preservation Gap project. I'm really pleased to see our work has been of interest to others working in this area.

One of the most valuable talks of the day for me was from Fernando Chirigati from New York University. He introduced us to a useful new tool called ReproZip. He made the point that the computational environment is as important as the data itself for the reproducibility of research data. This could include information about libraries used, environment variables and options. You can not expect your depositors to find or document all of the dependencies (or your future users to install them). What ReproZip does is package up all the necessary dependencies along with the data itself. This package can then be archived and re-used in the future. ReproZip can also be used to unpack and re-use the data in the future. I can see a very real use case for this for researchers within our institution.

Another engaging talk from Joanna Phillips from the Guggenheim Museum and and Deena Engel of New York University described a really productive collaboration between the two institutions. Computer Science students from NYU have been working closely with the time-based media conservator at the museum on the digital artworks in their care. This symbiotic relationship enables the students to earn credit towards their academic studies whilst the museum receives valuable help towards understanding and preserving some of their complex digital objects. Work that the students carry out includes source code analysis and the creation of full documentation of the code so that is can be understood by others. Some also engage with the unique preservation challenges within the artwork, considering how it could be migrated or exhibited again. It was clear from the speakers that both institutions get a huge amount of benefit from this collaboration. A great case study!

Karen Cariani from WGBH Educational Foundation talked about their work (with Indiana University Libraries) to build HydraDAM2. This presentation was of real interest to me given our recent Filling the Digital Preservation Gap project in which we introduced digital preservation functionality to Hydra by integrating it with Archivematica.  HydraDAM2 was a different approach, building a preservation head for audio-visual material within Hydra itself. Interesting to see a contrasting solution and to note the commonalities between their project and ours (particularly around the data modelling work and difficulties recruiting skilled developers).

More rain at the end of day 2
Ben Fino Radin from the Museum of Modern Art in "More Data, More Problems: Designing Efficient Workflows at Petabyte Scale" highlighted the challenges of digitising their time-based media holdings and shared some calculations around how much digital storage space would be required if they were to digitise all of their analogue holdings. This again really highlighted some big issues and questions around digital preservation. When working with large collections, organisations need to prioritise and compromise and these decisions can not be taken lightly. This theme was picked up again on day 3 in the session around environmental sustainability.

The lightning talks on the afternoon of the second day were also of interest. Great to hear from such a range of practitioners.... though I did feel guilty that I didn't volunteer to give one myself! Next time!

On the morning of day 3 we were treated to an excellent presentation by Dragan Espenschied from Rhizome who showed us Webrecorder. Webrecorder is a new open source tool for creating web archives. It uses a single system both for initial capture and subsequent access. One of its many strengths appears to be the ability to capture dynamic websites as you browse them and it looks like it will be particularly useful for websites that are also digital artworks. This is definitely one to watch!

MOMA again!
Also on day 3 was a really interesting session on environmental responsibility and sustainability. This was one of the reasons that PASIG made me think...this is not the sort of stuff we normally talk about so it was really refreshing to see a whole session dedicated to it.

Eira Tansey from the University of Cincinnati gave a very thought provoking talk with a key question for us to think about - why do we continue to buy more storage rather than appraise? This is particularly important considering the environmental costs of continuing to store more and more data of unknown value.

Ben Goldman of Penn State University also picked up this theme, looking at the carbon footprint of digital preservation. He pointed out the paradox in the fact we are preserving data for future generations but we are powering this work with fossil fuels. Is preserving the environment not going to be more important to future generations than our digital data? He suggested that we consider the long term impacts of our decision making and look at our own professional assumptions. Are there things that we do currently that we could do with less impact? Are we saving too many copies of things? Are we running too many integrity checks? Is capturing a full disk image wasteful? He ended his talk by suggesting that we should engage in a debate about the impacts of what we do.

Amelia Acker from the University of Texas at Austin presented another interesting perspective on digital preservation in mobile networks, asking how our collections will change as we move from an information society to a networked era and how mobile phones change the ways we read, write and create the cultural record. The atomic level of the file is no longer there on mobile devices.  Most people don't really know where the actual data is on their phones or tablets, they can't show you the file structure. Data is typically tied up with an app and stored in the cloud and apps come and go rapidly. There are obvious preservation challenges here! She also mentioned the concept of the legacy contact on Facebook...something which had passed me by, but which will be of interest to many of us who care about our own personal digital legacy.

Yes, there really is steam coming out of the pavements in NYC
The stand out presentation of the conference for me was "Invisible Defaults and Percieved Limitations: Processing the Juan Gelman Files" from Eliva Arroyo-Ramirez from Princeton University. She described the archive of Juan Gelman, an Argentinian poet and human rights activist. Much of the archive was received on floppy disks and included documents relating to his human rights work and campaigns for the return of his missing son and daughter-in-law. The area she focused on within her talk was about how we preserve files with accented characters in the file names.

Diacritics can cause problems when trying to open the files or use our preservation tools (for example Bagger). When she encountered problems like these she put a question out to the digital preservation community asking how to solve the problem and she was grateful to receive so many responses but at the same time was concerned about the language used. It was suggested that she 'scrub', 'clean' or 'detox' the file names in order to remove the 'illegal characters' but she was concerned that our attitudes towards accented characters further marginalises those who do not fit into our western ideals.

She also explored how removing or replacing these accented characters would impact on the files themselves and it was clear that meaning would change significantly.  'Campaign' (a word included in so many of the filenames) would change to 'bell'. She decided not to change the file names but to try and find a work around and she was eventually successful in finding a way to keep the filenames as they were (using the command line to turn the latin characters to UTF8). The message that she ended on was that we as archivists should do no harm whether we are dealing with physical or digital archives. We must juggle our priorities but think hard about where we compromise and what is important to preserve. It is possible to work through problems rather than work around them and we need to be conscious of the needs of collections that fall outside our defaults. This was real food for thought and prompted an interesting conversation on twitter afterwards.

Times Square selfie!
Not only did I have a fantastic week in New York (its not every day you can pop out in your lunch break to take a selfie in Times Square!), but I also came away with lots to think about. PASIG is a bit closer to home next year (in Oxford) so I am hoping I'll be there!







Jenny Mitcham, Digital Archivist

The sustainability of a digital preservation blog...

So this is a topic pretty close to home for me. Oh the irony of spending much of the last couple of months fretting about the future prese...