Digital Archiving at the University of York: From old York to New York: PASIG 2016

My walk to the conference on the first day

Last week I was lucky enough to attend PASIG 2016 (Preservation and Archiving Special Interest Group) at the Museum of Modern Art in New York. A big thanks to Jisc who generously funded my conference fee and travel expenses. This was the first time I have attended PASIG but I had heard excellent reports from previous conferences and knew I would be in for a treat.

On the conference website PASIG is described as "a place to learn from each other's practical experiences, success stories, and challenges in practising digital preservation." This sounded right up my street and I was not disappointed. The practical focus proved to be a real strength.

The conference was three days long and I took pages of notes (and lots of photographs!). As always, it would be impossible to cover everything in one blog post so here is a round up of some of my highlights. Apologies to all of those speakers who I haven't mentioned.

Bootcamp!

The first day was Bootcamp - all about finding your feet and getting started with digital preservation. However, this session had value not just for beginners but for those of us who have been working in this area for some time. There are always new things to learn in this field and a sometimes a benefit in being walked through some of the basics.

The highlight of the first day for me was an excellent talk by Bert Lyons from AVPreserve called "The Anatomy of Digital Files". This talk was a bit of a whirlwind (I couldn't type my notes fast enough) but it was so informative and hugely valuable. Bert talked us through the binary and hexadecimal notation systems and how they relate to content within a file. This information backed up some of the things I had learnt when investigating how file format signatures are created and really should be essential learning for all digital archivists. If we don't really understand what digital files are made up of then it is hard to preserve them.

Bert also went on to talk about the file system information - which is additional to the bytes within the file - and how crucial it is to also preserve this information alongside the file itself. If you want to know more, there is a great blog post by Bert that I read earlier this year - What is the chemistry of digital preservation?. It includes a comparison about the need to understand the materials you are working with whether you are working in physical conservation or digital preservation. One of the best blog posts I've read this year so pleased to get the chance to shout about it here!

Hands up if you love ISO 16363!

Kara Van Malssen, also from AVPreserve gave another good presentation called "How I learned to stop worrying and love ISO16363". Although specifically intended for formal certification, she talked about its value outside the certification process - for self assessment, to identify gaps and to prioritise further work. She concluded by saying that ISO16363 is one of the most valuable digital preservation tools we have.

Jon Tilbury from Preservica gave a thought provoking talk entitled "Preservation Architectures - Now and in the Future". He talked about how tool provision has evolved, from individual tools (like PRONOM and DROID) to integrated tools designed for an institution, to out of the box solutions. He suggested that the fourth age of digital preservation will be embedded tools - with digital preservation being seamless and invisible and very much business as usual. This will take digital preservation from the libraries and archives sector to the business world. Users will be expecting systems to be intuitive and highly automated - they won't want to think in OAIS terms. He went on to suggest that the fifth age will be when every day consumers (specifically his mum!) are using the tools without even thinking about it! This is a great vision - I wonder how long it will take us to get there?

Erin O'Meara from University of Arizona Libraries gave an interesting talk entitled "Digital Storage: Choose your own adventure". She discussed how we select suitable preservation storage and how we can get a seat at the table for storage discussions and decisions within our institutions. She suggested that often we are just getting what we are given rather than what we actually need. She referenced the excellent NDSA Levels of Digital Preservation which are a good starting point when trying to articulate preservation storage needs (and one which I have used myself). Further discussions on Twitter following on from this presentation highlighted the work on preservation storage requirements being carried out as a result of a workshop at iPRES 2016, so this is well worth following up on.

A talk from Amy Rushing and Julianna Barrera-Gomez from the University of Texas at San Antonio entitled "Jumping in and Staying Afloat: Creating Digital Preservation Capacity as a Balancing Act" really highlighted for me one of the key messages that has come out of our recent project work for Filling the Digital Preservation Gap. This is that, choosing a digital preservation system is relatively easy but actually deciding how to use it is the harder! After ArchivesDirect (a combination of Archivematica and DuraSpace) was selected as their preservation system (which included 6TB of storage), Amy and Julianna had a lot of decisions to make in order to balance the needs of their collections with the available resources. It was a really interesting case study and valuable to hear how they approached the problem and prioritised their collections.

The Museum of Modern Art in New York

Andrew French from Ex Libris Solutions gave an interesting insight into a more open future for their digital preservation system Rosetta. He pointed out that institutions when selected digital preservation systems focus on best practice and what is known. They tend to have key requirements relating to known standards such as OAIS, Dublin Core, PREMIS and METS as well as a need for automated workflows and a scalable infrastructure. However, once they start using the tool, they find they want other things too - they want to plug in different tools that suit their own needs.

In order to meet these needs, Rosetta is moving towards greater openness, enabling institutions to swap out any of the tools for ingest, preservation, deposit or publication. This flexibility allows the system to be better suited for a greater range of use cases. They are also being more open with their documentation and this is a very encouraging sign. The Rosetta Developer Network documentation is open to all and includes information, case studies and workflows from Rosetta users that help describe how Rosetta can be used in practice. We can all learn a lot from other people even if we are not using the same DP system so this kind of sharing is really great to see.

MOMA in the rain on day 2!

Day two of PASIG was a practitioners knowledge exchange. The morning sessions around reproducibility of research were of particular interest to me given my work on research data preservation and it was great to see two of the presentations referencing the work of the Filling the Digital Preservation Gap project. I'm really pleased to see our work has been of interest to others working in this area.

One of the most valuable talks of the day for me was from Fernando Chirigati from New York University. He introduced us to a useful new tool called ReproZip. He made the point that the computational environment is as important as the data itself for the reproducibility of research data. This could include information about libraries used, environment variables and options. You can not expect your depositors to find or document all of the dependencies (or your future users to install them). What ReproZip does is package up all the necessary dependencies along with the data itself. This package can then be archived and re-used in the future. ReproZip can also be used to unpack and re-use the data in the future. I can see a very real use case for this for researchers within our institution.

Another engaging talk from Joanna Phillips from the Guggenheim Museum and and Deena Engel of New York University described a really productive collaboration between the two institutions. Computer Science students from NYU have been working closely with the time-based media conservator at the museum on the digital artworks in their care. This symbiotic relationship enables the students to earn credit towards their academic studies whilst the museum receives valuable help towards understanding and preserving some of their complex digital objects. Work that the students carry out includes source code analysis and the creation of full documentation of the code so that is can be understood by others. Some also engage with the unique preservation challenges within the artwork, considering how it could be migrated or exhibited again. It was clear from the speakers that both institutions get a huge amount of benefit from this collaboration. A great case study!

Karen Cariani from WGBH Educational Foundation talked about their work (with Indiana University Libraries) to build HydraDAM2. This presentation was of real interest to me given our recent Filling the Digital Preservation Gap project in which we introduced digital preservation functionality to Hydra by integrating it with Archivematica. HydraDAM2 was a different approach, building a preservation head for audio-visual material within Hydra itself. Interesting to see a contrasting solution and to note the commonalities between their project and ours (particularly around the data modelling work and difficulties recruiting skilled developers).

More rain at the end of day 2

Ben Fino Radin from the Museum of Modern Art in "More Data, More Problems: Designing Efficient Workflows at Petabyte Scale" highlighted the challenges of digitising their time-based media holdings and shared some calculations around how much digital storage space would be required if they were to digitise all of their analogue holdings. This again really highlighted some big issues and questions around digital preservation. When working with large collections, organisations need to prioritise and compromise and these decisions can not be taken lightly. This theme was picked up again on day 3 in the session around environmental sustainability.

The lightning talks on the afternoon of the second day were also of interest. Great to hear from such a range of practitioners.... though I did feel guilty that I didn't volunteer to give one myself! Next time!

On the morning of day 3 we were treated to an excellent presentation by Dragan Espenschied from Rhizome who showed us Webrecorder. Webrecorder is a new open source tool for creating web archives. It uses a single system both for initial capture and subsequent access. One of its many strengths appears to be the ability to capture dynamic websites as you browse them and it looks like it will be particularly useful for websites that are also digital artworks. This is definitely one to watch!

MOMA again!

Also on day 3 was a really interesting session on environmental responsibility and sustainability. This was one of the reasons that PASIG made me think...this is not the sort of stuff we normally talk about so it was really refreshing to see a whole session dedicated to it.

Eira Tansey from the University of Cincinnati gave a very thought provoking talk with a key question for us to think about - why do we continue to buy more storage rather than appraise? This is particularly important considering the environmental costs of continuing to store more and more data of unknown value.

Ben Goldman of Penn State University also picked up this theme, looking at the carbon footprint of digital preservation. He pointed out the paradox in the fact we are preserving data for future generations but we are powering this work with fossil fuels. Is preserving the environment not going to be more important to future generations than our digital data? He suggested that we consider the long term impacts of our decision making and look at our own professional assumptions. Are there things that we do currently that we could do with less impact? Are we saving too many copies of things? Are we running too many integrity checks? Is capturing a full disk image wasteful? He ended his talk by suggesting that we should engage in a debate about the impacts of what we do.

Amelia Acker from the University of Texas at Austin presented another interesting perspective on digital preservation in mobile networks, asking how our collections will change as we move from an information society to a networked era and how mobile phones change the ways we read, write and create the cultural record. The atomic level of the file is no longer there on mobile devices. Most people don't really know where the actual data is on their phones or tablets, they can't show you the file structure. Data is typically tied up with an app and stored in the cloud and apps come and go rapidly. There are obvious preservation challenges here! She also mentioned the concept of the legacy contact on Facebook...something which had passed me by, but which will be of interest to many of us who care about our own personal digital legacy.

Yes, there really is steam coming out of the pavements in NYC

The stand out presentation of the conference for me was "Invisible Defaults and Percieved Limitations: Processing the Juan Gelman Files" from Eliva Arroyo-Ramirez from Princeton University. She described the archive of Juan Gelman, an Argentinian poet and human rights activist. Much of the archive was received on floppy disks and included documents relating to his human rights work and campaigns for the return of his missing son and daughter-in-law. The area she focused on within her talk was about how we preserve files with accented characters in the file names.

Diacritics can cause problems when trying to open the files or use our preservation tools (for example Bagger). When she encountered problems like these she put a question out to the digital preservation community asking how to solve the problem and she was grateful to receive so many responses but at the same time was concerned about the language used. It was suggested that she 'scrub', 'clean' or 'detox' the file names in order to remove the 'illegal characters' but she was concerned that our attitudes towards accented characters further marginalises those who do not fit into our western ideals.

She also explored how removing or replacing these accented characters would impact on the files themselves and it was clear that meaning would change significantly. 'Campaign' (a word included in so many of the filenames) would change to 'bell'. She decided not to change the file names but to try and find a work around and she was eventually successful in finding a way to keep the filenames as they were (using the command line to turn the latin characters to UTF8). The message that she ended on was that we as archivists should do no harm whether we are dealing with physical or digital archives. We must juggle our priorities but think hard about where we compromise and what is important to preserve. It is possible to work through problems rather than work around them and we need to be conscious of the needs of collections that fall outside our defaults. This was real food for thought and prompted an interesting conversation on twitter afterwards.

Times Square selfie!

Not only did I have a fantastic week in New York (its not every day you can pop out in your lunch break to take a selfie in Times Square!), but I also came away with lots to think about. PASIG is a bit closer to home next year (in Oxford) so I am hoping I'll be there!

Jenny Mitcham, Digital Archivist

4 comments:

Jaana Pinnick7 November 2016 at 09:55
Thank you for taking the time to summarize your experience from PASIG. For those of us not being able to attend, it's the next best thing and gives a great overview of what is being discussed. Especially the bits about understanding what you are trying to preserve are of interest, both in terms of files and the materials themselves. I also finished reading the Filing the Digital Preservation Gap report trilogy from York/Hull last week (would have read it earlier but was busy with my MSc diss on DP...) which perfectly captures the complexity of what we're trying to achieve in terms of preserving digital research data and has given us some ideas for the future.

Jenny Mitcham, Digital Archivist
Critical Steph10 November 2016 at 15:02
Thanks so much for this, Jen. I wasn't able to attend, and trying to keep up on Twitter didn't work out very well. Really appreciate the time you've take to write this up. So many useful links and projects in one place :)

Jenny Mitcham, Digital Archivist
Hannah Silverman11 November 2016 at 11:25
Thank you for writing up and sharing your take aways with us. I found the information very helpful!

Jenny Mitcham, Digital Archivist

Digital Archiving at the University of York

Friday, 4 November 2016

From old York to New York: PASIG 2016

4 comments:

The sustainability of a digital preservation blog...

Twitter

Subscribe