Digital Archiving at the University of York: How can we preserve our wiki pages

I was recently prompted by a colleague to investigate options for preserving institutional wiki pages. At the University of York we use the Confluence wiki and this is available for all staff to use for a variety of purposes. In the Archives we have our own wiki space on Confluence which we use primarily for our meeting agendas and minutes. The question asked of me was how can we best capture content on the wiki that needs to be preserved for the long term?

Good question and just the sort of thing I like to investigate. Here are my findings...

Space export

The most sensible way to approach the transfer of a set of wiki pages to the digital archive would be to export them using the export options available within the Space Tools.

The main problem with this approach is that a user will need to have the necessary permissions on the wiki space in order to be able to use these tools ...I found that I only had the necessary permissions on those wiki spaces that I administer myself.

There are three export options as illustrated below:

Space export options - available if you have the right permissions!

HTML

Once you select HTML, there are two options - a standard export (which exports the whole space) or a custom export (which allows you to select the pages you would like included within the export).

I went for a custom export and selected just one section of meeting papers. Each wiki page is saved as an HTML file. DROID identifies these as HTML version 5. All relevant attachments are included in the download in their original format.

There are some really good things about this export option:

The inclusion of attachments in the export - these are often going to be as valuable to us as the wiki page content itself. Note that they were all renamed with a number that tied them to the page that they were associated with. It seemed that the original file name was however preserved in the linking wiki page text
The metadata at the top of a wiki page is present in the HTML pages: ie Created by Jenny Mitcham, last modified by Jenny Mitcham on 31, Oct, 2016 - this is really important to us from an archival point of view
The links work - including links to the downloaded attachments, other wiki pages and external websites or Google Docs
The export includes an index page which can act as a table of contents for the exported files - this also includes some basic metadata about the wiki space

XML

Again, there are two options here - either a standard export (of the whole space) or a custom export, which allows you to select whether or not you want comments to be exported and choose exactly which pages you want to export.

I tried the custom export. It seemed to work and also did export all the relevant attachments. The attachments were all renamed as '1' (with no file extension), and the wiki page content is all bundled up into one huge XML file.

On the plus side, this export option may contain more metadata than the other options (for example the page history) but it is difficult to tell as the XML file is so big and unwieldy and hard to interpret. Really it isn't designed to be usable. The main function of this export option is to move wiki pages into another instance of Confluence.

PDF

Again you have the option to export whole space or choose your pages. There are also other configurations you can make to the output but these are mostly cosmetic.

I chose the same batch of meeting papers to export as PDF and this produces a 111 page PDF document. The first page is a contents page which lists all the other pages alphabetically with hyperlinks to the right section of the document. It is hard to use the document as the wiki pages seem to run into each other without adequate spacing and because of the linear nature of a pdf document you feel drawn to read it in the order it is presented (which in this case is not a logical order for the content). Attachments are not included in the download though links to the attachments are maintained in the PDF file and they do continue to resolve to the right place on the wiki. Creation and last modified metadata is also not included in the export.

Single page export

As well as the Space Export options in Confluence there are also single page export options. These are available to anyone who can access the wiki page so may be useful if people do not have necessary permissions for a space export.

I exported a range of test pages using the 'Export to PDF' and 'Export to Word' options.

Export to PDF

The PDF files created in this manner are version 1.4. Sadly no option to export as PDF/A, but at least version 1.4 is closer to the PDF/A standard than some, so perhaps a subsequent migration to PDF/A would be successful.

Export to Word

Surprisingly the 'Word' files produced by Confluence appear not to be Word files at all!

Double click on the files in Windows Explorer and they open in Microsoft Word no problem, but DROID identifies the files as HTML (with no version number) and reports a file extension mismatch (because the files have a .doc extension).

If you view the files in a text application you can clearly see the Content-Type marked as text/html and <html> tags within the document. Quick View Plus, however views them as an Internet Mail Message with the following text displayed at the top of each page:

Subject: Exported From Confluence

1024x640 72 Print 90

All very confusing and certainly not giving me a lot of faith in this particular export format!

Comparison

Both of these single page export formats do a reasonable job of retaining the basic content of the wiki pages - both versions include many of the key features I was looking for - text, images, tables, bullet points, colours.

Where advanced formatting has been used to lay out a page using coloured boxes, the PDF version does a better job at replicating this than the 'Word' version. Whilst the PDF attempts to retain the original formatting, the 'Word' version displays the information in a much more linear fashion.

Links were also more usefully replicated in the PDF version. The absolute URL of all links, whether internal, external or to attachments was included within the PDF file so that it is possible to follow them to their original location (if you have the necessary permissions to view the pages). On the 'Word' versions, only external links worked in this way. Internal wiki links and links to attachments were exported as a relative link which become 'broken' once that page is taken out of its original context.

The naming of the files that were produced is also worthy of comment. The 'Word' versions are given a name which mirrors the name of the page within the wiki space, but the naming of the PDF versions are much more useful, including the name of the wiki space itself, the page name and a date and timestamp showing when the page was exported.

Neither of these single page export formats retained the creation and last modified metadata for each page and this is something that it would be very helpful to retain.

Conclusions

So, if we want to preserve pages from our institutional wiki, what is the best approach?

The Space Export in HTML format is a clear winner. It reproduces the wiki pages in a reusable form that replicates the page content well. As HTML is essentially just ASCII text it is also a good format for long term preservation.

What impressed me about the HTML export was the fact that it retained the content, included basic creation and last modified metadata for each page and downloaded all relevant attachments, updating the links to point to these local copies.

What if someone does not have the necessary permissions to do a space export? My first suggestion would be that they ask for their permissions to be upgraded. If not, perhaps someone who does have necessary permissions could carry out the export?

If all else fails, the export of a single page using the 'Export as PDF' option could be used to provide ad hoc content for the digital archive. PDF is not the best preservation format but may be able to be converted to PDF/A. Note that any attachments would have to be exported separately and manually is this option was selected.

Final thoughts

A wiki space is a dynamic thing which can involve several different types of content - blog posts, labels/tags and comments can all be added to wiki spaces and pages. If these elements are thought to be significant then more work is required to see how they can be captured. It was apparent that comments could be captured using the HTML and XML exports and I believe blog posts can be captured individually as PDF files.

What is also available within the wiki platform itself is a very detailed Page History. Within each wiki page it is possible to view the Page History and see how a page has evolved over time - who has edited it and when those edits occurred. As far as I could see, none of the export formats included this level of information. The only exception may be the XML export but this was so difficult to view that I could not be sure either way.

So, there are limitations to all these approaches and as ever this goes back to the age old discussion about Significant Properties. What is significant about the wiki pages? What is it that we are trying to preserve? None of the export options preserve everything. All are compromises, but perhaps some are compromises we could live with.

Jenny Mitcham, Digital Archivist

2 comments:

Ed Pinsent10 March 2017 at 16:26
Hi Jen

I agree, there still remains to be written a definitive view on significant properties of wikis. It is disappointing that these export methods which you tried apparently failed to capture the editing history. But maybe it also indicates that application designers have not always been informed by archival thinking...this was not an eventuality they thought of.

I used to do it by web archiving, using Heritrix for the UK Web Archive. See http://dart.blogs.ulcc.ac.uk/2009/03/10/working-with-web-curator-tool-part-2-wikis/ for a discussion along similar lines, particularly Mo Pennock's point about reasons for keeping the collaborative editing history (which I was trying to systematically exclude, because it created a messy and over-large web harvest).

Also http://dart.blogs.ulcc.ac.uk/2009/03/25/archiving-a-wiki/, where I speculated on how we might frame our thinking within a formal selection decision.

Ed

Jenny Mitcham, Digital Archivist

Digital Archiving at the University of York

Friday, 10 March 2017

How can we preserve our wiki pages