Digital Archiving at the University of York: Some observations on digital appraisal

A couple of months ago I attended a Jisc sponsored workshop at the University of Westminster on digital appraisal. There were some useful presentations and discussions on a topic that I find both interesting and challenging.

Within the workshop I made the point that my approaches to some elements of digital appraisal may differ depending on the age of the born digital material I'm looking at.

For example, I may wish to take a firm line about removing modern system generated files such as Thumbs.db files and Apple Resource Forks that come into the archives - my reasons being that this is not really the content that the donor or depositor intended to give us, rather an artifact of the computer system that they were using.

However I also stated that for an older born digital archive I am much more reluctant to weed out system files or software.

It seems easy to weed out things that you recognise and understand - as is often the case with contemporary digital archives - but for older archives our lack of understanding of what we are seeing can make appraisal decisions much harder and the temptation is to keep things until we understand better what is useful.

I was thinking of a couple of examples we have here at the Borthwick Institute.

The archive of Trevor Wishart includes files dating back to 1985. Trevor Wishart specialises in electroacoustic composition, in particular the transformation of the human voice and natural sounds through technological means. He has also been involved in developing software for electronic composition. His digital archive is a great case study for us with interesting challenges around how we might be able to preserve and provide access to it.

Of course when I look at this archive there are numerous files that can not be identified by DROID. It is not always immediately obvious which files are system files, and which are software. Certainly for the time being, there is no intention to appraise out any of the content until we understand it better.

Another good case study...and one I am actively working on right now is the archive of Marks and Gran, a comedy screenwriting duo who have been writing together since the 1970's.

The digital element of this archive was deposited on a set of 5 1/4 inch floppy disks and includes files dating back to 1984.

When I carried out a first pass at the content of this archive to establish what it contained I encountered 100+ digital examples of screenplays, cast lists and plot outlines (in WordStar 4.0 format) and about 60 other files with various file extensions (.COM, .EXE, .BAT etc) that didn't appear to be created by Marks and Gran themselves.

Software and other system files were clearly present on these disks and this was also evidenced by the disk labels.

But do we want to keep this...are we even allowed to keep it? How can we preserve it effectively if we don't know what it is? Are we allowed to provide access to this material? If not, then what is the point of keeping it at all?

Given that rescuing this archive from the 5 1/4 inch floppy disks in the first place was quite a task and the fact that the size of the digital archive was small, it didn't seem right to appraise anything out until our knowledge and understanding increased.

As I spend a bit more time working with the Marks and Gran digital archive, this decision turns out to have had direct benefits. Here are a few examples of how:

WordStar

One of the floppy disks that was recovered had the label "Copy of MASTER WORDSTAR DISK (WITH INSTALL)" and indeed this is what it appeared to contain.

Why do we have actual copies of software in archives like this one?

Users of computers in the 1980's and 1990's were often encouraged to make their own backup copies of software. I've mentioned this before in a previous blog but there is this gem of information in the online WordStar 4 manual:

There will undoubtedly be numerous copies of early software deposited with archives as a result of this practice of creating a backup disk.

Of course there was an opportunity here - I had lots of WordStar files that were hard to read and I also had a copy of WordStar!

I installed the copy of WordStar on an ancient unsupported PC that sits under my desk and was pretty pleased with myself when I got it working.

Then I had to work out how to use it...

But the manual (and Twitter) helped and it has turned out to be incredibly useful in helping to understand the files within the archive and also to check back on the significant properties of the originals while carrying out file migrations.

WC.EXE

Another file within the archive that I didn't fully understand the purpose of until recently has turned out to be another tool to help with the process of file migration.

After the imperfect file migration triggered by a Windows 10 upgrade I wanted to go back and do some checks on how well this process worked.

If I could find out the number of words and characters within the WordStar files I could then compare these with similar statistics from the migrated files and see if they matched up.

But the first hurdle was how to get this information from a WordStar file in the first place. As with many things, to my modern brain, this was not entirely intuitive!

However, reading the manual revealed that there is an additional word counting utility called WC.EXE that wraps with WordStar.

Word counting advice from the WordStar 4.0 manual

Wouldn't it be great if I could find a copy of this utility!

As luck would have it, there was a copy in the digital archive!

I copied it across (on a high tech 3.5 inch floppy disk) to the old PC and got it working very quickly.

And it does what it says it will - here is the result for a file called REDHEAD.WS4

Using WC.EXE to count words and characters in WordStar

I then checked these stats against the stats for the migrated copy of REDHEAD.WS4 in Word 2016 and naively hoped they would match up.

Word counts for the same file in Word 2016

As you can see, the results were alarmingly different! (and note that unticking the box for textboxes, footnotes and endnotes doesn't help).

Twitter is a great help!

Furthermore it was suggested by Andy Jackson on Twitter that WordStar may be also counting formatting information at the start of a file, though I'm still unclear as to how this would add approximately 1,300 words. It is apparent that word and character counts are not to be trusted!

So back to equally imperfect manual methods and visual inspection...having spend some time with these files I am fairly confident that the content of the documents has been captured adequately.

Although WC.EXE didn't turn out to be such a useful file for assessing my migrations, if I hadn't had a copy of it I could have wasted a lot of time looking for it!

Print test pages

Another file within the Marks and Gran digital archive that would not necessarily be considered to be archival is PRINT.BAK. This WordStar file (illustrated below) doesn't look like something that was created by Marks and Gran.

However, the file has turned out to be hugely useful to me as I try to understand and migrate the WordStar files. It describes (and most importantly demonstrates) some of the ways that you can format text within WordStar and (in theory) shows how they will appear when printed.

This would have been quite important information for users of a word processor that is not WYSIWYG!

A migrated version of a file called PRINT.BAK, present on the MASTER WORDSTAR DISK

From my perspective, the great thing about this file is that by putting it through the same migration methodology as the other WordStar files within the archive, I can make an assessment of how successful the migration has been at capturing all the different types of formatting.

Here is how the print test page looks in WordStar - it shows the mark up used to display different formatting.

I thought my migrated version of the print test page looked pretty good until I opened up the original in WordStar and noted all the additional commands that have not been captured as part of the migration process.

See for example the .he command at the top of the page which specifies that the document should be printed with a header. Two lines of header text defined and Page # for page number.
Note also the ^D mark up that wraps the text 'WordStar' - this tells the printer to double strike the text - effectively creating a light bold face as the printer prints the characters twice.
The print test page includes an example of the strikeout command ~~which should look like this.~~
It also includes examples of variable pitch which should be visible as a change in character width (font size).

Backup files

I described in a previous blog how WordStar automatically creates backup files each time a file is saved and I asked the question 'should we keep them?'

At this point in time I'm very glad that we have kept them.

The backup fileshave been useful as a point of comparison when encountering what appears to be some localised corruption on one particular floppy disk.

A fragment of one of the corrupt files in WordStar. The text no longer makes sense at a particular point in the document.

Looking at the backup file of the document illustrated above I can see a similar, but not identical, issue. It happens in roughly the same location in the document but some of the missing words can be recovered, providing opportunities to repair or retrieve more information from the archive because of the presence of these contemporary backups.

To conclude

When I first encountered the Marks and Gran digital archive I was not convinced about the archival value of some of the files.

Specifically I was not convinced that we should be keeping things like system files and software or automatically generated backup files unless they were created with deliberate intent by Marks and Gran themselves.

However, as I have worked with the archive more and come to understand the files and how they were created, I have found some of these system files to be incredibly useful in moving forward my understanding of the WordStar files as a whole.

I'm not suggesting we should keep everything but I am suggesting that we should be cautious about getting rid of things that we don't fully understand...they may be useful one day.

Jenny Mitcham, Digital Archivist

1 comment:

Karl Pettersson26 July 2018 at 13:48
From the output, the WC.EXE looks very much like the Unix wc program, which counts everything in the file and is not specifically adapted to the Wordstar format (which would explain the inclusion of information not part of the real text, as you mention Andy Jackson suggested). Perhaps you can use this program on a copy of the original file (from e.g. a Linux/macOS/Cygwin installation) and compare with the output of WC.EXE.

Jenny Mitcham, Digital Archivist

Digital Archiving at the University of York

Wednesday, 25 July 2018

Some observations on digital appraisal

WordStar

WC.EXE

Print test pages

Backup files

To conclude

1 comment:

The sustainability of a digital preservation blog...

Twitter

Subscribe