Digital Archiving at the University of York: file extensions

Showing posts with label file extensions. Show all posts

Friday, 20 October 2017

Understanding WordStar - check out the manuals!

Last month I was pleased to be able to give a presentation at 'After the Digital Revolution' about some of the work I have been doing on the WordStar 4.0 files in the Marks and Gran digital archive that we hold here at the Borthwick Institute for Archives. This event specifically focused on literary archives.

It was some time ago now that I first wrote about these files that were recovered from 5.25 inch floppy (really floppy) disks deposited with us in 2009.

My original post described the process of re-discovery, data capture and file format identification - basically the steps that were carried out to get some level of control over the material and put it somewhere safe.

I recorded some of my initial observations about the files but offered no conclusions about the reasons for the idiosyncrasies.

I’ve since been able to spend a bit more time looking at the files and investigating the creating application (WordStar) so in my presentation at this event I was able to talk at length (too long as usual) about WordStar and early word processing. A topic guaranteed to bring out my inner geek!

WordStar is not an application I had any experience with in the past. I didn’t start word processing until the early 90’s when my archaeology essays and undergraduate dissertation were typed up into a DOS version of Word Perfect. Prior to that I used a typewriter (now I feel old!).

WordStar by all accounts was ahead of its time. It was the first Word Processing application to include mail merge functionality. It was hugely influential, introducing a number of keyboard shortcuts that are still used today in modern word processing applications (for example control-B to make text bold). Users interacted with WordStar using their keyboard, selecting the necessary keystrokes from a set of different menus. The computer mouse (if it was present at all) was entirely redundant.

WordStar was widely used as home computing and word processing increased in popularity through the 1980’s and into the early 90’s. However, with the introduction of Windows 3.0 and Word for Windows in 1989, WordStar gradually fell out of favour (info from Wikipedia).

Despite this it seems that WordStar had a loyal band of followers, particularly among writers. Of course the word processor was the key tool of their trade so if they found an application they were comfortable with it is understandable that they might want to stick with it.

I was therefore not surprised to hear that others presenting at 'After the Digital Revolution' also had WordStar files in their literary archives. Clear opportunities for collaboration here! If we are all thinking about how to provide access to and preserve these files for the future then wouldn't it be useful to talk about it together?

I've already learnt a lot through conversations with the National Library of New Zealand who have been carrying out work in this area (read all about it here: Gattuso J, McKinney P (2014) Converting WordStar to HTML4. iPres.)

However, this blog post is not about defining a preservation strategy for the files it is about better understanding them. My efforts have been greatly helped by finding a copy of both a WordStar 3 manual and a WordStar 4 manual online.

As noted in my previous post on this subject there were a few things that stand out when first looking at the recovered WordStar files and I've used the manuals and other research avenues to try and understand these better.

Created and last modified dates

The Marks and Gran digital archive consists of 174 files, most of which are WordStar files (and I believe them to be WordStar version 4).

Looking at the details that appear on the title pages of some of the scripts, the material appears to be from the period 1984 to 1987 (though not everything is dated).

However the system dates associated with the files themselves tell a different story.

The majority of files in the archive have a creation date of 1st January 1980.

This was odd. Not only would that have been a very busy New Year's Day for the screen writing duo, but the timestamps on the files suggest that they were also working in the very early hours of the morning - perhaps unexpected when many people are out celebrating having just seen in the New Year!

This is the point at which I properly lost my faith in technical metadata!

In this period computers weren't quite as clever as they are today. When you switched them on they would ask you what date it was. If you didn't tell them the date, the PC would fall back to a system default ....which just so happens to be 1st January 1980.

I was interested to see Abby Adams from the Harry Ransom Center, University of Texas at Austin (also presenting at 'After the Digital Revolution') flag up some similarly suspicious dates on files in a digital archive held at her institution. Her dates differed just slightly to mine, falling on the evening of the 31st December 1979. Again, these dates looked unreliable as they were clearly out of line with the rest of the collection.

This is the same issue as mine, but the differences relate to the timezone. There is further explanation here highlighted by David Clipsham when I threw the question out to Twitter. Thanks!

Fragmentation

Another thing I had noticed about the files was the way that they were broken up into fragments. The script for a single episode was not saved as a single file but typically as 3 or 4 separate files. These files were named in such a way that it was clear that they were related and that the order that the files should be viewed or accessed was apparent - for example GINGER1, GINGER2 or PILOT and PILOTB.

This seemed curious to me - why not just save the document as a single file? The WordStar 4 manual didn't offer any clues but I found this piece of information in the WordStar 3 manual which describes how files should be split up to help manage the storage space on your diskettes:

From the WordStar 3 manual

Perhaps some of the files in the digital archive are from WordStar 3, or perhaps Marks and Gran had been previously using WordStar 3 and had just got into the habit of splitting a document into several files in order to ensure they didn't run out of space on their floppy disks.

I can not imagine working this way today! Technology really has come on a long way. Imagine trying to format, review or spell check a document that exists as several discrete files potentially sitting on different media!

Filenames

One thing that stands out when browsing the disks is that all the filenames are in capital letters. DOES ANYONE KNOW WHY THIS WAS THE CASE?

File names in this digital archive were also quite cryptic.This is the 1980’s so filenames conform to the 8.3 limit. Only 8 characters are allowed in a filename and it *may* also include a 3 character file extension.

Note that the file extension really is optional and WordStar version 4 doesn’t enforce the use of a standard file extension. Users were encouraged to use those last 3 characters of the file name to give additional context to the file content rather than to describe the file format itself.

Guidance on file naming from the WordStar 4 manual

Some of the tools and processes we have in place to analyse and process the files in our digital archives use the file extension information to help understand the format. The file naming methodology described here therefore makes me quite uncomfortable!

Marks and Gran tended not to use the file extension in this way (though there are a few examples of this in the archive). The majority of WordStar files have no extension at all. The real consistent use of file extensions related to their back up files.

Backup files

Scattered amongst the recovered data were a set of files that had the extension BAK. This clearly is a file extension that WordStar creates and uses consistently. These files clearly contained very similar content to other documents within the archive but typically with just a few differences in content. These files were clearly back up files of some sort but I wondered whether they had been created automatically or by the writers themselves.

Again the manual was helpful in moving forward my understanding on this:

Backup files from the WordStar 4 manual

This backup procedure is also summarised with the help of a diagram in the WordStar 3 manual:

The backup procedure from WordStar 3 manual

This does help explain why there were so many back up files in the archive. I guess the next question is 'should we keep them?'. It does seem that they are an artefact of the application rather than representing a conscious process by the writers to back their files up at a particular point in time and that may impact on their value. However, as discussed in a previous post on preserving Google documents there could be some benefit in preserving revision history (even if only partial).

...and finally

My understanding of these WordStar files has come on in leaps and bounds by doing a bit of research and in particular through finding copies of the manuals.

The manuals even explain why alongside the scripts within the digital archive we also have a disk that contains a copy of the WordStar application itself.

The very first step in the manual asks users to make a copy of the software:

I do remember having to do this sort of thing in the past! From WordStar 4 manual

Of course the manuals themselves are also incredibly useful in teaching me how to actually use the software. Keystroke based navigation is hardly intuitive to those of us who are now used to using a mouse, but I think that might be the subject of another blog post!

Jenny Mitcham, Digital Archivist

Friday, 19 August 2016

My first file format signature

As part of Filling the Digital Preservation Gap we've been doing a lot of talking about the importance of accurate file format identification and the challenges of doing so for research data.

Now we are thinking about how we can help solve the problem.

As promised in a post last month, I wanted to have a go at file format signature creation for PRONOM to see whether it is something that an average digital archivist could get their head around. Never before had I created my own signature. In the past I had considered this to be work that only technical people could carry out and it would be foolhardy to attempt it myself.

However, given the extent of the file formats identification challenge for research data wouldn't it be great if the community could engage more directly? Also, shouldn't file signature development be something every digital archivist should have a good understanding of?

Encouraged by Ross Spencer's blog post Five Star File Format Signature Development and a meeting with the PRONOM team at The National Archives in which the tricks and challenges of signature creation were explained, I decided to give it a go.

Where to begin?

First I read TNA's How to research and develop signatures for file format identification. This is an accessible and readable guide which tells you how to get started with signature development - from gathering samples, doing internet research on the format and using a Hex Editor to spot patterns. You don't need to be very technically minded to get your head around it.
Then I downloaded and installed a Hex editor. Though it is possible to view files as hexadecimal within Quick View Plus, I followed TNA's advice and used HxD Hex Editor as this allows you to compare files thus partially automating the process of spotting sequences.
Once I'd spotted a pattern which could be used to create a signature, I planned to use PRONOM's Signature Development Utility to create it.
Once the signature is created I'm told it is possible to test this using DROID. Within DROID, go to Tools... Install signature file and replace the current signature with your new one (but remember to put it back again once you are done otherwise you may wonder why DROID isn't working properly!). Run this over your directory of sample files to see if they are all correctly identified using the signature you have developed.

I decided start off with something that had been at the back of my mind for a while...those Wordstar 4.0 files from the Marks and Gran archive that I blogged about some time ago and had struggled to identify. When I wrote that post three years ago, Wordstar 4.0 files were not represented in PRONOM. They have more recently been added and the files can be identified but this is by extension only - not the more accurate file signature. I thought it would be fun to try and create a file signature for them.

I was very wrong.

My attempts to see a pattern within the files using the Hex Editor were unsuccessful. I decided to send the sample files to the experts at TNA to see if I was missing something. It was quickly confirmed that this was a rather awkward file type and not one that lent itself well to being automatically identified. Disappointing but at least it confirmed that my own investigations were not lacking.

For my next attempt I decided to tackle some of the unidentified research data that I had highlighted in my previous post Research data - what does it *really* look like?

I looked through the top ten most frequent unidentified file extensions in my sample and started to dig out the files themselves and assess whether they were a good candidate for me to work on. Ross Spencer suggests that PRONOM lends itself best to the creation of signatures for binary formats so this is what I wanted to focus on. No point in trying to make it hard for myself!

dat - 286
crl - 259
sd - 246
jdf - 130 (a signature for these JEOL NMR Spectroscopy files is now available)
out - 50
mrc - 47
inp - 46
xyz - 37
prefs - 32
spa - 32

Unfortunately, looking through the list (and digging out some samples) I discovered that many of these are ASCII formats rather than binary. It is possible to create signatures to identify ASCII files but it can be challenging (involving quite complex regular expressions) and not a great place for a first timer to start. I certainly did not want to start to tackle the confusing landscape of .dat files either!

After a little bit of investigation I discovered that the .spa files were something I could work with. I knew nothing about this format but found the relevant files and started doing some internet research looking for more information and perhaps some additional samples. I soon discovered they were one of many formats for optical spectroscopy and are known as Thermo Fisher’s OMNIC file format or Thermo Scientific OMNIC or Nicolet/Thermo OMNIC.

Looking at some of the files using a Hex Editor it was immediately apparent that there was a consistent pattern of bytes at the start of each file. A string which read 'Spectral Data File' which was represented by 53 70 65 63 74 72 61 6C 20 44 61 74 61 20 46 69 6C 65 in hexadecimal. Note that I actually thought the pattern was longer but advice received from the PRONOM team suggested that it was better to cut it down.

I also looked at the end of each file and at first sight there appeared to be consistency here too with each file ending with the same few bytes. This hypothesis was blown out of the water when I looked at a sample file that I had discovered online which did not display this pattern (but luckily did have the same bytes at the start of the file).

This is why it is so important to have sample data that comes from more than one source. A set of files from a single researcher may have misleading patterns that have occurred just because of the consistent way in which they work, rather than this being a true feature of the format itself.

So, once I'd looked at all 33 files and had convinced myself the hypothesis was solid, I went to the online signature development tool provided by The National Archives and created my signature.

PRONOM signature development tool

This was relatively easy to use but there were areas where more guidance was needed (so questions were fired off to the PRONOM team and a speedy response was received). I'm hoping that in the future there will be more documentation to help guide the completion of this form - so that people know how best to name the signature, where to find a definitive list of Mimetypes (this is the list they suggested I looked at), and what the Version field should contain (it is for the version of the file format if this is apparent/relevant - not the version of the signature you are creating).

Once I was done, I clicked on the 'Save Signature File' button and I was presented with the finished XML file:

Ta daaaaa!

I briefly admired my handiwork before sending it off to The National Archives for feedback.*

How long did it take me to do all of this? I would say one full day is a fair estimate (that would include reading the guidance, downloading the Hex Editor and a few false starts as I tried to find a format that I thought I could handle). The next signature would be much quicker.

The biggest challenges:

It took me a while to find a binary format that I could work with. Much of the research data we hold appears to be ASCII formats ....which has benefits from a digital preservation perspective, but wasn't what I was looking for with regard to this exercise
I did not really understand the file format I was working with. I am not a chemist. I have never heard of the .spa format. I struggle to even say 'Spectroscopy' let alone understand it. When I started to research it online I found the results quite confusing. If I knew more about the format in the first place it would have made life much easier.
There are limitations with the metadata we get from researchers when they deposit data with Research Data York. Reading the brief descriptions of the dataset that are provided did not really help me work out what the individual files are or what software and hardware was used to create them.
I could not locate the file format specification online - I think next time I try this I may approach the software vendor direct and ask them for help.
Available documentation for creating and testing signatures could be enhanced. I had several questions as I went along and these were answered promptly by the PRONOM team, but if the information was all online then this would certainly help other newbies.

Despite the challenges this exercise has been both enjoyable and useful. The thing I like about being a digital archivist is being able to get hands on with the data and solve problems. Over the last few years I've done very little of this type of work so it was great to get stuck in. On top of the obvious benefit that after the next signature release these .spa files will now be recognised by DROID and other PRONOM-based file identification tools, I have also increased my knowledge and understanding of the process and this is a positive result.

I would definitely encourage other digital archivists, repository managers and research data managers to try this out for themselves.

* Feedback from the PRONOM team was positive. With a couple of modifications they were happy to include the signature in the next PRONOM signature release

Jenny Mitcham, Digital Archivist

Monday, 4 July 2016

New research data file formats now available in PRONOM

Like all good digital archivists I am interested in file formats and how we can use tools to automatically identify them. This is primarily so that when we package our digital archives up for long term preservation we can do so with a level of knowledge about how we might go about preserving and providing access to them in the future. This information is key whether migration or emulation is the preservation or access strategy of choice (or indeed a combination of both).

It has been really valuable to have some time and money as part of our "Filling the Digital Preservation Gap" project to be able to investigate issues around the identification of research data file formats and very pleasing to see the latest PRONOM signature release on 29th June which includes a couple of research data formats that we have sponsored as part of our project work.

I sent a batch of sample files off to the team who look after PRONOM at The National Archives (TNA) with a bit of contextual information about the formats and software/hardware that creates them (that I had uncovered after a bit of research on Google). TNA did the rest of the hard work and these new signatures are now available for all to use.

The formats in question are:

Gaussian input files - These are created for an application called Gaussian which is used by many researchers in the Chemistry department here in York. In a previous project update you can see that Gaussian was listed in the top 10 research software applications in use at the University of York. These files are essentially just ascii text files containing instructions for Gaussian and they can have a range of file extensions (though the samples I submitted were all .gjf). Though there is a recommended format or syntax for these instructions, there also appears to be flexibility in how these can be applied. Consequently this was a slightly challenging signature for TNA to work on and it would be useful if other institutions that have Gaussian input files could help test this signature and feedback to TNA if there are any problems or issues. In instances like this being able to develop against a range of sample files created at different times in different institutions by different researchers would help.
JEOL NMR Spectroscopy files - These are data files produced by JEOL's Nuclear Magnetic Resonance Spectrometers. These facilities at the University of York are clearly well used as data of this type was well represented in an initial assessment of the data that I reported on in a blog post last month (130 .jdf files were present in the sample of 3752 files). As these files are created by a piece of hardware in a very standard way, I am told that signature developers at TNA were able to create a signature without too many problems.

Further formats submitted from our project will appear in PRONOM within the next couple of months.

The project team are also interested in finding out how easy it is to create our own file format signatures. This is an alternative option for those who want to contribute but not something we have attempted before. Watch this space to find out how we get on!

Jenny Mitcham, Digital Archivist

Tuesday, 31 May 2016

Research data - what does it really look like?

Work continues on our Filling the Digital Preservation Gap project and I thought it was about time we updated you on some of the things we have been doing.

While my colleague Julie has been focusing on the more technical issues of implementing Archivematica for research data. I have been looking at some real research data and exploring in more detail some of the issues we discussed in our phase 1 report.

For the past year, we have been accepting research data for longer term curation. Though the systems for preservation and access to this data are still in development, we are for the time being able to allocate a DOI for each dataset, manage access and store it safely (ensuring it isn't altered) and intend to ingest it into our data curation systems once they are ready.

Having this data in one place on our filestore does give me the opportunity to test the hypothesis in our first report about the wide range of file formats that will be present in a research dataset and also the assertion that many of these will not be identified by the tools and registries in use for the creation of technical metadata.

So, I have done a fairly quick piece of analysis on the research data, running a tool called Droid developed by The National Archives over the data to get an indication of whether the files can be recognised and identified in an automated fashion.

All the data in our research data sample has been deposited with us since May 2015. The majority of the data is scientific in nature - much of it coming from the departments of Chemistry and Physics. (this may be a direct result of expectations from the EPSRC around data management). The data is mostly fairly recent, as suggested by the last modified dates on these files, which range from 2006 to 2016 with the vast majority having been modified in the last five years. The distribution of dates is illustrated below.

Here are some of the findings of this exercise:

Summary statistics

Droid reported that 3752 individual files were present*

1382 (37%) of the files were given a file format identification by Droid

1368 (99%) of those files that were identified were given just one possible identification. 12 files were given two possible identifications and a further two were given 18 possible identifications. In all these cases, the identification was done by file extension rather than signature - which perhaps explains the uncertainty

Files that were identified

Of the 1382 files that were identified:

668 (48%) were identified by signature (which suggests a fairly accurate identification - if a file is identified by signature it means that Droid has looked inside the file and seen something that it recognises. I'm told it does this by some sort of magic!)
648 (47%) were identified by extension alone (which implies a less accurate identification)
65 (5%) were identified by container. These were all Microsoft Office files - xlsx and docx as these are in effect zip files (which suggests a high level of accuracy)

111 (8%) of the identified files had a file extension mismatch - this means that the file extension was not what you would expect given the identification by signature.

All but 16 of these files were XML files that didn't have the .xml file extension (there were a range of extensions for these files including .orig, .plot, .xpr, .sc, .svg, .xci, .hwh, .bxml, .history). This isn't a very surprising finding given the ubiquity of XML files and the fact that applications often give their own XML output different extensions.

34 different file formats were identified within the collection of research data

Of the identified files 360 (26%) were XML files. This was by far the most common file format identified within the research dataset. The top 10 identified files are as follows:

Extensible Markup Language - 360
Log File - 212
Plain Text File - 186
AppleDouble Resource Fork - 133
Comma Separated Values - 109
Microsoft Print File - 77
Portable Network Graphics - 73
Microsoft Excel for Windows - 57
ZIP Format - 23
XYWrite Document - 21

Files that weren't identified

Of the 2370 files that weren't identified by Droid, 107 different file extensions were represented

614 (26%) of the unidentified files had no file extension at all. This does rather limit the chance of identification being that identification by file extension is relied on quite heavily by Droid and other similar tools. Of course it also limits our ability to actively curate this data unless we can identify it by another means.

The most common file extensions for the files that were not identified are as follows:

dat - 286
crl - 259
sd - 246
jdf - 130
out - 50
mrc - 47
inp - 46
xyz - 37
prefs - 32
spa - 32

Some thoughts

This is all very interesting and does back up our assertions about the long tail of file formats within a collection of research data and the challenges of identifying this data using current tools. I'd be interested to know whether for other collections of born digital data (not research data) a higher success rate would be expected? Is identification of 37% of files a particularly bad result or is it similar to what others have experienced?

As mentioned in a previous blog post, one area of work for us is to get some more research data file formats into PRONOM (a key database of file formats that is utilised by digital preservation tools). Alongside previous work on the top software applications used by researchers (a little of this is reported here) this has been helpful in informing our priorities when considering which formats we would like The National Archives to develop signatures for in phase 3 of the project.

Given the point made above, it could be suggested that one of our priorities for file format research should be .dat files. This would make sense being that we have 286 of these files and they are not identified by any means. However, here lies a problem. This is actually a fairly common file extension. There are many different types of .dat files produced by many different applications. PRONOM already holds information on two varieties of .dat file and the .dat files that we hold appear to come from several different software applications. In short, solving the .dat problem is not a trivial exercise!

It strikes me that we are really just scratching the surface here. Though it is good we are getting a few file signatures developed as an output for this project, this is clearly not going to make a big impact given the size of the problem. We will need to think about the community should continue this work going forward.

It has been really helpful having some genuine research data to investigate when thinking about preservation workflows - particularly those workflows for unidentified files that we were considering in some detail during phase 2 of our project. The unidentified file report that has been developed within Archivematica as a result of this project helpfully organises the files by extension and I had not envisaged at the time that so many files would have no file extension at all. We have discussed previously how useful it is to fall back on identification by file extension if identification by signature is unsuccessful but clearly for so many of our files this will not be worth attempting.

* note that this turns out not to be entirely accurate given that eight of the datasets were zipped up into a .rar archive. Though Droid looks inside several types of zip files, (including .zip, .gzip, .tar, and the web archival formats .arc and .warc) it does not yet look inside .rar, 7z, .bz, and .iso files. I didn't realise this until after I had carried out some analysis on the results. Consequently there are another 1291 files which I have not reported on here (though a quick run through with Droid after unzipping them manual identified 33% of the files so a similar ratio to the rest of the data. Note that this functionality is something that the team at The National Archives intend to develop in the future.

Jenny Mitcham, Digital Archivist

Tuesday, 13 August 2013

A short detective story involving 5 ¼ inch floppy disks

Earlier this year my colleague encountered two small boxes of 5 ¼ inch floppy disks buried within the Marks and Gran archive in the strongrooms of the Borthwick Institute. He had been performing an audit of audio visual material in our care and came across these in an unlisted archive.

This was exciting to me as I had not worked with this media before. As a digital archivist I had often encountered 3 ½ inch floppies but not their larger (and floppier) precursors. The story and detective work that follows, took us firmly into the realm of ‘digital archaeology’.

Digital archaeology:

“The process of reclaiming digital information that has been damaged or become unusable due to technological obsolescence of formats and/or media” (definition from Glossaurus)

Marks and Gran were a writing duo who wrote the scripts of many TV sitcoms from the late 1970's on-wards. ‘Birds of a Feather’ was the one that I remember watching myself in the 80’s and 90's but their credits include many others such as ‘Goodnight Sweetheart’ and ‘The New Statesman’. Their archive had been entrusted to us and was awaiting cataloguing.

Clues on the labels

There were some clues on the physical disks themselves about what they might contain. All the disks were labelled and many of the labels referred to the TV shows they had written, sometimes with other information such as an episode or series number. Some disks had dates written on them (1986 and 1987). One disk was intriguingly labelled 'IDEAS'. WordStar was also mentioned on several labels 'WordStar 2000' and 'Copy of Master WordStar disk'. WordStar was a popular and quite pioneering word processing package of the early 80’s.

However, clues on labels must always be taken with a pinch of salt. I remember being guilty of not keeping the labels on floppy disks up to date, of replacing or deleting files and not recording the fact. The information on these labels will be stored in some form in the digital archive but the files may have a different story to tell.

Reading the disks

The first challenge was to see if the data could be rescued from the obsolete media they were on. A fortuitous set of circumstances led me to a nice chap in IT who is somewhat of an enthusiast in this area. I was really pleased to learn that he had a working 5 ¼ inch drive on one of his old PCs at home. He very kindly agreed to have a go at copying the data off the disks for me and did so with very few problems. Data from 18 of the 19 disks was recovered. The only one of the disks that was no longer readable appeared from the label to be a backup disk of some accounting software - this is a level of loss I am happy to accept.

Looking at file names

Looking at the files that were recovered is like stepping back in time. Many of us remember the days when file names looked like this - capital letters, very short, rather cryptic, missing file extensions. WordStar like many of the software packages from this era, was not a stickler for enforcing use of file extensions! File extensions were also not always used correctly to define the file type but sometimes were used to hold additional information about the file.

Looking back at files from 30 years ago really does present a challenge. Modern operating systems allow long and descriptive file names to be created. When used well, file names often provide an invaluable source of metadata about the file. 30 years ago computer users had only 8 characters at their disposal. Creating file names that were both unique and descriptive was difficult. The file names in this collection do not always give many clues as to what is contained within them.

Missing clues

For a digital archivist, the file extension of a file is a really valuable piece of information. It gives us an idea of what software the file might have been created in and from this we can start to look for a suitable tool that we could use to work with these files. In a similar way a lack of file extension confuses modern operating systems. Double click on a file and Windows 7 has no idea what it is and what software to fire up to open it. File characterisation tools such as Droid used on a day to day basis by digital archivists also rely heavily on file extensions to help identify the file type. Running Droid on this collection (not surprisingly) produced lots of blank rows and inaccurate results*.

Another observation on initial inspection of this set of files is that the creation dates associated with them are very misleading. It is really useful to know the creation date of a file and this is the sort of information that digital archivists put some effort into recording as accurately as they can. The creation dates on this set of files were rather strange. The vast majority of files appeared to have been created on 1^st January 1980 but there were a handful of files with creation dates between 1984 and 1987. It does seem unlikely that Marks and Gran produced the main body of their work on a bank holiday in 1980, so it would seem that this date is not very accurate. My contact in IT pointed out that on old DOS computers it was up to the user to enter the correct date each time they used the PC. If no date was entered the PC defaulted to 1/1/1980. Not a great system clearly and we should be thankful that technology has moved on in this regard!

So, we are missing important metadata that will help us understand the files, but all is not lost, the next step is to see whether we can read and make sense of them with our modern software.

Reading the files

I have previously blogged about one of my favourite little software programmes, Quick View Plus – a useful addition to any digital archivist’s toolkit. True to form, Quick View Plus quickly identified the majority of these files as WordStar version 4.0 and was able to display them as well-formatted text documents. The vast majority of files appear to be sections of scripts and cast lists for various sitcoms from the 1980’s but there are other documents of a more administrative nature such as PHONE which looks to be a list of speed dial shortcuts to individuals and organisations that Marks and Gran were working with (including a number of famous names).

Unanswered questions

I have not finished investigating this collection yet and still have many questions in my head:

How do these digital files relate to what we hold in our physical archive? The majority of the Marks and Gran archive is in paper form. Do we also have this same data as print outs?
Do the digital versions of these files give us useful information that the physical do not (or vice versa)?
Many of the scripts are split into a number of separate files, some are just small snippets of dialogue. How do all of these relate to each other?
What can these files tell us about the creative process and about early word processing practices?

I am hoping that when I have a chance to investigate further I will come up with some answers.

* I will be providing some sample files to the Droid techies at The National Archives to see if they can tackle this issue.

Jenny Mitcham, Digital Archivist

Digital Archiving at the University of York

Friday, 20 October 2017

Understanding WordStar - check out the manuals!

Created and last modified dates

Fragmentation

Filenames

Backup files

...and finally

Friday, 19 August 2016

My first file format signature

Monday, 4 July 2016

New research data file formats now available in PRONOM

Tuesday, 31 May 2016

Research data - what does it really look like?

Summary statistics

Files that were identified

Files that weren't identified

Some thoughts

Tuesday, 13 August 2013

A short detective story involving 5 ¼ inch floppy disks

Clues on the labels

Reading the disks

Looking at file names

Missing clues

Reading the files

Unanswered questions

The sustainability of a digital preservation blog...

Twitter

Subscribe