Showing posts with label checksums. Show all posts
Showing posts with label checksums. Show all posts

Friday, 28 September 2018

Auditing the digital archive filestore

A couple of months ago I blogged about checksums and the methodology I have in place to ensure that I can verify the integrity and authenticity of the files within the digital archive.

I was aware that my current workflows for integrity checking were 'good enough' for the scale at which I'm currently working, but that there was room for improvement. This is often the case when there are humans involved in a process. What if I forget to create checksums to a directory? What happens if I forget to run the checksum verification?

Also, I am aware that checksum verification does not solve everything. For example, read all about The mysterious case of the changed last modified dates. Also, When checksums don't match... the checksum verification process doesn't tell you what has changed, who has changed it, when it was changed...it just tells you that something has changed. So perhaps we need more information.

A colleague in IT Services here at York mentioned to me that after an operating system upgrade on the filestore server last year, there is now auditing support (a bit more information here). This wasn't being widely used yet but it was an option if I wanted to give it a try and see what it did.

This seemed like an interesting idea so we have given it a whirl. With a bit of help (and the right level of permissions on the filestore), I have switched on auditing for the digital archive.

My helpful IT colleague showed me an example of the logs that were coming through. It has been a busy week in the digital archive. I have ingested 11 memory sticks, 24 CD-ROM and a pile of floppy disks. The logs were extensive and not very user friendly in the first instance.

That morning I had wanted to find out the total size of the born digital archives in the digital archive filestore and had right clicked on the folder and selected 'properties'. This had produced tens of thousands of lines of XML in the filestore logs as the attributes of each individual file had to be accessed by the server in order to make the calculation. The audit logs really are capable of auditing everything that happens to the files!

...but do I really need that level of information? Too much information is a problem if it hides the useful stuff.

It is possible to configure the logging so that it looks for specific types of events. So, while I am not specifically interested in accesses to the files, I am interested in changes to them. We configured the auditing to record only certain types of events (as illustrated below). This cuts down the size of the resulting logs and restricts it just to those things that might be of interest to me.




There is little point in switching this on if it is not going to be of use. So what do I intend to do with the output?

The format this is created in is XML, but this would be more user-friendly in a spreadsheet. IT have worked out how to pull out the relevant bits of the log into a tab delimited format that I can then open in a spreadsheet application.

What I have is some basic information about the date and time of the event, who initiated it, the type of event (eg RENAME, WRITE, ATTRIBUTE|WRITE) and the folder/file that was affected.

As I can view this in a spreadsheet application, it is so easy to reorder the columns to look for unexpected or unusual activity.

  • Was there anyone other than me working on the filestore? (there shouldn't be right now)
  • Was there any activity on a date I wasn't in the office?
  • Were there any activity in a folder I wasn't intentionally working on?
The current plan is that these logs will be emailed to me on a weekly basis and I will have a brief check to ensure all looks OK. This will sit alongside my regular integrity checking as another means of assuring that all is as it should be.

We'll review how this is working a few weeks to see if it continues to be a valuable exercise or should be tweaked further.

In my Benchmarking with the NDSA Levels of Preservation post last year, I put us at level 2 for Information Security (as highlighted in green below).



See the full NDSA levels here

Now we have switched on this auditing feature and have a plan in place for regular checking of the logs, does this now take us to level 4 or is more work required?

I'd be really interested to find out whether other digital archivists are utilising filestore audit logs and what processes and procedures are in place to monitor these.

Final thoughts...

This was a quick win and hopefully will prove a useful tool for the digital archive her at the University of York. It is also a nice little example of collaboration between IT and Archives staff.

I sometimes think that IT people and digital preservation folk don't talk enough. If we take the time to talk and to explain our needs and use cases, then the chances are that IT might have some helpful solutions to share. The tools that we need to do our jobs effectively are sometimes already in place in our institutions. We just need to talk to the right people to get them working for us.

Jenny Mitcham, Digital Archivist

Tuesday, 31 July 2018

Checksum or Fixity? Which tool is for me?

The digital preservation community are in agreement that file fixity and data integrity are important.

Indeed there is a whole row devoted to this in the NDSA Levels of Preservation. But how do we all do it? There are many tools out there - see for example those listed in COPTR.

It was noted in the NDSA Fixity Survey Report of 2017 that there isn't really a consensus on how checksums are created and verified across the digital preservation community. Many respondents to the survey also gave the impression that current procedures were a work in progress and that other options should be explored.

From the conclusion of the report:
"Respondents also talked frequently about improvements they wanted to make, such as looking for new tools, developing workflows, and making other changes to their fixity activities. It may be useful, then, for practitioners to think of their fixity practices within a maturity or continuous improvement model; digital preservation practitioners develop a set of ideal practices for their institution and continuously evaluate methods and the allocation of available resources to get as close as possible to their ideal state."

This nicely sums up how I feel about a lot of the digital preservation routines I put into practice ...but continuous improvement needs time.

Life is busy and technology moves fast.

I realised that I haven't reviewed my tools and procedures for fixity checking since 2013.

A  recent upgrade of my PC to Windows 10 gave me a good excuse to change this. Being that I was going to have to re-install and configure all my tools post-upgrade anyway, this was the catalyst I needed to review the way that I currently create and verify checksums within the digital archive.

Current procedures using Checksum

Since 2013 I have been using a tool called Checksum to generate and verify checksums. Checksum describes itself as a "blisteringly fast, no-nonsense file hashing application for Windows" and this has worked well for me in practice. One of the key selling points of Checksum is it's speed. This has become a more important requirement over time as the digital archive has grown in size.

I have a set of procedures around this tool that document and describe how it is configured and how it is used, both as part of the ingest process and as a routine integrity checking task. I keep a log of the dates that checksums are verified, numbers of files checked and the results of these checks and am able to respond to issues and errors as and when they occur.

There has been little drama on this score over the last 5 years apart from a brief blip when my checksums didn't match and the realisation that my integrity checking routine wasn't going to catch things like changed last modified dates.

I'm not unhappy with Checksum but I am aware that there is room for improvement. Getting it set up and configured correctly isn't as easy as I would like. I sometimes wonder if there are things I'm missing. In the past I have scheduled a regular checksum verification task using Windows Task Scheduler as this is not a feature of Checksum itself but more recently I've just been initiating it manually on a regular schedule.


Introducing Fixity

Fixity is a free tool from AVP. It has been around since 2013 but hasn't hit my radar until recent times. It was mentioned several times in the NDSA Fixity Survey Report and I was keen to try it out.



Fixity was created in recognition of the key role that checksum generation and validation have in digital preservation workflows. The intention of the developers was to provide institutions with a simple and low cost (in fact...free) tool that allows checksums to be generated and validated and that enables tasks to be scheduled and reports to be generated.

The Fixity User Guide is a wonderful thing. From personal experience I would say that one of the key barriers to adopting new tools is not being able to understand exactly what they do and why.

Documentation for open source tools can sometimes be a bit scattered and impenetrable, or occasionally too technical for me to understand - not the Fixity User Guide!

It starts off by explaining what problem it is trying to solve and it includes step by step instructions with screen shots and an FAQ at the back. Full marks!

Testing Fixity

The Graphical User Interface (GUI)

I like the interface for Fixity. It is clear and easy to understand, and it gives you the flexibility to manage different fixity checks you may want to set up for different types of content or areas of your filestore.

First impressions were that it is certainly easier to use than the Checksum tool I use currently. On the downside though, there were a few glitches or bugs that I encountered when using it.

Yes, I did manage to break it.

Yes, I did have to use the Task Manager to shut it down on several occasions.

Reporting

The reporting is good. I found this to be clearer and more more flexible than the reports generated by Checksum. It helps that it is presented in a format that can be opened in a spreadsheet application - this means that you can explore the results in whatever way you like.

Fixity will send you an email summary with statistics about how many files were changed, new, removed or moved/renamed. Attached to this email is a tab delimited file that includes all the details. This can be read into a spreadsheet application for closer inspection. You can quickly filter by the status column and focus in on those files that are new or changed. A helpful way of checking whether all is as it should be.

A useful summary report emailed to me from Fixity

One of the limitations of Checksum is that I can add new files into a directory and if I forget to update the checksum file it will not tell me that I have a new file - in theory that file could go without having its fixity created or checked for some time...or, if I enable the 'synchronise' option it will add the new checksums to the hash file when it next does a verification task. This is helpful, but perhaps I don't want them to be added without notifying me in some way. I would prefer to cast my eye over them to double check that they should be in there.


Scheduling

You can use the Fixity GUI to schedule checksum verification events - so you don't have to kick off your weekly or monthly scan manually. This is a nice feature. If your PC isn't switched on when a scheduled scan is due to start, it simply delays it until you next switch your computer on.

One of the downsides of running a scheduled scan in Fixity is that there is little feedback on the screen to let you know what it is doing and where it has got to. Also given that a scan can take quite a while (over 5 hours in my case) it would be helpful to have a notification to remind you that the scheduled scan is going to start and to allow you to opt out or reschedule if you know that your PC isn't going to be switched on for long enough to make it worthwhile.


Managing projects

The Fixity GUI will allow you to manage more than one project - meaning that you could set up and save different scans on different content using different settings. This is a great feature and a real selling point if you have this use case (I don't at the moment).

One downside I found when trying to test this out was when moving between different projects Fixity kept giving me the message that there are unsaved changes in my project and asking if I want to save - I don't believe there were unsaved changes at this point so this is perhaps a bug?

Also, selecting the checksum algorithm for your project can be a little clunky. You have to save your project before you can choose which algorithm you would like to use. This feature is hidden up in the Preferences menu but I'd prefer to see it up front and visible in the GUI so you can't forget about it or ignore it.

I thought I had set my project up to use the MD5 algorithm but when looking at the summary report I realise it is using SHA256. I thought it would be a fairly easy procedure to go back into my project and change the algorithm but now Fixity is stuck on a message saying 'Reading Files, please wait...'. It may be reprocessing all of the files in my project but I don't know for sure because there is no indicator of progress. If this is the case I expect I will have to switch my PC off before it has finished.


Progress

Following on from the comment above...one really useful addition to Fixity would be for it to give an indication of how far through a checksum verification task it is (and how long it has left). To be fair, the feedback around progress in Checksum, is not exact around timings - but it does give a clear notification about how many checksum files it still has to look at (there is one in each directory) and how many individual files are left to look at in the directory it is currently in.


Timings

When you kick off a checksum verification task Fixity comes up with a message that says 'Please do not close until a report is generated'. This is useful, but it doesn't give an indication of how long the scan might take, so you have no idea when you kick off the task how long you will need to keep your computer on for. I've had to close it down in the middle of the scan on a few occasions and I don't know whether this has any impact on the checksums that are stored for next time.


The message you get when you manually initiate a scan using Fixity

Fixity certainly takes longer to run a scan than the Checksum tool that I have used over the last few years. The last scan of the digital archive took 5 hours and 49 minutes (though this was using SHA256). The fixity check using Checksum (with MD5) takes around 1 and a half hours so the difference is not inconsequential.

Note I'm currently trying to change the checksum algorithm on my Fixity test project to MD5 so I can do a fairer comparison and it has been thinking about it for over 3 hours.


Where are the checksums stored?

In the Checksum tool I use currently, the checksums themselves are stored in a little text file in each directory of the digital archive. This could be seen as a good thing (all the information you need to verify the files is stored and backed up with the files) or a bad thing (the digital archive is being contaminated with additional administrative files, as are my DROID reports!).

Fixity however, stores the actual checksums in a /History/ folder and also in a SQLite database. When you first install Fixity it asks you to specify a location for the History folder. I initially changed this to my C: drive to stop it clogging up my limited profile space, but this may not have been the best option in retrospect as my C: drive isn't backed up. It would certainly be useful to have backup copies of the checksums elsewhere, though we could debate at length what exactly should be kept and for how long.


How to keep checksums up to date

My current workflow using the Checksum application is to create or update checksums as part of the ingest process or as preservation actions are carried out so that I always have a current checksum for every file that is being preserved.

I'm unsure how I would do this if I was using Fixity. Is there a way of adding checksums into the database or history log without actually running the whole integrity check? I don't think so.

Given the time it takes Fixity to run, running the whole project might not be a practical solution in many cases so files may sit for a little while before checksums are created. There are risks involved with this.


The verdict

Fixity is a good tool and I would recommend it to digital archivists who are looking for a user-friendly introduction to fixity checking if they don't have a large quantity of digital content. The documentation, reporting and GUI are clear and user friendly and it is pretty easy to get started with it.

One of the nice things about Fixity is that it has been developed for the digital preservation community with our digital preservation use cases in mind. Perhaps the reason why the documentation is so clear is because it feels like it is written for me (I like to think I am a part of the designated community for Fixity!). In contrast, I find it quite difficult to understand the documentation for Checksum or to know whether I have discovered all that it is capable of.

However, after careful consideration I have decided to stick with Checksum for the time being. The main reason being speed. Clearly checksum's claims to be a "blisteringly fast, no-nonsense file hashing application for Windows" are not unfounded! I don't have a huge digital archive to manage at 187 GB but the ability to verify all of my checksums in 1.5 hours instead of 5+ hours is a compelling argument. The digital archive is only going to grow, and speed of operation will become more important over time.

Knowing that I can quickly create or update checksums as part of the ingest or preservation planning process is also a big reason for me to stick to current workflows.

It has been really interesting testing Fixity and comparing it with the functionality of my existing Checksum tool. There are lots of good things about it and it has helped me to review what I want from a fixity checking tool and to examine the workflows and preferences that I have in place with Checksum.

As part of a cycle of continuous improvement as mentioned in the NDSA Fixity Survey Report of 2017 I hope to review file fixity procedures again in another few years. Fixity has potential and I'll certainly be keeping my eye on it as an option for the future.



Jenny Mitcham, Digital Archivist

Friday, 12 January 2018

New year, new tool - TeraCopy

For various reasons I'm not going to start 2018 with an ambitious to do list as I did in 2017 ...I've still got to do much of what I said I was going to do in 2017 and my desk needs another tidy!

In 2017 I struggled to make as much progress as I would have liked - that old problem of having too much to do and simply not enough hours in the day.

So it seems like a good idea to blog about a new tool I have just adopted this week to help me use the limited amount of time I've got more effectively!

The latest batch of material I've been given to ingest into the digital archive consists of 34 CD-ROMs and I've realised that my current ingest procedures were not as efficient as they could be. Virus checking, copying files over from 1 CD and then verifying the checksums is not very time consuming, but when you have to do this 34 times, you do start to wonder whether your processes could be improved!

In my previous ingest processes, copying files and then verifying checksums had been a two stage process. I would copy files over using Windows Explorer and then use FolderMatch to confirm (using checksums) that my copy was identical to the original.

But why use a two stage process when you can do it in one go?

I'd seen TeraCopy last year whilst visiting The British Library (thanks Simon!) so decided to give it a go. It is a free file transfer utility with a focus on data integrity.

So, I've installed it on my PC. Now, whenever I try and copy anything in Windows it pops up and asks me whether I want to use TeraCopy to make my copy.

One of the nice things about this is that this will also pop up when you accidentally click and drop a directory into another directory in Windows Explorer (who hasn't done that at least once?) and gives you the opportunity to cancel the operation.

When you copy with TeraCopy it doesn't just copy the files for you, but also creates checksums as it goes along and then at the end of the process verifies that the checksums are the same as they were originally. Nice! You need to tweak the settings a little to get this to work.


TeraCopy busy copying some files for me and creating checksums as it goes


When copying and verifying is complete it tells you how many files it has
verified and shows matching checksums for both copies - job done!

So, this has made the task of copying data from 34 CDs into the digital archive a little bit less painful and has made my digital ingest process a little bit more efficient.

...and that from my perspective is a pretty good start to 2018!

Jenny Mitcham, Digital Archivist

Friday, 18 August 2017

Benchmarking with the NDSA Levels of Preservation

Anyone who has heard me talk about digital preservation will know that I am a big fan of the NDSA Levels of Preservation.

This is also pretty obvious if you visit me in my office – a print out of the NDSA Levels is pinned to the notice board above my PC monitor!

When talking to students and peers about how to get started in digital preservation in a logical, pragmatic and iterative way, I always recommend using the NDSA Levels to get started. Start at level 1 and move forward to the more advanced levels as and when you are able. This is a much more accessible and simple way to start addressing digital preservation than digesting some of the bigger and more complex certification standards and benchmarking tools.

Over the last few months I have been doing a lot of documentation work. Both ensuring that our digital archiving procedures are written down somewhere and documenting where we are going in the future.

As part of this documentation it seemed like a good idea to use the NDSA Levels:

  • to demonstrate where we are
  • to show where improvements need to be made
  • to demonstrate progress in the future


Previously I have used the NDSA Levels in quite a superficial way – as a guide and a talking point, it has been quite a different exercise actually mapping where we stand.

It was not always straightforward to establish where we are and to unpick and interpret exactly what each level meant in practice. I guess this is one of the problems of using a relatively simple set of metrics to describe what is really quite a complex set of processes.

Without publishing the whole document that I've written on this, here is a summary of where I think we are currently. I'm also including some questions I've been grappling with as part of the process.

Storage and geographic location

Currently at LEVEL 2: 'know your data' with some elements of LEVEL 3 and 4 in place

See the full NDSA levels here


Four years ago we carried out a ‘rescue mission’ to get all digital data in the archives off portable media and on to the digital archive filestore. This now happens as a matter of course when born digital media is received by the archives.

The data isn’t in what I would call a proper digital archive but it is on a fairly well locked down area of University of York filestore.

There are three copies of the data available at any one time (not including the copy that is on original media within the strongrooms). The University stores two copies of the data on spinning disk. One at a data centre on one campus and the other at a data centre on another campus with another copy backed up to tape which is kept for 90 days.

I think I can argue that storage of the data on two different campuses is two different geographic locations but these locations are both in York and only about 1 mile apart. I'm not sure whether they could be described as having different disaster threats so I'm going to hold back from putting us at Level 3 though IT do seem to have systems in place to ensure that filestore is migrated on a regular schedule.

Questions:

  • On a practical level, what really constitutes a different geographic location with a different disaster threat? How far away is good enough?


File fixity and data integrity

Currently at LEVEL 4: 'repair your data'

See the full NDSA levels here


Having been in this job for five years now I can say with confidence that I have never once received file fixity information alongside data that has been submitted to us. Obviously if I did receive it I would check it on ingest, but I can not envisage this scenario occurring in the near future! I do however create fixity information for all content as part of the ingest process.

I use a tool called Foldermatch to ensure that the digital data I have copied into the archive is identical to the original. Foldermatch allows you to compare the contents of two folders and one of the comparison methods (the one I use at ingest) uses checksums to do this.

Last year I purchased a write blocker for use when working with digital content delivered to us on portable hard drives and memory sticks. A check for viruses is carried out on all content that is ingested into the digital archive so this fulfills the requirements of level 2 and some of level 3.

Despite putting us at Level 4, I am still very keen to improve our processes and procedures around fixity. Fixity checks are carried out at intervals (several times a month) and these checks are logged but at the moment this is all initiated manually. As the digital archive gets bigger, we will need to re-think our approaches to this important area and find solutions that are scalable.

Questions:


  • Does it really matter if fixity isn't checked at 'fixed intervals'? That to me suggests a certain rigidity. Do the intervals really need to be fixed or does it not matter as long as it happens within an agreed time frame?
  • At level 2 we are meant to ‘check fixity on all ingests’ - I am unclear as to what is expected here. What would I check if fixity information hasn’t been supplied (as is always the case currently)? Perhaps it means check fixity of the copy of the data that has been made against the fixity information on the original media? I do do that.


Information Security

Currently at LEVEL 2: 'know your data' with some elements of LEVEL 3 in place

See the full NDSA levels here


Access to the digital archive filestore is limited to the digital archivist and IT staff who administer the filestore. If staff or others need to see copies of data within the digital archive filestore, copies are made elsewhere after appropriate checks are made regarding access permissions. The master copy is always kept on the digital archive filestore to ensure that the authentic original version of the data is maintained. Access restrictions are documented.

We are also moving towards the higher levels here. A recent issue reported on a mysterious change of last modified dates for .eml files has led to discussions with colleagues in IT, and I have been informed that an operating system upgrade for the server should include the ability to provide logs of who has done what to files in the archive.

It is worth pointing out that as I don't currently have systems in place for recording PREMIS (preservation) metadata. I am currently taking a hands off approach to preservation planning within the digital archive. Preservation actions such as file migration are few and far between and are recorded in a temporary way until a more robust system is established.


Metadata

Currently at LEVEL 3: 'monitor your data'

See the full NDSA levels here


We do OK with metadata currently, (considering a full preservation system is not yet in place). Using DROID at ingest is helpful at fulfilling some of the requirements of levels 1 to 3 (essentially, having a record of what was received and where it is).

Our implementation of AtoM as our archival management system has helped fulfil some of the other metadata requirements. It gives us a place to store administrative metadata (who gave us it and when) as well as providing a platform to surface descriptive metadata about the digital archives that we hold.

Whether we actually have descriptive metadata or not for digital archives will remain an issue. Much metadata for the digital archive can be generated automatically but descriptive metadata isn't quite as straightforward. In some cases a basic listing is created for files within the digital archive (using Dublin Core as a framework) but this will not happen in all cases. Descriptive metadata typically will not be created until an archive is catalogued which may come at a later date.

Our plans to implement Archivematica next year will help us get to Level 4 as this will create full preservation metadata for us as PREMIS.

Questions:


  • What is the difference between the 'transformative metadata' as mentioned at Level 2 and Preservation metadata as mentioned at Level 4? Is this to do with the standards used? For example, at Level 2 you need to be storing metadata about transformations and events that have occured, but at Level 4 this must be in PREMIS?


File formats

Currently at LEVEL 2: 'know your data' with some elements of LEVEL 3 in place

See the full NDSA levels here


It took me a while to convince myself that we fulfilled Level 1 here! This is a pretty hard one to crack, especially if you have lots of different archives coming in from different sources, and sometimes with little notice. I think it is useful that the requirement at this level is prefaced with "When you can..."!

Thinking about it, we do do some work in this area - for example:

To get us to Level 2, as part of the ingest process we run DROID to get a list of file formats included within a digital archive. Summary stats are kept within a spreadsheet that covers all content within the digital archive so we can quickly see the range of formats that we hold and find out which archives they are in.

This should allow us to move towards Level 3 but we are not there yet. Some pretty informal and fairly ad hoc thinking goes into  file format obsolescence but I won't go as far as saying that we 'monitor' it. I have an awareness of some specific areas of concern in terms of obsolete files (for example I've still got those WordStar 4.0 files and I really do want to do something with them!) but there are no doubt other formats that need attention that haven't hit my radar yet.

As mentioned earlier, we are not really doing migration right now - not until I have a better system for creating the PREMIS metadata, so Level 4 is still out of reach.

Questions:


  • I do think there is more we could at Level 1, but there has also been concern raised by colleagues that in being too dictatorial you are altering the authenticity of the original archive and perhaps losing information about how a person or an organisation worked. I'd be interested to hear how others walk this tricky line.
  • Is it a valid answer to simple note at Level 1 that input into the creation of digital files is never given because it has been decided not to be appropriate in the content in which you are working?
  • I'd love to hear examples of how others monitor and report on file obsolescence - particularly if this is done in a systematic way


Conclusions

This has been a useful exercise and it is good to see where we need to progress. Going from using the Levels in the abstract and actually trying to apply them as a tool has been a bit challenging in some areas. I think additional information and examples would be useful to help clear up some of the questions that I have raised.

I've also found that even where we meet a level there is often other ways we could do things better. File fixity and data integrity looks like a strong area for us but I am all too aware that I would like to find a more sustainable and scalable way to do this. This is something we'll be working on as we get Archivematica in place. Reaching Level 4 shouldn't lead to complacency!

An interesting blog post last year by Shira Peltzman from the UCLA Library talked about Expanding the NDSA Levels of Preservation to include an additional row focused on Access. This seems sensible given that the ability to provide access is the reason why we preserve archives. I would be keen to see this developed further so long as the bar wasn't set too high. At the Borthwick my initial consideration has been preservation - getting the stuff and keeping it safe - but access is something that will be addressed over the next couple of years as we move forward with our plans for Archivematica and AtoM.

Has anyone else assessed themselves against the NDSA Levels?  I would be keen to see how others have interpreted the requirements.








Jenny Mitcham, Digital Archivist

The sustainability of a digital preservation blog...

So this is a topic pretty close to home for me. Oh the irony of spending much of the last couple of months fretting about the future prese...