Friday, 28 September 2018

Auditing the digital archive filestore

A couple of months ago I blogged about checksums and the methodology I have in place to ensure that I can verify the integrity and authenticity of the files within the digital archive.

I was aware that my current workflows for integrity checking were 'good enough' for the scale at which I'm currently working, but that there was room for improvement. This is often the case when there are humans involved in a process. What if I forget to create checksums to a directory? What happens if I forget to run the checksum verification?

Also, I am aware that checksum verification does not solve everything. For example, read all about The mysterious case of the changed last modified dates. Also, When checksums don't match... the checksum verification process doesn't tell you what has changed, who has changed it, when it was changed...it just tells you that something has changed. So perhaps we need more information.

A colleague in IT Services here at York mentioned to me that after an operating system upgrade on the filestore server last year, there is now auditing support (a bit more information here). This wasn't being widely used yet but it was an option if I wanted to give it a try and see what it did.

This seemed like an interesting idea so we have given it a whirl. With a bit of help (and the right level of permissions on the filestore), I have switched on auditing for the digital archive.

My helpful IT colleague showed me an example of the logs that were coming through. It has been a busy week in the digital archive. I have ingested 11 memory sticks, 24 CD-ROM and a pile of floppy disks. The logs were extensive and not very user friendly in the first instance.

That morning I had wanted to find out the total size of the born digital archives in the digital archive filestore and had right clicked on the folder and selected 'properties'. This had produced tens of thousands of lines of XML in the filestore logs as the attributes of each individual file had to be accessed by the server in order to make the calculation. The audit logs really are capable of auditing everything that happens to the files!

...but do I really need that level of information? Too much information is a problem if it hides the useful stuff.

It is possible to configure the logging so that it looks for specific types of events. So, while I am not specifically interested in accesses to the files, I am interested in changes to them. We configured the auditing to record only certain types of events (as illustrated below). This cuts down the size of the resulting logs and restricts it just to those things that might be of interest to me.




There is little point in switching this on if it is not going to be of use. So what do I intend to do with the output?

The format this is created in is XML, but this would be more user-friendly in a spreadsheet. IT have worked out how to pull out the relevant bits of the log into a tab delimited format that I can then open in a spreadsheet application.

What I have is some basic information about the date and time of the event, who initiated it, the type of event (eg RENAME, WRITE, ATTRIBUTE|WRITE) and the folder/file that was affected.

As I can view this in a spreadsheet application, it is so easy to reorder the columns to look for unexpected or unusual activity.

  • Was there anyone other than me working on the filestore? (there shouldn't be right now)
  • Was there any activity on a date I wasn't in the office?
  • Were there any activity in a folder I wasn't intentionally working on?
The current plan is that these logs will be emailed to me on a weekly basis and I will have a brief check to ensure all looks OK. This will sit alongside my regular integrity checking as another means of assuring that all is as it should be.

We'll review how this is working a few weeks to see if it continues to be a valuable exercise or should be tweaked further.

In my Benchmarking with the NDSA Levels of Preservation post last year, I put us at level 2 for Information Security (as highlighted in green below).



See the full NDSA levels here

Now we have switched on this auditing feature and have a plan in place for regular checking of the logs, does this now take us to level 4 or is more work required?

I'd be really interested to find out whether other digital archivists are utilising filestore audit logs and what processes and procedures are in place to monitor these.

Final thoughts...

This was a quick win and hopefully will prove a useful tool for the digital archive her at the University of York. It is also a nice little example of collaboration between IT and Archives staff.

I sometimes think that IT people and digital preservation folk don't talk enough. If we take the time to talk and to explain our needs and use cases, then the chances are that IT might have some helpful solutions to share. The tools that we need to do our jobs effectively are sometimes already in place in our institutions. We just need to talk to the right people to get them working for us.

Jenny Mitcham, Digital Archivist

The sustainability of a digital preservation blog...

So this is a topic pretty close to home for me. Oh the irony of spending much of the last couple of months fretting about the future prese...