Back Midas Rome Roody Rootana
  Midas DAQ System  Not logged in ELOG logo
Entry  07 Jan 2008, Stefan Ritt, Info, Roll-back for history sytem added 
    Reply  13 Feb 2008, Konstantin Olchanski, Info, Roll-back for history sytem added 
       Reply  13 Feb 2008, Stefan Ritt, Info, Roll-back for history sytem added 
          Reply  28 May 2008, Konstantin Olchanski, Info, Roll-back for history sytem added 
Message ID: 431     Entry time: 13 Feb 2008     In reply to: 429     Reply to this: 482
Author: Stefan Ritt 
Topic: Info 
Subject: Roll-back for history sytem added 
> But to make things more interesting we had another history outage this week - we
> happen to write history files to an NFS server (not recommened! do not do this!) and
> when the NFS server had a glitch, history files got corrupted - because during the
> glitch NFS was not available, I think this roll-back feature would not have helped.

Actually I put our history data on a separate file system, on a separate disk controlled
by a separate RAID controller! If you write bulk data with the logger, and want to read
history files at the same time with mhttpd, you get a bottleneck if both data are at the
same physical disk. Separating this (and even the controller) speeded things up
dramatically.

The rollback will not work for NFS, since it requires truncating the file if an event
gets only partially written. While on a full file system you always can *delete* data,
this does not work if NFS is down. This explains the behavior.

> Anyhow, I now have a patch to allow hs_read() to "skip the bad spots" in the history
> files. (hs_gen_index() also needs a patch).
> 
> In the nutshell, if invalid history data is detected, the code continues to read the
> data one byte at a time, looking for valid event_id markers (etc).
> 
> The code looks sane by inspection, and if nobody objects, I would like to commit it
> in the next few days.

Great. I was thinking of something like this myself. Having a quick look at your code
looks good. The best of course would be if we would have some "magic number" for
re-synchronizating the data stream, but that would blow up the file length. So searching
for the right event id is good, but will not work 100%. Also the check

  if (irec.time < last_irec_time)

to see if the history is broken is very weak. If you take random data, it will be true
50% and false 50%. If one makes however a check

  if ((irec.time - last_irec_time) > 3600*24)

this would work correctly with random data in >99% of all cases (3600*24/2^32). Maybe
you should change that.
ELOG V3.1.4-2e1708b5