Midas Talk

Back Midas Rome Roody Rootana

Midas DAQ System

Not logged in

Back | Find | Login | Help

07 Jan 2008, Stefan Ritt, Info, Roll-back for history sytem added

13 Feb 2008, Konstantin Olchanski, Info, Roll-back for history sytem added

13 Feb 2008, Stefan Ritt, Info, Roll-back for history sytem added

28 May 2008, Konstantin Olchanski, Info, Roll-back for history sytem added

Message ID: 431 Entry time: 13 Feb 2008 In reply to: 429 Reply to this: 482

Author:	Stefan Ritt
Topic:	Info
Subject:	Roll-back for history sytem added

> But to make things more interesting we had another history outage this week - we
> happen to write history files to an NFS server (not recommened! do not do this!) and
> when the NFS server had a glitch, history files got corrupted - because during the
> glitch NFS was not available, I think this roll-back feature would not have helped.

Actually I put our history data on a separate file system, on a separate disk controlled
by a separate RAID controller! If you write bulk data with the logger, and want to read
history files at the same time with mhttpd, you get a bottleneck if both data are at the
same physical disk. Separating this (and even the controller) speeded things up
dramatically.

The rollback will not work for NFS, since it requires truncating the file if an event
gets only partially written. While on a full file system you always can *delete* data,
this does not work if NFS is down. This explains the behavior.

> Anyhow, I now have a patch to allow hs_read() to "skip the bad spots" in the history
> files. (hs_gen_index() also needs a patch).
> 
> In the nutshell, if invalid history data is detected, the code continues to read the
> data one byte at a time, looking for valid event_id markers (etc).
> 
> The code looks sane by inspection, and if nobody objects, I would like to commit it
> in the next few days.

Great. I was thinking of something like this myself. Having a quick look at your code
looks good. The best of course would be if we would have some "magic number" for
re-synchronizating the data stream, but that would blow up the file length. So searching
for the right event id is good, but will not work 100%. Also the check

  if (irec.time < last_irec_time)

to see if the history is broken is very weak. If you take random data, it will be true
50% and false 50%. If one makes however a check

  if ((irec.time - last_irec_time) > 3600*24)

this would work correctly with random data in >99% of all cases (3600*24/2^32). Maybe
you should change that.

ELOG V3.1.4-2e1708b5