> But to make things more interesting we had another history outage this week - we
> happen to write history files to an NFS server (not recommened! do not do this!) and
> when the NFS server had a glitch, history files got corrupted - because during the
> glitch NFS was not available, I think this roll-back feature would not have helped.
Actually I put our history data on a separate file system, on a separate disk controlled
by a separate RAID controller! If you write bulk data with the logger, and want to read
history files at the same time with mhttpd, you get a bottleneck if both data are at the
same physical disk. Separating this (and even the controller) speeded things up
dramatically.
The rollback will not work for NFS, since it requires truncating the file if an event
gets only partially written. While on a full file system you always can *delete* data,
this does not work if NFS is down. This explains the behavior.
> Anyhow, I now have a patch to allow hs_read() to "skip the bad spots" in the history
> files. (hs_gen_index() also needs a patch).
>
> In the nutshell, if invalid history data is detected, the code continues to read the
> data one byte at a time, looking for valid event_id markers (etc).
>
> The code looks sane by inspection, and if nobody objects, I would like to commit it
> in the next few days.
Great. I was thinking of something like this myself. Having a quick look at your code
looks good. The best of course would be if we would have some "magic number" for
re-synchronizating the data stream, but that would blow up the file length. So searching
for the right event id is good, but will not work 100%. Also the check
if (irec.time < last_irec_time)
to see if the history is broken is very weak. If you take random data, it will be true
50% and false 50%. If one makes however a check
if ((irec.time - last_irec_time) > 3600*24)
this would work correctly with random data in >99% of all cases (3600*24/2^32). Maybe
you should change that. |