Back Midas Rome Roody Rootana
  Midas DAQ System, Page 78 of 155  Not logged in ELOG logo
New entries since:Wed Dec 31 16:00:00 1969
    Reply  26 May 2014, Konstantin Olchanski, Forum, Running a frontend on Arduino Yun 
> I'm trying to get a frontend running on an arduino yun single board computer
> (cpu is Atheros AR9331 and OS is a linux derivate
> http://arduino.cc/en/Main/ArduinoBoardYun )

What you want to do should be possible.

Here, the smallest machine we used to run a MIDAS frontend was a 300MHz PowerPC processor inside a 
Virtex4 FPGA with 256 Mbytes of RAM. Looks like your machine is a 400MHz MIPS with 64 Mbytes of RAM 
so there should be enough hardware available to run a MIDAS frontend underLinux.

One source of trouble could be if your MIPS CPU is running in big-endian mode (MIPS can do either big-
endian or little-endian). MIDAS supports big-endian frontends connecting to little-endian x86 PC hosts, 
but with big-endian machines getting less common, this code does not get much testing. If you run into 
trouble with this, please let us know and we will fix it for you.

> The idea is to use this device for some slow control for our experiment (ASACUSA
> Antihydrogen) we are using midas as main DAQ system and we would like to
> integrate the slow control with this small boards.

> My question is: How can I compile the midas library with the openwrt crosscompiler?

In the MIDAS Makefile, looks for the "crosscompile" target which we use to cross-build MIDAS for our 
PowerPC target using the regular GCC cross compiler chain. If you have very new MIDAS, you will also see 
some make targets for ARM Linux machines, also using GCC cross compilers.

> the system discspace is very limited (6 MB) therefore I don't want to have mysql, zlib an so on.

The MIDAS Makefile crosscompiler builds a very minimalistic version of MIDAS - no mysql, no sqlite, etc 
requirements for the MIDAS libraries and frontend. zlib may be required but it is not used by frontend 
code, so you may try to disable it.

If that is still too big, there is a possibility for building a super-minimal version of MIDAS just for running 
cross-compiled frontends. We use this function to build MIDAS for VxWorks. If you want to try that, I 
think it is not in the main Makefile, but in the VxWorks Makefile. Let me know if you want this and I can 
probable restore this function into the main Makefile fairly quickly.

> Do you have any suggestions on how to realize something like that?

1) cross compile MIDAS (see  the Makefile "make crosscompile" target)
2) cross compile your frontend
3) run it, with luck, it will fit into your 64 Mbytes of RAM

If you run into problems, please post them here (so other people can see the problems and the solutions)

K.O.
    Reply  27 May 2014, Clemens Sauerzopf, Forum, Running a frontend on Arduino Yun 
Ok, I'm currently trying to get things running, setting up a crosscompiler toolchain for the Arduino Yun is fairly
easy, just follow the tutorial on the  OpenWrt webpage.

The main problem is that openwrt uses the uClibc library instead of glibc this produces lots of difficulties, first
one is that building of the shared library is complaining about symbol name mismatches, but I guess this can be
fixed somehow, I wont use the midas-shared library, therefore I just disabled it in the Makefile. 

The next problem is the backtrace functions tjhat are used within system.c, the functions backtrace and
backtrace_symbols are only available in glibc for a quick fix I just changed the #ifdef directive in a way that this
code is not built. 
 

There is a more tricky problem, the compiler complains about mismatched function defintions:

In file included from include/midasinc.h:17:0,
                 from include/msystem.h:35,
                 from src/sequencer.cxx:13:
/home/clemens/arduino/openwrt-yun/build_dir/toolchain-mips_r2_gcc-4.6-linaro_uClibc-0.9.33.2/uClibc-0.9.33.2/include/string.h:495:41:
error: declaration of 'size_t strlcat(char*, const char*, size_t) throw ()' has a different exception specifier
include/midas.h:1955:17: error: from previous declaration 'size_t strlcat(char*, const char*, size_t)'
/home/clemens/arduino/openwrt-yun/build_dir/toolchain-mips_r2_gcc-4.6-linaro_uClibc-0.9.33.2/uClibc-0.9.33.2/include/string.h:498:41:
error: declaration of 'size_t strlcpy(char*, const char*, size_t) throw ()' has a different exception specifier
include/midas.h:1954:17: error: from previous declaration 'size_t strlcpy(char*, const char*, size_t)'

This can be solved by editing the midas.h file:
size_t EXPRT strlcpy(char *dst, const char *src, size_t size); -> size_t EXPRT strlcpy(char *dst, const char *src,
size_t size) __THROW __nonnull ((1, 2));

and 

size_t EXPRT strlcat(char *dst, const char *src, size_t size); -> size_t EXPRT strlcat(char *dst, const char *src,
size_t size) __THROW __nonnull ((1, 2));

the same trick has to be done in ../mxml/strlcpy.h

After changing this midas compiles with the crosscompiler and the resulting programs are executable on the Arduino
Yun. I'll report back if I got my frontend to run and connect to the midas server.
    Reply  27 May 2014, Konstantin Olchanski, Forum, Running a frontend on Arduino Yun 
> Ok, I'm currently trying to get things running, setting up a crosscompiler toolchain for the Arduino Yun is fairly
> easy, just follow the tutorial on the  OpenWrt webpage.
> 
> The main problem is that openwrt uses the uClibc library instead of glibc this produces lots of difficulties
>

Okey, I see. I do not think we used uClibc with MIDAS yet.

>
> one is that building of the shared library is complaining about symbol name mismatches, but I guess this can be
> fixed somehow, I wont use the midas-shared library, therefore I just disabled it in the Makefile. 
> 

The shared library is generally not used. The Makefile builds it as a convenience for things like pymidas, etc.

> 
> The next problem is the backtrace functions tjhat are used within system.c, the functions backtrace and
> backtrace_symbols are only available in glibc for a quick fix I just changed the #ifdef directive in a way that this
> code is not built.
>

Yes. They should probably be behind an #ifdef GLIBC (whatever the GLIBC identifier is)

> 
> There is a more tricky problem, the compiler complains about mismatched function defintions:
> 
> error: declaration of 'size_t strlcat(char*, const char*, size_t) throw ()' has a different exception specifier
> error: declaration of 'size_t strlcpy(char*, const char*, size_t) throw ()' has a different exception specifier
> 
> This can be solved by editing the midas.h file:
> size_t EXPRT strlcpy(char *dst, const char *src, size_t size); -> size_t EXPRT strlcpy(char *dst, const char *src,
> size_t size) __THROW __nonnull ((1, 2));
> 

No need to edit anything, this is controlled by NEED_STRLCPY in the Makefile - to enable our own strlcpy on systems that do not provide it (hello, GLIBC!)

> 
> After changing this midas compiles with the crosscompiler and the resulting programs are executable on the Arduino
> Yun. I'll report back if I got my frontend to run and connect to the midas server.

Congratulations!

K.O.
    Reply  28 May 2014, Clemens Sauerzopf, Forum, Running a frontend on Arduino Yun 
Thank you very much for your input, it finally works. I succeeded in crosscompiling the frontend and running it on the ArduinoYun. The 64 MB RAM is more than
enough to run the mserver and a frontend and connect to a remote midas server over ethernet or wifi. 

Yust for reference if someone tries something similar: to directly access the serial interface between the Linux running processor and the Atmel processor it
is required to comment out a line in /etc/inittab: #ttyATH0::askfirst:/bin/ash --login
 this line starts a shell on the serial connection, by preventing this it is possible to run more or less unmodified code (serial interface needs to be
Serial1) on the Atmel side and use the linux processor as slow control pc.

Thanks again for your help!
    Reply  24 Oct 2014, Clemens Sauerzopf, Forum, Running a frontend on Arduino Yun 
Hello,

I'm currently trying to create a midas bank for basic temperature reading from the Arduino Yun, but when creating a bank the frontend crashed with a segfault, my
code currently looks like this:

INT read_event(char *pevent, INT off)
{
  WORD *data;
  //printf("before init\n");
  bk_init(pevent);
  //printf("after init\n");
  bk_create(pevent, "TEM0", TID_WORD, data); // <= we are dieing at this line
  //printf("after create\n");

  bk_close(pevent, data);

  return bk_size(pevent);
}

Does anyone have an Idea how to tackle this problem down? running a debugger is a little bit tricky on a this processor..

Thanks!
    Reply  24 Oct 2014, Stefan Ritt, Forum, Running a frontend on Arduino Yun 
> Hello,
> 
> I'm currently trying to create a midas bank for basic temperature reading from the Arduino Yun, but when creating a bank the frontend crashed with a segfault, my
> code currently looks like this:
> 
> INT read_event(char *pevent, INT off)
> {
>   WORD *data;
>   //printf("before init\n");
>   bk_init(pevent);
>   //printf("after init\n");
>   bk_create(pevent, "TEM0", TID_WORD, data); // <= we are dieing at this line
>   //printf("after create\n");
> 
>   bk_close(pevent, data);
> 
>   return bk_size(pevent);
> }
> 
> Does anyone have an Idea how to tackle this problem down? running a debugger is a little bit tricky on a this processor..
> 
> Thanks!

Two bugs:

bk_create(pevent, "TEMO0", TID_WORD, &data);

note the "&" in front of data. Then you have to increment the pointer for each byte you add to the bank:

  *data = <temp>;
  data++;
  bk_close(pevent, data);

this way the bk_close() function know how much data you added to the bank.

Cheers,
Stefan
    Reply  24 Oct 2014, Konstantin Olchanski, Forum, Running a frontend on Arduino Yun 
> INT read_event(char *pevent, INT off)
> {
>   WORD *data;
>   bk_create(pevent, "TEM0", TID_WORD, data); // <= we are dieing at this line
> }

The declaration of bk_create() in midas.h is wrong:

void EXPRT bk_create(void *pbh, const char *name, WORD type, void *pdata);
should be
void EXPRT bk_create(void *pbh, const char *name, WORD type, void **pdata);

Notice the extra "*" in "void**pdata" to indicate that it takes a pointer to the pointer to the data.

With the correct definition, you should get a compile error (type mismatch).

With the wrong current definition, you should have gotten a warning about "use of uninitialized variable 'data'", but some compilers with some settings do not generate this warning.

As it is, without looking at an example (highly recommended) and reading documentation (do we even have a "frontend writing guide"?!?) you have
no way to tell if you should pass "data" or "&data" to bk_create().

Thank you for reporting this problem.

P.S. As for running on Arduino, for slow controls type application, any CPU and network speed should be okey,
but memory use is always a concern, so please speak up if you run into problems. We routinely run MIDAS frontends
on linux machines with 512M and 128M RAM (1GHz CPU, 100 and 1000 M/s ethernet).

K.O.
    Reply  02 Nov 2014, Stefan Ritt, Forum, Running a frontend on Arduino Yun 
> With the correct definition, you should get a compile error (type mismatch).
> 
> With the wrong current definition, you should have gotten a warning about "use of uninitialized variable 'data'", but some compilers with some settings do not generate this warning.

I redefined the definition of the bk_create function to contain a void **pdate pointer, but that did not really help. Now I get a compiler error:

"Incompatible pointer type passing 'DWORD **' to parameter of type 'void **', so I need an explicit cast each time

bk_create(... (void **)&pdata);

But I think this is better than what we had before so I leave it. Please note that all front-ends using bk_create need to be modified accordingly to suppress this warning.

/Stefan
Entry  06 Nov 2009, Jimmy Ngai, Forum, Run multiple frontend on the same host 
Dear All,

I want to run two frontend programs (one for trigger and one for slow control)
concurrently on the same computer, but I failed. The second frontend said: 

Semaphore already present
 There is another process using the semaphore.
 Or a process using the semaphore exited abnormally.
 In That case try to manually release the semaphore with:
   ipcrm sem XXX.

The two frontends are connected to the same experiment. Is there any way I can
overcome this problem?

Thanks!

Jimmy
    Reply  27 Nov 2009, Stefan Ritt, Forum, Run multiple frontend on the same host 
> Dear All,
> 
> I want to run two frontend programs (one for trigger and one for slow control)
> concurrently on the same computer, but I failed. The second frontend said: 
> 
> Semaphore already present
>  There is another process using the semaphore.
>  Or a process using the semaphore exited abnormally.
>  In That case try to manually release the semaphore with:
>    ipcrm sem XXX.
> 
> The two frontends are connected to the same experiment. Is there any way I can
> overcome this problem?

That might be related to the RPC mutex, which gets created system wide now. I 
modified this in midas.c rev. 4628, so there will be one mutex per process. Can you 
try that temporary patch and tell me if it works for you?
    Reply  07 Dec 2009, Jimmy Ngai, Forum, Run multiple frontend on the same host 
Dear Stefan,

Thanks for the reply. I have tried your patch and it didn't solve my problem. Maybe I 
have not written my question clearly. The two frontends could run on the same computer 
if I use the remote method, i.e. by setting up the mserver and connect to the 
experiment by specifying "-h localhost", also the frontend programs need to be put in 
different directory. What I want to know is whether I can simply start multiple 
frontends in the same directory without setting up the mserver etc. I noticed that 
there are several *.SHM files, I'm not familiar with semaphore, but I guess they are 
the key to the problem. Please correct me if I misunderstood something.

Best Regards,
Jimmy


> > Dear All,
> > 
> > I want to run two frontend programs (one for trigger and one for slow control)
> > concurrently on the same computer, but I failed. The second frontend said: 
> > 
> > Semaphore already present
> >  There is another process using the semaphore.
> >  Or a process using the semaphore exited abnormally.
> >  In That case try to manually release the semaphore with:
> >    ipcrm sem XXX.
> > 
> > The two frontends are connected to the same experiment. Is there any way I can
> > overcome this problem?
> 
> That might be related to the RPC mutex, which gets created system wide now. I 
> modified this in midas.c rev. 4628, so there will be one mutex per process. Can you 
> try that temporary patch and tell me if it works for you?
    Reply  08 Dec 2009, Stefan Ritt, Forum, Run multiple frontend on the same host Capture.png
Hi Jimmy,

ok, now I understand. Well, I don't see your problem. I just tried with the 
current SVN 
version to start

midas/examples/experiment/frontend
midas/examples/slowcont/scfe

in the same directory (without "-h localhost") and it works just fine (see 
attachemnt). I even started them from the same directory. Yes there are *.SHM 
files and they correspond to shared memory, but both front-ends use this shared 
memory together (that's why it's called 'shared').

Your error message 'Semaphore already present' is strange. The string is not 
contained in any midas program, so it must come from somewhere else. Do you 
maybe try to access the same hardware with the two front-end programs?

I would propose you do the following: Use the two front-ends from the 
distribution (see above). They do not access any hardware. See if you can run 
them with the current SVN version of midas. If not, report back to me.

Best regards,

  Stefan


> Dear Stefan,
> 
> Thanks for the reply. I have tried your patch and it didn't solve my problem. 
Maybe I 
> have not written my question clearly. The two frontends could run on the same 
computer 
> if I use the remote method, i.e. by setting up the mserver and connect to the 
> experiment by specifying "-h localhost", also the frontend programs need to be 
put in 
> different directory. What I want to know is whether I can simply start 
multiple 
> frontends in the same directory without setting up the mserver etc. I noticed 
that 
> there are several *.SHM files, I'm not familiar with semaphore, but I guess 
they are 
> the key to the problem. Please correct me if I misunderstood something.
> 
> Best Regards,
> Jimmy
> 
> 
> > > Dear All,
> > > 
> > > I want to run two frontend programs (one for trigger and one for slow 
control)
> > > concurrently on the same computer, but I failed. The second frontend said: 
> > > 
> > > Semaphore already present
> > >  There is another process using the semaphore.
> > >  Or a process using the semaphore exited abnormally.
> > >  In That case try to manually release the semaphore with:
> > >    ipcrm sem XXX.
> > > 
> > > The two frontends are connected to the same experiment. Is there any way I 
can
> > > overcome this problem?
> > 
> > That might be related to the RPC mutex, which gets created system wide now. 
I 
> > modified this in midas.c rev. 4628, so there will be one mutex per process. 
Can you 
> > try that temporary patch and tell me if it works for you?
    Reply  12 Dec 2009, Jimmy Ngai, Forum, Run multiple frontend on the same host 
Dear Stefan,

I followed your suggestion to try the sample front-ends from the distribution and 
they work fine. They also work fine with any one of my front-ends. Only my two 
front-ends cannot run concurrently in the same directory. I later found that the 
problem is in the CAEN HV wrapper library. The problem arises when the front-ends 
are both linked to that library and it is solved now.

Thanks & Best Regards,
Jimmy


> Hi Jimmy,
> 
> ok, now I understand. Well, I don't see your problem. I just tried with the 
> current SVN 
> version to start
> 
> midas/examples/experiment/frontend
> midas/examples/slowcont/scfe
> 
> in the same directory (without "-h localhost") and it works just fine (see 
> attachemnt). I even started them from the same directory. Yes there are *.SHM 
> files and they correspond to shared memory, but both front-ends use this shared 
> memory together (that's why it's called 'shared').
> 
> Your error message 'Semaphore already present' is strange. The string is not 
> contained in any midas program, so it must come from somewhere else. Do you 
> maybe try to access the same hardware with the two front-end programs?
> 
> I would propose you do the following: Use the two front-ends from the 
> distribution (see above). They do not access any hardware. See if you can run 
> them with the current SVN version of midas. If not, report back to me.
> 
> Best regards,
> 
>   Stefan
> 
> 
> > Dear Stefan,
> > 
> > Thanks for the reply. I have tried your patch and it didn't solve my problem. 
> Maybe I 
> > have not written my question clearly. The two frontends could run on the same 
> computer 
> > if I use the remote method, i.e. by setting up the mserver and connect to the 
> > experiment by specifying "-h localhost", also the frontend programs need to be 
> put in 
> > different directory. What I want to know is whether I can simply start 
> multiple 
> > frontends in the same directory without setting up the mserver etc. I noticed 
> that 
> > there are several *.SHM files, I'm not familiar with semaphore, but I guess 
> they are 
> > the key to the problem. Please correct me if I misunderstood something.
> > 
> > Best Regards,
> > Jimmy
> > 
> > 
> > > > Dear All,
> > > > 
> > > > I want to run two frontend programs (one for trigger and one for slow 
> control)
> > > > concurrently on the same computer, but I failed. The second frontend said: 
> > > > 
> > > > Semaphore already present
> > > >  There is another process using the semaphore.
> > > >  Or a process using the semaphore exited abnormally.
> > > >  In That case try to manually release the semaphore with:
> > > >    ipcrm sem XXX.
> > > > 
> > > > The two frontends are connected to the same experiment. Is there any way I 
> can
> > > > overcome this problem?
> > > 
> > > That might be related to the RPC mutex, which gets created system wide now. 
> I 
> > > modified this in midas.c rev. 4628, so there will be one mutex per process. 
> Can you 
> > > try that temporary patch and tell me if it works for you?
Entry  11 Mar 2019, Francesco Renga, Forum, Run length 
Dear all,
        I need to implement a DAQ sequence where a short run (100 events, which takes a couple of 
minutes) is taken every hour, with a long run in between two short runs. In the sequencer, I can do:

LOOP infinite

.... some ODB settings ....
     TRANSITION START
     WAIT events 100
     TRANSITION STOP

.... some ODB settings ....
     TRANSITION START
     WAIT seconds 3600
     TRANSITION STOP

ENDLOOP


I have two questions: 

- for the long run, I want to write on disk only a maximum number of events. I think I can suppress 
the event polling in the frontend, with an ODB query of the number of collected events. I'm 
wondering if there is a smarter way to do that. It is also ok if the run is stopped after a maximum 
number of events, but the subsequent short run should still start exactly after 1h from the previous 
short run. 

- with the script above, the real time lapse between the start of two short runs would depend on 
the duration of the short run itself. Is there a way to start the short run exactly 1 h after the starting 
of the previous short run?

Thank you in advance for your help,
              Francesco
    Reply  12 Mar 2019, Stefan Ritt, Forum, Run length 
> Is there a way to start the short run exactly 1 h after the starting 
> of the previous short run?

This is not possible with the current sequencer.
    Reply  12 Mar 2019, Pierre Gorel, Forum, Run length 
> 
> .... some ODB settings ....
>      TRANSITION START
>      WAIT events 100
>      TRANSITION STOP
> I have two questions: 
> 
> - for the long run, I want to write on disk only a maximum number of events. I think I can suppress 
> the event polling in the frontend, with an ODB query of the number of collected events. I'm 
> wondering if there is a smarter way to do that. It is also ok if the run is stopped after a maximum 
> number of events, but the subsequent short run should still start exactly after 1h from the previous 
> short run. 

I don't know about a way to give you an exact number of events (maybe /Logger/Run duration). 

I personally use 
    WAIT ODBValue,"/Equipment/DTM/Statistics/Events sent",>,100

Where DTM is the frontend of my trigger. Because of the lag in the run stop, the run will always exceed by few
seconds*rates.

Hope it helps
    Reply  13 Mar 2019, Konstantin Olchanski, Forum, Run length 
I did not quite understand your desired sequence, is this what you want:

- at 1pm
- start a run
- record 100 events
- end the run
- (this will be, say, 1:15pm)
- start a run
- at 2pm
- end the run
- start a run
- record 100 events
- ad infinitum

There are 2 difficulties with this:

1) If you want your cycle to be exactly 1 hour, you need to use cron or something similar - if you just "start; sleep 3600; stop", 
your cycle will be slightly longer than 1 hour because starting and stopping runs takes some time to complete.

2) if you want your "100 event" run to start exactly precisely on the hour, you need to stop the previous run a few 
minutes/seconds before the hour to avoid the "run stop" delay.

Instead of using the sequencer, I would use a shell script (run it from crontab to avoid problem (1))

#/bin/sh
mtransition stop # stop the previous long run
odbedit set "/logger/channels/0/settings/event limit" 100
odbedit set "/logger/auto restart" "y"
mtransition start # start the short run
# end

In your frontend end_of_run() function, add this:
odb set "/logger/channels/0/settings/event limit" 0

This will produce the following sequence:

- script will stop previous long run
- set event limit to 100, start the "100 events" run
- logger will stop at 100 events, call your frontend end_of_run(), set event limit to 0
- logger auto restart will start a new run, event limit is now 0, this is your long run
- on the hour, cron runs your script, cycle repeats from the top.

Instead of cron, your can use a looper script. Note that you must run
the main script in the background (note the "&") to avoid problem (1).

#!/bin/foo
# looper script
while (true) {
   main_script &
   sleep 3600   
}
#end

To stop the sequence, kill the looper script.

K.O.


> Dear all,
>         I need to implement a DAQ sequence where a short run (100 events, which takes a couple of 
> minutes) is taken every hour, with a long run in between two short runs. In the sequencer, I can do:
> 
> LOOP infinite
> 
> .... some ODB settings ....
>      TRANSITION START
>      WAIT events 100
>      TRANSITION STOP
> 
> .... some ODB settings ....
>      TRANSITION START
>      WAIT seconds 3600
>      TRANSITION STOP
> 
> ENDLOOP
> 
> 
> I have two questions: 
> 
> - for the long run, I want to write on disk only a maximum number of events. I think I can suppress 
> the event polling in the frontend, with an ODB query of the number of collected events. I'm 
> wondering if there is a smarter way to do that. It is also ok if the run is stopped after a maximum 
> number of events, but the subsequent short run should still start exactly after 1h from the previous 
> short run. 
> 
> - with the script above, the real time lapse between the start of two short runs would depend on 
> the duration of the short run itself. Is there a way to start the short run exactly 1 h after the starting 
> of the previous short run?
> 
> Thank you in advance for your help,
>               Francesco
Entry  07 Jan 2008, Stefan Ritt, Info, Roll-back for history sytem added 
The midas history system always had the problem that the database can get
corrupted if the disk gets full where the history records (*.hst & *.idx) are
stored. This can happen if a history event can only be written partially on the
almost full disk. If later some space is freed up (by deleting other files), the
writing continues at the old position, leaving the partial event in the data
base. In that case the whole history data of the current day cannot be read
because it is corrupted.

To solve the problem, a roll-back system has been implemented in the
hs_write_event() function. If an event cannot be written fully, the history file
is restored to the old state, so the partial event is removed from the end of
the file via truncation. This way only the data which could not be written to
the disk is missing in the history file, but the other data from that day is
still valid and readable. The change has been committed in revision 4107.
    Reply  13 Feb 2008, Konstantin Olchanski, Info, Roll-back for history sytem added 
> The midas history system always had the problem that the database can get
> corrupted if the disk gets full where the history records (*.hst & *.idx) are
> stored.

Stefan - big thanks for fixing this problem - it is one of those cases "how come I
did not think of do it!".

This change should fix the last remaining problem with history at CERN - we seem to
be unable to avoid running out of disk space once in a while (run away scripts, fat
fingers, etc) and history got corrupted every time.

But to make things more interesting we had another history outage this week - we
happen to write history files to an NFS server (not recommened! do not do this!) and
when the NFS server had a glitch, history files got corrupted - because during the
glitch NFS was not available, I think this roll-back feature would not have helped.

Anyhow, I now have a patch to allow hs_read() to "skip the bad spots" in the history
files. (hs_gen_index() also needs a patch).

In the nutshell, if invalid history data is detected, the code continues to read the
data one byte at a time, looking for valid event_id markers (etc).

The code looks sane by inspection, and if nobody objects, I would like to commit it
in the next few days.

Here is the diff against src/history.c rev 4114

Index: history.c
===================================================================
--- history.c	(revision 4118)
+++ history.c	(working copy)
@@ -129,6 +129,7 @@
    HIST_RECORD rec;
    INDEX_RECORD irec;
    DEF_RECORD def_rec;
+   int recovering = 0;
 
    printf("Recovering index files...\n");
 
@@ -171,7 +172,7 @@
 
          /* skip tags */
          lseek(fh, rec.data_size, SEEK_CUR);
-      } else {
+      } else if (rec.record_type == RT_DATA) {
          /* write index record */
          irec.event_id = rec.event_id;
          irec.time = rec.time;
@@ -180,6 +181,15 @@
 
          /* skip data */
          lseek(fh, rec.data_size, SEEK_CUR);
+      } else {
+
+         if (!recovering)
+            cm_msg(MERROR, "hs_gen_index", "broken history file %d, trying to
recover", (int)ltime);
+
+	 recovering = 1;
+         lseek(fh, -sizeof(rec)+1, SEEK_CUR);
+
+         continue;
       }
 
    } while (TRUE);
@@ -220,6 +230,7 @@
    time_t lt;
    int fh, fhd, fhi;
    struct tm *tms;
+   int idxsize = 0;
 
    if (*ltime == 0)
       *ltime = ss_time();
@@ -250,12 +261,15 @@
    hs_open_file(*ltime, "idf", O_RDONLY, &fhd);
    hs_open_file(*ltime, "idx", O_RDONLY, &fhi);
 
+   if (fhi >= 0)
+     idxsize = lseek(fhi, 0, SEEK_END);
+
    close(fh);
    close(fhd);
    close(fhi);
 
    /* generate them if not */
-   if (fhd < 0 || fhi < 0)
+   if (fhd < 0 || fhi < 0 || idxsize == 0)
       hs_gen_index(*ltime);
 
    return HS_SUCCESS;
@@ -1480,12 +1494,33 @@
             i = -1;
             M_FREE(cache);
             cache = NULL;
-         } else
+         } else {
+
+	 try_again:
+
             i = sizeof(irec);
-
-         if (cp < cache_size) {
             memcpy(&irec, cache + cp, sizeof(irec));
             cp += sizeof(irec);
+
+	    /* if history file is broken ... */
+	    if (irec.time < last_irec_time) {
+	      //printf("time %d -> %d, cache_size %d, cp %d\n", last_irec_time, irec.time,
cache_size, cp);
+
+	      //printf("Seeking next record...\n");
+
+	      while (cp < cache_size)
+		{
+		  DWORD* evidp = (DWORD*)(cache + cp);
+		  if (*evidp == event_id) {
+		    //printf("Found at cp %d\n", cp);
+		    goto try_again;
+		  }
+
+		  cp++;
+		}
+
+	      i = -1;
+	    }
          }
       } else
          i = read(fhi, (char *) &irec, sizeof(irec));

K.O.
    Reply  13 Feb 2008, Stefan Ritt, Info, Roll-back for history sytem added 
> But to make things more interesting we had another history outage this week - we
> happen to write history files to an NFS server (not recommened! do not do this!) and
> when the NFS server had a glitch, history files got corrupted - because during the
> glitch NFS was not available, I think this roll-back feature would not have helped.

Actually I put our history data on a separate file system, on a separate disk controlled
by a separate RAID controller! If you write bulk data with the logger, and want to read
history files at the same time with mhttpd, you get a bottleneck if both data are at the
same physical disk. Separating this (and even the controller) speeded things up
dramatically.

The rollback will not work for NFS, since it requires truncating the file if an event
gets only partially written. While on a full file system you always can *delete* data,
this does not work if NFS is down. This explains the behavior.

> Anyhow, I now have a patch to allow hs_read() to "skip the bad spots" in the history
> files. (hs_gen_index() also needs a patch).
> 
> In the nutshell, if invalid history data is detected, the code continues to read the
> data one byte at a time, looking for valid event_id markers (etc).
> 
> The code looks sane by inspection, and if nobody objects, I would like to commit it
> in the next few days.

Great. I was thinking of something like this myself. Having a quick look at your code
looks good. The best of course would be if we would have some "magic number" for
re-synchronizating the data stream, but that would blow up the file length. So searching
for the right event id is good, but will not work 100%. Also the check

  if (irec.time < last_irec_time)

to see if the history is broken is very weak. If you take random data, it will be true
50% and false 50%. If one makes however a check

  if ((irec.time - last_irec_time) > 3600*24)

this would work correctly with random data in >99% of all cases (3600*24/2^32). Maybe
you should change that.
ELOG V3.1.4-2e1708b5