Back Midas Rome Roody Rootana
  Midas DAQ System, Page 115 of 136  Not logged in ELOG logo
ID Date Author Topic Subjectdown
  1601   08 Jul 2019 Vinzenz BildsteinBug ReportFrontend killed at stop of run
> run the frontend inside gdb and post the stack trace after the crash?
> 
> if there is no crash (the program is stopped by exit()), you may need
> to set a breakpoint in exit() or _exit() (not sure what it's latest name is)
> then with luck your stack trace will show who/what called it from where.
> 

If I remember correctly from the last time I tried that, it doesn't use the exit
function but gdb just reports that the program was terminated and no longer exists. I
can't set a breakpoint on SIGKILL as the point of SIGKILL is to kill the program and
gdb can't set a break at that point afaik.
  1602   08 Jul 2019 Konstantin OlchanskiBug ReportFrontend killed at stop of run
> > run the frontend inside gdb and post the stack trace after the crash?
> > 
> > if there is no crash (the program is stopped by exit()), you may need
> > to set a breakpoint in exit() or _exit() (not sure what it's latest name is)
> > then with luck your stack trace will show who/what called it from where.
> > 
> 
> If I remember correctly from the last time I tried that, it doesn't use the exit
> function but gdb just reports that the program was terminated and no longer exists. I
> can't set a breakpoint on SIGKILL as the point of SIGKILL is to kill the program and
> gdb can't set a break at that point afaik.

For SIGKILL, my gdb reports "Program terminated with signal SIGKILL, Killed." and there is no stack 
trace. Is this what you see?

If your program stops "normally", not from receiving some signal, set breakpoints on "exit" and 
"_exit".

The normal stop sequence is to call exit(), which runs all the atexit() functions (the midas atexit() 
function prints the message about "cm_disconnect_experiment not called at end of program") and 
calls _exit() to stop the program.

So if you see the midas message "cm_disconnect_experiment not called at end of program", it is a 
good indication that somebody (not mfe.c) called exit() on you. A breakpoint on "exit" should catch 
who does it.

Good luck,
K.O.
  1603   08 Jul 2019 Vinzenz BildsteinBug ReportFrontend killed at stop of run
> > > run the frontend inside gdb and post the stack trace after the crash?
> > > 
> > > if there is no crash (the program is stopped by exit()), you may need
> > > to set a breakpoint in exit() or _exit() (not sure what it's latest name is)
> > > then with luck your stack trace will show who/what called it from where.
> > > 
> > 
> > If I remember correctly from the last time I tried that, it doesn't use the exit
> > function but gdb just reports that the program was terminated and no longer exists. I
> > can't set a breakpoint on SIGKILL as the point of SIGKILL is to kill the program and
> > gdb can't set a break at that point afaik.
> 
> For SIGKILL, my gdb reports "Program terminated with signal SIGKILL, Killed." and there is no stack 
> trace. Is this what you see?

Yes, that is exactly what I remember seeing.

> 
> If your program stops "normally", not from receiving some signal, set breakpoints on "exit" and 
> "_exit".
> 
> The normal stop sequence is to call exit(), which runs all the atexit() functions (the midas atexit() 
> function prints the message about "cm_disconnect_experiment not called at end of program") and 
> calls _exit() to stop the program.
> 
> So if you see the midas message "cm_disconnect_experiment not called at end of program", it is a 
> good indication that somebody (not mfe.c) called exit() on you. A breakpoint on "exit" should catch 
> who does it.
> 
> Good luck,
> K.O.

So far I haven't seen the issue with the message "cm_disconnect_experiment not called at end of program"
again. Now I just have to restart the frontend after the run has (failed?) to stop. After restarting the
frontend everything seems to work again. 

I haven't been writing data while doing these tests, so I can't say if there is any data missing or if the
runs were actually stopped properly (with a second dump of the ODB at the end).
  1604   08 Jul 2019 Konstantin OlchanskiBug ReportFrontend killed at stop of run
> > 
> > For SIGKILL, my gdb reports "Program terminated with signal SIGKILL, Killed." and there is no stack 
> > trace. Is this what you see?
> 
> Yes, that is exactly what I remember seeing.
> 

Where would a SIGKILL come from?!?

Look in the syslog (/var/log/messages). If the program was killed by the linux kernel, it would be logged there,
the usual cause is the machine runs out of memory and programs are killed by the OOM killer, this is logged
into the syslog, always.

MIDAS also can issue a SIGKILL sometimes, again this is always logged in midas.log. see src/midas.c, search for SIGKILL to see 
the exact messages printed before it is sent out.

K.O.
  1605   08 Jul 2019 Vinzenz BildsteinBug ReportFrontend killed at stop of run
> > > 
> > > For SIGKILL, my gdb reports "Program terminated with signal SIGKILL, Killed." and there is no stack 
> > > trace. Is this what you see?
> > 
> > Yes, that is exactly what I remember seeing.
> > 
> 
> Where would a SIGKILL come from?!?
> 
> Look in the syslog (/var/log/messages). If the program was killed by the linux kernel, it would be logged there,
> the usual cause is the machine runs out of memory and programs are killed by the OOM killer, this is logged
> into the syslog, always.
> 
> MIDAS also can issue a SIGKILL sometimes, again this is always logged in midas.log. see src/midas.c, search for SIGKILL to see 
> the exact messages printed before it is sent out.
> 
> K.O.

I haven't been able to reproduce the error from the overnight run so far. I will try and leave this running in gdb overnight to see
if I can get that error again. 
  1608   10 Jul 2019 Vinzenz BildsteinBug ReportFrontend killed at stop of run
> > > > 
> > > > For SIGKILL, my gdb reports "Program terminated with signal SIGKILL, Killed." and there is no stack 
> > > > trace. Is this what you see?
> > > 
> > > Yes, that is exactly what I remember seeing.
> > > 
> > 
> > Where would a SIGKILL come from?!?
> > 
> > Look in the syslog (/var/log/messages). If the program was killed by the linux kernel, it would be logged there,
> > the usual cause is the machine runs out of memory and programs are killed by the OOM killer, this is logged
> > into the syslog, always.
> > 
> > MIDAS also can issue a SIGKILL sometimes, again this is always logged in midas.log. see src/midas.c, search for SIGKILL to see 
> > the exact messages printed before it is sent out.
> > 
> > K.O.
> 
> I haven't been able to reproduce the error from the overnight run so far. I will try and leave this running in gdb overnight to see
> if I can get that error again. 

I was able to reproduce the error after an overnight run. gdb reported that the program received a SIGKILL, but no sign of it in 
/var/log/messages. I've tried finding a current midas.log file, but it seems we don't have one? The most recent one was last updated 
on May 24th this year.
  1610   10 Jul 2019 Konstantin OlchanskiBug ReportFrontend killed at stop of run
> ... finding a current midas.log file

On the "help" page, see "midas.log".

Same information is in ODB, the midas log file name is concatenation of "/Logger/Data dir" and "message file".

K.O.
  1613   11 Jul 2019 Vinzenz BildsteinBug ReportFrontend killed at stop of run
> > ... finding a current midas.log file
>
> On the "help" page, see "midas.log".
>
> Same information is in ODB, the midas log file name is concatenation of "/Logger/Data dir" and "message file".
>
> K.O.

Sorry, should have found that myself ...

Anyway, the output from midas is
Tue Jul  9 07:24:06 2019 [mhttpd,INFO] Run #13456 started
Wed Jul 10 06:23:58 2019 [mhttpd,ERROR] [system.c:4580:ss_recv_net_command,ERROR] timeout receiving network 
command header
Wed Jul 10 06:23:58 2019 [mhttpd,ERROR] [midas.c:10322:rpc_client_call,ERROR] call to "fedescant" on 
"grsmid00.triumf.ca" RPC "rc_transition": timeout waiting for reply
Wed Jul 10 06:24:02 2019 [mhttpd,ERROR] [midas.c:5495:cm_shutdown,ERROR] Client 'fedescant' not responding to 
shutdown command
Wed Jul 10 06:24:02 2019 [mhttpd,ERROR] [midas.c:5497:cm_shutdown,ERROR] Killing and Deleting client 'fedescant' 
pid 31482
Wed Jul 10 06:24:02 2019 [Logger,INFO] Client 'fedescant' on buffer 'SYSMSG' removed by cm_watchdog because 
process pid 31482 does not exist
Wed Jul 10 06:24:02 2019 [fegrifip09,INFO] Client 'fedescant' on buffer 'SYSTEM' removed by cm_watchdog because 
process pid 31482 does not exist
Wed Jul 10 06:24:03 2019 [mhttpd,INFO] Run #13456 stopped

And I think I tracked down where this comes from with help from Thomas Lindner. It is a problem in the communication via the A3818 card from CAEN. This seems to block the frontend, even though it still reacts normal to a shutdown. So no issue with midas, even if it seemed that way at first. Thanks for all your help!
  1614   11 Jul 2019 Konstantin OlchanskiBug ReportFrontend killed at stop of run
> Wed Jul 10 06:23:58 2019 [mhttpd,ERROR] [system.c:4580:ss_recv_net_command,ERROR] timeout receiving network  command header
> Wed Jul 10 06:23:58 2019 [mhttpd,ERROR] [midas.c:10322:rpc_client_call,ERROR] call to "fedescant" on  "grsmid00.triumf.ca" RPC "rc_transition": timeout waiting for reply

We should have started debugging from here. The error messages mean: your frontend is not responding to run transition (RPC timeout).

> problem in the communication via the A3818 card from CAEN.

Yes, this has been problematic before.

K.O.
  181   14 Dec 2004 Jan WoutersForumFrontend index
What is the api call to determine the index of the frontend when specifying the
-i parameter during execution of the frontend? 
  183   15 Dec 2004 Stefan RittForumFrontend index
> What is the api call to determine the index of the frontend when specifying the
> -i parameter during execution of the frontend? 

INT get_frontend_index();

- Stefan
  1183   06 Jul 2016 Zhe WangSuggestionFrontend crush on high event rate
Dear friends,

We have some questions on using midas.
We use a Caen digitizer V1751 to take waveforms.
When testing with caen provided programs, we roughly know it can work fine at 1000 Hz event rate, and 30 M/s data can be written to disk.
The test with Midas, however, is a little confusing. We use CAENDigitizer library with Midas. First, it works, data were taken, and there seems no error.
The only problem is we cannot go to a higher event rate, for example we can only work on a rate of 40 Hz, and only 3 M/s data recording. Otherwise it will crush.

We may miss something really simple. Would you please give some suggestions? for example, other people's discussions or documents?

Thank you very much.
  1184   09 Jul 2016 Zhe WangSuggestionFrontend crush on high event rate
Dear friends,

I may add a little more information.
For polling event, we check the data-ready register for the status of the digitizer.
In the readout routine, we create a bank, readout the data and write it out.

We commented out or made some replacement for each part of the subroutines to figure our where exactly goes wrong.
for example, replace the readout from the digitizer with a random generation of some fake events.
By replacing the readout by a random generation, the program runs fine and reach a very high event rates.

Any suggestions or ideas from experts?

Thank you very much.

--
Best regards,
Zhe Wang


> Dear friends,
> 
> We have some questions on using midas.
> We use a Caen digitizer V1751 to take waveforms.
> When testing with caen provided programs, we roughly know it can work fine at 1000 Hz event rate, and 30 M/s data can be written to disk.
> The test with Midas, however, is a little confusing. We use CAENDigitizer library with Midas. First, it works, data were taken, and there seems no error.
> The only problem is we cannot go to a higher event rate, for example we can only work on a rate of 40 Hz, and only 3 M/s data recording. Otherwise it will crush.
> 
> We may miss something really simple. Would you please give some suggestions? for example, other people's discussions or documents?
> 
> Thank you very much.
  1185   10 Jul 2016 Zhe WangSuggestionFrontend crush on high event rate
Dear friends,

In case anyone need the source code, it is attached. 
We use optic fiber to connect to a VME controler, which talks to V1751 via VME bus.

--
Zhe Wang

> Dear friends,
> 
> I may add a little more information.
> For polling event, we check the data-ready register for the status of the digitizer.
> In the readout routine, we create a bank, readout the data and write it out.
> 
> We commented out or made some replacement for each part of the subroutines to figure our where exactly goes wrong.
> for example, replace the readout from the digitizer with a random generation of some fake events.
> By replacing the readout by a random generation, the program runs fine and reach a very high event rates.
> 
> Any suggestions or ideas from experts?
> 
> Thank you very much.
> 
> --
> Best regards,
> Zhe Wang
> 
> 
> > Dear friends,
> > 
> > We have some questions on using midas.
> > We use a Caen digitizer V1751 to take waveforms.
> > When testing with caen provided programs, we roughly know it can work fine at 1000 Hz event rate, and 30 M/s data can be written to disk.
> > The test with Midas, however, is a little confusing. We use CAENDigitizer library with Midas. First, it works, data were taken, and there seems no error.
> > The only problem is we cannot go to a higher event rate, for example we can only work on a rate of 40 Hz, and only 3 M/s data recording. Otherwise it will crush.
> > 
> > We may miss something really simple. Would you please give some suggestions? for example, other people's discussions or documents?
> > 
> > Thank you very much.
  1186   13 Jul 2016 Zhe WangSuggestionFrontend crush on high event rate
Somehow I don't understand why people's reply is only in my mail box.
So I pasted them here. I hope they don't mind and these information may be useful for others.

The following is some discussion.
==========================================================================================
> In read_trigger_event(), you creating a secondary bank with time in
> second. For your information, this time in second is already written in
> the event header. You can retrieve the time using macros from the
> midas.h   time = TIME_STAMP(pevent)

Removed.

>
> In frontend_init() you loop over NFADC (1) and call for each loop
> frontend_config() after opening the device on that card. In
> frontend_config() you redo a loop over NFADC, meaning that in case of
> more than one card you will find the second one not open on the first
> frontend_config (ok for one card though).
>

Corrected.

> In frontend_config() what is the return sCAEN from MallocReadoutBuffer()?
> What is the size of the requested allocated buffer?

The return size of allocated buffer is 134936.

>
> What is the value of the sCAEN from the ReadData() function in
> read_trigger_event()?

It is always 0 for success until it crashes.
However, even for the event it crashes, it also appears as 0.

>
> I didn't check all the config parameters!
>
> What is the value of count in the poll_event(). It is true if the test
> in poll_event() is too short, it cause timing corruption during
> calibration. 

Do you mean Midas timing calibration for poll_event() before all finally start up?
We havn't observed corruption at this stage.

> This never happen during CAMAC time... to be fixed!
> The alternative is to include a ss_sleep(1) instead of the prescale.
> a 1ms delay between every poll is short enough to ensure your 1KHz trigger.

We tried ss_sleep(1) in poll_event(), and it doesn't help.
We also tried add a ss_sleep(10) in the read_trigger_event().
This may work. But we can only reach 100 Hz and 1 MB/s rate. Still low.

>
> How long do you spend in the read_trigger_event()? To be measured.

We add some timers in this part of the program.
The time spent on CAEN_DGTZ_ReadData is about 100 us.
To sleep 1 ms in read_trigger_event may delay the crush, but just one minute.
To sleep 10 ms works.

>
> I still don't understand your setup as you mention using optic fiber to
> access the VME controller? do you have a A3818 or similar to the
> controller? If so why don't you connect directly the optic to the VX1751
> and prevent the use of the VME backplane?

Our connect is:
A2818 (PCI) - fiber - V2718 (Bridge) - VME - V1751
We probably need to configure other vme boards through VME at the same time,
however, these boards don't have a fiber connection.

We also tested direct fiber connect for V1751 today.
But it crashes with the same symptom.
==========================================================================================
  1187   13 Jul 2016 Zhe WangSuggestionFrontend crush on high event rate
Suggestion from John and my reply.

> We have achieved very high rates, but only with some care.

> The biggest issue was to make sure when you compile the CAEN driver for the A3818 board that you turn on the MIDAS switch.  Without that problems occur with some 
> probability given by the number of bytes processed - which translates into very soon if you have a high rate.  (The underlying cause is that both MIDAS and the A3818
> use unix Alarm signals, but the CAEN folks have a compile option to turn this off.)

> We use as little as possible of the CAENDigitizerLibrary - instead we program the registers directly on the board.

> There is still some kind of memory leak which we have not yet tracked down, so every few hours we shut down the frontend then restart it. 

We use A2818 (PCI) - fiber - V2718 (Bridge) - VME - V1751.
I actually didn't find a MIDAS switch in the Makefile.
  1188   13 Jul 2016 Zhe WangSuggestionFrontend crush on high event rate

More suggestions from John and my reply.

> we also don't use the VME back plane - it's just too slow - mixing VME commands to plain modules and digitizer modules is unreliable....

> We use CAEN fiberoptic version 2 to talk to the digitizers directly, we have upto 12 digitizers, and can use all channels for several hours, and can fill to about 75% 
of the A3818 bandwidth... 

So far we are limitted to 30 MB/s, if tested with CAEN examples, for example, the wavedump program by CAEN.
I think is kind of the limit by IDE hard drive.
Unfortunately we are still far from that limit, only ~ 1 MB/s now.  :(
  1207   30 Sep 2016 Konstantin OlchanskiSuggestionFrontend crush on high event rate
> 
> More suggestions from John and my reply.
> 
> > we also don't use the VME back plane - it's just too slow - mixing VME commands to plain modules and digitizer modules is unreliable....
> 
> > We use CAEN fiberoptic version 2 to talk to the digitizers directly, we have upto 12 digitizers, and can use all channels for several hours, and can fill to about 75% 
> of the A3818 bandwidth... 
> 
> So far we are limitted to 30 MB/s, if tested with CAEN examples, for example, the wavedump program by CAEN.
> I think is kind of the limit by IDE hard drive.
> Unfortunately we are still far from that limit, only ~ 1 MB/s now.  :(
>

From writing MIDAS frontends for many years, I am starting to form an opinion that this type of problem is undebuggable
in the current midas frontend framework - it is impossible to separate problems in vendor-supplied libraries and linux kernel modules
from problems with midas (i.e. incorrectly created data banks, too-small event buffers getting full) from problems with
bad interaction (collision over the SIGALARM handlers).

I am pondering on a new scheme for midas frontend writing. Perhaps such a new scheme should have a "no midas" mode where you can
compile and link a midas frontend "without midas", leaving you to debug just your code and the vendor code and their interactions.

K.O.
  172   04 Nov 2004 Jan WoutersForumFrontend code and the ODB
I would like to know whether all parameters used by the frontend code have to be in the "Experiment/
Run Parameters" section.  This section can become big and difficult to maintain, because it is one single 
big section of experim.h (EXP_PARAM_DEFINED).  I have parameters the various frontends read at the 
beginning of each run, which set the hardware settings of various devices.  I would like to place these in 
a section all their own, organized by device.  Is this doable? 
  173   04 Nov 2004 Stefan RittForumFrontend code and the ODB
Hi Jan,

I usually keep under /Experiment/Run Parameters only those settings which are kind of "global" and thus of
interest to frontend *and* analyzer, like a run mode (data/calibration/cosmic/...). Settings more specific to a
frontend I keep under /Equipment/<name>/Settings where <name> is the equipment name the specific frontend
produces. In your case each frontend will then get its own tree (related to each fragment). Please note that
both discussed trees can contain a whole tree with subdirectories, which lets you organize your data better.

Best regards, Stefan.
ELOG V3.1.4-2e1708b5