11 Oct 2024, Stefan Ritt, Bug Report, Frontend name must differ from others by more than the last three characters
|
Hi Denis,
indeed a bug. Will fix it next week.
Best,
Stefan
> Hi,
> I have developed two Midas front-end programs for different hardware. The frontend_name of the first one is "FSCD_SC" (slow control) and that of the second one is "FSCD_PS" (power supply).
>
> Each front-end program runs fine separately, but when attempting to start FSCD_SC while FSCD_PS is running, FSCD_PS is terminated and Midas indicates "Previous frontend stopped" in the window where it starts FSCD_SC.
>
> The problem is that these two frontend names only differ in their last two characters, and Midas currently does not distinguish them properly. |
18 Oct 2024, Stefan Ritt, Bug Report, Frontend name must differ from others by more than the last three characters
|
Fixed and committed.
Best,
Stefan
> Hi Denis,
>
> indeed a bug. Will fix it next week.
>
> Best,
> Stefan
>
>
> > Hi,
> > I have developed two Midas front-end programs for different hardware. The frontend_name of the first one is "FSCD_SC" (slow control) and that of the second one is "FSCD_PS" (power supply).
> >
> > Each front-end program runs fine separately, but when attempting to start FSCD_SC while FSCD_PS is running, FSCD_PS is terminated and Midas indicates "Previous frontend stopped" in the window where it starts FSCD_SC.
> >
> > The problem is that these two frontend names only differ in their last two characters, and Midas currently does not distinguish them properly. |
08 Jul 2019, Vinzenz Bildstein, Bug Report, Frontend killed at stop of run
|
I wrote a c++ frontend to read data from CAEN VX1730 digitizers which is used in
parallel with the GRIFFIN frontend to read out DESCANT.
After a long overnight run to check that the frontend runs smoothly for a longer
time, I stopped the run and the frontend was killed by midas. I am not sure why
this happened, as the end_of_run function returned successfully (at least the
print statement right before "return SUCCESS;" appeared right away). So
something else must have timed-out and caused it to be killed, I guess?
Any suggestions on where to look to find out what causes this?
Thanks in advance for your help! |
08 Jul 2019, Konstantin Olchanski, Bug Report, Frontend killed at stop of run
|
> After a long overnight run to check that the frontend runs smoothly for a longer
> time, I stopped the run and the frontend was killed by midas.
run the frontend inside gdb and post the stack trace after the crash?
if there is no crash (the program is stopped by exit()), you may need
to set a breakpoint in exit() or _exit() (not sure what it's latest name is)
then with luck your stack trace will show who/what called it from where.
if it is hard to start the frontend inside gdb, you can start it normally,
and attach gdb later, using a "gdb frontend.exe pid" command.
K.O.
> I am not sure why
> this happened, as the end_of_run function returned successfully (at least the
> print statement right before "return SUCCESS;" appeared right away). So
> something else must have timed-out and caused it to be killed, I guess?
>
> Any suggestions on where to look to find out what causes this?
>
> Thanks in advance for your help! |
08 Jul 2019, Vinzenz Bildstein, Bug Report, Frontend killed at stop of run
|
> run the frontend inside gdb and post the stack trace after the crash?
>
> if there is no crash (the program is stopped by exit()), you may need
> to set a breakpoint in exit() or _exit() (not sure what it's latest name is)
> then with luck your stack trace will show who/what called it from where.
>
If I remember correctly from the last time I tried that, it doesn't use the exit
function but gdb just reports that the program was terminated and no longer exists. I
can't set a breakpoint on SIGKILL as the point of SIGKILL is to kill the program and
gdb can't set a break at that point afaik. |
08 Jul 2019, Konstantin Olchanski, Bug Report, Frontend killed at stop of run
|
> > run the frontend inside gdb and post the stack trace after the crash?
> >
> > if there is no crash (the program is stopped by exit()), you may need
> > to set a breakpoint in exit() or _exit() (not sure what it's latest name is)
> > then with luck your stack trace will show who/what called it from where.
> >
>
> If I remember correctly from the last time I tried that, it doesn't use the exit
> function but gdb just reports that the program was terminated and no longer exists. I
> can't set a breakpoint on SIGKILL as the point of SIGKILL is to kill the program and
> gdb can't set a break at that point afaik.
For SIGKILL, my gdb reports "Program terminated with signal SIGKILL, Killed." and there is no stack
trace. Is this what you see?
If your program stops "normally", not from receiving some signal, set breakpoints on "exit" and
"_exit".
The normal stop sequence is to call exit(), which runs all the atexit() functions (the midas atexit()
function prints the message about "cm_disconnect_experiment not called at end of program") and
calls _exit() to stop the program.
So if you see the midas message "cm_disconnect_experiment not called at end of program", it is a
good indication that somebody (not mfe.c) called exit() on you. A breakpoint on "exit" should catch
who does it.
Good luck,
K.O. |
08 Jul 2019, Vinzenz Bildstein, Bug Report, Frontend killed at stop of run
|
> > > run the frontend inside gdb and post the stack trace after the crash?
> > >
> > > if there is no crash (the program is stopped by exit()), you may need
> > > to set a breakpoint in exit() or _exit() (not sure what it's latest name is)
> > > then with luck your stack trace will show who/what called it from where.
> > >
> >
> > If I remember correctly from the last time I tried that, it doesn't use the exit
> > function but gdb just reports that the program was terminated and no longer exists. I
> > can't set a breakpoint on SIGKILL as the point of SIGKILL is to kill the program and
> > gdb can't set a break at that point afaik.
>
> For SIGKILL, my gdb reports "Program terminated with signal SIGKILL, Killed." and there is no stack
> trace. Is this what you see?
Yes, that is exactly what I remember seeing.
>
> If your program stops "normally", not from receiving some signal, set breakpoints on "exit" and
> "_exit".
>
> The normal stop sequence is to call exit(), which runs all the atexit() functions (the midas atexit()
> function prints the message about "cm_disconnect_experiment not called at end of program") and
> calls _exit() to stop the program.
>
> So if you see the midas message "cm_disconnect_experiment not called at end of program", it is a
> good indication that somebody (not mfe.c) called exit() on you. A breakpoint on "exit" should catch
> who does it.
>
> Good luck,
> K.O.
So far I haven't seen the issue with the message "cm_disconnect_experiment not called at end of program"
again. Now I just have to restart the frontend after the run has (failed?) to stop. After restarting the
frontend everything seems to work again.
I haven't been writing data while doing these tests, so I can't say if there is any data missing or if the
runs were actually stopped properly (with a second dump of the ODB at the end). |
08 Jul 2019, Konstantin Olchanski, Bug Report, Frontend killed at stop of run
|
> >
> > For SIGKILL, my gdb reports "Program terminated with signal SIGKILL, Killed." and there is no stack
> > trace. Is this what you see?
>
> Yes, that is exactly what I remember seeing.
>
Where would a SIGKILL come from?!?
Look in the syslog (/var/log/messages). If the program was killed by the linux kernel, it would be logged there,
the usual cause is the machine runs out of memory and programs are killed by the OOM killer, this is logged
into the syslog, always.
MIDAS also can issue a SIGKILL sometimes, again this is always logged in midas.log. see src/midas.c, search for SIGKILL to see
the exact messages printed before it is sent out.
K.O. |
08 Jul 2019, Vinzenz Bildstein, Bug Report, Frontend killed at stop of run
|
> > >
> > > For SIGKILL, my gdb reports "Program terminated with signal SIGKILL, Killed." and there is no stack
> > > trace. Is this what you see?
> >
> > Yes, that is exactly what I remember seeing.
> >
>
> Where would a SIGKILL come from?!?
>
> Look in the syslog (/var/log/messages). If the program was killed by the linux kernel, it would be logged there,
> the usual cause is the machine runs out of memory and programs are killed by the OOM killer, this is logged
> into the syslog, always.
>
> MIDAS also can issue a SIGKILL sometimes, again this is always logged in midas.log. see src/midas.c, search for SIGKILL to see
> the exact messages printed before it is sent out.
>
> K.O.
I haven't been able to reproduce the error from the overnight run so far. I will try and leave this running in gdb overnight to see
if I can get that error again. |
10 Jul 2019, Vinzenz Bildstein, Bug Report, Frontend killed at stop of run
|
> > > >
> > > > For SIGKILL, my gdb reports "Program terminated with signal SIGKILL, Killed." and there is no stack
> > > > trace. Is this what you see?
> > >
> > > Yes, that is exactly what I remember seeing.
> > >
> >
> > Where would a SIGKILL come from?!?
> >
> > Look in the syslog (/var/log/messages). If the program was killed by the linux kernel, it would be logged there,
> > the usual cause is the machine runs out of memory and programs are killed by the OOM killer, this is logged
> > into the syslog, always.
> >
> > MIDAS also can issue a SIGKILL sometimes, again this is always logged in midas.log. see src/midas.c, search for SIGKILL to see
> > the exact messages printed before it is sent out.
> >
> > K.O.
>
> I haven't been able to reproduce the error from the overnight run so far. I will try and leave this running in gdb overnight to see
> if I can get that error again.
I was able to reproduce the error after an overnight run. gdb reported that the program received a SIGKILL, but no sign of it in
/var/log/messages. I've tried finding a current midas.log file, but it seems we don't have one? The most recent one was last updated
on May 24th this year. |
10 Jul 2019, Konstantin Olchanski, Bug Report, Frontend killed at stop of run
|
> ... finding a current midas.log file
On the "help" page, see "midas.log".
Same information is in ODB, the midas log file name is concatenation of "/Logger/Data dir" and "message file".
K.O. |
11 Jul 2019, Vinzenz Bildstein, Bug Report, Frontend killed at stop of run
|
> > ... finding a current midas.log file
>
> On the "help" page, see "midas.log".
>
> Same information is in ODB, the midas log file name is concatenation of "/Logger/Data dir" and "message file".
>
> K.O.
Sorry, should have found that myself ...
Anyway, the output from midas is
Tue Jul 9 07:24:06 2019 [mhttpd,INFO] Run #13456 started
Wed Jul 10 06:23:58 2019 [mhttpd,ERROR] [system.c:4580:ss_recv_net_command,ERROR] timeout receiving network
command header
Wed Jul 10 06:23:58 2019 [mhttpd,ERROR] [midas.c:10322:rpc_client_call,ERROR] call to "fedescant" on
"grsmid00.triumf.ca" RPC "rc_transition": timeout waiting for reply
Wed Jul 10 06:24:02 2019 [mhttpd,ERROR] [midas.c:5495:cm_shutdown,ERROR] Client 'fedescant' not responding to
shutdown command
Wed Jul 10 06:24:02 2019 [mhttpd,ERROR] [midas.c:5497:cm_shutdown,ERROR] Killing and Deleting client 'fedescant'
pid 31482
Wed Jul 10 06:24:02 2019 [Logger,INFO] Client 'fedescant' on buffer 'SYSMSG' removed by cm_watchdog because
process pid 31482 does not exist
Wed Jul 10 06:24:02 2019 [fegrifip09,INFO] Client 'fedescant' on buffer 'SYSTEM' removed by cm_watchdog because
process pid 31482 does not exist
Wed Jul 10 06:24:03 2019 [mhttpd,INFO] Run #13456 stopped
And I think I tracked down where this comes from with help from Thomas Lindner. It is a problem in the communication via the A3818 card from CAEN. This seems to block the frontend, even though it still reacts normal to a shutdown. So no issue with midas, even if it seemed that way at first. Thanks for all your help! |
11 Jul 2019, Konstantin Olchanski, Bug Report, Frontend killed at stop of run
|
> Wed Jul 10 06:23:58 2019 [mhttpd,ERROR] [system.c:4580:ss_recv_net_command,ERROR] timeout receiving network command header
> Wed Jul 10 06:23:58 2019 [mhttpd,ERROR] [midas.c:10322:rpc_client_call,ERROR] call to "fedescant" on "grsmid00.triumf.ca" RPC "rc_transition": timeout waiting for reply
We should have started debugging from here. The error messages mean: your frontend is not responding to run transition (RPC timeout).
> problem in the communication via the A3818 card from CAEN.
Yes, this has been problematic before.
K.O. |
14 Dec 2004, Jan Wouters, Forum, Frontend index
|
What is the api call to determine the index of the frontend when specifying the
-i parameter during execution of the frontend? |
15 Dec 2004, Stefan Ritt, Forum, Frontend index
|
> What is the api call to determine the index of the frontend when specifying the
> -i parameter during execution of the frontend?
INT get_frontend_index();
- Stefan |
06 Jul 2016, Zhe Wang, Suggestion, Frontend crush on high event rate
|
Dear friends,
We have some questions on using midas.
We use a Caen digitizer V1751 to take waveforms.
When testing with caen provided programs, we roughly know it can work fine at 1000 Hz event rate, and 30 M/s data can be written to disk.
The test with Midas, however, is a little confusing. We use CAENDigitizer library with Midas. First, it works, data were taken, and there seems no error.
The only problem is we cannot go to a higher event rate, for example we can only work on a rate of 40 Hz, and only 3 M/s data recording. Otherwise it will crush.
We may miss something really simple. Would you please give some suggestions? for example, other people's discussions or documents?
Thank you very much. |
09 Jul 2016, Zhe Wang, Suggestion, Frontend crush on high event rate
|
Dear friends,
I may add a little more information.
For polling event, we check the data-ready register for the status of the digitizer.
In the readout routine, we create a bank, readout the data and write it out.
We commented out or made some replacement for each part of the subroutines to figure our where exactly goes wrong.
for example, replace the readout from the digitizer with a random generation of some fake events.
By replacing the readout by a random generation, the program runs fine and reach a very high event rates.
Any suggestions or ideas from experts?
Thank you very much.
--
Best regards,
Zhe Wang
> Dear friends,
>
> We have some questions on using midas.
> We use a Caen digitizer V1751 to take waveforms.
> When testing with caen provided programs, we roughly know it can work fine at 1000 Hz event rate, and 30 M/s data can be written to disk.
> The test with Midas, however, is a little confusing. We use CAENDigitizer library with Midas. First, it works, data were taken, and there seems no error.
> The only problem is we cannot go to a higher event rate, for example we can only work on a rate of 40 Hz, and only 3 M/s data recording. Otherwise it will crush.
>
> We may miss something really simple. Would you please give some suggestions? for example, other people's discussions or documents?
>
> Thank you very much. |
10 Jul 2016, Zhe Wang, Suggestion, Frontend crush on high event rate
|
Dear friends,
In case anyone need the source code, it is attached.
We use optic fiber to connect to a VME controler, which talks to V1751 via VME bus.
--
Zhe Wang
> Dear friends,
>
> I may add a little more information.
> For polling event, we check the data-ready register for the status of the digitizer.
> In the readout routine, we create a bank, readout the data and write it out.
>
> We commented out or made some replacement for each part of the subroutines to figure our where exactly goes wrong.
> for example, replace the readout from the digitizer with a random generation of some fake events.
> By replacing the readout by a random generation, the program runs fine and reach a very high event rates.
>
> Any suggestions or ideas from experts?
>
> Thank you very much.
>
> --
> Best regards,
> Zhe Wang
>
>
> > Dear friends,
> >
> > We have some questions on using midas.
> > We use a Caen digitizer V1751 to take waveforms.
> > When testing with caen provided programs, we roughly know it can work fine at 1000 Hz event rate, and 30 M/s data can be written to disk.
> > The test with Midas, however, is a little confusing. We use CAENDigitizer library with Midas. First, it works, data were taken, and there seems no error.
> > The only problem is we cannot go to a higher event rate, for example we can only work on a rate of 40 Hz, and only 3 M/s data recording. Otherwise it will crush.
> >
> > We may miss something really simple. Would you please give some suggestions? for example, other people's discussions or documents?
> >
> > Thank you very much. |
13 Jul 2016, Zhe Wang, Suggestion, Frontend crush on high event rate
|
Somehow I don't understand why people's reply is only in my mail box.
So I pasted them here. I hope they don't mind and these information may be useful for others.
The following is some discussion.
==========================================================================================
> In read_trigger_event(), you creating a secondary bank with time in
> second. For your information, this time in second is already written in
> the event header. You can retrieve the time using macros from the
> midas.h time = TIME_STAMP(pevent)
Removed.
>
> In frontend_init() you loop over NFADC (1) and call for each loop
> frontend_config() after opening the device on that card. In
> frontend_config() you redo a loop over NFADC, meaning that in case of
> more than one card you will find the second one not open on the first
> frontend_config (ok for one card though).
>
Corrected.
> In frontend_config() what is the return sCAEN from MallocReadoutBuffer()?
> What is the size of the requested allocated buffer?
The return size of allocated buffer is 134936.
>
> What is the value of the sCAEN from the ReadData() function in
> read_trigger_event()?
It is always 0 for success until it crashes.
However, even for the event it crashes, it also appears as 0.
>
> I didn't check all the config parameters!
>
> What is the value of count in the poll_event(). It is true if the test
> in poll_event() is too short, it cause timing corruption during
> calibration.
Do you mean Midas timing calibration for poll_event() before all finally start up?
We havn't observed corruption at this stage.
> This never happen during CAMAC time... to be fixed!
> The alternative is to include a ss_sleep(1) instead of the prescale.
> a 1ms delay between every poll is short enough to ensure your 1KHz trigger.
We tried ss_sleep(1) in poll_event(), and it doesn't help.
We also tried add a ss_sleep(10) in the read_trigger_event().
This may work. But we can only reach 100 Hz and 1 MB/s rate. Still low.
>
> How long do you spend in the read_trigger_event()? To be measured.
We add some timers in this part of the program.
The time spent on CAEN_DGTZ_ReadData is about 100 us.
To sleep 1 ms in read_trigger_event may delay the crush, but just one minute.
To sleep 10 ms works.
>
> I still don't understand your setup as you mention using optic fiber to
> access the VME controller? do you have a A3818 or similar to the
> controller? If so why don't you connect directly the optic to the VX1751
> and prevent the use of the VME backplane?
Our connect is:
A2818 (PCI) - fiber - V2718 (Bridge) - VME - V1751
We probably need to configure other vme boards through VME at the same time,
however, these boards don't have a fiber connection.
We also tested direct fiber connect for V1751 today.
But it crashes with the same symptom.
========================================================================================== |
13 Jul 2016, Zhe Wang, Suggestion, Frontend crush on high event rate
|
Suggestion from John and my reply.
> We have achieved very high rates, but only with some care.
> The biggest issue was to make sure when you compile the CAEN driver for the A3818 board that you turn on the MIDAS switch. Without that problems occur with some
> probability given by the number of bytes processed - which translates into very soon if you have a high rate. (The underlying cause is that both MIDAS and the A3818
> use unix Alarm signals, but the CAEN folks have a compile option to turn this off.)
> We use as little as possible of the CAENDigitizerLibrary - instead we program the registers directly on the board.
> There is still some kind of memory leak which we have not yet tracked down, so every few hours we shut down the frontend then restart it.
We use A2818 (PCI) - fiber - V2718 (Bridge) - VME - V1751.
I actually didn't find a MIDAS switch in the Makefile. |
|