06 Dec 2024, Stefan Ritt, Info, New slow control framework "mdev"
|
A new slow control mini-framework has been developed for MIDAS and been successfully tested in the Mu3e experiment. It
might be suited for other experiments as well.
Background
Since the late 90’s we have the three-tier bases slow control framework in MIDAS with class drivers, device drivers and bus
drivers. While it was used successfully since many years, it is complicated to understand and limited in its flexibility. If we
have a HV device with a demand value, a measured voltage and a current it’s fine, but if we want to control more things like
trip voltage, temperature and status readout etc. it soon hits its limits. With the development of the new odbxx API
(https://daq00.triumf.ca/MidasWiki/index.php/Odbxx) there is now an opportunity to make everything much simpler.
Design principles
Instead of a three-tier system, the new “mdev” framework (“m”idas “dev”ices) uses a simple base class which is attached to
a certain MIDAS equipment. It implements five simple functions:
- odb_setup() to setup /Equipment/<name>/Settings and /Equipment/<name>/Variables to its desired structure
- init() to initialize the slow control device
- exit() to close the connection to the device
- loop() which is called periodically to read the device
- read_event() which returns a MIDAS event going to the data stream
A device driver inherits from this base class and implements the functions. A simple example can be found in
midas/drivers/mdev/mdev_mscb.[h,cxx]
for the MSCB field bus system used at TRIUMF and PSI. It basically boils down to two calls:
Init:
m_variables.connect(“/Equipment/<name>/Variables”);
m_variables[“Output”].watch(midas::odb &o) {
m_mscb[“HV”] = o[0]; // transfer value from ODB to MSCB device
}
Reading a value in the loop function:
m_variables[“Input”][0] = m_mscb[“HVMeas”];
The member variable m_variables is a midas::odb variable attached to the “Input” and “Output” variables in the ODB. The
watch() functions executes the lambda function whenever the “Output” in the ODB changes. It then simply transfers the new
value to the device. The reading of measured values just work in the other direction from the device to the ODB.
If you look at the mdev_mscb.cxx code, you see of course some more things like connecting to the MSCB device with proper
error handling, looping over several devices and variables, setting up the “Setting” directory in the ODB to define labels for
all variables. In addition we have a mirror for output variables, so that new values are only sent to the device if they differ
from the previous variable (needed to reduce some communication traffic).
The midas/drivers/mdev directory contains also an example frontend in the mfe.cxx framework, but this is no a requirement.
The mdev framework can also be used in the tmfe framework and others as well. Please note how compact the frontend
code now looks.
User interface
Since the beginning, MIDAS allows access to the the slow control devices through the “equipment” page (on the main status
page, click on one equipment). A few more options can control now the behavior of this page, allowing quite some flexibility
without having to write a dedicated custom page (which of course can still be done). Attached is an example from Mu3e where
the details of the equipment display are controlled through some options in the setting subdirectory as described in
https://daq00.triumf.ca/MidasWiki/index.php//Equipment_ODB_tree (especially the “grid display”, “Editable” and “Format”
flags).
Conclusions
The new “mdev” framework offers a compact and effective way to communicate from MIDAS to slow control devices. Since
all interface code is now not “hidden” any more in system class and device drivers, the user has much higher flexibility in
controlling different devices. If a device has a new parameter, the user can add a single line of code to connect this
parameter to an ODB entry.
The framework is very simple and misses some features of the old system. Ramping of HV voltages and current trips are not
available in the framework (like with the old HV class driver), but modern devices usually implement this in hardware which
is much better. The new framework is not multi-threaded, but modern devices are these day much faster than in the ‘90s.
Since the ODB is thread save, nothing prevents us from putting a device readout into its own thread in the frontend.
We will use the new system for all devices in Mu3e, with probably some new features being added soon, so stay tuned.
/Stefan |
23 Oct 2024, Lukas Gerritzen, Bug Report, ODB key picker does not close when creating link / Edit-on-run string box too large
|
To reproduce:
In the interactive ODB, click the 🔗 icon to create a link. Next to the target, click the "..." button to open
the key picker browser. Then try to close it by either:
- Selecting a key and clicking ok
- Clicking "cancel"
- Clicking the red circle at the top left
Expected result:
The key picker closes
Actual result:
The key picker does not close.
Depending on how you trying to close the picker, the error messages in the debug console differ slightly.
On the red circle:
Uncaught TypeError: dlg is null
dlgClose http://localhost:8080/controls.js:791
onclick http://localhost:8080/?cmd=ODB&odb_path=/Test:1
On "ok" or "cancel":
Uncaught TypeError: dlg is null
dlgMessageDestroy http://localhost:8080/controls.js:828
pickerButton http://localhost:8080/odbbrowser.js:453
onclick http://localhost:8080/?cmd=ODB&odb_path=/Test:1
Another more minor visual problem is the edit-on-start dialog. There seems to be no upper bound to the
size of the text box. In the attached screenshot, ShortString has a maximum length of 32 characters,
LongString has 255. Both are empty at the time of the screenshot. Maybe, the size should be limited to a
reasonable width. |
02 Dec 2024, Stefan Ritt, Bug Report, ODB key picker does not close when creating link / Edit-on-run string box too large
|
> Actual result:
> The key picker does not close.
Thanks for reporting that bug. It has been fixed in the current commit (installed already on megon02)
Stefan |
04 Dec 2024, Konstantin Olchanski, Bug Report, ODB key picker does not close when creating link / Edit-on-run string box too large
|
> > Actual result:
> > The key picker does not close.
>
> Thanks for reporting that bug. It has been fixed in the current commit (installed already on megon02)
>
Stefan, thank you for fixing both problems, I have seen them, too, but no time to deal with them.
K.O. |
02 Dec 2024, Stefan Ritt, Bug Report, ODB key picker does not close when creating link / Edit-on-run string box too large
|
> Another more minor visual problem is the edit-on-start dialog. There seems to be no upper bound to the
> size of the text box. In the attached screenshot, ShortString has a maximum length of 32 characters,
> LongString has 255. Both are empty at the time of the screenshot. Maybe, the size should be limited to a
> reasonable width.
I limited the input size now to (arbitrarily) 100 chars. The string can still be longer than 100 chars, and you start then scrolling inside the input box. Let me know if
that's ok this way.
Stefan |
05 Dec 2024, Konstantin Olchanski, Bug Report, ODB key picker does not close when creating link / Edit-on-run string box too large
|
> > Another more minor visual problem is the edit-on-start dialog. There seems to be no upper bound to the
> > size of the text box. In the attached screenshot, ShortString has a maximum length of 32 characters,
> > LongString has 255. Both are empty at the time of the screenshot. Maybe, the size should be limited to a
> > reasonable width.
>
> I limited the input size now to (arbitrarily) 100 chars. The string can still be longer than 100 chars, and you start then scrolling inside the input box. Let me know if
> that's ok this way.
I am moving the dragon experiment to the new midas and we see this problem on the begin-of-run page.
Old midas: no horizontal scroll bar, edit-on-start names, values and comments are all squeezed in into the visible frame.
New midas: page is very wide, values entry fields are very long and there is a horizontal scroll bar.
So something got broken in the htlm formatting or sizing. I should be able to spot the change by doing
a diff between old resources/start.html and the new one.
K.O. |
15 Aug 2024, Scott Oser, Forum, "Safe" abort of sequencer scripts
|
We often use the MIDAS sequencer to temporarily control detector settings, such as:
* <change some setting>
* WAIT 60 seconds
* <revert setting to original value>
The question arises of what happens if the sequencer scripts gets aborted during that wait, preventing the value from being reset. Depending on the setting, this could be undesirable or even damage something if left uncorrected for too long.
Is there any way to have a "safe abort" from the sequencer so that the "Stop immediately" button will call some cleanup script to leave things in a safe state? Or what about if the sequencer process itself gets killed in the middle of a script?
How have other experiments using MIDAS protected themselves from unplanned terminations of sequencer scripts? |
19 Aug 2024, Stefan Ritt, Forum, "Safe" abort of sequencer scripts
|
This request came more than once in the past. One thing I could implement is a "atexit" function similarly to the C funciton atexit().
Then we would have a function in the script which gets called whenever one does "stop immediately". This function can then restore
some ODB values or do whatever is necessary.
If the sequencer gets killed in the middle, it can safely be restarted since the complete sequencer state is kept in the ODB under
/Sequencer/State. After the restart, the sequencer continues exactly where it has been killed before.
Would that solve your problem?
Stefan |
22 Aug 2024, Scott Oser, Forum, "Safe" abort of sequencer scripts
|
> This request came more than once in the past. One thing I could implement is a "atexit" function similarly to the C funciton atexit().
>
> Then we would have a function in the script which gets called whenever one does "stop immediately". This function can then restore
> some ODB values or do whatever is necessary.
>
> If the sequencer gets killed in the middle, it can safely be restarted since the complete sequencer state is kept in the ODB under
> /Sequencer/State. After the restart, the sequencer continues exactly where it has been killed before.
>
> Would that solve your problem?
>
> Stefan
Yes, an "atexit" functionality within the Midas Sequencer Language would be useful for us with this issue. Is this easy for you to implement?
Thanks,
Scott Oser |
02 Dec 2024, Stefan Ritt, Forum, "Safe" abort of sequencer scripts
|
The atexit() function has been implemented in the current develop branch of midas, see
https://daq00.triumf.ca/MidasWiki/index.php/Sequencer#ATEXIT_subroutine
Stefan |
11 Sep 2024, Konstantin Olchanski, Forum, "Safe" abort of sequencer scripts
|
> We often use the MIDAS sequencer to temporarily control detector settings, such as:
>
> * <change some setting>
> * WAIT 60 seconds
> * <revert setting to original value>
>
> The question arises of what happens if the sequencer scripts gets aborted during that wait, preventing the value from being reset.
Common problem. Go have an elegant solution using the "defer" keyword.
https://go.dev/tour/flowcontrol/12
K.O. |
30 Nov 2024, Pavel Murat, Bug Report, EQ_PERIODIC-only equipment ?
|
Dear Midas experts,
I'm running into something which looks like an initialization problem.
I have a mfe.cxx-style frontend which introduces an equipment of the EQ_PERIODIC type (EQ_PERIODIC-only!).
When Midas enters the running state, I see the frontend crashing.
Stepping through the code shows that the frontend is crashing because its equipment has been ignored
by the initialize_equipment@mfe.cxx - see
https://bitbucket.org/tmidas/midas/src/5d0dae001712164ae43137dced2fbbb594f0201e/src/mfe.cxx#lines-630
Is there an assumption that the initialization of the EQ_PERIODIC-only equipment is the user responsibility?
Or EQ_PERIODIC should always come paired with some other type?
-- many thanks, regards, Pasha |
01 Dec 2024, Stefan Ritt, Bug Report, EQ_PERIODIC-only equipment ?
|
There is no requirement that you pair an EQ_PERIODIC with an EQ_TRIGGER. Take for exmaple
midas/examples/experiment/frontend.cxx
and remove there the triggered event. The frontend runs happily with the periodic event only (I just tried that myself). You have probably some problem in
your event definition. Start with the running example frontend, and add your code line by line until you see the problem.
Stefan |
01 Dec 2024, Pavel Murat, Bug Report, EQ_PERIODIC-only equipment ?
|
> There is no requirement that you pair an EQ_PERIODIC with an EQ_TRIGGER. Take for exmaple
>
> midas/examples/experiment/frontend.cxx
>
> and remove there the triggered event. The frontend runs happily with the periodic event only (I just tried that myself). You have probably some problem in
> your event definition. Start with the running example frontend, and add your code line by line until you see the problem.
Hi Stefan, thank you very much!
As the pointer to the readout function and pointers to device drivers are all defined in the same structure (EQUIPMENT),
I was naively assuming that the readout function should be set during the class driver initialization.
Now it is clear that the equipment responding to EQ_PERIODIC events doesn't have to have drivers,
and specifying the readout function is the responsibility of the user.
I got around exactly this way yesterday, but was thinking that I was hacking the system :)
-- regards, Pasha |
07 Oct 2024, Amy Roberts, Bug Report, Difficulty running MIDAS on Rocky 9.4
|
We're trying to install the SuperCDMS version of MIDAS on a Rocky 9.4 Virtual
Machine and are getting a persistent error when we run mserver. As far as I
know there are minimal changes between this and the MIDAS branch, but Ben Smith
may have more to say on this.
[lekhraj@sdfcdmsdaq online]$ mserver
mserver started interactively
[mserver,INFO] Client 'ODBEdit' on buffer 'SYSMSG' removed by bm_open_buffer
because process pid 481051 does not exist
mserver will listen on TCP port 1175
[mserver,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore,
timeout 10000 ms, exiting...
[mserver,ERROR] [midas.cxx:2205:cm_check_connect,ERROR] cm_disconnect_experiment
not called at end of program
db_lock_database: Detected recursive call to db_{lock,unlock}_database() while
already inside db_{lock,unlock}_database(). Maybe this is a call from a signal
handler. Cannot continue, aborting...
Aborted (core dumped)
We thought perhaps we had a corrupted ODB file, so we removed the ODB file and
tried to create a new one (sized correctly for our experiment):
[lekhraj@sdfcdmsdaq online]$ odbedit -s 50000000
[ODBEdit,ERROR] [odb.cxx:2052:db_open_database,ERROR] Removed ODB client
'mserver', index 0 because process pid 481326 does not exists
[ODBEdit,INFO] Removed open record flag from "/Experiment/Security/RPC
hosts/Allowed hosts"
[ODBEdit,INFO] Removed exclusive access mode from "/Experiment/Security/RPC
hosts/Allowed hosts"
[ODBEdit,INFO] Corrected 1 ODB entries
[ODBEdit,INFO] Deleted entry '/System/Clients/481326' for client 'mserver'
because it is not connected to ODB
[ODBEdit,INFO] Client 'mserver' on buffer 'SYSMSG' removed by bm_open_buffer
because process pid 481326 does not exist
[local:test:S]/>Bus error (core dumped) |
07 Oct 2024, Ben Smith, Bug Report, Difficulty running MIDAS on Rocky 9.4
|
> We're trying to install the SuperCDMS version of MIDAS on a Rocky 9.4 Virtual
> Machine and are getting a persistent error when we run mserver. As far as I
> know there are minimal changes between this and the MIDAS branch, but Ben Smith
> may have more to say on this.
For reference, "the SuperCDMS version of MIDAS" is just a fork that no longer has any meaningful differences vs the main MIDAS repo, but we only pull updates infrequently after testing a bunch. We last pulled from the develop branch in November 2023. But that should be irrelevant here as semaphore code hasn't been touched for a very long time.
We're running Alma 9.4 on a machine at TRIUMF and the same version of midas works fine there (Amy, you may already have access to scdms-zeus). I believe Alma and Rocky should be basically identical for this.
So the questions are:
* Have you tried other midas programs, or only mserver? E.g. did odbedit and mhttpd work?
* If other programs work, have you been running them all as the same user? In particular, if you ran one program as root and another as an unprivileged user, then you will likely get odd permissions issues.
* What do you see if you run `ls -l /dev/shm` and `ls -l ~/packages/SuperCDMS_DAQ/MidasDAQ/online/.*SHM`? (Or wherever your online dir is for the 2nd one).
* Did you follow the full instructions for recovering from a corrupt ODB? https://daq00.triumf.ca/MidasWiki/index.php/FAQ#How_to_recover_from_a_corrupted_ODB In particular the bit about running odbinit with the --cleanup flag? |
08 Oct 2024, Amy Roberts, Bug Report, Difficulty running MIDAS on Rocky 9.4
|
> > We're trying to install the SuperCDMS version of MIDAS on a Rocky 9.4 Virtual
> > Machine and are getting a persistent error when we run mserver. As far as I
> > know there are minimal changes between this and the MIDAS branch, but Ben Smith
> > may have more to say on this.
>
> For reference, "the SuperCDMS version of MIDAS" is just a fork that no longer has any meaningful differences vs the main MIDAS repo, but we only pull updates infrequently after testing a bunch. We last pulled from the develop branch in November 2023. But that should be irrelevant here as semaphore code hasn't been touched for a very long time.
>
> We're running Alma 9.4 on a machine at TRIUMF and the same version of midas works fine there (Amy, you may already have access to scdms-zeus). I believe Alma and Rocky should be basically identical for this.
>
> So the questions are:
> * Have you tried other midas programs, or only mserver? E.g. did odbedit and mhttpd work?
> * If other programs work, have you been running them all as the same user? In particular, if you ran one program as root and another as an unprivileged user, then you will likely get odd permissions issues.
> * What do you see if you run `ls -l /dev/shm` and `ls -l ~/packages/SuperCDMS_DAQ/MidasDAQ/online/.*SHM`? (Or wherever your online dir is for the 2nd one).
> * Did you follow the full instructions for recovering from a corrupt ODB? https://daq00.triumf.ca/MidasWiki/index.php/FAQ#How_to_recover_from_a_corrupted_ODB In particular the bit about running odbinit with the --cleanup flag?
Here's what happens when I try to run odbedit:
[lekhraj@sdfcdmsdaq setup]$ odbedit
[ODBEdit,ERROR] [odb.cxx:2052:db_open_database,ERROR] Removed ODB client 'ODBEdit', index 0 because process pid 481823 does not exists
[ODBEdit,INFO] Removed open record flag from "/Experiment/Security/RPC hosts/Allowed hosts"
[ODBEdit,INFO] Removed exclusive access mode from "/Experiment/Security/RPC hosts/Allowed hosts"
[ODBEdit,INFO] Corrected 1 ODB entries
[ODBEdit,INFO] Deleted entry '/System/Clients/481823' for client 'ODBEdit' because it is not connected to ODB
[ODBEdit,INFO] Client 'ODBEdit' on buffer 'SYSMSG' removed by bm_open_buffer because process pid 481823 does not exist
[ODBEdit,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore, timeout 10000 ms, exiting...
[ODBEdit,ERROR] [midas.cxx:2205:cm_check_connect,ERROR] cm_disconnect_experiment not called at end of program
db_lock_database: Detected recursive call to db_{lock,unlock}_database() while already inside db_{lock,unlock}_database(). Maybe this is a call from a signal handler. Cannot continue, aborting...
Aborted (core dumped)
[lekhraj@sdfcdmsdaq setup]$
And mhttpd:
[lekhraj@sdfcdmsdaq setup]$ mhttpd
[mhttpd,ERROR] [odb.cxx:2052:db_open_database,ERROR] Removed ODB client 'ODBEdit', index 0 because process pid 601054 does not exists
[mhttpd,INFO] Removed open record flag from "/Experiment/Security/RPC hosts/Allowed hosts"
[mhttpd,INFO] Removed exclusive access mode from "/Experiment/Security/RPC hosts/Allowed hosts"
[mhttpd,INFO] Corrected 1 ODB entries
[mhttpd,INFO] Deleted entry '/System/Clients/601054' for client 'ODBEdit' because it is not connected to ODB
[mhttpd,INFO] Client 'ODBEdit' on buffer 'SYSMSG' removed by bm_open_buffer because process pid 601054 does not exist
[mhttpd,INFO] ODB subtree /Runinfo corrected successfully
Password protection is off
Hostlist off, connections from anywhere will be accepted
Listening on "http://localhost:8080", passwords OFF, hostlist OFF
Listening on "http://[::1]:8080", passwords OFF, hostlist OFF
bm_lock_buffer: Lock buffer "SYSMSG" is taking longer than 1 second!
[mhttpd,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore, timeout 10000 ms, exiting...
[mhttpd,ERROR] [midas.cxx:2205:cm_check_connect,ERROR] cm_disconnect_experiment not called at end of program
db_lock_database: Detected recursive call to db_{lock,unlock}_database() while already inside db_{lock,unlock}_database(). Maybe this is a call from a signal handler. Cannot continue, aborting...
Aborted (core dumped)
[lekhraj@sdfcdmsdaq setup]$
We have been running everything as a single user, the user who cloned the repositories and owns the directories.
We did follow the corrupted-ODB cleanup instructions.
[lekhraj@sdfcdmsdaq setup]$ ls -lh /dev/shm
total 1.3M
-rw------- 1 lekhraj dm 1.2M Oct 8 14:13 17468_test_ODB__sdf_home_l_lekhraj_packages_SuperCDMS_DAQ_MidasDAQ_online_
-rw------- 1 lekhraj dm 114K Oct 7 14:06 17468_test_SYSMSG__sdf_home_l_lekhraj_packages_SuperCDMS_DAQ_MidasDAQ_online_
[lekhraj@sdfcdmsdaq setup]$ ls -lh ~/packages/SuperCDMS_DAQ/MidasDAQ/online/.*SHM
-rw-r--r-- 1 lekhraj dm 0 Oct 3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.ALARM.SHM
-rw-r--r-- 1 lekhraj dm 0 Oct 3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.ELOG.SHM
-rw-r--r-- 1 lekhraj dm 0 Oct 3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.HISTORY.SHM
-rw-r--r-- 1 lekhraj dm 0 Oct 3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.LAZY.SHM
-rw-r--r-- 1 lekhraj dm 0 Oct 3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.MSG.SHM
-rw-r--r-- 1 lekhraj dm 1.2M Oct 8 14:12 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.ODB.SHM
-rw-r--r-- 1 lekhraj dm 0 Oct 3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.SYSMSG.SHM
-rw-r--r-- 1 lekhraj dm 0 Oct 3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.SYSTEM.SHM |
08 Oct 2024, Konstantin Olchanski, Bug Report, Difficulty running MIDAS on Rocky 9.4
|
I read these error messages. There is no ODB corruption. ODB semaphore is locked and all midas programs will fail, they will timeout trying to get the lock, report the timeout, then it looks like a bug was introduced where instead of hard exit or abort() they attempt a clean shutdown which crashes from a recursive call in db_lock_database(). Amy's
core dump should confirm this.
K.O.
> > > We're trying to install the SuperCDMS version of MIDAS on a Rocky 9.4 Virtual
> > > Machine and are getting a persistent error when we run mserver. As far as I
> > > know there are minimal changes between this and the MIDAS branch, but Ben Smith
> > > may have more to say on this.
> >
> > For reference, "the SuperCDMS version of MIDAS" is just a fork that no longer has any meaningful differences vs the main MIDAS repo, but we only pull updates infrequently after testing a bunch. We last pulled from the develop branch in November 2023. But that should be irrelevant here as semaphore code hasn't been touched for a very long time.
> >
> > We're running Alma 9.4 on a machine at TRIUMF and the same version of midas works fine there (Amy, you may already have access to scdms-zeus). I believe Alma and Rocky should be basically identical for this.
> >
> > So the questions are:
> > * Have you tried other midas programs, or only mserver? E.g. did odbedit and mhttpd work?
> > * If other programs work, have you been running them all as the same user? In particular, if you ran one program as root and another as an unprivileged user, then you will likely get odd permissions issues.
> > * What do you see if you run `ls -l /dev/shm` and `ls -l ~/packages/SuperCDMS_DAQ/MidasDAQ/online/.*SHM`? (Or wherever your online dir is for the 2nd one).
> > * Did you follow the full instructions for recovering from a corrupt ODB? https://daq00.triumf.ca/MidasWiki/index.php/FAQ#How_to_recover_from_a_corrupted_ODB In particular the bit about running odbinit with the --cleanup flag?
>
> Here's what happens when I try to run odbedit:
>
> [lekhraj@sdfcdmsdaq setup]$ odbedit
> [ODBEdit,ERROR] [odb.cxx:2052:db_open_database,ERROR] Removed ODB client 'ODBEdit', index 0 because process pid 481823 does not exists
> [ODBEdit,INFO] Removed open record flag from "/Experiment/Security/RPC hosts/Allowed hosts"
> [ODBEdit,INFO] Removed exclusive access mode from "/Experiment/Security/RPC hosts/Allowed hosts"
> [ODBEdit,INFO] Corrected 1 ODB entries
> [ODBEdit,INFO] Deleted entry '/System/Clients/481823' for client 'ODBEdit' because it is not connected to ODB
> [ODBEdit,INFO] Client 'ODBEdit' on buffer 'SYSMSG' removed by bm_open_buffer because process pid 481823 does not exist
> [ODBEdit,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore, timeout 10000 ms, exiting...
> [ODBEdit,ERROR] [midas.cxx:2205:cm_check_connect,ERROR] cm_disconnect_experiment not called at end of program
> db_lock_database: Detected recursive call to db_{lock,unlock}_database() while already inside db_{lock,unlock}_database(). Maybe this is a call from a signal handler. Cannot continue, aborting...
> Aborted (core dumped)
> [lekhraj@sdfcdmsdaq setup]$
>
> And mhttpd:
>
> [lekhraj@sdfcdmsdaq setup]$ mhttpd
> [mhttpd,ERROR] [odb.cxx:2052:db_open_database,ERROR] Removed ODB client 'ODBEdit', index 0 because process pid 601054 does not exists
> [mhttpd,INFO] Removed open record flag from "/Experiment/Security/RPC hosts/Allowed hosts"
> [mhttpd,INFO] Removed exclusive access mode from "/Experiment/Security/RPC hosts/Allowed hosts"
> [mhttpd,INFO] Corrected 1 ODB entries
> [mhttpd,INFO] Deleted entry '/System/Clients/601054' for client 'ODBEdit' because it is not connected to ODB
> [mhttpd,INFO] Client 'ODBEdit' on buffer 'SYSMSG' removed by bm_open_buffer because process pid 601054 does not exist
> [mhttpd,INFO] ODB subtree /Runinfo corrected successfully
> Password protection is off
> Hostlist off, connections from anywhere will be accepted
> Listening on "http://localhost:8080", passwords OFF, hostlist OFF
> Listening on "http://[::1]:8080", passwords OFF, hostlist OFF
> bm_lock_buffer: Lock buffer "SYSMSG" is taking longer than 1 second!
> [mhttpd,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore, timeout 10000 ms, exiting...
> [mhttpd,ERROR] [midas.cxx:2205:cm_check_connect,ERROR] cm_disconnect_experiment not called at end of program
> db_lock_database: Detected recursive call to db_{lock,unlock}_database() while already inside db_{lock,unlock}_database(). Maybe this is a call from a signal handler. Cannot continue, aborting...
> Aborted (core dumped)
> [lekhraj@sdfcdmsdaq setup]$
>
> We have been running everything as a single user, the user who cloned the repositories and owns the directories.
>
> We did follow the corrupted-ODB cleanup instructions.
>
> [lekhraj@sdfcdmsdaq setup]$ ls -lh /dev/shm
> total 1.3M
> -rw------- 1 lekhraj dm 1.2M Oct 8 14:13 17468_test_ODB__sdf_home_l_lekhraj_packages_SuperCDMS_DAQ_MidasDAQ_online_
> -rw------- 1 lekhraj dm 114K Oct 7 14:06 17468_test_SYSMSG__sdf_home_l_lekhraj_packages_SuperCDMS_DAQ_MidasDAQ_online_
>
> [lekhraj@sdfcdmsdaq setup]$ ls -lh ~/packages/SuperCDMS_DAQ/MidasDAQ/online/.*SHM
> -rw-r--r-- 1 lekhraj dm 0 Oct 3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.ALARM.SHM
> -rw-r--r-- 1 lekhraj dm 0 Oct 3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.ELOG.SHM
> -rw-r--r-- 1 lekhraj dm 0 Oct 3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.HISTORY.SHM
> -rw-r--r-- 1 lekhraj dm 0 Oct 3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.LAZY.SHM
> -rw-r--r-- 1 lekhraj dm 0 Oct 3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.MSG.SHM
> -rw-r--r-- 1 lekhraj dm 1.2M Oct 8 14:12 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.ODB.SHM
> -rw-r--r-- 1 lekhraj dm 0 Oct 3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.SYSMSG.SHM
> -rw-r--r-- 1 lekhraj dm 0 Oct 3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.SYSTEM.SHM |
08 Oct 2024, Konstantin Olchanski, Bug Report, Difficulty running MIDAS on Rocky 9.4
|
> I read these error messages. There is no ODB corruption. ODB semaphore is locked and all midas programs will fail...
Recovery from this is:
- stop all midas programs (actually they should have all crashed by now)
- identify the ODB semaphore with: ipcs -s -t
- remove the ODB semaphore with: ipcrm sem <semid>
- where <semid> is from the first column of ipcs
- keep deleting semaphores until odbedit works.
- if you delete extra midas sempahores, odbedit will recreate them
- if you delete non-midas semaphores, oh, well...
Little bit better steps for this recovery may be written up by Suzannah in this forum
or in the midas wiki... good luck finding them.
K.O. |
07 Oct 2024, Konstantin Olchanski, Bug Report, Difficulty running MIDAS on Rocky 9.4
|
> We're trying to install the SuperCDMS version of MIDAS on a Rocky 9.4 Virtual
> Machine and are getting a persistent error when we run mserver.
>
> [mserver,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore,
> timeout 10000 ms, exiting...
> db_lock_database: Detected recursive call to db_{lock,unlock}_database() while
> already inside db_{lock,unlock}_database(). Maybe this is a call from a signal
> handler. Cannot continue, aborting...
> Aborted (core dumped)
This is super very bad. Since you have a core dump, please post the stack trace here (or email it to me).
I probably cannot debug your private version of midas and I will recommend that you install and run vanilla midas
mserver (just while we debug this problem).
Let's look at the core dump stack trace first, but likely we see a problem with System-V semaphores and hopefully it
is not some breakage due to Red Hat bogosity or due to something specific to running on a virtual machine.
If indeed this is Linux-kernel level breakage of System-V semaphores, solution would be to start using Posix
semaphores, something I wanted to do for a long time. We already switched MIDAS shared memory from System-V to Posix
shared memory.
If we are lucky it is just one more crasher bug in ODB. Let's see that core dump stack trace.
K.O. |
08 Oct 2024, Mark Grimes, Bug Report, Difficulty running MIDAS on Rocky 9.4
|
We run Midas with no problems on Rocky 9.4, although not in a virtual machine. We're very close to the head of `develop`.
I'm fairly sure I've seen an error like this before. I didn't pay it much attention because it was transitory and I was doing something weird at the time - probably
stepping through with a debugger and hit a timeout. It was definitely about an ODB semaphore but I can't recall if it was about a recursive call.
Basically I think Rocky 9.4 is a red herring. Do you have another crashed copy of mserver/mhttpd running somewhere and stuck in limbo? If it's a virtual machine
are you sharing the shared memory location with the host, and running another midas on there?
> > We're trying to install the SuperCDMS version of MIDAS on a Rocky 9.4 Virtual
> > Machine and are getting a persistent error when we run mserver.
> >
> > [mserver,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore,
> > timeout 10000 ms, exiting...
> > db_lock_database: Detected recursive call to db_{lock,unlock}_database() while
> > already inside db_{lock,unlock}_database(). Maybe this is a call from a signal
> > handler. Cannot continue, aborting...
> > Aborted (core dumped)
>
> This is super very bad. Since you have a core dump, please post the stack trace here (or email it to me).
>
> I probably cannot debug your private version of midas and I will recommend that you install and run vanilla midas
> mserver (just while we debug this problem).
>
> Let's look at the core dump stack trace first, but likely we see a problem with System-V semaphores and hopefully it
> is not some breakage due to Red Hat bogosity or due to something specific to running on a virtual machine.
>
> If indeed this is Linux-kernel level breakage of System-V semaphores, solution would be to start using Posix
> semaphores, something I wanted to do for a long time. We already switched MIDAS shared memory from System-V to Posix
> shared memory.
>
> If we are lucky it is just one more crasher bug in ODB. Let's see that core dump stack trace.
>
> K.O. |
08 Oct 2024, Konstantin Olchanski, Bug Report, Difficulty running MIDAS on Rocky 9.4
|
> Basically I think Rocky 9.4 is a red herring.
This is what likely happened:
- some program crashed while holding the ODB lock semaphore (by ctrl-C at the wrong time or by kill -KILL at the wring time)
- semaphore is locked with a flag "unlock if this program stops"
- this is supposed to ensure ODB lock semaphore never gets stuck in the locked state
- (there is no code in MIDAS to unlock the ODB lock semaphore without locking it first)
- we have observed a malfunction in the Linux kernel, where this automatic unlock does not happen.
- it is rare and so far cannot be reproduced. you can find more about it by searching this forum.
- I think it is a bug in the System-V semaphore code or in the Linux "program stop" code (a path where they fail to call the semaphore unlock handler, and who knows what
other handlers).
- System-V semaphores are very obsolete, replaced by POSIX semaphores.
- POSIX semaphores do not have the "unlock if this program stops" magic, the user (MIDAS) is responsible with detecting that the program who locked the semaphore is gone
and and with taking corrective action, i.e. release the lock, automatically.
I do not know if this problem with System-V semaphores is in the generic Linux kernel, or if it is specific
to the Red Hat kernels (they are known to have many patches, deviating quite far from vanilla kernels).
I do not know if this problem is somehow sensitive to virtual machines.
So yes/no, Red Hat derived linux on a virtual machine could be where this problem happens more often that elsewhere.
K.O. |
08 Oct 2024, Amy Roberts, Bug Report, Difficulty running MIDAS on Rocky 9.4
|
> > We're trying to install the SuperCDMS version of MIDAS on a Rocky 9.4 Virtual
> > Machine and are getting a persistent error when we run mserver.
> >
> > [mserver,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore,
> > timeout 10000 ms, exiting...
> > db_lock_database: Detected recursive call to db_{lock,unlock}_database() while
> > already inside db_{lock,unlock}_database(). Maybe this is a call from a signal
> > handler. Cannot continue, aborting...
> > Aborted (core dumped)
>
> This is super very bad. Since you have a core dump, please post the stack trace here (or email it to me).
>
> I probably cannot debug your private version of midas and I will recommend that you install and run vanilla midas
> mserver (just while we debug this problem).
>
> Let's look at the core dump stack trace first, but likely we see a problem with System-V semaphores and hopefully it
> is not some breakage due to Red Hat bogosity or due to something specific to running on a virtual machine.
>
> If indeed this is Linux-kernel level breakage of System-V semaphores, solution would be to start using Posix
> semaphores, something I wanted to do for a long time. We already switched MIDAS shared memory from System-V to Posix
> shared memory.
>
> If we are lucky it is just one more crasher bug in ODB. Let's see that core dump stack trace.
>
> K.O.
I've uploaded the current core dump at: https://gitlab.com/det-lab/coredumps#.
This was done using the "CDMS" version of MIDAS, I'll compile the current MIDAS repository just to be sure we're seeing
the same error and report back here! |
08 Oct 2024, Konstantin Olchanski, Bug Report, Difficulty running MIDAS on Rocky 9.4
|
> I've uploaded the current core dump at: https://gitlab.com/det-lab/coredumps#.
I cannot read the core dump without the corresponding executable (and likely all it's shared libraries).
It is best if you run gdb and extract the stack traces on your end.
In case you are not familiar with gdb:
gdb mserver core # start gdb
bt # stack trace of crashed thread
info thr # get list of threads
thr 1
bt
thr 2
bt
# etc, get stack trace of each thread, there should not be too many of them
K.O. |
10 Oct 2024, Amy Roberts, Bug Report, Difficulty running MIDAS on Rocky 9.4
|
> > I've uploaded the current core dump at: https://gitlab.com/det-lab/coredumps#.
>
> I cannot read the core dump without the corresponding executable (and likely all it's shared libraries).
>
> It is best if you run gdb and extract the stack traces on your end.
>
> In case you are not familiar with gdb:
>
> gdb mserver core # start gdb
> bt # stack trace of crashed thread
> info thr # get list of threads
> thr 1
> bt
> thr 2
> bt
> # etc, get stack trace of each thread, there should not be too many of them
>
> K.O.
Hi Konstantin, thanks for the instructions. I do appear to be missing some debug symbols, but the output
looks potentially useful:
[lekhraj@sdfcdmsdaq ~]$ gdb mserver
core.mserver.17468.b174bb74f2bb44f9a0905e78ec6b2677.601715.1728422354000000
GNU gdb (GDB) Rocky Linux 10.2-11.1.el9_3
...
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from mserver...
[New LWP 601715]
warning: Section `.reg-xstate/601715' in core file too small.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `mserver'.
Program terminated with signal SIGABRT, Aborted.
warning: Section `.reg-xstate/601715' in core file too small.
#0 0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-83.el9.12.x86_64 libgcc-11.4.1-
3.el9.x86_64 libstdc++-11.4.1-2.1.el9.x86_64 libzstd-1.5.1-2.el9.x86_64 mysql-libs-8.0.36-1.el9_3.x86_64
openssl-libs-3.0.7-25.el9_3.x86_64 zlib-1.2.11-40.el9.x86_64
(gdb)
(gdb) bt
#0 0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
#1 0x00007fbdeac54d06 in raise () from /lib64/libc.so.6
#2 0x00007fbdeac287f3 in abort () from /lib64/libc.so.6
#3 0x0000000000430ee4 in db_lock_database (hDB=hDB@entry=1)
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2473
#4 0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536d348, key_name=0x4687a8 "/Logger/Message file date
format",
hKey=0, hDB=1) at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
#5 db_find_key (hDB=1, hKey=0, key_name=0x4687a8 "/Logger/Message file date format",
subhKey=0x7ffcc536d348)
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
#6 0x0000000000448297 in db_get_value_string (hdb=1, hKeyRoot=hKeyRoot@entry=0,
key_name=key_name@entry=0x4687a8 "/Logger/Message file date format", index=index@entry=0,
s=s@entry=0x7ffcc536d470, create=create@entry=1, create_string_length=0)
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:13950
#7 0x000000000040a690 in cm_msg_get_logfile (fac=<optimized out>, t=<optimized out>,
filename=0x7ffcc536d690,
linkname=0x7ffcc536d6b0, linktarget=0x7ffcc536d6d0)
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:573
#8 0x000000000041a307 in cm_msg_log (message_type=1, facility=0x46db0e "midas",
message=0x7e4290 "[mserver,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore,
timeout 10000 ms, exiting...") at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:685
#9 0x0000000000421fcd in cm_msg_flush_buffer () at /usr/include/c++/11/bits/basic_string.h:194
#10 0x00007fbdeac574dd in __run_exit_handlers () from /lib64/libc.so.6
#11 0x00007fbdeac57620 in exit () from /lib64/libc.so.6
#12 0x0000000000430f7a in db_lock_database (hDB=hDB@entry=1)
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2499
#13 0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536da04, key_name=0x476a21 "/Alarms/Alarms", hKey=0,
hDB=1)
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
#14 db_find_key (hDB=1, hKey=hKey@entry=0, key_name=key_name@entry=0x476a21 "/Alarms/Alarms",
subhKey=subhKey@entry=0x7ffcc536da04) at
/sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
#15 0x0000000000455fd2 in al_check () at
/sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/alarm.cxx:614
--Type <RET> for more, q to quit, c to continue without paging--
#16 0x000000000041ff85 in cm_periodic_tasks ()
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5596
#17 0x00000000004235c5 in cm_yield (millisec=millisec@entry=1000)
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5676
#18 0x00000000004065c2 in main (argc=<optimized out>, argv=0x7ffcc536e628)
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/progs/mserver.cxx:295
(gdb) info thr
Id Target Id Frame
* 1 Thread 0x7fbdec0b1740 (LWP 601715) 0x00007fbdeaca154c in __pthread_kill_implementation () from
/lib64/libc.so.6
(gdb) thr 1
[Switching to thread 1 (Thread 0x7fbdec0b1740 (LWP 601715))]
#0 0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
(gdb) bt
#0 0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
#1 0x00007fbdeac54d06 in raise () from /lib64/libc.so.6
#2 0x00007fbdeac287f3 in abort () from /lib64/libc.so.6
#3 0x0000000000430ee4 in db_lock_database (hDB=hDB@entry=1)
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2473
#4 0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536d348, key_name=0x4687a8 "/Logger/Message file date
format",
hKey=0, hDB=1) at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
#5 db_find_key (hDB=1, hKey=0, key_name=0x4687a8 "/Logger/Message file date format",
subhKey=0x7ffcc536d348)
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
#6 0x0000000000448297 in db_get_value_string (hdb=1, hKeyRoot=hKeyRoot@entry=0,
key_name=key_name@entry=0x4687a8 "/Logger/Message file date format", index=index@entry=0,
s=s@entry=0x7ffcc536d470, create=create@entry=1, create_string_length=0)
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:13950
#7 0x000000000040a690 in cm_msg_get_logfile (fac=<optimized out>, t=<optimized out>,
filename=0x7ffcc536d690,
linkname=0x7ffcc536d6b0, linktarget=0x7ffcc536d6d0)
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:573
#8 0x000000000041a307 in cm_msg_log (message_type=1, facility=0x46db0e "midas",
message=0x7e4290 "[mserver,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore,
timeout 10000 ms, exiting...") at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:685
#9 0x0000000000421fcd in cm_msg_flush_buffer () at /usr/include/c++/11/bits/basic_string.h:194
#10 0x00007fbdeac574dd in __run_exit_handlers () from /lib64/libc.so.6
#11 0x00007fbdeac57620 in exit () from /lib64/libc.so.6
#12 0x0000000000430f7a in db_lock_database (hDB=hDB@entry=1)
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2499
#13 0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536da04, key_name=0x476a21 "/Alarms/Alarms", hKey=0,
hDB=1)
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
#14 db_find_key (hDB=1, hKey=hKey@entry=0, key_name=key_name@entry=0x476a21 "/Alarms/Alarms",
subhKey=subhKey@entry=0x7ffcc536da04) at
/sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
#15 0x0000000000455fd2 in al_check () at
/sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/alarm.cxx:614
#16 0x000000000041ff85 in cm_periodic_tasks ()
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5596
#17 0x00000000004235c5 in cm_yield (millisec=millisec@entry=1000)
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5676
#18 0x00000000004065c2 in main (argc=<optimized out>, argv=0x7ffcc536e628)
at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/progs/mserver.cxx:295
(gdb) |
13 Oct 2024, Konstantin Olchanski, Bug Report, Difficulty running MIDAS on Rocky 9.4
|
Thank you for the stack trace, I fixed the buglet that cause midas programs to crash twice,
once on failure to lock ODB, then call exit() -> atexit() handlers -> cm_check_connect() -> crash on ODB lock
failure is the cm_msg() codes.
Replaced exit(1) with abort(). Could have used kill(getpid(),SIGKILL) to avoid making a core dump, but what the
heck...
Of course this does nothing to the original bug where ODB was locked and nobody will ever unlock it (reboot will
unlock it!).
commit bdd1d7fdc093b5a8d54a1b8467002bb3cac3ac11
K.O.
> > > I've uploaded the current core dump at: https://gitlab.com/det-lab/coredumps#.
> >
> > I cannot read the core dump without the corresponding executable (and likely all it's shared libraries).
> >
> > It is best if you run gdb and extract the stack traces on your end.
> >
> > In case you are not familiar with gdb:
> >
> > gdb mserver core # start gdb
> > bt # stack trace of crashed thread
> > info thr # get list of threads
> > thr 1
> > bt
> > thr 2
> > bt
> > # etc, get stack trace of each thread, there should not be too many of them
> >
> > K.O.
>
> Hi Konstantin, thanks for the instructions. I do appear to be missing some debug symbols, but the output
> looks potentially useful:
>
> [lekhraj@sdfcdmsdaq ~]$ gdb mserver
> core.mserver.17468.b174bb74f2bb44f9a0905e78ec6b2677.601715.1728422354000000
> GNU gdb (GDB) Rocky Linux 10.2-11.1.el9_3
> ...
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from mserver...
> [New LWP 601715]
>
> warning: Section `.reg-xstate/601715' in core file too small.
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> Core was generated by `mserver'.
> Program terminated with signal SIGABRT, Aborted.
>
> warning: Section `.reg-xstate/601715' in core file too small.
> #0 0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
> Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-83.el9.12.x86_64 libgcc-11.4.1-
> 3.el9.x86_64 libstdc++-11.4.1-2.1.el9.x86_64 libzstd-1.5.1-2.el9.x86_64 mysql-libs-8.0.36-1.el9_3.x86_64
> openssl-libs-3.0.7-25.el9_3.x86_64 zlib-1.2.11-40.el9.x86_64
> (gdb)
> (gdb) bt
> #0 0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
> #1 0x00007fbdeac54d06 in raise () from /lib64/libc.so.6
> #2 0x00007fbdeac287f3 in abort () from /lib64/libc.so.6
> #3 0x0000000000430ee4 in db_lock_database (hDB=hDB@entry=1)
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2473
> #4 0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536d348, key_name=0x4687a8 "/Logger/Message file date
> format",
> hKey=0, hDB=1) at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
> #5 db_find_key (hDB=1, hKey=0, key_name=0x4687a8 "/Logger/Message file date format",
> subhKey=0x7ffcc536d348)
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
> #6 0x0000000000448297 in db_get_value_string (hdb=1, hKeyRoot=hKeyRoot@entry=0,
> key_name=key_name@entry=0x4687a8 "/Logger/Message file date format", index=index@entry=0,
> s=s@entry=0x7ffcc536d470, create=create@entry=1, create_string_length=0)
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:13950
> #7 0x000000000040a690 in cm_msg_get_logfile (fac=<optimized out>, t=<optimized out>,
> filename=0x7ffcc536d690,
> linkname=0x7ffcc536d6b0, linktarget=0x7ffcc536d6d0)
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:573
> #8 0x000000000041a307 in cm_msg_log (message_type=1, facility=0x46db0e "midas",
> message=0x7e4290 "[mserver,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore,
> timeout 10000 ms, exiting...") at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:685
> #9 0x0000000000421fcd in cm_msg_flush_buffer () at /usr/include/c++/11/bits/basic_string.h:194
> #10 0x00007fbdeac574dd in __run_exit_handlers () from /lib64/libc.so.6
> #11 0x00007fbdeac57620 in exit () from /lib64/libc.so.6
> #12 0x0000000000430f7a in db_lock_database (hDB=hDB@entry=1)
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2499
> #13 0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536da04, key_name=0x476a21 "/Alarms/Alarms", hKey=0,
> hDB=1)
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
> #14 db_find_key (hDB=1, hKey=hKey@entry=0, key_name=key_name@entry=0x476a21 "/Alarms/Alarms",
> subhKey=subhKey@entry=0x7ffcc536da04) at
> /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
> #15 0x0000000000455fd2 in al_check () at
> /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/alarm.cxx:614
> --Type <RET> for more, q to quit, c to continue without paging--
> #16 0x000000000041ff85 in cm_periodic_tasks ()
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5596
> #17 0x00000000004235c5 in cm_yield (millisec=millisec@entry=1000)
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5676
> #18 0x00000000004065c2 in main (argc=<optimized out>, argv=0x7ffcc536e628)
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/progs/mserver.cxx:295
> (gdb) info thr
> Id Target Id Frame
> * 1 Thread 0x7fbdec0b1740 (LWP 601715) 0x00007fbdeaca154c in __pthread_kill_implementation () from
> /lib64/libc.so.6
> (gdb) thr 1
> [Switching to thread 1 (Thread 0x7fbdec0b1740 (LWP 601715))]
> #0 0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
> (gdb) bt
> #0 0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
> #1 0x00007fbdeac54d06 in raise () from /lib64/libc.so.6
> #2 0x00007fbdeac287f3 in abort () from /lib64/libc.so.6
> #3 0x0000000000430ee4 in db_lock_database (hDB=hDB@entry=1)
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2473
> #4 0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536d348, key_name=0x4687a8 "/Logger/Message file date
> format",
> hKey=0, hDB=1) at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
> #5 db_find_key (hDB=1, hKey=0, key_name=0x4687a8 "/Logger/Message file date format",
> subhKey=0x7ffcc536d348)
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
> #6 0x0000000000448297 in db_get_value_string (hdb=1, hKeyRoot=hKeyRoot@entry=0,
> key_name=key_name@entry=0x4687a8 "/Logger/Message file date format", index=index@entry=0,
> s=s@entry=0x7ffcc536d470, create=create@entry=1, create_string_length=0)
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:13950
> #7 0x000000000040a690 in cm_msg_get_logfile (fac=<optimized out>, t=<optimized out>,
> filename=0x7ffcc536d690,
> linkname=0x7ffcc536d6b0, linktarget=0x7ffcc536d6d0)
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:573
> #8 0x000000000041a307 in cm_msg_log (message_type=1, facility=0x46db0e "midas",
> message=0x7e4290 "[mserver,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore,
> timeout 10000 ms, exiting...") at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:685
> #9 0x0000000000421fcd in cm_msg_flush_buffer () at /usr/include/c++/11/bits/basic_string.h:194
> #10 0x00007fbdeac574dd in __run_exit_handlers () from /lib64/libc.so.6
> #11 0x00007fbdeac57620 in exit () from /lib64/libc.so.6
> #12 0x0000000000430f7a in db_lock_database (hDB=hDB@entry=1)
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2499
> #13 0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536da04, key_name=0x476a21 "/Alarms/Alarms", hKey=0,
> hDB=1)
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
> #14 db_find_key (hDB=1, hKey=hKey@entry=0, key_name=key_name@entry=0x476a21 "/Alarms/Alarms",
> subhKey=subhKey@entry=0x7ffcc536da04) at
> /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
> #15 0x0000000000455fd2 in al_check () at
> /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/alarm.cxx:614
> #16 0x000000000041ff85 in cm_periodic_tasks ()
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5596
> #17 0x00000000004235c5 in cm_yield (millisec=millisec@entry=1000)
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5676
> #18 0x00000000004065c2 in main (argc=<optimized out>, argv=0x7ffcc536e628)
> at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/progs/mserver.cxx:295
> (gdb) |
16 Oct 2024, Amy Roberts, Bug Report, Difficulty running MIDAS on Rocky 9.4
|
> Thank you for the stack trace, I fixed the buglet that cause midas programs to crash twice,
> once on failure to lock ODB, then call exit() -> atexit() handlers -> cm_check_connect() -> crash on ODB lock
> failure is the cm_msg() codes.
>
> Replaced exit(1) with abort(). Could have used kill(getpid(),SIGKILL) to avoid making a core dump, but what the
> heck...
>
> Of course this does nothing to the original bug where ODB was locked and nobody will ever unlock it (reboot will
> unlock it!).
>
> commit bdd1d7fdc093b5a8d54a1b8467002bb3cac3ac11
>
> K.O.
>
>
> > > > I've uploaded the current core dump at: https://gitlab.com/det-lab/coredumps#.
> > >
> > > I cannot read the core dump without the corresponding executable (and likely all it's shared libraries).
> > >
> > > It is best if you run gdb and extract the stack traces on your end.
> > >
> > > In case you are not familiar with gdb:
> > >
> > > gdb mserver core # start gdb
> > > bt # stack trace of crashed thread
> > > info thr # get list of threads
> > > thr 1
> > > bt
> > > thr 2
> > > bt
> > > # etc, get stack trace of each thread, there should not be too many of them
> > >
> > > K.O.
> >
> > Hi Konstantin, thanks for the instructions. I do appear to be missing some debug symbols, but the output
> > looks potentially useful:
> >
> > [lekhraj@sdfcdmsdaq ~]$ gdb mserver
> > core.mserver.17468.b174bb74f2bb44f9a0905e78ec6b2677.601715.1728422354000000
> > GNU gdb (GDB) Rocky Linux 10.2-11.1.el9_3
> > ...
> > For help, type "help".
> > Type "apropos word" to search for commands related to "word"...
> > Reading symbols from mserver...
> > [New LWP 601715]
> >
> > warning: Section `.reg-xstate/601715' in core file too small.
> > [Thread debugging using libthread_db enabled]
> > Using host libthread_db library "/lib64/libthread_db.so.1".
> > Core was generated by `mserver'.
> > Program terminated with signal SIGABRT, Aborted.
> >
> > warning: Section `.reg-xstate/601715' in core file too small.
> > #0 0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
> > Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-83.el9.12.x86_64 libgcc-11.4.1-
> > 3.el9.x86_64 libstdc++-11.4.1-2.1.el9.x86_64 libzstd-1.5.1-2.el9.x86_64 mysql-libs-8.0.36-1.el9_3.x86_64
> > openssl-libs-3.0.7-25.el9_3.x86_64 zlib-1.2.11-40.el9.x86_64
> > (gdb)
> > (gdb) bt
> > #0 0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
> > #1 0x00007fbdeac54d06 in raise () from /lib64/libc.so.6
> > #2 0x00007fbdeac287f3 in abort () from /lib64/libc.so.6
> > #3 0x0000000000430ee4 in db_lock_database (hDB=hDB@entry=1)
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2473
> > #4 0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536d348, key_name=0x4687a8 "/Logger/Message file date
> > format",
> > hKey=0, hDB=1) at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
> > #5 db_find_key (hDB=1, hKey=0, key_name=0x4687a8 "/Logger/Message file date format",
> > subhKey=0x7ffcc536d348)
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
> > #6 0x0000000000448297 in db_get_value_string (hdb=1, hKeyRoot=hKeyRoot@entry=0,
> > key_name=key_name@entry=0x4687a8 "/Logger/Message file date format", index=index@entry=0,
> > s=s@entry=0x7ffcc536d470, create=create@entry=1, create_string_length=0)
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:13950
> > #7 0x000000000040a690 in cm_msg_get_logfile (fac=<optimized out>, t=<optimized out>,
> > filename=0x7ffcc536d690,
> > linkname=0x7ffcc536d6b0, linktarget=0x7ffcc536d6d0)
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:573
> > #8 0x000000000041a307 in cm_msg_log (message_type=1, facility=0x46db0e "midas",
> > message=0x7e4290 "[mserver,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore,
> > timeout 10000 ms, exiting...") at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:685
> > #9 0x0000000000421fcd in cm_msg_flush_buffer () at /usr/include/c++/11/bits/basic_string.h:194
> > #10 0x00007fbdeac574dd in __run_exit_handlers () from /lib64/libc.so.6
> > #11 0x00007fbdeac57620 in exit () from /lib64/libc.so.6
> > #12 0x0000000000430f7a in db_lock_database (hDB=hDB@entry=1)
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2499
> > #13 0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536da04, key_name=0x476a21 "/Alarms/Alarms", hKey=0,
> > hDB=1)
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
> > #14 db_find_key (hDB=1, hKey=hKey@entry=0, key_name=key_name@entry=0x476a21 "/Alarms/Alarms",
> > subhKey=subhKey@entry=0x7ffcc536da04) at
> > /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
> > #15 0x0000000000455fd2 in al_check () at
> > /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/alarm.cxx:614
> > --Type <RET> for more, q to quit, c to continue without paging--
> > #16 0x000000000041ff85 in cm_periodic_tasks ()
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5596
> > #17 0x00000000004235c5 in cm_yield (millisec=millisec@entry=1000)
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5676
> > #18 0x00000000004065c2 in main (argc=<optimized out>, argv=0x7ffcc536e628)
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/progs/mserver.cxx:295
> > (gdb) info thr
> > Id Target Id Frame
> > * 1 Thread 0x7fbdec0b1740 (LWP 601715) 0x00007fbdeaca154c in __pthread_kill_implementation () from
> > /lib64/libc.so.6
> > (gdb) thr 1
> > [Switching to thread 1 (Thread 0x7fbdec0b1740 (LWP 601715))]
> > #0 0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
> > (gdb) bt
> > #0 0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
> > #1 0x00007fbdeac54d06 in raise () from /lib64/libc.so.6
> > #2 0x00007fbdeac287f3 in abort () from /lib64/libc.so.6
> > #3 0x0000000000430ee4 in db_lock_database (hDB=hDB@entry=1)
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2473
> > #4 0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536d348, key_name=0x4687a8 "/Logger/Message file date
> > format",
> > hKey=0, hDB=1) at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
> > #5 db_find_key (hDB=1, hKey=0, key_name=0x4687a8 "/Logger/Message file date format",
> > subhKey=0x7ffcc536d348)
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
> > #6 0x0000000000448297 in db_get_value_string (hdb=1, hKeyRoot=hKeyRoot@entry=0,
> > key_name=key_name@entry=0x4687a8 "/Logger/Message file date format", index=index@entry=0,
> > s=s@entry=0x7ffcc536d470, create=create@entry=1, create_string_length=0)
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:13950
> > #7 0x000000000040a690 in cm_msg_get_logfile (fac=<optimized out>, t=<optimized out>,
> > filename=0x7ffcc536d690,
> > linkname=0x7ffcc536d6b0, linktarget=0x7ffcc536d6d0)
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:573
> > #8 0x000000000041a307 in cm_msg_log (message_type=1, facility=0x46db0e "midas",
> > message=0x7e4290 "[mserver,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore,
> > timeout 10000 ms, exiting...") at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:685
> > #9 0x0000000000421fcd in cm_msg_flush_buffer () at /usr/include/c++/11/bits/basic_string.h:194
> > #10 0x00007fbdeac574dd in __run_exit_handlers () from /lib64/libc.so.6
> > #11 0x00007fbdeac57620 in exit () from /lib64/libc.so.6
> > #12 0x0000000000430f7a in db_lock_database (hDB=hDB@entry=1)
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2499
> > #13 0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536da04, key_name=0x476a21 "/Alarms/Alarms", hKey=0,
> > hDB=1)
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
> > #14 db_find_key (hDB=1, hKey=hKey@entry=0, key_name=key_name@entry=0x476a21 "/Alarms/Alarms",
> > subhKey=subhKey@entry=0x7ffcc536da04) at
> > /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
> > #15 0x0000000000455fd2 in al_check () at
> > /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/alarm.cxx:614
> > #16 0x000000000041ff85 in cm_periodic_tasks ()
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5596
> > #17 0x00000000004235c5 in cm_yield (millisec=millisec@entry=1000)
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5676
> > #18 0x00000000004065c2 in main (argc=<optimized out>, argv=0x7ffcc536e628)
> > at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/progs/mserver.cxx:295
> > (gdb)
I checked out the modified version of Midas and recompiled, and am still getting a similar error when I try to run
odbedit:
[aroberts@sdfcdmsdaq midas]$ odbedit
[ODBEdit,ERROR] [odb.cxx:2043:db_open_database,ERROR] Removed ODB client 'ODBEdit', index 0 because process pid
1615051 does not exists
[ODBEdit,INFO] Removed open record flag from "/Experiment/Security/RPC hosts/Allowed hosts"
[ODBEdit,INFO] Removed exclusive access mode from "/Experiment/Security/RPC hosts/Allowed hosts"
[ODBEdit,INFO] Corrected 1 ODB entries
[ODBEdit,INFO] Deleted entry '/System/Clients/1615051' for client 'ODBEdit' because it is not connected to ODB
[ODBEdit,INFO] Client 'ODBEdit' on buffer 'SYSMSG' removed by bm_open_buffer because process pid 1615051 does not
exist
[ODBEdit,ERROR] [odb.cxx:2489:db_lock_database,ERROR] cannot lock ODB semaphore, timeout 10000 ms, aborting...
Aborted (core dumped)
I'm not sure what's causing the call to lock the database, all I'm doing is typing "odbedit" in the command prompt.
I should add that I followed the instructions for unlocking the ODB, but when I call "odbedit" this error still
appears:
[aroberts@sdfcdmsdaq midas]$ ipcs -s -t
------ Semaphore Operation/Change Times --------
semid owner last-op last-changed
4 aroberts Wed Oct 16 11:44:10 2024 Wed Oct 16 11:44:00 2024
[aroberts@sdfcdmsdaq midas]$ ipcrm sem 4
resource(s) deleted
[aroberts@sdfcdmsdaq midas]$ odbedit
[ODBEdit,ERROR] [odb.cxx:2043:db_open_database,ERROR] Removed ODB client 'ODBEdit', index 0 because process pid
1617050 does not exists
[ODBEdit,INFO] Removed open record flag from "/Experiment/Security/RPC hosts/Allowed hosts"
[ODBEdit,INFO] Removed exclusive access mode from "/Experiment/Security/RPC hosts/Allowed hosts"
[ODBEdit,INFO] Corrected 1 ODB entries
[ODBEdit,INFO] Deleted entry '/System/Clients/1617050' for client 'ODBEdit' because it is not connected to ODB
[ODBEdit,INFO] Client 'ODBEdit' on buffer 'SYSMSG' removed by bm_open_buffer because process pid 1617050 does not
exist
[ODBEdit,ERROR] [odb.cxx:2489:db_lock_database,ERROR] cannot lock ODB semaphore, timeout 10000 ms, aborting...
Aborted (core dumped) |
18 Oct 2024, Konstantin Olchanski, Bug Report, Difficulty running MIDAS on Rocky 9.4
|
> [aroberts@sdfcdmsdaq midas]$ odbedit
> [ODBEdit,ERROR] [odb.cxx:2043:db_open_database,ERROR] Removed ODB client 'ODBEdit', index 0 because process pid
> 1615051 does not exists
> [ODBEdit,INFO] Removed open record flag from "/Experiment/Security/RPC hosts/Allowed hosts"
> [ODBEdit,INFO] Removed exclusive access mode from "/Experiment/Security/RPC hosts/Allowed hosts"
> [ODBEdit,INFO] Corrected 1 ODB entries
> [ODBEdit,INFO] Deleted entry '/System/Clients/1615051' for client 'ODBEdit' because it is not connected to ODB
> [ODBEdit,INFO] Client 'ODBEdit' on buffer 'SYSMSG' removed by bm_open_buffer because process pid 1615051 does not
> exist
so far, so good, we connected to ODB (lock was not stuck), cleared out client "odbedit" with pid 1615051 that crashed
without properly disconnecting. ODB semaphore is working correctly.
> [ODBEdit,ERROR] [odb.cxx:2489:db_lock_database,ERROR] cannot lock ODB semaphore, timeout 10000 ms, aborting...
> Aborted (core dumped)
suddenly, an ODB semaphore timeout...
can you post the stack trace from this core dump? I am pretty sure it will be boring, but just in case...
K.O. |
18 Oct 2024, Konstantin Olchanski, Bug Report, Difficulty running MIDAS on Rocky 9.4
|
> suddenly, an ODB semaphore timeout...
I am wondering if something bizarre is going on, like the system clock going backwards. I heard of things like that
happening in virtual environments.
https://stackoverflow.com/questions/4801122/how-to-stop-time-from-running-backwards-on-linux
I added some debugging information to the semaphore locking code. Please update to commit
eb625af119067f6d702211542d88a28ccb57ad2c of src/system.cxx (plus small change in include/msystem.h) and try again.
Now for each timeout it will print detailed syscall and timing information, if time goes backwards, it should catch it.
K.O. |
28 Oct 2024, Amy Roberts, Bug Report, Difficulty running MIDAS on Rocky 9.4
|
> Now for each timeout it will print detailed syscall and timing information, if time goes backwards, it should catch it.
It appears that time is moving forward:
[aroberts@sdfcdmsdaq build]$ odbedit
[ODBEdit,ERROR] [odb.cxx:2043:db_open_database,ERROR] Removed ODB client 'ODBEdit', index 0 because process pid 1617119 does
not exists
[ODBEdit,INFO] Removed open record flag from "/Experiment/Security/RPC hosts/Allowed hosts"
[ODBEdit,INFO] Removed exclusive access mode from "/Experiment/Security/RPC hosts/Allowed hosts"
[ODBEdit,INFO] Corrected 1 ODB entries
[ODBEdit,INFO] Deleted entry '/System/Clients/1617119' for client 'ODBEdit' because it is not connected to ODB
[ODBEdit,INFO] Client 'ODBEdit' on buffer 'SYSMSG' removed by bm_open_buffer because process pid 1617119 does not exist
[local:amy_test:S]/>ss_semaphore_wait_for: semop/semtimedop(5) returned -1, errno 11 (Resource temporarily unavailable),
start time 0xd4fd98f6, now 0xd4fdc0ef, dt 0x000027f9, timeout 0x00002710 ms, SEMAPHORE TIMEOUT!
[ODBEdit,ERROR] [odb.cxx:2489:db_lock_database,ERROR] cannot lock ODB semaphore, timeout 10000 ms, aborting...
Aborted (core dumped) |
06 Nov 2024, Amy Roberts, Bug Report, Difficulty running MIDAS on Rocky 9.4
|
After following Konstantin's debugging suggestions, I thought I would try to replicate
the issue on my own computer. My hope was that I could provide instructions for
replicating the bug so that the MIDAS team could try debugging things more easily.
However, when I ran the current version of MIDAS in a Rocky 9.4 VM on my laptop (both
VMWare and VirtualBox), mserver and odbedit ran just fine (!).
I'm currently trying to find out if there's a way to compare the VMs on my machine and
the machine that's being problematic, I'll report back if I learn anything. |
22 Nov 2024, Konstantin Olchanski, Bug Report, ODB lock timeout, Difficulty running MIDAS on Rocky 9.4
|
> try to replicate the issue ...
I see ODB lock timeout (and abort() of everything) in the dsvslice test station. We have
about 15-20 MIDAS clients connected.
I am pretty sure we have not seen this problem until recently (and I have not seen it
personally for a very long time). There were no changes to the MIDAS ODB locking code in a
long time.
I suspect a recent change in the linux kernel. But I am likely to be wrong.
I have 1000 core dumps from this crash of dsvslice, and among them should be the 1 thread
that has ODB locked. Wish me luck finding it. Worst case is to discover that ODB is locked
but nobody is holding a lock ("missing unlock bug"). This is hard to debug, I would have add
tracking of "who was the last one to lock it, who forgot to unlock it".
K.O. |
24 Nov 2024, Pavel Murat, Bug Report, ODB lock timeout, Difficulty running MIDAS on Rocky 9.4
|
there is a really good software tool developed by the Fermilab DAQ group, called TRACE -
https://github.com/art-daq/trace ,
It could be useful for debugging cases like this one. In short, TRACE instruments the code
with the printouts which could be selectively turned on and off without recompiling the executable.
TRACE output could go to /dev/stdout (slow output) and/or to a circular buffer implemented via a shared
memory segment (fast output). Sending unlimited output to the shared memory segment is extremely useful.
TRACE also allows to trigger on certain conditions, again, w/o recompiling the executable.
For debugging cases like the one in question, that could turn out even more useful,
however I didn't try the triggering functionality myself.
-- regards, Pasha |
21 Nov 2024, Mann Gandhi, Info, What do the status numbers mean and where can I find more information about them?
|
Hello,
This is the error message I got:
[RP Streaming Frontend,ERROR] [midas.cxx:17806:cm_write_event_to_odb,ERROR]
cannot create key for bank "DATA" with tid 24 in ODB, db_create_key() status 309
read_periodic_event: No data in ring buffer or error occurred
[RP Streaming Frontend,ERROR] [odb.cxx:3373:db_create_key,ERROR] invalid key type
24 to create 'DATA' in '/Equipment/Periodic/Variables'
[RP Streaming Frontend,ERROR] [midas.cxx:17806:cm_write_event_to_odb,ERROR]
cannot create key for bank "DATA" with tid 24 in ODB, db_create_key() status 309
I just need more information on what the error message means. Which data type
refers to tid 24 and what does status 309 indicate??
There is definitely data in the ring buffer but I keep on getting this error.
Thank you!
M.G |
21 Nov 2024, Stefan Ritt, Info, What do the status numbers mean and where can I find more information about them?
|
> [RP Streaming Frontend,ERROR] [midas.cxx:17806:cm_write_event_to_odb,ERROR]
> cannot create key for bank "DATA" with tid 24 in ODB, db_create_key() status 309
>
>
>
> I just need more information on what the error message means. Which data type
> refers to tid 24 and what does status 309 indicate??
A tid (type identification) of 24 does actually not exist. See midas.h:327, so this tells
me that your bank header got corrupted. Somewhere you write over your data.
Stefan |
13 Nov 2024, Stefan Ritt, Info, New sequencer command ODBLOOKUP
|
A new sequencer command "ODBLOOKUP" has been implemented, which does a lookup of a string in a string
array in the ODB given by a path and returns its index as a number. If we have for example an array
/Examples/Names
[0] Hello
[1] Test
[2] Other
and do a
ODBLOOKUP "/Examples/Names", "Test", index
we get a index equal 1.
/Stefan |
15 Nov 2024, Konstantin Olchanski, Info, New sequencer command ODBLOOKUP
|
> A new sequencer command "ODBLOOKUP" has been implemented, which does a lookup of a string in a string
> array in the ODB given by a path and returns its index as a number. If we have for example an array
>
> /Examples/Names
> [0] Hello
> [1] Test
> [2] Other
>
> and do a
>
> ODBLOOKUP "/Examples/Names", "Test", index
>
> we get a index equal 1.
>
"value not found" sets "index" to ?
"odb key not found" sets "index" to ?
link to documentation?
K.O. |
18 Nov 2024, Stefan Ritt, Info, New sequencer command ODBLOOKUP
|
> "value not found" sets "index" to ?
It sets it actually to "not found". Since all variables are stings in the sequencer, you can then do a test like
ODBLOOKUP ..., index
if ($index == "not found")
...
> "odb key not found" sets "index" to ?
If the odb key is not found, the sequencer aborts.
> link to documentation?
The documentation is where it always has been:
https://daq00.triumf.ca/MidasWiki/index.php/Sequencer#Sequencer_Commands
/Stefan |
14 Nov 2024, Mann Gandhi, Suggestion, Issue with creating banks
|
Hello, I am a coop student working at SNOLAB. I am currently setting up a frontend
program to collect data for an experiment I am currently having with my bank being
initialized correctly with the correct name. I will attach an image of the error and
a code snippet for clarity. This is a multi-thread program using ring buffers. The
first thread is only responsible for data collection of ADC values from the Red
Pitaya (FPGA) and the second thread does a simple derivative calculation. The
frontend makes use of the TCP connection to stream data from the Red Pitaya.
Here is the code snippet. This is the only place in the frontend code where I
initialize and create a bank to store the ADC values from the Red Pitaya.
void* data_acquisition_thread(void* param)
{
printf("Data acquisition thread started\n");
// Obtain ring buffer for inter-thread data exchange
EVENT_HEADER *pevent;
WORD *pdata;
int status;
//Set a timeout for the recv function to prevent indefinite blocking
struct timeval timeout;
timeout.tv_sec = 10; //seconds
timeout.tv_usec = 0; // 0 microseconds
setsockopt(stream_sockfd, SOL_SOCKET, SO_RCVTIMEO, (char *)&timeout,
sizeof(timeout));
while (is_readout_thread_enabled())
{
if (!readout_enabled())
{
usleep(10); // do not produce events when run is stopped
continue;
}
// Acquire a write pointer in the ring buffer
int status;
do {
status = rb_get_wp(rbh, (void **) &pevent, 0);
if (status == DB_TIMEOUT)
{
usleep(5);
if (!is_readout_thread_enabled()) break;
}
} while (status != DB_SUCCESS);
if (status != DB_SUCCESS) continue;
// Lock mutex before accessing shared resources
pthread_mutex_lock(&lock);
// Buffer for incoming data
//int16_t temp_buffer[4096] = {0};
bm_compose_event_threadsafe(pevent, 1, 0, 0,
&equipment[0].serial_number);
pdata = (WORD *)(pevent + 1); // Set pdata to point to the data section of
the event
// Initialize the bank and read data directly into the bank
bk_init32(pevent);
bk_create(pevent, "RPD0", TID_WORD, (void **)&pdata);
int bytes_read = recv(stream_sockfd, pdata, max_event_size *
sizeof(WORD), 0);
printf("Data received: %d bytes\n", bytes_read);
if (bytes_read <= 0)
{
if (bytes_read == 0)
{
printf("Red Pitaya disconnected\n");
pthread_mutex_unlock(&lock);
break;
} else if (errno == EWOULDBLOCK || errno ==EAGAIN)
{
printf("Receive timeout\n");
pthread_mutex_unlock(&lock);
continue;
}
else
{
printf("Error reading from the Red Pitaya: %s\n",
strerror(errno));
pthread_mutex_unlock(&lock);
continue;
}
}
// Adjust data pointers after reading
pdata += bytes_read / sizeof(WORD);
bk_close(pevent, pdata);
pevent->data_size = bk_size(pevent);
// Unlock mutex after writing to the buffer
pthread_mutex_unlock(&lock);
// Send event to ring buffer
rb_increment_wp(rbh, sizeof(EVENT_HEADER) + pevent->data_size);
}
pthread_mutex_unlock(&lock);
return NULL;
} |
14 Nov 2024, Stefan Ritt, Suggestion, Issue with creating banks
|
All I can see is that your bank header gets corrupted along the way. The funny character reported by
cm_write_event_to_odb indicates that your original name "RPD0" got overwritten somewhere, but I could not spot any
mistake in your code.
I would play around: change max_event_size, produce dummy data of size N instead of the recv() and so on. Also monitor
the bank header to see when it gets overwritten. I guess you only write form one thread, so that should be safe, right?
Best,
Stefan |
14 Nov 2024, Mann Gandhi, Suggestion, Issue with creating banks
|
> All I can see is that your bank header gets corrupted along the way. The funny character reported by
> cm_write_event_to_odb indicates that your original name "RPD0" got overwritten somewhere, but I could not spot any
> mistake in your code.
>
> I would play around: change max_event_size, produce dummy data of size N instead of the recv() and so on. Also monitor
> the bank header to see when it gets overwritten. I guess you only write form one thread, so that should be safe, right?
>
> Best,
> Stefan
Hello Stefan,
Thank you for the advice. On inspection, I noticed that my event size (when I print bk_size(pevent)) is around 1.4 billion
which seems absurd so I am not sure why this is the case as well. In addition, is mdump the way to monitor the bank header?
I just recently started using MIDAS so I am a little bit confused. I can attach a link to the github repository where I am
currently working on this for further clarity since I am sure there is an issue in my code somewhere.
(https://github.com/mgandhi-1/red-pitaya-frontend/blob/10-issue-with-bank-creation-neeed-to-figure-out-why-banks-are-not-
being-created-correctly/frontend.cxx)
I appreciate the help. Thank you once more.
Best,
Mann |
15 Nov 2024, Konstantin Olchanski, Suggestion, Issue with creating banks
|
> Hello, I am a coop student working at SNOLAB.
> void* data_acquisition_thread(void* param)
> {
> EVENT_HEADER *pevent;
> if (complicated) {
> status = rb_get_wp(rbh, (void **) &pevent, 0);
> }
> bm_compose_event_threadsafe(pevent, 1, 0, 0, &equipment[0].serial_number);
> }
this code is buggy. it should read "EVENT_HEADER *pevent = NULL;" to avoid an uninitialized variable
and bm_compose_event() & co should be inside an "if (pevent != NULL)" block, unless you can absolutely
proove that rb_get_wp() is always called and pevent is never NULL. (even is somebody changes the code later).
if you build your code with "gcc -O2 -g -Wall -Wuninitialized" it would probably warn you about use of uninitilialized
"pevent".
P.S. for building multithreaded frontends, you are much better off starting from the c++ tmfe frontend framework,
a good starting point is study tmfe_example_everything.cxx.
K.O. |
07 Nov 2024, Lukas Gerritzen, Suggestion, Stop run and sequencer button
|
Due to popular demand among our students, I added a button to the sequencer that stops the run and the sequence. If you find it useful, please consider merging this upstream.
$ git diff sequencer.html
diff --git a/resources/sequencer.html b/resources/sequencer.html
index e7f8a79d..95c7e3d8 100644
--- a/resources/sequencer.html
+++ b/resources/sequencer.html
@@ -115,6 +115,7 @@
<img src="icons/play.svg" title="Start" class="seqbtn Stopped" onclick="startSeq();">
<img src="icons/debug.svg" title="Debug" class="seqbtn Stopped" onclick="debugSeq();">
<img src="icons/square.svg" title="Stop" class="seqbtn Running Paused" onclick="stopSeq();">
+ <img src="icons/x-octagon.svg" title="Stop Run and Sequencer immediately" class="seqbtn Running Paused" onclick="stopRunAndSeq();">
<img src="icons/pause.svg" title="Pause" class="seqbtn Running" onclick="modbset('/Sequencer/Command/Pause script',true);">
<img src="icons/resume.svg" title="Resume" class="seqbtn Paused" onclick="modbset('/Sequencer/Command/Resume script',true);">
<img src="icons/step-over.svg" title="Step Over" class="seqbtn Running Paused" onclick="modbset('/Sequencer/Command/Step over',true);">
[gac-megj@pc13513 resources]$ git diff sequencer.js
diff --git a/resources/sequencer.js b/resources/sequencer.js
index cc5398ef..b75c926c 100644
--- a/resources/sequencer.js
+++ b/resources/sequencer.js
@@ -1582,6 +1582,23 @@ function stopSeq() {
});
}
+function stopRunAndSeq() {
+ const message = `Are you sure you want to stop the run and sequence?`;
+ dlgConfirm(message,function(resp) {
+ if (resp) {
+ modbset('/Sequencer/Command/Stop immediately',true);
+
+ mjsonrpc_call("cm_transition", {"transition": "TR_STOP"}).then(function (rpc) {
+ if (rpc.result.status !== 1) {
+ throw new Error("Cannot stop run, cm_transition() status " + rpc.result.status + ", see MIDAS messages");
+ }
+ }).catch(function (error) {
+ mjsonrpc_error_alert(error);
+ });
+ }
+ });
+}
+
// Show or hide parameters table
function showParTable(varContainer) {
let e = document.getElementById(varContainer);
|
07 Nov 2024, Stefan Ritt, Suggestion, Stop run and sequencer button
|
I don't find this very useful. Some experiments do not only want to stop the run, but also do other cleanup things. To do that, I proposed and "atexit" function like C has it. Then the user can put a run stop there, plus any other cleanup. This will be much more flexible. Think about the "reset" script we have to manually run if we abort a sequencer. The atexit function will come next week, so you should consider to use it instead your additional button.
Stefan |
05 Nov 2024, Jack Carlton, Forum, How to properly write a client listens for events on a given buffer?  
|
If there's some template for writing a client to access event data, that would be
very useful (and you can probably just ignore the context I gave below in that
case).
Some context:
Quite a while ago, I wrote the attached "data pipeline" client whose job was to
listen for events, copy their data, and pipe them to a python script. I believe I
just stole bits and pieces from mdump.cxx to accomplish this. Later I wrote the
attached wrapper class "MidasConnector.cpp" and a main.cpp to generalize
data_pipeline.cxx a bit. There were a lot of iterations to the code where I had the
below problems; so don't take the logic in the attached code as the exact code that
caused the issues below.
However, I'm unable to resolve a couple issues:
1. If a timeout is set, everything will work until that timeout is reached. Then
regardless of what kind of logic I tried to implement (retry receiving event,
disconnect and reconnect client, etc.) the client would refuse to receive more data.
2. When I ctrl-C main, it hangs; this is expected because it's stuck in a while
loop. But because I can't set a timeout I have to ctrl-C twice; this would
occasionally corrupt the ODB which was not ideal. I was able to get around this with
some impractical solution involving ncurses I believe.
Thanks,
Jack |
05 Nov 2024, Maia Henriksson-Ward, Forum, How to properly write a client listens for events on a given buffer?
|
> If there's some template for writing a client to access event data, that would be
> very useful (and you can probably just ignore the context I gave below in that
> case).
>
>
> Some context:
>
> Quite a while ago, I wrote the attached "data pipeline" client whose job was to
> listen for events, copy their data, and pipe them to a python script. I believe I
> just stole bits and pieces from mdump.cxx to accomplish this. Later I wrote the
> attached wrapper class "MidasConnector.cpp" and a main.cpp to generalize
> data_pipeline.cxx a bit. There were a lot of iterations to the code where I had the
> below problems; so don't take the logic in the attached code as the exact code that
> caused the issues below.
>
> However, I'm unable to resolve a couple issues:
>
> 1. If a timeout is set, everything will work until that timeout is reached. Then
> regardless of what kind of logic I tried to implement (retry receiving event,
> disconnect and reconnect client, etc.) the client would refuse to receive more data.
>
> 2. When I ctrl-C main, it hangs; this is expected because it's stuck in a while
> loop. But because I can't set a timeout I have to ctrl-C twice; this would
> occasionally corrupt the ODB which was not ideal. I was able to get around this with
> some impractical solution involving ncurses I believe.
>
>
> Thanks,
> Jack
midas/examples/lowlevel/consume.cxx might be what you're looking for, but I think all
you're missing is a call to cm_yield() in your loop, so your midas client doesn't get
killed when the timeout is reached (and also so you can act on shutdown requests from
midas)
Something like
int status = cm_yield(100);
if (status == SS_ABORT || status == RPC_SHUTDOWN)
break;
There might be a recommended way to handle the ctrl-c and disconnect from the ODB, but
off the top of my head I don't remember it.
Also check out Ben's new(ish) python library, midas/python/examples/event_receiver.py
might be a much easier solution. And you can use the context manager, which will take
care of safely disconnecting from midas after you ctrl-C. |
28 Oct 2024, Lukas Gerritzen, Bug Report, Visual glitch in history system 
|
Today, I encountered the bug shown in the attached video. The value of the plotted curve does not match the mouseover number.
When trying to understand it better, I stopped being able to replicate. Has anyone else observed a similar problem? |
11 Oct 2024, Denis Calvet, Bug Report, Frontend name must differ from others by more than the last three characters
|
Hi,
I have developed two Midas front-end programs for different hardware. The frontend_name of the first one is "FSCD_SC" (slow control) and that of the second one is "FSCD_PS" (power supply).
Each front-end program runs fine separately, but when attempting to start FSCD_SC while FSCD_PS is running, FSCD_PS is terminated and Midas indicates "Previous frontend stopped" in the window where it starts FSCD_SC.
The problem is that these two frontend names only differ in their last two characters, and Midas currently does not distinguish them properly.
Looking in mfe.cxx we have:
int main(int argc, char *argv[])
{
...
/* shutdown previous frontend */
status = cm_shutdown(full_frontend_name, FALSE);
...
And looking in midas.cxx we have:
INT cm_shutdown(const char *name, BOOL bUnique) {
...
if (!bUnique)
client_name[strlen(name)] = 0; /* strip number */
...
The above line removes the last 3 characters of the front-end name before the subsequent comparison with other frontend names. Stripping the last 3 characters of the front-end name is correct for frontend programs that use the "-i" command line option to specify an index for that frontend, but all the characters of the front-end name should otherwise be kept for comparison.
I have changed the names of my frontend programs to avoid the interference, but it would be nice that the code that determines if an instance of a frontend program is already running is corrected.
I hope this can help.
Best regards,
Denis. |
11 Oct 2024, Stefan Ritt, Bug Report, Frontend name must differ from others by more than the last three characters
|
Hi Denis,
indeed a bug. Will fix it next week.
Best,
Stefan
> Hi,
> I have developed two Midas front-end programs for different hardware. The frontend_name of the first one is "FSCD_SC" (slow control) and that of the second one is "FSCD_PS" (power supply).
>
> Each front-end program runs fine separately, but when attempting to start FSCD_SC while FSCD_PS is running, FSCD_PS is terminated and Midas indicates "Previous frontend stopped" in the window where it starts FSCD_SC.
>
> The problem is that these two frontend names only differ in their last two characters, and Midas currently does not distinguish them properly. |
18 Oct 2024, Stefan Ritt, Bug Report, Frontend name must differ from others by more than the last three characters
|
Fixed and committed.
Best,
Stefan
> Hi Denis,
>
> indeed a bug. Will fix it next week.
>
> Best,
> Stefan
>
>
> > Hi,
> > I have developed two Midas front-end programs for different hardware. The frontend_name of the first one is "FSCD_SC" (slow control) and that of the second one is "FSCD_PS" (power supply).
> >
> > Each front-end program runs fine separately, but when attempting to start FSCD_SC while FSCD_PS is running, FSCD_PS is terminated and Midas indicates "Previous frontend stopped" in the window where it starts FSCD_SC.
> >
> > The problem is that these two frontend names only differ in their last two characters, and Midas currently does not distinguish them properly. |
09 Oct 2024, Lukas Gerritzen, Suggestion, odbedit minor quality of life
|
I have made two minor quality of life changes to odbedit.
- cd command: Typing cd without arguments now changes the directory to /, similar to the behaviour of the cd command in Linux sending you to the home directory.
- Exit behavior: Upon exiting the program with Ctrl+C, a newline character is printed so that the command line starts on an empty line rather than the last line from odbedit.
Here's the diff:
@@ -1668,7 +1668,10 @@ int command_loop(char *host_name, char *exp_name, char *cmd, char *start_dir)
/* cd */
else if (param[0][0] == 'c' && param[0][1] == 'd') {
- compose_name(pwd, param[1], str);
+ if (strlen(param[1]) == 0)
+ strcpy(str, "/");
+ else
+ compose_name(pwd, param[1], str);
status = db_find_key(hDB, 0, str, &hKey);
@@ -2962,6 +2965,7 @@ void ctrlc_odbedit(INT i)
cm_disconnect_experiment();
+ printf("\n");
exit(EXIT_SUCCESS);
}
Please consider incorporating those changes to odbedit.
Lukas |
09 Oct 2024, Stefan Ritt, Suggestion, odbedit minor quality of life
|
Ok, accepted, done and pushed.
Stefan
Lukas Gerritzen wrote: | I have made two minor quality of life changes to odbedit.
- cd command: Typing cd without arguments now changes the directory to /, similar to the behaviour of the cd command in Linux sending you to the home directory.
- Exit behavior: Upon exiting the program with Ctrl+C, a newline character is printed so that the command line starts on an empty line rather than the last line from odbedit.
Here's the diff:
@@ -1668,7 +1668,10 @@ int command_loop(char *host_name, char *exp_name, char *cmd, char *start_dir)
/* cd */
else if (param[0][0] == 'c' && param[0][1] == 'd') {
- compose_name(pwd, param[1], str);
+ if (strlen(param[1]) == 0)
+ strcpy(str, "/");
+ else
+ compose_name(pwd, param[1], str);
status = db_find_key(hDB, 0, str, &hKey);
@@ -2962,6 +2965,7 @@ void ctrlc_odbedit(INT i)
cm_disconnect_experiment();
+ printf("\n");
exit(EXIT_SUCCESS);
}
Please consider incorporating those changes to odbedit.
Lukas |
|
05 Sep 2024, Jack Carlton, Forum, Python frontend rate limitations? 
|
I'm trying to get a sense of the rate limitations of a python frontend. I
understand this will vary from system to system.
I adapted two frontends from the example templates, one in C++ and one in python.
Both simply fill a midas bank with a fixed length array of zeros at a given polled
rate. However, the C++ frontend is about 100 times faster in both data and event
rates. This seems slow, even for an interpreted language like python. Furthermore,
I can effectively increase the maximum rate by concurrently running a second
python frontend (this is not the case for the C++ frontend). In short, there is
some limitation with using python here unrelated to hardware.
In my case, poll_func appears to be called at 100Hz at best. What limits the rate
that poll_func is called in a python frontend? Is there a more appropriate
solution for increasing the python frontend data/event rate than simply launching
more frontends?
I've attached my C++ and python frontend files for reference.
Thanks,
Jack |
05 Sep 2024, Ben Smith, Forum, Python frontend rate limitations?
|
> What limits the rate that poll_func is called in a python frontend?
First the general advice: if you reduce the "period" of your equipment, then your function will get called more frequently. You can set it to 0 and we'll call it as often as possible. You can set this in the ODB at "/Equipment/Python Data Simulator/Common/Period"
If that's still not fast enough, then you can return a *list* of events from your readout_func. I've seen real-world cases of 25kHz+ of midas events generated in this fashion.
However in your case the limitation is likely that you're sending 1.25MB per event and we have a lot of data marshalling to do between the python and C++ layer. In particular it takes 15ms on my machine to just pack the data into a memory buffer (see timeit command below). I am sure there must be a faster way to do this packing, especially in the case where the bank contains a numpy array rather than a python list.
I'll add it to my to-do list to investigate improving the performance of medium-to-large events in the python code.
Cheers,
Ben
P.S. You may have a bug in your calculations (depending on how you did your testing). In poll_func I think you should be updating the stats every time the function is called, not just the times when you return True.
P.P.S. Command I used to test how slow it is to pack the data. One-time setup of creating the buffers, then multiple tests of the pack_into function:
python -m timeit -s "import struct;import ctypes;arr = [0]*1250001;buf = ctypes.create_string_buffer(10000000);fmt = \">1250000d\"" "struct.pack_into(fmt, buf, *arr)"
20 loops, best of 5: 15.3 msec per loop |
05 Sep 2024, Stefan Ritt, Forum, Python frontend rate limitations?
|
> First the general advice: if you reduce the "period" of your equipment, then your function will get called more frequently.
> You can set it to 0 and we'll call it as often as possible. You can set this in the ODB at "/Equipment/Python Data Simulator/Common/Period"
Just for your general understanding: The "period" i the C framework works differently. It calls the poll function with a number,
and then that number is used in the poll function like (simplified):
poll(INT count) {
for (i=0 ; i<count ; i++)
if (new_event())
return TRUE;
return FALSE;
}
This ensures that polling is done as quickly as possible, even staying in the same function (poll) rather than called from the
framework in a loop (which would require a function call to poll each time). The "count" is determined from the framework
during startup of the framework such that the execution time of the poll() routine equals the "period". Like if the period
is 0.1, the count might be a few millions, so that the poll routine returns immediately when a new event occurs or when
100ms have expired. During the polling the frontend is "dead" meaning it cannot react on run transitions for example. That's
why most experiments use 0.1-0.5 seconds. But this does then NOT mean that you can only have 10-2 events per second, but that
the reaction time if the frontend is at maximum 0.1-0.5 seconds which is acceptable most of the case.
Due to this design, the C frontend is capable of producing millions of events per second. It took me some while in the early 1990's
to work out that scheme sitting in the "R" trailer at TRIUMF (old guys will remember...).
Best,
Stefan |
06 Sep 2024, Jack Carlton, Forum, Python frontend rate limitations?
|
Thanks for the responses, they were very helpful.
>First the general advice: if you reduce the "period" of your equipment, then your function will get called more frequently. You can set it to 0 and we'll
call it as often as possible.
Thanks, this solves the event rate limitation I described. I didn't think to change this because the "period" did not affect the observed rate in C (and now
I know why thanks to Stefan).
A couple more questions:
1.
For me,
python -m timeit -s "import struct;import ctypes;arr = [0]*1250001;buf = ctypes.create_string_buffer(10000000);fmt = \">1250000d\"" "struct.pack_into(fmt,
buf, *arr)"
10 loops, best of 3: 43.7 msec per loop
which suggests my maximum data rate is about 1.25 MB * 1000/43.7 Hz = 23 MB/s (?). But I see data rates up to 60 MB/s with a python frontend. Am I
misinterpreting the meaning of this result?
2. I can effectively bypass the rate limitations in python by running two concurrent frontends. For example, with one python frontend at best I can generate
60 MB/s of data (setting "period" to 0 now); but with two frontends I can double this to 120 MB/s. This implies one python frontend is not bottlenecked by
hardware limitations in my case.
Am I doing something wrong to artificially bottleneck my frontends? Perhaps there's a multi-threading solution I can implement to avoid needing multiple
frontends?
Thanks,
Jack |
11 Sep 2024, Konstantin Olchanski, Forum, Python frontend rate limitations?
|
>
> poll(INT count) {
> for (i=0 ; i<count ; i++)
> if (new_event())
> return TRUE;
> return FALSE;
> }
in the c++ frontend (tmfe.h) this loop usually runs in a separate thread, and I am now working on the linux magic to assign this thread maximum
uninterruptible priority. otherwise on my Cyclone-V FPGA SoC I see 1-10 msec dropouts, I think from taking ethernet interrupts.
K.O. |
27 Sep 2024, Ben Smith, Forum, Python frontend rate limitations?
|
> in your case the limitation is likely that you're sending 1.25MB per event and we have a lot of data marshalling to do between the python and C++ layer.
>
> I'll add it to my to-do list to investigate improving the performance of medium-to-large events in the python code.
I've now added better support for numpy arrays in the python code that encodes a `midas.event.Event` object. If you use the "correct" numpy data type then you can get vastly improved performance as numpy already stores the data in memory in the format that we need.
In your example, if you change
self.zero_buffer = [0] * self.total_data_size
to
self.zero_buffer = np.ndarray(self.total_data_size, np.int16)
then the max data rate of the frontend goes from 330MB/s to 7600MB/s on my laptop (a factor 20 improvement from one line of code!)
To ensure you're using the optimal numpy dtype for your bank, you can reference a dict called `midas.tid_np_formats`. For example `midas.tid_np_formats[midas.TID_SHORT]` is equivalent to `np.int16`. If you use an int16 array and write it as a TID_SHORT bank, then we'll use the fast path. If there is a mismatch, we'll have to do type conversions and will end up on the slow path. |
11 Sep 2024, Konstantin Olchanski, Forum, Python frontend rate limitations?
|
> I'm trying to get a sense of the rate limitations of a python frontend.
1) python is single-threaded, for ultimate performance, a MIDAS frontend (or any DAQ
application) has to be multithreaded:
a) thread with busy loop read the data and place it into a FIFO
b) thread to read data from FIFO and send it to SYSTEM buffer shared memory or to
mserver
c) thread to respond to begin-run, end-run, etc RPCs
d) probably a thread to recycle memory from thread (b) back to thread (a) if per-event
malloc()/free() adds too much overhead
2) data readout. C++ AXI bus access is compiled into 1 instruction and results in 1 AXI
bus operation. comparable for python likely has much more overhead, slows you down.
3) event bank filling. C++ for() loop is compiled into very compact machine code,
python loop cannot because each array element can be random data type, shows you down.
bottom line, there is a reason high speed data acquisitions are written in C/C++, not
in shell, perl, tcl/tk, or (today's favourite) python.
> The C++ frontend is about 100 times faster in both data and event rates.
This is as expected. You can probably improve python code to get closer to 10 times
slower than C++. But consider:
a) will it be "fast enough" for the task?
b) learning C++ and optimizing python to within "2-3-10x slower than C++" may involve a
similar amount of time and effort.
And you have not looked at the real-time properties of your frontend. You may discover
that it's actually faster than you think, but occasionally stops for a millisecond (or
two or hundred). some applications a notorious for running memory garbage collection
just at the wrong time.
I am working right now on exactly this problem, I have a 1 GHz ARM CPU (Cyclone-V FPGA)
and I need to push data out at 100 Mbytes/sec while avoiding and bad-real-time dropouts
that cause the FPGA data FIFO to overflow. And I only have 2 CPU cores, 1 to read the
FPGA FIFO, 1 to run the TCP/IP stack and the ethernet driver. No this can be done with
python.
K.O. |
11 Sep 2024, Konstantin Olchanski, Forum, Python frontend rate limitations?
|
> > I'm trying to get a sense of the rate limitations of a python frontend.
forgot one more:
c++ toolchain comes with extensive profiler tools aimed to answer the question "why is my
program so slow, where is it spending all the time?". some of these tools go all the way to
the hardware level and report CPU cache misses, TLB flushes, context switches and any other
hardware events that interrupt or slow down computations. programmer than uses this
information to restructure the code to avoid the worst slow downs (i.e. avoid branch mis-
predictions, avoid cache misses, etc).
I doubt the python toolchain will ever profiler tools as good.
K.O. |
22 Jun 2024, Joseph McKenna, Suggestion, manalyzer thread safety and custom http IP binding
|
Hi all, I hope this is the right place to post two pull requests, if not, please let me know where I should be submitting them
Both are fairly small changes, please see them listed below (more details written on the PRs themselves)
- Enable ROOT's thread safety when running in multithreaded mode
This helps avoid users having to write their call to a global thread lock when calling ->Fill() on ROOT histograms and Trees
https://bitbucket.org/tmidas/manalyzer/pull-requests/5
- Add command argument to specify an IP of the root HTTP server to bind to
This was a problem I painted around when at ALPHA (quickly hardcoding the right external IP address into the local build. Obviously a bad habit)
https://bitbucket.org/tmidas/manalyzer/pull-requests/6 |
05 Jul 2024, Joseph McKenna, Suggestion, shared pointers for more flexible memory managment of the analysis 'flow' and TMEvent
|
> Hi all, I hope this is the right place to post two pull requests, if not, please let me know where I should be submitting them
>
> Both are fairly small changes, please see them listed below (more details written on the PRs themselves)
>
>
> - Enable ROOT's thread safety when running in multithreaded mode
>
> This helps avoid users having to write their call to a global thread lock when calling ->Fill() on ROOT histograms and Trees
> https://bitbucket.org/tmidas/manalyzer/pull-requests/5
>
>
> - Add command argument to specify an IP of the root HTTP server to bind to
>
> This was a problem I painted around when at ALPHA (quickly hardcoding the right external IP address into the local build. Obviously a bad habit)
> https://bitbucket.org/tmidas/manalyzer/pull-requests/6
Further to the pull manalyzer pull requests, I have another feature I would like to add. Took a little longer to test than planned... here I present an effort to use smart pointers to manage the lifetime of TMEvents and TAFlow.
I will be interested to discuss the implications of this pull request (its possible to return to previous 'raw' pointers via a cmake toggle)
https://bitbucket.org/tmidas/manalyzer/pull-requests/8 |
05 Jul 2024, Joseph McKenna, Suggestion, Clean up compiler warning in manalyzer
|
This is a super small pull request, simple replace deprecated sprintf with snprintf
https://bitbucket.org/tmidas/manalyzer/pull-requests/9 |
13 Sep 2024, Konstantin Olchanski, Suggestion, Clean up compiler warning in manalyzer
|
> This is a super small pull request, simple replace deprecated sprintf with snprintf
> https://bitbucket.org/tmidas/manalyzer/pull-requests/9
sprintf() is not deprecated and "char buf[256]; sprintf(buf, "%05d", 64-bit-int);" is safe, will never overflow.
we could bulk-convert all these sprintf() to snprintf() but I would rather wait for this:
https://en.cppreference.com/w/cpp/utility/format/format
let me think on this for a bit.
K.O. |
20 Sep 2024, Joseph McKenna, Suggestion, Clean up compiler warning in manalyzer
|
> > This is a super small pull request, simple replace deprecated sprintf with snprintf
> > https://bitbucket.org/tmidas/manalyzer/pull-requests/9
>
> sprintf() is not deprecated and "char buf[256]; sprintf(buf, "%05d", 64-bit-int);" is safe, will never overflow.
>
> we could bulk-convert all these sprintf() to snprintf() but I would rather wait for this:
>
> https://en.cppreference.com/w/cpp/utility/format/format
>
> let me think on this for a bit.
>
> K.O.
I completely agree that the 64-bit int is safe and will never overflow. Doing a little digging, both clang and gcc don't raise warnings on x86_64 (even with -Wall -Wextra -Wpedantic), even when I give it a buffer impossibly small (two bytes). However I've narrowed down the depreciation warning comes from: MacOS
https://developer.apple.com/documentation/kernel/1441083-sprintf
I like the look of std::format, looks cleaner than string streams |
20 Sep 2024, Stefan Ritt, Suggestion, Clean up compiler warning in manalyzer
|
> I like the look of std::format, looks cleaner than string streams
I fully agree. String streams is a pain if you want to do zero-leading hex output mixed with decimal output. Yes it's easier to read if you don't know printf syntax,
but 10-20 times more chars to write and not necessarily cleaner.
Proble is that we would have to convert about a few thousand of sprintf's() in midas.
Stefan |
24 Sep 2024, Konstantin Olchanski, Suggestion, Clean up compiler warning in manalyzer
|
> > I like the look of std::format, looks cleaner than string streams
>
> I fully agree. String streams is a pain if you want to do zero-leading hex output mixed with decimal output. Yes it's easier to read if you don't know printf syntax,
> but 10-20 times more chars to write and not necessarily cleaner.
>
IMO c++ string streams formatting is optimized for "hello world" and is useless for printing hex numbers, table-formatted data and generally anything real-life.
plus the borked std::to_string() (it takes a global lock for the "C" locale), "fixed" it by introducing std::to_chars() in C++17,
with "ultimate fix" in std::format in C++26.
no question why C++ has the bad reputation. for a "done right" example, take a look at the Go standard library.
>
> Probable is that we would have to convert about a few thousand of sprintf's() in midas.
>
surprising few bare sprintf() remaining in MIDAS, most of them overflow-safe and most of them to be converted to msprintf().
K.O. |
13 Sep 2024, Konstantin Olchanski, Suggestion, manalyzer thread safety and custom http IP binding
|
> - Enable ROOT's thread safety when running in multithreaded mode
> This helps avoid users having to write their call to a global thread lock when calling ->Fill() on ROOT histograms and Trees
> https://bitbucket.org/tmidas/manalyzer/pull-requests/5
merged by hand. (pull request shows a "rejected", bitbucket has no "merged manually" button).
also noted this change in the documentation: README.md
K.O. |
22 Sep 2024, Tam Kai Chung, Bug Report, Can we convert the .mid file into .root file
|
Dear experts,
I am a new user of MIDAS. I have just created some banks by a frontend.cxx code.
Now, I would like to do some analysis from the data.
I have an analyzer.cxx code (A very simple one without complicated routine).
I try to link the analyzer.o with rmana.o and libmidas.a to create analyzer.exe
I am not sure whether I can do the analysis offline in the follow way:
analyzer.exe -i run00001.mid -o run00001.root
When I run this command, I get the following error:
Error in <TClass::LoadClassInfo>: no interpreter information for class TSocket is available even though it has a TClass initialization routine.
I am using root 6.30
Any suggestion about this issue? Thank you.
Best,
Terry |
24 Sep 2024, Konstantin Olchanski, Bug Report, Can we convert the .mid file into .root file
|
"Can we convert the .mid file into .root file".
yes, you can, but the operation is under-defined. it's like asking "can I convert these stones into houses". the answer is "yes", but it involves
more than running a universal conversion program.
For this reason, I recommend against converting midas files "to root". for some types of midas data such a conversion makes no sense (i.e. alpha-g
streamed udp packets with chopped compressed waveforms).
I recommend that you analyze you data in the midas analyzer. You can start with manalyzer_example_root.cxx,
it shows how to create a ROOT histogram, how to access midas event bank data and call the TH1 "Fill" method.
Instead of filling histograms in the analyzer, you can create a ROOT TTree and fill it with data from midas data banks,
effectively you will create your own custom converter from midas to root.
The key thing is that it has to be a custom converter, because only you know the meaning of midas bank data
and how it should be best stored in a root tree.
K.O. |
04 Sep 2024, Stefan Ritt, Info, News MSCB++ API
|
I had two free afternoon and took the opportunity to write a new API for the MSCB
system. I'm not sure if anybody else actually uses MSCB (MIDAS slow control bus),
but anyhow.
The new API is contained in a single header file mscbxx.h, and it's extremely
simple to use. Here is some example code:
#include "mscbxx.h"
...
// connect to node 10 at submaster mscb123
midas::mscb m("mscb123", 10);
// print node info and all variables
std::cout << m << std::endl;
// refresh all variables (read from MSCB device)
m.read_range();
// access individual variables
float f = m[5]; // index access
f = m["In0"]; // name access
// write value to MSCB device
m["In0"] = 1.234;
...
Any feedback is welcome.
Stefan |
11 Sep 2024, Konstantin Olchanski, Info, News MSCB++ API
|
> Here is some example code:
>
> #include "mscbxx.h"
> f = m["In0"]; // name access
> m["In0"] = 1.234;
> Any feedback is welcome.
Where is the example of error handling?
K.O. |
24 Sep 2024, Stefan Ritt, Info, News MSCB++ API
|
> Where is the example of error handling?
#include "mscbxx.h"
#include "mexcept.h"
...
try {
// connect to node 10 at submaster mscb123
midas::mscb m("mscb123", 10);
// print a variable
std::cout << m["Input0"] << std::endl;
} catch (mexception e) {
std::cout << e << std::endl; // simply print exception
}
... |
16 Sep 2024, Marius Köppel, Bug Report, Crash using ODB watch
|
Hi all,
last week I was running MIDAS with the commit 3ad98c5. Today I updated MIDAS and now all my watch functions are crashing. Attached I have a minimal example frontend of the problem.
In our software we have two functions one which sets up the ODB values of the frontend and another one which sets up all watch functions. So overall we connect two time to the ODB during fronend_init one time to create the values and one time to create the watch. In the example code a simple version of this setup is shown:
INT frontend_init() {
cm_msg(MINFO, "frontend_init() setup", "Test FE");
odb settings = {
{"Test", 123},
{"sub", {}}
};
settings.connect_and_fix_structure("/Equipment/Test FE/Settings");
// settings.watch(watch); <-- this works without segmentation fault
odb new_settings("/Equipment/Test FE/Settings");
new_settings.watch(watch); // <-- here I am getting a segmentation fault
return CM_SUCCESS;
}
When I directly set the watch everything runs fine however, when I create a new ODB object and use this one to set a watch I am getting the following segmentation fault:
Process 18474 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x34)
frame #0: 0x000000010004fa38 test_fe`midas::odb::watch_callback(hDB=<unavailable>, hKey=<unavailable>, index=0, info=0x00006000002001c0) at odbxx.cxx:96:25 [opt]
93 if (po->m_data == nullptr)
94 mthrow("Callback received for a midas::odb object which went out of scope");
95 midas::odb *poh = search_hkey(po, hKey);
-> 96 poh->m_last_index = index;
97 po->m_watch_callback(*poh);
98 poh->m_last_index = -1;
99 }
Best,
Marius |
16 Sep 2024, Stefan Ritt, Bug Report, Crash using ODB watch
|
The answer is in the error message: „Object went out of scope“. When your frontent_init() exits, the odb objects are destroyed. When you get a callback, it‘s linked to the
destroyed object. This is like if you have a local string and pass a reference to that string in the return of the function.
Use a global object (bad) or use „new“ (potential memory leak). I would use a global structure which holds all odb objects.
Stefan
>
> last week I was running MIDAS with the commit 3ad98c5. Today I updated MIDAS and now all my watch functions are crashing. Attached I have a minimal example frontend of the problem.
>
> In our software we have two functions one which sets up the ODB values of the frontend and another one which sets up all watch functions. So overall we connect two time to the ODB during fronend_init one time to create the values and one time to create the watch. In the example code a simple version of this setup is shown:
>
> INT frontend_init() {
>
> cm_msg(MINFO, "frontend_init() setup", "Test FE");
>
> odb settings = {
> {"Test", 123},
> {"sub", {}}
> };
> settings.connect_and_fix_structure("/Equipment/Test FE/Settings");
> // settings.watch(watch); <-- this works without segmentation fault
>
> odb new_settings("/Equipment/Test FE/Settings");
> new_settings.watch(watch); // <-- here I am getting a segmentation fault
>
> return CM_SUCCESS;
> }
>
> When I directly set the watch everything runs fine however, when I create a new ODB object and use this one to set a watch I am getting the following segmentation fault:
>
> Process 18474 stopped
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x34)
> frame #0: 0x000000010004fa38 test_fe`midas::odb::watch_callback(hDB=<unavailable>, hKey=<unavailable>, index=0, info=0x00006000002001c0) at odbxx.cxx:96:25 [opt]
> 93 if (po->m_data == nullptr)
> 94 mthrow("Callback received for a midas::odb object which went out of scope");
> 95 midas::odb *poh = search_hkey(po, hKey);
> -> 96 poh->m_last_index = index;
> 97 po->m_watch_callback(*poh);
> 98 poh->m_last_index = -1;
> 99 }
>
> Best,
> Marius |
16 Sep 2024, Marius Koeppel, Bug Report, Crash using ODB watch
|
This is not the case here. Note that the error message: "Callback received for a midas::odb object which went out of scope" is not called! The segmentation fault happens later line 96.
> The answer is in the error message: „Object went out of scope“. When your frontent_init() exits, the odb objects are destroyed. When you get a callback, it‘s linked to the
> destroyed object. This is like if you have a local string and pass a reference to that string in the return of the function.
>
> Use a global object (bad) or use „new“ (potential memory leak). I would use a global structure which holds all odb objects.
>
> Stefan
>
> >
> > last week I was running MIDAS with the commit 3ad98c5. Today I updated MIDAS and now all my watch functions are crashing. Attached I have a minimal example frontend of the problem.
> >
> > In our software we have two functions one which sets up the ODB values of the frontend and another one which sets up all watch functions. So overall we connect two time to the ODB during fronend_init one time to create the values and one time to create the watch. In the example code a simple version of this setup is shown:
> >
> > INT frontend_init() {
> >
> > cm_msg(MINFO, "frontend_init() setup", "Test FE");
> >
> > odb settings = {
> > {"Test", 123},
> > {"sub", {}}
> > };
> > settings.connect_and_fix_structure("/Equipment/Test FE/Settings");
> > // settings.watch(watch); <-- this works without segmentation fault
> >
> > odb new_settings("/Equipment/Test FE/Settings");
> > new_settings.watch(watch); // <-- here I am getting a segmentation fault
> >
> > return CM_SUCCESS;
> > }
> >
> > When I directly set the watch everything runs fine however, when I create a new ODB object and use this one to set a watch I am getting the following segmentation fault:
> >
> > Process 18474 stopped
> > * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x34)
> > frame #0: 0x000000010004fa38 test_fe`midas::odb::watch_callback(hDB=<unavailable>, hKey=<unavailable>, index=0, info=0x00006000002001c0) at odbxx.cxx:96:25 [opt]
> > 93 if (po->m_data == nullptr)
> > 94 mthrow("Callback received for a midas::odb object which went out of scope");
> > 95 midas::odb *poh = search_hkey(po, hKey);
> > -> 96 poh->m_last_index = index;
> > 97 po->m_watch_callback(*poh);
> > 98 poh->m_last_index = -1;
> > 99 }
> >
> > Best,
> > Marius |
16 Sep 2024, Stefan Ritt, Bug Report, Crash using ODB watch
|
Well, the object *went* out of scope. For my code it‘s hard to realize this, so the error reporting is poor. Also the first object should have the same
problem. Just by accident that it does not crash.
Stefan
> This is not the case here. Note that the error message: "Callback received for a midas::odb object which went out of scope" is not called! The segmentation fault happens later line 96.
>
> > The answer is in the error message: „Object went out of scope“. When your frontent_init() exits, the odb objects are destroyed. When you get a callback, it‘s linked to the
> > destroyed object. This is like if you have a local string and pass a reference to that string in the return of the function.
> >
> > Use a global object (bad) or use „new“ (potential memory leak). I would use a global structure which holds all odb objects.
> >
> > Stefan
> >
> > >
> > > last week I was running MIDAS with the commit 3ad98c5. Today I updated MIDAS and now all my watch functions are crashing. Attached I have a minimal example frontend of the problem.
> > >
> > > In our software we have two functions one which sets up the ODB values of the frontend and another one which sets up all watch functions. So overall we connect two time to the ODB during fronend_init one time to create the values and one time to create the watch. In the example code a simple version of this setup is shown:
> > >
> > > INT frontend_init() {
> > >
> > > cm_msg(MINFO, "frontend_init() setup", "Test FE");
> > >
> > > odb settings = {
> > > {"Test", 123},
> > > {"sub", {}}
> > > };
> > > settings.connect_and_fix_structure("/Equipment/Test FE/Settings");
> > > // settings.watch(watch); <-- this works without segmentation fault
> > >
> > > odb new_settings("/Equipment/Test FE/Settings");
> > > new_settings.watch(watch); // <-- here I am getting a segmentation fault
> > >
> > > return CM_SUCCESS;
> > > }
> > >
> > > When I directly set the watch everything runs fine however, when I create a new ODB object and use this one to set a watch I am getting the following segmentation fault:
> > >
> > > Process 18474 stopped
> > > * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x34)
> > > frame #0: 0x000000010004fa38 test_fe`midas::odb::watch_callback(hDB=<unavailable>, hKey=<unavailable>, index=0, info=0x00006000002001c0) at odbxx.cxx:96:25 [opt]
> > > 93 if (po->m_data == nullptr)
> > > 94 mthrow("Callback received for a midas::odb object which went out of scope");
> > > 95 midas::odb *poh = search_hkey(po, hKey);
> > > -> 96 poh->m_last_index = index;
> > > 97 po->m_watch_callback(*poh);
> > > 98 poh->m_last_index = -1;
> > > 99 }
> > >
> > > Best,
> > > Marius |
16 Sep 2024, Marius Koeppel, Bug Report, Crash using ODB watch
|
Okay, but this is then a big issue IMO. For Mu3e we do this in every frontend and I also checked again all of these watches are broken at the moment (with commit 3ad98c5 they worked).
In the old style we did for example (see https://bitbucket.org/tmidas/midas/src/develop/examples/crfe/crfe.cxx):
INT frontend_init()
{
HNDLE hKey;
// create Settings structure in ODB
db_create_record(hDB, 0, "Equipment/Clock Reset/Settings", strcomb1(cr_settings_str).c_str());
db_find_key(hDB, 0, "/Equipment/Clock Reset", &hKey);
assert(hKey);
db_watch(hDB, hKey, cr_settings_changed, NULL);
/*
* Set our transition sequence. The default is 500. Setting it
* to 600 means we are called AFTER most other clients.
*/
cm_set_transition_sequence(TR_START, 600);
return CM_SUCCESS;
}
I thought this will be the same (under the hood) in the current odbxx way via:
odb settings("Equipment/Clock Reset/Settings");
settings.watch(cr_settings_changed);
Best,
Marius
> Well, the object *went* out of scope. For my code it‘s hard to realize this, so the error reporting is poor. Also the first object should have the same
> problem. Just by accident that it does not crash.
>
> Stefan
>
> > This is not the case here. Note that the error message: "Callback received for a midas::odb object which went out of scope" is not called! The segmentation fault happens later line 96.
> >
> > > The answer is in the error message: „Object went out of scope“. When your frontent_init() exits, the odb objects are destroyed. When you get a callback, it‘s linked to the
> > > destroyed object. This is like if you have a local string and pass a reference to that string in the return of the function.
> > >
> > > Use a global object (bad) or use „new“ (potential memory leak). I would use a global structure which holds all odb objects.
> > >
> > > Stefan
> > >
> > > >
> > > > last week I was running MIDAS with the commit 3ad98c5. Today I updated MIDAS and now all my watch functions are crashing. Attached I have a minimal example frontend of the problem.
> > > >
> > > > In our software we have two functions one which sets up the ODB values of the frontend and another one which sets up all watch functions. So overall we connect two time to the ODB during fronend_init one time to create the values and one time to create the watch. In the example code a simple version of this setup is shown:
> > > >
> > > > INT frontend_init() {
> > > >
> > > > cm_msg(MINFO, "frontend_init() setup", "Test FE");
> > > >
> > > > odb settings = {
> > > > {"Test", 123},
> > > > {"sub", {}}
> > > > };
> > > > settings.connect_and_fix_structure("/Equipment/Test FE/Settings");
> > > > // settings.watch(watch); <-- this works without segmentation fault
> > > >
> > > > odb new_settings("/Equipment/Test FE/Settings");
> > > > new_settings.watch(watch); // <-- here I am getting a segmentation fault
> > > >
> > > > return CM_SUCCESS;
> > > > }
> > > >
> > > > When I directly set the watch everything runs fine however, when I create a new ODB object and use this one to set a watch I am getting the following segmentation fault:
> > > >
> > > > Process 18474 stopped
> > > > * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x34)
> > > > frame #0: 0x000000010004fa38 test_fe`midas::odb::watch_callback(hDB=<unavailable>, hKey=<unavailable>, index=0, info=0x00006000002001c0) at odbxx.cxx:96:25 [opt]
> > > > 93 if (po->m_data == nullptr)
> > > > 94 mthrow("Callback received for a midas::odb object which went out of scope");
> > > > 95 midas::odb *poh = search_hkey(po, hKey);
> > > > -> 96 poh->m_last_index = index;
> > > > 97 po->m_watch_callback(*poh);
> > > > 98 poh->m_last_index = -1;
> > > > 99 }
> > > >
> > > > Best,
> > > > Marius |
16 Sep 2024, Mark Grimes, Bug Report, Crash using ODB watch
|
Hi,
Maybe I've misunderstood the code, but odb::watch() creates a deep copy of itself to set the watch to. The comment where this happens specifies that this is in case the current one goes out of scope. See https://bitbucket.org/tmidas/midas/src/2878647fb73648474b35223ce53a125180f751b3/src/odbxx.cxx#lines-1393:1395
So as far as I can tell allowing the current odb instance to go out of scope is supported.
Thanks,
Mark.
> Okay, but this is then a big issue IMO. For Mu3e we do this in every frontend and I also checked again all of these watches are broken at the moment (with commit 3ad98c5 they worked).
>
> In the old style we did for example (see https://bitbucket.org/tmidas/midas/src/develop/examples/crfe/crfe.cxx):
>
> INT frontend_init()
> {
> HNDLE hKey;
>
> // create Settings structure in ODB
> db_create_record(hDB, 0, "Equipment/Clock Reset/Settings", strcomb1(cr_settings_str).c_str());
> db_find_key(hDB, 0, "/Equipment/Clock Reset", &hKey);
> assert(hKey);
>
> db_watch(hDB, hKey, cr_settings_changed, NULL);
>
> /*
> * Set our transition sequence. The default is 500. Setting it
> * to 600 means we are called AFTER most other clients.
> */
> cm_set_transition_sequence(TR_START, 600);
>
> return CM_SUCCESS;
> }
>
> I thought this will be the same (under the hood) in the current odbxx way via:
>
> odb settings("Equipment/Clock Reset/Settings");
> settings.watch(cr_settings_changed);
>
> Best,
> Marius
>
>
> > Well, the object *went* out of scope. For my code it‘s hard to realize this, so the error reporting is poor. Also the first object should have the same
> > problem. Just by accident that it does not crash.
> >
> > Stefan
> >
> > > This is not the case here. Note that the error message: "Callback received for a midas::odb object which went out of scope" is not called! The segmentation fault happens later line 96.
> > >
> > > > The answer is in the error message: „Object went out of scope“. When your frontent_init() exits, the odb objects are destroyed. When you get a callback, it‘s linked to the
> > > > destroyed object. This is like if you have a local string and pass a reference to that string in the return of the function.
> > > >
> > > > Use a global object (bad) or use „new“ (potential memory leak). I would use a global structure which holds all odb objects.
> > > >
> > > > Stefan
> > > >
> > > > >
> > > > > last week I was running MIDAS with the commit 3ad98c5. Today I updated MIDAS and now all my watch functions are crashing. Attached I have a minimal example frontend of the problem.
> > > > >
> > > > > In our software we have two functions one which sets up the ODB values of the frontend and another one which sets up all watch functions. So overall we connect two time to the ODB during fronend_init one time to create the values and one time to create the watch. In the example code a simple version of this setup is shown:
> > > > >
> > > > > INT frontend_init() {
> > > > >
> > > > > cm_msg(MINFO, "frontend_init() setup", "Test FE");
> > > > >
> > > > > odb settings = {
> > > > > {"Test", 123},
> > > > > {"sub", {}}
> > > > > };
> > > > > settings.connect_and_fix_structure("/Equipment/Test FE/Settings");
> > > > > // settings.watch(watch); <-- this works without segmentation fault
> > > > >
> > > > > odb new_settings("/Equipment/Test FE/Settings");
> > > > > new_settings.watch(watch); // <-- here I am getting a segmentation fault
> > > > >
> > > > > return CM_SUCCESS;
> > > > > }
> > > > >
> > > > > When I directly set the watch everything runs fine however, when I create a new ODB object and use this one to set a watch I am getting the following segmentation fault:
> > > > >
> > > > > Process 18474 stopped
> > > > > * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x34)
> > > > > frame #0: 0x000000010004fa38 test_fe`midas::odb::watch_callback(hDB=<unavailable>, hKey=<unavailable>, index=0, info=0x00006000002001c0) at odbxx.cxx:96:25 [opt]
> > > > > 93 if (po->m_data == nullptr)
> > > > > 94 mthrow("Callback received for a midas::odb object which went out of scope");
> > > > > 95 midas::odb *poh = search_hkey(po, hKey);
> > > > > -> 96 poh->m_last_index = index;
> > > > > 97 po->m_watch_callback(*poh);
> > > > > 98 poh->m_last_index = -1;
> > > > > 99 }
> > > > >
> > > > > Best,
> > > > > Marius |
17 Sep 2024, Konstantin Olchanski, Bug Report, Crash using ODB watch
|
> {
> odb new_settings("/Equipment/Test FE/Settings");
> new_settings.watch(watch); // <-- here I am getting a segmentation fault
> }
this code has a bug. "watch" is attached to object "new_settings" that is deleted
after the closing curly bracket.
I would say Stefan's odb API should not allow you to write code like this. an API defect.
K.O. |
18 Sep 2024, Marius Koeppel, Bug Report, Crash using ODB watch
|
I created a PR to fix this issue https://bitbucket.org/tmidas/midas/pull-requests/42.
The crash happened since the change in commit 3ad98c5 always got the ODB via XML.
However, the creation from XML should only be used when a user wants to read fast (and when we are on a remote machine) so I added the flag use_from_xml to explicitly specify this.
> > {
> > odb new_settings("/Equipment/Test FE/Settings");
> > new_settings.watch(watch); // <-- here I am getting a segmentation fault
> > }
>
> this code has a bug. "watch" is attached to object "new_settings" that is deleted
> after the closing curly bracket.
> I would say Stefan's odb API should not allow you to write code like this. an API defect.
As pointed out in the thread this feature is explicitly supported by odbxx.cxx:
void odb::watch(std::function<void(midas::odb &)> f) {
if (m_hKey == 0 || m_hKey == -1)
mthrow("watch() called for ODB key \"" + m_name +
"\" which is not connected to ODB");
// create a deep copy of current object in case it
// goes out of scope
midas::odb* ow = new midas::odb(*this);
ow->m_watch_callback = f;
db_watch(s_hDB, m_hKey, midas::odb::watch_callback, ow);
// put object into watchlist
g_watchlist.push_back(ow);
}
Also in the old way (see for example https://bitbucket.org/tmidas/midas/src/191d13f98626fae533cbca17b00df7ee361edf16/examples/crfe/crfe.cxx#lines-126) it was possible to create a watch in a scope without the user taking care that the "object" does not go out of scope.
I think this feature should be supported by the framework.
Best,
Marius |
20 Sep 2024, Stefan Ritt, Bug Report, Crash using ODB watch
|
The problem has been fixed in the current version. Here is my analysis:
- the midas::odb object *can* go out of scope in the function, since the odb::watch() function creates a deep copy of the object.
This does not cause a memory leak if one call odb::unwatch_all() at the end of a program.
- The creation from XML had a flaw where the ODB key handle ("hKey") is not initialized since it is not passed by the db_copy_xml() function.
I added code to db_copy_xml() to also fetch the key handle in the XML file, which now fixes the issue. Please note that you have to
update both the server and client side of midas to get this functionality if you are using it by a remote client.
- I saw the flag MK added on his pull request to the constructor of odb::odb(). This is a way to fight the symptoms (by creating an
object the "old" way if not otherwise needed, but how we have the cause cured. Nevertheless I added that parameter, but set to to true by default:
odb::odb(const std::string &str, bool init_via_xml = true);
since this should be fully working now and should always be faster than the old method. I only keep it for debugging should we observe
another flaw in odb_from_xml().
Best regards,
Stefan |
26 Jul 2024, Lukas Gerritzen, Bug Fix, strlcpy and strlcat added to glibc 2.38
|
A year ago, these two function were included in glibc. If trying to compile midas with a recent version of
Ubuntu or Fedora, one gets errors like this:
/usr/include/string.h:506:15: error: declaration of ‘size_t strlcpy(char*, const char*, size_t) noexcept’ has a
different exception specifier
506 | extern size_t strlcpy (char *__restrict __dest,
| ^~~~~~~
In file included from /home/luk/midas/src/midas.cxx:14:
/home/luk/midas/include/midas.h:2190:17: note: from previous declaration ‘size_t strlcpy(char*, const
char*, size_t)’
My proposed solution is a check in midas.h around line 248:
#if (__GLIBC__ > 2) || (__GLIBC__ == 2 && __GLIBC_MINOR__ >= 38)
#ifndef HAVE_STRLCPY
#define HAVE_STRLCPY 1
#endif
#endif
|
26 Jul 2024, Stefan Ritt, Bug Fix, strlcpy and strlcat added to glibc 2.38
|
Good catch. I added your code to the current develop branch of MIDAS.
Stefan |
13 Sep 2024, Konstantin Olchanski, Bug Fix, mstrcpy, was: strlcpy and strlcat added to glibc 2.38
|
for the record, as ultimate solution, strlcpy() and strlcat() were wholesale
replaced by mstrlcpy() and mstrlcat(). this should fix "missing strlcpy()"
problem for good and make midas more consistent across all platforms (including
non-linux, non-unix). on my side, I continue replacing these function with proper
std::string operations. K.O. |
04 Jul 2024, Nick Hastings, Forum, mfe.cxx with RO_STOPPED and EQ_POLLED
|
Dear Midas experts,
I noticed that a check was added to mfe.cxx in 1961af0d6:
+ /* check for consistent common settings */
+ if ((eq_info->read_on & RO_STOPPED) &&
+ (eq_info->eq_type == EQ_POLLED ||
+ eq_info->eq_type == EQ_INTERRUPT ||
+ eq_info->eq_type == EQ_MULTITHREAD ||
+ eq_info->eq_type == EQ_USER)) {
+ cm_msg(MERROR, "register_equipment", "Events \"%s\" cannot be read when run is stopped (RO_STOPPED flag)", equipment[idx].name);
+ return 0;
+ }
This commit was by Stefan in May 2022.
A commit few days later, 28d9c96bd, removed the "return 0;", and updated the
error message to:
"Equipment \"%s\" contains RO_STOPPED or RO_ALWAYS. This can lead to undesired side-effect and should be removed."
So such FEs can run but there is still an error at start up. The
documentation at https://daq00.triumf.ca/MidasWiki/index.php/ReadOn_Flags
states with RO_STOPPED "Readout Occurs" "Before stopping run".
Which seems to indicate that the removing the RO_STOPPED bit from a SC FE
would just result in an additional read not happening just prior to a run
stop. However reading scheduler() in mfe.cxx I see in the the main loop:
if (run_state == STATE_STOPPED && (eq_info->read_on & RO_STOPPED) == 0)
continue;
So it seems to me that the a EQ_PERIODIC equipment needs RO_STOPPED to be set
otherwise it will not read out data while there is no DAQ run.
Can someone explain the purpose of this check and error message? Perhaps it
was put in place with only DAQ FEs, not SC FEs in mind? And should the
documentation in the wiki actually be "s/Before stopping run/While run is stopped/"?
Thanks,
Nick. |
06 Aug 2024, Stefan Ritt, Forum, mfe.cxx with RO_STOPPED and EQ_POLLED
|
> I noticed that a check was added to mfe.cxx in 1961af0d6:
>
> + /* check for consistent common settings */
> + if ((eq_info->read_on & RO_STOPPED) &&
> + (eq_info->eq_type == EQ_POLLED ||
> + eq_info->eq_type == EQ_INTERRUPT ||
> + eq_info->eq_type == EQ_MULTITHREAD ||
> + eq_info->eq_type == EQ_USER)) {
> + cm_msg(MERROR, "register_equipment", "Events \"%s\" cannot be read when run is stopped (RO_STOPPED flag)", equipment[idx].name);
> + return 0;
> + }
>
>
> Can someone explain the purpose of this check and error message? Perhaps it
> was put in place with only DAQ FEs, not SC FEs in mind? And should the
> documentation in the wiki actually be "s/Before stopping run/While run is stopped/"?
Indeed you have two types of events handled by mfe.cxx: Slow control events (EQ_SLOW or EQ_PRIODIC) and triggered events (EQ_POLLED or
EQ_INTERRUPT or EQ_MULTITHREAD or EQ_USER). For slow control events it can make sense to read them also when the run is stopped, that's why you
can specify RO_STOPPED or RO_ALWAYS. This does however not make sense for triggered events. Reading triggered events when the run is stopped
invalidates the concept of runs (= read triggered events only during a run). We had cases where people mixed this up, so the warning was added.
If you have a slow control event you want to read when the run is stopped, make sure it is of type EQ_SLOW or EQ_PERIODIC.
Stefan |
13 Sep 2024, Konstantin Olchanski, Bug Report, mfe.cxx with RO_STOPPED and EQ_POLLED
|
> > I noticed that a check was added to mfe.cxx in 1961af0d6:
This is the reason I recommend against using mfe.c based frontends. There was never any
proper documentation on how they work and what different settings in ODB common
and elsewhere do. My attempts to document it by reverse-engineering were only partially
successful. Since then a number of changes was made that were also hard-to-impossible
to document.
I recommend that all use the new c++ tmfe frontend, which was designed for easy documentation,
and explanation. See tmfe.md for full documentation.
(pending improvements is to integrate TMEvent support, add the data-transmit thread and event fifo).
K.O. |
|