Back Midas Rome Roody Rootana
  Midas DAQ System, Page 120 of 146  Not logged in ELOG logo
New entries since:Wed Dec 31 16:00:00 1969
ID Date Authordown Topic Subject
  2843   13 Sep 2024 Konstantin OlchanskiSuggestionmanalyzer thread safety and custom http IP binding
> - Enable ROOT's thread safety when running in multithreaded mode
> This helps avoid users having to write their call to a global thread lock when calling ->Fill() on ROOT histograms and Trees
> https://bitbucket.org/tmidas/manalyzer/pull-requests/5

merged by hand. (pull request shows a "rejected", bitbucket has no "merged manually" button).

also noted this change in the documentation: README.md

K.O.
  2844   13 Sep 2024 Konstantin OlchanskiSuggestionClean up compiler warning in manalyzer
> This is a super small pull request, simple replace deprecated sprintf with snprintf
> https://bitbucket.org/tmidas/manalyzer/pull-requests/9

sprintf() is not deprecated and "char buf[256]; sprintf(buf, "%05d", 64-bit-int);" is safe, will never overflow.

we could bulk-convert all these sprintf() to snprintf() but I would rather wait for this:

https://en.cppreference.com/w/cpp/utility/format/format

let me think on this for a bit.

K.O.
  2851   17 Sep 2024 Konstantin OlchanskiBug ReportCrash using ODB watch
> {
> odb new_settings("/Equipment/Test FE/Settings");
> new_settings.watch(watch); // <-- here I am getting a segmentation fault
> }

this code has a bug. "watch" is attached to object "new_settings" that is deleted
after the closing curly bracket.

I would say Stefan's odb API should not allow you to write code like this. an API defect.

K.O.
  2858   24 Sep 2024 Konstantin OlchanskiBug ReportCan we convert the .mid file into .root file
"Can we convert the .mid file into .root file".

yes, you can, but the operation is under-defined. it's like asking "can I convert these stones into houses". the answer is "yes", but it involves 
more than running a universal conversion program.

For this reason, I recommend against converting midas files "to root". for some types of midas data such a conversion makes no sense (i.e. alpha-g 
streamed udp packets with chopped compressed waveforms).

I recommend that you analyze you data in the midas analyzer. You can start with manalyzer_example_root.cxx,
it shows how to create a ROOT histogram, how to access midas event bank data and call the TH1 "Fill" method.

Instead of filling histograms in the analyzer, you can create a ROOT TTree and fill it with data from midas data banks,
effectively you will create your own custom converter from midas to root.

The key thing is that it has to be a custom converter, because only you know the meaning of midas bank data
and how it should be best stored in a root tree.

K.O.
  2859   24 Sep 2024 Konstantin OlchanskiSuggestionClean up compiler warning in manalyzer
> > I like the look of std::format, looks cleaner than string streams
> 
> I fully agree. String streams is a pain if you want to do zero-leading hex output mixed with decimal output. Yes it's easier to read if you don't know printf syntax,
> but 10-20 times more chars to write and not necessarily cleaner.
>

IMO c++ string streams formatting is optimized for "hello world" and is useless for printing hex numbers, table-formatted data and generally anything real-life.

plus the borked std::to_string() (it takes a global lock for the "C" locale), "fixed" it by introducing std::to_chars() in C++17,
with "ultimate fix" in std::format in C++26.

no question why C++ has the bad reputation. for a "done right" example, take a look at the Go standard library.

> 
> Probable is that we would have to convert about a few thousand of sprintf's() in midas.
> 

surprising few bare sprintf() remaining in MIDAS, most of them overflow-safe and most of them to be converted to msprintf().

K.O.
  2863   07 Oct 2024 Konstantin OlchanskiBug ReportDifficulty running MIDAS on Rocky 9.4
> We're trying to install the SuperCDMS version of MIDAS on a Rocky 9.4 Virtual 
> Machine and are getting a persistent error when we run mserver.
>
> [mserver,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore, 
> timeout 10000 ms, exiting...
> db_lock_database: Detected recursive call to db_{lock,unlock}_database() while 
> already inside db_{lock,unlock}_database(). Maybe this is a call from a signal 
> handler. Cannot continue, aborting...
> Aborted (core dumped)

This is super very bad. Since you have a core dump, please post the stack trace here (or email it to me).

I probably cannot debug your private version of midas and I will recommend that you install and run vanilla midas 
mserver (just while we debug this problem).

Let's look at the core dump stack trace first, but likely we see a problem with System-V semaphores and hopefully it 
is not some breakage due to Red Hat bogosity or due to something specific to running on a virtual machine.

If indeed this is Linux-kernel level breakage of System-V semaphores, solution would be to start using Posix 
semaphores, something I wanted to do for a long time. We already switched MIDAS shared memory from System-V to Posix 
shared memory.

If we are lucky it is just one more crasher bug in ODB. Let's see that core dump stack trace.

K.O.
  2867   08 Oct 2024 Konstantin OlchanskiBug ReportDifficulty running MIDAS on Rocky 9.4
> I've uploaded the current core dump at: https://gitlab.com/det-lab/coredumps#.

I cannot read the core dump without the corresponding executable (and likely all it's shared libraries).

It is best if you run gdb and extract the stack traces on your end.

In case you are not familiar with gdb:

gdb mserver core # start gdb
bt # stack trace of crashed thread
info thr # get list of threads
thr 1
bt
thr 2
bt
# etc, get stack trace of each thread, there should not be too many of them

K.O.
  2868   08 Oct 2024 Konstantin OlchanskiBug ReportDifficulty running MIDAS on Rocky 9.4
I read these error messages. There is no ODB corruption. ODB semaphore is locked and all midas programs will fail, they will timeout trying to get the lock, report the timeout, then it looks like a bug was introduced where instead of hard exit or abort() they attempt a clean shutdown which crashes from a recursive call in db_lock_database(). Amy's 
core dump should confirm this.

K.O.


> > > We're trying to install the SuperCDMS version of MIDAS on a Rocky 9.4 Virtual 
> > > Machine and are getting a persistent error when we run mserver.  As far as I 
> > > know there are minimal changes between this and the MIDAS branch, but Ben Smith 
> > > may have more to say on this.
> > 
> > For reference, "the SuperCDMS version of MIDAS" is just a fork that no longer has any meaningful differences vs the main MIDAS repo, but we only pull updates infrequently after testing a bunch. We last pulled from the develop branch in November 2023. But that should be irrelevant here as semaphore code hasn't been touched for a very long time.
> > 
> > We're running Alma 9.4 on a machine at TRIUMF and the same version of midas works fine there (Amy, you may already have access to scdms-zeus). I believe Alma and Rocky should be basically identical for this.
> > 
> > So the questions are:
> > * Have you tried other midas programs, or only mserver? E.g. did odbedit and mhttpd work?
> > * If other programs work, have you been running them all as the same user? In particular, if you ran one program as root and another as an unprivileged user, then you will likely get odd permissions issues.
> > * What do you see if you run `ls -l /dev/shm` and `ls -l ~/packages/SuperCDMS_DAQ/MidasDAQ/online/.*SHM`? (Or wherever your online dir is for the 2nd one).
> > * Did you follow the full instructions for recovering from a corrupt ODB? https://daq00.triumf.ca/MidasWiki/index.php/FAQ#How_to_recover_from_a_corrupted_ODB   In particular the bit about running odbinit with the --cleanup flag?
> 
> Here's what happens when I try to run odbedit:
> 
> [lekhraj@sdfcdmsdaq setup]$ odbedit
> [ODBEdit,ERROR] [odb.cxx:2052:db_open_database,ERROR] Removed ODB client 'ODBEdit', index 0 because process pid 481823 does not exists
> [ODBEdit,INFO] Removed open record flag from "/Experiment/Security/RPC hosts/Allowed hosts"
> [ODBEdit,INFO] Removed exclusive access mode from "/Experiment/Security/RPC hosts/Allowed hosts"
> [ODBEdit,INFO] Corrected 1 ODB entries
> [ODBEdit,INFO] Deleted entry '/System/Clients/481823' for client 'ODBEdit' because it is not connected to ODB
> [ODBEdit,INFO] Client 'ODBEdit' on buffer 'SYSMSG' removed by bm_open_buffer because process pid 481823 does not exist
> [ODBEdit,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore, timeout 10000 ms, exiting...
> [ODBEdit,ERROR] [midas.cxx:2205:cm_check_connect,ERROR] cm_disconnect_experiment not called at end of program
> db_lock_database: Detected recursive call to db_{lock,unlock}_database() while already inside db_{lock,unlock}_database(). Maybe this is a call from a signal handler. Cannot continue, aborting...
> Aborted (core dumped)
> [lekhraj@sdfcdmsdaq setup]$
> 
> And mhttpd:
> 
> [lekhraj@sdfcdmsdaq setup]$ mhttpd
> [mhttpd,ERROR] [odb.cxx:2052:db_open_database,ERROR] Removed ODB client 'ODBEdit', index 0 because process pid 601054 does not exists
> [mhttpd,INFO] Removed open record flag from "/Experiment/Security/RPC hosts/Allowed hosts"
> [mhttpd,INFO] Removed exclusive access mode from "/Experiment/Security/RPC hosts/Allowed hosts"
> [mhttpd,INFO] Corrected 1 ODB entries
> [mhttpd,INFO] Deleted entry '/System/Clients/601054' for client 'ODBEdit' because it is not connected to ODB
> [mhttpd,INFO] Client 'ODBEdit' on buffer 'SYSMSG' removed by bm_open_buffer because process pid 601054 does not exist
> [mhttpd,INFO] ODB subtree /Runinfo corrected successfully
> Password protection is off
> Hostlist off, connections from anywhere will be accepted
> Listening on "http://localhost:8080", passwords OFF, hostlist OFF
> Listening on "http://[::1]:8080", passwords OFF, hostlist OFF
> bm_lock_buffer: Lock buffer "SYSMSG" is taking longer than 1 second!
> [mhttpd,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore, timeout 10000 ms, exiting...
> [mhttpd,ERROR] [midas.cxx:2205:cm_check_connect,ERROR] cm_disconnect_experiment not called at end of program
> db_lock_database: Detected recursive call to db_{lock,unlock}_database() while already inside db_{lock,unlock}_database(). Maybe this is a call from a signal handler. Cannot continue, aborting...
> Aborted (core dumped)
> [lekhraj@sdfcdmsdaq setup]$
> 
> We have been running everything as a single user, the user who cloned the repositories and owns the directories.
> 
> We did follow the corrupted-ODB cleanup instructions.
> 
> [lekhraj@sdfcdmsdaq setup]$ ls -lh /dev/shm
> total 1.3M
> -rw------- 1 lekhraj dm 1.2M Oct  8 14:13 17468_test_ODB__sdf_home_l_lekhraj_packages_SuperCDMS_DAQ_MidasDAQ_online_
> -rw------- 1 lekhraj dm 114K Oct  7 14:06 17468_test_SYSMSG__sdf_home_l_lekhraj_packages_SuperCDMS_DAQ_MidasDAQ_online_
> 
> [lekhraj@sdfcdmsdaq setup]$ ls -lh ~/packages/SuperCDMS_DAQ/MidasDAQ/online/.*SHM
> -rw-r--r-- 1 lekhraj dm    0 Oct  3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.ALARM.SHM
> -rw-r--r-- 1 lekhraj dm    0 Oct  3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.ELOG.SHM
> -rw-r--r-- 1 lekhraj dm    0 Oct  3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.HISTORY.SHM
> -rw-r--r-- 1 lekhraj dm    0 Oct  3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.LAZY.SHM
> -rw-r--r-- 1 lekhraj dm    0 Oct  3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.MSG.SHM
> -rw-r--r-- 1 lekhraj dm 1.2M Oct  8 14:12 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.ODB.SHM
> -rw-r--r-- 1 lekhraj dm    0 Oct  3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.SYSMSG.SHM
> -rw-r--r-- 1 lekhraj dm    0 Oct  3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.SYSTEM.SHM
  2869   08 Oct 2024 Konstantin OlchanskiBug ReportDifficulty running MIDAS on Rocky 9.4
> Basically I think Rocky 9.4 is a red herring.

This is what likely happened:
- some program crashed while holding the ODB lock semaphore (by ctrl-C at the wrong time or by kill -KILL at the wring time)
- semaphore is locked with a flag "unlock if this program stops"
- this is supposed to ensure ODB lock semaphore never gets stuck in the locked state
- (there is no code in MIDAS to unlock the ODB lock semaphore without locking it first)
- we have observed a malfunction in the Linux kernel, where this automatic unlock does not happen.
- it is rare and so far cannot be reproduced. you can find more about it by searching this forum.
- I think it is a bug in the System-V semaphore code or in the Linux "program stop" code (a path where they fail to call the semaphore unlock handler, and who knows what 
other handlers).
- System-V semaphores are very obsolete, replaced by POSIX semaphores.
- POSIX semaphores do not have the "unlock if this program stops" magic, the user (MIDAS) is responsible with detecting that the program who locked the semaphore is gone 
and and with taking corrective action, i.e. release the lock, automatically.

I do not know if this problem with System-V semaphores is in the generic Linux kernel, or if it is specific
to the Red Hat kernels (they are known to have many patches, deviating quite far from vanilla kernels).

I do not know if this problem is somehow sensitive to virtual machines.

So yes/no, Red Hat derived linux on a virtual machine could be where this problem happens more often that elsewhere.

K.O.
  2870   08 Oct 2024 Konstantin OlchanskiBug ReportDifficulty running MIDAS on Rocky 9.4
> I read these error messages. There is no ODB corruption. ODB semaphore is locked and all midas programs will fail...

Recovery from this is:
- stop all midas programs (actually they should have all crashed by now)
- identify the ODB semaphore with: ipcs -s -t
- remove the ODB semaphore with: ipcrm sem <semid>
- where <semid> is from the first column of ipcs
- keep deleting semaphores until odbedit works.
- if you delete extra midas sempahores, odbedit will recreate them
- if you delete non-midas semaphores, oh, well...

Little bit better steps for this recovery may be written up by Suzannah in this forum
or in the midas wiki... good luck finding them.

K.O.
  2876   13 Oct 2024 Konstantin OlchanskiBug ReportDifficulty running MIDAS on Rocky 9.4
Thank you for the stack trace, I fixed the buglet that cause midas programs to crash twice,
once on failure to lock ODB, then call exit() -> atexit() handlers -> cm_check_connect() -> crash on ODB lock 
failure is the cm_msg() codes.

Replaced exit(1) with abort(). Could have used kill(getpid(),SIGKILL) to avoid making a core dump, but what the 
heck...

Of course this does nothing to the original bug where ODB was locked and nobody will ever unlock it (reboot will 
unlock it!).

commit bdd1d7fdc093b5a8d54a1b8467002bb3cac3ac11

K.O.


> > > I've uploaded the current core dump at: https://gitlab.com/det-lab/coredumps#.
> > 
> > I cannot read the core dump without the corresponding executable (and likely all it's shared libraries).
> > 
> > It is best if you run gdb and extract the stack traces on your end.
> > 
> > In case you are not familiar with gdb:
> > 
> > gdb mserver core # start gdb
> > bt # stack trace of crashed thread
> > info thr # get list of threads
> > thr 1
> > bt
> > thr 2
> > bt
> > # etc, get stack trace of each thread, there should not be too many of them
> > 
> > K.O.
> 
> Hi Konstantin, thanks for the instructions.  I do appear to be missing some debug symbols, but the output 
> looks potentially useful:
> 
> [lekhraj@sdfcdmsdaq ~]$ gdb mserver 
> core.mserver.17468.b174bb74f2bb44f9a0905e78ec6b2677.601715.1728422354000000
> GNU gdb (GDB) Rocky Linux 10.2-11.1.el9_3
> ...
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from mserver...
> [New LWP 601715]
> 
> warning: Section `.reg-xstate/601715' in core file too small.
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> Core was generated by `mserver'.
> Program terminated with signal SIGABRT, Aborted.
> 
> warning: Section `.reg-xstate/601715' in core file too small.
> #0  0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
> Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-83.el9.12.x86_64 libgcc-11.4.1-
> 3.el9.x86_64 libstdc++-11.4.1-2.1.el9.x86_64 libzstd-1.5.1-2.el9.x86_64 mysql-libs-8.0.36-1.el9_3.x86_64 
> openssl-libs-3.0.7-25.el9_3.x86_64 zlib-1.2.11-40.el9.x86_64
> (gdb)
> (gdb) bt
> #0  0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
> #1  0x00007fbdeac54d06 in raise () from /lib64/libc.so.6
> #2  0x00007fbdeac287f3 in abort () from /lib64/libc.so.6
> #3  0x0000000000430ee4 in db_lock_database (hDB=hDB@entry=1)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2473
> #4  0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536d348, key_name=0x4687a8 "/Logger/Message file date 
> format",
>     hKey=0, hDB=1) at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
> #5  db_find_key (hDB=1, hKey=0, key_name=0x4687a8 "/Logger/Message file date format", 
> subhKey=0x7ffcc536d348)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
> #6  0x0000000000448297 in db_get_value_string (hdb=1, hKeyRoot=hKeyRoot@entry=0,
>     key_name=key_name@entry=0x4687a8 "/Logger/Message file date format", index=index@entry=0,
>     s=s@entry=0x7ffcc536d470, create=create@entry=1, create_string_length=0)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:13950
> #7  0x000000000040a690 in cm_msg_get_logfile (fac=<optimized out>, t=<optimized out>, 
> filename=0x7ffcc536d690,
>     linkname=0x7ffcc536d6b0, linktarget=0x7ffcc536d6d0)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:573
> #8  0x000000000041a307 in cm_msg_log (message_type=1, facility=0x46db0e "midas",
>     message=0x7e4290 "[mserver,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore, 
> timeout 10000 ms, exiting...") at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:685
> #9  0x0000000000421fcd in cm_msg_flush_buffer () at /usr/include/c++/11/bits/basic_string.h:194
> #10 0x00007fbdeac574dd in __run_exit_handlers () from /lib64/libc.so.6
> #11 0x00007fbdeac57620 in exit () from /lib64/libc.so.6
> #12 0x0000000000430f7a in db_lock_database (hDB=hDB@entry=1)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2499
> #13 0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536da04, key_name=0x476a21 "/Alarms/Alarms", hKey=0, 
> hDB=1)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
> #14 db_find_key (hDB=1, hKey=hKey@entry=0, key_name=key_name@entry=0x476a21 "/Alarms/Alarms",
>     subhKey=subhKey@entry=0x7ffcc536da04) at 
> /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
> #15 0x0000000000455fd2 in al_check () at 
> /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/alarm.cxx:614
> --Type <RET> for more, q to quit, c to continue without paging--
> #16 0x000000000041ff85 in cm_periodic_tasks ()
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5596
> #17 0x00000000004235c5 in cm_yield (millisec=millisec@entry=1000)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5676
> #18 0x00000000004065c2 in main (argc=<optimized out>, argv=0x7ffcc536e628)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/progs/mserver.cxx:295
> (gdb) info thr
>   Id   Target Id                          Frame
> * 1    Thread 0x7fbdec0b1740 (LWP 601715) 0x00007fbdeaca154c in __pthread_kill_implementation () from 
> /lib64/libc.so.6
> (gdb) thr 1
> [Switching to thread 1 (Thread 0x7fbdec0b1740 (LWP 601715))]
> #0  0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
> (gdb) bt
> #0  0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
> #1  0x00007fbdeac54d06 in raise () from /lib64/libc.so.6
> #2  0x00007fbdeac287f3 in abort () from /lib64/libc.so.6
> #3  0x0000000000430ee4 in db_lock_database (hDB=hDB@entry=1)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2473
> #4  0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536d348, key_name=0x4687a8 "/Logger/Message file date 
> format",
>     hKey=0, hDB=1) at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
> #5  db_find_key (hDB=1, hKey=0, key_name=0x4687a8 "/Logger/Message file date format", 
> subhKey=0x7ffcc536d348)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
> #6  0x0000000000448297 in db_get_value_string (hdb=1, hKeyRoot=hKeyRoot@entry=0,
>     key_name=key_name@entry=0x4687a8 "/Logger/Message file date format", index=index@entry=0,
>     s=s@entry=0x7ffcc536d470, create=create@entry=1, create_string_length=0)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:13950
> #7  0x000000000040a690 in cm_msg_get_logfile (fac=<optimized out>, t=<optimized out>, 
> filename=0x7ffcc536d690,
>     linkname=0x7ffcc536d6b0, linktarget=0x7ffcc536d6d0)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:573
> #8  0x000000000041a307 in cm_msg_log (message_type=1, facility=0x46db0e "midas",
>     message=0x7e4290 "[mserver,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore, 
> timeout 10000 ms, exiting...") at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:685
> #9  0x0000000000421fcd in cm_msg_flush_buffer () at /usr/include/c++/11/bits/basic_string.h:194
> #10 0x00007fbdeac574dd in __run_exit_handlers () from /lib64/libc.so.6
> #11 0x00007fbdeac57620 in exit () from /lib64/libc.so.6
> #12 0x0000000000430f7a in db_lock_database (hDB=hDB@entry=1)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2499
> #13 0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536da04, key_name=0x476a21 "/Alarms/Alarms", hKey=0, 
> hDB=1)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
> #14 db_find_key (hDB=1, hKey=hKey@entry=0, key_name=key_name@entry=0x476a21 "/Alarms/Alarms",
>     subhKey=subhKey@entry=0x7ffcc536da04) at 
> /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
> #15 0x0000000000455fd2 in al_check () at 
> /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/alarm.cxx:614
> #16 0x000000000041ff85 in cm_periodic_tasks ()
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5596
> #17 0x00000000004235c5 in cm_yield (millisec=millisec@entry=1000)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5676
> #18 0x00000000004065c2 in main (argc=<optimized out>, argv=0x7ffcc536e628)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/progs/mserver.cxx:295
> (gdb)
  2879   18 Oct 2024 Konstantin OlchanskiBug ReportDifficulty running MIDAS on Rocky 9.4
> [aroberts@sdfcdmsdaq midas]$ odbedit
> [ODBEdit,ERROR] [odb.cxx:2043:db_open_database,ERROR] Removed ODB client 'ODBEdit', index 0 because process pid 
> 1615051 does not exists
> [ODBEdit,INFO] Removed open record flag from "/Experiment/Security/RPC hosts/Allowed hosts"
> [ODBEdit,INFO] Removed exclusive access mode from "/Experiment/Security/RPC hosts/Allowed hosts"
> [ODBEdit,INFO] Corrected 1 ODB entries
> [ODBEdit,INFO] Deleted entry '/System/Clients/1615051' for client 'ODBEdit' because it is not connected to ODB
> [ODBEdit,INFO] Client 'ODBEdit' on buffer 'SYSMSG' removed by bm_open_buffer because process pid 1615051 does not 
> exist

so far, so good, we connected to ODB (lock was not stuck), cleared out client "odbedit" with pid 1615051 that crashed 
without properly disconnecting. ODB semaphore is working correctly.

> [ODBEdit,ERROR] [odb.cxx:2489:db_lock_database,ERROR] cannot lock ODB semaphore, timeout 10000 ms, aborting...
> Aborted (core dumped)

suddenly, an ODB semaphore timeout...

can you post the stack trace from this core dump? I am pretty sure it will be boring, but just in case...

K.O.
  2880   18 Oct 2024 Konstantin OlchanskiBug ReportDifficulty running MIDAS on Rocky 9.4
> suddenly, an ODB semaphore timeout...

I am wondering if something bizarre is going on, like the system clock going backwards. I heard of things like that 
happening in virtual environments.

https://stackoverflow.com/questions/4801122/how-to-stop-time-from-running-backwards-on-linux

I added some debugging information to the semaphore locking code. Please update to commit 
eb625af119067f6d702211542d88a28ccb57ad2c of src/system.cxx (plus small change in include/msystem.h) and try again.

Now for each timeout it will print detailed syscall and timing information, if time goes backwards, it should catch it.

K.O.
  2896   15 Nov 2024 Konstantin OlchanskiSuggestionIssue with creating banks
> Hello, I am a coop student working at SNOLAB.
> void* data_acquisition_thread(void* param)
> {
> 	EVENT_HEADER *pevent;
>       if (complicated) {
> 			status = rb_get_wp(rbh, (void **) &pevent, 0);
>       }
>       bm_compose_event_threadsafe(pevent, 1, 0, 0, &equipment[0].serial_number);
> }

this code is buggy. it should read "EVENT_HEADER *pevent = NULL;" to avoid an uninitialized variable
and bm_compose_event() & co should be inside an "if (pevent != NULL)" block, unless you can absolutely
proove that rb_get_wp() is always called and pevent is never NULL. (even is somebody changes the code later).

if you build your code with "gcc -O2 -g -Wall -Wuninitialized" it would probably warn you about use of uninitilialized 
"pevent".

P.S. for building multithreaded frontends, you are much better off starting from the c++ tmfe frontend framework,
a good starting point is study tmfe_example_everything.cxx.

K.O.
  2897   15 Nov 2024 Konstantin OlchanskiInfoNew sequencer command ODBLOOKUP
> A new sequencer command "ODBLOOKUP" has been implemented, which does a lookup of a string in a string 
> array in the ODB given by a path and returns its index as a number. If we have for example an array
> 
> /Examples/Names
>    [0] Hello
>    [1] Test
>    [2] Other
> 
> and do a
>  
> ODBLOOKUP "/Examples/Names", "Test", index
> 
> we get a index equal 1.
> 

"value not found" sets "index" to ?
"odb key not found" sets "index" to ?
link to documentation?

K.O.
  2902   22 Nov 2024 Konstantin OlchanskiBug ReportODB lock timeout, Difficulty running MIDAS on Rocky 9.4
> try to replicate the issue ...

I see ODB lock timeout (and abort() of everything) in the dsvslice test station. We have 
about 15-20 MIDAS clients connected.

I am pretty sure we have not seen this problem until recently (and I have not seen it 
personally for a very long time). There were no changes to the MIDAS ODB locking code in a 
long time.

I suspect a recent change in the linux kernel. But I am likely to be wrong.

I have 1000 core dumps from this crash of dsvslice, and among them should be the 1 thread
that has ODB locked. Wish me luck finding it. Worst case is to discover that ODB is locked 
but nobody is holding a lock ("missing unlock bug"). This is hard to debug, I would have add 
tracking of "who was the last one to lock it, who forgot to unlock it".

K.O.
  2906   27 Nov 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
> [tmfe.cxx:1033:TMFE::Sleep,ERROR] select() returned -1, errno 22 (Invalid argument).

The very original copy of this function had an error and was spewing out this error quite often,
this was a missing handler for EINTR.

Now it looks like we are missing a handler for EINVAL.

Most likely sleep is called with a funny sleep time value that fills struct timeval with
values select() does not like.

I see Ben added a check for negative sleep times, and this is good.

I think I will do these changes:

a) add an error message for negative sleep time, I think user should never call ::Sleep with negative or zero sleep times and 
if they do it is a bug and they should fix it, the error message will inform them so.

b) add a handler for EINVAL, which will report the requested sleep time and the values in struct timeval

K.O.
  2907   27 Nov 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
> 
> I wonder also if now that midas requires stricter/newer c++ standards if there maybe
> some more straightforward method to sleep that is sufficiently robust and portable.
> 

I believe POSIX defined clock_nanosleep() & co, so on most recent machines that is the most portable way to sleep.

Historically, select() was the only way to sleep for less than 1 sec, but it was never portable
because of differences between BSD UNIX and Linux implementations. (MacOS is BSD UNIX via FreeBSD).

On difference is the update of struct timeval is select() is interrupted.

In this elog entry, I compare sleep using select() with sleep using clock_nanosleep() and see that there is no difference:
https://daq00.triumf.ca/elog-midas/Midas/2115

As you can see tmfe.cxx has both implementations, select() and clock_nanosleep(), and anybody can try which one works better on 
their computer.

K.O.
  2908   27 Nov 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
>       status = select(1, &fdset, NULL, NULL, &timeout);
>
>       On error, -1 is returned, ... timeout becomes undefined.

I have been reading "man select" for 30 years and I do not remember seeing this text.

I believe on BSD UNIX (MacOS) it says timeout is unchanged and on Linux is says timeout is updated to time actually slept.

I will have to investigate, but I suspect the man page was posix-ized, by sweeping BSD/MacOS and Linux implementations
under the same "instead of saying what it actually does, we will just say 'undefined'".

In any case, EINTR is not an error, it's an artefact of UNIX signal handling. Linux implementations always try
very hard to handle signals without causing EINTR to select(), read() and write(). This is most painful
when reading and writing to TCP sockets, because one most handle partial reads and EINTR.

K.O.
  2915   04 Dec 2024 Konstantin OlchanskiBug ReportODB key picker does not close when creating link / Edit-on-run string box too large
> > Actual result:
> > The key picker does not close.
> 
> Thanks for reporting that bug. It has been fixed in the current commit (installed already on megon02)
>

Stefan, thank you for fixing both problems, I have seen them, too, but no time to deal with them.

K.O.
ELOG V3.1.4-2e1708b5