Back Midas Rome Roody Rootana
  Midas DAQ System, Page 122 of 150  Not logged in ELOG logo
New entries since:Wed Dec 31 16:00:00 1969
ID Date Authordown Topic Subject
  2869   08 Oct 2024 Konstantin OlchanskiBug ReportDifficulty running MIDAS on Rocky 9.4
> Basically I think Rocky 9.4 is a red herring.

This is what likely happened:
- some program crashed while holding the ODB lock semaphore (by ctrl-C at the wrong time or by kill -KILL at the wring time)
- semaphore is locked with a flag "unlock if this program stops"
- this is supposed to ensure ODB lock semaphore never gets stuck in the locked state
- (there is no code in MIDAS to unlock the ODB lock semaphore without locking it first)
- we have observed a malfunction in the Linux kernel, where this automatic unlock does not happen.
- it is rare and so far cannot be reproduced. you can find more about it by searching this forum.
- I think it is a bug in the System-V semaphore code or in the Linux "program stop" code (a path where they fail to call the semaphore unlock handler, and who knows what 
other handlers).
- System-V semaphores are very obsolete, replaced by POSIX semaphores.
- POSIX semaphores do not have the "unlock if this program stops" magic, the user (MIDAS) is responsible with detecting that the program who locked the semaphore is gone 
and and with taking corrective action, i.e. release the lock, automatically.

I do not know if this problem with System-V semaphores is in the generic Linux kernel, or if it is specific
to the Red Hat kernels (they are known to have many patches, deviating quite far from vanilla kernels).

I do not know if this problem is somehow sensitive to virtual machines.

So yes/no, Red Hat derived linux on a virtual machine could be where this problem happens more often that elsewhere.

K.O.
  2870   08 Oct 2024 Konstantin OlchanskiBug ReportDifficulty running MIDAS on Rocky 9.4
> I read these error messages. There is no ODB corruption. ODB semaphore is locked and all midas programs will fail...

Recovery from this is:
- stop all midas programs (actually they should have all crashed by now)
- identify the ODB semaphore with: ipcs -s -t
- remove the ODB semaphore with: ipcrm sem <semid>
- where <semid> is from the first column of ipcs
- keep deleting semaphores until odbedit works.
- if you delete extra midas sempahores, odbedit will recreate them
- if you delete non-midas semaphores, oh, well...

Little bit better steps for this recovery may be written up by Suzannah in this forum
or in the midas wiki... good luck finding them.

K.O.
  2876   13 Oct 2024 Konstantin OlchanskiBug ReportDifficulty running MIDAS on Rocky 9.4
Thank you for the stack trace, I fixed the buglet that cause midas programs to crash twice,
once on failure to lock ODB, then call exit() -> atexit() handlers -> cm_check_connect() -> crash on ODB lock 
failure is the cm_msg() codes.

Replaced exit(1) with abort(). Could have used kill(getpid(),SIGKILL) to avoid making a core dump, but what the 
heck...

Of course this does nothing to the original bug where ODB was locked and nobody will ever unlock it (reboot will 
unlock it!).

commit bdd1d7fdc093b5a8d54a1b8467002bb3cac3ac11

K.O.


> > > I've uploaded the current core dump at: https://gitlab.com/det-lab/coredumps#.
> > 
> > I cannot read the core dump without the corresponding executable (and likely all it's shared libraries).
> > 
> > It is best if you run gdb and extract the stack traces on your end.
> > 
> > In case you are not familiar with gdb:
> > 
> > gdb mserver core # start gdb
> > bt # stack trace of crashed thread
> > info thr # get list of threads
> > thr 1
> > bt
> > thr 2
> > bt
> > # etc, get stack trace of each thread, there should not be too many of them
> > 
> > K.O.
> 
> Hi Konstantin, thanks for the instructions.  I do appear to be missing some debug symbols, but the output 
> looks potentially useful:
> 
> [lekhraj@sdfcdmsdaq ~]$ gdb mserver 
> core.mserver.17468.b174bb74f2bb44f9a0905e78ec6b2677.601715.1728422354000000
> GNU gdb (GDB) Rocky Linux 10.2-11.1.el9_3
> ...
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from mserver...
> [New LWP 601715]
> 
> warning: Section `.reg-xstate/601715' in core file too small.
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> Core was generated by `mserver'.
> Program terminated with signal SIGABRT, Aborted.
> 
> warning: Section `.reg-xstate/601715' in core file too small.
> #0  0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
> Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-83.el9.12.x86_64 libgcc-11.4.1-
> 3.el9.x86_64 libstdc++-11.4.1-2.1.el9.x86_64 libzstd-1.5.1-2.el9.x86_64 mysql-libs-8.0.36-1.el9_3.x86_64 
> openssl-libs-3.0.7-25.el9_3.x86_64 zlib-1.2.11-40.el9.x86_64
> (gdb)
> (gdb) bt
> #0  0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
> #1  0x00007fbdeac54d06 in raise () from /lib64/libc.so.6
> #2  0x00007fbdeac287f3 in abort () from /lib64/libc.so.6
> #3  0x0000000000430ee4 in db_lock_database (hDB=hDB@entry=1)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2473
> #4  0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536d348, key_name=0x4687a8 "/Logger/Message file date 
> format",
>     hKey=0, hDB=1) at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
> #5  db_find_key (hDB=1, hKey=0, key_name=0x4687a8 "/Logger/Message file date format", 
> subhKey=0x7ffcc536d348)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
> #6  0x0000000000448297 in db_get_value_string (hdb=1, hKeyRoot=hKeyRoot@entry=0,
>     key_name=key_name@entry=0x4687a8 "/Logger/Message file date format", index=index@entry=0,
>     s=s@entry=0x7ffcc536d470, create=create@entry=1, create_string_length=0)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:13950
> #7  0x000000000040a690 in cm_msg_get_logfile (fac=<optimized out>, t=<optimized out>, 
> filename=0x7ffcc536d690,
>     linkname=0x7ffcc536d6b0, linktarget=0x7ffcc536d6d0)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:573
> #8  0x000000000041a307 in cm_msg_log (message_type=1, facility=0x46db0e "midas",
>     message=0x7e4290 "[mserver,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore, 
> timeout 10000 ms, exiting...") at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:685
> #9  0x0000000000421fcd in cm_msg_flush_buffer () at /usr/include/c++/11/bits/basic_string.h:194
> #10 0x00007fbdeac574dd in __run_exit_handlers () from /lib64/libc.so.6
> #11 0x00007fbdeac57620 in exit () from /lib64/libc.so.6
> #12 0x0000000000430f7a in db_lock_database (hDB=hDB@entry=1)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2499
> #13 0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536da04, key_name=0x476a21 "/Alarms/Alarms", hKey=0, 
> hDB=1)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
> #14 db_find_key (hDB=1, hKey=hKey@entry=0, key_name=key_name@entry=0x476a21 "/Alarms/Alarms",
>     subhKey=subhKey@entry=0x7ffcc536da04) at 
> /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
> #15 0x0000000000455fd2 in al_check () at 
> /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/alarm.cxx:614
> --Type <RET> for more, q to quit, c to continue without paging--
> #16 0x000000000041ff85 in cm_periodic_tasks ()
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5596
> #17 0x00000000004235c5 in cm_yield (millisec=millisec@entry=1000)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5676
> #18 0x00000000004065c2 in main (argc=<optimized out>, argv=0x7ffcc536e628)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/progs/mserver.cxx:295
> (gdb) info thr
>   Id   Target Id                          Frame
> * 1    Thread 0x7fbdec0b1740 (LWP 601715) 0x00007fbdeaca154c in __pthread_kill_implementation () from 
> /lib64/libc.so.6
> (gdb) thr 1
> [Switching to thread 1 (Thread 0x7fbdec0b1740 (LWP 601715))]
> #0  0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
> (gdb) bt
> #0  0x00007fbdeaca154c in __pthread_kill_implementation () from /lib64/libc.so.6
> #1  0x00007fbdeac54d06 in raise () from /lib64/libc.so.6
> #2  0x00007fbdeac287f3 in abort () from /lib64/libc.so.6
> #3  0x0000000000430ee4 in db_lock_database (hDB=hDB@entry=1)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2473
> #4  0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536d348, key_name=0x4687a8 "/Logger/Message file date 
> format",
>     hKey=0, hDB=1) at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
> #5  db_find_key (hDB=1, hKey=0, key_name=0x4687a8 "/Logger/Message file date format", 
> subhKey=0x7ffcc536d348)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
> #6  0x0000000000448297 in db_get_value_string (hdb=1, hKeyRoot=hKeyRoot@entry=0,
>     key_name=key_name@entry=0x4687a8 "/Logger/Message file date format", index=index@entry=0,
>     s=s@entry=0x7ffcc536d470, create=create@entry=1, create_string_length=0)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:13950
> #7  0x000000000040a690 in cm_msg_get_logfile (fac=<optimized out>, t=<optimized out>, 
> filename=0x7ffcc536d690,
>     linkname=0x7ffcc536d6b0, linktarget=0x7ffcc536d6d0)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:573
> #8  0x000000000041a307 in cm_msg_log (message_type=1, facility=0x46db0e "midas",
>     message=0x7e4290 "[mserver,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore, 
> timeout 10000 ms, exiting...") at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:685
> #9  0x0000000000421fcd in cm_msg_flush_buffer () at /usr/include/c++/11/bits/basic_string.h:194
> #10 0x00007fbdeac574dd in __run_exit_handlers () from /lib64/libc.so.6
> #11 0x00007fbdeac57620 in exit () from /lib64/libc.so.6
> #12 0x0000000000430f7a in db_lock_database (hDB=hDB@entry=1)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:2499
> #13 0x0000000000437e9c in db_find_key (subhKey=0x7ffcc536da04, key_name=0x476a21 "/Alarms/Alarms", hKey=0, 
> hDB=1)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4099
> #14 db_find_key (hDB=1, hKey=hKey@entry=0, key_name=key_name@entry=0x476a21 "/Alarms/Alarms",
>     subhKey=subhKey@entry=0x7ffcc536da04) at 
> /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/odb.cxx:4075
> #15 0x0000000000455fd2 in al_check () at 
> /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/alarm.cxx:614
> #16 0x000000000041ff85 in cm_periodic_tasks ()
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5596
> #17 0x00000000004235c5 in cm_yield (millisec=millisec@entry=1000)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/src/midas.cxx:5676
> #18 0x00000000004065c2 in main (argc=<optimized out>, argv=0x7ffcc536e628)
>     at /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/midas_fork/progs/mserver.cxx:295
> (gdb)
  2879   18 Oct 2024 Konstantin OlchanskiBug ReportDifficulty running MIDAS on Rocky 9.4
> [aroberts@sdfcdmsdaq midas]$ odbedit
> [ODBEdit,ERROR] [odb.cxx:2043:db_open_database,ERROR] Removed ODB client 'ODBEdit', index 0 because process pid 
> 1615051 does not exists
> [ODBEdit,INFO] Removed open record flag from "/Experiment/Security/RPC hosts/Allowed hosts"
> [ODBEdit,INFO] Removed exclusive access mode from "/Experiment/Security/RPC hosts/Allowed hosts"
> [ODBEdit,INFO] Corrected 1 ODB entries
> [ODBEdit,INFO] Deleted entry '/System/Clients/1615051' for client 'ODBEdit' because it is not connected to ODB
> [ODBEdit,INFO] Client 'ODBEdit' on buffer 'SYSMSG' removed by bm_open_buffer because process pid 1615051 does not 
> exist

so far, so good, we connected to ODB (lock was not stuck), cleared out client "odbedit" with pid 1615051 that crashed 
without properly disconnecting. ODB semaphore is working correctly.

> [ODBEdit,ERROR] [odb.cxx:2489:db_lock_database,ERROR] cannot lock ODB semaphore, timeout 10000 ms, aborting...
> Aborted (core dumped)

suddenly, an ODB semaphore timeout...

can you post the stack trace from this core dump? I am pretty sure it will be boring, but just in case...

K.O.
  2880   18 Oct 2024 Konstantin OlchanskiBug ReportDifficulty running MIDAS on Rocky 9.4
> suddenly, an ODB semaphore timeout...

I am wondering if something bizarre is going on, like the system clock going backwards. I heard of things like that 
happening in virtual environments.

https://stackoverflow.com/questions/4801122/how-to-stop-time-from-running-backwards-on-linux

I added some debugging information to the semaphore locking code. Please update to commit 
eb625af119067f6d702211542d88a28ccb57ad2c of src/system.cxx (plus small change in include/msystem.h) and try again.

Now for each timeout it will print detailed syscall and timing information, if time goes backwards, it should catch it.

K.O.
  2896   15 Nov 2024 Konstantin OlchanskiSuggestionIssue with creating banks
> Hello, I am a coop student working at SNOLAB.
> void* data_acquisition_thread(void* param)
> {
> 	EVENT_HEADER *pevent;
>       if (complicated) {
> 			status = rb_get_wp(rbh, (void **) &pevent, 0);
>       }
>       bm_compose_event_threadsafe(pevent, 1, 0, 0, &equipment[0].serial_number);
> }

this code is buggy. it should read "EVENT_HEADER *pevent = NULL;" to avoid an uninitialized variable
and bm_compose_event() & co should be inside an "if (pevent != NULL)" block, unless you can absolutely
proove that rb_get_wp() is always called and pevent is never NULL. (even is somebody changes the code later).

if you build your code with "gcc -O2 -g -Wall -Wuninitialized" it would probably warn you about use of uninitilialized 
"pevent".

P.S. for building multithreaded frontends, you are much better off starting from the c++ tmfe frontend framework,
a good starting point is study tmfe_example_everything.cxx.

K.O.
  2897   15 Nov 2024 Konstantin OlchanskiInfoNew sequencer command ODBLOOKUP
> A new sequencer command "ODBLOOKUP" has been implemented, which does a lookup of a string in a string 
> array in the ODB given by a path and returns its index as a number. If we have for example an array
> 
> /Examples/Names
>    [0] Hello
>    [1] Test
>    [2] Other
> 
> and do a
>  
> ODBLOOKUP "/Examples/Names", "Test", index
> 
> we get a index equal 1.
> 

"value not found" sets "index" to ?
"odb key not found" sets "index" to ?
link to documentation?

K.O.
  2902   22 Nov 2024 Konstantin OlchanskiBug ReportODB lock timeout, Difficulty running MIDAS on Rocky 9.4
> try to replicate the issue ...

I see ODB lock timeout (and abort() of everything) in the dsvslice test station. We have 
about 15-20 MIDAS clients connected.

I am pretty sure we have not seen this problem until recently (and I have not seen it 
personally for a very long time). There were no changes to the MIDAS ODB locking code in a 
long time.

I suspect a recent change in the linux kernel. But I am likely to be wrong.

I have 1000 core dumps from this crash of dsvslice, and among them should be the 1 thread
that has ODB locked. Wish me luck finding it. Worst case is to discover that ODB is locked 
but nobody is holding a lock ("missing unlock bug"). This is hard to debug, I would have add 
tracking of "who was the last one to lock it, who forgot to unlock it".

K.O.
  2906   27 Nov 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
> [tmfe.cxx:1033:TMFE::Sleep,ERROR] select() returned -1, errno 22 (Invalid argument).

The very original copy of this function had an error and was spewing out this error quite often,
this was a missing handler for EINTR.

Now it looks like we are missing a handler for EINVAL.

Most likely sleep is called with a funny sleep time value that fills struct timeval with
values select() does not like.

I see Ben added a check for negative sleep times, and this is good.

I think I will do these changes:

a) add an error message for negative sleep time, I think user should never call ::Sleep with negative or zero sleep times and 
if they do it is a bug and they should fix it, the error message will inform them so.

b) add a handler for EINVAL, which will report the requested sleep time and the values in struct timeval

K.O.
  2907   27 Nov 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
> 
> I wonder also if now that midas requires stricter/newer c++ standards if there maybe
> some more straightforward method to sleep that is sufficiently robust and portable.
> 

I believe POSIX defined clock_nanosleep() & co, so on most recent machines that is the most portable way to sleep.

Historically, select() was the only way to sleep for less than 1 sec, but it was never portable
because of differences between BSD UNIX and Linux implementations. (MacOS is BSD UNIX via FreeBSD).

On difference is the update of struct timeval is select() is interrupted.

In this elog entry, I compare sleep using select() with sleep using clock_nanosleep() and see that there is no difference:
https://daq00.triumf.ca/elog-midas/Midas/2115

As you can see tmfe.cxx has both implementations, select() and clock_nanosleep(), and anybody can try which one works better on 
their computer.

K.O.
  2908   27 Nov 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
>       status = select(1, &fdset, NULL, NULL, &timeout);
>
>       On error, -1 is returned, ... timeout becomes undefined.

I have been reading "man select" for 30 years and I do not remember seeing this text.

I believe on BSD UNIX (MacOS) it says timeout is unchanged and on Linux is says timeout is updated to time actually slept.

I will have to investigate, but I suspect the man page was posix-ized, by sweeping BSD/MacOS and Linux implementations
under the same "instead of saying what it actually does, we will just say 'undefined'".

In any case, EINTR is not an error, it's an artefact of UNIX signal handling. Linux implementations always try
very hard to handle signals without causing EINTR to select(), read() and write(). This is most painful
when reading and writing to TCP sockets, because one most handle partial reads and EINTR.

K.O.
  2915   04 Dec 2024 Konstantin OlchanskiBug ReportODB key picker does not close when creating link / Edit-on-run string box too large
> > Actual result:
> > The key picker does not close.
> 
> Thanks for reporting that bug. It has been fixed in the current commit (installed already on megon02)
>

Stefan, thank you for fixing both problems, I have seen them, too, but no time to deal with them.

K.O.
  2916   05 Dec 2024 Konstantin OlchanskiBug ReportODB key picker does not close when creating link / Edit-on-run string box too large
> > Another more minor visual problem is the edit-on-start dialog. There seems to be no upper bound to the 
> > size of the text box. In the attached screenshot, ShortString has a maximum length of 32 characters, 
> > LongString has 255. Both are empty at the time of the screenshot. Maybe, the size should be limited to a 
> > reasonable width.
> 
> I limited the input size now to (arbitrarily) 100 chars. The string can still be longer than 100 chars, and you start then scrolling inside the input box. Let me know if 
> that's ok this way.

I am moving the dragon experiment to the new midas and we see this problem on the begin-of-run page.

Old midas: no horizontal scroll bar, edit-on-start names, values and comments are all squeezed in into the visible frame.

New midas: page is very wide, values entry fields are very long and there is a horizontal scroll bar.

So something got broken in the htlm formatting or sizing. I should be able to spot the change by doing
a diff between old resources/start.html and the new one.

K.O.
  2918   06 Dec 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
> >       status = select(1, &fdset, NULL, NULL, &timeout);
> >       On error, -1 is returned, ... timeout becomes undefined.

I confirm Linux and MacOS man pages and select() with EINTR work as I remember, Linux updates "timeout" to account for the 
time already slept, MacOS does not ("timeout" is unchanged).

So the original code is roughly correct, but long sleeps will not work right if SIGALRM fires during sleeping.

Note that MIDAS no longer uses SIGALRM to fire cm_watchdog() (it was moved to a thread) and MIDAS does not use signals,
so handling of EINTR is now moot.

(Please correct me if I missed something).

The original bug report was about EINVAL, and best I can tell, it was caused by calls to TMFE::Sleep()
with strange sleep times that caused invalid values to be computed into the select() timeout.

To improve on this, I make these changes:

1) TMFE::Sleep(0) will report an error and will not sleep
2) TMFE::Sleep(negative number) will report an error and will not sleep

(please check the sleep time before calling TMFE::Sleep())

3) TMFE::Sleep(1 sec or less) will sleep using select(). (I also looked into using poll(), ppoll() and pselect()).
4) TMFE::Sleep(more than 1 second) will use a loop to sleep in increments of 1 second and will use one additional syscall to 
read the current time to decide how much more to sleep.

5) if select() returns EINVAL, the error message will reporting the sleep time and the values in "timeout".

A side effect of this is that on both Linux and MacOS long sleeps work correctly if interrupted by SIGALRM,
because SIGALRM granularity is 1 sec and our sleep time is also 1 sec.

Commit [develop 06735d29] improve TMFE::Sleep()

K.O.
  2919   06 Dec 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
> Commit [develop 06735d29] improve TMFE::Sleep()

Report of test_sleep on Ubuntu-22 (Intel E-2236) and MacOS 14.6.1 (M1 MAX Mac Studio).

It is easy to see Ubuntu-22 (kernel 6.2.x) sleep granularity is ~60 usec, MacOS sleep granularity is <2 usec. (Sub 1 usec sleep likely measures the syscall() speed, 500 ns on Intel and 200 
ns on ARM M1 MAX). (NOTE: long sleep is interrupted by an alarm roughly 10 seconds into the sleep, see progs/test_sleep.cxx)

daq00:midas$ uname -a
Linux daq00.triumf.ca 6.2.0-39-generic #40~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 16 10:53:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
daq00:midas$ ./bin/test_sleep 
test short sleep:
sleep      10 loops,   100000.000 usec per loop,  1.000000000 sec total,  1.002568007 sec actual total,   100256.801 usec actual per loop, oversleep  256.801 usec,    0.3%
sleep     100 loops,    10000.000 usec per loop,  1.000000000 sec total,  1.025897980 sec actual total,    10258.980 usec actual per loop, oversleep  258.980 usec,    2.6%
sleep    1000 loops,     1000.000 usec per loop,  1.000000000 sec total,  1.169670105 sec actual total,     1169.670 usec actual per loop, oversleep  169.670 usec,   17.0%
sleep   10000 loops,      100.000 usec per loop,  1.000000000 sec total,  1.573357105 sec actual total,      157.336 usec actual per loop, oversleep   57.336 usec,   57.3%
sleep   99999 loops,       10.000 usec per loop,  0.999990000 sec total,  6.963442087 sec actual total,       69.635 usec actual per loop, oversleep   59.635 usec,  596.4%
sleep 1000000 loops,        1.000 usec per loop,  1.000000000 sec total, 60.939687967 sec actual total,       60.940 usec actual per loop, oversleep   59.940 usec, 5994.0%
sleep 1000000 loops,        0.100 usec per loop,  0.100000000 sec total,  0.613572121 sec actual total,        0.614 usec actual per loop, oversleep    0.514 usec,  513.6%
sleep 1000000 loops,        0.010 usec per loop,  0.010000000 sec total,  0.576359987 sec actual total,        0.576 usec actual per loop, oversleep    0.566 usec, 5663.6%
test long sleep: requested 126.000000000 sec ... sleeping ...
alarm!
test long sleep: requested 126.000000000 sec, actual 126.000875950 sec

bash-3.2$ uname -a
Darwin send.triumf.ca 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000 arm64
bash-3.2$ ./bin/test_sleep 
test short sleep:
sleep      10 loops,   100000.000 usec per loop,  1.000000000 sec total,  1.032556057 sec actual total,   103255.606 usec actual per loop, oversleep 3255.606 usec,    3.3%
sleep     100 loops,    10000.000 usec per loop,  1.000000000 sec total,  1.245460033 sec actual total,    12454.600 usec actual per loop, oversleep 2454.600 usec,   24.5%
sleep    1000 loops,     1000.000 usec per loop,  1.000000000 sec total,  1.331466913 sec actual total,     1331.467 usec actual per loop, oversleep  331.467 usec,   33.1%
sleep   10000 loops,      100.000 usec per loop,  1.000000000 sec total,  1.281141996 sec actual total,      128.114 usec actual per loop, oversleep   28.114 usec,   28.1%
sleep   99999 loops,       10.000 usec per loop,  0.999990000 sec total,  1.410759926 sec actual total,       14.108 usec actual per loop, oversleep    4.108 usec,   41.1%
sleep 1000000 loops,        1.000 usec per loop,  1.000000000 sec total,  2.400593996 sec actual total,        2.401 usec actual per loop, oversleep    1.401 usec,  140.1%
sleep 1000000 loops,        0.100 usec per loop,  0.100000000 sec total,  0.188431025 sec actual total,        0.188 usec actual per loop, oversleep    0.088 usec,   88.4%
sleep 1000000 loops,        0.010 usec per loop,  0.010000000 sec total,  0.188102007 sec actual total,        0.188 usec actual per loop, oversleep    0.178 usec, 1781.0%
test long sleep: requested 126.000000000 sec ... sleeping ...
alarm!
test long sleep: requested 126.000000000 sec, actual 126.001244068 sec

K.O.
  2927   01 Jan 2025 Konstantin OlchanskiForumtime ordering of run transition calls to TMFeEquipment things
> I have a question about "tmfe approach" to implementing MIDAS frontends. If I read the code correctly, 
> within this approach it is the TMFeEquipment things, not the TMFrontend's themselves, 
> which handle the run transitions - the TMFrontend class

that's correct and it is documented so in https://bitbucket.org/tmidas/midas/src/develop/tmfe.md

> So how does a user control the sequence in which TMFeEquipment::HandleBeginRun functions of different 
> TMFeEquipment pieces are called at begin run? - there are two cases to consider: TMFeEquipment things 
> defined by the same TMFrontend and by different TMFrontend's.

I am not sure what you are trying to do. It is always easier to suggest a solution to a specific problem.

But I will try to answer anyway:

1) "time ordering of run transitions" - of course midas transitions are ordered by transition sequence numbers 
and the tmfe class provides methods to control this. ditto for the mfe.cxx frontends.

2) for one TMFrontend, the order of calling HandleBeginRun() is the order in which equipments were added to the 
equipment using FeAddEquipment(). HandleEndRun() is called in reverse order. (I better check this).

3) to have multiple TMFrontends in one program would be unusual (mfe.cxx frontends completely do not support 
this), but should work. Everything was coded to support this, but it was never tested in practice because we 
cannot invent any useful use-case for it. HandleBeginRun() handlers are likely to be called in the frontends are 
created. (I could check this and confirm it works, as long as you have a valid use-case for this configuration).

4) Frontend X has EquipmentA and EquipmentB, you want EqA::HandleBeginRun() to be called at run transition 200 
and EqB::HandleBeginRun() to be called at run transition 400.

This is not directly supported by mfe.cxx frontends (the begin_run() handler is a global function) and I did not 
directly implement it in the TMFE frontend.

But I think this would be a useful improvement. I will look into this.

Likely I will add per-equipment data members fEqConfBeginRunSeqNo, fEqConfEndRunSeqno, etc. Value 0 would 
unregister the corresponding run transition handler. This would cleanup the code quite a bit, a bunch
of RegisterTranstionXXX functions could go away.

K.O.
  2939   06 Feb 2025 Konstantin OlchanskiForumTransition from mana -> manalyzer
> Could somebody please give me a boost?

no need to shout into the void, it is pretty easy to identify the author of manalyzer and ask me directly.

> we are planning to migrate from mana to manalyzer. I started to have a look into it and realized that I have some lose ends.
> Is there a clear migration docu somewhere?

README.md and examples in the manalyzer git repository.

If something is missing, or unclear, please ask.

> Currently I understand it the following way (which might be wrong):
> The class TARunObject is used to write analyzer modules which are registered by TAFactory. I hope this is right?

Please read the README file. It explains what is going on there.

Design of manalyzer had 2 main goals:
a) lifetime of all c++ objects (and ROOT objects) is well defined (to fix a design defect in rootana)
b) event flow and data flow are documentable (problem in mfe.c frontends, etc)

> However, in mana there is an analyzer implemented by the user which binds the modules and has additional routines:
> analyzer_init(), analyzer_exit(), analyzer_loop()
> ana_begin_of_run(), ana_end_of_run(), ana_pause_run(), ana_resume_run()
> which we are using.

I have never used the mana analyzer, I wrote the c++ rootana analyzer very early on (first commit in 2006).

But the basic steps should all be there for you:
- initialization (create histograms, open files) can be done in the module constructor or in BeginRun()
- finalization (fit histograms, close files) should be done in EndRun()
- event processing (obviously) in Analyze()
- pause run, resume run and switch to next subrun file have corresponding methods
- all the "flow" and multithreading stuff you can ignore to first order.

To start the migration, I recommend you take manalyzer_example_root.cxx and start stuffing it with your code.

If you run into any problems, I am happy to help with them. Ask here or contact me directly.

> This part I somehow miss in manalyzer, most probably due to lack of understanding, and missing documentation.

True, I wrote a migration guide for the frontend mfe.c to c++ tmfe, because we do this migration
quite often. But I never wrote a migration guide from mana.c analyzer, because we never did such
migration. Most experiments at TRIUMF are post-2006 and use rootana in it's different incarnations.

P.S. I designed the C++ TMFe frontend after manalyzer and I think it came out quite better, I especially
value the design input from Stefan, Thomas, Pierre, Joseph and Ben.

P.P.S. Be free to ignore all this manalyzer business and write your own analyzer based
on the midasio library:

int main()
{
   TMReaderInteraface* f = TMNewReader(file.mid.gz");
   while (1) {
      TMEvent* e = TMReadEvent(f);
      dwim(e);
      delete e;
   }
   delete f;
}

For online processing I use the TMFe class, it has enough bits to be a frontend and an analyzer,
or you can use the older TMidasOnline from rootana.

Access to ODB is via the mvodb library, which is new in midas, but has been part of rootana
and my frontend toolkit since at least 2011 or earlier, inspired by Peter Green's even
older "myload" ODB access library.

K.O.
  2940   06 Feb 2025 Konstantin OlchanskiForumTransition from mana -> manalyzer
> > Is there a clear migration docu somewhere?

I can also give you links to the alpha-g analyzer (very complex) and the DarkLight trigger TDC analyzer (very simple),
there is also analyzer examples of in-between complexity.

K.O.
  2941   07 Feb 2025 Konstantin OlchanskiInfoswitch midas to next c++
to continue where we left off in 2019,
https://daq00.triumf.ca/elog-midas/Midas/1520

time to choose the next c++!

snapshot of 2019:

- Linux RHEL/SL/CentOS6 - gcc 4.4.7, no C++11.
- Linux RHEL/SL/CentOS7 - gcc 4.8.5, full C++11, no C++14, no C++17
- Ubuntu 18.04.2 LTS - gcc 7.3.0, full C++11, full C++14, "experimental" C++17.
- MacOS 10.13 - llvm 10.0.0 (clang-1000.11.45.5), full C++11, full C++14, full C++17

the world moved on:

- el6/SL6 is gone
- el7/CentOS-7 is out the door, only two experiments on my plate (EMMA and ALPHA-g)
- el8 was a still born child of RedHat
- el9 - gcc 11.5 with 12, 13, and 14 available.
- el10 - gcc 14.2

- U-18 - gcc  7.5
- U-20 - gcc  9.4 default, 10.5 available
- U-22 - gcc 11.4 default, 12.3 available
- U-24 - gcc 13.3 default, 14.2 available

- MacOS 15.2 - llvm/clang 16

Next we read C++ level support:

(see here for GCC C++ support: https://gcc.gnu.org/projects/cxx-status.html)
(see here for LLVM clang c++ support: https://clang.llvm.org/cxx_status.html)
(see here for GLIBC c++ support: https://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html)

gcc:
4.4.7 - no C++11
4.8.5 - full C++11, no C++14, no C++17
7.3.0 - full C++11, full C++14, "experimental" C++17.
7.5.0 - c++17 and older
9.4.0 - c++17 and older
10.5 - no c++26, no c++23, mostly c++20, c++17 and older
11.4 - no c++26, no c++23, full c++20 and older
12.3 - no c++26, mostly c++23, full c++20 and older
13.3 - no c++26, mostly c++23, full c++20 and older
14.2 - limited c++26, mostly c++23, full c++20 and older

clang:
16 - no c++26, mostly c++23, mostly c++20, full c++17 and older

I think our preference is c++23, the number of useful improvements is quite big.

This choice will limit us to:
- el9 or older
- U-22 or older
- current MacOS 15.2 with Xcode 16.

It looks like gcc and llvm support for c++23 is only partial, so obviously we will use a subset of c++23 
supported by both.

Next step is to try to build midas with c++23 on el9 and U-22,24 and see what happens.

K.O.
  2949   14 Mar 2025 Konstantin OlchanskiBug Reportpython hist_get_events not returning events, but javascript does
> After starting midas (mhttpd &, and mlogger -D) and running the `fetest` frontend I went into the midas/python/examples directory and ran basic_hist_script.py, and, even though I could see the 'pytest' program in the Programs page,
> 
> Valid events are:
> Enter event name:
> 
> was printed out, which signified that no events were found. No errors were displayed.

To check that MIDAS itself is built correctly, you can try "make test", this will create a sample experiment,
run fetest, start, stop a run and check that data file and history file is created with correct history events.

If "make test" fails, I can help debug it.

In your experiment, you can check that history files are created correctly:

1) "mhist -l" should show all available events
2) "mhdump -L *.hst" should show all events in the .hst history files
3) if you have the newer mhf*.dat files, you can "more mhf_1449770978_20151210_hv.dat" to see what data is inside

If all of that works as expected, there must be a problem with the python side and we will have to figure
out how to reproduce it.

This reminds me, "make test" does not test any of the python code, it should be added (and python should be added
to the bitbucket builds).

K.O.
ELOG V3.1.4-2e1708b5