ELOG Midas

Back Midas Rome Roody Rootana

Midas DAQ System, Page 125 of 158

Not logged in

Find | Login | Help

Full | Summary | Threaded | Collapse | Expand

3149 Entries

Goto page Previous 1, 2, 3 ... 124, 125, 126 ... 156, 157, 158 Next

11 Sep 2024, Konstantin Olchanski, Forum, Python frontend rate limitations?

> > I'm trying to get a sense of the rate limitations of a python frontend.

forgot one more:

c++ toolchain comes with extensive profiler tools aimed to answer the question "why is my 
program so slow, where is it spending all the time?". some of these tools go all the way to 
the hardware level and report CPU cache misses, TLB flushes, context switches and any other 
hardware events that interrupt or slow down computations. programmer than uses this 
information to restructure the code to avoid the worst slow downs (i.e. avoid branch mis-
predictions, avoid cache misses, etc).

I doubt the python toolchain will ever profiler tools as good.

K.O.

11 Sep 2024, Konstantin Olchanski, Bug Report, Multiple issues with mhist

I think I can offer some insight into your problems:

1) your mhist crash is due to the ODB timeout, it is probably set to 30 seconds in ODB /programs/mhist. you will 
have to make it biigger.

2) 1.5 years of files. yes. I have 10 years of files for ALPHA at CERN. and the number of files is a problem. 
But it should be better than the old system with 3 files per day (1000 files per year).

One solution you can try is symlinks. Assuming you have 10 years of history files in 10 per-year directory, you 
symlink as many of them as you need into the "current" directory, then remove the symlinks.

Why remove the symlinks? I use "ls" to read the list of history files and Unix/Linux does not have a syscall to 
"give me the 100 files with the newest mtime". I have to read the whole directory and that takes forever (if ZFS 
on HDD), it is quick with ZFS on SSD if ZFS cache is hot (you can have a cron job do "ls" every 5 minutes to 
keep the ZFS cache hot).

Now that I wrote the above, I think I see a way to make it "automatic", let me ponder this. (plus I always 
wanted to implement compressed history files (using "free" lz4)).

K.O.



I am having some trouble with mhist. I suppose that the problems are at least partially due to our specific 
needs which might exceed what has been tested. For context, in MEG II we have some 10^4 history variables in ~30 
different events. 

1. mhist -l crashes. After displaying around 7000 lines, I get the following error message:
[CODE]
[mhist,ERROR] [midas.cxx:5949:bm_validate_client_index,ERROR] My client index 10 in buffer 'SYSMSG' 
is invalid: client name '', pid 0 should be my pid 3773321
[mhist,ERROR] [midas.cxx:5952:bm_validate_client_index,ERROR] Maybe this client was removed by a 
timeout. See midas.log. Cannot continue, aborting...
Aborted (core dumped)
[/CODE]
Timing the execution shows around 33 seconds before the process is aborted.

I'm not sure if this would actually fix the problem, but while trying to circumvent the issue, I tried the 
following: [CODE]mhist -e "Xenon" -l[/CODE] This doesn't seem to be implemented. Listing only the variables of a 
single event would be nice 
regardless of our specific issue.

2. mhist and history files.
We have a directory directory with about 2500 history files (mhf_...dat) for the past 1.5 years. Older 
history files are archived in other directories with similar numbers of files. When trying to access them, I 
encountered two issues:
It seems like it is not possible to pass a "history directory" as an argument. To dump the history for a full 
year in the archive directory, I would need to run mhist many times with -f and then combine all the dumps.

If it really does not work, please consider this a feature request.
Also, even using single files does not work at the moment:
[CODE]
$ mhist -e "Xenon" -v "Det XeTmp 0-0" -t 100000 -s 200101 -p 250101 -f 
/data2/history/2022/mhf_1644698398_20220212_xenon.dat
ID 980316009, Aug 13 19:10:56, size 1851749486
[/CODE]
This command was supposed to show me the rough time frame covered in this particular history file. I was 
informed that the history files are in the new "FILE" format and mhist might not work with them properly.

tl;dr
[LIST]
[*] Bug: mhist -l crashes
[*] Bug: mhist -f does not work with "FILE" history format
[*] Feature request: mhist -e "Name" -l to only show variables of event "Name"
[*] Feature request: Set temporary history dir with a flag
[/LIST]

Lukas[/quote]

11 Sep 2024, Konstantin Olchanski, Suggestion, Improve Event Documentation

> I am writing a Rust based midas file reader however it was kind of hard to understand the full midas file 
> structure from the documentation.

MIDAS is old-school, when the code was the documentation.

This is very noticeable when you try to document things MIDAS (as I have done many times).

For MIDAS data format, file level and bank level, best if you look at my midasio library (included with MIDAS 
git clone) and translate it to Rust directly. I think a Rust version of C++ midasio would be very welcome.

Many data fields in MIDAS files are mysterious and I reverse-engineered them the best I could.

The main problems were:
- data padding
- "length" fields include padding or not?
- identification of big-endian vs little-endian data
- probably something I forget

K.O.

11 Sep 2024, Konstantin Olchanski, Info, Help parsing scdms_v1 data?

Look at the C++ implementation of the MIDAS data file reader, the code is very 
simple to follow.

Depending on how old are your data files, you may run into a problem with 
misaligned 32-bit data banks. Latest MIDAS creates BANK32A events where all 
banks are aligned to 64 bits. old BANK32 format had banks alternating between 
aligned and misaligned. old 16-bit BANK format data hopefully you do not have.

If you successfully make a data format description file for MIDAS, please post 
it here for the next user.

K.O.



[quote="Adrian Fisher"]Hi! I'm working on creating a ksy file to help with 
parsing some data, but I'm having trouble finding some information. Right now, I 
have it set up very rudimentary - it grabs the event header and then uses the 
data bank size to grab the size of the data, but then I'm needing additional 
padding after the data bank to reach the next event.
However, there's some irregularity in the "padding" between data banks that I 
haven't been able to find any documentation for. For some reason, after the data 
banks, there's sections of data of either 168 or 192 bytes, and it's seemingly 
arbitrary which size is used. 
I'm just wondering if anyone has any information about this so that I'd be able 
to make some more progress in parsing the data.
The data I'm working with can be found at https://github.com/det-
lab/dataReaderWriter/blob/master/data/07180808_1735_F0001.mid.gz
And the ksy file that I've created so far is at https://github.com/det-
lab/dataReaderWriter/blob/master/kaitai/ksy/scdms_v1.ksy

There's also a block of data after the odb that runs for 384 bytes that I'm 
unsure the purpose of, if anyone could point me to some information about that.

Thank you![/quote]

11 Sep 2024, Konstantin Olchanski, Info, mana.cxx

> Ok, no relevant complains so far, so I removed mana and rmana from the CMake build 
> process, but left the file mana.cxx still in the repository for educational 
> purposes ;-)

+1

K.O.

11 Sep 2024, Konstantin Olchanski, Forum, "Safe" abort of sequencer scripts

> We often use the MIDAS sequencer to temporarily control detector settings, such as:
> 
> * <change some setting>
> * WAIT 60 seconds
> * <revert setting to original value>
> 
> The question arises of what happens if the sequencer scripts gets aborted during that wait, preventing the value from being reset.

Common problem. Go have an elegant solution using the "defer" keyword.

https://go.dev/tour/flowcontrol/12

K.O.

12 Sep 2024, Konstantin Olchanski, Bug Fix, bitbucket builds repaired

bitbucket builds work again, also added ubuntu-24 and almalinux-9.

two problems fixed:
- cmake file in examples/experiment was replaced by a non-working version
- unannounced change of strlcpy() to mstrlcpy() broke "make remoteonly"

P.S. I should also fix the rootana and the roody bitbucket builds.

K.O.

13 Sep 2024, Konstantin Olchanski, Bug Fix, rootana bitbucket build fixed

rootana bitbucket build is fixed, only a few minor build problems. I am using the 
root official docker image (which turned out to not work right out of the box 
becuase of missing libvdt-dev package). K.O.

13 Sep 2024, Konstantin Olchanski, Bug Report, mfe.cxx with RO_STOPPED and EQ_POLLED

> > I noticed that a check was added to mfe.cxx in 1961af0d6:

This is the reason I recommend against using mfe.c based frontends. There was never any
proper documentation on how they work and what different settings in ODB common
and elsewhere do. My attempts to document it by reverse-engineering were only partially
successful. Since then a number of changes was made that were also hard-to-impossible
to document.

I recommend that all use the new c++ tmfe frontend, which was designed for easy documentation,
and explanation. See tmfe.md for full documentation.

(pending improvements is to integrate TMEvent support, add the data-transmit thread and event fifo).

K.O.

13 Sep 2024, Konstantin Olchanski, Bug Fix, mstrcpy, was: strlcpy and strlcat added to glibc 2.38

for the record, as ultimate solution, strlcpy() and strlcat() were wholesale 
replaced by mstrlcpy() and mstrlcat(). this should fix "missing strlcpy()" 
problem for good and make midas more consistent across all platforms (including 
non-linux, non-unix). on my side, I continue replacing these function with proper 
std::string operations. K.O.

13 Sep 2024, Konstantin Olchanski, Suggestion, manalyzer thread safety and custom http IP binding

> - Enable ROOT's thread safety when running in multithreaded mode
> This helps avoid users having to write their call to a global thread lock when calling ->Fill() on ROOT histograms and Trees
> https://bitbucket.org/tmidas/manalyzer/pull-requests/5

merged by hand. (pull request shows a "rejected", bitbucket has no "merged manually" button).

also noted this change in the documentation: README.md

K.O.

13 Sep 2024, Konstantin Olchanski, Suggestion, Clean up compiler warning in manalyzer

> This is a super small pull request, simple replace deprecated sprintf with snprintf
> https://bitbucket.org/tmidas/manalyzer/pull-requests/9

sprintf() is not deprecated and "char buf[256]; sprintf(buf, "%05d", 64-bit-int);" is safe, will never overflow.

we could bulk-convert all these sprintf() to snprintf() but I would rather wait for this:

https://en.cppreference.com/w/cpp/utility/format/format

let me think on this for a bit.

K.O.

17 Sep 2024, Konstantin Olchanski, Bug Report, Crash using ODB watch

> {
> odb new_settings("/Equipment/Test FE/Settings");
> new_settings.watch(watch); // <-- here I am getting a segmentation fault
> }

this code has a bug. "watch" is attached to object "new_settings" that is deleted
after the closing curly bracket.

I would say Stefan's odb API should not allow you to write code like this. an API defect.

K.O.

24 Sep 2024, Konstantin Olchanski, Bug Report, Can we convert the .mid file into .root file

"Can we convert the .mid file into .root file".

yes, you can, but the operation is under-defined. it's like asking "can I convert these stones into houses". the answer is "yes", but it involves 
more than running a universal conversion program.

For this reason, I recommend against converting midas files "to root". for some types of midas data such a conversion makes no sense (i.e. alpha-g 
streamed udp packets with chopped compressed waveforms).

I recommend that you analyze you data in the midas analyzer. You can start with manalyzer_example_root.cxx,
it shows how to create a ROOT histogram, how to access midas event bank data and call the TH1 "Fill" method.

Instead of filling histograms in the analyzer, you can create a ROOT TTree and fill it with data from midas data banks,
effectively you will create your own custom converter from midas to root.

The key thing is that it has to be a custom converter, because only you know the meaning of midas bank data
and how it should be best stored in a root tree.

K.O.

24 Sep 2024, Konstantin Olchanski, Suggestion, Clean up compiler warning in manalyzer

> > I like the look of std::format, looks cleaner than string streams
> 
> I fully agree. String streams is a pain if you want to do zero-leading hex output mixed with decimal output. Yes it's easier to read if you don't know printf syntax,
> but 10-20 times more chars to write and not necessarily cleaner.
>

IMO c++ string streams formatting is optimized for "hello world" and is useless for printing hex numbers, table-formatted data and generally anything real-life.

plus the borked std::to_string() (it takes a global lock for the "C" locale), "fixed" it by introducing std::to_chars() in C++17,
with "ultimate fix" in std::format in C++26.

no question why C++ has the bad reputation. for a "done right" example, take a look at the Go standard library.

> 
> Probable is that we would have to convert about a few thousand of sprintf's() in midas.
> 

surprising few bare sprintf() remaining in MIDAS, most of them overflow-safe and most of them to be converted to msprintf().

K.O.

07 Oct 2024, Konstantin Olchanski, Bug Report, Difficulty running MIDAS on Rocky 9.4

> We're trying to install the SuperCDMS version of MIDAS on a Rocky 9.4 Virtual 
> Machine and are getting a persistent error when we run mserver.
>
> [mserver,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore, 
> timeout 10000 ms, exiting...
> db_lock_database: Detected recursive call to db_{lock,unlock}_database() while 
> already inside db_{lock,unlock}_database(). Maybe this is a call from a signal 
> handler. Cannot continue, aborting...
> Aborted (core dumped)

This is super very bad. Since you have a core dump, please post the stack trace here (or email it to me).

I probably cannot debug your private version of midas and I will recommend that you install and run vanilla midas 
mserver (just while we debug this problem).

Let's look at the core dump stack trace first, but likely we see a problem with System-V semaphores and hopefully it 
is not some breakage due to Red Hat bogosity or due to something specific to running on a virtual machine.

If indeed this is Linux-kernel level breakage of System-V semaphores, solution would be to start using Posix 
semaphores, something I wanted to do for a long time. We already switched MIDAS shared memory from System-V to Posix 
shared memory.

If we are lucky it is just one more crasher bug in ODB. Let's see that core dump stack trace.

K.O.

08 Oct 2024, Konstantin Olchanski, Bug Report, Difficulty running MIDAS on Rocky 9.4

> I've uploaded the current core dump at: https://gitlab.com/det-lab/coredumps#.

I cannot read the core dump without the corresponding executable (and likely all it's shared libraries).

It is best if you run gdb and extract the stack traces on your end.

In case you are not familiar with gdb:

gdb mserver core # start gdb
bt # stack trace of crashed thread
info thr # get list of threads
thr 1
bt
thr 2
bt
# etc, get stack trace of each thread, there should not be too many of them

K.O.

08 Oct 2024, Konstantin Olchanski, Bug Report, Difficulty running MIDAS on Rocky 9.4

I read these error messages. There is no ODB corruption. ODB semaphore is locked and all midas programs will fail, they will timeout trying to get the lock, report the timeout, then it looks like a bug was introduced where instead of hard exit or abort() they attempt a clean shutdown which crashes from a recursive call in db_lock_database(). Amy's 
core dump should confirm this.

K.O.


> > > We're trying to install the SuperCDMS version of MIDAS on a Rocky 9.4 Virtual 
> > > Machine and are getting a persistent error when we run mserver.  As far as I 
> > > know there are minimal changes between this and the MIDAS branch, but Ben Smith 
> > > may have more to say on this.
> > 
> > For reference, "the SuperCDMS version of MIDAS" is just a fork that no longer has any meaningful differences vs the main MIDAS repo, but we only pull updates infrequently after testing a bunch. We last pulled from the develop branch in November 2023. But that should be irrelevant here as semaphore code hasn't been touched for a very long time.
> > 
> > We're running Alma 9.4 on a machine at TRIUMF and the same version of midas works fine there (Amy, you may already have access to scdms-zeus). I believe Alma and Rocky should be basically identical for this.
> > 
> > So the questions are:
> > * Have you tried other midas programs, or only mserver? E.g. did odbedit and mhttpd work?
> > * If other programs work, have you been running them all as the same user? In particular, if you ran one program as root and another as an unprivileged user, then you will likely get odd permissions issues.
> > * What do you see if you run `ls -l /dev/shm` and `ls -l ~/packages/SuperCDMS_DAQ/MidasDAQ/online/.*SHM`? (Or wherever your online dir is for the 2nd one).
> > * Did you follow the full instructions for recovering from a corrupt ODB? https://daq00.triumf.ca/MidasWiki/index.php/FAQ#How_to_recover_from_a_corrupted_ODB   In particular the bit about running odbinit with the --cleanup flag?
> 
> Here's what happens when I try to run odbedit:
> 
> [lekhraj@sdfcdmsdaq setup]$ odbedit
> [ODBEdit,ERROR] [odb.cxx:2052:db_open_database,ERROR] Removed ODB client 'ODBEdit', index 0 because process pid 481823 does not exists
> [ODBEdit,INFO] Removed open record flag from "/Experiment/Security/RPC hosts/Allowed hosts"
> [ODBEdit,INFO] Removed exclusive access mode from "/Experiment/Security/RPC hosts/Allowed hosts"
> [ODBEdit,INFO] Corrected 1 ODB entries
> [ODBEdit,INFO] Deleted entry '/System/Clients/481823' for client 'ODBEdit' because it is not connected to ODB
> [ODBEdit,INFO] Client 'ODBEdit' on buffer 'SYSMSG' removed by bm_open_buffer because process pid 481823 does not exist
> [ODBEdit,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore, timeout 10000 ms, exiting...
> [ODBEdit,ERROR] [midas.cxx:2205:cm_check_connect,ERROR] cm_disconnect_experiment not called at end of program
> db_lock_database: Detected recursive call to db_{lock,unlock}_database() while already inside db_{lock,unlock}_database(). Maybe this is a call from a signal handler. Cannot continue, aborting...
> Aborted (core dumped)
> [lekhraj@sdfcdmsdaq setup]$
> 
> And mhttpd:
> 
> [lekhraj@sdfcdmsdaq setup]$ mhttpd
> [mhttpd,ERROR] [odb.cxx:2052:db_open_database,ERROR] Removed ODB client 'ODBEdit', index 0 because process pid 601054 does not exists
> [mhttpd,INFO] Removed open record flag from "/Experiment/Security/RPC hosts/Allowed hosts"
> [mhttpd,INFO] Removed exclusive access mode from "/Experiment/Security/RPC hosts/Allowed hosts"
> [mhttpd,INFO] Corrected 1 ODB entries
> [mhttpd,INFO] Deleted entry '/System/Clients/601054' for client 'ODBEdit' because it is not connected to ODB
> [mhttpd,INFO] Client 'ODBEdit' on buffer 'SYSMSG' removed by bm_open_buffer because process pid 601054 does not exist
> [mhttpd,INFO] ODB subtree /Runinfo corrected successfully
> Password protection is off
> Hostlist off, connections from anywhere will be accepted
> Listening on "http://localhost:8080", passwords OFF, hostlist OFF
> Listening on "http://[::1]:8080", passwords OFF, hostlist OFF
> bm_lock_buffer: Lock buffer "SYSMSG" is taking longer than 1 second!
> [mhttpd,ERROR] [odb.cxx:2498:db_lock_database,ERROR] cannot lock ODB semaphore, timeout 10000 ms, exiting...
> [mhttpd,ERROR] [midas.cxx:2205:cm_check_connect,ERROR] cm_disconnect_experiment not called at end of program
> db_lock_database: Detected recursive call to db_{lock,unlock}_database() while already inside db_{lock,unlock}_database(). Maybe this is a call from a signal handler. Cannot continue, aborting...
> Aborted (core dumped)
> [lekhraj@sdfcdmsdaq setup]$
> 
> We have been running everything as a single user, the user who cloned the repositories and owns the directories.
> 
> We did follow the corrupted-ODB cleanup instructions.
> 
> [lekhraj@sdfcdmsdaq setup]$ ls -lh /dev/shm
> total 1.3M
> -rw------- 1 lekhraj dm 1.2M Oct  8 14:13 17468_test_ODB__sdf_home_l_lekhraj_packages_SuperCDMS_DAQ_MidasDAQ_online_
> -rw------- 1 lekhraj dm 114K Oct  7 14:06 17468_test_SYSMSG__sdf_home_l_lekhraj_packages_SuperCDMS_DAQ_MidasDAQ_online_
> 
> [lekhraj@sdfcdmsdaq setup]$ ls -lh ~/packages/SuperCDMS_DAQ/MidasDAQ/online/.*SHM
> -rw-r--r-- 1 lekhraj dm    0 Oct  3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.ALARM.SHM
> -rw-r--r-- 1 lekhraj dm    0 Oct  3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.ELOG.SHM
> -rw-r--r-- 1 lekhraj dm    0 Oct  3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.HISTORY.SHM
> -rw-r--r-- 1 lekhraj dm    0 Oct  3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.LAZY.SHM
> -rw-r--r-- 1 lekhraj dm    0 Oct  3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.MSG.SHM
> -rw-r--r-- 1 lekhraj dm 1.2M Oct  8 14:12 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.ODB.SHM
> -rw-r--r-- 1 lekhraj dm    0 Oct  3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.SYSMSG.SHM
> -rw-r--r-- 1 lekhraj dm    0 Oct  3 08:46 /sdf/home/l/lekhraj/packages/SuperCDMS_DAQ/MidasDAQ/online/.SYSTEM.SHM

08 Oct 2024, Konstantin Olchanski, Bug Report, Difficulty running MIDAS on Rocky 9.4

> Basically I think Rocky 9.4 is a red herring.

This is what likely happened:
- some program crashed while holding the ODB lock semaphore (by ctrl-C at the wrong time or by kill -KILL at the wring time)
- semaphore is locked with a flag "unlock if this program stops"
- this is supposed to ensure ODB lock semaphore never gets stuck in the locked state
- (there is no code in MIDAS to unlock the ODB lock semaphore without locking it first)
- we have observed a malfunction in the Linux kernel, where this automatic unlock does not happen.
- it is rare and so far cannot be reproduced. you can find more about it by searching this forum.
- I think it is a bug in the System-V semaphore code or in the Linux "program stop" code (a path where they fail to call the semaphore unlock handler, and who knows what
other handlers).
- System-V semaphores are very obsolete, replaced by POSIX semaphores.
- POSIX semaphores do not have the "unlock if this program stops" magic, the user (MIDAS) is responsible with detecting that the program who locked the semaphore is gone
and and with taking corrective action, i.e. release the lock, automatically.

I do not know if this problem with System-V semaphores is in the generic Linux kernel, or if it is specific
to the Red Hat kernels (they are known to have many patches, deviating quite far from vanilla kernels).

I do not know if this problem is somehow sensitive to virtual machines.

So yes/no, Red Hat derived linux on a virtual machine could be where this problem happens more often that elsewhere.

K.O.

08 Oct 2024, Konstantin Olchanski, Bug Report, Difficulty running MIDAS on Rocky 9.4

> I read these error messages. There is no ODB corruption. ODB semaphore is locked and all midas programs will fail...

Recovery from this is:
- stop all midas programs (actually they should have all crashed by now)
- identify the ODB semaphore with: ipcs -s -t
- remove the ODB semaphore with: ipcrm sem <semid>
- where <semid> is from the first column of ipcs
- keep deleting semaphores until odbedit works.
- if you delete extra midas sempahores, odbedit will recreate them
- if you delete non-midas semaphores, oh, well...

Little bit better steps for this recovery may be written up by Suzannah in this forum
or in the midas wiki... good luck finding them.

K.O.

Goto page Previous 1, 2, 3 ... 124, 125, 126 ... 156, 157, 158 Next

ELOG V3.1.4-2e1708b5