Back Midas Rome Roody Rootana
  Midas DAQ System, Page 109 of 138  Not logged in ELOG logo
ID Date Authordown Topic Subject
  2379   31 Mar 2022 Konstantin OlchanskiBug Fix"run stop" trouble in mlogger, fixed
while debugging something else, I ran into a bit of trouble in mlogger.

I set the mlogger event limit to 100, and after reaching 100 events, mlogger
sayd "stopping run", but nothing happened, run kept going.

it turns out mlogger tried stopping the run too soon, the run-start
transition did not finish yet and the error message about trying
to stop a run while another transition is in progress was missing.

(fixed - if another transition is in progress, we try again later)

it also turns out that cm_transition() checks if another transition
is in progress way too late, all the way in the transition thread,
where it cannot return it is an error to mlogger.

(fixed - first thing done in cm_transition() is this check).

while debugging this, I tested the ODB flags "/Logger/Async transitions"
and /Logger/Multithread transitions". It turns out only two transition
types still work from inside mlogger - multithread transition
and detached transition (via the mtransition helper).

the issue is the dead lock between mlogger and frontend. while mlogger
is inside cm_transition(), it is not reading the SYSTEM buffer,
while at the same time frontends are writing into it. If SYSTEM
buffer happens to be pretty full, we dead lock - frontends are waiting
for free space in the SYSTEM buffer do not respond to RPCs, mlogger is not 
reading from the SYSTEM and it stuck trying to issue "run stop" RPC
to frontend. (this dead lock is not forever, eventually frontend
is killed by RPC timeout, mlogger survives and stops the run).

this is a well known problem and as solution, mlogger has been using the 
multithreaded transitions for years.

now I removed the OBD /Logger/Async transition and /Logger/Multithread 
transition flags, instead, there is now a flag /Logger/Detached transitions
set to FALSE by default. Setting it to TRUE will cause mlogger to fork 
"mtransition STOP" and "mtransition START" for stopping and starting runs,
this is useful in case there is trouble with multithreading in mlogger.

K.O.
  2381   04 Apr 2022 Konstantin OlchanskiSuggestionMaximum ODB size
> Anybody some idea what the maximum ODB size can be?

It turns out ODB size limit is hardwired on db_open_database() at 100 Mbytes.

I now committed an improved error message for this.

I confirm that "odbinit -s 100MB" works and creates ODB with 50 Mbyte data area and 50 
Mbyte key area.

> in the age of 64GB RAM being a standard, we should be able to grow bigger ...

I agree, I think we can safely bump the limit from 100 Mbytes to 1 Gbyte, maybe 1.5 or 
1.99 Gbytes. Above that we run into 32-bit/31-bit cleanliness problems.

And creating extra large 1 GB ODB but using only a few megabytes will not waste any 
RAM, because the .ODB.SHM file is demand-paged and non-used parts of ODB will not be 
mapped into RAM. (It will waste disk space, file .ODB.SHM will be 1 GByte size).

However, 1 GByte (FPGA based) and 4-8 GByte (Raspberry Pi & co) machines are again
becoming popular and relevant for running MIDAS, and they have very slow "disk" 
subsystems, with NAND, SD and USB flash, so we should not go crazy here.

> odbinit -s 1024MB --cleanup

there is a bug in odbinit, if initial odbinit fails, ODB with default size is creates, 
and original rejected ODB size is written to .ODB_SIZE.TXT (an inconsistency).

bitbucket bug 328

> [ how do I resize ODB ??? ]

we need odbresize. bitbucket bug 329.

K.O.
  2382   12 Apr 2022 Konstantin OlchanskiInfoODB JSON support
> > > > odbedit can now save ODB in JSON-formatted files.
> > encode NaN, Inf and -Inf as JSON string values "NaN", "Infinity" and "-Infinity". (Corresponding to the respective Javascript values).
> http://docs.oasis-open.org/odata/odata-json-format/v4.0/os/odata-json-format-v4.0-os.html
> > Values of types [...] Edm.Single, Edm.Double, and Edm.Decimal are represented as JSON numbers,
> except for NaN, INF, and –INF which are represented as strings "NaN", "INF" and "-INF".
> https://xkcd.com/927/

Per xkcd, there is a new json standard "json5". In addition to other things, numeric
values NaN, +Infinity and -Infinity are encoded as literals NaN, Infinity and -Infinity (without quotes):
https://spec.json5.org/#numbers

Good discussion of this mess here:
https://stackoverflow.com/questions/1423081/json-left-out-infinity-and-nan-json-status-in-ecmascript

K.O.
  2384   13 Apr 2022 Konstantin OlchanskiInfoODB JSON support
> > Per xkcd, there is a new json standard "json5". In addition to other things, numeric
> > values NaN, +Infinity and -Infinity are encoded as literals NaN, Infinity and -Infinity (without quotes):
> > https://spec.json5.org/#numbers
> 
> Just for curiosity: Is this implemented by the midas json library now?

MIDAS encodes NaN, Infinity and -Infinity as javascript compatible "NaN", "Infinity" and "-Infinity",
this encoding is popular with other projects and allows correct transmission of these values
from ODB to javascript. The test code for this is on the MIDAS "Example" page, scroll down
to "Test nan and inf encoding".

I think this type of encoding, using strings to encode special values, is more in the spirit of json,
compared to other approaches such as adding special literals just for a few special cases
leaving other special cases in the cold (ieee-754 specifies several different types of NaN,
you can encode them into different nan-strings, but not into the one nan-literal (need more nan-literals,
requires change to the standard and change to every json parser).

As editorial comment, it boggles my mind, what university or kindergarden these people went to
who made the biggest number, the smallest number and the imaginary number (sqrt(-1))
all equal to zero (all encoded as literal null).

K.O.
  2386   24 Apr 2022 Konstantin OlchanskiBug Fixmserver buffer overrun and crash
There is a memory allocation bug in the mserver.

ALIGN8() was missing when receiving events from the event socket and data buffer 
was allocated 4 bytes too short. but only for some received events and only in 
very unlucky sequence of received events. result was a rare but obnoxious crash 
of fevme frontend in alpha-2 at CERN. (we do not see any crash from this in 
alpha-g or anywhere else, the best I can tell).

fixed in commit 4dc06ba47ff7caa5251fd8c48d8533f35799f3a6.

If you use the mserver, please update to this commit or apply following patch in 
midas.cxx:

-   int bufsize = sizeof(INT) + event_size;
+   int bufsize = sizeof(INT) + total_size;

K.O.
  2388   30 Apr 2022 Konstantin OlchanskiForumS3 Object Storage
> We are storing raw MIDAS files to S3 Object Storage, but MIDAS file are not 
> optimised for readout from such kind of storage. There is any work around on 
> evolution of midas raw output or, beyond simulated posix fs,  to develop midas 
> python library optimised to stream data from S3 (is not really clear to me if this 
> is possible).

We have plans for adding S3 object storage support to lazylogger, but have not gotten 
around to it yet.

We do not plan to add this in mlogger. mlogger works well for writing data to locally-
attached storage (local ext4, XFS, ZFS) but always runs into problems with timeouts and 
delays when writing to anything network-attached (even writing to NFS).

I envision that each midas raw data file (mid.gz or mid.lz4 or mid.bz2) will
be stored as an S3 object and there will be some kind of directory object
to map object ids to run and subrun numbers.

Choice of best file size is open, normally we use subruns to limit file size to 1-2 
Gbytes. If cloud storage prefers some other object size, we can easily to up to 10 
Gbytes and down to "a few megabytes" (ODB dumps will have to be turned off for this).

Other than that, in your view, what else is needed to optimize midas files for storage 
in the Amazon S3 could?

P.S. For reading files from the cloud, code needs to be written and added to 
midasio/midasio.cxx, for example, see the code that is already there for reading ssh-
attached files and dcache/dccp-attached files. (CERN EOS files can be read directly 
from POSIX mount point /eos).

K.O.
  2389   30 Apr 2022 Konstantin OlchanskiInfoadded web pages for "show odb clients" and "show open records"
for a long time, midas web pages have been missing the equivalent of odbedit 
"scl" and "sor" to display current odb clients and current odb open records.

this is now added as buttons "show open records" and "show odb clients" in the 
odb editor page.

as in odbedit, "sor" shows open records under the current subtree, i.e. if you 
are looking at /equipment, you will not see open records for /experiment. to see 
all open records, go to "/".

commit b1ab7e67ecf785744fff092708d8389f222b14a4

K.O.
  2390   01 May 2022 Konstantin OlchanskiInfoadded web page for "mdump"
added JSON RPC for bm_receive_event() and added a web page for "mdump".

the event dump is a hex dump for now.

if somebody can contribute a javascript decoder for midas bank format, it would be greatly appreciated.

otherwise, I will eventually write my own decoder library patterned on midasio.h and midasio.cxx.

as of commit 5882d55d1f5bbbdb0d9238ada639e63ac27d8825
K.O.
  2391   01 May 2022 Konstantin OlchanskiInfoadded web page for "mdump"
> added JSON RPC for bm_receive_event()

there is a number of problems with implementing bm_receive_event() as a RPC:

1) mhttpd has only event buffer 1 read pointer for all javascript connections, if two browser tabs are 
running mdump, they will "steal" events from each other.
2) javascript connections are state-less and we cannot specify per-connection event_id and trigger_mask 
filters to bm_receive_event(). our bm_request_event() has to be for all event_id and all trigger_mask.
3) for same reason, we cannot have some requests to be GET_ALL, some to be GET_RECENT and some to be 
GET_OLD (if GET_OLD is ever implemented).

Problem (1) is hard to fix. Only solution I can see is to have mhttpd have it's own event buffer that can 
somehow track which events have been sent to which javascript connection.

The same scheme allows implementing GET_ALL and per-connection event_id and trigger_mask filters.

The difficulty is in detecting javascript connections that are no longer active and it's event request and 
events we have buffered for it can be deleted. Unlike proper rpc clients, javascript browser tabs can be 
closed without warning and without opportunity to tell rpc server that they are closed, gone.

K.O.
  2392   01 May 2022 Konstantin OlchanskiInfoadded web page for "mdump"
> added a web page for "mdump".

missing functions:
- get a list of existing event buffers (should read event buffer names from /Experiment/Buffer sizes)
- selector box to select event buffer
- button for "get next" and "get new" (should call bm_skip_event() before bm_receive_event())
- entry fields for event_id and trigger_mask event filter
- check box for "keep getting new data" and entry field for update frequency
- (eventually) entry field for bank name filter

K.O.
  2396   04 May 2022 Konstantin OlchanskiBug Fixmysql history update
the code for writing midas history to mysql has been updated to work against 
MYSQL 8.0.23 (CERN ALPHA-2):

- as ever mysql reports inconsistent data types (I create column with type 
"integer", mysql reports it has type "int" and so forth), the special kludge to 
take care of this had to be tweaked.

- this caused some columns to be marked "inactive" and the code to "reactivate" 
them was missing (fixed)

- binary history event data size was computed incorrectly for events with 
"inactive" columns (fixed) and caused assert() failure and mlogger crash.

- mysql read of column definitions for history event "system" (as in 
/history/links/system) bombed because of incorrect quoting (worked before, why? 
why bombed now?). this caused duplicate columns to be created in mysql table 
"system" and mlogger bomb-out with complaint about "duplicated columns" 
(actually the error message was missing, so it was a silent bomb-out). quoting 
fixed, missing error message fixed, but cleanup of duplicate columns has to be 
done by hand. in case of alpha-2 the fix was to remove the unused 
/history/links/system).

if you are using mysql history please update or patch src/history_schema.cxx.

commit 9d17d2fef233cf457121ca7c2a283c4c76ed33bc

K.O.
  Draft   08 May 2022 Konstantin OlchanskiInfoRO_STOPPED with triggered events
> If events are sent when a run is stopped, this leads to many unexpected results

I think we need to understand what these unexpected results are.

Naively thinking, of would expect midas to not care 
  2400   08 May 2022 Konstantin OlchanskiInfoRO_STOPPED with triggered events
> some old front-end are not running any more since they do use RO_ALWAYS together with 
triggered events.

I confirm, if you have mfe.c frontends that have RO_ALWAYS, after you update MIDAS, 
some of these frontends will fail to start.
https://bitbucket.org/tmidas/midas/commits/1961af0d657e4f76ab9db17f9b70c0c492172b6d

tmfe c++ frontends do not have this restriction but by default only read data when run 
is active (per-equipment fEqConfReadOnlyWhenRunning default is true).

K.O.
  2401   13 May 2022 Konstantin OlchanskiInfoanalysis of corner cases in event buffer write cache
introduction:

to remember, bm_send_event() writes an event to the write cache, bm_flush_cache() 
writes the contents of the write cache into the shared memory event buffer, buffer 
free space is consumed. in the usual case, mlogger is reading events from the shared 
memory event buffer, buffer free space is released. there is also a read cache, not 
part of this discussion.

the purpose of the write cache is to reduce contention for the shared memory 
semaphore. in the case of large number of small events, semaphore is locked per 
cache-flush, instead of per-event. correct tuning of write cache and event size can 
reduce lock rate from >100 kHz to around 100 Hz or lower.

analysis:

for correct operation of bm_send_event() under all conditions we need to consider 
all corner cases:

1) no write cache: (cache size set to 0)

- event_size > buffer_size -> reject the event (obviously)
- event_size > 0.5 * buffer_size -> only 1 event fits into the buffer, next write 
will stall until mlogger reads the previous event (sequential operation, bad)
- event_size < 0.3 * buffer_size -> at least 2 events fit into the buffer (good)

decision: limit event size to 0.5 to 0.3 * buffer_size (current limit is 0.5 * 
buffer_size, I think).

consequence: buffer size limit is 2 Gbytes (32-bit byte offsets, code is only 31-
bit-clean), max event size is between 1 Gbytes and 0.6 Gbytes.

2) writing to write cache:

- event_size > cache_size -> flush cache, write event to directly to buffer
- event_size > 0.5 * cache_size -> inefficient use of cache: write to cache, next 
event does not fit, flush to buffer, repeat. no gain in semaphore locking (bad), one 
additional memcpy() (event to cache and cache to buffer) (bad)
- event_size < 0.3 * cache_size -> multiple events fit into cache, but probably no 
gain in semaphore locking

decision: events that are bigger than 0.3 to 0.1 * cache_size should not go through 
the cache. (flush cache, write directly to buffer).

3) flush write cache to buffer:

- cache_size > buffer_size -> cannot flush in 1 operation, must have a loop and 
flush the cache in pieces
- cache_size between 0.5 and 1.0 * buffer_size -> can flush in 1 operation, but must 
wait for mlogger to fully empty the buffer (sequential operation, bad)
- cache size < 0.3 * buffer_size -> can flush in 1 operation, at least 2 "flushes" 
fit inside the buffer (good)

decision: limit write cache size to 0.3 * buffer_size. (current limit is 
0.25*buffer_size).

consequences:

- write cache size limit is 0.3..0.25 * 2GB = 0.6..0.5 Gbytes
- cached event size limit is 0.3..0.1 * 0.5 GBytes = 150..50 Mbytes
- minimum number of cached events: 3 to 10
- semaphore locks reduced: 3 to 10 locks become 1 lock (all events cached),
4 to 11 locks become 2 locks (big event causes cache flush).

4) complications:

- there is a periodic 1/second bm_flush_cache() that flushes the cache early and 
reduces it's efficiency (but needed to avoid having data stuck in cache for long 
time)
- if multiple frontends use large write cache (~ 0.3..0.5 * buffer_size), again, 
sequential operation can happen (bad)
- write cache is per-frontend, not per-equipment. if different equipments request 
different cache sizes, mfe.c and tmfe c++ frontends complain about this, but the 
user has to sort it out.

K.O.
  2402   16 May 2022 Konstantin OlchanskiInfoRO_STOPPED with triggered events
> > some old front-end are not running any more since they do use RO_ALWAYS together with 
> triggered events.
> 
> I confirm, if you have mfe.c frontends that have RO_ALWAYS, after you update MIDAS, 
> some of these frontends will fail to start.
> https://bitbucket.org/tmidas/midas/commits/1961af0d657e4f76ab9db17f9b70c0c492172b6d
> 
> tmfe c++ frontends do not have this restriction but by default only read data when run 
> is active (per-equipment fEqConfReadOnlyWhenRunning default is true).

As of commit 
https://bitbucket.org/tmidas/midas/commits/28d9c96bd6d4f65346ebcd6a04492ea764c90823 mfe.c 
frontends will no longer fail to start. an error will still be issued "Equipment \"%s\" 
contains RO_STOPPED or RO_ALWAYS. This can lead to undesired side-effect and should be 
removed."

BTW 1:

Some of our old frontends use EQ_MULTITHREAD to implement multithreaded periodic equipments. 
They do not generate any events when there is no run (some of them do not generate any 
events at all). Now they will start printing this error message, for no reason. (no we will 
not be rewriting them justy to get rid of this message. life is too short).

BTW 2:

the c++ tmfe frontend does not have any protections against these "undersired side-effects".

What are these undesired side effects and should we add protection against them?

K.O.
  2403   16 May 2022 Konstantin OlchanskiInfoanalysis of corner cases in event buffer write cache
> for correct operation of bm_send_event() under all conditions we need to ...

to continue computation from last message:

default SYSTEM buffer size: 32 MiBytes
default max event size: 4 MiBytes

hard max buffer size: 2 Gbytes (code is only 31-bit-clean)
hard max event size: 2 Gbytes (code is only 31-bit-clean)

max event size currently: 32 Mbytes (same as buffer size)
max event size per (1) in previous post: 32*0.5..0.3 = 16..9 MiBytes

number of default-max-size events buffered: 32/4 = 8.
number of per (1) max-size events buffered: 2 or 3
number of current max-size events buffered: 0 (bad, frontend is serialized with mlogger)

default write cache size: 100 kbytes

max write cache size currently: buffer size / 4 = 32/4 = 8 MiBytes
max write cache size per (3) in previous post: buffer_size / 3 = 10 Mbytes
hard max write cache size per (3): 2 Gbytes/3 = 600 Mbytes

max size of cached events:

current: 100 kbytes (size as cache size)
per (2) in previous post: 0.1..0.3 * cache size = 10..30 kbytes
per (2), 1 Mbyte cahe: 0.1..0.3 * cache size = 100..300 kbytes
hard max size: 0.1..0.3 * hard_max_cache_size = 0.1..0.3 * 600 = 60..180 Mbytes.

max data rate before event buffer semaphore locking rate exceeds 100 Hz:

1 kbyte events, no write cache: 100 kbytes/sec
1 kbyte events, 100 kbyte cache: 100 events cached, cache flush rate 100 Hz -> 100*1kbyte*100Hz -> 10 Mbytes/sec
1 kbyte events, 1 Mbyte cache: 1000 events cached, cache flush rate 100 Hz -> 100 Mbytes/sec (1gige ethernet)
N kbyte events, 1 Mbyte cache: same thing (data rate is limited by cache flush rate 100 Hz)
100 kbyte events, 1 Mbyte cache, not cached per (2): 100kbyte*100Hz = 10 Mbytes/sec
300 kbyte events, 1 Mbyte cache, not cached per (2): 300kbyte*100Hz = 30 Mbytes/sec
N00 kbyte events: N0 Mbytes/sec (500->50, etc)
1 kbyte events, 10 Mbyte cache: 10000 events cached, cache flush rate 100 Hz -> 1000 Mbytes/sec (10gige ethernet)
N kbyte events, 10 Mbyte cache: same thing (data rate is limited by cache flush rate 100 Hz)
1000 kbyte events, 10 Mbyte cache, not cached per (2): 1000kbyte*100Hz = 100 Mbytes/sec
3000 kbyte events, 10 Mbyte cache, not cached per (2): 3000kbyte*100Hz = 300 Mbytes/sec
N000 kbyte events: N00 Mbytes/sec (4000->400, 5000->500, etc)
default max event size: 4 Mibytes*100Hz = 400 Mbytes/sec (exceeds 1gige ethernet)
hard max event size (divided by 10 to buffer 10 events): 200 Mbytes*100Hz -> 20 Gbytes/sec

max event rate before event buffer semaphore locking rate exceeds 100 Hz:

1 kbyte events, no write cache: 100 Hz (obviously)
1 kbyte events, 100 kbyte cache: 100 events cached, cache flush rate 100 Hz -> 10 kHz
1 kbyte events, 1 Mbyte cache: 1000 events cached, cache flush rate 100 Hz -> 100 kHz
N kbyte events, 1 Mbyte cache: 1000/N events cached, cache flush rate 100 Hz -> 100/N kHz
1 kbyte events, 10 Mbyte cache: 10000 events cached, cache flush rate 100 Hz -> 1000 kHz
N kbyte events, 10 Mbyte cache: 10000/N events cached, cache flush rate 100 Hz -> 1000/N kHz
100 kbyte events, not cached per (2): 100 Hz (obviously)
300 kbyte events, not cached per (2): 100 Hz (obviously)
default max event size: 100 Hz (obviously)

K.O.
  2404   16 May 2022 Konstantin OlchanskiInfoanalysis of corner cases in event buffer write cache
> > for correct operation of bm_send_event() under all conditions we need to ...
> to continue computation from last message:

if I got my numbers right, for present-day hardware (1gige/10gige data rates, 100 Hz max locking rate), we should 
increase the default buffer write cache size from 100 kbytes to 10 Mbytes.

this cache size will permit processing of the full mix of small/big events
at the full mix of event rates without exceeding the 100 Hz semaphore locking rate.

with the 10 Mbyte write cache, default event buffer size should be 30-40 Mbytes (current size is 33 Mbytes, so does 
not need to change).

this computation is for 1 writer (1 reader, mlogger). it is a typical case for our experiments.

multiple writers can run into contention for event buffer space.

consider 10 writers want to flush their 10 Mbyte write cache all at the same time:

if buffer size is the default 33 Mbytes, the first 3 writers will have successful write cache flush,
but the other 7 will stall, there is no space in the buffer, we have to wait for mlogger to free
some (mlogger writing X Mbytes/sec will take Y milliseconds to liberate 10 Mbytes of space for the 4th writer
to successfully flush, writers 5..10 are still stalled).

but in a system with 10 writers writing at 10 Mbytes/sec (1 Hz default cache flush rate) is 100 Mbytes/sec
will likely have SYSTEM buffer size at least 200-300 Mbytes (to buffer 1-2 seconds of data against
any delays in writing to disk/network storage).

so there should be no problem in practice.

K.O.
  2405   16 May 2022 Konstantin OlchanskiBug Fixmserver buffer overrun and crash
> There is a memory allocation bug in the mserver.

Fix for this problem introduced a new problem, an infinite loop in bm_flush_cache, 
bitbucket bugs https://bitbucket.org/tmidas/midas/issues/339/infinite-loop-in-
mserver-due-to-mfes and https://bitbucket.org/tmidas/midas/issues/331/stuck-
semaphore-of-system-buffer

This is now fixed and the buffer write cache logic and size was rejigged
according to calculations in https://daq00.triumf.ca/elog-midas/Midas/2401

Event buffer write cache (as set via ODB Equipment/Common and via 
bm_set_cache_size()) now take 2 possible values:
0 - write cache is disabled and
MIN_WRITE_CACHE_SIZE - (10 Mbytes) minimum permitted cache size
bigger cache size values are permitted, up to buffer_size/3, but probably not useful 
if my calculations are right.
smaller cache size values are generally not useful, if my calculations are right.

mfe.c and tmfe c++ frontends updated to request the new write cache size by default.

if events are getting stuck in the write cache for too long, instead of reducing the 
cache size, one should increase frequency of bm_flush_cache() calls (1/sec by 
default).

commit 373bcc3ab7f83c3c7bf6c051c237de043a982502

K.O.
  2409   17 May 2022 Konstantin OlchanskiInfoMIDAS switched to C++
> Hi, I have three naive questions about this:

all good questions, ask more of them.

>  - have you posted somewhere this guide about converting C frontends to C++?

yes, in this elog here I posted a guide for converting C mfe.c frontends to C++ and
a guide for converting mfe.c frontend to C++ TMFE frontend. please use the "find" function,
if you cannot find them, let me know, I will look for it for you.

>  - it was mentioned previously that there will be a 'tag the last "C" midas', which version is it?

correct. please run "git tag", tags before "midas-2019-05-cxx"is "C", after is "C++".

>  - it means that even a simple example like odb_test.c cannot be compile anymore? Even when using g++?
> g++ -I $HOME/daq/packages/midas/include/ -L $HOME/daq/packages/midas/lib/ odb_test.c -l midas
> is expected to fail or is just me glitching? Is it because of thread library differences?

yes, it is expected to fail, you have spaces after "-I", "-L" and "-l", incorrect g++ command syntax. after
correcting this, it may or may not work depending on what you have inside odb_test.c. I would be happy
to help you debug this, but please start a separate thread instead of necroposting into the C++ announcements.

K.O.
  2415   18 Jul 2022 Konstantin OlchanskiBug ReportRPC timeout for manalyzer over network
> In ALPHA, I get RPC timeouts running a (reasonably heavy) analyzer on a remote machine (connected directly via a ~30 meter 10Gbe Ethernet cable) after ~5 minutes of running. If I run the analyser locally, I dont not see a timeout...

there is a subtle bug in the mserver. under rare conditions, ss_suspend() will recurse in an unexpected way
and mserver will go to sleep waiting for data from a udp socket (that will never arrive, so sleep forever).
remote client will see it as an rpc timeout. in my tests (and in ALPHA-g at CERN, as reported by Joseph),
I see this rare condition to happen about every 5 minutes. in normal use, this is the first time we become
aware of this problem, the best I can tell this bug was in the mserver since day one.

commit https://bitbucket.org/tmidas/midas/commits/fbd06ad9d665b1341bd58b0e28d6625877f3cbd0
to develop and
to release/midas-2022-05

The stack trace that shows the mserver hang/crash (sleep() is the stand-in for the sleep-forever socket read).

(gdb) bt
#0  0x00007f922c53f9e0 in __nanosleep_nocancel () from /lib64/libc.so.6
#1  0x00007f922c53f894 in sleep () from /lib64/libc.so.6
#2  0x0000000000451922 in ss_suspend (millisec=millisec@entry=100, msg=msg@entry=1) at /home/agmini/packages/midas/src/system.cxx:4433
#3  0x0000000000411d53 in bm_wait_for_more_events_locked (pbuf_guard=..., pc=pc@entry=0x7f920639b93c, timeout_msec=timeout_msec@entry=100, 
    unlock_read_cache=unlock_read_cache@entry=1) at /home/agmini/packages/midas/src/midas.cxx:9429
#4  0x00000000004238c3 in bm_fill_read_cache_locked (timeout_msec=100, pbuf_guard=...) at /home/agmini/packages/midas/src/midas.cxx:9003
#5  bm_read_buffer (pbuf=pbuf@entry=0xdf8b50, buffer_handle=buffer_handle@entry=2, bufptr=bufptr@entry=0x0, buf=buf@entry=0x7f9203d75020, 
    buf_size=buf_size@entry=0x7f920639aa20, vecptr=vecptr@entry=0x0, timeout_msec=timeout_msec@entry=100, convert_flags=0, 
    dispatch=dispatch@entry=0) at /home/agmini/packages/midas/src/midas.cxx:10279
#6  0x0000000000424161 in bm_receive_event (buffer_handle=2, destination=0x7f9203d75020, buf_size=0x7f920639aa20, timeout_msec=100)
    at /home/agmini/packages/midas/src/midas.cxx:10649
#7  0x0000000000406ae4 in rpc_server_dispatch (index=11111, prpc_param=0x7ffcad70b7a0) at /home/agmini/packages/midas/progs/mserver.cxx:575
#8  0x000000000041ce9c in rpc_execute (sock=10, buffer=buffer@entry=0xe11570 "g+", convert_flags=0)
    at /home/agmini/packages/midas/src/midas.cxx:15003
#9  0x000000000041d7a5 in rpc_server_receive_rpc (idx=idx@entry=0, sa=0xde6ba0) at /home/agmini/packages/midas/src/midas.cxx:15958
#10 0x0000000000451455 in ss_suspend (millisec=millisec@entry=1000, msg=msg@entry=0) at /home/agmini/packages/midas/src/system.cxx:4575
#11 0x000000000041deb2 in rpc_server_loop () at /home/agmini/packages/midas/src/midas.cxx:15907
#12 0x0000000000405266 in main (argc=9, argv=<optimized out>) at /home/agmini/packages/midas/progs/mserver.cxx:390
(gdb) 

K.O.
ELOG V3.1.4-2e1708b5