> > for correct operation of bm_send_event() under all conditions we need to ...
> to continue computation from last message:
if I got my numbers right, for present-day hardware (1gige/10gige data rates, 100 Hz max locking rate), we should
increase the default buffer write cache size from 100 kbytes to 10 Mbytes.
this cache size will permit processing of the full mix of small/big events
at the full mix of event rates without exceeding the 100 Hz semaphore locking rate.
with the 10 Mbyte write cache, default event buffer size should be 30-40 Mbytes (current size is 33 Mbytes, so does
not need to change).
this computation is for 1 writer (1 reader, mlogger). it is a typical case for our experiments.
multiple writers can run into contention for event buffer space.
consider 10 writers want to flush their 10 Mbyte write cache all at the same time:
if buffer size is the default 33 Mbytes, the first 3 writers will have successful write cache flush,
but the other 7 will stall, there is no space in the buffer, we have to wait for mlogger to free
some (mlogger writing X Mbytes/sec will take Y milliseconds to liberate 10 Mbytes of space for the 4th writer
to successfully flush, writers 5..10 are still stalled).
but in a system with 10 writers writing at 10 Mbytes/sec (1 Hz default cache flush rate) is 100 Mbytes/sec
will likely have SYSTEM buffer size at least 200-300 Mbytes (to buffer 1-2 seconds of data against
any delays in writing to disk/network storage).
so there should be no problem in practice.
> for correct operation of bm_send_event() under all conditions we need to ...
to continue computation from last message:
default SYSTEM buffer size: 32 MiBytes
default max event size: 4 MiBytes
hard max buffer size: 2 Gbytes (code is only 31-bit-clean)
hard max event size: 2 Gbytes (code is only 31-bit-clean)
max event size currently: 32 Mbytes (same as buffer size)
max event size per (1) in previous post: 32*0.5..0.3 = 16..9 MiBytes
number of default-max-size events buffered: 32/4 = 8.
number of per (1) max-size events buffered: 2 or 3
number of current max-size events buffered: 0 (bad, frontend is serialized with mlogger)
default write cache size: 100 kbytes
max write cache size currently: buffer size / 4 = 32/4 = 8 MiBytes
max write cache size per (3) in previous post: buffer_size / 3 = 10 Mbytes
hard max write cache size per (3): 2 Gbytes/3 = 600 Mbytes
max size of cached events:
current: 100 kbytes (size as cache size)
per (2) in previous post: 0.1..0.3 * cache size = 10..30 kbytes
per (2), 1 Mbyte cahe: 0.1..0.3 * cache size = 100..300 kbytes
hard max size: 0.1..0.3 * hard_max_cache_size = 0.1..0.3 * 600 = 60..180 Mbytes.
max data rate before event buffer semaphore locking rate exceeds 100 Hz:
1 kbyte events, no write cache: 100 kbytes/sec
1 kbyte events, 100 kbyte cache: 100 events cached, cache flush rate 100 Hz -> 100*1kbyte*100Hz -> 10 Mbytes/sec
1 kbyte events, 1 Mbyte cache: 1000 events cached, cache flush rate 100 Hz -> 100 Mbytes/sec (1gige ethernet)
N kbyte events, 1 Mbyte cache: same thing (data rate is limited by cache flush rate 100 Hz)
100 kbyte events, 1 Mbyte cache, not cached per (2): 100kbyte*100Hz = 10 Mbytes/sec
300 kbyte events, 1 Mbyte cache, not cached per (2): 300kbyte*100Hz = 30 Mbytes/sec
N00 kbyte events: N0 Mbytes/sec (500->50, etc)
1 kbyte events, 10 Mbyte cache: 10000 events cached, cache flush rate 100 Hz -> 1000 Mbytes/sec (10gige ethernet)
N kbyte events, 10 Mbyte cache: same thing (data rate is limited by cache flush rate 100 Hz)
1000 kbyte events, 10 Mbyte cache, not cached per (2): 1000kbyte*100Hz = 100 Mbytes/sec
3000 kbyte events, 10 Mbyte cache, not cached per (2): 3000kbyte*100Hz = 300 Mbytes/sec
N000 kbyte events: N00 Mbytes/sec (4000->400, 5000->500, etc)
default max event size: 4 Mibytes*100Hz = 400 Mbytes/sec (exceeds 1gige ethernet)
hard max event size (divided by 10 to buffer 10 events): 200 Mbytes*100Hz -> 20 Gbytes/sec
max event rate before event buffer semaphore locking rate exceeds 100 Hz:
1 kbyte events, no write cache: 100 Hz (obviously)
1 kbyte events, 100 kbyte cache: 100 events cached, cache flush rate 100 Hz -> 10 kHz
1 kbyte events, 1 Mbyte cache: 1000 events cached, cache flush rate 100 Hz -> 100 kHz
N kbyte events, 1 Mbyte cache: 1000/N events cached, cache flush rate 100 Hz -> 100/N kHz
1 kbyte events, 10 Mbyte cache: 10000 events cached, cache flush rate 100 Hz -> 1000 kHz
N kbyte events, 10 Mbyte cache: 10000/N events cached, cache flush rate 100 Hz -> 1000/N kHz
100 kbyte events, not cached per (2): 100 Hz (obviously)
300 kbyte events, not cached per (2): 100 Hz (obviously)
default max event size: 100 Hz (obviously)
> > some old front-end are not running any more since they do use RO_ALWAYS together with
> triggered events.
> I confirm, if you have mfe.c frontends that have RO_ALWAYS, after you update MIDAS,
> some of these frontends will fail to start.
> tmfe c++ frontends do not have this restriction but by default only read data when run
> is active (per-equipment fEqConfReadOnlyWhenRunning default is true).
As of commit
frontends will no longer fail to start. an error will still be issued "Equipment \"%s\"
contains RO_STOPPED or RO_ALWAYS. This can lead to undesired side-effect and should be
Some of our old frontends use EQ_MULTITHREAD to implement multithreaded periodic equipments.
They do not generate any events when there is no run (some of them do not generate any
events at all). Now they will start printing this error message, for no reason. (no we will
not be rewriting them justy to get rid of this message. life is too short).
the c++ tmfe frontend does not have any protections against these "undersired side-effects".
What are these undesired side effects and should we add protection against them?
to remember, bm_send_event() writes an event to the write cache, bm_flush_cache()
writes the contents of the write cache into the shared memory event buffer, buffer
free space is consumed. in the usual case, mlogger is reading events from the shared
memory event buffer, buffer free space is released. there is also a read cache, not
part of this discussion.
the purpose of the write cache is to reduce contention for the shared memory
semaphore. in the case of large number of small events, semaphore is locked per
cache-flush, instead of per-event. correct tuning of write cache and event size can
reduce lock rate from >100 kHz to around 100 Hz or lower.
for correct operation of bm_send_event() under all conditions we need to consider
all corner cases:
1) no write cache: (cache size set to 0)
- event_size > buffer_size -> reject the event (obviously)
- event_size > 0.5 * buffer_size -> only 1 event fits into the buffer, next write
will stall until mlogger reads the previous event (sequential operation, bad)
- event_size < 0.3 * buffer_size -> at least 2 events fit into the buffer (good)
decision: limit event size to 0.5 to 0.3 * buffer_size (current limit is 0.5 *
buffer_size, I think).
consequence: buffer size limit is 2 Gbytes (32-bit byte offsets, code is only 31-
bit-clean), max event size is between 1 Gbytes and 0.6 Gbytes.
2) writing to write cache:
- event_size > cache_size -> flush cache, write event to directly to buffer
- event_size > 0.5 * cache_size -> inefficient use of cache: write to cache, next
event does not fit, flush to buffer, repeat. no gain in semaphore locking (bad), one
additional memcpy() (event to cache and cache to buffer) (bad)
- event_size < 0.3 * cache_size -> multiple events fit into cache, but probably no
gain in semaphore locking
decision: events that are bigger than 0.3 to 0.1 * cache_size should not go through
the cache. (flush cache, write directly to buffer).
3) flush write cache to buffer:
- cache_size > buffer_size -> cannot flush in 1 operation, must have a loop and
flush the cache in pieces
- cache_size between 0.5 and 1.0 * buffer_size -> can flush in 1 operation, but must
wait for mlogger to fully empty the buffer (sequential operation, bad)
- cache size < 0.3 * buffer_size -> can flush in 1 operation, at least 2 "flushes"
fit inside the buffer (good)
decision: limit write cache size to 0.3 * buffer_size. (current limit is
- write cache size limit is 0.3..0.25 * 2GB = 0.6..0.5 Gbytes
- cached event size limit is 0.3..0.1 * 0.5 GBytes = 150..50 Mbytes
- minimum number of cached events: 3 to 10
- semaphore locks reduced: 3 to 10 locks become 1 lock (all events cached),
4 to 11 locks become 2 locks (big event causes cache flush).
- there is a periodic 1/second bm_flush_cache() that flushes the cache early and
reduces it's efficiency (but needed to avoid having data stuck in cache for long
- if multiple frontends use large write cache (~ 0.3..0.5 * buffer_size), again,
sequential operation can happen (bad)
- write cache is per-frontend, not per-equipment. if different equipments request
different cache sizes, mfe.c and tmfe c++ frontends complain about this, but the
user has to sort it out.
> some old front-end are not running any more since they do use RO_ALWAYS together with
I confirm, if you have mfe.c frontends that have RO_ALWAYS, after you update MIDAS,
some of these frontends will fail to start.
tmfe c++ frontends do not have this restriction but by default only read data when run
is active (per-equipment fEqConfReadOnlyWhenRunning default is true).
> If events are sent when a run is stopped, this leads to many unexpected results
I think we need to understand what these unexpected results are.
Naively thinking, of would expect midas to not care
We had issues in one of our experiment that people used RO_STOPPED in the
equipment list together with triggered events (EQ_USER). If events are sent when
a run is stopped, this leads to many unexpected results, so I added a check in
the mfe.cxx code which prevents RO_STOPPED (or RO_ALWAYS which includes
RO_STOPPED) together with EQ_TRIGGERED, EQ_INTERRUPT, EQ_MULTITHREAD and EQ_USER
type of events.
I got now complaints that some old front-end are not running any more since they
do use RO_ALWAYS together with triggered events. Can the author of these frontend
please tell me the rationale why this is needed, then I can maybe add a better
fix for that.
We had the problem in our lab that a frontend took about 6 seconds to gracefully
shut down, mainly it needed to park some motors. I found that the shutdown command
had a hard-coded timeout of 5 seconds, after which the frontend gets killed, and
cannot finish the park operation. I change the code so that the client timeout
stored in the ODB is taken instead of the hard-coded 5 seconds. This allows each
client to fine-tune its timeout, to allow graceful shutdown, but also not let the
user wait too long if the client gets stuck and needs a hard kill.
The default timeout for mfe.cxx based frontends has been changed to 10 seconds
now, but in the frontend_init function this can be changed by the user code
I hope this char does not trigger any bad side effects, but if it does, please
the code for writing midas history to mysql has been updated to work against
MYSQL 8.0.23 (CERN ALPHA-2):
- as ever mysql reports inconsistent data types (I create column with type
"integer", mysql reports it has type "int" and so forth), the special kludge to
take care of this had to be tweaked.
- this caused some columns to be marked "inactive" and the code to "reactivate"
them was missing (fixed)
- binary history event data size was computed incorrectly for events with
"inactive" columns (fixed) and caused assert() failure and mlogger crash.
- mysql read of column definitions for history event "system" (as in
/history/links/system) bombed because of incorrect quoting (worked before, why?
why bombed now?). this caused duplicate columns to be created in mysql table
"system" and mlogger bomb-out with complaint about "duplicated columns"
(actually the error message was missing, so it was a silent bomb-out). quoting
fixed, missing error message fixed, but cleanup of duplicate columns has to be
done by hand. in case of alpha-2 the fix was to remove the unused
if you are using mysql history please update or patch src/history_schema.cxx.
Concerning the "scl" page, we are currently having a discussion. At the moment, one can
see midas clients in three different places:
1) the main status page at the bottom, only names and hosts are there
2) the programs page, where one can also start/stop program
3) now the new page "Show ODB clients" in the ODB editor page, which shows also the
alive status, PID and timeout
I'm thinking that three locations are two too much, so we are considering to merge the
tree pages into one. That would mean that 1) goes away, and the "Programs" page will
show more information. We have some rare cases that programs are removed from
/System/Clients in the ODB but still attached to the ODB. For those "zombies" we would
add a "hard kill" function.
I would like to hear feedback from the midas community before we proceed with the
plans. Anybody desperately in need of the programs shown on the status page?
Here are some of my thoughts:
midasio.cxx library needed.
would rather like that all connections see the SAME event. So mhttpd keeps one event, serves it to all
links, so displays are consistent. If a browser wants to see the "next" event, it send the old serial
number and days "please send next event AFTER serial number". If the serial number is larger than the
event in the buffer, mhttpd fetches a new event and puts it into its buffer.
request. Then mhttpd can retrieve events until event_id and trigger_mask match, then serve that event.
Since reading events from a midas buffer is fast (many 10'000s of events per second), the won't be much of
- GET_ALL does not make sense for browsers, you don't want to slow down any frontend. If someone wants to
do histogramming in the browser, then GET_SOME (which is kind of GET_OLD) would make sense, but most of
the cases we have some single event display, and there a GET_RECENT is most appropriate.
> > We are storing raw MIDAS files to S3 Object Storage, but MIDAS file are not
> > optimised for readout from such kind of storage. There is any work around on
> > evolution of midas raw output or, beyond simulated posix fs, to develop midas
> > python library optimised to stream data from S3 (is not really clear to me if this
> > is possible).
> We have plans for adding S3 object storage support to lazylogger, but have not gotten
> around to it yet.
> We do not plan to add this in mlogger. mlogger works well for writing data to locally-
> attached storage (local ext4, XFS, ZFS) but always runs into problems with timeouts and
> delays when writing to anything network-attached (even writing to NFS).
> I envision that each midas raw data file (mid.gz or mid.lz4 or mid.bz2) will
> be stored as an S3 object and there will be some kind of directory object
> to map object ids to run and subrun numbers.
> Choice of best file size is open, normally we use subruns to limit file size to 1-2
> Gbytes. If cloud storage prefers some other object size, we can easily to up to 10
> Gbytes and down to "a few megabytes" (ODB dumps will have to be turned off for this).
> Other than that, in your view, what else is needed to optimize midas files for storage
> in the Amazon S3 could?
> P.S. For reading files from the cloud, code needs to be written and added to
> midasio/midasio.cxx, for example, see the code that is already there for reading ssh-
> attached files and dcache/dccp-attached files. (CERN EOS files can be read directly
> from POSIX mount point /eos).
actually a I made a small work around with python boto3 library with file of any size (with
the obviously limitation of opportunity and time to wait) eg:
key = 'TMP/run00060.mid.gz'
aws_session = creds.assumed_session("infncloud-iam")
s3 = aws_session.client('s3', endpoint_url="https://minio.cloud.infn.it/",
s3_obj = s3.get_object(Bucket='cygno-data',Key=key)
buf = BytesIO(s3_obj["Body"]._raw_stream.data)
for event in MidasSream(gzip.GzipFile(fileobj=buf)):
print("Saw a special event")
bank_names = ", ".join(b.name for b in event.banks.values())
print("Event # %s of type ID %s contains banks %s" % (event.header.serial_number,
where in MidasSream I just bypass the open, and the code work, but obviously in this way I
need to have all the buffer in memory and it take time get all the buffer. I was interested to
understand if some one have already develop the stream event by event (better in python but
not mandatory). I'll look to the code you underline.
> added a web page for "mdump".
- get a list of existing event buffers (should read event buffer names from /Experiment/Buffer sizes)
- selector box to select event buffer
- button for "get next" and "get new" (should call bm_skip_event() before bm_receive_event())
- entry fields for event_id and trigger_mask event filter
- check box for "keep getting new data" and entry field for update frequency
- (eventually) entry field for bank name filter
> added JSON RPC for bm_receive_event()
there is a number of problems with implementing bm_receive_event() as a RPC:
running mdump, they will "steal" events from each other.
filters to bm_receive_event(). our bm_request_event() has to be for all event_id and all trigger_mask.
3) for same reason, we cannot have some requests to be GET_ALL, some to be GET_RECENT and some to be
GET_OLD (if GET_OLD is ever implemented).
Problem (1) is hard to fix. Only solution I can see is to have mhttpd have it's own event buffer that can
The same scheme allows implementing GET_ALL and per-connection event_id and trigger_mask filters.
closed without warning and without opportunity to tell rpc server that they are closed, gone.
added JSON RPC for bm_receive_event() and added a web page for "mdump".
the event dump is a hex dump for now.
otherwise, I will eventually write my own decoder library patterned on midasio.h and midasio.cxx.
as of commit 5882d55d1f5bbbdb0d9238ada639e63ac27d8825
for a long time, midas web pages have been missing the equivalent of odbedit
"scl" and "sor" to display current odb clients and current odb open records.
this is now added as buttons "show open records" and "show odb clients" in the
odb editor page.
as in odbedit, "sor" shows open records under the current subtree, i.e. if you
are looking at /equipment, you will not see open records for /experiment. to see
all open records, go to "/".
> We are storing raw MIDAS files to S3 Object Storage, but MIDAS file are not
> optimised for readout from such kind of storage. There is any work around on
> evolution of midas raw output or, beyond simulated posix fs, to develop midas
> python library optimised to stream data from S3 (is not really clear to me if this
> is possible).
We have plans for adding S3 object storage support to lazylogger, but have not gotten
around to it yet.
We do not plan to add this in mlogger. mlogger works well for writing data to locally-
attached storage (local ext4, XFS, ZFS) but always runs into problems with timeouts and
delays when writing to anything network-attached (even writing to NFS).
I envision that each midas raw data file (mid.gz or mid.lz4 or mid.bz2) will
be stored as an S3 object and there will be some kind of directory object
to map object ids to run and subrun numbers.
Choice of best file size is open, normally we use subruns to limit file size to 1-2
Gbytes. If cloud storage prefers some other object size, we can easily to up to 10
Gbytes and down to "a few megabytes" (ODB dumps will have to be turned off for this).
Other than that, in your view, what else is needed to optimize midas files for storage
in the Amazon S3 could?
P.S. For reading files from the cloud, code needs to be written and added to
midasio/midasio.cxx, for example, see the code that is already there for reading ssh-
attached files and dcache/dccp-attached files. (CERN EOS files can be read directly
from POSIX mount point /eos).
We are storing raw MIDAS files to S3 Object Storage, but MIDAS file are not
optimised for readout from such kind of storage. There is any work around on
evolution of midas raw output or, beyond simulated posix fs, to develop midas
python library optimised to stream data from S3 (is not really clear to me if this
There is a memory allocation bug in the mserver.
ALIGN8() was missing when receiving events from the event socket and data buffer
was allocated 4 bytes too short. but only for some received events and only in
very unlucky sequence of received events. result was a rare but obnoxious crash
of fevme frontend in alpha-2 at CERN. (we do not see any crash from this in
alpha-g or anywhere else, the best I can tell).
fixed in commit 4dc06ba47ff7caa5251fd8c48d8533f35799f3a6.
If you use the mserver, please update to this commit or apply following patch in
- int bufsize = sizeof(INT) + event_size;
+ int bufsize = sizeof(INT) + total_size;
I prepared some slides about the new features of the sequencer and post it here so
people can have a quick look at get some inspiration.