Back Midas Rome Roody Rootana
  Midas DAQ System, Page 1 of 40  Not logged in ELOG logo
New entries since:Wed Dec 31 16:00:00 1969
Entry  15 Apr 2019, Konstantin Olchanski, Info, switch of MIDAS to C++ 
For a long time now we have keep the core of midas (odb.c, midas.c, etc) compatible with plain C and by default 
we have built the MIDAS library using the plain C compiler. Over time, we have switched most MIDAS programs 
(mhttpd, mlogger, mdump, odbedit, etc) to C++ (with happy results). (and for a long time now, all of MIDAS 
could be build as C++, even if the default build remained plain C).

The main reason for keeping the core of MIDAS as C has been to allow writing MIDAS frontends in C - for 
example, in environments with no C++ compilers or no C++ runtime (VxWorks) or where C++ had too much 
overhead (small memory machines, etc).

Today, all concerns against using C++ seem to have receded into the past. C++ compilers are now always 
available, even for small embedded systems. C++ overheads are now well understood and one can easily write 
C++ code that is as efficient as C for using limited CPU and memory resources. (While at the same time, today's 
embedded systems tend to have more CPU and RAM than "big" MIDAS DAQ machines had in the past - 1GHz 
CPU, 1GB RAM is pretty typical for embedded ARM).

As examples of small hardware where MIDAS frontends written in C++ worked just fine, consider the T2K ND280 
FGD data collector running on XILINX FPGA with a 300MHz PowerPC and 128 Mbytes of RAM (standard Linux 
kernel) and the GRIFFIN Clock distribution module control running on a Microsemi FPGA with a 300MHz ARM 
CPU (ucLinux without an MMU). More typical Cyclone-5 ARM SoCs with 1GB RAM and 1GHz CPU run standard 
Linux (CentOS7) and can build MIDAS natively (no need for cross-compiling).

With the removal of the requirement to make it possible to write MIDAS frontends in C, we can switch the MIDAS 
default build to C++ and start using C++ features in the MIDAS API (std::string, std::vector, etc).

Next to consider is "which C++ should we use?".

K.O.
    Reply  15 Apr 2019, Konstantin Olchanski, Info, switch of MIDAS to C++, which C++? 
>
> With the removal of the requirement to make it possible to write MIDAS frontends in C, we can switch the MIDAS 
> default build to C++ and start using C++ features in the MIDAS API (std::string, std::vector, etc).
> 

Consider the most basic C++ construct, std::string, and observe how many member functions are annotated "c++11", "c++17", etc:
https://en.cppreference.com/w/cpp/string/basic_string

For MIDAS this means that we cannot target "a" C++ or "the" C++, we have to chose between C++ "before C++11", C++11, C++17 
(plus the incoming c++20).

For example, the ROOT 6 package requires C++11 *and* g++ >= 4.8.

Now consider the platforms we use at TRIUMF:

- Linux RHEL/SL/CentOS6 - gcc 4.4.7, no C++11.
- Linux RHEL/SL/CentOS7 - gcc 4.8.5, full C++11, no C++14, no C++17
- Ubuntu 18.04.2 LTS - gcc 7.3.0, full C++11, full C++14, "experimental" C++17.
- MacOS 10.13 - llvm 10.0.0 (clang-1000.11.45.5), full C++11, full C++14, full C++17.

(see here for GCC C++ support: https://gcc.gnu.org/projects/cxx-status.html)
(see here for LLVM clang c++ support: https://clang.llvm.org/cxx_status.html)

As is easy to see from the std::string reference how C++17 has a large number of very useful new features.

Alas, at TRIUMF we still run MIDAS on many SL6 machines where C++11 and C++17 is not normally available. I estimate another 1-2 
years before all our SL6 machines are upgraded to RHEL/SL/CentOS7 (or Ubuntu LTS).

This means we cannot use C++11 and C++17 in MIDAS yet. We are stuck with pre-C++11 for now.

Remarks:
- there will be trouble right away as both Stefan and myself do MIDAS development on MacOS where full C++17 is available and is 
tempting to use. (as they say, watch this space)
- it is possible to install a newer C++ compiler into RHEL/SL/CentOS 6 and 7 systems, but we are loath to require this (same as we 
are loath to require cmake for building MIDAS) - the "I" in MIDAS means integrated, meaning "does not require installing 100 
additional packages before one can use it".
- the MS Windows situation is unclear, but since one has to install the C++ compiler as an additional package anyway, I do not see 
any problem with requiring C++17 support, with a choice of MS compilers, GCC and LLVM. I doubt we will support anything older 
than Windows 10.

K.O.
       Reply  15 Apr 2019, Konstantin Olchanski, Info, switch of MIDAS to C++, how much C++? 
> >
> > With the removal of the requirement to make it possible to write MIDAS frontends in C, we can switch the MIDAS 
> > default build to C++ and start using C++ features in the MIDAS API (std::string, std::vector, etc).
> > 

C++ is a big animal. Obviously we want to use std::string, std::vector and similar improvements over plain C (we already use "//" for comments).

But in keeping with the Camel's nose fable (https://en.wikipedia.org/wiki/Camel%27s_nose), there are some parts of C++ we definitely do not want to use in MIDAS. Even the C++ FAQ talks 
about "evil features", see https://isocpp.org/wiki/faq/big-picture#use-evil-things-sometimes

Here is my list of things to use and to avoid. Comments on this are very welcome - as everybody's experience with C++ is different (and everybody's experience is very valuable and very 
welcome).

- std::string, std:vector, etc are in. I am already using them in the MIDAS API (midas.h)
- extern "C" is out, everything has to be C++, will remove "extern "C"" from all midas header files.
- exceptions are out, see https://stackoverflow.com/questions/1736146/why-is-exception-handling-bad
- std::thread and std::mutex are in, at least for writing new frontends, but see discussion of "cannot use c++11". (maybe replace ss_mutex_xxx() with out own std::mutex look-alike).
- heavy use of templates and heavy use of argument overloading is out - just by looking at the code, impossible to tell what function will be called
- "auto" is on probation. I need to know if "auto v=f()" is an integer or a double when I write "auto w=v/2" or "auto w=v/2.0". see 
https://softwareengineering.stackexchange.com/questions/180216/does-auto-make-c-code-harder-to-understand
- unreadable gibberish is out (lambdas, etc)
- C-style malloc()/free() is in. C++ new and delete are okey, but "delete[]" confuses me.
- C-style printf() is in. C++ cout and "<<" gunk provide no way to easily format the output for easy reading.

K.O.
          Reply  16 Apr 2019, Pintaudi Giorgio, Info, switch of MIDAS to C++, how much C++? 
Dear Konstantin,

even if I am still quite young and have only limited experience (but not null), I would like to give my two cents. I have reflected a bit about the C++ issue, also because I am developing a 
brand new MIDAS interface for the WAGASCI-T2K experiment, and I feel that the future of MIDAS could influence the future of our DAQ system, too. I'll start from the conclusions: I completely 
agree with you on a practical level, even if I kind of disagree on an "ethical" level.

What you propose in essence is to migrate the MIDAS core from pure C to a version of C with some fancy C++ features. Let's say a kind of C+ with only one plus. Theoretically speaking, even if 
on the surface C and C++ are very similar, they are completely different languages and require different mindsets (and I am sure that everyone is aware of it). This is the reason why even if I 
would have preferred to develop the MIDAS frontend for our experiment in C++, I have chosen to stick to pure C because I feel that MIDAS is still very C-like in its architecture (or from what 
I can see from the documentation). So I wanted to "keep on track" for better internal coherence. What I mean is that, if someone told me to port a C project of mine to C++, I would end up 
rewriting it almost completely, instead of just modifying it (I really don't know how much of the MIDAS core has been written with C++ in mind, so if a large part of it is already C++-like, 
please ignore my comment above).

Anyway, on a practical level, I completely agree with your approach, because I imagine that a complete rewrite of MIDAS is off the table but, at the same time, some new C++ features like 
better string and vector handling are very tempting to use. Moreover, in general, physicists are more familiar with the C syntax than with the C++ one (but thanks to ROOT that is changing). As 
for the use of MIDAS in embedded devices, I have no experience so I refrain from judging. So, in the particular case of MIDAS, what you propose is probably the best and only option.

As far as the C++ standard to adopt, I would say that the C++11 standard is the best fit for the T2K experiment since the official OS for T2K is CentOS7 and, out of the box, it supports C++11 
only. Anyway, I acknowledge that there are many other experiments and requirements. For the records, I do development on Ubuntu 18.04.

Best regards
Giorgio
          Reply  17 Apr 2019, John M O'Donnell, Info, switch of MIDAS to C++, how much C++? 
some semi-random thoughts:

no templates strictly means you can't use std::string, std::vector etc.

printf is in any case part of C++ (#include <cstdio>), but std::ostreams can be faster (for std::cout, endl line causes buffer flushing, whereas "\n" does not flush the buffer but printf
always flushes the buffer), and formatting is possible (though very long winded).  printf does not allow to print things other than simple data, e.g. BANK_HEADER* bh; printf( "%?", *bh);

I've been writing all our DAQ code in C++ for a while now.

> > >
> > > With the removal of the requirement to make it possible to write MIDAS frontends in C, we can switch the MIDAS 
> > > default build to C++ and start using C++ features in the MIDAS API (std::string, std::vector, etc).
> > > 
> 
> C++ is a big animal. Obviously we want to use std::string, std::vector and similar improvements over plain C (we already use "//" for comments).
> 
> But in keeping with the Camel's nose fable (https://en.wikipedia.org/wiki/Camel%27s_nose), there are some parts of C++ we definitely do not want to use in MIDAS. Even the C++ FAQ talks 
> about "evil features", see https://isocpp.org/wiki/faq/big-picture#use-evil-things-sometimes
> 
> Here is my list of things to use and to avoid. Comments on this are very welcome - as everybody's experience with C++ is different (and everybody's experience is very valuable and very 
> welcome).
> 
> - std::string, std:vector, etc are in. I am already using them in the MIDAS API (midas.h)
> - extern "C" is out, everything has to be C++, will remove "extern "C"" from all midas header files.
> - exceptions are out, see https://stackoverflow.com/questions/1736146/why-is-exception-handling-bad
> - std::thread and std::mutex are in, at least for writing new frontends, but see discussion of "cannot use c++11". (maybe replace ss_mutex_xxx() with out own std::mutex look-alike).
> - heavy use of templates and heavy use of argument overloading is out - just by looking at the code, impossible to tell what function will be called
> - "auto" is on probation. I need to know if "auto v=f()" is an integer or a double when I write "auto w=v/2" or "auto w=v/2.0". see 
> https://softwareengineering.stackexchange.com/questions/180216/does-auto-make-c-code-harder-to-understand
> - unreadable gibberish is out (lambdas, etc)
> - C-style malloc()/free() is in. C++ new and delete are okey, but "delete[]" confuses me.
> - C-style printf() is in. C++ cout and "<<" gunk provide no way to easily format the output for easy reading.
> 
> K.O.
             Reply  22 Apr 2019, Pintaudi Giorgio, Info, switch of MIDAS to C++, how much C++? example.cpp
Dear Konstantin and others,
our recent discussion stimulated my curiosity and I wrote a small frontend for the trigger board of our experiment in C++.
The underlying hardware details are not relevant here. I would just like to briefly report and discuss what I found out.

I have written all the frontend files (but the bus driver) in C++11:
  • my_frontend.cpp
  • driver/class/my_class_driver.cpp
  • driver/device/my_device_driver.cpp

All went quite smoothly, but I feel that the overall structure is still very C-like (that may be a good thing or a bad thing depending on the point of view).
As far as I know, the MIDAS frontend mfe.c has still only the C version (I couldn't find any mfe.cxx). This means that all the points of contact between the MIDAS frontend code and the user
frontend code must be C compatible (no C++ features or name mangling). To accomplish this I needed to slightly modify the midas.h header file like this:
@@ -1141,7 +1141,13 @@ typedef struct eqpmnt {
+#ifdef __cplusplus
+extern "C" {
+#endif
 INT device_driver(DEVICE_DRIVER *device_driver, INT cmd, ...);
+#ifdef __cplusplus
+}
+#endif

I also tested the new strcomb1 function and it seems to work OK.

I have attached a source file to show how I implemented the device driver in C++. The code is not meant to be compilable: it is just to show how I implemented it. This is the most C++-like syntax that I could come out with. Feel free to comment it and if you think that it could be improved let me know.

Best Regards
Giorgio
                Reply  23 Apr 2019, Konstantin Olchanski, Info, switch of MIDAS to C++, how much C++? 
> Dear Konstantin and others, our recent discussion stimulated my curiosity and I wrote a small frontend for the trigger board of our 
experiment in C++.

Yay!

> my_frontend.cpp

In MIDAS we are using .cxx, not .cpp, per ROOT coding convention https://root.cern.ch/coding-conventions

> the overall structure is still very C-like

this is object-oriented programming done in C. (actually C++ looks exactly the same if you look behind the curtain)

right now we do not hope to rewrite the slow control class driver framework in C++, but if somebody does it,
we should be happy to add it to midas.

for the mfe.c framework, I have a new C++ class based frontend framework in development (and already in use
in the ALPHA-g experiment at CERN). There is a number of lose ends to polish befire I can add it to midas.
And as usual the last 10% of the work consume 90% of the time.

> the MIDAS frontend mfe.c has still only the C version (I couldn't find any mfe.cxx). 
> This means that all the points of contact between the MIDAS frontend code and the user frontend code must be C compatible
> (no C++ features or name mangling).

this will change with the switch to C++, mfe.c will become mfe.cxx and I shall add the required definitions to mfe.h (or midas.h, TBD)

> To accomplish this I needed to slightly modify the midas.h header file like this:
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
>  INT device_driver(DEVICE_DRIVER *device_driver, INT cmd, ...);

I intend for all "extern "C"" to go away, everything will use the C++ linkage (and name mangling). This will break existing frontends
and I will need to write clear instructions on converting them to the new scheme.

> I also tested the new strcomb1 function and it seems to work OK.

good.

> I have attached a source file to show how I implemented the device driver in C++

Yup, looks familiar, I have a couple of C++ frontends written like this, too.

K.O.
       Reply  11 May 2019, Konstantin Olchanski, Info, switch of MIDAS to C++, which C++? 
> [which c++]
> 
> - Linux RHEL/SL/CentOS6 - gcc 4.4.7, no C++11.
> - Linux RHEL/SL/CentOS7 - gcc 4.8.5, full C++11, no C++14, no C++17
>

The construct I now always use:

class X {
int a = 0; // do not leave data members uninitialized, see "Non-static data member initializers", N2756 and N2628
}

is only available starting from gcc 4.7, see https://gcc.gnu.org/projects/cxx-status.html

Another nail into the coffin of "pre c++11" c++ and el < el7.

Hmm...

K.O.
    Reply  22 May 2019, Konstantin Olchanski, Info, switch of MIDAS to C++ 
> switch MIDAS to C++

switch to C++ will proceed as follows:

- create a new branch off develop (feature/switch_to_cxx)
- remove all extern "C", ifdef c++, etc
- switch Makefile from gcc to g++
- test
- merge into develop
- before merge, tag the last "C" midas
- cut a new release branch (tentatively feature/midas-2019-06)

the last recommended "pre-C++" midas will remain the midas-2019-03 release (where we can retroactively apply bug fixes, as I just did a few minutes ago).

K.O.
       Reply  05 Jun 2019, Konstantin Olchanski, Info, MIDAS switched to C++ 
The last bits of code to switch MIDAS to C++ have been committed, see tag midas-2019-05-cxx.

Since the cmake conversion is still in progress, for now, I recommend using the old "make" build for trying this update.

From the switch to C++, the biggest change is the requirement that frontend programs be build and linked
using the C++ compiler. Since mfe.o and the rest of MIDAS are built with C++, building frontends
with C is no longer possible.

To help with this, I will post a short guide for converting C frontends to C++.

K.O.
          Reply  17 May 2022, Razvan Stefan Gornea, Info, MIDAS switched to C++ 
Hi, I have three naive questions about this:
 - have you posted somewhere this guide about converting C frontends to C++?
 - it was mentioned previously that there will be a 'tag the last "C" midas', which version is it?
 - it means that even a simple example like odb_test.c cannot be compile anymore? Even when using g++?

Something like

g++ -I $HOME/daq/packages/midas/include/ -L $HOME/daq/packages/midas/lib/ odb_test.c -l midas

is expected to fail or is just me glitching? Is it because of thread library differences?

Thanks!


> The last bits of code to switch MIDAS to C++ have been committed, see tag midas-2019-05-cxx.
> 
> Since the cmake conversion is still in progress, for now, I recommend using the old "make" build for trying this update.
> 
> From the switch to C++, the biggest change is the requirement that frontend programs be build and linked
> using the C++ compiler. Since mfe.o and the rest of MIDAS are built with C++, building frontends
> with C is no longer possible.
> 
> To help with this, I will post a short guide for converting C frontends to C++.
> 
> K.O.
             Reply  17 May 2022, Konstantin Olchanski, Info, MIDAS switched to C++ 
> Hi, I have three naive questions about this:

all good questions, ask more of them.

>  - have you posted somewhere this guide about converting C frontends to C++?

yes, in this elog here I posted a guide for converting C mfe.c frontends to C++ and
a guide for converting mfe.c frontend to C++ TMFE frontend. please use the "find" function,
if you cannot find them, let me know, I will look for it for you.

>  - it was mentioned previously that there will be a 'tag the last "C" midas', which version is it?

correct. please run "git tag", tags before "midas-2019-05-cxx"is "C", after is "C++".

>  - it means that even a simple example like odb_test.c cannot be compile anymore? Even when using g++?
> g++ -I $HOME/daq/packages/midas/include/ -L $HOME/daq/packages/midas/lib/ odb_test.c -l midas
> is expected to fail or is just me glitching? Is it because of thread library differences?

yes, it is expected to fail, you have spaces after "-I", "-L" and "-l", incorrect g++ command syntax. after
correcting this, it may or may not work depending on what you have inside odb_test.c. I would be happy
to help you debug this, but please start a separate thread instead of necroposting into the C++ announcements.

K.O.
                Reply  17 May 2022, Ben Smith, Info, MIDAS switched to C++ 
>  - have you posted somewhere this guide about converting C frontends to C++?

There's documentation in the wiki at:
https://daq00.triumf.ca/MidasWiki/index.php/Changelog#2019-06

It includes a step-by-step guide of how to upgrade, what changes need to be made to frontends, and common issues that people had.
Entry  08 May 2022, Stefan Ritt, Info, RO_STOPPED with triggered events 
We had issues in one of our experiment that people used RO_STOPPED in the 
equipment list together with triggered events (EQ_USER). If events are sent when 
a run is stopped, this leads to many unexpected results, so I added a check in 
the mfe.cxx code which prevents RO_STOPPED (or RO_ALWAYS which includes 
RO_STOPPED) together with EQ_TRIGGERED, EQ_INTERRUPT, EQ_MULTITHREAD and EQ_USER 
type of events.

I got now complaints that some old front-end are not running any more since they 
do use RO_ALWAYS together with triggered events. Can the author of these frontend 
please tell me the rationale why this is needed, then I can maybe add a better 
fix for that.

Stefan
    Reply  08 May 2022, Konstantin Olchanski, Info, RO_STOPPED with triggered events 
> some old front-end are not running any more since they do use RO_ALWAYS together with 
triggered events.

I confirm, if you have mfe.c frontends that have RO_ALWAYS, after you update MIDAS, 
some of these frontends will fail to start.
https://bitbucket.org/tmidas/midas/commits/1961af0d657e4f76ab9db17f9b70c0c492172b6d

tmfe c++ frontends do not have this restriction but by default only read data when run 
is active (per-equipment fEqConfReadOnlyWhenRunning default is true).

K.O.
       Reply  16 May 2022, Konstantin Olchanski, Info, RO_STOPPED with triggered events 
> > some old front-end are not running any more since they do use RO_ALWAYS together with 
> triggered events.
> 
> I confirm, if you have mfe.c frontends that have RO_ALWAYS, after you update MIDAS, 
> some of these frontends will fail to start.
> https://bitbucket.org/tmidas/midas/commits/1961af0d657e4f76ab9db17f9b70c0c492172b6d
> 
> tmfe c++ frontends do not have this restriction but by default only read data when run 
> is active (per-equipment fEqConfReadOnlyWhenRunning default is true).

As of commit 
https://bitbucket.org/tmidas/midas/commits/28d9c96bd6d4f65346ebcd6a04492ea764c90823 mfe.c 
frontends will no longer fail to start. an error will still be issued "Equipment \"%s\" 
contains RO_STOPPED or RO_ALWAYS. This can lead to undesired side-effect and should be 
removed."

BTW 1:

Some of our old frontends use EQ_MULTITHREAD to implement multithreaded periodic equipments. 
They do not generate any events when there is no run (some of them do not generate any 
events at all). Now they will start printing this error message, for no reason. (no we will 
not be rewriting them justy to get rid of this message. life is too short).

BTW 2:

the c++ tmfe frontend does not have any protections against these "undersired side-effects".

What are these undesired side effects and should we add protection against them?

K.O.
          Reply  17 May 2022, Stefan Ritt, Info, RO_STOPPED with triggered events 
> > > some old front-end are not running any more since they do use RO_ALWAYS together with 
> > triggered events.
> > 
> > I confirm, if you have mfe.c frontends that have RO_ALWAYS, after you update MIDAS, 
> > some of these frontends will fail to start.
> > https://bitbucket.org/tmidas/midas/commits/1961af0d657e4f76ab9db17f9b70c0c492172b6d
> > 
> > tmfe c++ frontends do not have this restriction but by default only read data when run 
> > is active (per-equipment fEqConfReadOnlyWhenRunning default is true).
> 
> As of commit 
> https://bitbucket.org/tmidas/midas/commits/28d9c96bd6d4f65346ebcd6a04492ea764c90823 mfe.c 
> frontends will no longer fail to start. an error will still be issued "Equipment \"%s\" 
> contains RO_STOPPED or RO_ALWAYS. This can lead to undesired side-effect and should be 
> removed."
> 
> BTW 1:
> 
> Some of our old frontends use EQ_MULTITHREAD to implement multithreaded periodic equipments. 
> They do not generate any events when there is no run (some of them do not generate any 
> events at all). Now they will start printing this error message, for no reason. (no we will 
> not be rewriting them justy to get rid of this message. life is too short).
> 
> BTW 2:
> 
> the c++ tmfe frontend does not have any protections against these "undersired side-effects".
> 
> What are these undesired side effects and should we add protection against them?
> 
> K.O.

The undesired side-effects are the following: The logger tries to collect all events at the end of 
the run by emptying the SYSTEM buffer. If events keep coming after the run is stopped, this loop in 
the logger might be an endless loop, crashing the whole experiment in the end. 

Another issue (and actually the reason for this change) is the funciton receive_trigger_event() in 
mfe.cxx which will get confused if events are still coming in after a run has been stopped and 
actually enters an infinite loop.

Combining EQ_MULTITHREAD with EQ_PERIODIC or EQ_SLOW is a wrong parameter combination as written in 
the documentation. If one wants to have multi-threaded slow control events, one has to use the 
DF_MULTITHREAD flag in the DEVICE_DRIVER structure.

Having triggered events being sent to the system after a run has been stopped I would consider 
simply wrong. Why should we ever use a run start/stop if events are always flowing? Adding 
protections in all places for this case is certainly much more work than just changing one flag for 
frontends which produce this error message now for a wrong parameter combination.
Entry  24 Apr 2022, Konstantin Olchanski, Bug Fix, mserver buffer overrun and crash 
There is a memory allocation bug in the mserver.

ALIGN8() was missing when receiving events from the event socket and data buffer 
was allocated 4 bytes too short. but only for some received events and only in 
very unlucky sequence of received events. result was a rare but obnoxious crash 
of fevme frontend in alpha-2 at CERN. (we do not see any crash from this in 
alpha-g or anywhere else, the best I can tell).

fixed in commit 4dc06ba47ff7caa5251fd8c48d8533f35799f3a6.

If you use the mserver, please update to this commit or apply following patch in 
midas.cxx:

-   int bufsize = sizeof(INT) + event_size;
+   int bufsize = sizeof(INT) + total_size;

K.O.
    Reply  16 May 2022, Konstantin Olchanski, Bug Fix, mserver buffer overrun and crash 
> There is a memory allocation bug in the mserver.

Fix for this problem introduced a new problem, an infinite loop in bm_flush_cache, 
bitbucket bugs https://bitbucket.org/tmidas/midas/issues/339/infinite-loop-in-
mserver-due-to-mfes and https://bitbucket.org/tmidas/midas/issues/331/stuck-
semaphore-of-system-buffer

This is now fixed and the buffer write cache logic and size was rejigged
according to calculations in https://daq00.triumf.ca/elog-midas/Midas/2401

Event buffer write cache (as set via ODB Equipment/Common and via 
bm_set_cache_size()) now take 2 possible values:
0 - write cache is disabled and
MIN_WRITE_CACHE_SIZE - (10 Mbytes) minimum permitted cache size
bigger cache size values are permitted, up to buffer_size/3, but probably not useful 
if my calculations are right.
smaller cache size values are generally not useful, if my calculations are right.

mfe.c and tmfe c++ frontends updated to request the new write cache size by default.

if events are getting stuck in the write cache for too long, instead of reducing the 
cache size, one should increase frequency of bm_flush_cache() calls (1/sec by 
default).

commit 373bcc3ab7f83c3c7bf6c051c237de043a982502

K.O.
Entry  13 May 2022, Konstantin Olchanski, Info, analysis of corner cases in event buffer write cache 
introduction:

to remember, bm_send_event() writes an event to the write cache, bm_flush_cache() 
writes the contents of the write cache into the shared memory event buffer, buffer 
free space is consumed. in the usual case, mlogger is reading events from the shared 
memory event buffer, buffer free space is released. there is also a read cache, not 
part of this discussion.

the purpose of the write cache is to reduce contention for the shared memory 
semaphore. in the case of large number of small events, semaphore is locked per 
cache-flush, instead of per-event. correct tuning of write cache and event size can 
reduce lock rate from >100 kHz to around 100 Hz or lower.

analysis:

for correct operation of bm_send_event() under all conditions we need to consider 
all corner cases:

1) no write cache: (cache size set to 0)

- event_size > buffer_size -> reject the event (obviously)
- event_size > 0.5 * buffer_size -> only 1 event fits into the buffer, next write 
will stall until mlogger reads the previous event (sequential operation, bad)
- event_size < 0.3 * buffer_size -> at least 2 events fit into the buffer (good)

decision: limit event size to 0.5 to 0.3 * buffer_size (current limit is 0.5 * 
buffer_size, I think).

consequence: buffer size limit is 2 Gbytes (32-bit byte offsets, code is only 31-
bit-clean), max event size is between 1 Gbytes and 0.6 Gbytes.

2) writing to write cache:

- event_size > cache_size -> flush cache, write event to directly to buffer
- event_size > 0.5 * cache_size -> inefficient use of cache: write to cache, next 
event does not fit, flush to buffer, repeat. no gain in semaphore locking (bad), one 
additional memcpy() (event to cache and cache to buffer) (bad)
- event_size < 0.3 * cache_size -> multiple events fit into cache, but probably no 
gain in semaphore locking

decision: events that are bigger than 0.3 to 0.1 * cache_size should not go through 
the cache. (flush cache, write directly to buffer).

3) flush write cache to buffer:

- cache_size > buffer_size -> cannot flush in 1 operation, must have a loop and 
flush the cache in pieces
- cache_size between 0.5 and 1.0 * buffer_size -> can flush in 1 operation, but must 
wait for mlogger to fully empty the buffer (sequential operation, bad)
- cache size < 0.3 * buffer_size -> can flush in 1 operation, at least 2 "flushes" 
fit inside the buffer (good)

decision: limit write cache size to 0.3 * buffer_size. (current limit is 
0.25*buffer_size).

consequences:

- write cache size limit is 0.3..0.25 * 2GB = 0.6..0.5 Gbytes
- cached event size limit is 0.3..0.1 * 0.5 GBytes = 150..50 Mbytes
- minimum number of cached events: 3 to 10
- semaphore locks reduced: 3 to 10 locks become 1 lock (all events cached),
4 to 11 locks become 2 locks (big event causes cache flush).

4) complications:

- there is a periodic 1/second bm_flush_cache() that flushes the cache early and 
reduces it's efficiency (but needed to avoid having data stuck in cache for long 
time)
- if multiple frontends use large write cache (~ 0.3..0.5 * buffer_size), again, 
sequential operation can happen (bad)
- write cache is per-frontend, not per-equipment. if different equipments request 
different cache sizes, mfe.c and tmfe c++ frontends complain about this, but the 
user has to sort it out.

K.O.
    Reply  16 May 2022, Konstantin Olchanski, Info, analysis of corner cases in event buffer write cache 
> for correct operation of bm_send_event() under all conditions we need to ...

to continue computation from last message:

default SYSTEM buffer size: 32 MiBytes
default max event size: 4 MiBytes

hard max buffer size: 2 Gbytes (code is only 31-bit-clean)
hard max event size: 2 Gbytes (code is only 31-bit-clean)

max event size currently: 32 Mbytes (same as buffer size)
max event size per (1) in previous post: 32*0.5..0.3 = 16..9 MiBytes

number of default-max-size events buffered: 32/4 = 8.
number of per (1) max-size events buffered: 2 or 3
number of current max-size events buffered: 0 (bad, frontend is serialized with mlogger)

default write cache size: 100 kbytes

max write cache size currently: buffer size / 4 = 32/4 = 8 MiBytes
max write cache size per (3) in previous post: buffer_size / 3 = 10 Mbytes
hard max write cache size per (3): 2 Gbytes/3 = 600 Mbytes

max size of cached events:

current: 100 kbytes (size as cache size)
per (2) in previous post: 0.1..0.3 * cache size = 10..30 kbytes
per (2), 1 Mbyte cahe: 0.1..0.3 * cache size = 100..300 kbytes
hard max size: 0.1..0.3 * hard_max_cache_size = 0.1..0.3 * 600 = 60..180 Mbytes.

max data rate before event buffer semaphore locking rate exceeds 100 Hz:

1 kbyte events, no write cache: 100 kbytes/sec
1 kbyte events, 100 kbyte cache: 100 events cached, cache flush rate 100 Hz -> 100*1kbyte*100Hz -> 10 Mbytes/sec
1 kbyte events, 1 Mbyte cache: 1000 events cached, cache flush rate 100 Hz -> 100 Mbytes/sec (1gige ethernet)
N kbyte events, 1 Mbyte cache: same thing (data rate is limited by cache flush rate 100 Hz)
100 kbyte events, 1 Mbyte cache, not cached per (2): 100kbyte*100Hz = 10 Mbytes/sec
300 kbyte events, 1 Mbyte cache, not cached per (2): 300kbyte*100Hz = 30 Mbytes/sec
N00 kbyte events: N0 Mbytes/sec (500->50, etc)
1 kbyte events, 10 Mbyte cache: 10000 events cached, cache flush rate 100 Hz -> 1000 Mbytes/sec (10gige ethernet)
N kbyte events, 10 Mbyte cache: same thing (data rate is limited by cache flush rate 100 Hz)
1000 kbyte events, 10 Mbyte cache, not cached per (2): 1000kbyte*100Hz = 100 Mbytes/sec
3000 kbyte events, 10 Mbyte cache, not cached per (2): 3000kbyte*100Hz = 300 Mbytes/sec
N000 kbyte events: N00 Mbytes/sec (4000->400, 5000->500, etc)
default max event size: 4 Mibytes*100Hz = 400 Mbytes/sec (exceeds 1gige ethernet)
hard max event size (divided by 10 to buffer 10 events): 200 Mbytes*100Hz -> 20 Gbytes/sec

max event rate before event buffer semaphore locking rate exceeds 100 Hz:

1 kbyte events, no write cache: 100 Hz (obviously)
1 kbyte events, 100 kbyte cache: 100 events cached, cache flush rate 100 Hz -> 10 kHz
1 kbyte events, 1 Mbyte cache: 1000 events cached, cache flush rate 100 Hz -> 100 kHz
N kbyte events, 1 Mbyte cache: 1000/N events cached, cache flush rate 100 Hz -> 100/N kHz
1 kbyte events, 10 Mbyte cache: 10000 events cached, cache flush rate 100 Hz -> 1000 kHz
N kbyte events, 10 Mbyte cache: 10000/N events cached, cache flush rate 100 Hz -> 1000/N kHz
100 kbyte events, not cached per (2): 100 Hz (obviously)
300 kbyte events, not cached per (2): 100 Hz (obviously)
default max event size: 100 Hz (obviously)

K.O.
       Reply  16 May 2022, Konstantin Olchanski, Info, analysis of corner cases in event buffer write cache 
> > for correct operation of bm_send_event() under all conditions we need to ...
> to continue computation from last message:

if I got my numbers right, for present-day hardware (1gige/10gige data rates, 100 Hz max locking rate), we should 
increase the default buffer write cache size from 100 kbytes to 10 Mbytes.

this cache size will permit processing of the full mix of small/big events
at the full mix of event rates without exceeding the 100 Hz semaphore locking rate.

with the 10 Mbyte write cache, default event buffer size should be 30-40 Mbytes (current size is 33 Mbytes, so does 
not need to change).

this computation is for 1 writer (1 reader, mlogger). it is a typical case for our experiments.

multiple writers can run into contention for event buffer space.

consider 10 writers want to flush their 10 Mbyte write cache all at the same time:

if buffer size is the default 33 Mbytes, the first 3 writers will have successful write cache flush,
but the other 7 will stall, there is no space in the buffer, we have to wait for mlogger to free
some (mlogger writing X Mbytes/sec will take Y milliseconds to liberate 10 Mbytes of space for the 4th writer
to successfully flush, writers 5..10 are still stalled).

but in a system with 10 writers writing at 10 Mbytes/sec (1 Hz default cache flush rate) is 100 Mbytes/sec
will likely have SYSTEM buffer size at least 200-300 Mbytes (to buffer 1-2 seconds of data against
any delays in writing to disk/network storage).

so there should be no problem in practice.

K.O.
Entry  06 May 2022, Stefan Ritt, Info, Increased timeout for program shut down 
We had the problem in our lab that a frontend took about 6 seconds to gracefully 
shut down, mainly it needed to park some motors. I found that the shutdown command 
had a hard-coded timeout of 5 seconds, after which the frontend gets killed, and 
cannot finish the park operation. I change the code so that the client timeout 
stored in the ODB is taken instead of the hard-coded 5 seconds. This allows each 
client to fine-tune its timeout, to allow graceful shutdown, but also not let the 
user wait too long if the client gets stuck and needs a hard kill.

The default timeout for mfe.cxx based frontends has been changed to 10 seconds 
now, but in the frontend_init function this can be changed by the user code 
easily.

I hope this char does not trigger any bad side effects, but if it does, please 
report here.

Stefan
Entry  04 May 2022, Konstantin Olchanski, Bug Fix, mysql history update 
the code for writing midas history to mysql has been updated to work against 
MYSQL 8.0.23 (CERN ALPHA-2):

- as ever mysql reports inconsistent data types (I create column with type 
"integer", mysql reports it has type "int" and so forth), the special kludge to 
take care of this had to be tweaked.

- this caused some columns to be marked "inactive" and the code to "reactivate" 
them was missing (fixed)

- binary history event data size was computed incorrectly for events with 
"inactive" columns (fixed) and caused assert() failure and mlogger crash.

- mysql read of column definitions for history event "system" (as in 
/history/links/system) bombed because of incorrect quoting (worked before, why? 
why bombed now?). this caused duplicate columns to be created in mysql table 
"system" and mlogger bomb-out with complaint about "duplicated columns" 
(actually the error message was missing, so it was a silent bomb-out). quoting 
fixed, missing error message fixed, but cleanup of duplicate columns has to be 
done by hand. in case of alpha-2 the fix was to remove the unused 
/history/links/system).

if you are using mysql history please update or patch src/history_schema.cxx.

commit 9d17d2fef233cf457121ca7c2a283c4c76ed33bc

K.O.
Entry  30 Apr 2022, Konstantin Olchanski, Info, added web pages for "show odb clients" and "show open records" 
for a long time, midas web pages have been missing the equivalent of odbedit 
"scl" and "sor" to display current odb clients and current odb open records.

this is now added as buttons "show open records" and "show odb clients" in the 
odb editor page.

as in odbedit, "sor" shows open records under the current subtree, i.e. if you 
are looking at /equipment, you will not see open records for /experiment. to see 
all open records, go to "/".

commit b1ab7e67ecf785744fff092708d8389f222b14a4

K.O.
    Reply  04 May 2022, Stefan Ritt, Info, added web pages for "show odb clients" and "show open records" 
Concerning the "scl" page, we are currently having a discussion. At the moment, one can 
see midas clients in three different places:

1) the main status page at the bottom, only names and hosts are there

2) the programs page, where one can also start/stop program

3) now the new page "Show ODB clients" in the ODB editor page, which shows also the 
alive status, PID and timeout

I'm thinking that three locations are two too much, so we are considering to merge the 
tree pages into one. That would mean that 1) goes away, and the "Programs" page will 
show more information. We have some rare cases that programs are removed from 
/System/Clients in the ODB but still attached to the ODB. For those "zombies" we would 
add a "hard kill" function.

I would like to hear feedback from the midas community before we proceed with the 
plans. Anybody desperately in need of the programs shown on the status page?

Best,
Stefan
Entry  01 May 2022, Konstantin Olchanski, Info, added web page for "mdump" 
added JSON RPC for bm_receive_event() and added a web page for "mdump".

the event dump is a hex dump for now.

if somebody can contribute a javascript decoder for midas bank format, it would be greatly appreciated.

otherwise, I will eventually write my own decoder library patterned on midasio.h and midasio.cxx.

as of commit 5882d55d1f5bbbdb0d9238ada639e63ac27d8825
K.O.
    Reply  01 May 2022, Konstantin Olchanski, Info, added web page for "mdump" 
> added JSON RPC for bm_receive_event()

there is a number of problems with implementing bm_receive_event() as a RPC:

1) mhttpd has only event buffer 1 read pointer for all javascript connections, if two browser tabs are 
running mdump, they will "steal" events from each other.
2) javascript connections are state-less and we cannot specify per-connection event_id and trigger_mask 
filters to bm_receive_event(). our bm_request_event() has to be for all event_id and all trigger_mask.
3) for same reason, we cannot have some requests to be GET_ALL, some to be GET_RECENT and some to be 
GET_OLD (if GET_OLD is ever implemented).

Problem (1) is hard to fix. Only solution I can see is to have mhttpd have it's own event buffer that can 
somehow track which events have been sent to which javascript connection.

The same scheme allows implementing GET_ALL and per-connection event_id and trigger_mask filters.

The difficulty is in detecting javascript connections that are no longer active and it's event request and 
events we have buffered for it can be deleted. Unlike proper rpc clients, javascript browser tabs can be 
closed without warning and without opportunity to tell rpc server that they are closed, gone.

K.O.
    Reply  01 May 2022, Konstantin Olchanski, Info, added web page for "mdump" 
> added a web page for "mdump".

missing functions:
- get a list of existing event buffers (should read event buffer names from /Experiment/Buffer sizes)
- selector box to select event buffer
- button for "get next" and "get new" (should call bm_skip_event() before bm_receive_event())
- entry fields for event_id and trigger_mask event filter
- check box for "keep getting new data" and entry field for update frequency
- (eventually) entry field for bank name filter

K.O.
       Reply  02 May 2022, Stefan Ritt, Info, added web page for "mdump" 
Here are some of my thoughts:

- I volunteer to write the JavaScript midas bank decoder. Just a couple of pure javascript functions, no 
midasio.cxx library needed.

- If different javascript connections "steal" events from each other, I would not be concerned. Actually I 
would rather like that all connections see the SAME event. So mhttpd keeps one event, serves it to all 
links, so displays are consistent. If a browser wants to see the "next" event, it send the old serial 
number and days "please send next event AFTER serial number". If the serial number is larger than the 
event in the buffer, mhttpd fetches a new event and puts it into its buffer.

- Since javascript connections are connectionless, I would rather pass event_id and trigger_mask with each 
request. Then mhttpd can retrieve events until event_id and trigger_mask match, then serve that event. 
Since reading events from a midas buffer is fast (many 10'000s of events per second), the won't be much of 
a delay.

- GET_ALL does not make sense for browsers, you don't want to slow down any frontend. If someone wants to 
do histogramming in the browser, then GET_SOME (which is kind of GET_OLD) would make sense, but most of 
the cases we have some single event display, and there a GET_RECENT is most appropriate.
Entry  30 Apr 2022, Giovanni Mazzitelli, Forum, S3 Object Storage 
Dear all,
We are storing raw MIDAS files to S3 Object Storage, but MIDAS file are not 
optimised for readout from such kind of storage. There is any work around on 
evolution of midas raw output or, beyond simulated posix fs,  to develop midas 
python library optimised to stream data from S3 (is not really clear to me if this 
is possible).
    Reply  30 Apr 2022, Konstantin Olchanski, Forum, S3 Object Storage 
> We are storing raw MIDAS files to S3 Object Storage, but MIDAS file are not 
> optimised for readout from such kind of storage. There is any work around on 
> evolution of midas raw output or, beyond simulated posix fs,  to develop midas 
> python library optimised to stream data from S3 (is not really clear to me if this 
> is possible).

We have plans for adding S3 object storage support to lazylogger, but have not gotten 
around to it yet.

We do not plan to add this in mlogger. mlogger works well for writing data to locally-
attached storage (local ext4, XFS, ZFS) but always runs into problems with timeouts and 
delays when writing to anything network-attached (even writing to NFS).

I envision that each midas raw data file (mid.gz or mid.lz4 or mid.bz2) will
be stored as an S3 object and there will be some kind of directory object
to map object ids to run and subrun numbers.

Choice of best file size is open, normally we use subruns to limit file size to 1-2 
Gbytes. If cloud storage prefers some other object size, we can easily to up to 10 
Gbytes and down to "a few megabytes" (ODB dumps will have to be turned off for this).

Other than that, in your view, what else is needed to optimize midas files for storage 
in the Amazon S3 could?

P.S. For reading files from the cloud, code needs to be written and added to 
midasio/midasio.cxx, for example, see the code that is already there for reading ssh-
attached files and dcache/dccp-attached files. (CERN EOS files can be read directly 
from POSIX mount point /eos).

K.O.
       Reply  01 May 2022, Giovanni Mazzitelli, Forum, S3 Object Storage 
> > We are storing raw MIDAS files to S3 Object Storage, but MIDAS file are not 
> > optimised for readout from such kind of storage. There is any work around on 
> > evolution of midas raw output or, beyond simulated posix fs,  to develop midas 
> > python library optimised to stream data from S3 (is not really clear to me if this 
> > is possible).
> 
> We have plans for adding S3 object storage support to lazylogger, but have not gotten 
> around to it yet.
> 
> We do not plan to add this in mlogger. mlogger works well for writing data to locally-
> attached storage (local ext4, XFS, ZFS) but always runs into problems with timeouts and 
> delays when writing to anything network-attached (even writing to NFS).
> 
> I envision that each midas raw data file (mid.gz or mid.lz4 or mid.bz2) will
> be stored as an S3 object and there will be some kind of directory object
> to map object ids to run and subrun numbers.
> 
> Choice of best file size is open, normally we use subruns to limit file size to 1-2 
> Gbytes. If cloud storage prefers some other object size, we can easily to up to 10 
> Gbytes and down to "a few megabytes" (ODB dumps will have to be turned off for this).
> 
> Other than that, in your view, what else is needed to optimize midas files for storage 
> in the Amazon S3 could?
> 
> P.S. For reading files from the cloud, code needs to be written and added to 
> midasio/midasio.cxx, for example, see the code that is already there for reading ssh-
> attached files and dcache/dccp-attached files. (CERN EOS files can be read directly 
> from POSIX mount point /eos).
> 
> K.O.

thanks, 
actually a I made a small work around with python boto3 library with file of any size (with 
the obviously limitation of opportunity and time to wait) eg:

key = 'TMP/run00060.mid.gz'

aws_session = creds.assumed_session("infncloud-iam")
s3 = aws_session.client('s3', endpoint_url="https://minio.cloud.infn.it/", 
                        config=boto3.session.Config(signature_version='s3v4'),verify=True)

s3_obj = s3.get_object(Bucket='cygno-data',Key=key)
buf = BytesIO(s3_obj["Body"]._raw_stream.data)

for event in MidasSream(gzip.GzipFile(fileobj=buf)):
    if event.header.is_midas_internal_event():
        print("Saw a special event")
        continue

    bank_names = ", ".join(b.name for b in event.banks.values())
    print("Event # %s of type ID %s contains banks %s" % (event.header.serial_number, 
event.header.event_id, bank_names))
    ....


where in MidasSream I just bypass the open, and the code work, but obviously in this way I 
need to have all the buffer in memory and it take time get all the buffer. I was interested to 
understand if some one have already develop the stream event by event (better in python but 
not mandatory). I'll look to the code you underline.
Thanks, G. 
 
Entry  16 Mar 2022, Stefan Ritt, Info, New midas sequencer version 
A new version of the midas sequencer has been developed and now available in the 
develop/seq_eval branch. Many thanks to Lewis Van Winkle and his TinyExpr library 
(https://codeplea.com/tinyexpr), which has now been integrated into the sequencer 
and allow arbitrary Math expressions. Here is a complete list of new features:


* Math is now possible in all expressions, such as "x = $i*3 + sin($y*pi)^2", or 
in "ODBSET /Path/value[$i*2+1], 10"


* "SET <var>,<value>" can be written as "<var>=<value>", but the old syntax is 
still possible.


* There are new functions ODBCREATE and ODBDLETE to create and delete ODB keys, 
including arrays


* Variable arrays are now possible, like "a[5] = 0" and "MESSAGE $a[5]"


If the branch works for us in the next days and I don't get complaints from 
others, I will merge the branch into develop next week.

Stefan
    Reply  22 Mar 2022, Stefan Ritt, Info, New midas sequencer version 
After several days of testing in various experiments, the new sequencer has
been merged into the develop branch. One more feature was added. The path to
the ODB can now contain variables which are substituted with their values.
Instead writing

ODBSET /Equipment/XYZ/Setting/1/Switch, 1
ODBSET /Equipment/XYZ/Setting/2/Switch, 1
ODBSET /Equipment/XYZ/Setting/3/Switch, 1

one can now write

LOOP i, 3
   ODBSET /Equipment/XYZ/Setting/$i/Switch, 1
ENDLOOP

Of course it is not possible for me to test any possible script. So if you 
have issues with the new sequencer, please don't hesitate to report them 
back to me.

Best,
Stefan
       Reply  15 Apr 2022, Stefan Ritt, Info, New midas sequencer version sequencer.pdf
I prepared some slides about the new features of the sequencer and post it here so 
people can have a quick look at get some inspiration.

Stefan
Entry  05 Apr 2013, Konstantin Olchanski, Info, ODB JSON support 
odbedit can now save ODB in JSON-formatted files. (JSON is a popular data encoding standard associated 
with Javascript). The intent is to eventually use the ODB JSON encoder in mhttpd to simplify passing of 
ODB data to custom web pages. In mhttpd I also intend to support the JSON-P variation of JSON (via the 
jQuery "callback=?" notation).

JSON encoding implementation follows specifications at:
http://json.org/
http://www.json-p.org/
http://api.jquery.com/jQuery.getJSON/  (seek to JSONP)

The result passes validation by:
http://jsonlint.com/

Added functions:
   INT EXPRT db_save_json(HNDLE hDB, HNDLE hKey, const char *file_name);
   INT EXPRT db_copy_json(HNDLE hDB, HNDLE hKey, char **buffer, int *buffer_size, int *buffer_end, int 
save_keys, int follow_links);

For example of using this code, see odbedit.c and odb.c::db_save_json().

Example json file:

Notes:
1) hex numbers are quoted "0x1234" - JSON does not permit "hex numbers", but Javascript will 
automatically convert strings containing hex numbers into proper integers.
2) "double" is encoded with full 15 digit precision, "float" with full 7 digit precision. If floating point values 
are actually integers, they are encoded as integers (10.0 -> "10" if (value == (int)value)).
3) in this example I deleted all the "name/key" entries except for "stringvalue" and "sbyte2". I use the 
"/key" notation for ODB KEY data because the "/" character cannot appear inside valid ODB entry names. 
Normally, depending on the setting of "save_keys" argument, KEY data is present or absent for all entries.

ladd03:midas$ odbedit
[local:testexpt:S]/>cd /test
[local:testexpt:S]/test>save test.js
[local:testexpt:S]/test>exit
ladd03:midas$ more test.js
# MIDAS ODB JSON
# FILE test.js
# PATH /test
{
  "test" : {
    "intarr" : [ 15, 0, 0, 3, 0, 0, 0, 0, 0, 9 ],
    "dblvalue" : 2.2199999999999999e+01,
    "fltvalue" : 1.1100000e+01,
    "dwordvalue" : "0x0000007d",
    "wordvalue" : "0x0141",
    "boolvalue" : true,
    "stringvalue" : [ "aaa123bbb", "", "", "", "", "", "", "", "", "" ],
    "stringvalue/key" : {
      "type" : 12,
      "num_values" : 10,
      "item_size" : 1024,
      "last_written" : 1288592982
    },
    "byte1" : 10,
    "byte2" : 241,
    "char1" : "1",
    "char2" : "-",
    "sbyte1" : 10,
    "sbyte2" : -15,
    "sbyte2/key" : {
      "type" : 2,
      "last_written" : 1365101364
    }
  }
}

svn rev 5356
K.O.
    Reply  10 May 2013, Konstantin Olchanski, Info, mhttpd JSON support 
> odbedit can now save ODB in JSON-formatted files.
> Added functions:
>    INT EXPRT db_save_json(HNDLE hDB, HNDLE hKey, const char *file_name);
>    INT EXPRT db_copy_json(HNDLE hDB, HNDLE hKey, char **buffer, int *buffer_size, int *buffer_end, int  save_keys, int follow_links);
> 

Added JSON encoding format to Javascript ODBCopy() ("jcopy"). Use format="json", Javascript example updated with an example example.

Also updated db_copy_json():
- always return NUL-terminated string
- "save_keys" values: 0 - do not save any KEY data, 1 - save all KEY data, 2 - save only KEY.last_written

odb.c, mhttpd.cxx, example.html
svn rev 5362
K.O.
       Reply  17 May 2013, Konstantin Olchanski, Info, mhttpd JSON-P support 
> 
> Added JSON encoding format to Javascript ODBCopy(path,format) ("jcopy"). Use format="json", Javascript example updated with an example example.
> 

More ODBCopy() expansion: format="json-p" returns data suitable for JSON-P ("script tag") messaging.

Also implemented multiple-paths for "jcopy" (similar to "jget"/ODBMGet()). An example ODBMCopy(paths,callback,format) is present in example.html (will move to mhttpd.js).

Added JSON encoding options:
- format="json-nokeys" will omit all KEY information except for "last_written"
- "json-nokeys-nolastwritten" will also omit "last_written"
- "json-nofollowlinks" will return ODB symlink KEYs instead of following them (ODBGet/ODBMGet always follows symlinks)
- "json-p" adds JSON-P encapsulation
All these JSON format options can be used at the same time, i.e. format="json-p-nofollowlinks"

To see how it all works, please look at examples/javascript1/example.html.

The new code seems to be functional enough, but it is still work in progress and there are a few problems:
- ODBMCopy() using the "xml" format returns gibberish (the MIDAS XML encoder has to be told to omit the <?xml> header)
- example.html does not actually parse any of the XML data, so we do not know if XML encoding is okey
- JSON encoding has an extra layer of objects (variables.Variables.foo instead of variables.foo)
- ODBRpc() with JSON/JSON-P encoding not done yet.

mhttpd.cxx, example.html
svn rev 5364
K.O.
          Reply  31 May 2013, Konstantin Olchanski, Info, mhttpd JSON-P support 
> To see how it all works, please look at examples/javascript1/example.html.
> 
> - JSON encoding has an extra layer of objects (variables.Variables.foo instead of variables.foo)
>

This is now fixed. See updated example.html. Current encoding looks like this:

{
  "System" : {
    "Clients" : {
      "24885" : {
        "Name/key" : { "type" : 12, "item_size" : 32, "last_written" : 1370024816 },
        "Name" : "ODBEdit",
        "Host/key" : { "type" : 12, "item_size" : 256, "last_written" : 1370024816 },
        "Host" : "ladd03.triumf.ca",
        "Hardware type/key" : { "type" : 7, "last_written" : 1370024816 },
        "Hardware type" : 44,
        "Server Port/key" : { "type" : 7, "last_written" : 1370024816 },
        "Server Port" : 52539
      }
    },
    "Tmp" : {
...

odb.c, example.html
svn rev 5368
K.O.
    Reply  27 Sep 2013, Konstantin Olchanski, Info, ODB JSON support 
> odbedit can now save ODB in JSON-formatted files.
> 
> JSON encoding implementation follows specifications at:
> http://json.org/
> 
> The result passes validation by:
> http://jsonlint.com/
> 

A bug was reported in my JSON ODB encoder: NaN values are not encoded correctly. A quick review found this:

1) the authors of JSON smoked some bad mushrooms and specifically disallowed NaN and Inf values for floating point numbers: 
http://tools.ietf.org/html/rfc4627
2) most JSON encoders and decoders do reasonable and unreasonable things with NaN and Inf values. The worst ones encode them as zero. More bad 
mushrooms.

There is a quick survey at: http://lavag.org/topic/16217-cr-json-labview/?p=99058

<pre>
Some Javascript engines allow it since it is valid Javascript but not valid Json however there is no concensus.
cmj-JSON4Lua: raw tostring() output (invalid JSON).
dkjson: 'null' (like in the original JSON-implementation).
Fleece: NaN is 0.0000, freezes on +/-Inf.
jf-JSON: NaN is 'null', Inf is 1e+9999 (the encode_pretty function still outputs raw tostring()).
Lua-Yajl: NaN is -0, Inf is 1e+666.
mp-CJSON: raises invalid JSON error by default, but runtime configurable ('null' or Nan/Inf).
nm-luajsonlib: 'null' (like in the original JSON-implementation).
sb-Json: raw tostring() output (invalid JSON).
th-LuaJSON: JavaScript? constants: NaN is 'NaN', Inf is 'Infinity' (this is valid JavaScript?, but invalid JSON).
</pre>

For the MIDAS JSON encoder (and decoder) I have several choices:
a) encode NaN and Inf using the printf("%f") encoding (as strings, making it valid JSON)
b) encode NaN and Inf as strings using the Javascript special values: "NaN", "Infinity" and "-Infinity", see 
http://www.w3schools.com/jsref/jsref_positive_infinity.asp

I note that the Python JSON encoder does (b), see section 18.2.3.3 at http://docs.python.org/2/library/json.html

In either case, behaviour of the JSON decoder on the Javascript side needs to be tested. (Silent conversion to value of zero is not acceptable).

If anybody has an suggestion on this, please let me know.



P.S. If you do not know all about NaN, Inf, "-0" and other floating point funnies, please read:  https://www.ualberta.ca/~kbeach/phys420_580_2010/docs/ACM-Goldberg.pdf

P.P.S. If you ever used the type "float" or "double", used the "/" operator or the function "sqrt()" you also should read that reference.

K.O.
       Reply  09 Oct 2013, Konstantin Olchanski, Info, ODB JSON support 
> > odbedit can now save ODB in JSON-formatted files.
> A bug was reported in my JSON ODB encoder: NaN values are not encoded correctly.

Tested the browser-builtin JSON.stringify() function in google-chrome, firefox, safari, opera:
everybody encodes numeric values NaN and Inf as JSON value [null].

To me, this clearly demonstrates a severe defect in the JSON standard and in it's Javascript implementation:
a) NaN, Inf and -Inf are valid, useful and commonly used numeric values defined by the IEEE754/854 standard (as opposed to the special value "-0", which is also defined by the standard, but is not nearly as useful)
b) they are all distinct numeric values, encoding them all into the same JSON value [null] is the same as encoding all even numbers into the JSON value [42].
c) on the decoding end, JSON value [null] is decoded into Javascript value [null], which works as 0 for numeric computation, so effectively NaN, Inf and -Inf are made equal to zero. A neat trick.

Note that (c) - NaN, Inf is same as 0 - eventually produces incorrect numerical results by breaking the IEEE754/854 standard specification that number+NaN->NaN, number+infinity->infinity, etc.

In MIDAS we have a requirement that results be numerically correct: if an ODB value is "infinity", the corresponding web page should not show "0".

In addition we have a requirement that JSON encoding should be lossess: i.e. ODB contents encoded by JSON should decode back into the same ODB contents.

To satisfy both requirements, I now encode NaN, Inf and -Inf as JSON string values "NaN", "Infinity" and "-Infinity". (Corresponding to the respective Javascript values).

Notes:
1) this is valid JSON
2) it survives decode/encode in the browser (ODBMCopy()/JSON.parse/modify some values/JSON.stringify/ODBMPaste() does not destroy these special values)
3) it is numerically correct for "NaN" values (Javascript [1+"NaN"] -> NaN)
4) it fails in an obvious way for Inf and -Inf values (Javascript [1+"Infinity"] is NaN instead of Infinity).

https://bitbucket.org/tmidas/midas/commits/82dd203cc95dacb6ec9c0a24bc97ffd45bb58427
K.O.
          Reply  17 Mar 2014, Konstantin Olchanski, Info, ODB JSON support 
> > > odbedit can now save ODB in JSON-formatted files.
> encode NaN, Inf and -Inf as JSON string values "NaN", "Infinity" and "-Infinity". (Corresponding to the respective Javascript values).

A new standard just came out - Oasis OData JSON format 4.0 - 
http://docs.oasis-open.org/odata/odata-json-format/v4.0/os/odata-json-format-v4.0-os.html

Section 7.1 reads:

> Values of types [...] Edm.Single, Edm.Double, and Edm.Decimal are represented as JSON numbers, except for NaN, INF, and –INF which are represented as strings.

This is consistent with what we do in MIDAS - encode special numbers as strings. For now I think we stay with Javascript-standard "Infinity", "-Infinity",
but if more standards start using "INF", "-INF", maybe we will switch. It is easy enough to support both encodings in the JSON parser and in the ODB decoder.

https://xkcd.com/927/
K.O.
             Reply  12 Apr 2022, Konstantin Olchanski, Info, ODB JSON support 
> > > > odbedit can now save ODB in JSON-formatted files.
> > encode NaN, Inf and -Inf as JSON string values "NaN", "Infinity" and "-Infinity". (Corresponding to the respective Javascript values).
> http://docs.oasis-open.org/odata/odata-json-format/v4.0/os/odata-json-format-v4.0-os.html
> > Values of types [...] Edm.Single, Edm.Double, and Edm.Decimal are represented as JSON numbers,
> except for NaN, INF, and –INF which are represented as strings "NaN", "INF" and "-INF".
> https://xkcd.com/927/

Per xkcd, there is a new json standard "json5". In addition to other things, numeric
values NaN, +Infinity and -Infinity are encoded as literals NaN, Infinity and -Infinity (without quotes):
https://spec.json5.org/#numbers

Good discussion of this mess here:
https://stackoverflow.com/questions/1423081/json-left-out-infinity-and-nan-json-status-in-ecmascript

K.O.
                Reply  13 Apr 2022, Stefan Ritt, Info, ODB JSON support 
> Per xkcd, there is a new json standard "json5". In addition to other things, numeric
> values NaN, +Infinity and -Infinity are encoded as literals NaN, Infinity and -Infinity (without quotes):
> https://spec.json5.org/#numbers

Just for curiosity: Is this implemented by the midas json library now?
                   Reply  13 Apr 2022, Konstantin Olchanski, Info, ODB JSON support 
> > Per xkcd, there is a new json standard "json5". In addition to other things, numeric
> > values NaN, +Infinity and -Infinity are encoded as literals NaN, Infinity and -Infinity (without quotes):
> > https://spec.json5.org/#numbers
> 
> Just for curiosity: Is this implemented by the midas json library now?

MIDAS encodes NaN, Infinity and -Infinity as javascript compatible "NaN", "Infinity" and "-Infinity",
this encoding is popular with other projects and allows correct transmission of these values
from ODB to javascript. The test code for this is on the MIDAS "Example" page, scroll down
to "Test nan and inf encoding".

I think this type of encoding, using strings to encode special values, is more in the spirit of json,
compared to other approaches such as adding special literals just for a few special cases
leaving other special cases in the cold (ieee-754 specifies several different types of NaN,
you can encode them into different nan-strings, but not into the one nan-literal (need more nan-literals,
requires change to the standard and change to every json parser).

As editorial comment, it boggles my mind, what university or kindergarden these people went to
who made the biggest number, the smallest number and the imaginary number (sqrt(-1))
all equal to zero (all encoded as literal null).

K.O.
Entry  31 Mar 2022, Stefan Ritt, Suggestion, Maximum ODB size 
Anybody some idea what the maximum ODB size can be? In the old days, the linux 
kernels had a severe limit on shared memory of usually 8MB, but in the age of 
64GB RAM being a standard, we should be able to grow bigger. Tried

odbinit -s 1024MB --cleanup

which went through without complain, even put that value in to .ODB_SIZE.TXT, but 
when I started odbedit doing "mem", I only see a size of 1MB. Probably somewhere 
deep inside we have a limit which prevents the user to create very large ODBs, 
but this should be mentioned more prominently in odbinit. Like "size too large, 
maximum allowd is xxx MB".

Stefan
    Reply  04 Apr 2022, Konstantin Olchanski, Suggestion, Maximum ODB size 
> Anybody some idea what the maximum ODB size can be?

It turns out ODB size limit is hardwired on db_open_database() at 100 Mbytes.

I now committed an improved error message for this.

I confirm that "odbinit -s 100MB" works and creates ODB with 50 Mbyte data area and 50 
Mbyte key area.

> in the age of 64GB RAM being a standard, we should be able to grow bigger ...

I agree, I think we can safely bump the limit from 100 Mbytes to 1 Gbyte, maybe 1.5 or 
1.99 Gbytes. Above that we run into 32-bit/31-bit cleanliness problems.

And creating extra large 1 GB ODB but using only a few megabytes will not waste any 
RAM, because the .ODB.SHM file is demand-paged and non-used parts of ODB will not be 
mapped into RAM. (It will waste disk space, file .ODB.SHM will be 1 GByte size).

However, 1 GByte (FPGA based) and 4-8 GByte (Raspberry Pi & co) machines are again
becoming popular and relevant for running MIDAS, and they have very slow "disk" 
subsystems, with NAND, SD and USB flash, so we should not go crazy here.

> odbinit -s 1024MB --cleanup

there is a bug in odbinit, if initial odbinit fails, ODB with default size is creates, 
and original rejected ODB size is written to .ODB_SIZE.TXT (an inconsistency).

bitbucket bug 328

> [ how do I resize ODB ??? ]

we need odbresize. bitbucket bug 329.

K.O.
Entry  31 Mar 2022, Konstantin Olchanski, Bug Fix, "run stop" trouble in mlogger, fixed 
while debugging something else, I ran into a bit of trouble in mlogger.

I set the mlogger event limit to 100, and after reaching 100 events, mlogger
sayd "stopping run", but nothing happened, run kept going.

it turns out mlogger tried stopping the run too soon, the run-start
transition did not finish yet and the error message about trying
to stop a run while another transition is in progress was missing.

(fixed - if another transition is in progress, we try again later)

it also turns out that cm_transition() checks if another transition
is in progress way too late, all the way in the transition thread,
where it cannot return it is an error to mlogger.

(fixed - first thing done in cm_transition() is this check).

while debugging this, I tested the ODB flags "/Logger/Async transitions"
and /Logger/Multithread transitions". It turns out only two transition
types still work from inside mlogger - multithread transition
and detached transition (via the mtransition helper).

the issue is the dead lock between mlogger and frontend. while mlogger
is inside cm_transition(), it is not reading the SYSTEM buffer,
while at the same time frontends are writing into it. If SYSTEM
buffer happens to be pretty full, we dead lock - frontends are waiting
for free space in the SYSTEM buffer do not respond to RPCs, mlogger is not 
reading from the SYSTEM and it stuck trying to issue "run stop" RPC
to frontend. (this dead lock is not forever, eventually frontend
is killed by RPC timeout, mlogger survives and stops the run).

this is a well known problem and as solution, mlogger has been using the 
multithreaded transitions for years.

now I removed the OBD /Logger/Async transition and /Logger/Multithread 
transition flags, instead, there is now a flag /Logger/Detached transitions
set to FALSE by default. Setting it to TRUE will cause mlogger to fork 
"mtransition STOP" and "mtransition START" for stopping and starting runs,
this is useful in case there is trouble with multithreading in mlogger.

K.O.
Entry  30 Mar 2022, Konstantin Olchanski, Bug Fix, erroneous removal of odb clients, fixed 
commit https://bitbucket.org/tmidas/midas/commits/b1fe21445109774be3f059c2124727b414abf835
made on 2022-02-21 fixed a serious bug in ODB.

a multithread race condition against an incorrectly updated shared variable caused removal of 
random clients from ODB with error message:

My client index %d in ODB is invalid: out of range 0..%d. Maybe this client was removed by a 
timeout, see midas.log. Cannot continue, aborting...

the race is between db_open_database() in one program (executed when any midas program starts) and 
db_get_my_client_locked() in all running midas programs.

as long as no midas programs are started (db_open_database() is not executed), this bug does not 
happen.

if i.e. odbedit is executed very often, i.e. from a script, probability of hitting this bug becomes 
quite high.

fixed now.

K.O.
Entry  29 Mar 2022, Konstantin Olchanski, Bug Fix, mdump can read lz4 and bz2 files now 
I converted mdump file i/o from older mdsupport library to newer midasio library 
and it can now read .mid, .mid.gz, .mid.lz4 and .mid.bz2 files. Output should be 
identical to what it printed before, if you see any differences, please report 
them here or on bitbucket. K.O.
Entry  29 Mar 2022, Hunter Lowe, Forum, Triggering without LAM signal - mcstd_libgpmc_camac driver 
Hello,

I have a question for anyone experienced with simple CAMAC systems.
 My understanding is that for a single ADC system you can use a gate to generate a
 LAM signal for triggering on ADC.
 The driver that I have "mcstd_libgpmc_camac" has LAM "not implemented" though,
 so I'm not sure how I should trigger DAQ. The frontend code that I have seems to use a TDC
 as trigger for ADC via "EQ_POLLED" type equipment setting. Should I simply plug in TDC in my
 system and use this as trigger? Is it as simple as TDC generates signal via gate and ADC performs job? 

Sorry if question is super basic, just confused how to trigger without LAM signal.

Thank you :)

Hunter Lowe
UNBC Grad Physics
Entry  12 Dec 2021, Marius Koeppel, Bug Report, Writting MIDAS Events via FPGAs  dummy_fe.cpp
Dear all,

in 13 Feb 2020 to 21 Feb 2020 we had a talk about how I try to create MIDAS events directly on a FPGA and 
than use DMA to hand the event over to MIDAS. In the thread I also explained how I do it in my MIDAS frontend. 

For testing the DAQ I created a dummy frontend which was emulating my FPGA (see attached file). The interesting code is 
in the function read_stream_thread and there I just fill a array according to the 32b BANKS which are 64b aligned (more or less
the lines 306-369). And than I do:

    uint32_t * dma_buf_volatile;
    dma_buf_volatile = dma_buf_dummy;

    copy_n(&dma_buf_volatile[0], sizeof(dma_buf_dummy)/4, pdata);

    pdata+=sizeof(dma_buf_dummy);
    rb_increment_wp(rbh, sizeof(dma_buf_dummy)); // in byte length

to send the data to the buffer.

This summer (Mai - July) everything was working fine but today I did not get the data into MIDAS. 
I was hopping around a bit with the commits and everything was at least working until: 3921016ce6d3444e6c647cbc7840e73816564c78.

Thanks,
Marius
    Reply  26 Jan 2022, Konstantin Olchanski, Bug Report, Writting MIDAS Events via FPGAs  
> today I did not get the data into MIDAS. 

Any error messages printed by the frontend? any error message in midas.log? core dumps? crashes?

I do not understand what you mean by "did not get the data into midas". You create events
and send them to a midas event buffer and you do not see them there? With mdump?
Do you see this both connected locally and connected remotely through the mserver?

BTW, I see you are using the mfe.c frontend. Event data handling in mfe.c frontends
is quite convoluted and impossible to straighten out. I recommend that you use
the tmfe c++ frontend instead. Event data handling is much simplified and is easier to debug
compared to the mfe.c frontend. There is examples in the midas repository and there are
tutorials for converting frontends from mfe.c to tmfe posted in this forum here.

BTW, the commit you refer to only changed some html files, could not have affected
your data.

K.O.
       Reply  26 Jan 2022, Marius Koeppel, Bug Report, Writting MIDAS Events via FPGAs  
> Any error messages printed by the frontend? any error message in midas.log? core dumps? crashes? 
> I do not understand what you mean by "did not get the data into midas". You create events
> and send them to a midas event buffer and you do not see them there? With mdump?
> Do you see this both connected locally and connected remotely through the mserver?

I simply don't see the event counter counting up and I also don't see them using mdump. No logs, no dumps and no crashes - every is quite. I only tested it locally.
 
> BTW, I see you are using the mfe.c frontend. Event data handling in mfe.c frontends
> is quite convoluted and impossible to straighten out. I recommend that you use
> the tmfe c++ frontend instead. Event data handling is much simplified and is easier to debug
> compared to the mfe.c frontend. There is examples in the midas repository and there are
> tutorials for converting frontends from mfe.c to tmfe posted in this forum here.

I know the code I used is really old that's why I was so surprised that it suddenly did not work. But I am on the way to change it. Also Stefan gave me some comments on how to improve the code. But still changing them did not really change the behavior. 

> BTW, the commit you refer to only changed some html files, could not have affected
> your data.

I just hopped around and the commit I send was the first one which worked again. But it's of course not the one where the stuff broke. I did a bit of git-bisect and ended up with this commit as the first one where my frontend is not working anymore: 91582e4172d534bf9b10e661a423c399fd1a69f4

Cheers,
Marius
          Reply  26 Jan 2022, Konstantin Olchanski, Bug Report, Writting MIDAS Events via FPGAs  
> 
> > Any error messages printed by the frontend? any error message in midas.log? core dumps? crashes? 
> > I do not understand what you mean by "did not get the data into midas". You create events
> > and send them to a midas event buffer and you do not see them there? With mdump?
> > Do you see this both connected locally and connected remotely through the mserver?
> 
> I simply don't see the event counter counting up and I also don't see them using mdump. No logs, no dumps and no crashes - every is quite. I only tested it locally.
>

If you are connected locally (no mserver), I want to know the value returned by bm_send_event(). Simplest
if you edit mfe.c and everywhere it calls bm_send_event() and rpc_send_event(), print the returned value.

It would be very interesting to see if bm_send_event() returns 1 (SUCCESS), but the event vanishes
without a trace.

Before you do that, try something simpler:

Run "mdump -s -d", it will print some event buffer internals.

Watch to see if any data pointers change when you send your events ("wp", "rp", etc).

If nothing changes at all, then we are not sending anything (fault is in your code or on mfe.c).

If you see "wp" counting up, then we definitely write your events into the buffer and mdump & mlogger should see them.

But there is some funny logic for event_id and trigger_mask and it is worth checking their
values. For a good test, set event_id=1 and trigger_mask=0x1. There might be trouble if either is set to zero.

K.O.
             Reply  26 Jan 2022, Marius Koeppel, Bug Report, Writting MIDAS Events via FPGAs  
> If you are connected locally (no mserver), I want to know the value returned by bm_send_event(). Simplest
> if you edit mfe.c and everywhere it calls bm_send_event() and rpc_send_event(), print the returned value.
> 
> It would be very interesting to see if bm_send_event() returns 1 (SUCCESS), but the event vanishes
> without a trace.

I checked bm_send_event(rbh, (EVENT_HEADER*)(&pdata[0]), 0, 20); which gives me back 1. I also check the status of rb_increment_wp which is also 1.

> Before you do that, try something simpler:
> Run "mdump -s -d", it will print some event buffer internals.
> Watch to see if any data pointers change when you send your events ("wp", "rp", etc).

"rp" & "wp" are not counting up. 

> But there is some funny logic for event_id and trigger_mask and it is worth checking their
> values. For a good test, set event_id=1 and trigger_mask=0x1. There might be trouble if either is set to zero.

Changing both to 0x1 did not change the behavior. 

Cheers,
Marius
    Reply  28 Jan 2022, Stefan Ritt, Bug Report, Writting MIDAS Events via FPGAs  dummy_fe.cpp
I finally got the dummy program working. There were several issues:

- event_buffer_size was defined as 10000 * 32 MB = 320 GB, exceeding the RAM of the computer

- SERIAL number starting with 1. Actually in midas, event serial numbers always started with zero, but this was wrong in the documentation at 
https://midas.triumf.ca/MidasWiki/index.php/Event_Structure, so I also fixed the documentation

- the event header time stamp must be seconds since 1.1.1970, and thus the function ss_time() should be used to set it

- calling set_equipment_status() for each event slows down the event collection considerably, since this function access the ODB each time

- dma_buf_dummy is defined inside the event loop, so it gets allocated and de-allocated on the stack for each event. Of course this might vanish 
when the real FPGA buffer will be used.

- The line pdata+=sizeof(dma_buf_dummy); is wrong. pdata is pointer to uint32_t, but the sizeof() operation returns the size of the 
dma_buf_dummy in bytes. Therefore, pdata gets incremented by four times the size of dma_buf_dummy

- Instead the call to std::this_thread::sleep_for(std::chrono::milliseconds(2000)); one can call the standard midas call ss_sleep(2000); which 
is a bit shorter

- Finally, sending many events to the ring buffer triggered a bug in the midas ring buffer functions which were lingering there since 2007. I'm 
glad that this happened and now could be fixed. Not sure if other experiments where affected in the last decade by that. This could have 
manifested itself in lost events or crashing front-ends. Anyhow, now it's fixed. You need to update midas to get the fix.

I attached a working version of the dummy program for your reference. Banks a different but the principle should become clear.

Stefan
       Reply  16 Feb 2022, Marius Koeppel, Bug Report, Writting MIDAS Events via FPGAs  
I just came back to this and started to use the dummy frontend.
Unfortunately, I have a problem during run cycles: 

Starting the frontend and starting a run works fine -> seeing events with mdump and also on the web GUI. 
But when I stop the run and try to start the next run the frontend is sending no events anymore.
It get stuck at line 221 (if (status == DB_TIMEOUT)).
I tried to reduce the nEvents to 1 which helped in terms of DB_TIMEOUT but still I don't get any events after I did a stop / start cycle -> no events in mdump and no events counting up at the web GUI.
If I kill the frontend in the terminal (ctrl+c) and restart it, while the run is still running, it starts to send events again.

Cheers,
Marius
          Reply  03 Mar 2022, Stefan Ritt, Bug Report, Writting MIDAS Events via FPGAs  
> Starting the frontend and starting a run works fine -> seeing events with mdump and also on the web GUI. 
> But when I stop the run and try to start the next run the frontend is sending no events anymore.
> It get stuck at line 221 (if (status == DB_TIMEOUT)).
> I tried to reduce the nEvents to 1 which helped in terms of DB_TIMEOUT but still I don't get any events after I did a stop / start cycle -> no events in mdump and no events counting up at the web GUI.
> If I kill the frontend in the terminal (ctrl+c) and restart it, while the run is still running, it starts to send events again.

This problem has (likely) been fixed in the current version. Please pull develop and try again. Was a recursive call to the event collection routine which is only triggered if you send events faster than 
the logger can digest, so not many people see it.

Best,
Stefan
             Reply  07 Mar 2022, Marius Koeppel, Bug Report, Writting MIDAS Events via FPGAs  
> This problem has (likely) been fixed in the current version. Please pull develop and try again. Was a recursive call to the event collection routine which is only triggered if you send events faster than 
> the logger can digest, so not many people see it.

I just pulled the current version (d945fa9) but the problem as explained in 2347 stays the same.

Best,
Marius
                Reply  25 Mar 2022, Marius Koeppel, Bug Report, Writting MIDAS Events via FPGAs  
I finally found the problem why the readout stops after a run transition. 

In my dummy frontend the serial number was not reset to zero at run start. 
This leads to a mismatch of the serial number in the function receive_trigger_event of mfe.cxx:1247.
Which is than resulting in the problem that the function founds never a new event in all ring buffers and nothing get read out of the buffer.

Nevertheless, it would be nice that the system would tell the user that there is a mismatch in the serial number (printing a warning / error etc.). 

Cheers,
Marius
Entry  23 Mar 2022, Konstantin Olchanski, Bug Fix, mhttpd bug fixed 
the mhttpd bug should be fixed now (branch feature/buffer_mutex).

simplest way to reproduce:

wget http://localhost:8080/
quickly ctrl-C it
wget http://localhost:8080/
inside mhttpd (by hook or crook) observe that the second wget got the data meant for the first wget.

if you cannot ctrl-C the first wget quickly enough, put a sleep somewhere in the worker thread (in 
mongoose_write(), I think).

this is what happens.

1st wget stops (by ctrl-C), socket is closed, mongoose frees it's mg_connection object
(corresponding worker is still labouring, hmm... actually sleeping, and now has a stale nc pointer)

2nd wget starts, new socket is opened, mongoose allocates a new mg_connection object,
but malloc() gives it back the same memory we just freed(), and the 1st wget's worker thread
nc pointer is no longer stale, but points to 2nd wget's connection.

so we think we are clever and we check the socket file descriptors. but same thing
happens there, too. if 1st wget was file descriptor 7, it is closed, (1st wget worker now has
a stale file handle), then reopened for the 2nd wget, per POSIX, we get back the same
file descriptor 7. 1st wget worker now has the file handle for the 2nd wget tcp socket and
the famous test/crash for "sending data to wrong socket" is defeated.

now, worker thread for the 1st wget wants to send a reply, it has a valid nc pointer (points to 2nd wget's
mg_connection object) and a valid file descriptor (points to 2nd wget's tcp socket),
reply meant for the 1st wget is successfully sent to the 2nd wget, 2nd wget finishes, it's socket
is closed, mg_connection object is free'ed. Now the worker thread for the 2nd wget has stale
connection info, but this is okey, mongoose does not find a matching connection, 2nd wget
worked thread reply goes nowhere, thread finishes silently (no memory leaks here, I checked).

so, connection for 2nd wget completely impersonates the closed connection of 1st wget (I guess I could
check the full socket address info, remote ip address, remote port number, etc, but...)

in practice, this bug does not happen often because modern browsers tend to keep tcp sockets open
for very long time. (not sure about sundry web proxies, etc).

solution of course is very simple. match worker thread data to mongoose mg_connection objects
using our own connection sequential number, which are unique and very easy to keep track
of through the mongoose event handler. all this mess runs in the main thread,
so no locking trouble here, small blessing.

K.O.
    Reply  24 Mar 2022, Stefan Ritt, Bug Fix, mhttpd bug fixed 
> 1st wget stops (by ctrl-C), socket is closed, mongoose frees it's mg_connection object
> (corresponding worker is still labouring, hmm... actually sleeping, and now has a stale nc pointer)
> 
> 2nd wget starts, new socket is opened, mongoose allocates a new mg_connection object,
> but malloc() gives it back the same memory we just freed(), and the 1st wget's worker thread
> nc pointer is no longer stale, but points to 2nd wget's connection.

Why don't we CLEAR the memory (memset(object,0,sizeof(object)) before the free(), this way it cannot be 
mistakenly re-used by the next thread.

Stefan
       Reply  24 Mar 2022, Konstantin Olchanski, Bug Fix, mhttpd bug fixed 
> > 1st wget stops (by ctrl-C), socket is closed, mongoose frees it's mg_connection object
> > (corresponding worker is still labouring, hmm... actually sleeping, and now has a stale nc pointer)
> > 
> > 2nd wget starts, new socket is opened, mongoose allocates a new mg_connection object,
> > but malloc() gives it back the same memory we just freed(), and the 1st wget's worker thread
> > nc pointer is no longer stale, but points to 2nd wget's connection.
> 
> Why don't we CLEAR the memory (memset(object,0,sizeof(object)) before the free(), this way it cannot be 
> mistakenly re-used by the next thread.
> 

My description was unclear. I will try better now.

When http replies are generated by worker threads, matching of reply to mg_connection is done
by checking the address of the mg_connection object. (mongoose itself unhelpfully offers
to send the reply to every mg_connection, see the responder to mg_broadcast() messages).

This works for open/active connections, addresses of all mg_connections are unique.

But if connection is closed and a new connection is opened, the address is reused (by malloc()/free()
reusing memory blocks or by mongoose using a pool of mg_connection objects, does not matter).

So matching http reply to mg_connection using only address of mg_connection can match the wrong connection.

(contents of mg_connection object does not matter, only address is used by matching. so memzero() of
mg_connection object does not help).

I saw this during my testing - wrong data was sent to wrong browser often enough - but did
not understand that the above problem is happening.

Because I was unable to reliably reproduce the problem, I could not debug it. I tried to add
a check for the tcp socket file descriptor number, in case there is a straight bug or multithread race
or simple memory corruption. This replaced "we sent wrong data to wrong browser, poisoned browser
cache, confused the user" with a crash. This "fix" seemed effective at the time.

Maybe I should mention browser cache poisoning again. What happened is html pages and rpc replies
were returned as responses to load things like CSS files, these bad responses are cached by the browser
pretty much forever, so all subsequent midas pages will look wrong (bad css!) forever, until
user manually clears browser cache. reload of page did not help, restart of browser did not help (I think).

So a very bad bug.

Unfortunately, the check for file descriptor was not effective because file descriptors are also
reused. And I did see wrong data returned by mhttpd, but even more rarely. And everybody (myself
included) complained about mhttpd crashes.

Now, matching of responses to connections is done by connection sequential/serial number,
which is unique 32-bit counter. Mismatch of reply to connection should not happen again.

P.S. Latest version of the mongoose web server library does not help with this problem,
the example code for matching reply to connection in their multithread example looks bogus:
https://github.com/cesanta/mongoose/blob/master/examples/multi-threaded/main.c

K.O.
          Reply  24 Mar 2022, Stefan Ritt, Bug Fix, mhttpd bug fixed 
I see, now I understand.

As for the browser cache problem: This Chrome extension is your friend: 

https://chrome.google.com/webstore/detail/clear-cache/cppjkneekbjaeellbfkmgnhonkkjfpdn?hl=en

I use it all the time I change the CSS or a JS file. Having the "Developer Tools" open in Chrome helps as well 
(cache is then turned off). Firefox has similar extensions.

Stefan
             Reply  24 Mar 2022, Konstantin Olchanski, Bug Fix, mhttpd bug fixed 
> As for the browser cache problem: This Chrome extension is your friend ...

for google chrome, it is easy, open the javascript debugger (left-click "inspect"),
the reload button becomes a left-click menu, one left-click option is "clear cache and reload".
(there is no button for "clear cookies and reload", re recent elog cookie problem).

but this does not help me personally any. if midas web pages get confused, I will also get confused, too,
and I will spend hours debugging mhttpd before thinking "hmm... maybe I should clear the browser cache!"

not sure about firefox, safari, microsoft edge and opera. if I ever need it, I google it.

K.O.
Entry  10 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid 
Dear all

We just started our beam time at ILL and just found yesterday that for certain 
settings of our detector the data is not saved into the .mid files. Running "mdump 
-l 10" online we see the data coming in as they should. Nevertheless, if we run 
"mdump -x runXXXXXX.mid" offline, the data file has no events and the banks are 
missing. Any ideas where the data could go lost?

Thanks in advance,
Ivo
    Reply  10 Aug 2020, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid 
> Dear all
> 
> We just started our beam time at ILL and just found yesterday that for certain 
> settings of our detector the data is not saved into the .mid files. Running "mdump 
> -l 10" online we see the data coming in as they should. Nevertheless, if we run 
> "mdump -x runXXXXXX.mid" offline, the data file has no events and the banks are 
> missing. Any ideas where the data could go lost?
> 
> Thanks in advance,
> Ivo

Have you checked 

/Logger/Channels/0/Settings/Event ID = -1
/Logger/Channels/0/Settings/Trigger mask = -1

If these settings are not -1, they filter the data stream for certain events and trigger 
masks.

Stefan
       Reply  10 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid 
> > Dear all
> > 
> > We just started our beam time at ILL and just found yesterday that for certain 
> > settings of our detector the data is not saved into the .mid files. Running "mdump 
> > -l 10" online we see the data coming in as they should. Nevertheless, if we run 
> > "mdump -x runXXXXXX.mid" offline, the data file has no events and the banks are 
> > missing. Any ideas where the data could go lost?
> > 
> > Thanks in advance,
> > Ivo
> 
> Have you checked 
> 
> /Logger/Channels/0/Settings/Event ID = -1
> /Logger/Channels/0/Settings/Trigger mask = -1
> 
> If these settings are not -1, they filter the data stream for certain events and trigger 
> masks.
> 
> Stefan

Good morning Stefan

Both set to -1. We only have one logging channel. If we run a sequence with a few runs and the 
same settings, sometimes data is in the .mid file and sometimes it is not.

Best,
Ivo 
          Reply  10 Aug 2020, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid 
> Both set to -1. We only have one logging channel. If we run a sequence with a few runs and the 
> same settings, sometimes data is in the .mid file and sometimes it is not.

Then I'm running out of ideas. Things I would check:

- Are the file sizes about the same? 

- When you dump the .mid file, you do you see your bank names? 

This would tell you if the events are really missing or if mdump would just not find them.

But I guess without being able to debug the system at ILL I cannot be of any more help. You are the 
first one reporting such a problem, so it must have to do with your local setup.

Stefan
             Reply  10 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid 
> Then I'm running out of ideas. Things I would check:
> 
> - Are the file sizes about the same? 
> 
> - When you dump the .mid file, you do you see your bank names? 
> 
> This would tell you if the events are really missing or if mdump would just not find them.
> 
> But I guess without being able to debug the system at ILL I cannot be of any more help. You are the 
> first one reporting such a problem, so it must have to do with your local setup.
> 
> Stefan

So I did a quick check. The file size is about the same (322K and 329K). When I dump the .mid I don't see 
the banks. It only prints two lines with "------ Event# 0 ------" and "------ Event# 1 ------" whereas for 
the file with data I get the two banks with all the data. Our online analyzer also fails to see the banks. 
Is there another way to check what is in the .mid file?

Best,
Ivo
                Reply  10 Aug 2020, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid 
> So I did a quick check. The file size is about the same (322K and 329K). When I dump the .mid I don't see 
> the banks. It only prints two lines with "------ Event# 0 ------" and "------ Event# 1 ------" whereas for 
> the file with data I get the two banks with all the data. Our online analyzer also fails to see the banks. 
> Is there another way to check what is in the .mid file?

with "dump" I meant a true object dump like "hexdump -C run000001.mid". I produced a file with ADC0 and TDC0 
banks (that's the example from the distribution under exampels/experiments/frontend.cxx), and I get

....
00024220  01 00 00 00 41 44 43 30  04 00 08 00 eb 06 35 04  |....ADC0......5.|
00024230  31 09 4f 06 54 44 43 30  04 00 08 00 93 04 fb 07  |1.O.TDC0........|
00024240  5c 09 88 0b 01 00 00 00  01 00 00 00 2a 0b 31 5f  |\...........*.1_|
00024250  28 00 00 00 20 00 00 00  01 00 00 00 41 44 43 30  |(... .......ADC0|
00024260  04 00 08 00 c3 09 24 05  85 05 f3 06 54 44 43 30  |......$.....TDC0|
00024270  04 00 08 00 88 08 2d 03  3b 0d d6 02 01 00 00 00  |......-.;.......|
00024280  02 00 00 00 2a 0b 31 5f  28 00 00 00 20 00 00 00  |....*.1_(... ...|
00024290  01 00 00 00 41 44 43 30  04 00 08 00 a5 0a 69 09  |....ADC0......i.|

where you clearly see the ADC0 and TDC0 banks.

Stefan
                   Reply  10 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid 
> with "dump" I meant a true object dump like "hexdump -C run000001.mid". I produced a file with ADC0 and TDC0 
> banks (that's the example from the distribution under exampels/experiments/frontend.cxx), and I get
> 
> ....
> 00024220  01 00 00 00 41 44 43 30  04 00 08 00 eb 06 35 04  |....ADC0......5.|
> 00024230  31 09 4f 06 54 44 43 30  04 00 08 00 93 04 fb 07  |1.O.TDC0........|
> 00024240  5c 09 88 0b 01 00 00 00  01 00 00 00 2a 0b 31 5f  |\...........*.1_|
> 00024250  28 00 00 00 20 00 00 00  01 00 00 00 41 44 43 30  |(... .......ADC0|
> 00024260  04 00 08 00 c3 09 24 05  85 05 f3 06 54 44 43 30  |......$.....TDC0|
> 00024270  04 00 08 00 88 08 2d 03  3b 0d d6 02 01 00 00 00  |......-.;.......|
> 00024280  02 00 00 00 2a 0b 31 5f  28 00 00 00 20 00 00 00  |....*.1_(... ...|
> 00024290  01 00 00 00 41 44 43 30  04 00 08 00 a5 0a 69 09  |....ADC0......i.|
> 
> where you clearly see the ADC0 and TDC0 banks.
> 
> Stefan

So at least I learned something new. I tried it with the hexdump and the banks are not existent in the .mid file. I 
only have the ODB inside the file. The 7K difference in size is actually just about what I expect to be the data 
(1792 x 4 bytes)

Best, Ivo
                      Reply  10 Aug 2020, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid 
Have you tried longer files? Maybe a few 100 MB or so. Maybe a buffer is not flushed correctly at the end of a run.
                         Reply  10 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid 
> Have you tried longer files? Maybe a few 100 MB or so. Maybe a buffer is not flushed correctly at the end of a run.

Yes, I did. This 7 KB of the data bank is about the limit. If we go only 1 KB higher it seems that we save all data. In 
our specific case, this is the number of time bins (256 pixels with 7 time bins results in data loss, with 8 time bins it 
seems to be okay, data type is DWORD). 

Of course, a workaround for us is to save at least 8 time bins and throw 7 of them away later on. Nevertheless, since we 
are only in the commissioning phase now this is okay, I would just like to avoid data loss in the data taking phase of the 
experiment so knowing where the problem origins could help. 

I did another test with another FE running that produces a lot of data. The behavior is the same though. If the bank size 
is less than about 8 KB, the bank is not saved anymore. But probably this is anyway the expected behavior since it is a 
different FE that produces the data. 

So if it is coming from the buffer, is there something I could change to test or solve the problem?

Best, Ivo
                            Reply  10 Aug 2020, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid 
I have to reproduce the problem to fix it. Why don't you go and modify midas/examples/experiment/frontend.cxx in such a way that 
it creates exactly the banks you have, just with random data. If you see the same problem, send me your frontend file so that I 
can reproduce it.
                               Reply  11 Aug 2020, Konstantin Olchanski, Bug Report, data missing in runXXXXXX.mid 
> I have to reproduce the problem to fix it. Why don't you go and modify midas/examples/experiment/frontend.cxx in such a way that 
> it creates exactly the banks you have, just with random data. If you see the same problem, send me your frontend file so that I 
> can reproduce it.

It would be good to pin point there the data is lost. This is the sequence:

frontend user code -> mfe.c code -> SYSTEM buffer -> mlogger -> disk

To see if correct data arrives to the SYSTEM buffer, run:
mdump -z SYSTEM

To see if mlogger is receiving events from the SYSTEM buffer, run:
mlogger -v ### mlogger should report all events, history and data

To see if mlogger writes events to disk, examine the disk file (in this case, you already did, data is not there).

I would guess that your data does not make it out from the frontend (mdump shows "nothing"),
if data were to arrive into the SYSTEM buffer, it would make it to disk, unless
mlogger is misconfigured (but you already checked that).

If you have trouble with the frontend framework code, you can try to switch from the mfe.c frontend
to the newer c++ tmfe frontend (see progs/fetest_tmfe.cxx and progs/fetest_tmfe_thread.cxx).

K.O.
                                  Reply  11 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid 
> It would be good to pin point there the data is lost. This is the sequence:
> 
> frontend user code -> mfe.c code -> SYSTEM buffer -> mlogger -> disk
> 
> To see if correct data arrives to the SYSTEM buffer, run:
> mdump -z SYSTEM
> 
> To see if mlogger is receiving events from the SYSTEM buffer, run:
> mlogger -v ### mlogger should report all events, history and data
> 
> To see if mlogger writes events to disk, examine the disk file (in this case, you already did, data is not there).
> 
> I would guess that your data does not make it out from the frontend (mdump shows "nothing"),
> if data were to arrive into the SYSTEM buffer, it would make it to disk, unless
> mlogger is misconfigured (but you already checked that).
> 
> If you have trouble with the frontend framework code, you can try to switch from the mfe.c frontend
> to the newer c++ tmfe frontend (see progs/fetest_tmfe.cxx and progs/fetest_tmfe_thread.cxx).
> 
> K.O.

Good evening

I tried to reproduce the behavior in a very simple FE but it did not work out. The next thing for me would be to take the FE that is producing this behavior, replace all the device communication and data with dummies. If the problem is still there I would start to simplify as much as possible. 

Following the inputs of KO, I pin-pointed the data loss. The system buffer still gets the data but the mlogger does not write the data event. Then of course the data is also not anymore present in the data file. Therefore, I checked the logger settings again, Event ID and Trigger Mask still -1. Nothing else, at least from my point of view, that is misconfigured. Nevertheless, if it helps I can send my ODB settings. 

When doing the tests just before I found something else that probably can give a hint to the problem. The data is only lost if the time between two runs is long (a few seconds). As an example: If I run a sequence with a loop and after the FE stops the run the loop ends and the next run is started automatically, then only the first run has no data, which is the one after a longer time of no data taking. When I add a "WAIT Seconds 5" after the run before starting the next, not data is written to the disk for any run. I also found this once when adding a sleep(1) at the end of the FE readout function but back then did not think about it any further. 

Best, Ivo
                                     Reply  24 Mar 2022, Konstantin Olchanski, Bug Report, data missing in runXXXXXX.mid 
> > It would be good to pin point there the data is lost. This is the sequence:
> > 
> > frontend user code -> mfe.c code -> SYSTEM buffer -> mlogger -> disk
> > 
> > To see if correct data arrives to the SYSTEM buffer, run:
> > mdump -z SYSTEM
> > 
> > To see if mlogger is receiving events from the SYSTEM buffer, run:
> > mlogger -v ### mlogger should report all events, history and data
> > 
> > To see if mlogger writes events to disk, examine the disk file (in this case, you already did, data is not there).
> > 
> > I would guess that your data does not make it out from the frontend (mdump shows "nothing"),
> > if data were to arrive into the SYSTEM buffer, it would make it to disk, unless
> > mlogger is misconfigured (but you already checked that).
> > 
> > If you have trouble with the frontend framework code, you can try to switch from the mfe.c frontend
> > to the newer c++ tmfe frontend (see progs/fetest_tmfe.cxx and progs/fetest_tmfe_thread.cxx).
> > 
> > K.O.
> 
> Good evening
> 
> I tried to reproduce the behavior in a very simple FE but it did not work out.
> The next thing for me would be to take the FE that is producing this behavior,
> replace all the device communication and data with dummies. If the problem is still
> there I would start to simplify as much as possible. 
> 
> Following the inputs of KO, I pin-pointed the data loss. The system buffer still
> gets the data but the mlogger does not write the data event. Then of course the data
> is also not anymore present in the data file. Therefore, I checked the logger
> settings again, Event ID and Trigger Mask still -1. Nothing else, at least from my point of view,
> that is misconfigured. Nevertheless, if it helps I can send my ODB settings. 
> 
> When doing the tests just before I found something else that probably
> can give a hint to the problem. The data is only lost if the time between
> two runs is long (a few seconds). As an example: If I run a sequence with a loop
> and after the FE stops the run the loop ends and the next run is started automatically,
> then only the first run has no data, which is the one after a longer time of
> no data taking. When I add a "WAIT Seconds 5" after the run before starting
> the next, not data is written to the disk for any run. I also found this
> once when adding a sleep(1) at the end of the FE readout function
> but back then did not think about it any further. 
> 

Looks like this problem fell into the covid crack.

As far as I know, MIDAS does not lose any events between bm_send_event() and the shared memory
buffer. It does not lose any events in the mlogger (unless the "event request" is misconfigured).
(there is lots of opportunity to lose events in complicated frontends).

If you have some evidence otherwise, I would very much like to hear about
it and I want to fix all problems that cause it.

In your previous report I was under the impression that you lose random events here and there,
but your latest report is about mlogger not writing anything at all.

Which case is it?

If you can definitely say that all your events make it to the SYSTEM buffer
but mlogger sometimes does not see some of them and sometimes does not see all of them,
we should look very closely at bm_receive_event() and mlogger itself.

In the case where mlogger is not seeing any events at all (output file is empty), as this is
happening, I would like to see the output of mdump (to confirm events are written to SYSTEM
buffer with correct event_id and trigger_mask) and the output of (say)
"manalyzer_test.exe --dump run01161.mid.lz4" on your output file.

If the output is very long, you can email it to me directly instead of posting it here.

K.O.
                                        Reply  24 Mar 2022, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid 
One idea: we should have a look at mlogger::close_channels(). There the SYSTEM buffer is emptied through the cm_yield() call. Instrumenting this with some debugging code will enlighten us. 

Another possible problem: If the frontend requested to be notified for a run stop AFTER the logger, then the problem might happen: Logger closes file, and THEN the frontend flushes events ending up in the SYSTEM buffer and being logged at the beginning of the next run. The mfe.cxx framework takes care of this by calling 

cm_register_transition(TR_SOP, 500);

while the mlogger does 

cm_register_transition(TR_STOP, tr_stop, 800);

and since 800 > 500 the logger will be called AFTER the frontend. If one use a framework different from mfe.cxx, this could however be different.

Stefan
                                           Reply  24 Mar 2022, Konstantin Olchanski, Bug Report, data missing in runXXXXXX.mid 
> One idea: we should have a look at mlogger::close_channels().
> There the SYSTEM buffer is emptied through the cm_yield() call.
> Instrumenting this with some debugging code will enlighten us.

right. this will "last few events are lost at the end of run".

but that code in the mlogger was not touched in years, if there is a problem there,
we would have seen it by now, most experiments check that the number
of events in the data file is same as number of triggers generated, both
numbers are shown on the midas status page.

> Another possible problem: If the frontend requested to be notified for a run stop AFTER the logger, then the problem might happen: Logger closes file, and THEN the frontend flushes events ending up in the SYSTEM buffer and being logged at the beginning of the next run. The mfe.cxx framework takes care of this by calling 
> cm_register_transition(TR_STOP, 500);

default sequence, both mfe.c frontend and c++ tmfe frontend:

start of run:
- mlogger first (configure history, open data file)
- frontends last
- (if any frontend fails, TR_STARTABORT is sent to mlogger to close the output file and "undo" the run start)

end of run:
- frontends first (must not send any events after after processing the TR_STOP RPC call, inside the TR_STOP handler, bm_flush_cache() takes care of the write cache)
- mlogger last
- (if any frontend fails, failure is ignored, run stops regardless)

wrong order will be only if they manually change it, and whatever order
they set, you see it on the midas transition page (and mtransition -v and odbedit stop now -v, etc).

K.O.
Entry  22 Mar 2022, Konstantin Olchanski, Bug Fix, fix for event buffer corruption in bm_flush_cache() 
multithreaded frontends have an unusual event buffer corruption if the write 
cache is enabled. For a long time now I had to disable the write cache on
all multithreaded frontends in alpha-g, I was hitting this bug quite often.
(somehow I do not see this problem reported on bitbucket!)

last week I reworked the multithread locking of event buffers, in hope
that this bug will turn up, but nope, all mutexes and locking look okey,
except for a number of unrelated problems (races against bm_close_buffer()
were the most troublesome to fix).

but finally found the trouble.

first, some background.

because multiprocess locking is expensive, frontends that generate
a large number of small events can use the write cache to reduce
this overhead. instead of locking the shared memory event buffer for
each event, events are accumulated in the write cache, and periodic
calls to bm_flush_buffer() flush them to shared memory. For best effect,
one should increase the size of the write cache until lock rate is around
10/second.

it turns out introduction of multithreading broke bm_flush_cache().

it does this:

- int ask_free = pbuf->wp; // how much data we have in the write cache now
- call bm_wait_for_free_space(ask_free); // ensure we have this much free shared 
memory space
- copy pbuf->wp worth if events to shared memory

looks okey at first sight. this is what happens to trigger the bug:

- int ask_free = pbuf->wp; // ok
- call bm_wait_for_free_space(ask_free); // ok, but if shared memory is full, it 
will go to sleep waiting for free space
- in the mean time, another thread calls bm_send_event(), this adds more data to 
the write cache, moves pbuf->wp
- bm_wait_for_free_space() eventually returns
- copy pbuf->wp worth of data to shared memo KABOOM! shared memory corruption!

we just overwrote some unlucky event in shared memory: we only have "ask_free"
free bytes available, but pbuf->wp moved and now has more data,
and it does not fit, and there is no check against it.

of course in the single threaded world this bug did not exist, there was no 
other thread to call bm_send_event() while bm_flush_cache() is sleeping.

the obvious fix is to ask for more free space if cached data does not fit.

this is now implemented on the branch feature/buffer_mutex. after a bit more 
tested I will merge it into develop.

so that's it?

not so fast. there was more going on. as described, the bug will only happen
when shared memory event buffer is full. (i.e. rarely or never). It turns
out the old version of thread locking code was defective and permitted
a race between bm_send_event() and bm_send_event() in another thread:

thread 1: while (1) { bm_send_event(very small event); }
thread 2:
-> bm_send_event(very big event)
-> no space in the cache for the very big event, call bm_flush_cache()
-> bm_flush_cache() asks bm_wait_for_free_space() to make space for cached data
-> this was done with write cache mutex released (mistake!)
-> at the same time bm_send_event(very small event) added 1 more small event to 
the cache
-> back in bm_flush_cache() write cache mutex is locked correctly, we copy 
cached data to shared memory and again KABOOM because we now have more data than 
we asked free space for.

So in the original implementation, corruption was possible even when share 
memory event buffer was pretty much empty.

The reworked locking code closed that loop hole - bm_flush_cache() is now
called with write cache locked, and bm_send_event() from another thread
cannot confuse things, unless shared memory buffer is full and we go to
sleep inside bm_wait_for_free_space(). And this is now fixed, too.

K.O.
    Reply  22 Mar 2022, Stefan Ritt, Bug Fix, fix for event buffer corruption in bm_flush_cache() 
Thanks Konstantin for your detailed description. 

I wonder why we never saw this problem at PSI. Here is the reason: In multil-threaded environments, we never call bm_send_event() directly 
from all threads (since in the old days nothing was thread safe in midas). Instead, we use a collector thread which gets all events via the 
rb_xxx functions from the individual readout threads. This is well integrated into the mfe.cxx framework. Look at examples/mtfe/mfte.cxx. 
Each thread does (simplified):

while (true) {
  do {
     status = rb_get_wp(&pevent);
  } while (status == DB_TIMEOUT)

  bm_compose_event_threadsafe(pevent, ..., &serial_number);
  bk_init32(pevent+1);
  ... fill event ...
  bk_close(pevent)

  rb_increment_wp(sizeof(EVENT_HEADER) + pevent->data_size);
}

The framework now collects all these events in receive_trigger_event() which runs in the main thread:

  for (i=0 ; i<n_thread ; i++) {
     rb_get_rp(i, pevent);
     if (pevent->serial_number == prev_serial+1)
        break;
  }
  
  prev_serial = pevent->serial_number;
  rpc_send_event(pevent);
  rb_increment_rp(sizeof(EVENT_HEADER) + pevent->data_size);

This code ensures that all events are in the right sequence (before the serial numbers where mixed up) and that all events are sent only 
from a single thread, so the write buffer can be used effectively without complicated multi-thread locks.

This solution works nicely at PSI since many years, maybe you should put some thought to use it in your tmfe framework in Alpha-g as well 
instead of struggling with all your locks.

Stefan
       Reply  23 Mar 2022, Konstantin Olchanski, Bug Fix, fix for event buffer corruption in bm_flush_cache() 
I confirm, there is no problem in single-threaded programs, and
there is no problem if all bm_send_event() and bm_flush_cache() are called
from the same thread.

> ... instead of struggling with all your locks.

it is better to have midas fully thread safe. ODB has been so for a long time,
event buffer partially (except for this bug), now fully.

without that the problem still exists, because in many frontends,
bm_flush_buffer() is called from the main thread, and will race
against the "bm_send_event() thread". Of course you can do
everything on the main thread, but this opens you to RPC timeouts
during run transitions (if you sleep in bm_wait_for_free_space()).

also the SYSMSG buffer is subject to the same bug. cm_msg() is of course
safe to call from anywhere, but cm_msg_flush_buffer() and cm_periodic_tasks()
can be called from any thread, and they issue bm_send_event(SYSMSG),
and there will be mysterious crashes and SYSMSG corruptions, probably
only during message storms, but still!

K.O.
          Reply  24 Mar 2022, Stefan Ritt, Bug Fix, fix for event buffer corruption in bm_flush_cache() 
> > ... instead of struggling with all your locks.
> 
> it is better to have midas fully thread safe. ODB has been so for a long time,
> event buffer partially (except for this bug), now fully.
> 
> without that the problem still exists, because in many frontends,
> bm_flush_buffer() is called from the main thread, and will race
> against the "bm_send_event() thread". Of course you can do
> everything on the main thread, but this opens you to RPC timeouts
> during run transitions (if you sleep in bm_wait_for_free_space()).

Just for the record: in the mfe.cxx framework both bm_send_event() and 
bm_flush_buffer() are called from the main thread, as can be seen in the 
midas/examples/mtfe/mtfe.cxx example.

But I agree that having all buffer operations thread safe is a clear benefit.

Stefan
    Reply  23 Mar 2022, Ivo Schulthess, Bug Fix, fix for event buffer corruption in bm_flush_cache() 
Thanks for the investigation. Back in 2020, we had some issues of losing data between the system buffer and the logger writing them to disk (https://daq00.triumf.ca/elog-midas/Midas/1966). This was polled equipment but we had a multithreaded FE running at the same time. Could this be related to the same problem?

Best, Ivo
       Reply  24 Mar 2022, Konstantin Olchanski, Bug Fix, fix for event buffer corruption in bm_flush_cache() 
> Thanks for the investigation. Back in 2020, we had some issues
> of losing data between the system buffer and the logger writing them
> to disk (https://daq00.triumf.ca/elog-midas/Midas/1966). This was polled equipment
> but we had a multithreaded FE running at the same time. Could this be related to the same problem?

I think we will have to follow up on your problem 1966 separately.

I think this bug cannot lose events. Writing events to the write cache has correct
locking, no loss here. writing write cache to shared memory has correct locking,
no loss there. the bug will cause the *next* event in the event buffer to be overwritten,
this will be detected by most programs as shared memory corruption and everybody
will quit. (mhttpd, mserver, odbedit will probably survive).

I guess there could be unlucky corruption that looks like nothing was corrupted,
but this will affect only a few events right at the shared memory read/write
pointer, it so happens that they are the oldest events in the buffer and likely
mlogger already wrote them to disk. mlogger read pointer will likely follow
the shared memory write pointer closely, well ahead of the shared memory
read pointer which always pointe to the older event and where this bug's corruption
will happen.

So no, I do not think this bug can cause event loss between frontend and mlogger.

K.O.
ELOG V3.1.4-2e1708b5