ELOG Midas

Back Midas Rome Roody Rootana

Midas DAQ System, Page 39 of 45

Not logged in

Find | Login | Help

Full | Summary | Threaded | Collapse | Expand

886 Entries

Goto page Previous 1, 2, 3 ... 38, 39, 40 ... 43, 44, 45 Next

09 Nov 2021, Francesco Renga, Forum, Issue in data writing speed

Dear all,
       I've a frontend writing a quite big bunch of data into a MIDAS bank (16bit output from a 4MP photo camera). 
I'm experiencing a writing speed problem that I don't understand. When the photo camera is triggered at a low rate (< 2 Hz) 
writing into the bank takes a very short time for each event (indeed, what I measure is the time to write and go back 
into the polling function). If I increase the rate to 4 Hz, I see that writing the first two events takes a sort time, 
but the third event takes a very long time (hundreds of ms), then again the fourth and fifth events are very fast, and 
the sixth is very slow. If I further increase the rate, every other event is very slow. The problem is not in the readout 
of the camera, because if I just remove the bank writing and keep the camera readout, the problem disappears. Can you 
explain this behavior? Is there any way to improve it?

Below you can also find the code I use to copy the data from the camera buffer into the bank. If you have any suggestion 
to improve it, it would be really appreciated.

Thank you very much,
          Francesco



  const char* pSrc = (const char*)bufframe.buf;

  for(int y = 0; y < bufframe.height; y++ ){

    //Copy one row
    const unsigned short* pDst = (const unsigned short*)pSrc;

    //go through the row
    for(int x = 0; x < bufframe.width; x++ ){

      WORD tmpData = *pDst++; 

      *pdata++ = tmpData;

    }

    pSrc += bufframe.rowbytes;

  }

10 Nov 2021, Stefan Ritt, Forum, Issue in data writing speed

Midas uses various buffers (in the frontend, at the server side before the SYSTEM buffer, the SYSTEM buffer itself, on the 
logger before writing to disk. All these buffers are in RAM and have fast access, so you can fill them pretty quickly. When
they are full, the logger writes to disk, which is slower. So I believe at 2 Hz your disk can keep up with your writing 
speed, but at 4 Hz (2x8MBx4=32 MB/sec) your disk starts slowing down the writing process. Now 32MB/s is pretty slow for
a disk, so I presume you have turned compression on which takes quite some time.

To verify this, disable logging. The disable compression and keep logging. Then report back here again.

> Dear all,
>        I've a frontend writing a quite big bunch of data into a MIDAS bank (16bit output from a 4MP photo camera). 
> I'm experiencing a writing speed problem that I don't understand. When the photo camera is triggered at a low rate (< 2 Hz) 
> writing into the bank takes a very short time for each event (indeed, what I measure is the time to write and go back 
> into the polling function). If I increase the rate to 4 Hz, I see that writing the first two events takes a sort time, 
> but the third event takes a very long time (hundreds of ms), then again the fourth and fifth events are very fast, and 
> the sixth is very slow. If I further increase the rate, every other event is very slow. The problem is not in the readout 
> of the camera, because if I just remove the bank writing and keep the camera readout, the problem disappears. Can you 
> explain this behavior? Is there any way to improve it?
> 
> Below you can also find the code I use to copy the data from the camera buffer into the bank. If you have any suggestion 
> to improve it, it would be really appreciated.
> 
> Thank you very much,
>           Francesco
> 
> 
> 
>   const char* pSrc = (const char*)bufframe.buf;
> 
>   for(int y = 0; y < bufframe.height; y++ ){
> 
>     //Copy one row
>     const unsigned short* pDst = (const unsigned short*)pSrc;
> 
>     //go through the row
>     for(int x = 0; x < bufframe.width; x++ ){
> 
>       WORD tmpData = *pDst++; 
> 
>       *pdata++ = tmpData;
> 
>     }
> 
>     pSrc += bufframe.rowbytes;
> 
>   }
>

26 Jan 2022, Konstantin Olchanski, Forum, Issue in data writing speed

Francesco, when you say "writing an event is slow", do you mean it in the frontend
or in the output data file?

Stefan is quite right about the data file, it can take seconds between generating
an event in the frontend and seeing it written to the data file. (if compression
buffers are too big, an event can sit there forever, until pushed out by next events
or by run stop).

But maybe you see this on the frontend side.

What you are looking at is "real time" performance of the frontend and of the linux kernel.

The mfe.c frontend has many problems with real time performance, it can stall and take a long
time between calls to read_event(), for many reasons.

There are ways around that, but it is simpler to switch to the tmfe c++ frontend
that was designed for good real time performance.

In the tmfe frontend, if you use the polled equipment and enable the poll thread,
your frontend will be limited only by the linux kernel real time performance (i.e.
on a single-core CPU, other programs will delay execution of your frontend
and you will see it as long delays (usec, millisec) between calls to your read_event().

Next limit to real time performance (common to mfe.c and tmfe frontends) is the writing
of event data to the midas shared event buffer. One has to lock the shared memory semaphore
and this has to wait until other users of the event buffer finish their reading
or writing and unlock it. Arbitrary amount of time (usec, millisec, sec) can pass.
(there is also problems with "fairness" of the linux semaphores, a different story, again).

Making things more interesting, midas event buffers implement a write cache (default size 100 kbytes),
events smaller than the cache are quickly accumulated (no need to lock the shared memory semaphore),
them flushed to shared memory when cache is full. This is done to reduce the number
of shared memory semaphore locks per event, in the case of very high rate of very small events.

Solution to all this is to use 2 threads: read the data from hardware in one thread and write the data to midas
in a different thread. Between the threads would be an event fifo (circular buffer in mfe.c,
std::deque<EVENT> in tmfe c++ frontends).

For remote connected frontends, things are a bit different. Event data is written directly
into the TCP socket and as long as socket buffers are big enough, there is no real-time delays,
unless SYSTEM buffer is very congested and mserver does not read the TCP socket quickly enough.
So depending on event size, data rate and tcp socket buffer size, the extra 2nd thread
may not be necessary and poll thread real time performance may be good enough.

I hope this clarifies the situation somewhat.

K.O.

> Dear all,
>        I've a frontend writing a quite big bunch of data into a MIDAS bank (16bit output from a 4MP photo camera). 
> I'm experiencing a writing speed problem that I don't understand. When the photo camera is triggered at a low rate (< 2 Hz) 
> writing into the bank takes a very short time for each event (indeed, what I measure is the time to write and go back 
> into the polling function). If I increase the rate to 4 Hz, I see that writing the first two events takes a sort time, 
> but the third event takes a very long time (hundreds of ms), then again the fourth and fifth events are very fast, and 
> the sixth is very slow. If I further increase the rate, every other event is very slow. The problem is not in the readout 
> of the camera, because if I just remove the bank writing and keep the camera readout, the problem disappears. Can you 
> explain this behavior? Is there any way to improve it?
> 
> Below you can also find the code I use to copy the data from the camera buffer into the bank. If you have any suggestion 
> to improve it, it would be really appreciated.
> 
> Thank you very much,
>           Francesco
> 
> 
> 
>   const char* pSrc = (const char*)bufframe.buf;
> 
>   for(int y = 0; y < bufframe.height; y++ ){
> 
>     //Copy one row
>     const unsigned short* pDst = (const unsigned short*)pSrc;
> 
>     //go through the row
>     for(int x = 0; x < bufframe.width; x++ ){
> 
>       WORD tmpData = *pDst++; 
> 
>       *pdata++ = tmpData;
> 
>     }
> 
>     pSrc += bufframe.rowbytes;
> 
>   }
>

26 Jan 2022, Konstantin Olchanski, Forum, Issue in data writing speed

> Francesco, when you say "writing an event is slow", do you mean it in the frontend
> or in the output data file?

Another explanation just occurred to me. We do not know your event size and we do not
know the size of your SYSTEM buffer. But if you have an unlucky combination,
this can happen:

Consider event size is 6 Mbytes, buffer size is 8 Mbytes, enough space for only 1 event.

First event is written quickly (buffer is empty).
Second event will be delayed, there is not enough free space in the buffer, we have
to wait for mlogger to finish reading the first event.

Same thing happens if event size is 3 Mbytes, the first 2 events will write quickly,
writing the 3rd event will be delayed until mlogger does it's thing.

The mlogger reads the SYSTEM buffer "fast" and "quickly", but it can be delayed
for a number of reasons, i.e. handling a history event, a delay writing to disk,
a delay writing to network connected storage, etc.

In general, it is best to size the SYSTEM buffer to hold about 1 second worth
of data (of average size, average rate). If your event size is 4 Mbytes, and you
record them at 10/sec, SYSTEM buffer should be at least 40 Mbytes big. (this is
set in ODB /Experiment/Buffer Sizes). (MIDAS event buffer size is limited to 2 GBytes).

K.O.

01 Dec 2021, Lars Martin, Bug Report, Off-by-one in sequencer documentation

The documentation for the sequencer loop says:

<quote>
LOOP [name ,] n ... ENDLOOP	To execute a loop n times. For infinite loops, "infinite" 
can be specified as n. Optionally, the loop variable running from 0...(n-1) can be accessed 
inside the loop via $name.
</quote>

In fact the loop variable runs from 1...n, as can be seen by running this exciting 
sequencer code:

1 COMMENT "Figuring out MSL"
2 
3 LOOP n,4
4   MESSAGE $n,1
5 ENDLOOP

02 Dec 2021, Stefan Ritt, Bug Report, Off-by-one in sequencer documentation

> The documentation for the sequencer loop says:
> 
> <quote>
> LOOP [name ,] n ... ENDLOOP	To execute a loop n times. For infinite loops, "infinite" 
> can be specified as n. Optionally, the loop variable running from 0...(n-1) can be accessed 
> inside the loop via $name.
> </quote>
> 
> In fact the loop variable runs from 1...n, as can be seen by running this exciting 
> sequencer code:
> 
> 1 COMMENT "Figuring out MSL"
> 2 
> 3 LOOP n,4
> 4   MESSAGE $n,1
> 5 ENDLOOP

Indeed you're right. The loop variable runs from 1...n. I fixed that in the documentation.

Stefan

26 Jan 2022, Konstantin Olchanski, Bug Report, Off-by-one in sequencer documentation

> > 3 LOOP n,4
> > 4   MESSAGE $n,1
> > 5 ENDLOOP
> 
> Indeed you're right. The loop variable runs from 1...n. I fixed that in the documentation.

Shades/ghosts of FORTRAN. c/c++/perl/python loops loop from 0 to n-1.

K.O.

26 Jan 2022, Stefan Ritt, Bug Report, Off-by-one in sequencer documentation

> Shades/ghosts of FORTRAN. c/c++/perl/python loops loop from 0 to n-1.

   for (i=1 ; i<=10 ; i++);     ;-)

26 Jan 2022, Konstantin Olchanski, Bug Report, Off-by-one in sequencer documentation

> > Shades/ghosts of FORTRAN. c/c++/perl/python loops loop from 0 to n-1.
> 
>    for (i=1 ; i<=10 ; i++);     ;-)

Similar code made big news just recently: (scroll down to the example main() program)

https://blog.qualys.com/vulnerabilities-threat-research/2022/01/25/pwnkit-local-privilege-escalation-
vulnerability-discovered-in-polkits-pkexec-cve-2021-4034

I forget if the FORTRAN rules were "loop once" or "never loop" or if it was different
between Fortran-4, fortran-77, DEC extensions and IBM extension, or if it was a compiler switch.

We should check that we do something reasonable with such loops to zero:

LOOP n,0
   MESSAGE $n,1
ENDLOOP

P.S. Yup. "man g77" option "-fonetrip".

K.O.

29 Oct 2021, Kushal Kapoor, Bug Report, Unknown Error 319 from client

I�m trying to run MIDAS using a frontend code/client named �fetiglab�. Run stops 
after 2/3sec with an error saying �Unknown error 319 from client �fetiglab� on 
localhost.

Frontend code compiled without any errors and MIDAS reads the frontend 
successfully, this only comes when I start the new run on MIDAS, here are a few 
more details from the terminal-

11:46:32 [fetiglab,ERROR] [odb.cxx:11268:db_get_record,ERROR] struct size 
mismatch for "/" (expected size: 1, size in ODB: 41920)

11:46:32 [Logger,INFO] Deleting previous file 
"/home/rcmp/online3/run00621_000.root"

11:46:32 [ODBEdit,ERROR] [midas.cxx:5073:cm_transition,ERROR] transition START 
aborted: client "fetiglab" returned status 319

11:46:32 [ODBEdit,ERROR] [midas.cxx:5246:cm_transition,ERROR] Could not start a 
run: cm_transition() status 319, message 'Unknown error 319 from client 
'fetiglab' on host "localhost"'

TR_STARTABORT transition: cleanup after failure to start a run

&#8204;

I�ve also enclosed a screenshot for the same, any suggestions would be highly 
appreciated. thanks

26 Jan 2022, Konstantin Olchanski, Bug Report, Unknown Error 319 from client

> I�m trying to run MIDAS using a frontend code/client named �fetiglab�. Run stops 
> after 2/3sec with an error saying �Unknown error 319 from client �fetiglab� on 
> localhost.

actually run never starts.

> 11:46:32 [fetiglab,ERROR] [odb.cxx:11268:db_get_record,ERROR] struct size 
> mismatch for "/" (expected size: 1, size in ODB: 41920)

this is the error that causes run start to fail. for reasons unknown
your frontend is trying to do a db_get_record() from "/" (ODB root top directory).

if this is an mfe.c frontend, I do not think I have ever seen it do something
like this.

so, a puzzle.

K.O.

22 Oct 2021, Francesco Renga, Forum, mhttpd error

Dear all,
      I am trying to make the MIDAS web server for my DAQ project accessible from other machines. In the ODB, I activated the necessary flags:

[local:CYGNUS_RD:S]/WebServer>ls
Enable localhost port           y
localhost port                  8080
localhost port passwords        n
Enable insecure port            y
insecure port                   8081
insecure port passwords         y
insecure port host list         y
Enable https port               y
https port                      8443
https port passwords            y
https port host list            n
Host list
                                localhost
Enable IPv6                     y
Proxy                           
mime.types

Following the instructions on the Wiki I enabled the SSL support. When running mhttpd, I get these messages:

Mongoose web server will use HTTP Digest authentication with realm "CYGNUS_RD" and password file "/home/cygno/DAQ/online/htpasswd.txt"
Mongoose web server will use the hostlist, connections will be accepted only from: localhost
Mongoose web server listening on http address "localhost:8080", passwords OFF, hostlist OFF
Mongoose web server listening on http address "[::1]:8080", passwords OFF, hostlist OFF
Mongoose web server listening on http address "8081", passwords enabled, hostlist enabled
[mhttpd,ERROR] [mhttpd.cxx:19166:mongoose_listen,ERROR] Cannot mg_bind address "[::0]:8081"
Mongoose web server will use https certificate file "/home/cygno/DAQ/online/ssl_cert.pem"
Mongoose web server listening on https address "8443", passwords enabled, hostlist OFF
[mhttpd,ERROR] [mhttpd.cxx:19166:mongoose_listen,ERROR] Cannot mg_bind address "[::0]:8443"

and the server is not accessible from other machines. Any suggestion to solve or better investigate this problem?

Thank you very much,
         Francesco

22 Oct 2021, Stefan Ritt, Forum, mhttpd error

> Enable IPv6                     y

Probably the IPv6 problem, see here elog:2269

I asked to turn off IPv6 by default, or at least mention this in the documentation,
but unfortunately nothing happened.

Stefan

25 Oct 2021, Francesco Renga, Forum, mhttpd error

It worked, thank you very much!

Francesco

> > Enable IPv6                     y
> 
> Probably the IPv6 problem, see here elog:2269
> 
> I asked to turn off IPv6 by default, or at least mention this in the documentation,
> but unfortunately nothing happened.
> 
> Stefan

26 Jan 2022, Konstantin Olchanski, Forum, mhttpd error

> > Enable IPv6                     y
> 
> Probably the IPv6 problem, see here elog:2269
> 
> I asked to turn off IPv6 by default, or at least mention this in the documentation,
> but unfortunately nothing happened.

But IPv4 and IPv6 code is completely separate, if IPv6 bind fails, IPv4 should still 
work.

This is all very strange.

It does not help that the OP does not say in which way things do not work,
"the server is not accessible from other machines" is not an error message
reported by any browser, and we do not know what URL he is using
to access mhttpd - http: or https:

Also he is enabling the "insecure" port 8081, I am pretty sure the documentation
is pretty clear, either use the secure https port or the insecure port,
but not both at the same time.

In any case, I see current version of mongoose have removed support
for password files, so all this stuff will likely become reworked
and at the end mhttpd will only listen to localhost ports. To make it "accessible
to other machines", one will have to use the apache https proxy. (or mtpcproxy from 
midas).

K.O.

29 Jan 2022, Isaac Labrie Boulay, Forum, MIDAS and GRIF-16 digitizer (Standalone Mode).

Hi all,

I was sent a version of the frontend for the TIGRESS Detector lab setup so that 
I can test detectors using a GRIF-16 digitizer in standalone mode.

I followed the GRIF-16 wiki (https://grsi.wiki.triumf.ca/wiki/GRIF-16#One-
level_operation) to setup the GRIF-16 through the webpage. The digitized data is 
supposed to come into my UDP port 8800 but it is never retrieved in the 
frontend.

Here's the readout scheme:
// readout sequence ...
// poll_event() true (if still have data in buffer or testmsg() true)
// -> read_trigger_event() -> read_grifc_event() - re-buffers into midas events
//                         -> grifc_eventread()  - returns single grif fragment
//                         -> grifc_dataread()   - returns single net-pkt 


Here's poll_event():
INT poll_event(INT source, INT count, BOOL test)
{
   int i, have_data=0;

   for(i=0; i<count; i++){
      if( data_available ){ break; }
      have_data = ( testmsg(data_socket, 0) > 0 );
      if( have_data && !test ){ break; }
   }
   return( (have_data || data_available) && !test );
}

This being said, testmsg() always returns empty and "data_available" is only set 
to TRUE when there's leftover data after a GRIF-C reading (I'm obviously not 
using a GRIF-C).

I know that when GRIF-16 is in standalone mode, MIDAS does not change GRIF-16s 
settings based on the ODB, it has to be done through the GRIF-16 webpage. Is the 
user frontend code even responsible for the GRIF-16 data readout in standalone 
mode? If not, could it just be that my UDP offloader is incorrectly setup?

Here are its current settings:

SETTINGS/UDP
- Offloader: ON
- Dst IP: my IP
- Dst Port: 8800 (DATA_PORT)

SETTINGS/MIDAS
- Use MIDAS: OFF
- MIDAS Hostname: my hostname
- MIDAS IP: same as Dst IP from UDP settings
- Dst Port: 8080 (I'm assuming that this is the mhttpd port)

Again, the frontend runs but I get 0 events. What might I be missing?

Thanks for helping me out!

Isaac

26 Jan 2022, Frederik Wauters, Forum, .gz files

I adapted our analyzer to compile against the manalyzer included in the midas repo.

All our data files are .mid.gz, which now fail to process :(

frederik@frederik-ThinkPad-T550:~/new_daq/build/analyzer$ ./analyzer -e100 -s100 ../../run_backup_11783.mid.gz 
...
...n
Registered modules: 1
file[0]: ../../run_backup_11783.mid.gz
Setting up the analysis!
TMReadEvent: error: short read 0 instead of -1193512213

Which is in the TMEvent* TMReadEvent(TMReaderInterface* reader) class in the midasio.cxx file

Reading the unzipped files works. But we have always processed our .gz files directly, for the unzipping we would need ~2x disk space.

Am I doing something wrong? I see that there is some activity on lz4 in the midasio repo, is gunzip next?

26 Jan 2022, Konstantin Olchanski, Forum, .gz files

> I adapted our analyzer to compile against the manalyzer included in the midas repo.
> TMReadEvent: error: short read 0 instead of -1193512213

I think this problem is fixed in the latest version of midasio and manalyzer, but this update
was not pulled into midas yet. (Canada is in the middle of a covid wave since December).

What happens is you do not have the gzip library installed on your computer and
your analyzer is built without support for gzip.

The fix is done the hard way, the gzip library is no longer optional, but required.

You do not say what linux you use, so I cannot give exact instructions, but for:
ubuntu: apt -y install libz-dev
centos7: installed by default
centos8: installed by default
debian11/raspbian: same as ubuntu

K.O.

31 Jan 2022, Frederik Wauters, Forum, .gz files

> > I adapted our analyzer to compile against the manalyzer included in the midas repo.
> > TMReadEvent: error: short read 0 instead of -1193512213
> 
> I think this problem is fixed in the latest version of midasio and manalyzer, but this update
> was not pulled into midas yet. (Canada is in the middle of a covid wave since December).
> 
> What happens is you do not have the gzip library installed on your computer and
> your analyzer is built without support for gzip.
> 
> The fix is done the hard way, the gzip library is no longer optional, but required.
> 
> You do not say what linux you use, so I cannot give exact instructions, but for:
> ubuntu: apt -y install libz-dev
> centos7: installed by default
> centos8: installed by default
> debian11/raspbian: same as ubuntu
> 
> K.O.

My libz under ubuntu

-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.11") 
-- MIDAS: Found ZLIB version 1.2.11

I got both the manalyzer example and mine going with
* the latest midas dev
* + the latest manalyzer (cf6c233)
* + almost latest midasio (568a617, otherwise I get an linking error 

./libmidas.a(midasio.cxx.o): In function `Lz4Error(int)':
midasio.cxx:(.text+0x359): undefined reference to `MLZ4F_getErrorName(unsigned long)'

So this works, I will assume that in the near future this all will come together in the standard midas release.

thanks

07 Feb 2022, Konstantin Olchanski, Forum, MidasWiki moved from ladd00 to daq00.triumf.ca and updated to MediaWiki 1.35

MidasWiki moved from ladd00 (obsolete SL6) to daq00.triumf.ca (Ubuntu LTS 20.04) 
and updated from obsolete MediaWiki LTS 1.27.7 to MediaWiki LTS 1.35, supported 
until mid-2023, see https://www.mediawiki.org/wiki/Version_lifecycle

Old URL https://midas.triumf.ca and https://midas.triumf.ca/MidasWiki/...
redirect to new URL https://daq00.triumf.ca/MidasWiki/index.php/Main_Page

All old links and bookmarks should continue to work (via redirect).

To report problems with this MediaWiki instance and to request
any changes in configuration or installed extensions, please reply to this
message here.

K.O.

30 Sep 2021, Francesco Renga, Forum, OPC client within MIDAS

Dear all,
     I need to integrate in my MIDAS project the communication with an OPC UA 
server. My plan is to develop an OPC UA client as a "device" in 
midas/drivers/device.

Two questions:

1) Is anybody aware of some similar effort for some other project, so that I can 
get some example?

2) What could be the more appropriate driver's class to be used? generic.cxx? 
multi.cxx?

Thank you for your help,
           Francesco

10 Feb 2022, Francesco Renga, Forum, OPC client within MIDAS

Dear all,
        I finally succeeded to get a working driver for the communication with an OPC 
UA server. It is based on the open62541 library and I use it in combination with the 
generic.h driver class. This is still a crude implementation, but let me post it here, 
maybe it can be useful to somebody else.

BTW, if there is somebody more skilled than me with OPC UA and MIDAS drivers, who is 
willing to give suggestions for improving the implementation, it would be extremely 
appreciated.

Best Regards,
      Francesco



> Dear all,
>      I need to integrate in my MIDAS project the communication with an OPC UA 
> server. My plan is to develop an OPC UA client as a "device" in 
> midas/drivers/device.
> 
> Two questions:
> 
> 1) Is anybody aware of some similar effort for some other project, so that I can 
> get some example?
> 
> 2) What could be the more appropriate driver's class to be used? generic.cxx? 
> multi.cxx?
> 
> Thank you for your help,
>            Francesco

28 May 2021, Joseph McKenna, Bug Report, History plots deceiving users into thinking data is still logging

I have been trying to fix this myself but my javascript isn't strong... The 'new' history plot render fills in missing data with the last ODB value (even when this value is very old... elog:2180/1 shows this... The data logging stopped, but the history plot can fool users into thinking data is logging (The export button generates CSVs with entires every 10 seconds also). Grepping through the history files behind the scenes, I found only one match for an example variable from this plot, so it looks like there are no entries after March 24th (although I may be mistaken, I've not studied the history files data structure in detail), ie this is a artifact from the mhistory.js rather than the mlogger... Have I missed something simple? Would it be possible to not draw the line if there are no datapoints in a significant time? Or maybe render a dashed line that doesn't export to CSV? Thanks in advance Edit, I see certificate errors this forum and I think its preventing my upload an image... inlining it into the text here:

28 May 2021, Stefan Ritt, Bug Report, History plots deceiving users into thinking data is still logging

This is a known problem and I'm working on. See the discussion at:

https://bitbucket.org/tmidas/midas/issues/305/log_history_periodic-doesnt-account-for

Stefan

02 Jun 2021, Konstantin Olchanski, Bug Report, History plots deceiving users into thinking data is still logging

https://bitbucket.org/tmidas/midas/issues/305/log_history_periodic-doesnt-account-for

this problem is a blocker for the next midas release.

the best I can tell, current development version of midas writes history data incorrectly,
but I do not have time to look at it at this moment.

I recommend that people use the latest released version, midas-2020-12. (this is what we have on alphag and 
should have in alpha2).

midas-2020-12 uses mlogger from midas-2020-08.

If I cannot find time to figure out what is going on in the mlogger,
the next release may have to be done the same way (with mlogger from midas-2020-08).

K.O.

10 Feb 2022, Stefan Ritt, Bug Report, History plots deceiving users into thinking data is still logging

The problem has been fixed on commit 825935dc on Oct. 2021 and runs fine since then at PSI. If TRIUMF people 
agree, we can close that issue and proceed.

Stefan

02 Dec 2021, Alexey Kalinin, Bug Report, some frontend kicked by cm_periodic_tasks

Hello,
We have a small experiment with MIDAS based DAQ.
Status page shows :
ES	ESFrontend@192.168.0.37	207	0.2	0.000
Trigger06	Sample Frontend06@192.168.0.37	1.297M	0.3	0.000
Trigger01	Sample Frontend01@192.168.0.37	1.297M	0.3	0.000
Trigger16	Sample Frontend16@192.168.0.37	1.297M	0.3	0.000
Trigger38	Sample Frontend38@192.168.0.37	1.297M	0.3	0.000
Trigger37	Sample Frontend37@192.168.0.37	1.297M	0.3	0.000
Trigger03	Sample Frontend03@192.168.0.38	1.297M	0.3	0.000
Trigger07	Sample Frontend07@192.168.0.38	1.297M	0.3	0.000
Trigger04	Sample Frontend04@192.168.0.38	59898	0.0	0.000
Trigger08	Sample Frontend08@192.168.0.38	59898	0.0	0.000
Trigger17	Sample Frontend17@192.168.0.38	59898	0.0	0.000


And SYSTEM buffers page shows:
ESFrontend	1968	198	47520	0	0x00000000	0		
193 ms
Sample Frontend06	1332547	1330826	379729872	0	0x00000000	
0		1.1 sec
Sample Frontend16	1332542	1330839	361988208	0	0x00000000	
0		94 ms
Sample Frontend37	1332530	1330841	337798408	0	0x00000000	
0		1.1 sec
Sample Frontend01	1332543	1330829	467136688	0	0x00000000	
0		34 ms
Sample Frontend38	1332528	1330830	291453608	0	0x00000000	
0		1.1 sec
Sample Frontend04	63254	61467	20882584	0	0x00000000	
0		208 ms
Sample Frontend08	63262	61476	27904056	0	0x00000000	
0		205 ms
Sample Frontend17	63271	61473	20433840	0	0x00000000	
0		213 ms
Sample Frontend03	1332549	1330818	386821728	0	0x00000000	
0		82 ms
Sample Frontend07	1332554	1330821	462210896	0	0x00000000	
0		37 ms
Logger	968742	0w+9500418r	0w+2718405736r	0	0x00000000	0	
GET_ALL Used 0 bytes 0.0%	303 ms
rootana	254561	0w+29856958r	0w+8718288352r	0	0x00000000	0		
762 ms


The problem is that eventually some of frontend closed with message 
:19:22:31.834 2021/12/02 [rootana,INFO] Client 'Sample Frontend38' on buffer 
'SYSMSG' removed by cm_periodic_tasks because process pid 9789 does not exist

in the meantime mserver loggging :
mserver started interactively
mserver will listen on TCP port 1175
double free or corruption (!prev)
double free or corruption (!prev)
free(): invalid next size (normal)
double free or corruption (!prev)


I can find some correlation between number of events/event size produced by 
frontend, cause its failed when its become big enough. 

frontend scheme is like this:

poll event time set to 0;

poll_event{
//if buffer not transferred return (continue cutting the main buffer)
//read main buffer from hardware
//buffer not transfered
}

read event{
// cut the main buffer to subevents (cut one event from main buffer) return;
//if (last subevent) {buffer transfered ;return}
}

What is strange to me that 2 frontends (1 per remote pc) causing this.

Also, I'm executing one FEcode with -i # flag , put setting eventid in 
frontend_init , and using SYSTEM buffer for all.

Is there something I'm missing?
Thanks. 
A.

26 Jan 2022, Konstantin Olchanski, Bug Report, some frontend kicked by cm_periodic_tasks

> The problem is that eventually some of frontend closed with message 
> :19:22:31.834 2021/12/02 [rootana,INFO] Client 'Sample Frontend38' on buffer 
> 'SYSMSG' removed by cm_periodic_tasks because process pid 9789 does not exist

This messages means what it says. A client was registered with the SYSMSG buffer and this 
client had pid 9789. At some point some other client (rootana, in this case) checked it and 
process pid 9789 was no longer running. (it then proceeded to remove the registration).

There is 2 possibilities:
- simplest: your frontend has crashed. best to debug this by running it inside gdb, wait for 
the crash.
- unlikely: reported pid is bogus, real pid of your frontend is different, the client 
registration in SYSMSG is corrupted. this would indicate massive corruption of midas shared 
memory buffers, not impossible if your frontend misbehaves and writes to random memory 
addresses. ODB has protection against this (normally turned off, easy to enable, set ODB 
"/experiment/protect odb" to yes), shared memory buffers do not have protection against this 
(should be added?).

Do this. When you start your frontend, write down it's pid, when you see the crash message, 
confirm pid number printed is the same. As additional test, run your frontend inside gdb, 
after it crashes, you can print the stack trace, etc.

> 
> in the meantime mserver loggging :
> mserver started interactively
> mserver will listen on TCP port 1175
> double free or corruption (!prev)
> double free or corruption (!prev)
> free(): invalid next size (normal)
> double free or corruption (!prev)
> 

Are these "double free" messages coming from the mserver or from your frontend? (i.e. you run 
them in different terminals, not all in the same terminal?).

If messages are coming from the mserver, this confirms possibility (1),
except that for frontends connected remotely, the pid is the pid of the mserver,
and what we see are crashes of mserver, not crashes of your frontend. These are much harder to 
debug.

You will need to enable core dumps (ODB /Experiment/Enable core dumps set to "y"),
confirm that core dumps work (i.e. "killall -SEGV mserver", observe core files are created
in the directory where you started the mserver), reproduce the crash, run "gdb mserver 
core.NNNN", run "bt" to print the stack trace, post the stack trace here (or email to me 
directly).

>
> I can find some correlation between number of events/event size produced by 
> frontend, cause its failed when its become big enough. 
> 

There is no limit on event size or event rate in midas, you should not see any crash
regardless of what you do. (there is a limit of event size, because an event has
to fit inside an event buffer and event buffer size is limited to 2 GB).

Obviously you hit a bug in mserver that makes it crash. Let's debug it.

One thing to try is set the write cache size to zero and see if your crash goes away. I see
some indication of something rotten in the event buffer code if write cache is enabled. This
is set in ODB "/Eq/XXX/Common/Write Cache Size", set it to zero. (beware recent confusion
where odb settings have no effect depending on value of "equipment_common_overwrite").

>
> frontend scheme is like this:
> 

Best if you use the tmfe c++ frontend, event data handling is much simpler and we do not
have to debug the convoluted old code in mfe.c.

K.O.

>
> poll event time set to 0;
> 
> poll_event{
> //if buffer not transferred return (continue cutting the main buffer)
> //read main buffer from hardware
> //buffer not transfered
> }
> 
> read event{
> // cut the main buffer to subevents (cut one event from main buffer) return;
> //if (last subevent) {buffer transfered ;return}
> }
> 
> What is strange to me that 2 frontends (1 per remote pc) causing this.
> 
> Also, I'm executing one FEcode with -i # flag , put setting eventid in 
> frontend_init , and using SYSTEM buffer for all.
> 
> Is there something I'm missing?
> Thanks. 
> A.

11 Feb 2022, Alexey Kalinin, Bug Report, some frontend kicked by cm_periodic_tasks

Thanks for the answer.
As soon as I can(possible in a month) I'll try suggestion below:

> One thing to try is set the write cache size to zero and see if your crash goes away. I see
> some indication of something rotten in the event buffer code if write cache is enabled. This
> is set in ODB "/Eq/XXX/Common/Write Cache Size", set it to zero. (beware recent confusion
> where odb settings have no effect depending on value of "equipment_common_overwrite").

I tried to change this ODB for one of the frontend via mhttpd/browser, and eventually it goes back 
to default value (1000 as I remember). but this frontend has the minimum rate 50DWORD/~10sec. and 
depending on cashe size it appears in mdump once per 31 events but all aff them . SO its different 
story, but m.b. it has the same solution to play with Write Cashe Size.    


double free message goes from mserver terminal. 
all of the frontends are remote.
I can't exclude crashes of frontend , but when I run ./frontend -i 1(2,3 etc) thet means that I run 
one code for all, and only several causes crash.also I found that crash in frontend happened while 
it do nothing with collected data (last event reached and new data is not ready), but it tries to 
watch for the ODB changes.I mean it crashes iside (while {odb_changes(value in watchdog)}),and I don't 
know what else happenned meanwhile with cahed buffer.

Future plans is to use event buider for frontends when data/signals will be perfectly reasonable 
i/e/ without broken events. for now i kinda worry about if one of frontends will skip one of the 
event inside its buffer.


Thanks for the way to dig into.
A.    

> > The problem is that eventually some of frontend closed with message 
> > :19:22:31.834 2021/12/02 [rootana,INFO] Client 'Sample Frontend38' on buffer 
> > 'SYSMSG' removed by cm_periodic_tasks because process pid 9789 does not exist
> 
> This messages means what it says. A client was registered with the SYSMSG buffer and this 
> client had pid 9789. At some point some other client (rootana, in this case) checked it and 
> process pid 9789 was no longer running. (it then proceeded to remove the registration).
> 
> There is 2 possibilities:
> - simplest: your frontend has crashed. best to debug this by running it inside gdb, wait for 
> the crash.
> - unlikely: reported pid is bogus, real pid of your frontend is different, the client 
> registration in SYSMSG is corrupted. this would indicate massive corruption of midas shared 
> memory buffers, not impossible if your frontend misbehaves and writes to random memory 
> addresses. ODB has protection against this (normally turned off, easy to enable, set ODB 
> "/experiment/protect odb" to yes), shared memory buffers do not have protection against this 
> (should be added?).
> 
> Do this. When you start your frontend, write down it's pid, when you see the crash message, 
> confirm pid number printed is the same. As additional test, run your frontend inside gdb, 
> after it crashes, you can print the stack trace, etc.
> 
> > 
> > in the meantime mserver loggging :
> > mserver started interactively
> > mserver will listen on TCP port 1175
> > double free or corruption (!prev)
> > double free or corruption (!prev)
> > free(): invalid next size (normal)
> > double free or corruption (!prev)
> > 
> 
> Are these "double free" messages coming from the mserver or from your frontend? (i.e. you run 
> them in different terminals, not all in the same terminal?).
> 
> If messages are coming from the mserver, this confirms possibility (1),
> except that for frontends connected remotely, the pid is the pid of the mserver,
> and what we see are crashes of mserver, not crashes of your frontend. These are much harder to 
> debug.
> 
> You will need to enable core dumps (ODB /Experiment/Enable core dumps set to "y"),
> confirm that core dumps work (i.e. "killall -SEGV mserver", observe core files are created
> in the directory where you started the mserver), reproduce the crash, run "gdb mserver 
> core.NNNN", run "bt" to print the stack trace, post the stack trace here (or email to me 
> directly).
> 
> >
> > I can find some correlation between number of events/event size produced by 
> > frontend, cause its failed when its become big enough. 
> > 
> 
> There is no limit on event size or event rate in midas, you should not see any crash
> regardless of what you do. (there is a limit of event size, because an event has
> to fit inside an event buffer and event buffer size is limited to 2 GB).
> 
> Obviously you hit a bug in mserver that makes it crash. Let's debug it.
> 
> One thing to try is set the write cache size to zero and see if your crash goes away. I see
> some indication of something rotten in the event buffer code if write cache is enabled. This
> is set in ODB "/Eq/XXX/Common/Write Cache Size", set it to zero. (beware recent confusion
> where odb settings have no effect depending on value of "equipment_common_overwrite").
> 
> >
> > frontend scheme is like this:
> > 
> 
> Best if you use the tmfe c++ frontend, event data handling is much simpler and we do not
> have to debug the convoluted old code in mfe.c.
> 
> K.O.
> 
> >
> > poll event time set to 0;
> > 
> > poll_event{
> > //if buffer not transferred return (continue cutting the main buffer)
> > //read main buffer from hardware
> > //buffer not transfered
> > }
> > 
> > read event{
> > // cut the main buffer to subevents (cut one event from main buffer) return;
> > //if (last subevent) {buffer transfered ;return}
> > }
> > 
> > What is strange to me that 2 frontends (1 per remote pc) causing this.
> > 
> > Also, I'm executing one FEcode with -i # flag , put setting eventid in 
> > frontend_init , and using SYSTEM buffer for all.
> > 
> > Is there something I'm missing?
> > Thanks. 
> > A.

08 Feb 2022, jago aartsen, Bug Fix, ODBINC/Sequencer Issue

Hi all,

I am having some issues getting the ODBINC command to work within the MIDAS 
sequencer. I am trying to increment one of the ODB values but it is returning a 
mismatch data-type size error (see attached image).

All the ODB variables are MIDAS data-type FLOAT and should all be 32-bit values, 
but for some reason MIDAS is thinking they are 4-bit values. I have tried creating 
new ODB keys of type INT32, UINT32 and DOUBLE but they all give the same error.

If anybody has any suggestions I would really appreciate some help:)

Thanks

08 Feb 2022, Konstantin Olchanski, Bug Fix, ODBINC/Sequencer Issue

Please post the output of odbedit "ls -l" for /eq/ar.../variables. (you posted the 
variable name as an image, and I cannot cut-and-paste the odb path!). BTW data size 4 is 
correct, 4 bytes for INT32/UINT32/FLOAT. For DOUBLE it should be 8. For you it prints 32 
and this is wrong, we need to see the output of "ls -l".
K.O.

09 Feb 2022, jago aartsen, Bug Fix, ODBINC/Sequencer Issue

> Please post the output of odbedit "ls -l" for /eq/ar.../variables. (you posted the 
> variable name as an image, and I cannot cut-and-paste the odb path!). BTW data size 4 is 
> correct, 4 bytes for INT32/UINT32/FLOAT. For DOUBLE it should be 8. For you it prints 32 
> and this is wrong, we need to see the output of "ls -l".
> K.O.

Hi,

Thanks for getting back to me regarding this. The output of "ls -l" is:

[local:mu3eMSci:S]/>cd Equipment/ArduinoTestStation/Variables
[local:mu3eMSci:S]Variables>ls -l
Key name                        Type    #Val  Size  Last Opn Mode Value
---------------------------------------------------------------------------
_T_                             FLOAT   1     4     1h   0   RWD  20.93
_F_                             FLOAT   1     4     1h   0   RWD  12.8
_P_                             FLOAT   1     4     1h   0   RWD  56
_S_                             FLOAT   1     4     1h   0   RWD  5
_H_                             FLOAT   1     4     60h  0   RWD  44.74
_B_                             FLOAT   1     4     60h  0   RWD  18.54
_A_                             FLOAT   1     4     1h   0   RWD  14.41
_RH_                            FLOAT   1     4     1h   0   RWD  41.81
_AT_                            FLOAT   1     4     1h   0   RWD  20.46
SP                              INT16   1     2     1h   0   RWD  10

Many Thanks
Jago

09 Feb 2022, Konstantin Olchanski, Bug Fix, ODBINC/Sequencer Issue

> 
> [local:mu3eMSci:S]/>cd Equipment/ArduinoTestStation/Variables
> [local:mu3eMSci:S]Variables>ls -l
> Key name                        Type    #Val  Size  Last Opn Mode Value
> ---------------------------------------------------------------------------
> _T_                             FLOAT   1     4     1h   0   RWD  20.93
> _F_                             FLOAT   1     4     1h   0   RWD  12.8
> _P_                             FLOAT   1     4     1h   0   RWD  56
> _S_                             FLOAT   1     4     1h   0   RWD  5
> _H_                             FLOAT   1     4     60h  0   RWD  44.74
> _B_                             FLOAT   1     4     60h  0   RWD  18.54
> _A_                             FLOAT   1     4     1h   0   RWD  14.41
> _RH_                            FLOAT   1     4     1h   0   RWD  41.81
> _AT_                            FLOAT   1     4     1h   0   RWD  20.46
> SP                              INT16   1     2     1h   0   RWD  10
> 

This looks okey, so we still have no explanation for your error. Please post your sequencer 
script?

K.O.

10 Feb 2022, Stefan Ritt, Bug Fix, ODBINC/Sequencer Issue

I tried following script:

ODBSET /Equipment/ArduinoTestStation/Variables/_S_, 10

LOOP 10
  WAIT seconds, 3
  ODBINC /Equipment/ArduinoTestStation/Variables/_S_
ENDLOOP

and it worked as expected. So I conclude the problem must be in your script. Probably a typo in 
the ODB path pointing to a 32-byte string instead to a 4-byte float.

Stefan

14 Feb 2022, jago aartsen, Bug Fix, ODBINC/Sequencer Issue

> I tried following script:
> 
> ODBSET /Equipment/ArduinoTestStation/Variables/_S_, 10
> 
> LOOP 10
>   WAIT seconds, 3
>   ODBINC /Equipment/ArduinoTestStation/Variables/_S_
> ENDLOOP
> 
> and it worked as expected. So I conclude the problem must be in your script. Probably a typo in 
> the ODB path pointing to a 32-byte string instead to a 4-byte float.
> 
> Stefan  

Hi Stefan,

Cheers for the reply. I believe the syntax we are using is correct. I have tried copying the script 
you used above and it results in the same error as before. Perhaps something is going wrong 
elsewhere - I will take another look today.

Jago

14 Feb 2022, Stefan Ritt, Bug Fix, ODBINC/Sequencer Issue

Just post here a minimal script which produces the error, so that I can try myself.

... and make sure that you have the latest develop version of midas.

Stefan

14 Feb 2022, jago aartsen, Bug Fix, ODBINC/Sequencer Issue

> Just post here a minimal script which produces the error, so that I can try myself.
> 
> ... and make sure that you have the latest develop version of midas.
> 
> Stefan

Here is the simplest script which produces the error:

WAIT seconds, 3
ODBINC /Equipment/ArduinoTestStation/Variables/_S_

I noticed that "Jacob Thorne"  in the forum had the same issue as us in Novemeber last 
year. Indeed we have not installed any later versions of MIDAS since then so we will 
double check we have the latest version.

Jago

14 Feb 2022, Stefan Ritt, Bug Fix, ODBINC/Sequencer Issue

> I noticed that "Jacob Thorne"  in the forum had the same issue as us in Novemeber last 
> year. Indeed we have not installed any later versions of MIDAS since then so we will 
> double check we have the latest version.

As you see from my reply to Jacob, the bug has been fixed in midas since then, so just 
update.

Stefan

14 Feb 2022, jago aartsen, Bug Fix, ODBINC/Sequencer Issue

> > I noticed that "Jacob Thorne"  in the forum had the same issue as us in Novemeber last 
> > year. Indeed we have not installed any later versions of MIDAS since then so we will 
> > double check we have the latest version.
> 
> As you see from my reply to Jacob, the bug has been fixed in midas since then, so just 
> update.
> 
> Stefan

We have tried updating using both:

git submodule update --init --recursive

and:

git pull --recurse-submodules

But the error still persists. Is there another way to update which we are missing?

Cheers
Jago

15 Feb 2022, Stefan Ritt, Bug Fix, ODBINC/Sequencer Issue

> But the error still persists. Is there another way to update which we are missing?

The bug was definitively fixed in this modification:

https://bitbucket.org/tmidas/midas/commits/5f33f9f7f21bcaa474455ab72b15abc424bbebf2

You probably forgot to compile/install correctly after your pull. Of you start "odbedit" and do 
a "ver" you see which git revision you are currently running. Make sure to get this output:

MIDAS version:      2.1
GIT revision:       Fri Feb 11 08:56:02 2022 +0100 - midas-2020-08-a-509-g585faa96 on branch 
develop
ODB version:        3


Stefan

16 Feb 2022, jago aartsen, Bug Fix, ODBINC/Sequencer Issue

> > But the error still persists. Is there another way to update which we are missing?
> 
> The bug was definitively fixed in this modification:
> 
> https://bitbucket.org/tmidas/midas/commits/5f33f9f7f21bcaa474455ab72b15abc424bbebf2
> 
> You probably forgot to compile/install correctly after your pull. Of you start "odbedit" and do 
> a "ver" you see which git revision you are currently running. Make sure to get this output:
> 
> MIDAS version:      2.1
> GIT revision:       Fri Feb 11 08:56:02 2022 +0100 - midas-2020-08-a-509-g585faa96 on branch 
> develop
> ODB version:        3
> 
> 
> Stefan

We we're having some problems compiling but have got it sorted now - thanks for your help:)

Jago

14 Feb 2022, jago aartsen, Bug Fix, ODBINC/Sequencer Issue

> > 
> > [local:mu3eMSci:S]/>cd Equipment/ArduinoTestStation/Variables
> > [local:mu3eMSci:S]Variables>ls -l
> > Key name                        Type    #Val  Size  Last Opn Mode Value
> > ---------------------------------------------------------------------------
> > _T_                             FLOAT   1     4     1h   0   RWD  20.93
> > _F_                             FLOAT   1     4     1h   0   RWD  12.8
> > _P_                             FLOAT   1     4     1h   0   RWD  56
> > _S_                             FLOAT   1     4     1h   0   RWD  5
> > _H_                             FLOAT   1     4     60h  0   RWD  44.74
> > _B_                             FLOAT   1     4     60h  0   RWD  18.54
> > _A_                             FLOAT   1     4     1h   0   RWD  14.41
> > _RH_                            FLOAT   1     4     1h   0   RWD  41.81
> > _AT_                            FLOAT   1     4     1h   0   RWD  20.46
> > SP                              INT16   1     2     1h   0   RWD  10
> > 
> 
> This looks okey, so we still have no explanation for your error. Please post your sequencer 
> script?
> 
> K.O.

Hey, thanks for getting back to me

We are fairly confident the syntax is correct. Having tried the test script posted by Stefan:

> ODBSET /Equipment/ArduinoTestStation/Variables/_S_, 10
> 
> LOOP 10
>   WAIT seconds, 3
>   ODBINC 

The same error is returned:/

We will take another look today.

Jago

23 Feb 2022, Stefan Ritt, Info, Midas slow control event generation switched to 32-bit banks

The midas slow control system class drivers automatically read their equipment and generate events containing midas banks. So far these have been 16-bit banks using bk_init(). But now more and more experiments use large amount of channels, so the 16-bit address space is exceeded. Until last week, there was even no check that this happens, leading to unpredictable crashes.

Therefore I switched the bank generation in the drivers generic.cxx, hv.cxx and multi.cxx to 32-bit banks via bk_init32(). This should be in principle transparent, since the midas bank functions automatically detect the bank type during reading. But I thought I let everybody know just in case.

Stefan

03 Mar 2022, Konstantin Olchanski, Info, zlib required, lz4 internal

as of commit 8eb18e4ae9c57a8a802219b90d4dc218eb8fdefb, the gzip compression
library is required, not optional.

this fixes midas and manalyzer mis-build if the system gzip library
is accidentally not installed. (is there any situation where
gzip library is not installed on purpose?)

midas internal lz4 compression library was renamed to mlz4 to avoid collision
against system lz4 library (where present). lz4 files from midasio are now
used, lz4 files in midas/include and midas/src are removed.

I see that on recent versions of ubuntu we could switch to the system version 
of the lz4 library. however, on centos-7 systems it is usually not present
and it still is a supported and widely used platform, so we stay
with the midas-internal library for now.

K.O.

03 Mar 2022, Konstantin Olchanski, Info, manalyzer updated

manalyzer was updated to latest version. mostly multi-threading improvements from 
Joseph and myself. K.O.

10 Mar 2022, Gennaro Tortone, Bug Report, Python ODB watch

Hi,

I have an issue with ODB watch on MIDAS Python library;

I wrote a simple frontend that read/write FPGA registers through 
ODB keys (simplified version at link below):

https://gist.github.com/gtortone/cd035a9ac4ea7a78ea9cd931e80e2c75

Everything works fine but there is a boolean array
in Settings (Enable ADC sampling) that I need to "toggle" 
(19 bit to 0 and 19 bit to 1). This operation is handled by
detailed_settings_changed_func that write the value of
toggled bit to FPGA.

The issue is that if I quickly toggle the boolean array by
odbedit:

set "/Equipment/odbtest/Settings/Enable ADC sampling[0-18]" 0
set "/Equipment/odbtest/Settings/Enable ADC sampling[0-18]" 1

I see in the Python script the following list of callbacks:

detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[0] - new value 0
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[1] - new value 0
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[2] - new value 0
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[3] - new value 0
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[4] - new value 0
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[5] - new value 0
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[6] - new value 0
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[7] - new value 0
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[8] - new value 1    ***
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[9] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[10] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[11] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[12] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[13] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[14] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[15] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[16] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[17] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[18] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[0] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[1] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[2] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[3] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[4] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[5] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[6] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[7] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[8] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[9] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[10] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[11] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[12] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[13] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[14] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[15] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[16] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[17] - new value 1
detailed_settings_changed_func: /Equipment/odbtest/Settings/Enable ADC sampling[18] - new value 1

It seems that the second write operation "overlaps" the first one...

The same behavior is not observed using a 'watch' in odbedit...

I can overcame this problem using the value of register as ODB key to avoid 
array of boolean... but I report this issue as "possible" bug/limitation on Python ODB watch;

Cheers,
Gennaro

16 Mar 2022, Ben Smith, Bug Report, Python ODB watch

> It seems that the second write operation "overlaps" the first one...

Hi Gennaro,

In principle the same issue can happen in C++ code, but is much less likely as the callbacks get executed  more quickly (partly due to C++/python in general, and partly because the python code does some extra work to make the interface more user-friendly). The C++ code at the end of this message adds a 100ms sleep to the callback and can result in output like this when you do quick edits of "Test[0-19]" in odbedit.

Element 1 is 0
Element 2 is 0
Element 3 is 0
Element 4 is 0
Element 5 is 0
Element 6 is 0
Element 7 is 1
Element 8 is 1
Element 9 is 1
etc...


I agree that this can be a really nasty source of bugs if you need to react to every change. I'll add a warning to the python docstrings, but I can't think of a way to make this more robust at the midas level - I think we'd need some sort of ODB "snapshot" system...









#include "midas.h"

void watch_fn(HNDLE hDB, HNDLE hKey, int index, void *info) {
  DWORD data = 0;
  INT buf_size = sizeof(data);
  db_get_data_index(hDB, hKey, &data, &buf_size, index, TID_DWORD);
  printf("Element %d is %u\n", index, data);
  ss_sleep(100);
}

int main() {
  HNDLE hDB, hClient, hTestKey;

  std::string host, expt;
  cm_get_environment(&host, &expt);
  cm_connect_experiment(host.c_str(), expt.c_str(), "test_odb", nullptr);
  cm_get_experiment_database(&hDB, &hClient);

  static const DWORD numValues = 20;
  DWORD data[numValues] = {};
  db_set_value(hDB, 0, "Test", data, sizeof(DWORD) * numValues, numValues, TID_DWORD);
  db_find_key(hDB, 0, "Test", &hTestKey);
  db_watch(hDB, hTestKey, watch_fn, nullptr);

  printf("Press any key to exit loop...\n");

  while (!ss_kbhit()) {
    cm_yield(1);
  }

  db_unwatch_all();
  db_delete_key(hDB, hTestKey, FALSE);
  cm_disconnect_experiment();

  return 0;
}

21 Mar 2022, Stefan Ritt, Bug Report, Python ODB watch

What you describe is a well-known problem with the ODB. At PSI we have similar issues. There are
two approaches to solve it:

1) Write values one-by-one to the ODB, but do not trigger a watch update. In the sequencer, this
can be achieved with the ODBSET command (see https://daq00.triumf.ca/MidasWiki/index.php/Sequencer 
and the last paragraph right of the ODBSET command). You use notify=0 for all set commands except
the last one where you use notify=1. On the C++ API, you can use db_set_data_index1() which has
this notify flag as the last parameter.

2) You add intelligence to your front-end. If you get a watchdog update, you do not apply this
directly to the hardware, but put it into a FIFO. Once you do not get any more update for a certain
period (like 1s is a good value), you empty the FIFO and apply all setting immediately.

Both methods have been used at PSI successfully, although 1) is much easier to implement, especially
if you use the midas sequencer.

Stefan

15 Mar 2022, Konstantin Olchanski, Bug Fix, mhttpd ipv6 bind should be fixed now

Something changed after my initial implementation of ipv6 in mhttpd
and listening to ipv6 http/https connections was broken.

It turns out, I do not need to listen to both ipv4 and ipv6 sockets,
it is sufficient to listen to just ipv6. ipv4 connections will also
magically work. see linux kernel "bindv6only" sysctl setting: https://sysctl-
explorer.net/net/ipv6/bindv6only/

The specific bug in mhttpd was to bind to ipv4 socket first, subsequent bind() to ipv6 socket 
fails with error "Address already in use", which is silent, not reported by the mongoose library. 
For reasons unknown, this does not happen to bind() to "localhost" aka ipv6 "::1".

Apparently other web servers (apache, nginx) are/were also affected by this problem. 
https://chrisjean.com/fix-nginx-emerg-bind-to-80-failed-98-address-already-in-use/

First fix was to bind to ipv6 first (success) and to ipv4 second (fails). Second fix
committed to git is to only listen to ipv6.

This works both on MacOS and on Linux. Linux reports the listener socket is "tcp6", MacOS reports 
the listener socket as "tcp46":

4ed0:javascript1 olchansk$ netstat -an | grep 808 | grep LISTEN
tcp46      0      0  *.8081                 *.*                    LISTEN     
tcp6       0      0  ::1.8080               *.*                    LISTEN     
tcp4       0      0  127.0.0.1.8080         *.*                    LISTEN     
4ed0:javascript1 olchansk$ 

K.O.

23 Mar 2022, Konstantin Olchanski, Bug Fix, mhttpd ipv6 bind should be fixed now

> Something changed after my initial implementation of ipv6 in mhttpd
> and listening to ipv6 http/https connections was broken.

Reporting that mhttpd ipv6 works at CERN. The hostnames for ipv6 connections
come back as alphacpc05.ipv6.cern.ch instead of alphacpc05.cern.ch
so both are added to the http "insecure port" whitelist.

K.O.

23 Mar 2022, Hunter Lowe, Forum, ODB has issue with example analyzer

Trying to play with midas file but I get error:

[Analyzer,ERROR] [odb.cxx:845:db_validate_name,ERROR] Invalid name "/Analyzer/Tests/low_sum/Rate [Hz]" passed to db_create_key_wlocked: should not contain "["

I'm not sure what sets the name so I'm not sure how to fix this.

Thanks

22 Mar 2022, Konstantin Olchanski, Bug Fix, fix for event buffer corruption in bm_flush_cache()

multithreaded frontends have an unusual event buffer corruption if the write 
cache is enabled. For a long time now I had to disable the write cache on
all multithreaded frontends in alpha-g, I was hitting this bug quite often.
(somehow I do not see this problem reported on bitbucket!)

last week I reworked the multithread locking of event buffers, in hope
that this bug will turn up, but nope, all mutexes and locking look okey,
except for a number of unrelated problems (races against bm_close_buffer()
were the most troublesome to fix).

but finally found the trouble.

first, some background.

because multiprocess locking is expensive, frontends that generate
a large number of small events can use the write cache to reduce
this overhead. instead of locking the shared memory event buffer for
each event, events are accumulated in the write cache, and periodic
calls to bm_flush_buffer() flush them to shared memory. For best effect,
one should increase the size of the write cache until lock rate is around
10/second.

it turns out introduction of multithreading broke bm_flush_cache().

it does this:

- int ask_free = pbuf->wp; // how much data we have in the write cache now
- call bm_wait_for_free_space(ask_free); // ensure we have this much free shared 
memory space
- copy pbuf->wp worth if events to shared memory

looks okey at first sight. this is what happens to trigger the bug:

- int ask_free = pbuf->wp; // ok
- call bm_wait_for_free_space(ask_free); // ok, but if shared memory is full, it 
will go to sleep waiting for free space
- in the mean time, another thread calls bm_send_event(), this adds more data to 
the write cache, moves pbuf->wp
- bm_wait_for_free_space() eventually returns
- copy pbuf->wp worth of data to shared memo KABOOM! shared memory corruption!

we just overwrote some unlucky event in shared memory: we only have "ask_free"
free bytes available, but pbuf->wp moved and now has more data,
and it does not fit, and there is no check against it.

of course in the single threaded world this bug did not exist, there was no 
other thread to call bm_send_event() while bm_flush_cache() is sleeping.

the obvious fix is to ask for more free space if cached data does not fit.

this is now implemented on the branch feature/buffer_mutex. after a bit more 
tested I will merge it into develop.

so that's it?

not so fast. there was more going on. as described, the bug will only happen
when shared memory event buffer is full. (i.e. rarely or never). It turns
out the old version of thread locking code was defective and permitted
a race between bm_send_event() and bm_send_event() in another thread:

thread 1: while (1) { bm_send_event(very small event); }
thread 2:
-> bm_send_event(very big event)
-> no space in the cache for the very big event, call bm_flush_cache()
-> bm_flush_cache() asks bm_wait_for_free_space() to make space for cached data
-> this was done with write cache mutex released (mistake!)
-> at the same time bm_send_event(very small event) added 1 more small event to 
the cache
-> back in bm_flush_cache() write cache mutex is locked correctly, we copy 
cached data to shared memory and again KABOOM because we now have more data than 
we asked free space for.

So in the original implementation, corruption was possible even when share 
memory event buffer was pretty much empty.

The reworked locking code closed that loop hole - bm_flush_cache() is now
called with write cache locked, and bm_send_event() from another thread
cannot confuse things, unless shared memory buffer is full and we go to
sleep inside bm_wait_for_free_space(). And this is now fixed, too.

K.O.

22 Mar 2022, Stefan Ritt, Bug Fix, fix for event buffer corruption in bm_flush_cache()

Thanks Konstantin for your detailed description. 

I wonder why we never saw this problem at PSI. Here is the reason: In multil-threaded environments, we never call bm_send_event() directly 
from all threads (since in the old days nothing was thread safe in midas). Instead, we use a collector thread which gets all events via the 
rb_xxx functions from the individual readout threads. This is well integrated into the mfe.cxx framework. Look at examples/mtfe/mfte.cxx. 
Each thread does (simplified):

while (true) {
  do {
     status = rb_get_wp(&pevent);
  } while (status == DB_TIMEOUT)

  bm_compose_event_threadsafe(pevent, ..., &serial_number);
  bk_init32(pevent+1);
  ... fill event ...
  bk_close(pevent)

  rb_increment_wp(sizeof(EVENT_HEADER) + pevent->data_size);
}

The framework now collects all these events in receive_trigger_event() which runs in the main thread:

  for (i=0 ; i<n_thread ; i++) {
     rb_get_rp(i, pevent);
     if (pevent->serial_number == prev_serial+1)
        break;
  }
  
  prev_serial = pevent->serial_number;
  rpc_send_event(pevent);
  rb_increment_rp(sizeof(EVENT_HEADER) + pevent->data_size);

This code ensures that all events are in the right sequence (before the serial numbers where mixed up) and that all events are sent only 
from a single thread, so the write buffer can be used effectively without complicated multi-thread locks.

This solution works nicely at PSI since many years, maybe you should put some thought to use it in your tmfe framework in Alpha-g as well 
instead of struggling with all your locks.

Stefan

23 Mar 2022, Konstantin Olchanski, Bug Fix, fix for event buffer corruption in bm_flush_cache()

I confirm, there is no problem in single-threaded programs, and
there is no problem if all bm_send_event() and bm_flush_cache() are called
from the same thread.

> ... instead of struggling with all your locks.

it is better to have midas fully thread safe. ODB has been so for a long time,
event buffer partially (except for this bug), now fully.

without that the problem still exists, because in many frontends,
bm_flush_buffer() is called from the main thread, and will race
against the "bm_send_event() thread". Of course you can do
everything on the main thread, but this opens you to RPC timeouts
during run transitions (if you sleep in bm_wait_for_free_space()).

also the SYSMSG buffer is subject to the same bug. cm_msg() is of course
safe to call from anywhere, but cm_msg_flush_buffer() and cm_periodic_tasks()
can be called from any thread, and they issue bm_send_event(SYSMSG),
and there will be mysterious crashes and SYSMSG corruptions, probably
only during message storms, but still!

K.O.

24 Mar 2022, Stefan Ritt, Bug Fix, fix for event buffer corruption in bm_flush_cache()

> > ... instead of struggling with all your locks.
> 
> it is better to have midas fully thread safe. ODB has been so for a long time,
> event buffer partially (except for this bug), now fully.
> 
> without that the problem still exists, because in many frontends,
> bm_flush_buffer() is called from the main thread, and will race
> against the "bm_send_event() thread". Of course you can do
> everything on the main thread, but this opens you to RPC timeouts
> during run transitions (if you sleep in bm_wait_for_free_space()).

Just for the record: in the mfe.cxx framework both bm_send_event() and 
bm_flush_buffer() are called from the main thread, as can be seen in the 
midas/examples/mtfe/mtfe.cxx example.

But I agree that having all buffer operations thread safe is a clear benefit.

Stefan

23 Mar 2022, Ivo Schulthess, Bug Fix, fix for event buffer corruption in bm_flush_cache()

Thanks for the investigation. Back in 2020, we had some issues of losing data between the system buffer and the logger writing them to disk (https://daq00.triumf.ca/elog-midas/Midas/1966). This was polled equipment but we had a multithreaded FE running at the same time. Could this be related to the same problem?

Best, Ivo

24 Mar 2022, Konstantin Olchanski, Bug Fix, fix for event buffer corruption in bm_flush_cache()

> Thanks for the investigation. Back in 2020, we had some issues
> of losing data between the system buffer and the logger writing them
> to disk (https://daq00.triumf.ca/elog-midas/Midas/1966). This was polled equipment
> but we had a multithreaded FE running at the same time. Could this be related to the same problem?

I think we will have to follow up on your problem 1966 separately.

I think this bug cannot lose events. Writing events to the write cache has correct
locking, no loss here. writing write cache to shared memory has correct locking,
no loss there. the bug will cause the *next* event in the event buffer to be overwritten,
this will be detected by most programs as shared memory corruption and everybody
will quit. (mhttpd, mserver, odbedit will probably survive).

I guess there could be unlucky corruption that looks like nothing was corrupted,
but this will affect only a few events right at the shared memory read/write
pointer, it so happens that they are the oldest events in the buffer and likely
mlogger already wrote them to disk. mlogger read pointer will likely follow
the shared memory write pointer closely, well ahead of the shared memory
read pointer which always pointe to the older event and where this bug's corruption
will happen.

So no, I do not think this bug can cause event loss between frontend and mlogger.

K.O.

10 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid

Dear all

We just started our beam time at ILL and just found yesterday that for certain 
settings of our detector the data is not saved into the .mid files. Running "mdump 
-l 10" online we see the data coming in as they should. Nevertheless, if we run 
"mdump -x runXXXXXX.mid" offline, the data file has no events and the banks are 
missing. Any ideas where the data could go lost?

Thanks in advance,
Ivo

10 Aug 2020, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid

> Dear all
> 
> We just started our beam time at ILL and just found yesterday that for certain 
> settings of our detector the data is not saved into the .mid files. Running "mdump 
> -l 10" online we see the data coming in as they should. Nevertheless, if we run 
> "mdump -x runXXXXXX.mid" offline, the data file has no events and the banks are 
> missing. Any ideas where the data could go lost?
> 
> Thanks in advance,
> Ivo

Have you checked 

/Logger/Channels/0/Settings/Event ID = -1
/Logger/Channels/0/Settings/Trigger mask = -1

If these settings are not -1, they filter the data stream for certain events and trigger 
masks.

Stefan

10 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid

> > Dear all
> > 
> > We just started our beam time at ILL and just found yesterday that for certain 
> > settings of our detector the data is not saved into the .mid files. Running "mdump 
> > -l 10" online we see the data coming in as they should. Nevertheless, if we run 
> > "mdump -x runXXXXXX.mid" offline, the data file has no events and the banks are 
> > missing. Any ideas where the data could go lost?
> > 
> > Thanks in advance,
> > Ivo
> 
> Have you checked 
> 
> /Logger/Channels/0/Settings/Event ID = -1
> /Logger/Channels/0/Settings/Trigger mask = -1
> 
> If these settings are not -1, they filter the data stream for certain events and trigger 
> masks.
> 
> Stefan

Good morning Stefan

Both set to -1. We only have one logging channel. If we run a sequence with a few runs and the 
same settings, sometimes data is in the .mid file and sometimes it is not.

Best,
Ivo

10 Aug 2020, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid

> Both set to -1. We only have one logging channel. If we run a sequence with a few runs and the 
> same settings, sometimes data is in the .mid file and sometimes it is not.

Then I'm running out of ideas. Things I would check:

- Are the file sizes about the same? 

- When you dump the .mid file, you do you see your bank names? 

This would tell you if the events are really missing or if mdump would just not find them.

But I guess without being able to debug the system at ILL I cannot be of any more help. You are the 
first one reporting such a problem, so it must have to do with your local setup.

Stefan

10 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid

> Then I'm running out of ideas. Things I would check:
> 
> - Are the file sizes about the same? 
> 
> - When you dump the .mid file, you do you see your bank names? 
> 
> This would tell you if the events are really missing or if mdump would just not find them.
> 
> But I guess without being able to debug the system at ILL I cannot be of any more help. You are the 
> first one reporting such a problem, so it must have to do with your local setup.
> 
> Stefan

So I did a quick check. The file size is about the same (322K and 329K). When I dump the .mid I don't see 
the banks. It only prints two lines with "------ Event# 0 ------" and "------ Event# 1 ------" whereas for 
the file with data I get the two banks with all the data. Our online analyzer also fails to see the banks. 
Is there another way to check what is in the .mid file?

Best,
Ivo

10 Aug 2020, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid

> So I did a quick check. The file size is about the same (322K and 329K). When I dump the .mid I don't see 
> the banks. It only prints two lines with "------ Event# 0 ------" and "------ Event# 1 ------" whereas for 
> the file with data I get the two banks with all the data. Our online analyzer also fails to see the banks. 
> Is there another way to check what is in the .mid file?

with "dump" I meant a true object dump like "hexdump -C run000001.mid". I produced a file with ADC0 and TDC0 
banks (that's the example from the distribution under exampels/experiments/frontend.cxx), and I get

....
00024220  01 00 00 00 41 44 43 30  04 00 08 00 eb 06 35 04  |....ADC0......5.|
00024230  31 09 4f 06 54 44 43 30  04 00 08 00 93 04 fb 07  |1.O.TDC0........|
00024240  5c 09 88 0b 01 00 00 00  01 00 00 00 2a 0b 31 5f  |\...........*.1_|
00024250  28 00 00 00 20 00 00 00  01 00 00 00 41 44 43 30  |(... .......ADC0|
00024260  04 00 08 00 c3 09 24 05  85 05 f3 06 54 44 43 30  |......$.....TDC0|
00024270  04 00 08 00 88 08 2d 03  3b 0d d6 02 01 00 00 00  |......-.;.......|
00024280  02 00 00 00 2a 0b 31 5f  28 00 00 00 20 00 00 00  |....*.1_(... ...|
00024290  01 00 00 00 41 44 43 30  04 00 08 00 a5 0a 69 09  |....ADC0......i.|

where you clearly see the ADC0 and TDC0 banks.

Stefan

10 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid

> with "dump" I meant a true object dump like "hexdump -C run000001.mid". I produced a file with ADC0 and TDC0 
> banks (that's the example from the distribution under exampels/experiments/frontend.cxx), and I get
> 
> ....
> 00024220  01 00 00 00 41 44 43 30  04 00 08 00 eb 06 35 04  |....ADC0......5.|
> 00024230  31 09 4f 06 54 44 43 30  04 00 08 00 93 04 fb 07  |1.O.TDC0........|
> 00024240  5c 09 88 0b 01 00 00 00  01 00 00 00 2a 0b 31 5f  |\...........*.1_|
> 00024250  28 00 00 00 20 00 00 00  01 00 00 00 41 44 43 30  |(... .......ADC0|
> 00024260  04 00 08 00 c3 09 24 05  85 05 f3 06 54 44 43 30  |......$.....TDC0|
> 00024270  04 00 08 00 88 08 2d 03  3b 0d d6 02 01 00 00 00  |......-.;.......|
> 00024280  02 00 00 00 2a 0b 31 5f  28 00 00 00 20 00 00 00  |....*.1_(... ...|
> 00024290  01 00 00 00 41 44 43 30  04 00 08 00 a5 0a 69 09  |....ADC0......i.|
> 
> where you clearly see the ADC0 and TDC0 banks.
> 
> Stefan

So at least I learned something new. I tried it with the hexdump and the banks are not existent in the .mid file. I 
only have the ODB inside the file. The 7K difference in size is actually just about what I expect to be the data 
(1792 x 4 bytes)

Best, Ivo

10 Aug 2020, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid

Have you tried longer files? Maybe a few 100 MB or so. Maybe a buffer is not flushed correctly at the end of a run.

10 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid

> Have you tried longer files? Maybe a few 100 MB or so. Maybe a buffer is not flushed correctly at the end of a run.

Yes, I did. This 7 KB of the data bank is about the limit. If we go only 1 KB higher it seems that we save all data. In 
our specific case, this is the number of time bins (256 pixels with 7 time bins results in data loss, with 8 time bins it 
seems to be okay, data type is DWORD). 

Of course, a workaround for us is to save at least 8 time bins and throw 7 of them away later on. Nevertheless, since we 
are only in the commissioning phase now this is okay, I would just like to avoid data loss in the data taking phase of the 
experiment so knowing where the problem origins could help. 

I did another test with another FE running that produces a lot of data. The behavior is the same though. If the bank size 
is less than about 8 KB, the bank is not saved anymore. But probably this is anyway the expected behavior since it is a 
different FE that produces the data. 

So if it is coming from the buffer, is there something I could change to test or solve the problem?

Best, Ivo

10 Aug 2020, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid

I have to reproduce the problem to fix it. Why don't you go and modify midas/examples/experiment/frontend.cxx in such a way that 
it creates exactly the banks you have, just with random data. If you see the same problem, send me your frontend file so that I 
can reproduce it.

11 Aug 2020, Konstantin Olchanski, Bug Report, data missing in runXXXXXX.mid

> I have to reproduce the problem to fix it. Why don't you go and modify midas/examples/experiment/frontend.cxx in such a way that 
> it creates exactly the banks you have, just with random data. If you see the same problem, send me your frontend file so that I 
> can reproduce it.

It would be good to pin point there the data is lost. This is the sequence:

frontend user code -> mfe.c code -> SYSTEM buffer -> mlogger -> disk

To see if correct data arrives to the SYSTEM buffer, run:
mdump -z SYSTEM

To see if mlogger is receiving events from the SYSTEM buffer, run:
mlogger -v ### mlogger should report all events, history and data

To see if mlogger writes events to disk, examine the disk file (in this case, you already did, data is not there).

I would guess that your data does not make it out from the frontend (mdump shows "nothing"),
if data were to arrive into the SYSTEM buffer, it would make it to disk, unless
mlogger is misconfigured (but you already checked that).

If you have trouble with the frontend framework code, you can try to switch from the mfe.c frontend
to the newer c++ tmfe frontend (see progs/fetest_tmfe.cxx and progs/fetest_tmfe_thread.cxx).

K.O.

11 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid

> It would be good to pin point there the data is lost. This is the sequence:
> 
> frontend user code -> mfe.c code -> SYSTEM buffer -> mlogger -> disk
> 
> To see if correct data arrives to the SYSTEM buffer, run:
> mdump -z SYSTEM
> 
> To see if mlogger is receiving events from the SYSTEM buffer, run:
> mlogger -v ### mlogger should report all events, history and data
> 
> To see if mlogger writes events to disk, examine the disk file (in this case, you already did, data is not there).
> 
> I would guess that your data does not make it out from the frontend (mdump shows "nothing"),
> if data were to arrive into the SYSTEM buffer, it would make it to disk, unless
> mlogger is misconfigured (but you already checked that).
> 
> If you have trouble with the frontend framework code, you can try to switch from the mfe.c frontend
> to the newer c++ tmfe frontend (see progs/fetest_tmfe.cxx and progs/fetest_tmfe_thread.cxx).
> 
> K.O.

Good evening

I tried to reproduce the behavior in a very simple FE but it did not work out. The next thing for me would be to take the FE that is producing this behavior, replace all the device communication and data with dummies. If the problem is still there I would start to simplify as much as possible. 

Following the inputs of KO, I pin-pointed the data loss. The system buffer still gets the data but the mlogger does not write the data event. Then of course the data is also not anymore present in the data file. Therefore, I checked the logger settings again, Event ID and Trigger Mask still -1. Nothing else, at least from my point of view, that is misconfigured. Nevertheless, if it helps I can send my ODB settings. 

When doing the tests just before I found something else that probably can give a hint to the problem. The data is only lost if the time between two runs is long (a few seconds). As an example: If I run a sequence with a loop and after the FE stops the run the loop ends and the next run is started automatically, then only the first run has no data, which is the one after a longer time of no data taking. When I add a "WAIT Seconds 5" after the run before starting the next, not data is written to the disk for any run. I also found this once when adding a sleep(1) at the end of the FE readout function but back then did not think about it any further. 

Best, Ivo

24 Mar 2022, Konstantin Olchanski, Bug Report, data missing in runXXXXXX.mid

> > It would be good to pin point there the data is lost. This is the sequence:
> > 
> > frontend user code -> mfe.c code -> SYSTEM buffer -> mlogger -> disk
> > 
> > To see if correct data arrives to the SYSTEM buffer, run:
> > mdump -z SYSTEM
> > 
> > To see if mlogger is receiving events from the SYSTEM buffer, run:
> > mlogger -v ### mlogger should report all events, history and data
> > 
> > To see if mlogger writes events to disk, examine the disk file (in this case, you already did, data is not there).
> > 
> > I would guess that your data does not make it out from the frontend (mdump shows "nothing"),
> > if data were to arrive into the SYSTEM buffer, it would make it to disk, unless
> > mlogger is misconfigured (but you already checked that).
> > 
> > If you have trouble with the frontend framework code, you can try to switch from the mfe.c frontend
> > to the newer c++ tmfe frontend (see progs/fetest_tmfe.cxx and progs/fetest_tmfe_thread.cxx).
> > 
> > K.O.
> 
> Good evening
> 
> I tried to reproduce the behavior in a very simple FE but it did not work out.
> The next thing for me would be to take the FE that is producing this behavior,
> replace all the device communication and data with dummies. If the problem is still
> there I would start to simplify as much as possible. 
> 
> Following the inputs of KO, I pin-pointed the data loss. The system buffer still
> gets the data but the mlogger does not write the data event. Then of course the data
> is also not anymore present in the data file. Therefore, I checked the logger
> settings again, Event ID and Trigger Mask still -1. Nothing else, at least from my point of view,
> that is misconfigured. Nevertheless, if it helps I can send my ODB settings. 
> 
> When doing the tests just before I found something else that probably
> can give a hint to the problem. The data is only lost if the time between
> two runs is long (a few seconds). As an example: If I run a sequence with a loop
> and after the FE stops the run the loop ends and the next run is started automatically,
> then only the first run has no data, which is the one after a longer time of
> no data taking. When I add a "WAIT Seconds 5" after the run before starting
> the next, not data is written to the disk for any run. I also found this
> once when adding a sleep(1) at the end of the FE readout function
> but back then did not think about it any further. 
> 

Looks like this problem fell into the covid crack.

As far as I know, MIDAS does not lose any events between bm_send_event() and the shared memory
buffer. It does not lose any events in the mlogger (unless the "event request" is misconfigured).
(there is lots of opportunity to lose events in complicated frontends).

If you have some evidence otherwise, I would very much like to hear about
it and I want to fix all problems that cause it.

In your previous report I was under the impression that you lose random events here and there,
but your latest report is about mlogger not writing anything at all.

Which case is it?

If you can definitely say that all your events make it to the SYSTEM buffer
but mlogger sometimes does not see some of them and sometimes does not see all of them,
we should look very closely at bm_receive_event() and mlogger itself.

In the case where mlogger is not seeing any events at all (output file is empty), as this is
happening, I would like to see the output of mdump (to confirm events are written to SYSTEM
buffer with correct event_id and trigger_mask) and the output of (say)
"manalyzer_test.exe --dump run01161.mid.lz4" on your output file.

If the output is very long, you can email it to me directly instead of posting it here.

K.O.

24 Mar 2022, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid

One idea: we should have a look at mlogger::close_channels(). There the SYSTEM buffer is emptied through the cm_yield() call. Instrumenting this with some debugging code will enlighten us. 

Another possible problem: If the frontend requested to be notified for a run stop AFTER the logger, then the problem might happen: Logger closes file, and THEN the frontend flushes events ending up in the SYSTEM buffer and being logged at the beginning of the next run. The mfe.cxx framework takes care of this by calling 

cm_register_transition(TR_SOP, 500);

while the mlogger does 

cm_register_transition(TR_STOP, tr_stop, 800);

and since 800 > 500 the logger will be called AFTER the frontend. If one use a framework different from mfe.cxx, this could however be different.

Stefan

24 Mar 2022, Konstantin Olchanski, Bug Report, data missing in runXXXXXX.mid

> One idea: we should have a look at mlogger::close_channels().
> There the SYSTEM buffer is emptied through the cm_yield() call.
> Instrumenting this with some debugging code will enlighten us.

right. this will "last few events are lost at the end of run".

but that code in the mlogger was not touched in years, if there is a problem there,
we would have seen it by now, most experiments check that the number
of events in the data file is same as number of triggers generated, both
numbers are shown on the midas status page.

> Another possible problem: If the frontend requested to be notified for a run stop AFTER the logger, then the problem might happen: Logger closes file, and THEN the frontend flushes events ending up in the SYSTEM buffer and being logged at the beginning of the next run. The mfe.cxx framework takes care of this by calling 
> cm_register_transition(TR_STOP, 500);

default sequence, both mfe.c frontend and c++ tmfe frontend:

start of run:
- mlogger first (configure history, open data file)
- frontends last
- (if any frontend fails, TR_STARTABORT is sent to mlogger to close the output file and "undo" the run start)

end of run:
- frontends first (must not send any events after after processing the TR_STOP RPC call, inside the TR_STOP handler, bm_flush_cache() takes care of the write cache)
- mlogger last
- (if any frontend fails, failure is ignored, run stops regardless)

wrong order will be only if they manually change it, and whatever order
they set, you see it on the midas transition page (and mtransition -v and odbedit stop now -v, etc).

K.O.

23 Mar 2022, Konstantin Olchanski, Bug Fix, mhttpd bug fixed

the mhttpd bug should be fixed now (branch feature/buffer_mutex).

simplest way to reproduce:

wget http://localhost:8080/
quickly ctrl-C it
wget http://localhost:8080/
inside mhttpd (by hook or crook) observe that the second wget got the data meant for the first wget.

if you cannot ctrl-C the first wget quickly enough, put a sleep somewhere in the worker thread (in
mongoose_write(), I think).

this is what happens.

1st wget stops (by ctrl-C), socket is closed, mongoose frees it's mg_connection object
(corresponding worker is still labouring, hmm... actually sleeping, and now has a stale nc pointer)

2nd wget starts, new socket is opened, mongoose allocates a new mg_connection object,
but malloc() gives it back the same memory we just freed(), and the 1st wget's worker thread
nc pointer is no longer stale, but points to 2nd wget's connection.

so we think we are clever and we check the socket file descriptors. but same thing
happens there, too. if 1st wget was file descriptor 7, it is closed, (1st wget worker now has
a stale file handle), then reopened for the 2nd wget, per POSIX, we get back the same
file descriptor 7. 1st wget worker now has the file handle for the 2nd wget tcp socket and
the famous test/crash for "sending data to wrong socket" is defeated.

now, worker thread for the 1st wget wants to send a reply, it has a valid nc pointer (points to 2nd wget's
mg_connection object) and a valid file descriptor (points to 2nd wget's tcp socket),
reply meant for the 1st wget is successfully sent to the 2nd wget, 2nd wget finishes, it's socket
is closed, mg_connection object is free'ed. Now the worker thread for the 2nd wget has stale
connection info, but this is okey, mongoose does not find a matching connection, 2nd wget
worked thread reply goes nowhere, thread finishes silently (no memory leaks here, I checked).

so, connection for 2nd wget completely impersonates the closed connection of 1st wget (I guess I could
check the full socket address info, remote ip address, remote port number, etc, but...)

in practice, this bug does not happen often because modern browsers tend to keep tcp sockets open
for very long time. (not sure about sundry web proxies, etc).

solution of course is very simple. match worker thread data to mongoose mg_connection objects
using our own connection sequential number, which are unique and very easy to keep track
of through the mongoose event handler. all this mess runs in the main thread,
so no locking trouble here, small blessing.

K.O.

24 Mar 2022, Stefan Ritt, Bug Fix, mhttpd bug fixed

> 1st wget stops (by ctrl-C), socket is closed, mongoose frees it's mg_connection object
> (corresponding worker is still labouring, hmm... actually sleeping, and now has a stale nc pointer)
> 
> 2nd wget starts, new socket is opened, mongoose allocates a new mg_connection object,
> but malloc() gives it back the same memory we just freed(), and the 1st wget's worker thread
> nc pointer is no longer stale, but points to 2nd wget's connection.

Why don't we CLEAR the memory (memset(object,0,sizeof(object)) before the free(), this way it cannot be 
mistakenly re-used by the next thread.

Stefan

24 Mar 2022, Konstantin Olchanski, Bug Fix, mhttpd bug fixed

> > 1st wget stops (by ctrl-C), socket is closed, mongoose frees it's mg_connection object
> > (corresponding worker is still labouring, hmm... actually sleeping, and now has a stale nc pointer)
> > 
> > 2nd wget starts, new socket is opened, mongoose allocates a new mg_connection object,
> > but malloc() gives it back the same memory we just freed(), and the 1st wget's worker thread
> > nc pointer is no longer stale, but points to 2nd wget's connection.
> 
> Why don't we CLEAR the memory (memset(object,0,sizeof(object)) before the free(), this way it cannot be 
> mistakenly re-used by the next thread.
> 

My description was unclear. I will try better now.

When http replies are generated by worker threads, matching of reply to mg_connection is done
by checking the address of the mg_connection object. (mongoose itself unhelpfully offers
to send the reply to every mg_connection, see the responder to mg_broadcast() messages).

This works for open/active connections, addresses of all mg_connections are unique.

But if connection is closed and a new connection is opened, the address is reused (by malloc()/free()
reusing memory blocks or by mongoose using a pool of mg_connection objects, does not matter).

So matching http reply to mg_connection using only address of mg_connection can match the wrong connection.

(contents of mg_connection object does not matter, only address is used by matching. so memzero() of
mg_connection object does not help).

I saw this during my testing - wrong data was sent to wrong browser often enough - but did
not understand that the above problem is happening.

Because I was unable to reliably reproduce the problem, I could not debug it. I tried to add
a check for the tcp socket file descriptor number, in case there is a straight bug or multithread race
or simple memory corruption. This replaced "we sent wrong data to wrong browser, poisoned browser
cache, confused the user" with a crash. This "fix" seemed effective at the time.

Maybe I should mention browser cache poisoning again. What happened is html pages and rpc replies
were returned as responses to load things like CSS files, these bad responses are cached by the browser
pretty much forever, so all subsequent midas pages will look wrong (bad css!) forever, until
user manually clears browser cache. reload of page did not help, restart of browser did not help (I think).

So a very bad bug.

Unfortunately, the check for file descriptor was not effective because file descriptors are also
reused. And I did see wrong data returned by mhttpd, but even more rarely. And everybody (myself
included) complained about mhttpd crashes.

Now, matching of responses to connections is done by connection sequential/serial number,
which is unique 32-bit counter. Mismatch of reply to connection should not happen again.

P.S. Latest version of the mongoose web server library does not help with this problem,
the example code for matching reply to connection in their multithread example looks bogus:
https://github.com/cesanta/mongoose/blob/master/examples/multi-threaded/main.c

K.O.

24 Mar 2022, Stefan Ritt, Bug Fix, mhttpd bug fixed

I see, now I understand.

As for the browser cache problem: This Chrome extension is your friend: 

https://chrome.google.com/webstore/detail/clear-cache/cppjkneekbjaeellbfkmgnhonkkjfpdn?hl=en

I use it all the time I change the CSS or a JS file. Having the "Developer Tools" open in Chrome helps as well 
(cache is then turned off). Firefox has similar extensions.

Stefan

24 Mar 2022, Konstantin Olchanski, Bug Fix, mhttpd bug fixed

> As for the browser cache problem: This Chrome extension is your friend ...

for google chrome, it is easy, open the javascript debugger (left-click "inspect"),
the reload button becomes a left-click menu, one left-click option is "clear cache and reload".
(there is no button for "clear cookies and reload", re recent elog cookie problem).

but this does not help me personally any. if midas web pages get confused, I will also get confused, too,
and I will spend hours debugging mhttpd before thinking "hmm... maybe I should clear the browser cache!"

not sure about firefox, safari, microsoft edge and opera. if I ever need it, I google it.

K.O.

Goto page Previous 1, 2, 3 ... 38, 39, 40 ... 43, 44, 45 Next

ELOG V3.1.4-2e1708b5