Back Midas Rome Roody Rootana
  Midas DAQ System, Page 7 of 146  Not logged in ELOG logo
Entry  02 Dec 2021, Alexey Kalinin, Bug Report, some frontend kicked by cm_periodic_tasks 
Hello,
We have a small experiment with MIDAS based DAQ.
Status page shows :
ES	ESFrontend@192.168.0.37	207	0.2	0.000
Trigger06	Sample Frontend06@192.168.0.37	1.297M	0.3	0.000
Trigger01	Sample Frontend01@192.168.0.37	1.297M	0.3	0.000
Trigger16	Sample Frontend16@192.168.0.37	1.297M	0.3	0.000
Trigger38	Sample Frontend38@192.168.0.37	1.297M	0.3	0.000
Trigger37	Sample Frontend37@192.168.0.37	1.297M	0.3	0.000
Trigger03	Sample Frontend03@192.168.0.38	1.297M	0.3	0.000
Trigger07	Sample Frontend07@192.168.0.38	1.297M	0.3	0.000
Trigger04	Sample Frontend04@192.168.0.38	59898	0.0	0.000
Trigger08	Sample Frontend08@192.168.0.38	59898	0.0	0.000
Trigger17	Sample Frontend17@192.168.0.38	59898	0.0	0.000


And SYSTEM buffers page shows:
ESFrontend	1968	198	47520	0	0x00000000	0		
193 ms
Sample Frontend06	1332547	1330826	379729872	0	0x00000000	
0		1.1 sec
Sample Frontend16	1332542	1330839	361988208	0	0x00000000	
0		94 ms
Sample Frontend37	1332530	1330841	337798408	0	0x00000000	
0		1.1 sec
Sample Frontend01	1332543	1330829	467136688	0	0x00000000	
0		34 ms
Sample Frontend38	1332528	1330830	291453608	0	0x00000000	
0		1.1 sec
Sample Frontend04	63254	61467	20882584	0	0x00000000	
0		208 ms
Sample Frontend08	63262	61476	27904056	0	0x00000000	
0		205 ms
Sample Frontend17	63271	61473	20433840	0	0x00000000	
0		213 ms
Sample Frontend03	1332549	1330818	386821728	0	0x00000000	
0		82 ms
Sample Frontend07	1332554	1330821	462210896	0	0x00000000	
0		37 ms
Logger	968742	0w+9500418r	0w+2718405736r	0	0x00000000	0	
GET_ALL Used 0 bytes 0.0%	303 ms
rootana	254561	0w+29856958r	0w+8718288352r	0	0x00000000	0		
762 ms


The problem is that eventually some of frontend closed with message 
:19:22:31.834 2021/12/02 [rootana,INFO] Client 'Sample Frontend38' on buffer 
'SYSMSG' removed by cm_periodic_tasks because process pid 9789 does not exist

in the meantime mserver loggging :
mserver started interactively
mserver will listen on TCP port 1175
double free or corruption (!prev)
double free or corruption (!prev)
free(): invalid next size (normal)
double free or corruption (!prev)


I can find some correlation between number of events/event size produced by 
frontend, cause its failed when its become big enough. 

frontend scheme is like this:

poll event time set to 0;

poll_event{
//if buffer not transferred return (continue cutting the main buffer)
//read main buffer from hardware
//buffer not transfered
}

read event{
// cut the main buffer to subevents (cut one event from main buffer) return;
//if (last subevent) {buffer transfered ;return}
}

What is strange to me that 2 frontends (1 per remote pc) causing this.

Also, I'm executing one FEcode with -i # flag , put setting eventid in 
frontend_init , and using SYSTEM buffer for all.

Is there something I'm missing?
Thanks. 
A.
    Reply  26 Jan 2022, Konstantin Olchanski, Bug Report, some frontend kicked by cm_periodic_tasks 
> The problem is that eventually some of frontend closed with message 
> :19:22:31.834 2021/12/02 [rootana,INFO] Client 'Sample Frontend38' on buffer 
> 'SYSMSG' removed by cm_periodic_tasks because process pid 9789 does not exist

This messages means what it says. A client was registered with the SYSMSG buffer and this 
client had pid 9789. At some point some other client (rootana, in this case) checked it and 
process pid 9789 was no longer running. (it then proceeded to remove the registration).

There is 2 possibilities:
- simplest: your frontend has crashed. best to debug this by running it inside gdb, wait for 
the crash.
- unlikely: reported pid is bogus, real pid of your frontend is different, the client 
registration in SYSMSG is corrupted. this would indicate massive corruption of midas shared 
memory buffers, not impossible if your frontend misbehaves and writes to random memory 
addresses. ODB has protection against this (normally turned off, easy to enable, set ODB 
"/experiment/protect odb" to yes), shared memory buffers do not have protection against this 
(should be added?).

Do this. When you start your frontend, write down it's pid, when you see the crash message, 
confirm pid number printed is the same. As additional test, run your frontend inside gdb, 
after it crashes, you can print the stack trace, etc.

> 
> in the meantime mserver loggging :
> mserver started interactively
> mserver will listen on TCP port 1175
> double free or corruption (!prev)
> double free or corruption (!prev)
> free(): invalid next size (normal)
> double free or corruption (!prev)
> 

Are these "double free" messages coming from the mserver or from your frontend? (i.e. you run 
them in different terminals, not all in the same terminal?).

If messages are coming from the mserver, this confirms possibility (1),
except that for frontends connected remotely, the pid is the pid of the mserver,
and what we see are crashes of mserver, not crashes of your frontend. These are much harder to 
debug.

You will need to enable core dumps (ODB /Experiment/Enable core dumps set to "y"),
confirm that core dumps work (i.e. "killall -SEGV mserver", observe core files are created
in the directory where you started the mserver), reproduce the crash, run "gdb mserver 
core.NNNN", run "bt" to print the stack trace, post the stack trace here (or email to me 
directly).

>
> I can find some correlation between number of events/event size produced by 
> frontend, cause its failed when its become big enough. 
> 

There is no limit on event size or event rate in midas, you should not see any crash
regardless of what you do. (there is a limit of event size, because an event has
to fit inside an event buffer and event buffer size is limited to 2 GB).

Obviously you hit a bug in mserver that makes it crash. Let's debug it.

One thing to try is set the write cache size to zero and see if your crash goes away. I see
some indication of something rotten in the event buffer code if write cache is enabled. This
is set in ODB "/Eq/XXX/Common/Write Cache Size", set it to zero. (beware recent confusion
where odb settings have no effect depending on value of "equipment_common_overwrite").

>
> frontend scheme is like this:
> 

Best if you use the tmfe c++ frontend, event data handling is much simpler and we do not
have to debug the convoluted old code in mfe.c.

K.O.

>
> poll event time set to 0;
> 
> poll_event{
> //if buffer not transferred return (continue cutting the main buffer)
> //read main buffer from hardware
> //buffer not transfered
> }
> 
> read event{
> // cut the main buffer to subevents (cut one event from main buffer) return;
> //if (last subevent) {buffer transfered ;return}
> }
> 
> What is strange to me that 2 frontends (1 per remote pc) causing this.
> 
> Also, I'm executing one FEcode with -i # flag , put setting eventid in 
> frontend_init , and using SYSTEM buffer for all.
> 
> Is there something I'm missing?
> Thanks. 
> A.
    Reply  11 Feb 2022, Alexey Kalinin, Bug Report, some frontend kicked by cm_periodic_tasks 
Thanks for the answer.
As soon as I can(possible in a month) I'll try suggestion below:

> One thing to try is set the write cache size to zero and see if your crash goes away. I see
> some indication of something rotten in the event buffer code if write cache is enabled. This
> is set in ODB "/Eq/XXX/Common/Write Cache Size", set it to zero. (beware recent confusion
> where odb settings have no effect depending on value of "equipment_common_overwrite").

I tried to change this ODB for one of the frontend via mhttpd/browser, and eventually it goes back 
to default value (1000 as I remember). but this frontend has the minimum rate 50DWORD/~10sec. and 
depending on cashe size it appears in mdump once per 31 events but all aff them . SO its different 
story, but m.b. it has the same solution to play with Write Cashe Size.    


double free message goes from mserver terminal. 
all of the frontends are remote.
I can't exclude crashes of frontend , but when I run ./frontend -i 1(2,3 etc) thet means that I run 
one code for all, and only several causes crash.also I found that crash in frontend happened while 
it do nothing with collected data (last event reached and new data is not ready), but it tries to 
watch for the ODB changes.I mean it crashes iside (while {odb_changes(value in watchdog)}),and I don't 
know what else happenned meanwhile with cahed buffer.

Future plans is to use event buider for frontends when data/signals will be perfectly reasonable 
i/e/ without broken events. for now i kinda worry about if one of frontends will skip one of the 
event inside its buffer.


Thanks for the way to dig into.
A.    

> > The problem is that eventually some of frontend closed with message 
> > :19:22:31.834 2021/12/02 [rootana,INFO] Client 'Sample Frontend38' on buffer 
> > 'SYSMSG' removed by cm_periodic_tasks because process pid 9789 does not exist
> 
> This messages means what it says. A client was registered with the SYSMSG buffer and this 
> client had pid 9789. At some point some other client (rootana, in this case) checked it and 
> process pid 9789 was no longer running. (it then proceeded to remove the registration).
> 
> There is 2 possibilities:
> - simplest: your frontend has crashed. best to debug this by running it inside gdb, wait for 
> the crash.
> - unlikely: reported pid is bogus, real pid of your frontend is different, the client 
> registration in SYSMSG is corrupted. this would indicate massive corruption of midas shared 
> memory buffers, not impossible if your frontend misbehaves and writes to random memory 
> addresses. ODB has protection against this (normally turned off, easy to enable, set ODB 
> "/experiment/protect odb" to yes), shared memory buffers do not have protection against this 
> (should be added?).
> 
> Do this. When you start your frontend, write down it's pid, when you see the crash message, 
> confirm pid number printed is the same. As additional test, run your frontend inside gdb, 
> after it crashes, you can print the stack trace, etc.
> 
> > 
> > in the meantime mserver loggging :
> > mserver started interactively
> > mserver will listen on TCP port 1175
> > double free or corruption (!prev)
> > double free or corruption (!prev)
> > free(): invalid next size (normal)
> > double free or corruption (!prev)
> > 
> 
> Are these "double free" messages coming from the mserver or from your frontend? (i.e. you run 
> them in different terminals, not all in the same terminal?).
> 
> If messages are coming from the mserver, this confirms possibility (1),
> except that for frontends connected remotely, the pid is the pid of the mserver,
> and what we see are crashes of mserver, not crashes of your frontend. These are much harder to 
> debug.
> 
> You will need to enable core dumps (ODB /Experiment/Enable core dumps set to "y"),
> confirm that core dumps work (i.e. "killall -SEGV mserver", observe core files are created
> in the directory where you started the mserver), reproduce the crash, run "gdb mserver 
> core.NNNN", run "bt" to print the stack trace, post the stack trace here (or email to me 
> directly).
> 
> >
> > I can find some correlation between number of events/event size produced by 
> > frontend, cause its failed when its become big enough. 
> > 
> 
> There is no limit on event size or event rate in midas, you should not see any crash
> regardless of what you do. (there is a limit of event size, because an event has
> to fit inside an event buffer and event buffer size is limited to 2 GB).
> 
> Obviously you hit a bug in mserver that makes it crash. Let's debug it.
> 
> One thing to try is set the write cache size to zero and see if your crash goes away. I see
> some indication of something rotten in the event buffer code if write cache is enabled. This
> is set in ODB "/Eq/XXX/Common/Write Cache Size", set it to zero. (beware recent confusion
> where odb settings have no effect depending on value of "equipment_common_overwrite").
> 
> >
> > frontend scheme is like this:
> > 
> 
> Best if you use the tmfe c++ frontend, event data handling is much simpler and we do not
> have to debug the convoluted old code in mfe.c.
> 
> K.O.
> 
> >
> > poll event time set to 0;
> > 
> > poll_event{
> > //if buffer not transferred return (continue cutting the main buffer)
> > //read main buffer from hardware
> > //buffer not transfered
> > }
> > 
> > read event{
> > // cut the main buffer to subevents (cut one event from main buffer) return;
> > //if (last subevent) {buffer transfered ;return}
> > }
> > 
> > What is strange to me that 2 frontends (1 per remote pc) causing this.
> > 
> > Also, I'm executing one FEcode with -i # flag , put setting eventid in 
> > frontend_init , and using SYSTEM buffer for all.
> > 
> > Is there something I'm missing?
> > Thanks. 
> > A.
Entry  01 Dec 2017, Frederik Wauters, Bug Report, small bug in mfe.c init 
There is a small bug in the mfe.c initialization for the EQ_POLLED mode. There 
is a routine where the number of polls fitting in eq_info->period is counted:


         count = 1;
         do {
            if (display_period)
               printf(".");

            start_time = ss_millitime();

            poll_event(equipment[idx].info.source, (INT)count, TRUE);

            delta_time = ss_millitime() - start_time;

            ...

            if (delta_time > 0)
               count = count * eq_info->period / delta_time;
            else
               count *= 100;
            
            // avoid overflows
            if (count > 2147483647.0) {
               count = 2147483647.0;
               break;
            }
            
         } while (delta_time > eq_info->period * 1.2 || delta_time < eq_info-
>period * 0.8);

As "start_time = ss_millitime();" resets "delta_time" each time, only the 
"avoid overflows" addition saves the day. 

start_time = ss_millitime(); show be out of the loop.
    Reply  01 Dec 2017, Stefan Ritt, Bug Report, small bug in mfe.c init 
> There is a small bug in the mfe.c initialization for the EQ_POLLED mode. There 
> is a routine where the number of polls fitting in eq_info->period is counted:
> 
> 
>          count = 1;
>          do {
>             if (display_period)
>                printf(".");
> 
>             start_time = ss_millitime();
> 
>             poll_event(equipment[idx].info.source, (INT)count, TRUE);
> 
>             delta_time = ss_millitime() - start_time;
> 
>             ...
> 
>             if (delta_time > 0)
>                count = count * eq_info->period / delta_time;
>             else
>                count *= 100;
>             
>             // avoid overflows
>             if (count > 2147483647.0) {
>                count = 2147483647.0;
>                break;
>             }
>             
>          } while (delta_time > eq_info->period * 1.2 || delta_time < eq_info-
> >period * 0.8);
> 
> As "start_time = ss_millitime();" resets "delta_time" each time, only the 
> "avoid overflows" addition saves the day. 
> 
> start_time = ss_millitime(); show be out of the loop.

Nope.

What I want is to determine how often I have to call poll_event to stay there for a certain time (usually 100ms). So I iterate "count" until I roughly get to my 100ms. Each call to 
poll_event with a different count is a new measurement, therefore I initialize start_time before each measurement. If i do it outside the loop, and kind of incrementally increase 
it, then the whole code inside the loop is added to the measurement which makes it (slightly) wrong.

The whole loop optimization has some background. Polling can be sow (think of talking to a device via Ethernet which can easily take milli seconds). So how often do we poll 
before we do other things in the main look (like looking if a run has been started). If I only poll once, then the average front-end response time would be poor, because I mostly 
look if a run has been started in the main loop. This is not effective. If I poll too often inside the poll_event loop, then the front-end does not react on run stops any more. So 
there is some optimum, and this is set by the polling time of usually 100ms. This ensures that the front-end does optimal polling - without ANYTHING in between - for about 
100ms. But how can I know how often I should poll for 100 ms? As said above, polling can be very fast (reading a memory cell) or very slow (network). The the best method I 
found is to do a calibration at the startup, and this is what the code above does. Maybe there are better ways today, but that code worked nicely in the last 25 years.

Stefan
    Reply  04 Dec 2017, Frederik Wauters, Bug Report, small bug in mfe.c init 
> > There is a small bug in the mfe.c initialization for the EQ_POLLED mode. There 
> > is a routine where the number of polls fitting in eq_info->period is counted:
> > 
> > 
> >          count = 1;
> >          do {
> >             if (display_period)
> >                printf(".");
> > 
> >             start_time = ss_millitime();
> > 
> >             poll_event(equipment[idx].info.source, (INT)count, TRUE);
> > 
> >             delta_time = ss_millitime() - start_time;
> > 
> >             ...
> > 
> >             if (delta_time > 0)
> >                count = count * eq_info->period / delta_time;
> >             else
> >                count *= 100;
> >             
> >             // avoid overflows
> >             if (count > 2147483647.0) {
> >                count = 2147483647.0;
> >                break;
> >             }
> >             
> >          } while (delta_time > eq_info->period * 1.2 || delta_time < eq_info-
> > >period * 0.8);
> > 
> > As "start_time = ss_millitime();" resets "delta_time" each time, only the 
> > "avoid overflows" addition saves the day. 
> > 
> > start_time = ss_millitime(); show be out of the loop.
> 
> Nope.
> 
> What I want is to determine how often I have to call poll_event to stay there for a certain time (usually 100ms). So I iterate "count" until I roughly get to my 100ms. Each call to 
> poll_event with a different count is a new measurement, therefore I initialize start_time before each measurement. If i do it outside the loop, and kind of incrementally increase 
> it, then the whole code inside the loop is added to the measurement which makes it (slightly) wrong.
> 
> The whole loop optimization has some background. Polling can be sow (think of talking to a device via Ethernet which can easily take milli seconds). So how often do we poll 
> before we do other things in the main look (like looking if a run has been started). If I only poll once, then the average front-end response time would be poor, because I mostly 
> look if a run has been started in the main loop. This is not effective. If I poll too often inside the poll_event loop, then the front-end does not react on run stops any more. So 
> there is some optimum, and this is set by the polling time of usually 100ms. This ensures that the front-end does optimal polling - without ANYTHING in between - for about 
> 100ms. But how can I know how often I should poll for 100 ms? As said above, polling can be very fast (reading a memory cell) or very slow (network). The the best method I 
> found is to do a calibration at the startup, and this is what the code above does. Maybe there are better ways today, but that code worked nicely in the last 25 years.
> 
> Stefan

Thanks, I misunderstood the loop then. If poll_event(equipment[idx].info.source, (INT)count, TRUE); doesn`t do anything with "count", the loop becomes infinite except for the overflow 
check. 
    Reply  04 Dec 2017, Stefan Ritt, Bug Report, small bug in mfe.c init 
> Thanks, I misunderstood the loop then. If poll_event(equipment[idx].info.source, (INT)count, TRUE); doesn`t do anything with "count", the loop becomes infinite except for the overflow 
> check. 

Well, the function poll_event() is _supposed_ to use "count" in a for loop as written in the example frontend:

   for (i = 0; i < count; i++) {
      /* poll hardware and set flag to TRUE if new event is available */
      flag = TRUE;

      if (flag)
         if (!test)
            return TRUE;
   }

where "flag = TRUE" must be replaced with the proper hardware check. This can be a VME access, a network TCP exchange with some Ethernet based hardware, or even a mutex check if the events are collected by a 
separate thread in the frontend.

The idea of having the for (i=0 ; i<count ; i++) loop _inside_ the poll_event() function and not outside is the fact that each function call to poll_event() takes time, and we want the minimal possible response time to new 
events. It might be just a micro-second, but having an experiment running at 100 Hz for one year (like Mu3e), this adds up to about one hour per year, which is a considerable amount of precious beam time.

Stefan
Entry  10 Jun 2020, Ivo Schulthess, Forum, slow-control equipment crashes when running multi-threaded on a remote machine 
Dear all

To reduce the time needed by Midas between runs, we want to change some of our periodic equipment to multi-threaded slow-control equipment. To do that I wanted to start from 
the slowcont with the multi/hv class driver and the nulldev device driver and null bus driver. The example runs fine as it is on the local midas machine and also on remote 
machines. When adding the DF_MULTITHREAD flag to the device driver list, it does not run anymore on remote machines but aborts with the following assertion:

scfe: /home/neutron/packages/midas/src/midas.cxx:1569: INT cm_get_path(char*, int): Assertion `_path_name.length() > 0' failed.

Running the frontend with GDB and set a breakpoint at the exit leads to the following: 

(gdb) where
#0  0x00007ffff68d599f in raise () from /lib64/libc.so.6
#1  0x00007ffff68bfcf5 in abort () from /lib64/libc.so.6
#2  0x00007ffff68bfbc9 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x00007ffff68cde56 in __assert_fail () from /lib64/libc.so.6
#4  0x000000000041efbf in cm_get_path (path=0x7fffffffd060 "P\373g", path_size=256)
    at /home/neutron/packages/midas/src/midas.cxx:1563
#5  cm_get_path (path=path@entry=0x7fffffffd060 "P\373g", path_size=path_size@entry=256)
    at /home/neutron/packages/midas/src/midas.cxx:1563
#6  0x0000000000453dd8 in ss_semaphore_create (name=name@entry=0x7fffffffd2c0 "DD_Input", 
    semaphore_handle=semaphore_handle@entry=0x67f700 <multi_driver+96>)
    at /home/neutron/packages/midas/src/system.cxx:2340
#7  0x0000000000451d25 in device_driver (device_drv=0x67f6a0 <multi_driver>, cmd=<optimized out>)
    at /home/neutron/packages/midas/src/device_driver.cxx:155
#8  0x00000000004175f8 in multi_init(eqpmnt*) ()
#9  0x00000000004185c8 in cd_multi(int, eqpmnt*) ()
#10 0x000000000041c20c in initialize_equipment () at /home/neutron/packages/midas/src/mfe.cxx:827
#11 0x000000000040da60 in main (argc=1, argv=0x7fffffffda48)
    at /home/neutron/packages/midas/src/mfe.cxx:2757

I also tried to use the generic class driver which results in the same. I am not sure if this is a problem of the multi-threaded frontend running on a remote machine or is it 
something of our system which is not properly set up. Anyway I am running out of ideas how to solve this and would appreciate any input. 

Thanks in advance,
Ivo
    Reply  10 Jun 2020, Konstantin Olchanski, Forum, slow-control equipment crashes when running multi-threaded on a remote machine 
Yes, it is supposed to crash. On a remote frontend, cm_get_path() cannot be used
(we are on a different computer, all filesystems maybe no the same!) and is actually not set and
triggers a trap if something tries to use it. (this is the crash you see).

The caller to cm_get_path() is a MIDAS semaphore function.

And I think there is a mistake here. It is unusual for the driver framework to use a semaphore. For multithreaded
protection inside the frontend, a mutex would normally be used. (and mutexes do not use cm_get_path(), so
all is good). But if a semaphore is used, than all frontends running on the same computer become
serialized across the locked section. This is the right thing to do if you have multiple frontends
sharing the same hardware, i.e. a /dev/ttyUSB serial line, but why would a generic framework function
do this. I am not sure, I will have to take a look at why there is a semaphore and what it is locking/protecting.

(In midas, semaphores are normally used to protect global memory, such as ODB, or global resources, such as alarms,
against concurrent access by multiple programs, but of course that does not work for remote frontends,
they are on a different computer! semaphores only work locally, do not work across the network!)

K.O.

> 
> scfe: /home/neutron/packages/midas/src/midas.cxx:1569: INT cm_get_path(char*, int): Assertion `_path_name.length() > 0' failed.
> 
> Running the frontend with GDB and set a breakpoint at the exit leads to the following: 
> 
> (gdb) where
> #0  0x00007ffff68d599f in raise () from /lib64/libc.so.6
> #1  0x00007ffff68bfcf5 in abort () from /lib64/libc.so.6
> #2  0x00007ffff68bfbc9 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
> #3  0x00007ffff68cde56 in __assert_fail () from /lib64/libc.so.6
> #4  0x000000000041efbf in cm_get_path (path=0x7fffffffd060 "P\373g", path_size=256)
>     at /home/neutron/packages/midas/src/midas.cxx:1563
> #5  cm_get_path (path=path@entry=0x7fffffffd060 "P\373g", path_size=path_size@entry=256)
>     at /home/neutron/packages/midas/src/midas.cxx:1563
> #6  0x0000000000453dd8 in ss_semaphore_create (name=name@entry=0x7fffffffd2c0 "DD_Input", 
>     semaphore_handle=semaphore_handle@entry=0x67f700 <multi_driver+96>)
>     at /home/neutron/packages/midas/src/system.cxx:2340
> #7  0x0000000000451d25 in device_driver (device_drv=0x67f6a0 <multi_driver>, cmd=<optimized out>)
>     at /home/neutron/packages/midas/src/device_driver.cxx:155
> #8  0x00000000004175f8 in multi_init(eqpmnt*) ()
> #9  0x00000000004185c8 in cd_multi(int, eqpmnt*) ()
> #10 0x000000000041c20c in initialize_equipment () at /home/neutron/packages/midas/src/mfe.cxx:827
> #11 0x000000000040da60 in main (argc=1, argv=0x7fffffffda48)
>     at /home/neutron/packages/midas/src/mfe.cxx:2757
> 
> I also tried to use the generic class driver which results in the same. I am not sure if this is a problem of the multi-threaded frontend running on a remote machine or is it 
> something of our system which is not properly set up. Anyway I am running out of ideas how to solve this and would appreciate any input. 
> 
> Thanks in advance,
> Ivo
    Reply  10 Jun 2020, Stefan Ritt, Forum, slow-control equipment crashes when running multi-threaded on a remote machine 
Few comments:

- As KO write, we might need semaphores also on a remote front-end, in case several programs share the same hardware. So it should work and cm_get_path() should not just exit

- When I wrote the multi-threaded device drivers, I did use semaphores instead of mutexes, but I forgot why. Might be that midas semaphores have a timeout and mutexes not, or 
something along those lines.

- I do need either semaphores or mutexes since in a multi-threaded slow-control font-end (too many dashes...) several threads have to access an internal data exchange buffer, which 
needs protection for multi-threaded environments.

So we can how either fix cm_get_path() or replace all semaphores in with mutexes in midas/src/device_driver.cxx. I have kind of a feeling that we should do both. And what about 
switching to c++ std::mutex instead of pthread mutexes?

Stefan
    Reply  12 Jun 2020, Ivo Schulthess, Forum, slow-control equipment crashes when running multi-threaded on a remote machine 
Thanks you two once again for the very fast answers. I tested the example on the local machine and it works perfectly fine. In the meantime I also created two new drivers for our devices 
and everything works with them, the improvement in time is significant and I will create drivers for all our devices where possible. If they are in a working state I can also provide 
them to add to the Midas drivers. Of course if it would be possible to run the front-end also on our remote machines this would be even better. I am not experienced in any multi-threaded 
programming but if I can provide any help or input, please let me know. 

Have a great weekend,
Ivo
Entry  10 Jan 2024, Pavel Murat, Forum, slow control frontends - how much do they sleep and how often their drivers are called?  
Dear all,

I have implemented a number of slow control frontends which are directed to update the 
history once in every 10 sec, and they do just that. 

I expected that such frontends would be spending most of the time sleeping and waking up 
once in ten seconds to call their respective drivers and send the data to the server. 

However I observe that each frontend process consumes almost 100% of a single core CPU time 
and the frontend driver is called many times per second. 

Is that the expected behavior ?

So far, I couldn't find the place in the system part of the frontend code (is that the right 
place to look for?) which regulates the frequency of the frontend driver calls, so I'd greatly 
appreciate if someone could point me to that place.

I'm using the following commit:

commit 30a03c4c develop origin/develop Make sure line numbers and sequence lines are aligned.

-- many thanks, regards, Pasha
    Reply  11 Jan 2024, Stefan Ritt, Forum, slow control frontends - how much do they sleep and how often their drivers are called?  
Put a 

  ss_sleep(10);

into your frontend_loop(), then you should be fine.

The event loop runs as fast as possible in order not to miss any (triggered) event, so no seep in the 
event loop, because this would limit the (triggered) event rate to 100 Hz (minimum sleep is 10 ms). 
Therefore, you have to slow down the event loop manually with the method described above.

Best,
Stefan
    Reply  11 Jan 2024, Pavel Murat, Forum, slow control frontends - how much do they sleep and how often their drivers are called?  
Hi Stefan, thanks a lot !

I just thought that for the EQ_SLOW type equipment calls to sleep() could be hidden in mfe.cxx 
and handled based on the requested frequency of the history updates.

Doing the same in the user side is straighforward - the important part is to know where the 
responsibility line goes (: smile :) 

-- regards, Pasha
    Reply  12 Jan 2024, Stefan Ritt, Forum, slow control frontends - how much do they sleep and how often their drivers are called?  
> Hi Stefan, thanks a lot !
> 
> I just thought that for the EQ_SLOW type equipment calls to sleep() could be hidden in mfe.cxx 
> and handled based on the requested frequency of the history updates.

Most people combine EQ_SLOW with EQ_POLLED, so they want to read out as quickly as possible. Since 
the framework cannot "guess" what the users want there, I removed all sleep() in the framework.



> Doing the same in the user side is straighforward - the important part is to know where the 
> responsibility line goes (: smile :) 


Pushing this to the user gives you more freedom. Like you can add sleep() for some frontends, but not 
for others, only when the run is stopped and more.

Stefan
    Reply  28 Jan 2024, Konstantin Olchanski, Forum, slow control frontends - how much do they sleep and how often their drivers are called?  
> I have implemented a number of slow control frontends which are directed to update the 
> history once in every 10 sec, and they do just that. 

I suggest that you switch from the old mfe.c frontend framework to the new tmfe framework that was 
designed to solve exactly this type of problems.

Look at .../midas/progs/tmfe_example*.cxx

You have a choice of:
- single threaded frontend, most robust, no race conditions, but readout is interrupted during 
begin/end of run.
- two-threaded frontend, your periodic equipments run in one thread, midas loop and rpc run in a 
different thread, you have to handle locking yourself.
- you can run each of your equipments in it's own thread without help from the framework, it is 
obvious how to do it if you can program c++ threads, "new std::thread" to create/start a thread, 
stop threads using a binary flag, thread->join() to reap them at the end (or thread sanitizer will 
complain).

K.O.
Entry  10 May 2011, Jianglai Liu, Forum, simple example frontend for V1720  
Hi,

Who has a good example of a frontend program using CAEN V1718 VME-USB bridge and
V1720 FADC? I am trying to set up the DAQ for such a simple system.

I put together a frontend which talks to the VME. However it gets stuck at
"Calibrating" in initialize_equipment().

I'd appreciate some help!

Thanks,
Jianglai
    Reply  10 May 2011, Stefan Ritt, Forum, simple example frontend for V1720  

Jianglai Liu wrote:
Hi,

Who has a good example of a frontend program using CAEN V1718 VME-USB bridge and
V1720 FADC? I am trying to set up the DAQ for such a simple system.

I put together a frontend which talks to the VME. However it gets stuck at
"Calibrating" in initialize_equipment().

I'd appreciate some help!

Thanks,
Jianglai


During "Calibrating", the framework calls your poll_event() routine. You code there accesses for the first time the VME crate and probably gets stuck.
    Reply  10 May 2011, Pierre-Andre Amaudruz, Forum, simple example frontend for V1720  

Jianglai Liu wrote:
Hi,

Who has a good example of a frontend program using CAEN V1718 VME-USB bridge and
V1720 FADC? I am trying to set up the DAQ for such a simple system.

I put together a frontend which talks to the VME. However it gets stuck at
"Calibrating" in initialize_equipment().

I'd appreciate some help!

Thanks,
Jianglai


Under the drivers/vme you can find code for the v1720.c (VME access) and ov1720.c
(A2818/A3818 PCIe optical link access). For testing the hardware, we use this code compiled and linked
with MAIN_ENABLE to confirm its functionality. You may want to do the same for your USB. Once this
is under control, the Midas frontend implementation using the same driver shouldn't give you trouble.
    Reply  18 May 2011, Jimmy Ngai, Forum, simple example frontend for V1720  frontend.cv1718.hv1718.cv792n.hv792n.c

Jianglai Liu wrote:
Hi,

Who has a good example of a frontend program using CAEN V1718 VME-USB bridge and
V1720 FADC? I am trying to set up the DAQ for such a simple system.

I put together a frontend which talks to the VME. However it gets stuck at
"Calibrating" in initialize_equipment().

I'd appreciate some help!

Thanks,
Jianglai


Hi Jianglai,

I don't have an exmaple of using V1718 with V1720, but I have been using V1718 with V792N for a long time.

You may find in the attachment an example frontend program and my drivers for V1718 and V792N written in MVMESTD format. They have to be linked with the CAENVMELib library and other essential MIDAS stuffs.

Regards,
Jimmy
ELOG V3.1.4-2e1708b5