Back Midas Rome Roody Rootana
  Midas DAQ System, Page 132 of 150  Not logged in ELOG logo
ID Date Author Topicdown Subject
  2904   26 Nov 2024 Nick HastingsBug ReportTMFE::Sleep() errors
Hello,

I've noticed that SC FEs that use the TMFE class with midas-2022-05-c often report errors when calling TMFE:Sleep().
The error is :

[tmfe.cxx:1033:TMFE::Sleep,ERROR] select() returned -1, errno 22 (Invalid argument).

This seems to happen in two different ways:

1. Error being reported repeatedly
2. Occasional single errors being reported

When the first of these presents, we typically restart the FE to "solve" the problem.
Case 2. is typically ignored.

The code in question is:

void TMFE::Sleep(double time)
{
   int status;
   fd_set fdset;
   struct timeval timeout;
      
   FD_ZERO(&fdset);
      
   timeout.tv_sec = time;
   timeout.tv_usec = (time-timeout.tv_sec)*1000000.0;

   while (1) {
      status = select(1, &fdset, NULL, NULL, &timeout);
#ifdef EINTR
      if (status < 0 && errno == EINTR) {
         continue;
      }
#endif
      break;
   }
      
   if (status < 0) {
      TMFE::Instance()->Msg(MERROR, "TMFE::Sleep", "select() returned %d, errno %d (%s)", status, errno, strerror(errno));
   }
}

So it looks like either file descriptor of the timeval struct must have a problem.
From some reading it seems that invalid timeval structs are often caused by one or both
of tv_sec or tv_usec not being set. In the code above we can see that both appear to be
correctly set initially.

From the select() man page I see:

RETURN VALUE
       On success, select() and pselect() return the number of file descriptors contained in
       the three returned descriptor sets (that is, the total number of bits that are set in
       readfds,  writefds,  exceptfds).  The return value may be zero if the timeout expired
       before any file descriptors became ready.

       On error, -1 is returned, and errno is set to indicate the error; the file descriptor
       sets are unmodified, and timeout becomes undefined.

The second paragraph quoted from the man page above would indicate to me that perhaps the
timeout needs to be reset inside the if block. eg:

      if (status < 0 && errno == EINTR) {
         timeout.tv_sec = time;
         timeout.tv_usec = (time-timeout.tv_sec)*1000000.0;
         continue;
      }

Please note that I've only just briefly looked at this and was hoping someone more
familiar with using select() as a way to sleep() might be better able to understand
what is happening.

I wonder also if now that midas requires stricter/newer c++ standards if there maybe
some more straightforward method to sleep that is sufficiently robust and portable.

Thanks,

Nick.
  2905   26 Nov 2024 Maia Henriksson-WardBug ReportTMFE::Sleep() errors
> Hello,
> 
> I've noticed that SC FEs that use the TMFE class with midas-2022-05-c often report errors when calling TMFE:Sleep().
> The error is :
> 
> [tmfe.cxx:1033:TMFE::Sleep,ERROR] select() returned -1, errno 22 (Invalid argument).
> 
> This seems to happen in two different ways:
> 
> 1. Error being reported repeatedly
> 2. Occasional single errors being reported
> 
> When the first of these presents, we typically restart the FE to "solve" the problem.
> Case 2. is typically ignored.
> 
> The code in question is:
> 
> void TMFE::Sleep(double time)
> {
>    int status;
>    fd_set fdset;
>    struct timeval timeout;
>       
>    FD_ZERO(&fdset);
>       
>    timeout.tv_sec = time;
>    timeout.tv_usec = (time-timeout.tv_sec)*1000000.0;
> 
>    while (1) {
>       status = select(1, &fdset, NULL, NULL, &timeout);
> #ifdef EINTR
>       if (status < 0 && errno == EINTR) {
>          continue;
>       }
> #endif
>       break;
>    }
>       
>    if (status < 0) {
>       TMFE::Instance()->Msg(MERROR, "TMFE::Sleep", "select() returned %d, errno %d (%s)", status, errno, strerror(errno));
>    }
> }
> 
> So it looks like either file descriptor of the timeval struct must have a problem.
> From some reading it seems that invalid timeval structs are often caused by one or both
> of tv_sec or tv_usec not being set. In the code above we can see that both appear to be
> correctly set initially.
> 
> From the select() man page I see:
> 
> RETURN VALUE
>        On success, select() and pselect() return the number of file descriptors contained in
>        the three returned descriptor sets (that is, the total number of bits that are set in
>        readfds,  writefds,  exceptfds).  The return value may be zero if the timeout expired
>        before any file descriptors became ready.
> 
>        On error, -1 is returned, and errno is set to indicate the error; the file descriptor
>        sets are unmodified, and timeout becomes undefined.
> 
> The second paragraph quoted from the man page above would indicate to me that perhaps the
> timeout needs to be reset inside the if block. eg:
> 
>       if (status < 0 && errno == EINTR) {
>          timeout.tv_sec = time;
>          timeout.tv_usec = (time-timeout.tv_sec)*1000000.0;
>          continue;
>       }
> 
> Please note that I've only just briefly looked at this and was hoping someone more
> familiar with using select() as a way to sleep() might be better able to understand
> what is happening.
> 
> I wonder also if now that midas requires stricter/newer c++ standards if there maybe
> some more straightforward method to sleep that is sufficiently robust and portable.
> 
> Thanks,
> 
> Nick.

I had the same error a few months ago, though I wasn't using a tagged release. It happened because I was calling TMFE::Sleep() 
with a negative time. If your issues were caused by the same reason, TMFE::Sleep() can handle negative times since commit 
591f78f (https://bitbucket.org/tmidas/midas/commits/591f78f52893d5ffd64bf4e52a1daac537ebd672).

Early in my debugging, I did come to the same conclusions you did, and actually tried a similar solution the one you suggested. 
This was a few months ago and I didn't write down what happened, but I believe it didn't work because in my case the errno was 
something other than EINTR, and/or the timeval was still an invalid argument for sleep because the timeout was still negative. I 
never followed it up because I was able to fix my problem by fixing my frontend.
  2906   27 Nov 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
> [tmfe.cxx:1033:TMFE::Sleep,ERROR] select() returned -1, errno 22 (Invalid argument).

The very original copy of this function had an error and was spewing out this error quite often,
this was a missing handler for EINTR.

Now it looks like we are missing a handler for EINVAL.

Most likely sleep is called with a funny sleep time value that fills struct timeval with
values select() does not like.

I see Ben added a check for negative sleep times, and this is good.

I think I will do these changes:

a) add an error message for negative sleep time, I think user should never call ::Sleep with negative or zero sleep times and 
if they do it is a bug and they should fix it, the error message will inform them so.

b) add a handler for EINVAL, which will report the requested sleep time and the values in struct timeval

K.O.
  2907   27 Nov 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
> 
> I wonder also if now that midas requires stricter/newer c++ standards if there maybe
> some more straightforward method to sleep that is sufficiently robust and portable.
> 

I believe POSIX defined clock_nanosleep() & co, so on most recent machines that is the most portable way to sleep.

Historically, select() was the only way to sleep for less than 1 sec, but it was never portable
because of differences between BSD UNIX and Linux implementations. (MacOS is BSD UNIX via FreeBSD).

On difference is the update of struct timeval is select() is interrupted.

In this elog entry, I compare sleep using select() with sleep using clock_nanosleep() and see that there is no difference:
https://daq00.triumf.ca/elog-midas/Midas/2115

As you can see tmfe.cxx has both implementations, select() and clock_nanosleep(), and anybody can try which one works better on 
their computer.

K.O.
  2908   27 Nov 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
>       status = select(1, &fdset, NULL, NULL, &timeout);
>
>       On error, -1 is returned, ... timeout becomes undefined.

I have been reading "man select" for 30 years and I do not remember seeing this text.

I believe on BSD UNIX (MacOS) it says timeout is unchanged and on Linux is says timeout is updated to time actually slept.

I will have to investigate, but I suspect the man page was posix-ized, by sweeping BSD/MacOS and Linux implementations
under the same "instead of saying what it actually does, we will just say 'undefined'".

In any case, EINTR is not an error, it's an artefact of UNIX signal handling. Linux implementations always try
very hard to handle signals without causing EINTR to select(), read() and write(). This is most painful
when reading and writing to TCP sockets, because one most handle partial reads and EINTR.

K.O.
  2909   30 Nov 2024 Pavel MuratBug ReportEQ_PERIODIC-only equipment ?
Dear Midas experts, 

I'm running into something which looks like an initialization problem. 
I have a mfe.cxx-style frontend which introduces an equipment of the EQ_PERIODIC type (EQ_PERIODIC-only!). 
When Midas enters the running state, I see the frontend crashing. 
Stepping through the code shows that the frontend is crashing because its equipment has been ignored 
by the initialize_equipment@mfe.cxx - see
 
https://bitbucket.org/tmidas/midas/src/5d0dae001712164ae43137dced2fbbb594f0201e/src/mfe.cxx#lines-630

Is there an assumption that the initialization of the EQ_PERIODIC-only equipment is the user responsibility? 
Or EQ_PERIODIC should always come paired with some other type?

-- many thanks, regards, Pasha
  2910   01 Dec 2024 Stefan RittBug ReportEQ_PERIODIC-only equipment ?
There is no requirement that you pair an EQ_PERIODIC with an EQ_TRIGGER. Take for exmaple

  midas/examples/experiment/frontend.cxx

and remove there the triggered event. The frontend runs happily with the periodic event only (I just tried that myself). You have probably some problem in 
your event definition. Start with the running example frontend, and add your code line by line until you see the problem.

Stefan
  2911   01 Dec 2024 Pavel MuratBug ReportEQ_PERIODIC-only equipment ?
> There is no requirement that you pair an EQ_PERIODIC with an EQ_TRIGGER. Take for exmaple
> 
>   midas/examples/experiment/frontend.cxx
> 
> and remove there the triggered event. The frontend runs happily with the periodic event only (I just tried that myself). You have probably some problem in 
> your event definition. Start with the running example frontend, and add your code line by line until you see the problem.

Hi Stefan, thank you very much! 

As the pointer to the readout function and pointers to device drivers are all defined in the same structure (EQUIPMENT), 
I was naively assuming that the readout function should be set during the class driver initialization.
Now it is clear that the equipment responding to EQ_PERIODIC events doesn't have to have drivers, 
and specifying the readout function is the responsibility of the user.

I got around exactly this way yesterday, but was thinking that I was hacking the system :)
 
-- regards, Pasha
  2912   02 Dec 2024 Stefan RittBug ReportODB key picker does not close when creating link / Edit-on-run string box too large
> Actual result:
> The key picker does not close.

Thanks for reporting that bug. It has been fixed in the current commit (installed already on megon02)

Stefan
  2913   02 Dec 2024 Stefan RittBug ReportODB key picker does not close when creating link / Edit-on-run string box too large
> Another more minor visual problem is the edit-on-start dialog. There seems to be no upper bound to the 
> size of the text box. In the attached screenshot, ShortString has a maximum length of 32 characters, 
> LongString has 255. Both are empty at the time of the screenshot. Maybe, the size should be limited to a 
> reasonable width.

I limited the input size now to (arbitrarily) 100 chars. The string can still be longer than 100 chars, and you start then scrolling inside the input box. Let me know if 
that's ok this way.

Stefan
  2915   04 Dec 2024 Konstantin OlchanskiBug ReportODB key picker does not close when creating link / Edit-on-run string box too large
> > Actual result:
> > The key picker does not close.
> 
> Thanks for reporting that bug. It has been fixed in the current commit (installed already on megon02)
>

Stefan, thank you for fixing both problems, I have seen them, too, but no time to deal with them.

K.O.
  2916   05 Dec 2024 Konstantin OlchanskiBug ReportODB key picker does not close when creating link / Edit-on-run string box too large
> > Another more minor visual problem is the edit-on-start dialog. There seems to be no upper bound to the 
> > size of the text box. In the attached screenshot, ShortString has a maximum length of 32 characters, 
> > LongString has 255. Both are empty at the time of the screenshot. Maybe, the size should be limited to a 
> > reasonable width.
> 
> I limited the input size now to (arbitrarily) 100 chars. The string can still be longer than 100 chars, and you start then scrolling inside the input box. Let me know if 
> that's ok this way.

I am moving the dragon experiment to the new midas and we see this problem on the begin-of-run page.

Old midas: no horizontal scroll bar, edit-on-start names, values and comments are all squeezed in into the visible frame.

New midas: page is very wide, values entry fields are very long and there is a horizontal scroll bar.

So something got broken in the htlm formatting or sizing. I should be able to spot the change by doing
a diff between old resources/start.html and the new one.

K.O.
  2918   06 Dec 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
> >       status = select(1, &fdset, NULL, NULL, &timeout);
> >       On error, -1 is returned, ... timeout becomes undefined.

I confirm Linux and MacOS man pages and select() with EINTR work as I remember, Linux updates "timeout" to account for the 
time already slept, MacOS does not ("timeout" is unchanged).

So the original code is roughly correct, but long sleeps will not work right if SIGALRM fires during sleeping.

Note that MIDAS no longer uses SIGALRM to fire cm_watchdog() (it was moved to a thread) and MIDAS does not use signals,
so handling of EINTR is now moot.

(Please correct me if I missed something).

The original bug report was about EINVAL, and best I can tell, it was caused by calls to TMFE::Sleep()
with strange sleep times that caused invalid values to be computed into the select() timeout.

To improve on this, I make these changes:

1) TMFE::Sleep(0) will report an error and will not sleep
2) TMFE::Sleep(negative number) will report an error and will not sleep

(please check the sleep time before calling TMFE::Sleep())

3) TMFE::Sleep(1 sec or less) will sleep using select(). (I also looked into using poll(), ppoll() and pselect()).
4) TMFE::Sleep(more than 1 second) will use a loop to sleep in increments of 1 second and will use one additional syscall to 
read the current time to decide how much more to sleep.

5) if select() returns EINVAL, the error message will reporting the sleep time and the values in "timeout".

A side effect of this is that on both Linux and MacOS long sleeps work correctly if interrupted by SIGALRM,
because SIGALRM granularity is 1 sec and our sleep time is also 1 sec.

Commit [develop 06735d29] improve TMFE::Sleep()

K.O.
  2919   06 Dec 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
> Commit [develop 06735d29] improve TMFE::Sleep()

Report of test_sleep on Ubuntu-22 (Intel E-2236) and MacOS 14.6.1 (M1 MAX Mac Studio).

It is easy to see Ubuntu-22 (kernel 6.2.x) sleep granularity is ~60 usec, MacOS sleep granularity is <2 usec. (Sub 1 usec sleep likely measures the syscall() speed, 500 ns on Intel and 200 
ns on ARM M1 MAX). (NOTE: long sleep is interrupted by an alarm roughly 10 seconds into the sleep, see progs/test_sleep.cxx)

daq00:midas$ uname -a
Linux daq00.triumf.ca 6.2.0-39-generic #40~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 16 10:53:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
daq00:midas$ ./bin/test_sleep 
test short sleep:
sleep      10 loops,   100000.000 usec per loop,  1.000000000 sec total,  1.002568007 sec actual total,   100256.801 usec actual per loop, oversleep  256.801 usec,    0.3%
sleep     100 loops,    10000.000 usec per loop,  1.000000000 sec total,  1.025897980 sec actual total,    10258.980 usec actual per loop, oversleep  258.980 usec,    2.6%
sleep    1000 loops,     1000.000 usec per loop,  1.000000000 sec total,  1.169670105 sec actual total,     1169.670 usec actual per loop, oversleep  169.670 usec,   17.0%
sleep   10000 loops,      100.000 usec per loop,  1.000000000 sec total,  1.573357105 sec actual total,      157.336 usec actual per loop, oversleep   57.336 usec,   57.3%
sleep   99999 loops,       10.000 usec per loop,  0.999990000 sec total,  6.963442087 sec actual total,       69.635 usec actual per loop, oversleep   59.635 usec,  596.4%
sleep 1000000 loops,        1.000 usec per loop,  1.000000000 sec total, 60.939687967 sec actual total,       60.940 usec actual per loop, oversleep   59.940 usec, 5994.0%
sleep 1000000 loops,        0.100 usec per loop,  0.100000000 sec total,  0.613572121 sec actual total,        0.614 usec actual per loop, oversleep    0.514 usec,  513.6%
sleep 1000000 loops,        0.010 usec per loop,  0.010000000 sec total,  0.576359987 sec actual total,        0.576 usec actual per loop, oversleep    0.566 usec, 5663.6%
test long sleep: requested 126.000000000 sec ... sleeping ...
alarm!
test long sleep: requested 126.000000000 sec, actual 126.000875950 sec

bash-3.2$ uname -a
Darwin send.triumf.ca 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000 arm64
bash-3.2$ ./bin/test_sleep 
test short sleep:
sleep      10 loops,   100000.000 usec per loop,  1.000000000 sec total,  1.032556057 sec actual total,   103255.606 usec actual per loop, oversleep 3255.606 usec,    3.3%
sleep     100 loops,    10000.000 usec per loop,  1.000000000 sec total,  1.245460033 sec actual total,    12454.600 usec actual per loop, oversleep 2454.600 usec,   24.5%
sleep    1000 loops,     1000.000 usec per loop,  1.000000000 sec total,  1.331466913 sec actual total,     1331.467 usec actual per loop, oversleep  331.467 usec,   33.1%
sleep   10000 loops,      100.000 usec per loop,  1.000000000 sec total,  1.281141996 sec actual total,      128.114 usec actual per loop, oversleep   28.114 usec,   28.1%
sleep   99999 loops,       10.000 usec per loop,  0.999990000 sec total,  1.410759926 sec actual total,       14.108 usec actual per loop, oversleep    4.108 usec,   41.1%
sleep 1000000 loops,        1.000 usec per loop,  1.000000000 sec total,  2.400593996 sec actual total,        2.401 usec actual per loop, oversleep    1.401 usec,  140.1%
sleep 1000000 loops,        0.100 usec per loop,  0.100000000 sec total,  0.188431025 sec actual total,        0.188 usec actual per loop, oversleep    0.088 usec,   88.4%
sleep 1000000 loops,        0.010 usec per loop,  0.010000000 sec total,  0.188102007 sec actual total,        0.188 usec actual per loop, oversleep    0.178 usec, 1781.0%
test long sleep: requested 126.000000000 sec ... sleeping ...
alarm!
test long sleep: requested 126.000000000 sec, actual 126.001244068 sec

K.O.
  2923   17 Dec 2024 Lukas GerritzenBug Report[History plots] "Jump to current time" resets x range to 7d
To reproduce:
- Open a history plot, click [-] a few times until the x axis shows more than 7 days.
- Scroll to the past (left)
- Click "Jump to current time" (the triangle)

Expected result:
The upper limit of the x axis is at the current time and the lower range is now - whatever range you had before 
(>7d)

Actual result:
The upper limit is the current time, the lower limit is now - 7d

(The interval seems unchanged if the range was < 7d before clicking "Jump to current time")
  2924   19 Dec 2024 Stefan RittBug Report[History plots] "Jump to current time" resets x range to 7d
I had put in a check which limits the range to 7d into the past if you press the "play" button, but now I'm not sure why this was needed. I removed it again and 
things seem to be fine. Change is committed to develop.

Stefan
  2936   01 Feb 2025 Pavel MuratBug ReportMIDAS history system not using the event timestamps ?
> I have a time series of slow control measurements in an ASCII format - 
> data records in a format (run_number, time, temperature, voltage1, ..., voltageN), 
> and, if possible, would like to convert them into a MIDAS history format. 
> 
> Making MIDAS events out of that data is easy, but is it possible to preserve 
> the time stamps?  - Logically, this boils down to whether it is possible to have  
> the event time set by a user frontend

It looks that the original question was not as naive as I expected and may be pointing to a subtle bug. 
I have implemented a python frontend - essentially a clone of 

https://bitbucket.org/tmidas/midas/src/develop/python/midas/frontend.py

reading the old slow control data and setting the event.header.timestamp's to some dates from the year of 2022. 

When I run MIDAS and read the "old slow control events", one event in 10 seconds, 
the MIDAS Event Dump utility shows the data with the correct event timestamps, from the year of 2022. 

However the history plots show the event parameters with the timestamps from Feb 01 2025 and the adjacent 
data points separated by 10 sec.

Is it possible that the history system uses its own timestamp setting instead of using timestamps from the event headers? 
- Under normal circumstances, the two should be very close, and that could've kept the issue hidden... 

-- thanks, regards, Pasha

UPDATE:  I attached the frontend code and the input data file it is reading. The data file should reside in the local directory
- the frontend code doesn't have everything fully automated for the test, 
  -- an integer field "/Mu2e/Offline/Ops/LastTime" would need to be created manually
  -- the history plots would need to be declared manually
  2943   19 Feb 2025 Lukas GerritzenBug ReportDefault write cache size for new equipments breaks compatibility with older equipments
We have a frontend for slow control with a lot of legacy code. I wanted to add a new equipment using the
mdev_mscb class. It seems like the default write cache size is 1000000B now, which produces error
messages like this:
12:51:20.154 2025/02/19 [SC Frontend,ERROR] [mfe.cxx:620:register_equipment,ERROR] Write cache size mismatch for buffer "SYSTEM": equipment "Environment" asked for 0, while eqiupment "LED" asked for 10000000
12:51:20.154 2025/02/19 [SC Frontend,ERROR] [mfe.cxx:620:register_equipment,ERROR] Write cache size mismatch for buffer "SYSTEM": equipment "LED" asked for 10000000, while eqiupment "Xenon" asked for 0

I can manually change the write cache size in /Equipment/LED/Common/Write cache size to 0. However, if I delete the LED tree in the ODB, then I get the same problems again. It would be nice if I could either choose the size as 0 in the frontend code, or if the defaults were compatible with our legacy code.

The commit that made the write cache size configurable seems to be from 2019: https://bitbucket.org/tmidas/midas/commits/3619ecc6ba1d29d74c16aa6571e40920018184c0
  2944   24 Feb 2025 Stefan RittBug ReportDefault write cache size for new equipments breaks compatibility with older equipments
The commit that introduced the write cache size check is https://bitbucket.org/tmidas/midas/commits/3619ecc6ba1d29d74c16aa6571e40920018184c0

Unfortunately K.O. added the write cache size to the equipment list, but there is currently no way to change this programmatically from the user frontend code. The options I see are

1) Re-arrange the equipment settings so that the write case size comes to the end of the list which the user initializes, like
   {"Trigger",               /* equipment name */
      {1, 0,                 /* event ID, trigger mask */
         "SYSTEM",           /* event buffer */
         EQ_POLLED,          /* equipment type */
         0,                  /* event source */
         "MIDAS",            /* format */
         TRUE,               /* enabled */
         RO_RUNNING |        /* read only when running */
         RO_ODB,             /* and update ODB */
         100,                /* poll for 100ms */
         0,                  /* stop run after this event limit */
         0,                  /* number of sub events */
         0,                  /* don't log history */
         "", "", "", "", "", 0, 0},
      read_trigger_event,    /* readout routine */
      10000000,              /* write cache size */
   },

2) Add a function
fe_set_write_case(int size);
which goes through the local equipment list and sets the cache size for all equipments to be the same.

I would appreciate some guidance from K.O. who introduced that code above.

/Stefan
  2947   11 Mar 2025 Federico RezzonicoBug Reportpython hist_get_events not returning events, but javascript does
After starting midas (mhttpd &, and mlogger -D) and running the `fetest` frontend I went into the midas/python/examples directory and ran basic_hist_script.py, and, even though I could see the 'pytest' program in the Programs page,

Valid events are:
Enter event name:

was printed out, which signified that no events were found. No errors were displayed.

Instead, when trying to do the same in javascript (using mjsonrpc_send_request( mjsonrpc_make_request("hs_get_events")).then(console.log)), I was able to get the expected events.

The History page also displayed the expected data and the plots worked correctly.

Device info: Chip: Apple M1 Pro, OS: Sequoia (15.3)

MIDAS version: bitbucket commit 84c7ef7

Python version: 3.13.2
ELOG V3.1.4-2e1708b5