ID |
Date |
Author |
Topic |
Subject |
2904
|
26 Nov 2024 |
Nick Hastings | Bug Report | TMFE::Sleep() errors | Hello,
I've noticed that SC FEs that use the TMFE class with midas-2022-05-c often report errors when calling TMFE:Sleep().
The error is :
[tmfe.cxx:1033:TMFE::Sleep,ERROR] select() returned -1, errno 22 (Invalid argument).
This seems to happen in two different ways:
1. Error being reported repeatedly
2. Occasional single errors being reported
When the first of these presents, we typically restart the FE to "solve" the problem.
Case 2. is typically ignored.
The code in question is:
void TMFE::Sleep(double time)
{
int status;
fd_set fdset;
struct timeval timeout;
FD_ZERO(&fdset);
timeout.tv_sec = time;
timeout.tv_usec = (time-timeout.tv_sec)*1000000.0;
while (1) {
status = select(1, &fdset, NULL, NULL, &timeout);
#ifdef EINTR
if (status < 0 && errno == EINTR) {
continue;
}
#endif
break;
}
if (status < 0) {
TMFE::Instance()->Msg(MERROR, "TMFE::Sleep", "select() returned %d, errno %d (%s)", status, errno, strerror(errno));
}
}
So it looks like either file descriptor of the timeval struct must have a problem.
From some reading it seems that invalid timeval structs are often caused by one or both
of tv_sec or tv_usec not being set. In the code above we can see that both appear to be
correctly set initially.
From the select() man page I see:
RETURN VALUE
On success, select() and pselect() return the number of file descriptors contained in
the three returned descriptor sets (that is, the total number of bits that are set in
readfds, writefds, exceptfds). The return value may be zero if the timeout expired
before any file descriptors became ready.
On error, -1 is returned, and errno is set to indicate the error; the file descriptor
sets are unmodified, and timeout becomes undefined.
The second paragraph quoted from the man page above would indicate to me that perhaps the
timeout needs to be reset inside the if block. eg:
if (status < 0 && errno == EINTR) {
timeout.tv_sec = time;
timeout.tv_usec = (time-timeout.tv_sec)*1000000.0;
continue;
}
Please note that I've only just briefly looked at this and was hoping someone more
familiar with using select() as a way to sleep() might be better able to understand
what is happening.
I wonder also if now that midas requires stricter/newer c++ standards if there maybe
some more straightforward method to sleep that is sufficiently robust and portable.
Thanks,
Nick. |
2905
|
26 Nov 2024 |
Maia Henriksson-Ward | Bug Report | TMFE::Sleep() errors | > Hello,
>
> I've noticed that SC FEs that use the TMFE class with midas-2022-05-c often report errors when calling TMFE:Sleep().
> The error is :
>
> [tmfe.cxx:1033:TMFE::Sleep,ERROR] select() returned -1, errno 22 (Invalid argument).
>
> This seems to happen in two different ways:
>
> 1. Error being reported repeatedly
> 2. Occasional single errors being reported
>
> When the first of these presents, we typically restart the FE to "solve" the problem.
> Case 2. is typically ignored.
>
> The code in question is:
>
> void TMFE::Sleep(double time)
> {
> int status;
> fd_set fdset;
> struct timeval timeout;
>
> FD_ZERO(&fdset);
>
> timeout.tv_sec = time;
> timeout.tv_usec = (time-timeout.tv_sec)*1000000.0;
>
> while (1) {
> status = select(1, &fdset, NULL, NULL, &timeout);
> #ifdef EINTR
> if (status < 0 && errno == EINTR) {
> continue;
> }
> #endif
> break;
> }
>
> if (status < 0) {
> TMFE::Instance()->Msg(MERROR, "TMFE::Sleep", "select() returned %d, errno %d (%s)", status, errno, strerror(errno));
> }
> }
>
> So it looks like either file descriptor of the timeval struct must have a problem.
> From some reading it seems that invalid timeval structs are often caused by one or both
> of tv_sec or tv_usec not being set. In the code above we can see that both appear to be
> correctly set initially.
>
> From the select() man page I see:
>
> RETURN VALUE
> On success, select() and pselect() return the number of file descriptors contained in
> the three returned descriptor sets (that is, the total number of bits that are set in
> readfds, writefds, exceptfds). The return value may be zero if the timeout expired
> before any file descriptors became ready.
>
> On error, -1 is returned, and errno is set to indicate the error; the file descriptor
> sets are unmodified, and timeout becomes undefined.
>
> The second paragraph quoted from the man page above would indicate to me that perhaps the
> timeout needs to be reset inside the if block. eg:
>
> if (status < 0 && errno == EINTR) {
> timeout.tv_sec = time;
> timeout.tv_usec = (time-timeout.tv_sec)*1000000.0;
> continue;
> }
>
> Please note that I've only just briefly looked at this and was hoping someone more
> familiar with using select() as a way to sleep() might be better able to understand
> what is happening.
>
> I wonder also if now that midas requires stricter/newer c++ standards if there maybe
> some more straightforward method to sleep that is sufficiently robust and portable.
>
> Thanks,
>
> Nick.
I had the same error a few months ago, though I wasn't using a tagged release. It happened because I was calling TMFE::Sleep()
with a negative time. If your issues were caused by the same reason, TMFE::Sleep() can handle negative times since commit
591f78f (https://bitbucket.org/tmidas/midas/commits/591f78f52893d5ffd64bf4e52a1daac537ebd672).
Early in my debugging, I did come to the same conclusions you did, and actually tried a similar solution the one you suggested.
This was a few months ago and I didn't write down what happened, but I believe it didn't work because in my case the errno was
something other than EINTR, and/or the timeval was still an invalid argument for sleep because the timeout was still negative. I
never followed it up because I was able to fix my problem by fixing my frontend. |
2906
|
27 Nov 2024 |
Konstantin Olchanski | Bug Report | TMFE::Sleep() errors | > [tmfe.cxx:1033:TMFE::Sleep,ERROR] select() returned -1, errno 22 (Invalid argument).
The very original copy of this function had an error and was spewing out this error quite often,
this was a missing handler for EINTR.
Now it looks like we are missing a handler for EINVAL.
Most likely sleep is called with a funny sleep time value that fills struct timeval with
values select() does not like.
I see Ben added a check for negative sleep times, and this is good.
I think I will do these changes:
a) add an error message for negative sleep time, I think user should never call ::Sleep with negative or zero sleep times and
if they do it is a bug and they should fix it, the error message will inform them so.
b) add a handler for EINVAL, which will report the requested sleep time and the values in struct timeval
K.O. |
2907
|
27 Nov 2024 |
Konstantin Olchanski | Bug Report | TMFE::Sleep() errors | >
> I wonder also if now that midas requires stricter/newer c++ standards if there maybe
> some more straightforward method to sleep that is sufficiently robust and portable.
>
I believe POSIX defined clock_nanosleep() & co, so on most recent machines that is the most portable way to sleep.
Historically, select() was the only way to sleep for less than 1 sec, but it was never portable
because of differences between BSD UNIX and Linux implementations. (MacOS is BSD UNIX via FreeBSD).
On difference is the update of struct timeval is select() is interrupted.
In this elog entry, I compare sleep using select() with sleep using clock_nanosleep() and see that there is no difference:
https://daq00.triumf.ca/elog-midas/Midas/2115
As you can see tmfe.cxx has both implementations, select() and clock_nanosleep(), and anybody can try which one works better on
their computer.
K.O. |
2908
|
27 Nov 2024 |
Konstantin Olchanski | Bug Report | TMFE::Sleep() errors | > status = select(1, &fdset, NULL, NULL, &timeout);
>
> On error, -1 is returned, ... timeout becomes undefined.
I have been reading "man select" for 30 years and I do not remember seeing this text.
I believe on BSD UNIX (MacOS) it says timeout is unchanged and on Linux is says timeout is updated to time actually slept.
I will have to investigate, but I suspect the man page was posix-ized, by sweeping BSD/MacOS and Linux implementations
under the same "instead of saying what it actually does, we will just say 'undefined'".
In any case, EINTR is not an error, it's an artefact of UNIX signal handling. Linux implementations always try
very hard to handle signals without causing EINTR to select(), read() and write(). This is most painful
when reading and writing to TCP sockets, because one most handle partial reads and EINTR.
K.O. |
2909
|
30 Nov 2024 |
Pavel Murat | Bug Report | EQ_PERIODIC-only equipment ? | Dear Midas experts,
I'm running into something which looks like an initialization problem.
I have a mfe.cxx-style frontend which introduces an equipment of the EQ_PERIODIC type (EQ_PERIODIC-only!).
When Midas enters the running state, I see the frontend crashing.
Stepping through the code shows that the frontend is crashing because its equipment has been ignored
by the initialize_equipment@mfe.cxx - see
https://bitbucket.org/tmidas/midas/src/5d0dae001712164ae43137dced2fbbb594f0201e/src/mfe.cxx#lines-630
Is there an assumption that the initialization of the EQ_PERIODIC-only equipment is the user responsibility?
Or EQ_PERIODIC should always come paired with some other type?
-- many thanks, regards, Pasha |
2910
|
01 Dec 2024 |
Stefan Ritt | Bug Report | EQ_PERIODIC-only equipment ? | There is no requirement that you pair an EQ_PERIODIC with an EQ_TRIGGER. Take for exmaple
midas/examples/experiment/frontend.cxx
and remove there the triggered event. The frontend runs happily with the periodic event only (I just tried that myself). You have probably some problem in
your event definition. Start with the running example frontend, and add your code line by line until you see the problem.
Stefan |
2911
|
01 Dec 2024 |
Pavel Murat | Bug Report | EQ_PERIODIC-only equipment ? | > There is no requirement that you pair an EQ_PERIODIC with an EQ_TRIGGER. Take for exmaple
>
> midas/examples/experiment/frontend.cxx
>
> and remove there the triggered event. The frontend runs happily with the periodic event only (I just tried that myself). You have probably some problem in
> your event definition. Start with the running example frontend, and add your code line by line until you see the problem.
Hi Stefan, thank you very much!
As the pointer to the readout function and pointers to device drivers are all defined in the same structure (EQUIPMENT),
I was naively assuming that the readout function should be set during the class driver initialization.
Now it is clear that the equipment responding to EQ_PERIODIC events doesn't have to have drivers,
and specifying the readout function is the responsibility of the user.
I got around exactly this way yesterday, but was thinking that I was hacking the system :)
-- regards, Pasha |
2912
|
02 Dec 2024 |
Stefan Ritt | Bug Report | ODB key picker does not close when creating link / Edit-on-run string box too large | > Actual result:
> The key picker does not close.
Thanks for reporting that bug. It has been fixed in the current commit (installed already on megon02)
Stefan |
2913
|
02 Dec 2024 |
Stefan Ritt | Bug Report | ODB key picker does not close when creating link / Edit-on-run string box too large | > Another more minor visual problem is the edit-on-start dialog. There seems to be no upper bound to the
> size of the text box. In the attached screenshot, ShortString has a maximum length of 32 characters,
> LongString has 255. Both are empty at the time of the screenshot. Maybe, the size should be limited to a
> reasonable width.
I limited the input size now to (arbitrarily) 100 chars. The string can still be longer than 100 chars, and you start then scrolling inside the input box. Let me know if
that's ok this way.
Stefan |
2915
|
04 Dec 2024 |
Konstantin Olchanski | Bug Report | ODB key picker does not close when creating link / Edit-on-run string box too large | > > Actual result:
> > The key picker does not close.
>
> Thanks for reporting that bug. It has been fixed in the current commit (installed already on megon02)
>
Stefan, thank you for fixing both problems, I have seen them, too, but no time to deal with them.
K.O. |
2916
|
05 Dec 2024 |
Konstantin Olchanski | Bug Report | ODB key picker does not close when creating link / Edit-on-run string box too large | > > Another more minor visual problem is the edit-on-start dialog. There seems to be no upper bound to the
> > size of the text box. In the attached screenshot, ShortString has a maximum length of 32 characters,
> > LongString has 255. Both are empty at the time of the screenshot. Maybe, the size should be limited to a
> > reasonable width.
>
> I limited the input size now to (arbitrarily) 100 chars. The string can still be longer than 100 chars, and you start then scrolling inside the input box. Let me know if
> that's ok this way.
I am moving the dragon experiment to the new midas and we see this problem on the begin-of-run page.
Old midas: no horizontal scroll bar, edit-on-start names, values and comments are all squeezed in into the visible frame.
New midas: page is very wide, values entry fields are very long and there is a horizontal scroll bar.
So something got broken in the htlm formatting or sizing. I should be able to spot the change by doing
a diff between old resources/start.html and the new one.
K.O. |
2918
|
06 Dec 2024 |
Konstantin Olchanski | Bug Report | TMFE::Sleep() errors | > > status = select(1, &fdset, NULL, NULL, &timeout);
> > On error, -1 is returned, ... timeout becomes undefined.
I confirm Linux and MacOS man pages and select() with EINTR work as I remember, Linux updates "timeout" to account for the
time already slept, MacOS does not ("timeout" is unchanged).
So the original code is roughly correct, but long sleeps will not work right if SIGALRM fires during sleeping.
Note that MIDAS no longer uses SIGALRM to fire cm_watchdog() (it was moved to a thread) and MIDAS does not use signals,
so handling of EINTR is now moot.
(Please correct me if I missed something).
The original bug report was about EINVAL, and best I can tell, it was caused by calls to TMFE::Sleep()
with strange sleep times that caused invalid values to be computed into the select() timeout.
To improve on this, I make these changes:
1) TMFE::Sleep(0) will report an error and will not sleep
2) TMFE::Sleep(negative number) will report an error and will not sleep
(please check the sleep time before calling TMFE::Sleep())
3) TMFE::Sleep(1 sec or less) will sleep using select(). (I also looked into using poll(), ppoll() and pselect()).
4) TMFE::Sleep(more than 1 second) will use a loop to sleep in increments of 1 second and will use one additional syscall to
read the current time to decide how much more to sleep.
5) if select() returns EINVAL, the error message will reporting the sleep time and the values in "timeout".
A side effect of this is that on both Linux and MacOS long sleeps work correctly if interrupted by SIGALRM,
because SIGALRM granularity is 1 sec and our sleep time is also 1 sec.
Commit [develop 06735d29] improve TMFE::Sleep()
K.O. |
2919
|
06 Dec 2024 |
Konstantin Olchanski | Bug Report | TMFE::Sleep() errors | > Commit [develop 06735d29] improve TMFE::Sleep()
Report of test_sleep on Ubuntu-22 (Intel E-2236) and MacOS 14.6.1 (M1 MAX Mac Studio).
It is easy to see Ubuntu-22 (kernel 6.2.x) sleep granularity is ~60 usec, MacOS sleep granularity is <2 usec. (Sub 1 usec sleep likely measures the syscall() speed, 500 ns on Intel and 200
ns on ARM M1 MAX). (NOTE: long sleep is interrupted by an alarm roughly 10 seconds into the sleep, see progs/test_sleep.cxx)
daq00:midas$ uname -a
Linux daq00.triumf.ca 6.2.0-39-generic #40~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 16 10:53:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
daq00:midas$ ./bin/test_sleep
test short sleep:
sleep 10 loops, 100000.000 usec per loop, 1.000000000 sec total, 1.002568007 sec actual total, 100256.801 usec actual per loop, oversleep 256.801 usec, 0.3%
sleep 100 loops, 10000.000 usec per loop, 1.000000000 sec total, 1.025897980 sec actual total, 10258.980 usec actual per loop, oversleep 258.980 usec, 2.6%
sleep 1000 loops, 1000.000 usec per loop, 1.000000000 sec total, 1.169670105 sec actual total, 1169.670 usec actual per loop, oversleep 169.670 usec, 17.0%
sleep 10000 loops, 100.000 usec per loop, 1.000000000 sec total, 1.573357105 sec actual total, 157.336 usec actual per loop, oversleep 57.336 usec, 57.3%
sleep 99999 loops, 10.000 usec per loop, 0.999990000 sec total, 6.963442087 sec actual total, 69.635 usec actual per loop, oversleep 59.635 usec, 596.4%
sleep 1000000 loops, 1.000 usec per loop, 1.000000000 sec total, 60.939687967 sec actual total, 60.940 usec actual per loop, oversleep 59.940 usec, 5994.0%
sleep 1000000 loops, 0.100 usec per loop, 0.100000000 sec total, 0.613572121 sec actual total, 0.614 usec actual per loop, oversleep 0.514 usec, 513.6%
sleep 1000000 loops, 0.010 usec per loop, 0.010000000 sec total, 0.576359987 sec actual total, 0.576 usec actual per loop, oversleep 0.566 usec, 5663.6%
test long sleep: requested 126.000000000 sec ... sleeping ...
alarm!
test long sleep: requested 126.000000000 sec, actual 126.000875950 sec
bash-3.2$ uname -a
Darwin send.triumf.ca 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000 arm64
bash-3.2$ ./bin/test_sleep
test short sleep:
sleep 10 loops, 100000.000 usec per loop, 1.000000000 sec total, 1.032556057 sec actual total, 103255.606 usec actual per loop, oversleep 3255.606 usec, 3.3%
sleep 100 loops, 10000.000 usec per loop, 1.000000000 sec total, 1.245460033 sec actual total, 12454.600 usec actual per loop, oversleep 2454.600 usec, 24.5%
sleep 1000 loops, 1000.000 usec per loop, 1.000000000 sec total, 1.331466913 sec actual total, 1331.467 usec actual per loop, oversleep 331.467 usec, 33.1%
sleep 10000 loops, 100.000 usec per loop, 1.000000000 sec total, 1.281141996 sec actual total, 128.114 usec actual per loop, oversleep 28.114 usec, 28.1%
sleep 99999 loops, 10.000 usec per loop, 0.999990000 sec total, 1.410759926 sec actual total, 14.108 usec actual per loop, oversleep 4.108 usec, 41.1%
sleep 1000000 loops, 1.000 usec per loop, 1.000000000 sec total, 2.400593996 sec actual total, 2.401 usec actual per loop, oversleep 1.401 usec, 140.1%
sleep 1000000 loops, 0.100 usec per loop, 0.100000000 sec total, 0.188431025 sec actual total, 0.188 usec actual per loop, oversleep 0.088 usec, 88.4%
sleep 1000000 loops, 0.010 usec per loop, 0.010000000 sec total, 0.188102007 sec actual total, 0.188 usec actual per loop, oversleep 0.178 usec, 1781.0%
test long sleep: requested 126.000000000 sec ... sleeping ...
alarm!
test long sleep: requested 126.000000000 sec, actual 126.001244068 sec
K.O. |
2923
|
17 Dec 2024 |
Lukas Gerritzen | Bug Report | [History plots] "Jump to current time" resets x range to 7d | To reproduce:
- Open a history plot, click [-] a few times until the x axis shows more than 7 days.
- Scroll to the past (left)
- Click "Jump to current time" (the triangle)
Expected result:
The upper limit of the x axis is at the current time and the lower range is now - whatever range you had before
(>7d)
Actual result:
The upper limit is the current time, the lower limit is now - 7d
(The interval seems unchanged if the range was < 7d before clicking "Jump to current time") |
2924
|
19 Dec 2024 |
Stefan Ritt | Bug Report | [History plots] "Jump to current time" resets x range to 7d | I had put in a check which limits the range to 7d into the past if you press the "play" button, but now I'm not sure why this was needed. I removed it again and
things seem to be fine. Change is committed to develop.
Stefan |
2936
|
01 Feb 2025 |
Pavel Murat | Bug Report | MIDAS history system not using the event timestamps ? | > I have a time series of slow control measurements in an ASCII format -
> data records in a format (run_number, time, temperature, voltage1, ..., voltageN),
> and, if possible, would like to convert them into a MIDAS history format.
>
> Making MIDAS events out of that data is easy, but is it possible to preserve
> the time stamps? - Logically, this boils down to whether it is possible to have
> the event time set by a user frontend
It looks that the original question was not as naive as I expected and may be pointing to a subtle bug.
I have implemented a python frontend - essentially a clone of
https://bitbucket.org/tmidas/midas/src/develop/python/midas/frontend.py
reading the old slow control data and setting the event.header.timestamp's to some dates from the year of 2022.
When I run MIDAS and read the "old slow control events", one event in 10 seconds,
the MIDAS Event Dump utility shows the data with the correct event timestamps, from the year of 2022.
However the history plots show the event parameters with the timestamps from Feb 01 2025 and the adjacent
data points separated by 10 sec.
Is it possible that the history system uses its own timestamp setting instead of using timestamps from the event headers?
- Under normal circumstances, the two should be very close, and that could've kept the issue hidden...
-- thanks, regards, Pasha
UPDATE: I attached the frontend code and the input data file it is reading. The data file should reside in the local directory
- the frontend code doesn't have everything fully automated for the test,
-- an integer field "/Mu2e/Offline/Ops/LastTime" would need to be created manually
-- the history plots would need to be declared manually |
2943
|
19 Feb 2025 |
Lukas Gerritzen | Bug Report | Default write cache size for new equipments breaks compatibility with older equipments | We have a frontend for slow control with a lot of legacy code. I wanted to add a new equipment using the
mdev_mscb class. It seems like the default write cache size is 1000000B now, which produces error
messages like this:
12:51:20.154 2025/02/19 [SC Frontend,ERROR] [mfe.cxx:620:register_equipment,ERROR] Write cache size mismatch for buffer "SYSTEM": equipment "Environment" asked for 0, while eqiupment "LED" asked for 10000000
12:51:20.154 2025/02/19 [SC Frontend,ERROR] [mfe.cxx:620:register_equipment,ERROR] Write cache size mismatch for buffer "SYSTEM": equipment "LED" asked for 10000000, while eqiupment "Xenon" asked for 0
I can manually change the write cache size in /Equipment/LED/Common/Write cache size to 0. However, if I delete the LED tree in the ODB, then I get the same problems again. It would be nice if I could either choose the size as 0 in the frontend code, or if the defaults were compatible with our legacy code.
The commit that made the write cache size configurable seems to be from 2019: https://bitbucket.org/tmidas/midas/commits/3619ecc6ba1d29d74c16aa6571e40920018184c0 |
2944
|
24 Feb 2025 |
Stefan Ritt | Bug Report | Default write cache size for new equipments breaks compatibility with older equipments | The commit that introduced the write cache size check is https://bitbucket.org/tmidas/midas/commits/3619ecc6ba1d29d74c16aa6571e40920018184c0
Unfortunately K.O. added the write cache size to the equipment list, but there is currently no way to change this programmatically from the user frontend code. The options I see are
1) Re-arrange the equipment settings so that the write case size comes to the end of the list which the user initializes, like
{"Trigger", /* equipment name */
{1, 0, /* event ID, trigger mask */
"SYSTEM", /* event buffer */
EQ_POLLED, /* equipment type */
0, /* event source */
"MIDAS", /* format */
TRUE, /* enabled */
RO_RUNNING | /* read only when running */
RO_ODB, /* and update ODB */
100, /* poll for 100ms */
0, /* stop run after this event limit */
0, /* number of sub events */
0, /* don't log history */
"", "", "", "", "", 0, 0},
read_trigger_event, /* readout routine */
10000000, /* write cache size */
},
2) Add a function fe_set_write_case(int size); which goes through the local equipment list and sets the cache size for all equipments to be the same.
I would appreciate some guidance from K.O. who introduced that code above.
/Stefan |
2947
|
11 Mar 2025 |
Federico Rezzonico | Bug Report | python hist_get_events not returning events, but javascript does | After starting midas (mhttpd &, and mlogger -D) and running the `fetest` frontend I went into the midas/python/examples directory and ran basic_hist_script.py, and, even though I could see the 'pytest' program in the Programs page,
Valid events are:
Enter event name:
was printed out, which signified that no events were found. No errors were displayed.
Instead, when trying to do the same in javascript (using mjsonrpc_send_request( mjsonrpc_make_request("hs_get_events")).then(console.log)), I was able to get the expected events.
The History page also displayed the expected data and the plots worked correctly.
Device info: Chip: Apple M1 Pro, OS: Sequoia (15.3)
MIDAS version: bitbucket commit 84c7ef7
Python version: 3.13.2 |
|