Back Midas Rome Roody Rootana
  Midas DAQ System, Page 129 of 146  Not logged in ELOG logo
New entries since:Wed Dec 31 16:00:00 1969
ID Date Author Topicdown Subject
  2878   18 Oct 2024 Stefan RittBug ReportFrontend name must differ from others by more than the last three characters
Fixed and committed.

Best,
Stefan

> Hi Denis,
> 
> indeed a bug. Will fix it next week.
> 
> Best,
> Stefan
> 
> 
> > Hi,
> > I have developed two Midas front-end programs for different hardware. The frontend_name of the first one is "FSCD_SC" (slow control) and that of the second one is "FSCD_PS" (power supply).
> > 
> > Each front-end program runs fine separately, but when attempting to start FSCD_SC while FSCD_PS is running, FSCD_PS is terminated and Midas indicates "Previous frontend stopped" in the window where it starts FSCD_SC.
> > 
> > The problem is that these two frontend names only differ in their last two characters, and Midas currently does not distinguish them properly.
  2879   18 Oct 2024 Konstantin OlchanskiBug ReportDifficulty running MIDAS on Rocky 9.4
> [aroberts@sdfcdmsdaq midas]$ odbedit
> [ODBEdit,ERROR] [odb.cxx:2043:db_open_database,ERROR] Removed ODB client 'ODBEdit', index 0 because process pid 
> 1615051 does not exists
> [ODBEdit,INFO] Removed open record flag from "/Experiment/Security/RPC hosts/Allowed hosts"
> [ODBEdit,INFO] Removed exclusive access mode from "/Experiment/Security/RPC hosts/Allowed hosts"
> [ODBEdit,INFO] Corrected 1 ODB entries
> [ODBEdit,INFO] Deleted entry '/System/Clients/1615051' for client 'ODBEdit' because it is not connected to ODB
> [ODBEdit,INFO] Client 'ODBEdit' on buffer 'SYSMSG' removed by bm_open_buffer because process pid 1615051 does not 
> exist

so far, so good, we connected to ODB (lock was not stuck), cleared out client "odbedit" with pid 1615051 that crashed 
without properly disconnecting. ODB semaphore is working correctly.

> [ODBEdit,ERROR] [odb.cxx:2489:db_lock_database,ERROR] cannot lock ODB semaphore, timeout 10000 ms, aborting...
> Aborted (core dumped)

suddenly, an ODB semaphore timeout...

can you post the stack trace from this core dump? I am pretty sure it will be boring, but just in case...

K.O.
  2880   18 Oct 2024 Konstantin OlchanskiBug ReportDifficulty running MIDAS on Rocky 9.4
> suddenly, an ODB semaphore timeout...

I am wondering if something bizarre is going on, like the system clock going backwards. I heard of things like that 
happening in virtual environments.

https://stackoverflow.com/questions/4801122/how-to-stop-time-from-running-backwards-on-linux

I added some debugging information to the semaphore locking code. Please update to commit 
eb625af119067f6d702211542d88a28ccb57ad2c of src/system.cxx (plus small change in include/msystem.h) and try again.

Now for each timeout it will print detailed syscall and timing information, if time goes backwards, it should catch it.

K.O.
  2881   23 Oct 2024 Lukas GerritzenBug ReportODB key picker does not close when creating link / Edit-on-run string box too large
To reproduce:
In the interactive ODB, click the 🔗 icon to create a link. Next to the target,  click the "..." button to open 
the key picker browser. Then try to close it by either:
- Selecting a key and clicking ok
- Clicking "cancel"
- Clicking the red circle at the top left

Expected result:
The key picker closes

Actual result:
The key picker does not close.

Depending on how you trying to close the picker, the error messages in the debug console differ slightly.

On the red circle:
Uncaught TypeError: dlg is null
    dlgClose http://localhost:8080/controls.js:791
    onclick http://localhost:8080/?cmd=ODB&odb_path=/Test:1

On "ok" or "cancel":
Uncaught TypeError: dlg is null
    dlgMessageDestroy http://localhost:8080/controls.js:828
    pickerButton http://localhost:8080/odbbrowser.js:453
    onclick http://localhost:8080/?cmd=ODB&odb_path=/Test:1


Another more minor visual problem is the edit-on-start dialog. There seems to be no upper bound to the 
size of the text box. In the attached screenshot, ShortString has a maximum length of 32 characters, 
LongString has 255. Both are empty at the time of the screenshot. Maybe, the size should be limited to a 
reasonable width.
Attachment 1: Screenshot_2024-10-23_at_11.38.38.png
Screenshot_2024-10-23_at_11.38.38.png
  2882   28 Oct 2024 Lukas GerritzenBug ReportVisual glitch in history system
Today, I encountered the bug shown in the attached video. The value of the plotted curve does not match the mouseover number.

When trying to understand it better, I stopped being able to replicate. Has anyone else observed a similar problem? 
Attachment 1: Screen_Recording_2024-10-28_at_17.23.57.mov
Attachment 2: Screenshot_2024-10-28_at_17.29.34.png
Screenshot_2024-10-28_at_17.29.34.png
  Draft   28 Oct 2024 Lukas GerritzenBug ReportVisual glitch in history system
Attachment 1: Screenshot_2024-10-28_at_17.34.26.png
Screenshot_2024-10-28_at_17.34.26.png
  2884   28 Oct 2024 Amy RobertsBug ReportDifficulty running MIDAS on Rocky 9.4
> Now for each timeout it will print detailed syscall and timing information, if time goes backwards, it should catch it.

It appears that time is moving forward:

[aroberts@sdfcdmsdaq build]$ odbedit
[ODBEdit,ERROR] [odb.cxx:2043:db_open_database,ERROR] Removed ODB client 'ODBEdit', index 0 because process pid 1617119 does 
not exists
[ODBEdit,INFO] Removed open record flag from "/Experiment/Security/RPC hosts/Allowed hosts"
[ODBEdit,INFO] Removed exclusive access mode from "/Experiment/Security/RPC hosts/Allowed hosts"
[ODBEdit,INFO] Corrected 1 ODB entries
[ODBEdit,INFO] Deleted entry '/System/Clients/1617119' for client 'ODBEdit' because it is not connected to ODB
[ODBEdit,INFO] Client 'ODBEdit' on buffer 'SYSMSG' removed by bm_open_buffer because process pid 1617119 does not exist
[local:amy_test:S]/>ss_semaphore_wait_for: semop/semtimedop(5) returned -1, errno 11 (Resource temporarily unavailable), 
start time 0xd4fd98f6, now 0xd4fdc0ef, dt 0x000027f9, timeout 0x00002710 ms, SEMAPHORE TIMEOUT!
[ODBEdit,ERROR] [odb.cxx:2489:db_lock_database,ERROR] cannot lock ODB semaphore, timeout 10000 ms, aborting...
Aborted (core dumped)
  2889   06 Nov 2024 Amy RobertsBug ReportDifficulty running MIDAS on Rocky 9.4
After following Konstantin's debugging suggestions, I thought I would try to replicate 
the issue on my own computer.  My hope was that I could provide instructions for 
replicating the bug so that the MIDAS team could try debugging things more easily.

However, when I ran the current version of MIDAS in a Rocky 9.4 VM on my laptop (both 
VMWare and VirtualBox), mserver and odbedit ran just fine (!).

I'm currently trying to find out if there's a way to compare the VMs on my machine and 
the machine that's being problematic, I'll report back if I learn anything.
  2902   22 Nov 2024 Konstantin OlchanskiBug ReportODB lock timeout, Difficulty running MIDAS on Rocky 9.4
> try to replicate the issue ...

I see ODB lock timeout (and abort() of everything) in the dsvslice test station. We have 
about 15-20 MIDAS clients connected.

I am pretty sure we have not seen this problem until recently (and I have not seen it 
personally for a very long time). There were no changes to the MIDAS ODB locking code in a 
long time.

I suspect a recent change in the linux kernel. But I am likely to be wrong.

I have 1000 core dumps from this crash of dsvslice, and among them should be the 1 thread
that has ODB locked. Wish me luck finding it. Worst case is to discover that ODB is locked 
but nobody is holding a lock ("missing unlock bug"). This is hard to debug, I would have add 
tracking of "who was the last one to lock it, who forgot to unlock it".

K.O.
  2903   24 Nov 2024 Pavel MuratBug ReportODB lock timeout, Difficulty running MIDAS on Rocky 9.4
there is a really good software tool developed by the Fermilab DAQ group, called TRACE - 

https://github.com/art-daq/trace ,

It could be useful for debugging cases like this one. In short, TRACE instruments the code 
with the printouts which could be selectively turned on and off without recompiling the executable. 

TRACE output could go to /dev/stdout (slow output) and/or to a circular buffer implemented via a shared 
memory segment (fast output). Sending unlimited output to the shared memory segment is extremely useful.

TRACE also allows to trigger on certain conditions, again, w/o recompiling the executable. 
For debugging cases like the one in question, that could turn out even more useful, 
however I didn't try the triggering functionality myself. 

-- regards, Pasha
  2904   26 Nov 2024 Nick HastingsBug ReportTMFE::Sleep() errors
Hello,

I've noticed that SC FEs that use the TMFE class with midas-2022-05-c often report errors when calling TMFE:Sleep().
The error is :

[tmfe.cxx:1033:TMFE::Sleep,ERROR] select() returned -1, errno 22 (Invalid argument).

This seems to happen in two different ways:

1. Error being reported repeatedly
2. Occasional single errors being reported

When the first of these presents, we typically restart the FE to "solve" the problem.
Case 2. is typically ignored.

The code in question is:

void TMFE::Sleep(double time)
{
   int status;
   fd_set fdset;
   struct timeval timeout;
      
   FD_ZERO(&fdset);
      
   timeout.tv_sec = time;
   timeout.tv_usec = (time-timeout.tv_sec)*1000000.0;

   while (1) {
      status = select(1, &fdset, NULL, NULL, &timeout);
#ifdef EINTR
      if (status < 0 && errno == EINTR) {
         continue;
      }
#endif
      break;
   }
      
   if (status < 0) {
      TMFE::Instance()->Msg(MERROR, "TMFE::Sleep", "select() returned %d, errno %d (%s)", status, errno, strerror(errno));
   }
}

So it looks like either file descriptor of the timeval struct must have a problem.
From some reading it seems that invalid timeval structs are often caused by one or both
of tv_sec or tv_usec not being set. In the code above we can see that both appear to be
correctly set initially.

From the select() man page I see:

RETURN VALUE
       On success, select() and pselect() return the number of file descriptors contained in
       the three returned descriptor sets (that is, the total number of bits that are set in
       readfds,  writefds,  exceptfds).  The return value may be zero if the timeout expired
       before any file descriptors became ready.

       On error, -1 is returned, and errno is set to indicate the error; the file descriptor
       sets are unmodified, and timeout becomes undefined.

The second paragraph quoted from the man page above would indicate to me that perhaps the
timeout needs to be reset inside the if block. eg:

      if (status < 0 && errno == EINTR) {
         timeout.tv_sec = time;
         timeout.tv_usec = (time-timeout.tv_sec)*1000000.0;
         continue;
      }

Please note that I've only just briefly looked at this and was hoping someone more
familiar with using select() as a way to sleep() might be better able to understand
what is happening.

I wonder also if now that midas requires stricter/newer c++ standards if there maybe
some more straightforward method to sleep that is sufficiently robust and portable.

Thanks,

Nick.
  2905   26 Nov 2024 Maia Henriksson-WardBug ReportTMFE::Sleep() errors
> Hello,
> 
> I've noticed that SC FEs that use the TMFE class with midas-2022-05-c often report errors when calling TMFE:Sleep().
> The error is :
> 
> [tmfe.cxx:1033:TMFE::Sleep,ERROR] select() returned -1, errno 22 (Invalid argument).
> 
> This seems to happen in two different ways:
> 
> 1. Error being reported repeatedly
> 2. Occasional single errors being reported
> 
> When the first of these presents, we typically restart the FE to "solve" the problem.
> Case 2. is typically ignored.
> 
> The code in question is:
> 
> void TMFE::Sleep(double time)
> {
>    int status;
>    fd_set fdset;
>    struct timeval timeout;
>       
>    FD_ZERO(&fdset);
>       
>    timeout.tv_sec = time;
>    timeout.tv_usec = (time-timeout.tv_sec)*1000000.0;
> 
>    while (1) {
>       status = select(1, &fdset, NULL, NULL, &timeout);
> #ifdef EINTR
>       if (status < 0 && errno == EINTR) {
>          continue;
>       }
> #endif
>       break;
>    }
>       
>    if (status < 0) {
>       TMFE::Instance()->Msg(MERROR, "TMFE::Sleep", "select() returned %d, errno %d (%s)", status, errno, strerror(errno));
>    }
> }
> 
> So it looks like either file descriptor of the timeval struct must have a problem.
> From some reading it seems that invalid timeval structs are often caused by one or both
> of tv_sec or tv_usec not being set. In the code above we can see that both appear to be
> correctly set initially.
> 
> From the select() man page I see:
> 
> RETURN VALUE
>        On success, select() and pselect() return the number of file descriptors contained in
>        the three returned descriptor sets (that is, the total number of bits that are set in
>        readfds,  writefds,  exceptfds).  The return value may be zero if the timeout expired
>        before any file descriptors became ready.
> 
>        On error, -1 is returned, and errno is set to indicate the error; the file descriptor
>        sets are unmodified, and timeout becomes undefined.
> 
> The second paragraph quoted from the man page above would indicate to me that perhaps the
> timeout needs to be reset inside the if block. eg:
> 
>       if (status < 0 && errno == EINTR) {
>          timeout.tv_sec = time;
>          timeout.tv_usec = (time-timeout.tv_sec)*1000000.0;
>          continue;
>       }
> 
> Please note that I've only just briefly looked at this and was hoping someone more
> familiar with using select() as a way to sleep() might be better able to understand
> what is happening.
> 
> I wonder also if now that midas requires stricter/newer c++ standards if there maybe
> some more straightforward method to sleep that is sufficiently robust and portable.
> 
> Thanks,
> 
> Nick.

I had the same error a few months ago, though I wasn't using a tagged release. It happened because I was calling TMFE::Sleep() 
with a negative time. If your issues were caused by the same reason, TMFE::Sleep() can handle negative times since commit 
591f78f (https://bitbucket.org/tmidas/midas/commits/591f78f52893d5ffd64bf4e52a1daac537ebd672).

Early in my debugging, I did come to the same conclusions you did, and actually tried a similar solution the one you suggested. 
This was a few months ago and I didn't write down what happened, but I believe it didn't work because in my case the errno was 
something other than EINTR, and/or the timeval was still an invalid argument for sleep because the timeout was still negative. I 
never followed it up because I was able to fix my problem by fixing my frontend.
  2906   27 Nov 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
> [tmfe.cxx:1033:TMFE::Sleep,ERROR] select() returned -1, errno 22 (Invalid argument).

The very original copy of this function had an error and was spewing out this error quite often,
this was a missing handler for EINTR.

Now it looks like we are missing a handler for EINVAL.

Most likely sleep is called with a funny sleep time value that fills struct timeval with
values select() does not like.

I see Ben added a check for negative sleep times, and this is good.

I think I will do these changes:

a) add an error message for negative sleep time, I think user should never call ::Sleep with negative or zero sleep times and 
if they do it is a bug and they should fix it, the error message will inform them so.

b) add a handler for EINVAL, which will report the requested sleep time and the values in struct timeval

K.O.
  2907   27 Nov 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
> 
> I wonder also if now that midas requires stricter/newer c++ standards if there maybe
> some more straightforward method to sleep that is sufficiently robust and portable.
> 

I believe POSIX defined clock_nanosleep() & co, so on most recent machines that is the most portable way to sleep.

Historically, select() was the only way to sleep for less than 1 sec, but it was never portable
because of differences between BSD UNIX and Linux implementations. (MacOS is BSD UNIX via FreeBSD).

On difference is the update of struct timeval is select() is interrupted.

In this elog entry, I compare sleep using select() with sleep using clock_nanosleep() and see that there is no difference:
https://daq00.triumf.ca/elog-midas/Midas/2115

As you can see tmfe.cxx has both implementations, select() and clock_nanosleep(), and anybody can try which one works better on 
their computer.

K.O.
  2908   27 Nov 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
>       status = select(1, &fdset, NULL, NULL, &timeout);
>
>       On error, -1 is returned, ... timeout becomes undefined.

I have been reading "man select" for 30 years and I do not remember seeing this text.

I believe on BSD UNIX (MacOS) it says timeout is unchanged and on Linux is says timeout is updated to time actually slept.

I will have to investigate, but I suspect the man page was posix-ized, by sweeping BSD/MacOS and Linux implementations
under the same "instead of saying what it actually does, we will just say 'undefined'".

In any case, EINTR is not an error, it's an artefact of UNIX signal handling. Linux implementations always try
very hard to handle signals without causing EINTR to select(), read() and write(). This is most painful
when reading and writing to TCP sockets, because one most handle partial reads and EINTR.

K.O.
  2909   30 Nov 2024 Pavel MuratBug ReportEQ_PERIODIC-only equipment ?
Dear Midas experts, 

I'm running into something which looks like an initialization problem. 
I have a mfe.cxx-style frontend which introduces an equipment of the EQ_PERIODIC type (EQ_PERIODIC-only!). 
When Midas enters the running state, I see the frontend crashing. 
Stepping through the code shows that the frontend is crashing because its equipment has been ignored 
by the initialize_equipment@mfe.cxx - see
 
https://bitbucket.org/tmidas/midas/src/5d0dae001712164ae43137dced2fbbb594f0201e/src/mfe.cxx#lines-630

Is there an assumption that the initialization of the EQ_PERIODIC-only equipment is the user responsibility? 
Or EQ_PERIODIC should always come paired with some other type?

-- many thanks, regards, Pasha
  2910   01 Dec 2024 Stefan RittBug ReportEQ_PERIODIC-only equipment ?
There is no requirement that you pair an EQ_PERIODIC with an EQ_TRIGGER. Take for exmaple

  midas/examples/experiment/frontend.cxx

and remove there the triggered event. The frontend runs happily with the periodic event only (I just tried that myself). You have probably some problem in 
your event definition. Start with the running example frontend, and add your code line by line until you see the problem.

Stefan
  2911   01 Dec 2024 Pavel MuratBug ReportEQ_PERIODIC-only equipment ?
> There is no requirement that you pair an EQ_PERIODIC with an EQ_TRIGGER. Take for exmaple
> 
>   midas/examples/experiment/frontend.cxx
> 
> and remove there the triggered event. The frontend runs happily with the periodic event only (I just tried that myself). You have probably some problem in 
> your event definition. Start with the running example frontend, and add your code line by line until you see the problem.

Hi Stefan, thank you very much! 

As the pointer to the readout function and pointers to device drivers are all defined in the same structure (EQUIPMENT), 
I was naively assuming that the readout function should be set during the class driver initialization.
Now it is clear that the equipment responding to EQ_PERIODIC events doesn't have to have drivers, 
and specifying the readout function is the responsibility of the user.

I got around exactly this way yesterday, but was thinking that I was hacking the system :)
 
-- regards, Pasha
  2912   02 Dec 2024 Stefan RittBug ReportODB key picker does not close when creating link / Edit-on-run string box too large
> Actual result:
> The key picker does not close.

Thanks for reporting that bug. It has been fixed in the current commit (installed already on megon02)

Stefan
  2913   02 Dec 2024 Stefan RittBug ReportODB key picker does not close when creating link / Edit-on-run string box too large
> Another more minor visual problem is the edit-on-start dialog. There seems to be no upper bound to the 
> size of the text box. In the attached screenshot, ShortString has a maximum length of 32 characters, 
> LongString has 255. Both are empty at the time of the screenshot. Maybe, the size should be limited to a 
> reasonable width.

I limited the input size now to (arbitrarily) 100 chars. The string can still be longer than 100 chars, and you start then scrolling inside the input box. Let me know if 
that's ok this way.

Stefan
ELOG V3.1.4-2e1708b5