Back Midas Rome Roody Rootana
  Midas DAQ System, Page 8 of 151  Not logged in ELOG logo
ID Date Author Topic Subject
  2908   27 Nov 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
>       status = select(1, &fdset, NULL, NULL, &timeout);
>
>       On error, -1 is returned, ... timeout becomes undefined.

I have been reading "man select" for 30 years and I do not remember seeing this text.

I believe on BSD UNIX (MacOS) it says timeout is unchanged and on Linux is says timeout is updated to time actually slept.

I will have to investigate, but I suspect the man page was posix-ized, by sweeping BSD/MacOS and Linux implementations
under the same "instead of saying what it actually does, we will just say 'undefined'".

In any case, EINTR is not an error, it's an artefact of UNIX signal handling. Linux implementations always try
very hard to handle signals without causing EINTR to select(), read() and write(). This is most painful
when reading and writing to TCP sockets, because one most handle partial reads and EINTR.

K.O.
  2907   27 Nov 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
> 
> I wonder also if now that midas requires stricter/newer c++ standards if there maybe
> some more straightforward method to sleep that is sufficiently robust and portable.
> 

I believe POSIX defined clock_nanosleep() & co, so on most recent machines that is the most portable way to sleep.

Historically, select() was the only way to sleep for less than 1 sec, but it was never portable
because of differences between BSD UNIX and Linux implementations. (MacOS is BSD UNIX via FreeBSD).

On difference is the update of struct timeval is select() is interrupted.

In this elog entry, I compare sleep using select() with sleep using clock_nanosleep() and see that there is no difference:
https://daq00.triumf.ca/elog-midas/Midas/2115

As you can see tmfe.cxx has both implementations, select() and clock_nanosleep(), and anybody can try which one works better on 
their computer.

K.O.
  2906   27 Nov 2024 Konstantin OlchanskiBug ReportTMFE::Sleep() errors
> [tmfe.cxx:1033:TMFE::Sleep,ERROR] select() returned -1, errno 22 (Invalid argument).

The very original copy of this function had an error and was spewing out this error quite often,
this was a missing handler for EINTR.

Now it looks like we are missing a handler for EINVAL.

Most likely sleep is called with a funny sleep time value that fills struct timeval with
values select() does not like.

I see Ben added a check for negative sleep times, and this is good.

I think I will do these changes:

a) add an error message for negative sleep time, I think user should never call ::Sleep with negative or zero sleep times and 
if they do it is a bug and they should fix it, the error message will inform them so.

b) add a handler for EINVAL, which will report the requested sleep time and the values in struct timeval

K.O.
  2905   26 Nov 2024 Maia Henriksson-WardBug ReportTMFE::Sleep() errors
> Hello,
> 
> I've noticed that SC FEs that use the TMFE class with midas-2022-05-c often report errors when calling TMFE:Sleep().
> The error is :
> 
> [tmfe.cxx:1033:TMFE::Sleep,ERROR] select() returned -1, errno 22 (Invalid argument).
> 
> This seems to happen in two different ways:
> 
> 1. Error being reported repeatedly
> 2. Occasional single errors being reported
> 
> When the first of these presents, we typically restart the FE to "solve" the problem.
> Case 2. is typically ignored.
> 
> The code in question is:
> 
> void TMFE::Sleep(double time)
> {
>    int status;
>    fd_set fdset;
>    struct timeval timeout;
>       
>    FD_ZERO(&fdset);
>       
>    timeout.tv_sec = time;
>    timeout.tv_usec = (time-timeout.tv_sec)*1000000.0;
> 
>    while (1) {
>       status = select(1, &fdset, NULL, NULL, &timeout);
> #ifdef EINTR
>       if (status < 0 && errno == EINTR) {
>          continue;
>       }
> #endif
>       break;
>    }
>       
>    if (status < 0) {
>       TMFE::Instance()->Msg(MERROR, "TMFE::Sleep", "select() returned %d, errno %d (%s)", status, errno, strerror(errno));
>    }
> }
> 
> So it looks like either file descriptor of the timeval struct must have a problem.
> From some reading it seems that invalid timeval structs are often caused by one or both
> of tv_sec or tv_usec not being set. In the code above we can see that both appear to be
> correctly set initially.
> 
> From the select() man page I see:
> 
> RETURN VALUE
>        On success, select() and pselect() return the number of file descriptors contained in
>        the three returned descriptor sets (that is, the total number of bits that are set in
>        readfds,  writefds,  exceptfds).  The return value may be zero if the timeout expired
>        before any file descriptors became ready.
> 
>        On error, -1 is returned, and errno is set to indicate the error; the file descriptor
>        sets are unmodified, and timeout becomes undefined.
> 
> The second paragraph quoted from the man page above would indicate to me that perhaps the
> timeout needs to be reset inside the if block. eg:
> 
>       if (status < 0 && errno == EINTR) {
>          timeout.tv_sec = time;
>          timeout.tv_usec = (time-timeout.tv_sec)*1000000.0;
>          continue;
>       }
> 
> Please note that I've only just briefly looked at this and was hoping someone more
> familiar with using select() as a way to sleep() might be better able to understand
> what is happening.
> 
> I wonder also if now that midas requires stricter/newer c++ standards if there maybe
> some more straightforward method to sleep that is sufficiently robust and portable.
> 
> Thanks,
> 
> Nick.

I had the same error a few months ago, though I wasn't using a tagged release. It happened because I was calling TMFE::Sleep() 
with a negative time. If your issues were caused by the same reason, TMFE::Sleep() can handle negative times since commit 
591f78f (https://bitbucket.org/tmidas/midas/commits/591f78f52893d5ffd64bf4e52a1daac537ebd672).

Early in my debugging, I did come to the same conclusions you did, and actually tried a similar solution the one you suggested. 
This was a few months ago and I didn't write down what happened, but I believe it didn't work because in my case the errno was 
something other than EINTR, and/or the timeval was still an invalid argument for sleep because the timeout was still negative. I 
never followed it up because I was able to fix my problem by fixing my frontend.
  2904   26 Nov 2024 Nick HastingsBug ReportTMFE::Sleep() errors
Hello,

I've noticed that SC FEs that use the TMFE class with midas-2022-05-c often report errors when calling TMFE:Sleep().
The error is :

[tmfe.cxx:1033:TMFE::Sleep,ERROR] select() returned -1, errno 22 (Invalid argument).

This seems to happen in two different ways:

1. Error being reported repeatedly
2. Occasional single errors being reported

When the first of these presents, we typically restart the FE to "solve" the problem.
Case 2. is typically ignored.

The code in question is:

void TMFE::Sleep(double time)
{
   int status;
   fd_set fdset;
   struct timeval timeout;
      
   FD_ZERO(&fdset);
      
   timeout.tv_sec = time;
   timeout.tv_usec = (time-timeout.tv_sec)*1000000.0;

   while (1) {
      status = select(1, &fdset, NULL, NULL, &timeout);
#ifdef EINTR
      if (status < 0 && errno == EINTR) {
         continue;
      }
#endif
      break;
   }
      
   if (status < 0) {
      TMFE::Instance()->Msg(MERROR, "TMFE::Sleep", "select() returned %d, errno %d (%s)", status, errno, strerror(errno));
   }
}

So it looks like either file descriptor of the timeval struct must have a problem.
From some reading it seems that invalid timeval structs are often caused by one or both
of tv_sec or tv_usec not being set. In the code above we can see that both appear to be
correctly set initially.

From the select() man page I see:

RETURN VALUE
       On success, select() and pselect() return the number of file descriptors contained in
       the three returned descriptor sets (that is, the total number of bits that are set in
       readfds,  writefds,  exceptfds).  The return value may be zero if the timeout expired
       before any file descriptors became ready.

       On error, -1 is returned, and errno is set to indicate the error; the file descriptor
       sets are unmodified, and timeout becomes undefined.

The second paragraph quoted from the man page above would indicate to me that perhaps the
timeout needs to be reset inside the if block. eg:

      if (status < 0 && errno == EINTR) {
         timeout.tv_sec = time;
         timeout.tv_usec = (time-timeout.tv_sec)*1000000.0;
         continue;
      }

Please note that I've only just briefly looked at this and was hoping someone more
familiar with using select() as a way to sleep() might be better able to understand
what is happening.

I wonder also if now that midas requires stricter/newer c++ standards if there maybe
some more straightforward method to sleep that is sufficiently robust and portable.

Thanks,

Nick.
  2903   24 Nov 2024 Pavel MuratBug ReportODB lock timeout, Difficulty running MIDAS on Rocky 9.4
there is a really good software tool developed by the Fermilab DAQ group, called TRACE - 

https://github.com/art-daq/trace ,

It could be useful for debugging cases like this one. In short, TRACE instruments the code 
with the printouts which could be selectively turned on and off without recompiling the executable. 

TRACE output could go to /dev/stdout (slow output) and/or to a circular buffer implemented via a shared 
memory segment (fast output). Sending unlimited output to the shared memory segment is extremely useful.

TRACE also allows to trigger on certain conditions, again, w/o recompiling the executable. 
For debugging cases like the one in question, that could turn out even more useful, 
however I didn't try the triggering functionality myself. 

-- regards, Pasha
  2902   22 Nov 2024 Konstantin OlchanskiBug ReportODB lock timeout, Difficulty running MIDAS on Rocky 9.4
> try to replicate the issue ...

I see ODB lock timeout (and abort() of everything) in the dsvslice test station. We have 
about 15-20 MIDAS clients connected.

I am pretty sure we have not seen this problem until recently (and I have not seen it 
personally for a very long time). There were no changes to the MIDAS ODB locking code in a 
long time.

I suspect a recent change in the linux kernel. But I am likely to be wrong.

I have 1000 core dumps from this crash of dsvslice, and among them should be the 1 thread
that has ODB locked. Wish me luck finding it. Worst case is to discover that ODB is locked 
but nobody is holding a lock ("missing unlock bug"). This is hard to debug, I would have add 
tracking of "who was the last one to lock it, who forgot to unlock it".

K.O.
  2901   21 Nov 2024 Stefan RittInfoWhat do the status numbers mean and where can I find more information about them?
> [RP Streaming Frontend,ERROR] [midas.cxx:17806:cm_write_event_to_odb,ERROR] 
> cannot create key for bank "DATA" with tid 24 in ODB, db_create_key() status 309
> 
> 
> 
> I just need more information on what the error message means. Which data type 
> refers to tid 24 and what does status 309 indicate?? 

A tid (type identification) of 24 does actually not exist. See midas.h:327, so this tells
me that your bank header got corrupted. Somewhere you write over your data.

Stefan
  2900   21 Nov 2024 Mann GandhiInfoWhat do the status numbers mean and where can I find more information about them?
Hello, 

This is the error message I got:

[RP Streaming Frontend,ERROR] [midas.cxx:17806:cm_write_event_to_odb,ERROR] 
cannot create key for bank "DATA" with tid 24 in ODB, db_create_key() status 309
read_periodic_event: No data in ring buffer or error occurred

[RP Streaming Frontend,ERROR] [odb.cxx:3373:db_create_key,ERROR] invalid key type 
24 to create 'DATA' in '/Equipment/Periodic/Variables'

[RP Streaming Frontend,ERROR] [midas.cxx:17806:cm_write_event_to_odb,ERROR] 
cannot create key for bank "DATA" with tid 24 in ODB, db_create_key() status 309



I just need more information on what the error message means. Which data type 
refers to tid 24 and what does status 309 indicate?? 

There is definitely data in the ring buffer but I keep on getting this error.

Thank you!

M.G 
  2899   18 Nov 2024 Lukas GerritzenSuggestionComma-separated indices in alarm conditions
I have the following use case: I would like to check if two elements of an array exceed a certain threshold. 
However, they are not consecutive. Currently, I have to write two alarms, one checking Array[8] and one 
checking Array [10].

It would be nice if we could enter conditions such as "/Path/To/Array[8,10] > 0.5".

I looked into the code of al_evaluate_condition() and it seems very C-style. I know that you have been 
refactoring a lot of code to work with STL strings and their functions. If you find the time to refactor 
alarm.cxx, I ask that you consider adding comma-separated lists as a new feature.

Cheers
Lukas
  2898   18 Nov 2024 Stefan RittInfoNew sequencer command ODBLOOKUP
> "value not found" sets "index" to ?

It sets it actually to "not found". Since all variables are stings in the sequencer, you can then do a test like

ODBLOOKUP ..., index
if ($index == "not found")
  ...


> "odb key not found" sets "index" to ?

If the odb key is not found, the sequencer aborts.

> link to documentation?

The documentation is where it always has been: 

https://daq00.triumf.ca/MidasWiki/index.php/Sequencer#Sequencer_Commands

/Stefan
  2897   15 Nov 2024 Konstantin OlchanskiInfoNew sequencer command ODBLOOKUP
> A new sequencer command "ODBLOOKUP" has been implemented, which does a lookup of a string in a string 
> array in the ODB given by a path and returns its index as a number. If we have for example an array
> 
> /Examples/Names
>    [0] Hello
>    [1] Test
>    [2] Other
> 
> and do a
>  
> ODBLOOKUP "/Examples/Names", "Test", index
> 
> we get a index equal 1.
> 

"value not found" sets "index" to ?
"odb key not found" sets "index" to ?
link to documentation?

K.O.
  2896   15 Nov 2024 Konstantin OlchanskiSuggestionIssue with creating banks
> Hello, I am a coop student working at SNOLAB.
> void* data_acquisition_thread(void* param)
> {
> 	EVENT_HEADER *pevent;
>       if (complicated) {
> 			status = rb_get_wp(rbh, (void **) &pevent, 0);
>       }
>       bm_compose_event_threadsafe(pevent, 1, 0, 0, &equipment[0].serial_number);
> }

this code is buggy. it should read "EVENT_HEADER *pevent = NULL;" to avoid an uninitialized variable
and bm_compose_event() & co should be inside an "if (pevent != NULL)" block, unless you can absolutely
proove that rb_get_wp() is always called and pevent is never NULL. (even is somebody changes the code later).

if you build your code with "gcc -O2 -g -Wall -Wuninitialized" it would probably warn you about use of uninitilialized 
"pevent".

P.S. for building multithreaded frontends, you are much better off starting from the c++ tmfe frontend framework,
a good starting point is study tmfe_example_everything.cxx.

K.O.
  2895   14 Nov 2024 Mann GandhiSuggestionIssue with creating banks
> All I can see is that your bank header gets corrupted along the way. The funny character reported by 
> cm_write_event_to_odb indicates that your original name "RPD0" got overwritten somewhere, but I could not spot any 
> mistake in your code. 
> 
> I would play around: change max_event_size, produce dummy data of size N instead of the recv() and so on. Also monitor 
> the bank header to see when it gets overwritten. I guess you only write form one thread, so that should be safe, right?
> 
> Best,
> Stefan

Hello Stefan, 

Thank you for the advice. On inspection, I noticed that my event size (when I print bk_size(pevent)) is around 1.4 billion 
which seems absurd so I am not sure why this is the case as well. In addition, is mdump the way to monitor the bank header?
I just recently started using MIDAS so I am a little bit confused. I can attach a link to the github repository where I am 
currently working on this for further clarity since I am sure there is an issue in my code somewhere. 
(https://github.com/mgandhi-1/red-pitaya-frontend/blob/10-issue-with-bank-creation-neeed-to-figure-out-why-banks-are-not-
being-created-correctly/frontend.cxx)

I appreciate the help. Thank you once more.

Best, 
Mann
  2894   14 Nov 2024 Stefan RittSuggestionIssue with creating banks
All I can see is that your bank header gets corrupted along the way. The funny character reported by 
cm_write_event_to_odb indicates that your original name "RPD0" got overwritten somewhere, but I could not spot any 
mistake in your code. 

I would play around: change max_event_size, produce dummy data of size N instead of the recv() and so on. Also monitor 
the bank header to see when it gets overwritten. I guess you only write form one thread, so that should be safe, right?

Best,
Stefan
  2893   14 Nov 2024 Mann GandhiSuggestionIssue with creating banks
Hello, I am a coop student working at SNOLAB. I am currently setting up a frontend 
program to collect data for an experiment I am currently having with my bank being 
initialized correctly with the correct name. I will attach an image of the error and 
a code snippet for clarity. This is a multi-thread program using ring buffers. The 
first thread  is only responsible for data collection of ADC values from the Red 
Pitaya (FPGA) and the second thread does a simple derivative calculation. The 
frontend makes use of the TCP connection to stream data from the Red Pitaya. 

Here is the code snippet. This is the only place in the frontend code where I 
initialize and create a bank to store the ADC values from the Red Pitaya. 

void* data_acquisition_thread(void* param)
{
	printf("Data acquisition thread started\n");
	// Obtain ring buffer for inter-thread data exchange
	EVENT_HEADER *pevent;
	WORD *pdata;
	int status;

	//Set a timeout for the recv function to prevent indefinite blocking
	struct timeval timeout;
	timeout.tv_sec = 10; //seconds
	timeout.tv_usec = 0; // 0 microseconds
	setsockopt(stream_sockfd, SOL_SOCKET, SO_RCVTIMEO, (char *)&timeout, 
sizeof(timeout));



	while (is_readout_thread_enabled())
	{

		if (!readout_enabled())
		{
			usleep(10); // do not produce events when run is stopped
			continue;
		}
		// Acquire a write pointer in the ring buffer
		int status;
		do {
			status = rb_get_wp(rbh, (void **) &pevent, 0);
			if (status == DB_TIMEOUT)
			{
				usleep(5);
				if (!is_readout_thread_enabled()) break;
			}
		} while (status != DB_SUCCESS);

		if (status != DB_SUCCESS) continue;

		// Lock mutex before accessing shared resources
		pthread_mutex_lock(&lock);

		// Buffer for incoming data
		//int16_t temp_buffer[4096] = {0};

		bm_compose_event_threadsafe(pevent, 1, 0, 0, 
&equipment[0].serial_number);
        pdata = (WORD *)(pevent + 1);  // Set pdata to point to the data section of 
the event

		// Initialize the bank and read data directly into the bank
        bk_init32(pevent);
        bk_create(pevent, "RPD0", TID_WORD, (void **)&pdata);

		int bytes_read = recv(stream_sockfd, pdata, max_event_size * 
sizeof(WORD), 0);
		printf("Data received: %d bytes\n", bytes_read);


		if (bytes_read <= 0)
		{
			if (bytes_read == 0)
			{
				printf("Red Pitaya disconnected\n");
				pthread_mutex_unlock(&lock);
				break;

			} else if (errno == EWOULDBLOCK || errno ==EAGAIN)
			{
				printf("Receive timeout\n");
				pthread_mutex_unlock(&lock);
				continue;
			}

			else
			{
				printf("Error reading from the Red Pitaya: %s\n", 
strerror(errno));
				pthread_mutex_unlock(&lock);
				continue;
			}

		}
		
		 // Adjust data pointers after reading
        pdata += bytes_read / sizeof(WORD);
        bk_close(pevent, pdata);

        pevent->data_size = bk_size(pevent);

		// Unlock mutex after writing to the buffer
		pthread_mutex_unlock(&lock);

		// Send event to ring buffer
		rb_increment_wp(rbh, sizeof(EVENT_HEADER) + pevent->data_size);
	}
	pthread_mutex_unlock(&lock);

	return NULL;
}
Attachment 1: Screenshot_from_2024-11-14_12-35-06.png
Screenshot_from_2024-11-14_12-35-06.png
  2892   13 Nov 2024 Stefan RittInfoNew sequencer command ODBLOOKUP
A new sequencer command "ODBLOOKUP" has been implemented, which does a lookup of a string in a string 
array in the ODB given by a path and returns its index as a number. If we have for example an array

/Examples/Names
   [0] Hello
   [1] Test
   [2] Other

and do a
 
ODBLOOKUP "/Examples/Names", "Test", index

we get a index equal 1.


/Stefan
  2891   07 Nov 2024 Stefan RittSuggestionStop run and sequencer button
I don't find this very useful. Some experiments do not only want to stop the run, but also do other cleanup things. To do that, I proposed and "atexit" function like C has it. Then the user can put a run stop there, plus any other cleanup. This will be much more flexible. Think about the "reset" script we have to manually run if we abort a sequencer. The atexit function will come next week, so you should consider to use it instead your additional button.

Stefan
  2890   07 Nov 2024 Lukas GerritzenSuggestionStop run and sequencer button
Due to popular demand among our students, I added a button to the sequencer that stops the run and the sequence. If you find it useful, please consider merging this upstream.
$ git diff sequencer.html
diff --git a/resources/sequencer.html b/resources/sequencer.html
index e7f8a79d..95c7e3d8 100644
--- a/resources/sequencer.html
+++ b/resources/sequencer.html
@@ -115,6 +115,7 @@
               <img src="icons/play.svg" title="Start" class="seqbtn Stopped" onclick="startSeq();">
               <img src="icons/debug.svg" title="Debug" class="seqbtn Stopped" onclick="debugSeq();">
               <img src="icons/square.svg" title="Stop" class="seqbtn Running Paused" onclick="stopSeq();">
+              <img src="icons/x-octagon.svg" title="Stop Run and Sequencer immediately" class="seqbtn Running Paused" onclick="stopRunAndSeq();">
               <img src="icons/pause.svg" title="Pause" class="seqbtn Running" onclick="modbset('/Sequencer/Command/Pause script',true);">
               <img src="icons/resume.svg" title="Resume" class="seqbtn Paused" onclick="modbset('/Sequencer/Command/Resume script',true);">
               <img src="icons/step-over.svg" title="Step Over" class="seqbtn Running Paused" onclick="modbset('/Sequencer/Command/Step over',true);">
[gac-megj@pc13513 resources]$ git diff sequencer.js
diff --git a/resources/sequencer.js b/resources/sequencer.js
index cc5398ef..b75c926c 100644
--- a/resources/sequencer.js
+++ b/resources/sequencer.js
@@ -1582,6 +1582,23 @@ function stopSeq() {
    });
 }

+function stopRunAndSeq() {
+   const message = `Are you sure you want to stop the run and sequence?`;
+   dlgConfirm(message,function(resp) {
+      if (resp) {
+         modbset('/Sequencer/Command/Stop immediately',true);
+
+         mjsonrpc_call("cm_transition", {"transition": "TR_STOP"}).then(function (rpc) {
+            if (rpc.result.status !== 1) {
+               throw new Error("Cannot stop run, cm_transition() status " + rpc.result.status + ", see MIDAS messages");
+            }
+         }).catch(function (error) {
+            mjsonrpc_error_alert(error);
+         });
+      }
+   });
+}
+
 // Show or hide parameters table
 function showParTable(varContainer) {
    let e = document.getElementById(varContainer);
  2889   06 Nov 2024 Amy RobertsBug ReportDifficulty running MIDAS on Rocky 9.4
After following Konstantin's debugging suggestions, I thought I would try to replicate 
the issue on my own computer.  My hope was that I could provide instructions for 
replicating the bug so that the MIDAS team could try debugging things more easily.

However, when I ran the current version of MIDAS in a Rocky 9.4 VM on my laptop (both 
VMWare and VirtualBox), mserver and odbedit ran just fine (!).

I'm currently trying to find out if there's a way to compare the VMs on my machine and 
the machine that's being problematic, I'll report back if I learn anything.
ELOG V3.1.4-2e1708b5