For the record, I am stumped by this problem. We have definitely ruled out any data overflow inside the midas message code (there are no
long messages sent). My only guess is that the frontend itself is corrupting the midas message buffer, but this corruption
must be unlikely lucky to corrupt just the "_" character (and maybe what follows it) from the "MSG_" header inside the buffer.
If indeed this is memory corruption inside the frontend, to find and fix it, one would have to roll out valgrind and other malloc() debugging
tools and good luck...
K.O.
> > > [dfe01,INFO] Slow control equipment initialized
> > > dfe: src/midas.c:838: cm_msg_flush_buffer: Assertion `rp[3]=='_'' failed.
> > > if I remove line 838 from midas.c (fixing message length) the problem disappear...
> >
> > Thank you for reporting this problem.
> >
> > It is very strange, the check is for message start "MSG_", why "M", "S" and "G" are there
> > but "_" is missing? And you remove the check for "_" and the rest of the message is also okey?
> > Very odd.
>
> if I remove the check for "_" then the first message is empty and next messages are ok...
> If I don't remove the check the frontend fails at start and I find these lines in midas.log:
>
> 14:46:29.719 2019/03/19 [dfe01,INFO] Program dfe01 on host lxaria02 started
> 14:46:29.731 2019/03/19 [dfe01,INFO] Dome FE initialized
> 14:46:29.737 2019/03/19 [dfe01,ERROR] [system.c:4709:recv_tcp2,ERROR] unexpected connection closure
> 14:46:29.737 2019/03/19 [dfe01,ERROR] [midas.c:12814:recv_event_server,ERROR] recv_tcp2(header) returned -1
> 14:46:29.737 2019/03/19 [dfe01,ERROR] [midas.c:14699:rpc_server_receive,ERROR] recv_event_server() returned -1, abort
> 14:46:29.737 2019/03/19 [dfe01,TALK] Program 'dfe01' on host 'lxaria02' aborted
>
> > You can also add this code "assert(4+3*sizeof(int)+len < 1020)" in cm_msg_buffer() right before
> > rb_increment_wp() - it this assert fails, we definitely determine that we have a buffer overflow.
>
> I added assert you suggested in cm_msg_buffer() function before rp_increment_wp() and
> result is always the same at same line:
>
> [dfe01,INFO] Dome FE initialized
> Dome0001-rc:
> [dfe01,INFO] Slow control equipment initialized
> dfe: src/midas.c:839: cm_msg_flush_buffer: Assertion `rp[3]=='_'' failed.
> Aborted
>
> Regards,
> Gennaro |