Back Midas Rome Roody Rootana
  Midas DAQ System, Page 100 of 146  Not logged in ELOG logo
Entry  12 Dec 2011, Michael Murray, Bug Report, bk_delete uses memcpy instead of memmove 
In midas.c, the bk_delete function removes a bank by decrementing the total
event size and then copying the remaining banks into the location of the first
using memcpy from string.h.

memcpy is not specified to handle overlapping memory regions (such as MIDAS
banks), though it seems most common implementations do.

memmove should be used instead, which is specified to behave as if copying
through an intermediate buffer.

I noticed the misbehavior using glibc with gcc version 4.4.4 and scientific
linux 6.0. Other gcc versions changed nothing, as this originates from the
implementation of memcpy in libc.

libc version:
GNU C Library stable release version 2.12, by Roland McGrath et al.
Compiled by GNU CC version 4.4.5 20110214 (Red Hat 4.4.5-6).
Compiled on a Linux 2.6.32 system on 2011-12-06.
    Reply  16 Dec 2011, Konstantin Olchanski, Bug Report, bk_delete uses memcpy instead of memmove 
> In midas.c, the bk_delete function removes a bank by decrementing the total
> event size and then copying the remaining banks into the location of the first
> using memcpy from string.h.


I confirm the documented difference between memcpy() and memmove() and I confirm the 
questionable use of memcpy() in bk_delete(). I think it should be memmove(). I made it so in my copy 
of midas, so this change will not be lost.

But I am not sure how to test it - I do not think I ever used bk_delete(). I will probably ponder upon 
this and do a blind commit.


K.O.
    Reply  19 Dec 2011, Stefan Ritt, Bug Report, bk_delete uses memcpy instead of memmove 
> > In midas.c, the bk_delete function removes a bank by decrementing the total
> > event size and then copying the remaining banks into the location of the first
> > using memcpy from string.h.
> 
> 
> I confirm the documented difference between memcpy() and memmove() and I confirm the 
> questionable use of memcpy() in bk_delete(). I think it should be memmove(). I made it so in my copy 
> of midas, so this change will not be lost.
> 
> But I am not sure how to test it - I do not think I ever used bk_delete(). I will probably ponder upon 
> this and do a blind commit.
> 
> 
> K.O.

It cannot hurt to use memmove(), so please go ahead to commit the changes.

- Stefan
Entry  29 Feb 2012, Konstantin Olchanski, Bug Report, Problem with semaphores 
Hi there! In the T2K/ND280 experiment in Japan, we keep having problems with MIDAS locking (probably 
of ODB). The symptoms are: some program reports a timeout waiting for the ODB lock, then all programs 
eventually die with this same error. Complete system meltdown. This does not look like the deadlock 
between locks for ODB, cm_msg and the data buffers that I looked into last year. It looks more like 
somebody locks ODB, dies and the Linux kernel fails to unlock the lock (via the SYSV "sem undo" 
function). But it is hard to confirm, hence this message:

The implementation of semaphores in MIDAS (used for locking ODB and the shared memory data buffers) 
uses the straight SYSV semaphore API - which lacks basic debugging features - there is no tracking of 
who locked what when, so if anything at all goes wrong at all, i.e. we are confronted with a timeout 
waiting for the ODB lock, the only corrective action possible is to kill all MIDAS clients and tell the user to 
start from scratch. There is no additional information available from the SYSV semaphore API to identify 
which MIDAS program caused the fault.

The POSIX semaphore API is even worse - no debugging features are available, *and* if a program dies 
while holding a lock, the lock stays locked forever (everybody else will wait forever or see a semaphore 
timeout, and then what?).

So I am looking for an "advanced semaphore library" to use in MIDAS. In addition to the boring functions 
of reliable locking and unlocking, it should support:
- wait with timeout
- remember who is holding the lock
- detect that the process holding the lock is dead and take corrective action (automatic unlock as done by 
SYSV semaphores, call back to user code where we can cleanup and unlock ourselves, etc)
- maybe permit recursive locking (not really required as ODB locks are already made recursive "by hand")
- maybe remember some of the locking history (so we can dump it into a log file when we detect a 
deadlock or other lock malfunction).

Quick google search only find sundry wrappers for SYSV and POSIX semaphores. How they deal with the 
problem of processes locking the semaphore and dying remains a mystery to me (other than telling users 
to remove the Ctrl-C button from their keyboard). BTW, we have seen this problem with several 
commercial applications that use SYSV semaphores but forget to enable the SEM_UNDO function).

Anyhow, if anybody can suggest such an advanced locking library it would be great. Will save me the 
effort of writing one.

K.O.
    Reply  01 Mar 2012, Stefan Ritt, Bug Report, Problem with semaphores 
> Anyhow, if anybody can suggest such an advanced locking library it would be great. Will save me the 
> effort of writing one.

Hi Konstantin,

yes there is a good way, which I used during development of the buffer manager function. Put in each sm_xxx function a cm_msg(M_DEBUG, ...) to 
generate a debug system message. They go only into the SYSMSG ring buffer and thus are light weight and don't influence the timing much. You can 
keep odbedit open to see these messages, but there is also another way. You can write a little program which dumps the whole SYSMSG buffer, which 
you can call when the lock happens. You then look "backwards" in time and get all messages stored there, depending of the size of the SYSMSG buffer of 
course. Of course this only works if the lock does not happen on the SYSMSB buffer itself. In that case you have to produce M_LOG messages which are 
written to the logging file. This will influence the timing slightly (the file might grow rapidly) but you are independent of semaphores.

The interesting thing is that in the MEG experiment (9 Front-ends, Event Builder, Logger, Lazylogger, ....) we run for months without any lock up. So I 
might suspect it's caused in your case from a program only you are using.

Best regards,
Stefan
Entry  18 Apr 2012, Exaos Lee, Bug Report, Build error with mlogger: invalid conversion from ‘void*’ to ‘gzFile’ mlogger-err.log.gz
I tried to build MIDAS under ArchLinux, failed on errors as following:
src/mlogger.cxx: In function ‘INT midas_flush_buffer(LOG_CHN*)’:
src/mlogger.cxx:1011:54: error: invalid conversion from ‘void*’ to ‘gzFile’ [-fpermissive]
In file included from src/mlogger.cxx:33:0:
/usr/include/zlib.h:1318:21: error:   initializing argument 1 of ‘int gzwrite(gzFile, voidpc, unsigned int)’ [-fpermissive]
src/mlogger.cxx: In function ‘INT midas_log_open(LOG_CHN*, INT)’:
src/mlogger.cxx:1200:79: error: invalid conversion from ‘void*’ to ‘gzFile’ [-fpermissive]
In file included from src/mlogger.cxx:33:0:
Please refer to attachment elog:786/1 for detail. There are also many warnings listed.

This error can be supressed by adding -fpermissive to CXXFLAGS. But the error message is correct."gzFile" is not equal to "void *"! C allows implicit casts between void* and any pointer type, C++ doesn't allow that. It's better to fix this error. A quick fix would be adding explicit casts. But I'm not sure what is the proper way to fix this.
    Reply  19 Apr 2012, Stefan Ritt, Bug Report, Build error with mlogger: invalid conversion from ‘void*’ to ‘gzFile’ 

Exaos Lee wrote:
I tried to build MIDAS under ArchLinux, failed on errors as following:
src/mlogger.cxx: In function ‘INT midas_flush_buffer(LOG_CHN*)’:
src/mlogger.cxx:1011:54: error: invalid conversion from ‘void*’ to ‘gzFile’ [-fpermissive]
In file included from src/mlogger.cxx:33:0:
/usr/include/zlib.h:1318:21: error:   initializing argument 1 of ‘int gzwrite(gzFile, voidpc, unsigned int)’ [-fpermissive]
src/mlogger.cxx: In function ‘INT midas_log_open(LOG_CHN*, INT)’:
src/mlogger.cxx:1200:79: error: invalid conversion from ‘void*’ to ‘gzFile’ [-fpermissive]
In file included from src/mlogger.cxx:33:0:
Please refer to attachment elog:786/1 for detail. There are also many warnings listed.

This error can be supressed by adding -fpermissive to CXXFLAGS. But the error message is correct."gzFile" is not equal to "void *"! C allows implicit casts between void* and any pointer type, C++ doesn't allow that. It's better to fix this error. A quick fix would be adding explicit casts. But I'm not sure what is the proper way to fix this.


Ah, dumb gcc gets pickier and pickier. I added a case (gzFile)log_chn->gzfile which fixes the error. I cannot put gzFile already into the header file since the zlib header is included after the midas header, otherwise we get some other problems. The SVN version with the fix is 5275.
    Reply  25 Apr 2012, Konstantin Olchanski, Bug Report, Build error with mlogger: invalid conversion from ‘void*’ to ‘gzFile’ 
Stefan's fix is incomplete - the "gzFile" cast is needed for all calls to zlib, not just those that some version 
of GCC happens to complain about. Fixed.
svn rev 5286.

BTW, I read the midas elog via email and if you post html or elcode messages, I receive complete 
gibberish. For prompt service, please select message type "plain". (yes, you cannot use fancy colours and 
blinking text, but better than me not reading your stuff at all).

BTW2, for easier reading, please include error messages as plain text in your message. As opposed to 
compressed attachements.

K.O.
    Reply  27 Apr 2012, Stefan Ritt, Bug Report, Build error with mlogger: invalid conversion from ‘void*’ to ‘gzFile’ 

KO wrote:
BTW, I read the midas elog via email and if you post html or elcode messages, I receive complete
gibberish. For prompt service, please select message type "plain". (yes, you cannot use fancy colours and
blinking text, but better than me not reading your stuff at all).

BTW2, for easier reading, please include error messages as plain text in your message. As opposed to
compressed attachements.

K.O.


BTW3, if you use a real email program you don't get glibberish. I know some people prefer good-old-text-only pine, but I'm sure you do not use the ascii-only browser lynx to browse the internet, right? So if you browse the web in graphics, why not read your email in graphics as well. Better change yourself than the whole rest of the world Wink
Entry  09 Jun 2012, Greg Christian, Bug Report, _net_send_buffer realloc 
In midas.c, I noticed that memory is only allocated to the global buffer 
_net_send_buffer by calling realloc() from within the function 
resize_net_send_buffer() (at least this was the only place I could find 
allocation to _net_send_buffer happening). This can cause problems for a couple 
of reasons:

1) _net_send_buffer is not set to NULL when declared. To my understanding, this 
makes the first call to realloc(_net_send_buffer, /*size*/) undefined. When 
passed a pointer that has not previously been allocated, realloc() acts like 
malloc() only if the pointer equal to NULL. Otherwise, the behavior is undefined 
and usually causes a crash.

2) cm_disconect_experiment() calls free(_net_send_buffer) but does not set its 
value to NULL. Thus if a client tries to include more than one 
connect...disconnect cycle within an application, there is undefined behavior 
the next time realloc(_net_send_buffer, ...) gets called.

I think that any potential allocation issues involving _net_send_buffer could be 
solved by:

1) Initializing _net_send_buffer to NULL.

2) In cm_disconnect_experiment(), changing
>   M_FREE(_net_send_buffer); 
to 
>   M_FREE(_net_send_buffer);
>   _net_send_buffer = NULL;
    Reply  10 Jun 2012, Konstantin Olchanski, Bug Report, _net_send_buffer realloc 
> In midas.c, ...
>
> 1) _net_send_buffer is not set to NULL when declared.

_net_send_buffer is a global variable. All global variables are automatically initialized to zero before the program 
starts.

static char*x; // = NULL; is redundant
char*y=realloc(x, 100);  // x is NULL, usage is correct

> 2) cm_disconect_experiment() calls free(_net_send_buffer) but does not set its 
> value to NULL.

My copy of midas.c (svn rev 5256) sets _net_send_buffer to NULL:

   if (_net_send_buffer_size > 0) { 
      M_FREE(_net_send_buffer); 
      _net_send_buffer_size = 0; 
   } 
 
What version of midas do you have? (svn info .)

K.O.
    Reply  10 Jun 2012, Greg Christian, Bug Report, _net_send_buffer realloc 
> > In midas.c, ...
> >
> > 1) _net_send_buffer is not set to NULL when declared.
> 
> _net_send_buffer is a global variable. All global variables are automatically 
initialized to zero before the program 
> starts.
> 
> static char*x; // = NULL; is redundant
> char*y=realloc(x, 100);  // x is NULL, usage is correct
>

Ah,okay. I was not aware of this feature of global variables.

 
> > 2) cm_disconect_experiment() calls free(_net_send_buffer) but does not set 
its 
> > value to NULL.
> 
> My copy of midas.c (svn rev 5256) sets _net_send_buffer to NULL:
> 
>    if (_net_send_buffer_size > 0) { 
>       M_FREE(_net_send_buffer); 
>       _net_send_buffer_size = 0; 
>    } 
>  
> What version of midas do you have? (svn info .)
> 
> K.O.

I have version 5256 also (matches what you posted), but I only see 
_net_send_buffer_size being set to 0, not _net_send_buffer itself. In midas.h, 
M_FREE(x) only expands to free(x) if _MEM_DBG is not defined.
    Reply  11 Jun 2012, Konstantin Olchanski, Bug Report, _net_send_buffer realloc 
> > > In midas.c, ...
> > >
> > > 1) _net_send_buffer is not set to NULL when declared.
> 
> Ah,okay. I was not aware of this feature of global variables.
> 

RTFM K&R "The C programming language". 
http://en.wikipedia.org/wiki/The_C_Programming_Language

>  
> > > 2) cm_disconect_experiment() calls free(_net_send_buffer) but does not set 
> its value to NULL.
>

Confirmed. Sorry for confusion in my previous message. Set the pointer to NULL after free() is good practice.

But note that calling cm_connect and cm_disconnect multiple times is unusual use of MIDAS and you will most 
likely find more breakage.

K.O.
Entry  13 Jun 2012, Exaos Lee, Bug Report, Cannot start/stop run through mhttpd 
Revision: r5286 
Platform: Debian Linux 6.0.5 AMD64, with packages from squeeze-backports 
Problem:
After building and installation, using the script 'start_daq.sh' to start
'sampleexpt'. Everything seems fine. But I cannot start a run through web. Using
'odbedit' and 'mtransition' to start/stop a run works fine. So, what may cause
such a problem?
    Reply  13 Jun 2012, Konstantin Olchanski, Bug Report, Cannot start/stop run through mhttpd 
> Revision: r5286 
> Platform: Debian Linux 6.0.5 AMD64, with packages from squeeze-backports 
> Problem:
> After building and installation, using the script 'start_daq.sh' to start
> 'sampleexpt'. Everything seems fine. But I cannot start a run through web. Using
> 'odbedit' and 'mtransition' to start/stop a run works fine. So, what may cause
> such a problem?

Well, it's mhttpd who cannot start the run, not you. So what happens when you press
the "start run" button? Any errors in midas.log or in midas messages? Is mtransition
in your PATH?

K.O.
    Reply  13 Jun 2012, Exaos Lee, Bug Report, Cannot start/stop run through mhttpd 
> Well, it's mhttpd who cannot start the run, not you. So what happens when you press
> the "start run" button? Any errors in midas.log or in midas messages? Is mtransition
> in your PATH?
After pressing "start run", there is a message displayed: "Run start requested". There
is no error in midas.log. And mtransition is actually in my PATH. I even looked into
"mhttpd.cxx" and found where "cm_transition" is called for starting a run. I have no
clue to grasp the reason.
    Reply  14 Jun 2012, Exaos Lee, Bug Report, Cannot start/stop run through mhttpd 
> > Revision: r5286 
> > Platform: Debian Linux 6.0.5 AMD64, with packages from squeeze-backports 
> > Problem:
> > After building and installation, using the script 'start_daq.sh' to start
> > 'sampleexpt'. Everything seems fine. But I cannot start a run through web. Using
> > 'odbedit' and 'mtransition' to start/stop a run works fine. So, what may cause
> > such a problem?
> 
> Well, it's mhttpd who cannot start the run, not you. So what happens when you press
> the "start run" button? Any errors in midas.log or in midas messages? Is mtransition
> in your PATH?
> 
> K.O.

I found the problem only appears when I run mhttpd in scripts, whether bash or python.
And I'm quite sure that the MIDAS environments (e.g. PATH, MIDAS_EXPTAB, MIDASSYS, etc.)
are set in such scripts. If I start mhttpd in an xterm with or without "-D", it works
fine. So, what's the difference between invoking mhttpd directly and through a script?
    Reply  14 Jun 2012, Stefan Ritt, Bug Report, Cannot start/stop run through mhttpd 
> I found the problem only appears when I run mhttpd in scripts, whether bash or python.
> And I'm quite sure that the MIDAS environments (e.g. PATH, MIDAS_EXPTAB, MIDASSYS, etc.)
> are set in such scripts. If I start mhttpd in an xterm with or without "-D", it works
> fine. So, what's the difference between invoking mhttpd directly and through a script?

When you start it with "-D", then mhttpd become a daemon. According to linux rules, it has to "cd /", so it lives in the 
root directory, in order not to block any NFS mount/unmount. If something with the path is not correct then, mhttpd 
cannot find mtransition then. Once I fixed that problem my moving mtransition to /usr/bin.

Stefan
    Reply  14 Jun 2012, Konstantin Olchanski, Bug Report, Cannot start/stop run through mhttpd 
> > I found the problem only appears when I run mhttpd in scripts, whether bash or python.
> > And I'm quite sure that the MIDAS environments (e.g. PATH, MIDAS_EXPTAB, MIDASSYS, etc.)
> > are set in such scripts. If I start mhttpd in an xterm with or without "-D", it works
> > fine. So, what's the difference between invoking mhttpd directly and through a script?
> 
> When you start it with "-D", then mhttpd become a daemon. According to linux rules, it has to "cd /", so it lives in the 
> root directory, in order not to block any NFS mount/unmount. If something with the path is not correct then, mhttpd 
> cannot find mtransition then. Once I fixed that problem my moving mtransition to /usr/bin.
> 

I agree. Somehow mhttpd cannot run mtransition. I am not super happy with this dependance on user $PATH settings and the inability to capture error messages 
from attempts to start mtransition. I am now thinking in the direction of running mtransition code by forking. But remember that mlogger and the event builder also
have to use mtransition to stop runs (otherwise they can dead-lock). So an mhttpd-only solution is not good enough...

K.O.
    Reply  14 Jun 2012, Konstantin Olchanski, Bug Report, Cannot start/stop run through mhttpd 
> > > Revision: r5286 
> > > Platform: Debian Linux 6.0.5 AMD64, with packages from squeeze-backports 
> 
> I found the problem only appears when I run mhttpd in scripts, whether bash or python.
> And I'm quite sure that the MIDAS environments (e.g. PATH, MIDAS_EXPTAB, MIDASSYS, etc.)
> are set in such scripts. If I start mhttpd in an xterm with or without "-D", it works
> fine.

Right. I see Debian 6.0.5 just came out hot off the presses. Would be good to fix this problem.

As a work around, can you run mhttpd without "-D", but in the background, i.e. "mhttpd -p xxx >& mhttpd.log &"?

Also what are your $PATH settings?

> So, what's the difference between invoking mhttpd directly and through a script?

As Stefan mentioned, "-D" invokes some nasty unix magic to disconnect the process from the user login session. It is 
possible that this magic breaks in the latest Debian.

MIDAS "-D" does roughly the same thing as "nohup".

K.O.
ELOG V3.1.4-2e1708b5