> mhttpd function for starting and stopping runs now uses cm_transition(DETACH) which spawns an
> external helper program called mtransition to handle the transition sequencing.
>
> P.S. In one of our experiments, I sometimes see mhttpd getting "stuck" when starting or stopping a run
> using this feature. strace shows it is stuck in repeated calls to wait(), but I am unable to reproduce this
> problem in a test system and it happens only sometimes in the experiment. When it does, mhttpd has to
> be restarted. Replacing system("mtransition ...") with ss_sysem("mtransition ...") seems to fix this problem,
> but there are downsides to this (mtransition debug output vanishes) so I am not committing this yet.
> K.O.
Found the problem. As observed on SL5 systems, the GLIBC "system()" function breaks if the user application
installs a SIGCHLD handler that "steals" wait() notifications. Such a handler is installed by the MIDAS ss_exec()
function in system.c.
I would count this as a GLIBC bug - their "system()" function should survive in the presence of non-default signal
handlers installed by the user, and in fact my copy of "man signal" talks about the "system()" doing something
special about SIGCHLD. Obviously whatever they do is broken, at least in the SL5 GLIBC.
I am now testing an implementation using MIDAS ss_spawnvp().
The simplest way to reproduce the problem: start mhttpd; start/stop runs - mtransition works perfectly; start some
program from the MIDAS "programs" page (this calls "ss_exec()"), try to start a run - mhttpd will hang inside the
system() GLIBC function, every time. mhttpd has to be killed with "kill -KILL" to recover.
K.O. |