> I suspect mlogger uses ASYNC transactions exactly to avoid
> this type of deadlock (mlogger used ASYNC transactions since svn revision 2, the
> beginning of time).
That's exactly the case. If you would have asked me, I would have told you
immediately, but it is also good that you re-confirmed the deadlock behavior with
the SYNC flag. I didn't check this for the last ten years or so.
Making the buffers bigger is only a partial solution. Assume that the disk gets
slow for some reason, then any buffer will fill up and you get the dead lock.
The only real solution is to put the logic into a separate thread. So the thread
does all the RPC communication with the clients, while the main logger thread logs
data as usual in parallel. The problem is that the RPC layer is not yet completely
tested to be thread safe. I put some mutex and you correctly realized that these
are system wide, but you want a local mutex just for the logger process. You need
also some basic communication between the "run stop thread" and the "logger main
thread". Maybe Pierre remembers that once there was the problem that the logger did
not know when all events "came down the pipe" and could close the file. He added
some delay which helped most of the time. But if we would have some communication
from the "run stop thread" telling the main thread that all programs except the
logger have stopped the run, then the logger only has to empty the local system
buffer and knows 100% that everything is done.
In the MEG experiment we have the same problem. We need a certain sequence
(basically because we have 9 front-ends and one event builder, which has to be
called after the front-ends). We realized quickly that the logger cannot stop the
run, so we wrote a little tool "RunSubmit", which is a run sequence with scripting
facility. So you write a XML file, telling RunSubmit to start 10 runs, each with
5000 events. RunSubmit now watches the run statistics and stops the run. Since it's
outside the logger process, there is no dead lock. Unfortunately RunSubmit was
written by one of our students and contains some MEG specific code. Otherwise it
could be committed to the distribution.
So I feel that a separate thread for run stop (and maybe even start) would be a
good thing, but I'm not sure when I will have time to address this issue.
- Stefan |