Midas Talk

Midas DAQ System

Not logged in

Back | Find | Login | Help

Message ID: 500 Entry time: 18 Sep 2008

Author:	Stefan Ritt
Topic:	Info
Subject:	Potential problems in multi-threaded slow control front-end

We had recently some problems at our experiment which I would like to share 
with the community. This affects however only experiments which have a slow 
control front-end in multi-threaded mode.

The problem is related with the fact that the midas API is not thread safe, so 
a device driver or bus driver from the slow control system may not call any ODB 
function. We found several drivers (mainly psi_separator.c, psi_beamline.c etc) 
which use inside read/write function the midas PAI function cm_msg() to report 
any error. While this is ok for the init section (which is executed in the main 
frontend thread) this is not ok for the read/write function inside the driver. 
If this is done anyhow, it can happen that the main thread locks the ODB (via 
db_lock_database()) and the thread interrupts that call and locks the ODB 
again. In rare cases this can cause a stale lock on the ODB. This blocks all 
other programs to access the ODB and the experiment will die loudly. It is hard 
to identify, since error messages cannot be produced any more, and remote 
programs (not affected by the lock) just show a rpc timeout.

I fixed all drivers now in our experiment which solved the problem for us, but 
I urge other people to double check their device drivers as well.

In case of problems, there is a thread ID check in 
db_lock_database()/db_unlock_database() which can be activated by supplying 

-DCHECK_THREAD_ID

in the compile command line. If then these functions are called from different 
threads, the program aborts with an assertion failure, which can then be 
debugged. 

There is also a stack history system implemented with new functions 
ss_stack_xxxx. Using this system, one can check which functions called 
db_lock_database() *before* an error occurs. Using this system, I identified 
the malicious drivers. Maybe this system can also be used in other error 
debugging scenarios.

ELOG V3.1.4-2e1708b5