We had recently some problems at our experiment which I would like to share
with the community. This affects however only experiments which have a slow
control front-end in multi-threaded mode.
The problem is related with the fact that the midas API is not thread safe, so
a device driver or bus driver from the slow control system may not call any ODB
function. We found several drivers (mainly psi_separator.c, psi_beamline.c etc)
which use inside read/write function the midas PAI function cm_msg() to report
any error. While this is ok for the init section (which is executed in the main
frontend thread) this is not ok for the read/write function inside the driver.
If this is done anyhow, it can happen that the main thread locks the ODB (via
db_lock_database()) and the thread interrupts that call and locks the ODB
again. In rare cases this can cause a stale lock on the ODB. This blocks all
other programs to access the ODB and the experiment will die loudly. It is hard
to identify, since error messages cannot be produced any more, and remote
programs (not affected by the lock) just show a rpc timeout.
I fixed all drivers now in our experiment which solved the problem for us, but
I urge other people to double check their device drivers as well.
In case of problems, there is a thread ID check in
db_lock_database()/db_unlock_database() which can be activated by supplying
-DCHECK_THREAD_ID
in the compile command line. If then these functions are called from different
threads, the program aborts with an assertion failure, which can then be
debugged.
There is also a stack history system implemented with new functions
ss_stack_xxxx. Using this system, one can check which functions called
db_lock_database() *before* an error occurs. Using this system, I identified
the malicious drivers. Maybe this system can also be used in other error
debugging scenarios. |