Back Midas Rome Roody Rootana
  Midas DAQ System  Not logged in ELOG logo
Entry  29 Feb 2012, Konstantin Olchanski, Bug Report, Problem with semaphores 
    Reply  01 Mar 2012, Stefan Ritt, Bug Report, Problem with semaphores 
Message ID: 784     Entry time: 29 Feb 2012     Reply to this: 785
Author: Konstantin Olchanski 
Topic: Bug Report 
Subject: Problem with semaphores 
Hi there! In the T2K/ND280 experiment in Japan, we keep having problems with MIDAS locking (probably 
of ODB). The symptoms are: some program reports a timeout waiting for the ODB lock, then all programs 
eventually die with this same error. Complete system meltdown. This does not look like the deadlock 
between locks for ODB, cm_msg and the data buffers that I looked into last year. It looks more like 
somebody locks ODB, dies and the Linux kernel fails to unlock the lock (via the SYSV "sem undo" 
function). But it is hard to confirm, hence this message:

The implementation of semaphores in MIDAS (used for locking ODB and the shared memory data buffers) 
uses the straight SYSV semaphore API - which lacks basic debugging features - there is no tracking of 
who locked what when, so if anything at all goes wrong at all, i.e. we are confronted with a timeout 
waiting for the ODB lock, the only corrective action possible is to kill all MIDAS clients and tell the user to 
start from scratch. There is no additional information available from the SYSV semaphore API to identify 
which MIDAS program caused the fault.

The POSIX semaphore API is even worse - no debugging features are available, *and* if a program dies 
while holding a lock, the lock stays locked forever (everybody else will wait forever or see a semaphore 
timeout, and then what?).

So I am looking for an "advanced semaphore library" to use in MIDAS. In addition to the boring functions 
of reliable locking and unlocking, it should support:
- wait with timeout
- remember who is holding the lock
- detect that the process holding the lock is dead and take corrective action (automatic unlock as done by 
SYSV semaphores, call back to user code where we can cleanup and unlock ourselves, etc)
- maybe permit recursive locking (not really required as ODB locks are already made recursive "by hand")
- maybe remember some of the locking history (so we can dump it into a log file when we detect a 
deadlock or other lock malfunction).

Quick google search only find sundry wrappers for SYSV and POSIX semaphores. How they deal with the 
problem of processes locking the semaphore and dying remains a mystery to me (other than telling users 
to remove the Ctrl-C button from their keyboard). BTW, we have seen this problem with several 
commercial applications that use SYSV semaphores but forget to enable the SEM_UNDO function).

Anyhow, if anybody can suggest such an advanced locking library it would be great. Will save me the 
effort of writing one.

ELOG V3.1.4-2e1708b5