Back Midas Rome Roody Rootana
  Midas DAQ System  Not logged in ELOG logo
Entry  31 Aug 2004, Konstantin Olchanski, , midas odb locking 
    Reply  15 Sep 2004, Konstantin Olchanski, , midas odb locking 
       Reply  16 Sep 2004, Stefan Ritt, , midas odb locking 
Message ID: 40     Entry time: 31 Aug 2004     Reply to this: 41
Author: Konstantin Olchanski 
Topic:  
Subject: midas odb locking 
One of our experiments is suffering from periodic ODB corruption and I
suspected that there might be a problem with ODB locking. In the last few
days, I finally had time to read the ODB locking code, to write a little
test program and to play with ODB. This is what I found:

1) ODB locking appears to be sound, my test program failed to find any
locking flaws, except for a big problem, described below. Please read on.

2) ODB locking is "unfair". A program "while (1) { lock(); do_stuff();
unlock(); /* no sleep here */ }" would lock out other users of ODB
(including odbedit) for seconds and minutes at a time. I see this as a flaw
in the semop() implementation in the Linux kernel and I cannot think of an
easy way to fix it in our code. (I tested only on RHL9 2.4.20-31.9smp on a
dual CPU machine. 2.6 kernels may work better).

3) presently, we use an infinite timeout waiting for the ODB lock. I suggest
we set the timeout to, say, 5 minutes, to protect against dead (or live)
locks that we saw a few times here at TRIUMF- every ODB client would hang
without any error messages or explanations forever waiting for the ODB lock
that is held by some rogue ODB client stuck in an infinite loop in corrupted
ODB.

4) while reading the locking code in db_{lock,unlock}_database(), I thought
that there is a race condition against the "lock_cnt" variable, until I
realized that this variable is local and there is no race condition. I would
like to comment this in the code?

5) I found a failure mode where db_close_database() erroneously deletes the
lock semaphore. Once the semaphore is deleted, ODB locking silently fails
(in db_lock_database() we do not check for success status of
mutex_wait_for()) and remaining ODB clients operate without locking protection.

This failure happens after ODB undercounts active clients after losing track
of clients removed by "idle timeouts" (and by other checks?). At some point,
db_close_database() decides that there are no more clients left, attempts to
delete the shared memory (this fails because there are still active clients
attached) and deletes the lock semaphore. Afterwards, the remaining "lost"
active clients operate without lock protection. This would tend to happen
while shutting down all clients, a time when they all rush-in to delete
themselves from "/system/clients", unsuring ODB corruption.

A quick solution I just coded would not work for mmap()-based shared memory
(I destroy the lock semaphore after the ODB shared memory is destroyed) as
this relies on "shm_nattch" counting feature of System-V shared memories,
absent in the mmap() based shared memories. Since the Windows implementation
uses mmap(), my "solution" is an obvious no-go. Alternatives would be to add
a second semaphore, just for counting active ODB clients (kluncky); or never
delete the semaphore in the first place (dirty, and how does one clear it if
it gets stuck in the locked state?).

For now, I would like to add a check to ss_mutex_waitfor() call in
db_lock_database() and crash if we can't get the mutex. Returning an error
code would be cleaner, but would not work because nobody checks the return
status of db_lock_database(). If can't get the mutex for (say) 5 minutes, I
think we should crash, too- something is very wrong and it is pointless to
continue waiting.

K.O.
ELOG V3.1.4-2e1708b5