Back Midas Rome Roody Rootana
  Midas DAQ System  Not logged in ELOG logo
Entry  23 Dec 2010, Konstantin Olchanski, Bug Report, odb corruption, odb race condition? 
    Reply  24 Dec 2010, Konstantin Olchanski, Bug Report, odb corruption, odb race condition? 
       Reply  24 Dec 2010, Konstantin Olchanski, Bug Report, odb corruption, odb race condition? 
          Reply  26 Dec 2010, Konstantin Olchanski, Bug Report, race condition and deadlock between ODB lock and SYSMSG lock in cm_msg() 
             Reply  29 Dec 2010, Konstantin Olchanski, Bug Report, use of nested locks in MIDAS 
          Reply  29 Dec 2010, Konstantin Olchanski, Bug Report, fixed. odb corruption, odb race condition? 
             Reply  11 Feb 2011, Konstantin Olchanski, Bug Report, fixed. odb corruption, odb race condition? 
                Reply  15 Feb 2011, Konstantin Olchanski, Bug Report, fixed. odb corruption, odb race condition? 
                   Reply  16 Feb 2011, Konstantin Olchanski, Bug Report, fixed. odb corruption, odb race condition? 
Message ID: 737     Entry time: 26 Dec 2010     In reply to: 736     Reply to this: 738
Author: Konstantin Olchanski 
Topic: Bug Report 
Subject: race condition and deadlock between ODB lock and SYSMSG lock in cm_msg() 
> 
> The only remaining problem when running my script is some kind of deadlock between the ODB and SYSMSG semaphores...
> 


In theory, we understand how programs that use 2 semaphores to protect 2 shared resources can deadlock
if there are mistakes in how locks are used.

For example, consider 2 semaphores A and B and 2 concurrent
subroutines foo() and bar() running at exactly the same time:

foo() { lock(A); lock(B); do stuff; unlock(B); unlock(A); } and
bar() { lock(B); lock(A); do stuff; unlock(A); unlock(B); }

This system will deadlock immediately with foo() taking semaphore A, bar() taking semaphore B,
then foo() waiting for B and bar() waiting for A forever.

This situation can also be described as a race condition where foo() and bar() are racing each
other to get the semaphores, with the result depending on who gets there first
and, in this case, sometimes the result is deadlock.

In this example, the size of the race condition time window is the wall clock time
between actually locking both semaphores in the sequence "lock(X); lock(Y);". While
locking a semaphore is "instantaneous", the actual function lock() takes time to call
and execute, and this time is not fixed - it can change if the CPU takes a hardware
interrupt (quick), a page fault (when we may have to wait until data is read from the swap file)
or a scheduler interrupt (when we are outright stopped for milliseconds while the CPU runs
some other process).

In reality, subroutines foo() and bar() do not run at exactly the same time, so the probability
of deadlock will depend on how often foo() and bar() are executed, the size of the race condition time window,
the number of processes executing foo() and bar(), and the amount of background activity
like swapping, hardware interrupts, etc.

(Also note that on a single-cpu system, we will probably never see a deadlock between foo() and bar()
because they will never be running at the same time. But the deadlock is still there, waiting
for the lucky moment when the scheduler switches from foo() to bar() just at the wrong place).

There is more on deadlocks and stuff written at:
http://en.wikipedia.org/wiki/Deadlock
http://en.wikipedia.org/wiki/Race_condition

In case of MIDAS, the 2 semaphores are the ODB lock and the SYSMSG lock (also remember about locks
for the shared memory event buffers, SYSTEM, etc, but they seem to be unlikely to deadlock).

The function foo() is any ODB function (db_xxx) that locks ODB and then calls cm_msg() (which locks SYSMSG).

The function bar() is cm_msg() which locks SYSMSG and then calls some ODB db_xxx() function which tries to lock ODB.

(This is made more interesting by cm_watchdog() periodically called by alarm(), where we alternately
take SYSMSG (via bm_cleanup) and ODB locks.)

I think this establishes a theoretical possibility for MIDAS to deadlock on the ODB and SYSMSG semaphores.

In practice, I think we almost never see this deadlock because cm_msg() is not called very often, and during normal
operation, is almost never called from inside ODB functions holding the ODB lock - almost all calls to cm_msg from
ODB functions are made to report some kind of problem with the ODB internal structure, something that "never"
happens.

By "luck" I stumbled into this deadlock when doing the "odbedit" fork-bomb torture tests, when high ODB lock
activity is combined with high cm_msg() activity reporting clients starting and stopping, combined with a large
number of MIDAS clients running, starting and stopping.

So a deadlock I see within 1 minute of running the torture test, other lucky people will see after running an experiment
for 1 year, or 1 month, or 1 day, depending.

In theory, this deadlock can be removed by establishing a fixed order of taking locks. There will never be a deadlock
if we always take the SYSMSG lock first, then ask for the ODB lock.

In practice, it means that using cm_msg() while holding an ODB lock is automatically dangerous
and should be avoided if not forbidden.

And it does work. By refactoring a few places in client startup, shutdown and cleanup code, I made the deadlock "go away",
and my test script (posted in my first message) no longer deadlocks, even if I run hundreds of odbedit's at the same time.

Unfortunately,  it is impractical to audit and refactor all of MIDAS to completely remove this problem. MIDAS call graphs
are sufficiently complicated for making manual analysis of lock sequences infeasible and
I expect any automatic lock analysis tool will be defeated by the cm_watchdog() periodic interrupt.

An improvement is possible if we make cm_msg() safe for calling from inside the ODB db_xxx() function. Instead
of immediately sending messages to SYSMSG (requiring a SYSMSG lock), if ODB is locked, cm_msg() could
save the messages in a buffer, which would be flushed when the ODB lock is released. (This does not fix
all the other places that take ODB and SYSMSG locks in arbitrary order, but I think those places are not as
likely to deadlock, compared to cm_msg()).

However, now that I have greatly reduced the probability of deadlock in the client startup/shutdown/cleanup code,
maybe there is no urgency for changing cm_msg() - remember that if we do not call cm_msg() we will never deadlock -
and during normal operation, cm_msg() is almost never called.

Investigation completed, I will now cleanup, retest and commit my changes to midas.c and odb.c. Looking into this
and writing it up was a good intellectual exercise.

P.S. Also remember that there are locks for shared memory event buffers (SYSTEM, etc), but those do not involve
lock inversion leading to deadlock. I think all lock sequences are like this: SYSTEM->ODB, SYSTEM->SYSMSG->ODB,
there are no inverted sequences SYSMSG->SYSTEM or ODB->SYSTEM and the only deadlocking
sequence SYSTEM->ODB->SYSMSG, does not really involve the SYSTEM lock.

K.O.
ELOG V3.1.4-2e1708b5