Cm msg deadlock note: Difference between revisions

From MidasWiki
Jump to navigation Jump to search
Line 150: Line 150:


K.O.
K.O.
== From https://midas.triumf.ca/elog/Midas/738 ==
== AAA ==

Revision as of 09:46, 6 August 2013

Note on race condition and deadlock between ODB lock and SYSMSG lock in cm_msg()

In December 2010/January/February 2011 I identified and fixed a number of race conditions and deadlocks that were severely affecting the T2K/ND280 experiment in Japan. Removal of these problems was an important improvement to MIDAS. To remember lessons learned and to avoid having these problems come back, I document the relevant information in this wiki.

The whole blow-by-blow account can be read on the MIDAS forum:

From https://midas.triumf.ca/elog/Midas/734

The following (odb torture) script makes midas very unhappy and eventually causes odb corruption. I suspect the reason is some kind of race condition collision between client creation and destruction code and the watchdog activity (each client periodically runs cm_watchdog() to check if other clients are still alive, O(NxN) total complexity).

#!/usr/bin/perl -w
#$cmd = "odbedit -c \'scl -w\' &";
$cmd = "odbedit -c \'ls -l /system/clients\' &";
for (my $i=0; $i<50; $i++)
{
system $cmd;
}
#end

(Note: How is this odb torture test relevant? The T2K/ND280 experiment had a large number of MIDAS clients and they are started and stopped frequently. Any bugs/problems in client creation/removal cause problems very quickly - within days of operation we see strange midas errors - see the elog thread. The odb torture script helps to expose such problems quickly).

From https://midas.triumf.ca/elog/Midas/735

> Thu Dec 23 12:10:30 2010 [ODBEdit9,ERROR] [odb.c:3247:db_get_value,ERROR] "Name" is of type NULL, not STRING

This is caused by a race condition between client removal in cm_delete_client_info() and cm_exist().

The race condition in cm_exist() works like this:

  • db_enum_key() returns the hkey (pointer to) the next /System/Clients/PID directory
  • the client corresponding to PID is removed, our hkey now refers to a deleted entry
  • db_get_value() tries to use the now stale hkey pointing to a deleted entry, complains about invalid key TID.

Because the offending db_get_value() is called with the "create if not found" argument set to TRUE, there is potential for writing into ODB using a stale hkey, maybe leading to ODB corruption. Other than that, this race condition seems to be benign.

cm_exist() is called from: everybody->cm_yield()->al_check()->cm_exist()

Further analysis:

  • cm_yield() calls al_check() every 10 sec, al_check() calls cm_exist() to check for "program is not running" alarms.
  • in al_check() cm_exist() is called once for each entry in /Programs/xxx, even for programs with no alarms. (Maybe I should change this?)
  • assuming 10 programs are running (10 clients), every 10 seconds, cm_exist() will be called 10 times and inside, will loop over 10 clients, exposing the enum-get race condition 10*10=100 times every 10 seconds. Usually,

ODB /Programs/ has many more entries than there are active clients, further increasing the frequency of exposure of this race condition.

K.O.

From https://midas.triumf.ca/elog/Midas/737

> > The only remaining problem when running my script is some kind of deadlock between the ODB and SYSMSG semaphores... >

In theory, we understand how programs that use 2 semaphores to protect 2 shared resources can deadlock if there are mistakes in how locks are used.

For example, consider 2 semaphores A and B and 2 concurrent subroutines foo() and bar() running at exactly the same time:

  • foo() { lock(A); lock(B); do stuff; unlock(B); unlock(A); } and
  • bar() { lock(B); lock(A); do stuff; unlock(A); unlock(B); }

This system will deadlock immediately with foo() taking semaphore A, bar() taking semaphore B, then foo() waiting for B and bar() waiting for A forever.

This situation can also be described as a race condition where foo() and bar() are racing each other to get the semaphores, with the result depending on who gets there first and, in this case, sometimes the result is deadlock.

In this example, the size of the race condition time window is the wall clock time between actually locking both semaphores in the sequence "lock(X); lock(Y);". While locking a semaphore is "instantaneous", the actual function lock() takes time to call and execute, and this time is not fixed - it can change if the CPU takes a hardware interrupt (quick), a page fault (when we may have to wait until data is read from the swap file) or a scheduler interrupt (when we are outright stopped for milliseconds while the CPU runs some other process).

In reality, subroutines foo() and bar() do not run at exactly the same time, so the probability of deadlock will depend on how often foo() and bar() are executed, the size of the race condition time window, the number of processes executing foo() and bar(), and the amount of background activity like swapping, hardware interrupts, etc.

(Also note that on a single-cpu system, we will probably never see a deadlock between foo() and bar() because they will never be running at the same time. But the deadlock is still there, waiting for the lucky moment when the scheduler switches from foo() to bar() just at the wrong place).

There is more on deadlocks and stuff written at: http://en.wikipedia.org/wiki/Deadlock http://en.wikipedia.org/wiki/Race_condition

In case of MIDAS, the 2 semaphores are the ODB lock and the SYSMSG lock (also remember about locks for the shared memory event buffers, SYSTEM, etc, but they seem to be unlikely to deadlock).

The function foo() is any ODB function (db_xxx) that locks ODB and then calls cm_msg() (which locks SYSMSG).

The function bar() is cm_msg() which locks SYSMSG and then calls some ODB db_xxx() function which tries to lock ODB.

(This is made more interesting by cm_watchdog() periodically called by alarm(), where we alternately take SYSMSG (via bm_cleanup) and ODB locks.)

I think this establishes a theoretical possibility for MIDAS to deadlock on the ODB and SYSMSG semaphores.

In practice, I think we almost never see this deadlock because cm_msg() is not called very often, and during normal operation, is almost never called from inside ODB functions holding the ODB lock - almost all calls to cm_msg from ODB functions are made to report some kind of problem with the ODB internal structure, something that "never" happens.

By "luck" I stumbled into this deadlock when doing the "odbedit" fork-bomb torture tests, when high ODB lock activity is combined with high cm_msg() activity reporting clients starting and stopping, combined with a large number of MIDAS clients running, starting and stopping.

So a deadlock I see within 1 minute of running the torture test, other lucky people will see after running an experiment for 1 year, or 1 month, or 1 day, depending.

In theory, this deadlock can be removed by establishing a fixed order of taking locks. There will never be a deadlock if we always take the SYSMSG lock first, then ask for the ODB lock.

In practice, it means that using cm_msg() while holding an ODB lock is automatically dangerous and should be avoided if not forbidden.

And it does work. By refactoring a few places in client startup, shutdown and cleanup code, I made the deadlock "go away", and my test script (posted in my first message) no longer deadlocks, even if I run hundreds of odbedit's at the same time.

Unfortunately, it is impractical to audit and refactor all of MIDAS to completely remove this problem. MIDAS call graphs are sufficiently complicated for making manual analysis of lock sequences infeasible and I expect any automatic lock analysis tool will be defeated by the cm_watchdog() periodic interrupt.

An improvement is possible if we make cm_msg() safe for calling from inside the ODB db_xxx() function. Instead of immediately sending messages to SYSMSG (requiring a SYSMSG lock), if ODB is locked, cm_msg() could save the messages in a buffer, which would be flushed when the ODB lock is released. (This does not fix all the other places that take ODB and SYSMSG locks in arbitrary order, but I think those places are not as likely to deadlock, compared to cm_msg()).

However, now that I have greatly reduced the probability of deadlock in the client startup/shutdown/cleanup code, maybe there is no urgency for changing cm_msg() - remember that if we do not call cm_msg() we will never deadlock - and during normal operation, cm_msg() is almost never called.

Investigation completed, I will now cleanup, retest and commit my changes to midas.c and odb.c. Looking into this and writing it up was a good intellectual exercise.

P.S. Also remember that there are locks for shared memory event buffers (SYSTEM, etc), but those do not involve lock inversion leading to deadlock. I think all lock sequences are like this: SYSTEM->ODB, SYSTEM->SYSMSG->ODB, there are no inverted sequences SYSMSG->SYSTEM or ODB->SYSTEM and the only deadlocking sequence SYSTEM->ODB->SYSMSG, does not really involve the SYSTEM lock.

K.O.

From https://midas.triumf.ca/elog/Midas/738

AAA