Cm msg deadlock note

From MidasWiki
Jump to navigation Jump to search

Note on race condition and deadlock between ODB lock and SYSMSG lock in cm_msg()

In December 2010/January/February 2011 I identified and fixed a number of race conditions and deadlocks that were severely affecting the T2K/ND280 experiment in Japan. Removal of these problems was an important improvement to MIDAS. To remember lessons learned and to avoid having these problems come back, I document the relevant information in this wiki.

The whole blow-by-blow account can be read on the MIDAS forum:


The following (odb torture) script makes midas very unhappy and eventually causes odb corruption. I suspect the reason is some kind of race condition collision between client creation and destruction code and the watchdog activity (each client periodically runs cm_watchdog() to check if other clients are still alive, O(NxN) total complexity).

#!/usr/bin/perl -w
#$cmd = "odbedit -c \'scl -w\' &";
$cmd = "odbedit -c \'ls -l /system/clients\' &";
for (my $i=0; $i<50; $i++)
system $cmd;

(Note: How is this odb torture test relevant? The T2K/ND280 experiment had a large number of MIDAS clients and they are started and stopped frequently. Any bugs/problems in client creation/removal cause problems very quickly - within days of operation we see strange midas errors - see the elog thread. The odb torture script helps to expose such problems quickly).


> Thu Dec 23 12:10:30 2010 [ODBEdit9,ERROR] [odb.c:3247:db_get_value,ERROR] "Name" is of type NULL, not STRING

This is caused by a race condition between client removal in cm_delete_client_info() and cm_exist().

The race condition in cm_exist() works like this:

  • db_enum_key() returns the hkey (pointer to) the next /System/Clients/PID directory
  • the client corresponding to PID is removed, our hkey now refers to a deleted entry
  • db_get_value() tries to use the now stale hkey pointing to a deleted entry, complains about invalid key TID.

Because the offending db_get_value() is called with the "create if not found" argument set to TRUE, there is potential for writing into ODB using a stale hkey, maybe leading to ODB corruption. Other than that, this race condition seems to be benign.

cm_exist() is called from: everybody->cm_yield()->al_check()->cm_exist()

Further analysis:

  • cm_yield() calls al_check() every 10 sec, al_check() calls cm_exist() to check for "program is not running" alarms.
  • in al_check() cm_exist() is called once for each entry in /Programs/xxx, even for programs with no alarms. (Maybe I should change this?)
  • assuming 10 programs are running (10 clients), every 10 seconds, cm_exist() will be called 10 times and inside, will loop over 10 clients, exposing the enum-get race condition 10*10=100 times every 10 seconds. Usually,

ODB /Programs/ has many more entries than there are active clients, further increasing the frequency of exposure of this race condition.



> > The only remaining problem when running my script is some kind of deadlock between the ODB and SYSMSG semaphores... >

In theory, we understand how programs that use 2 semaphores to protect 2 shared resources can deadlock if there are mistakes in how locks are used.

For example, consider 2 semaphores A and B and 2 concurrent subroutines foo() and bar() running at exactly the same time:

  • foo() { lock(A); lock(B); do stuff; unlock(B); unlock(A); } and
  • bar() { lock(B); lock(A); do stuff; unlock(A); unlock(B); }

This system will deadlock immediately with foo() taking semaphore A, bar() taking semaphore B, then foo() waiting for B and bar() waiting for A forever.

This situation can also be described as a race condition where foo() and bar() are racing each other to get the semaphores, with the result depending on who gets there first and, in this case, sometimes the result is deadlock.

In this example, the size of the race condition time window is the wall clock time between actually locking both semaphores in the sequence "lock(X); lock(Y);". While locking a semaphore is "instantaneous", the actual function lock() takes time to call and execute, and this time is not fixed - it can change if the CPU takes a hardware interrupt (quick), a page fault (when we may have to wait until data is read from the swap file) or a scheduler interrupt (when we are outright stopped for milliseconds while the CPU runs some other process).

In reality, subroutines foo() and bar() do not run at exactly the same time, so the probability of deadlock will depend on how often foo() and bar() are executed, the size of the race condition time window, the number of processes executing foo() and bar(), and the amount of background activity like swapping, hardware interrupts, etc.

(Also note that on a single-cpu system, we will probably never see a deadlock between foo() and bar() because they will never be running at the same time. But the deadlock is still there, waiting for the lucky moment when the scheduler switches from foo() to bar() just at the wrong place).

There is more on deadlocks and stuff written at:

In case of MIDAS, the 2 semaphores are the ODB lock and the SYSMSG lock (also remember about locks for the shared memory event buffers, SYSTEM, etc, but they seem to be unlikely to deadlock).

The function foo() is any ODB function (db_xxx) that locks ODB and then calls cm_msg() (which locks SYSMSG).

The function bar() is cm_msg() which locks SYSMSG and then calls some ODB db_xxx() function which tries to lock ODB.

(This is made more interesting by cm_watchdog() periodically called by alarm(), where we alternately take SYSMSG (via bm_cleanup) and ODB locks.)

I think this establishes a theoretical possibility for MIDAS to deadlock on the ODB and SYSMSG semaphores.

In practice, I think we almost never see this deadlock because cm_msg() is not called very often, and during normal operation, is almost never called from inside ODB functions holding the ODB lock - almost all calls to cm_msg from ODB functions are made to report some kind of problem with the ODB internal structure, something that "never" happens.

By "luck" I stumbled into this deadlock when doing the "odbedit" fork-bomb torture tests, when high ODB lock activity is combined with high cm_msg() activity reporting clients starting and stopping, combined with a large number of MIDAS clients running, starting and stopping.

So a deadlock I see within 1 minute of running the torture test, other lucky people will see after running an experiment for 1 year, or 1 month, or 1 day, depending.

In theory, this deadlock can be removed by establishing a fixed order of taking locks. There will never be a deadlock if we always take the SYSMSG lock first, then ask for the ODB lock.

In practice, it means that using cm_msg() while holding an ODB lock is automatically dangerous and should be avoided if not forbidden.

And it does work. By refactoring a few places in client startup, shutdown and cleanup code, I made the deadlock "go away", and my test script (posted in my first message) no longer deadlocks, even if I run hundreds of odbedit's at the same time.

Unfortunately, it is impractical to audit and refactor all of MIDAS to completely remove this problem. MIDAS call graphs are sufficiently complicated for making manual analysis of lock sequences infeasible and I expect any automatic lock analysis tool will be defeated by the cm_watchdog() periodic interrupt.

An improvement is possible if we make cm_msg() safe for calling from inside the ODB db_xxx() function. Instead of immediately sending messages to SYSMSG (requiring a SYSMSG lock), if ODB is locked, cm_msg() could save the messages in a buffer, which would be flushed when the ODB lock is released. (This does not fix all the other places that take ODB and SYSMSG locks in arbitrary order, but I think those places are not as likely to deadlock, compared to cm_msg()).

However, now that I have greatly reduced the probability of deadlock in the client startup/shutdown/cleanup code, maybe there is no urgency for changing cm_msg() - remember that if we do not call cm_msg() we will never deadlock - and during normal operation, cm_msg() is almost never called.

Investigation completed, I will now cleanup, retest and commit my changes to midas.c and odb.c. Looking into this and writing it up was a good intellectual exercise.

P.S. Also remember that there are locks for shared memory event buffers (SYSTEM, etc), but those do not involve lock inversion leading to deadlock. I think all lock sequences are like this: SYSTEM->ODB, SYSTEM->SYSMSG->ODB, there are no inverted sequences SYSMSG->SYSTEM or ODB->SYSTEM and the only deadlocking sequence SYSTEM->ODB->SYSMSG, does not really involve the SYSTEM lock.



A "nested" or "recursive" lock is a special type of lock that permits a lock holder to lock the same resources again and again, without deadlocking on itself. They are very useful, but tricky to implement because most system lock primitives (SYSV semaphores, POSIX mutexes, etc) do not permit nested locks, so all the logic for "yes, I am the holder of the lock, yes, I can go ahead without taking it again" (plus the reverse on unlocking) has to be done "by hand". As ever, if implemented wrong or used wrong, Bad Things happen. Many people dislike nested locks because of the added complexity, but realistically, it is impossible to build a system that does not require nested locking at least somewhere.

MIDAS lock primitives - ss_semaphore_wait_for(), db_lock_database() and bm_lock_buffer() implement a type of nested locks.

ODB locks implemented in db_lock_database() fully support nested (recursive) locking and this feature is heavily used by the ODB library. Many ODB db_xxx() functions take the ODB lock, do something, then call another ODB function that also takes the ODB lock recursively. This works well.

Unfortunately, the ODB nested lock implementation is NOT thread-safe. (Unless one is connected through the mserver, in which case, db_xxx() functions ARE thread-safe because all ODB access is serialized by the mserver RPC mutex).

Event buffer locks implemented in bm_lock_buffer() rely on ss_semaphore_xxx() to provide nested locking.

ss_semaphore_wait_for() uses SYSV semaphores, which do not provide nested locking, except when called from cm_watchdog(). (keep reading).

Because bm_lock_buffer() does not implement nested locking, use of cm_msg() in buffer management code will lead to self-deadlock, as shown in the following stack trace, where bm_cleanup() is working on the SYSMSG buffer, locked it, then called cm_msg() which is now waiting on the SYSMSG lock, which we are holding ourselves.

(gdb) where
#0  0x00007fff87274e9e in semop ()
#1  0x0000000100024075 in ss_semaphore_wait_for (semaphore_handle=1179654, timeout=300000) at src/system.c:2280
#2  0x0000000100015292 in bm_lock_buffer (buffer_handle=<value temporarily unavailable, due to optimizations>) at src/midas.c:5386
#3  0x000000010000df97 in bm_send_event (buffer_handle=1, source=0x7fff5fbfd430, buf_size=<value temporarily unavailable, due to optimizations>, 
async_flag=0) at src/midas.c:6484
#4  0x000000010000e6f5 in cm_msg (message_type=2, filename=<value temporarily unavailable, due to optimizations>, line=4226, routine=0x10004559f 
"bm_cleanup", format=0x100045550 "Client '%s' on buffer '%s' removed by %s because process pid %d does not exist") at src/midas.c:722
#5  0x000000010001553c in bm_cleanup_buffer_locked (i=<value temporarily unavailable, due to optimizations>, who=0x100045f42 "bm_open_buffer", 
actual_time=869425784) at src/midas.c:4226
#6  0x00000001000167ee in bm_cleanup (who=0x100045f42 "bm_open_buffer", actual_time=869425784, wrong_interval=0) at src/midas.c:4286
#7  0x000000010001ae27 in bm_open_buffer (buffer_name=<value temporarily unavailable, due to optimizations>, buffer_size=100000, 
buffer_handle=0x10006e9ac) at src/midas.c:4550
#8  0x000000010001ae90 in cm_msg_register (func=0x100000c60 <process_message>) at src/midas.c:895
#9  0x0000000100009a13 in main (argc=3, argv=0x7fff5fbff3d8) at src/odbedit.c:2790

This example deadlock is not a normal code path - I accidentally exposed this deadlock sequence by adding some extra locking.

But in normal use, cm_msg() is called quite often from cm_watchdog() and as protection against this type of deadlock, MIDAS ss_semaphore_xxx() has a special case that permits one level of nesting for locks called by code executed from cm_watchdog(). This is a very clever implementation of partial nested locking.

So again, we are running into problems with cm_msg() - logically it should be at the very bottom of the system hierarchy - everybody calls it from their most delicate places, while holding various locks, etc - but instead, cm_msg() call the whole MIDAS system all over again - it calls ODB functions, event buffer functions, etc - mostly to open and to write into the SYSMSG buffer.

If you are reading this, I hope you are getting a better idea of the difference between textbook systems and systems that are used in the field to get some work done.



Found one more deadlock between ODB and SYSMSG semaphores, this time through cm_watchdog():

If cm_watchdog somehow runs while we are holding the ODB semaphore, it will eventually try to lock SYSMSG (through bm_cleanup & co) in violation of our semaphore locking order. If at the same time another application tries to lock stuff using the correct order (SYSMSG first, ODB last), the two programs will deadlock (wait for each other forever). I presently have two copies of gdb attached to two copies of odbedit waiting for each other in a deadlock through this cm_watchdog scenario...


Over the last couple of days I made a series of commits to odb.c and midas.c to implement a buffer-based cm_msg() and fix the latest deadlock problem, also to help with the race conditions in client creation and cleanup.


Problems fixed as of svn rev 4976.

One down side of the implemented solution using buffered cm_msg() is reduced cm_msg() performance - messages can be delayed by quite a bit and if a program does cm_msg() and dies right away, one may never see this last message.