Back Midas Rome Roody Rootana
  Midas DAQ System  Not logged in ELOG logo
Entry  23 Dec 2010, Konstantin Olchanski, Bug Report, odb corruption, odb race condition? 
    Reply  24 Dec 2010, Konstantin Olchanski, Bug Report, odb corruption, odb race condition? 
       Reply  24 Dec 2010, Konstantin Olchanski, Bug Report, odb corruption, odb race condition? 
          Reply  26 Dec 2010, Konstantin Olchanski, Bug Report, race condition and deadlock between ODB lock and SYSMSG lock in cm_msg() 
             Reply  29 Dec 2010, Konstantin Olchanski, Bug Report, use of nested locks in MIDAS 
          Reply  29 Dec 2010, Konstantin Olchanski, Bug Report, fixed. odb corruption, odb race condition? 
             Reply  11 Feb 2011, Konstantin Olchanski, Bug Report, fixed. odb corruption, odb race condition? 
                Reply  15 Feb 2011, Konstantin Olchanski, Bug Report, fixed. odb corruption, odb race condition? 
                   Reply  16 Feb 2011, Konstantin Olchanski, Bug Report, fixed. odb corruption, odb race condition? 
Message ID: 735     Entry time: 24 Dec 2010     In reply to: 734     Reply to this: 736
Author: Konstantin Olchanski 
Topic: Bug Report 
Subject: odb corruption, odb race condition? 
> Thu Dec 23 12:10:30 2010 [ODBEdit9,ERROR] [odb.c:3247:db_get_value,ERROR] "Name" is of type NULL, not STRING

This is caused by a race condition between client removal in cm_delete_client_info() and cm_exist().

The race condition in cm_exist() works like this:
- db_enum_key() returns the hkey (pointer to) the next /System/Clients/PID directory
- the client corresponding to PID is removed, our hkey now refers to a deleted entry
- db_get_value() tries to use the now stale hkey pointing to a deleted entry, complains about invalid key TID.

Because the offending db_get_value() is called with the "create if not found" argument set to TRUE, there is potential
for writing into ODB using a stale hkey, maybe leading to ODB corruption. Other than that, this race condition seems
to be benign.

cm_exist() is called from:
everybody->cm_yield()->al_check()->cm_exist()

Further analysis:
- cm_yield() calls al_check() every 10 sec, al_check() calls cm_exist() to check for "program is not running" alarms.
- in al_check() cm_exist() is called once for each entry in /Programs/xxx, even for programs with no alarms. (Maybe I should change this?)
- assuming 10 programs are running (10 clients), every 10 seconds, cm_exist() will be called 10 times and inside, will loop over 10 clients, exposing the enum-get race condition 10*10=100 times every 10 seconds. Usually, 
ODB /Programs/ has many more entries than there are active clients, further increasing the frequency of exposure of this race condition.

K.O.
ELOG V3.1.4-2e1708b5