Back Midas Rome Roody Rootana
  Midas DAQ System  Not logged in ELOG logo
Entry  23 Dec 2010, Konstantin Olchanski, Bug Report, odb corruption, odb race condition? 
    Reply  24 Dec 2010, Konstantin Olchanski, Bug Report, odb corruption, odb race condition? 
       Reply  24 Dec 2010, Konstantin Olchanski, Bug Report, odb corruption, odb race condition? 
          Reply  26 Dec 2010, Konstantin Olchanski, Bug Report, race condition and deadlock between ODB lock and SYSMSG lock in cm_msg() 
             Reply  29 Dec 2010, Konstantin Olchanski, Bug Report, use of nested locks in MIDAS 
          Reply  29 Dec 2010, Konstantin Olchanski, Bug Report, fixed. odb corruption, odb race condition? 
             Reply  11 Feb 2011, Konstantin Olchanski, Bug Report, fixed. odb corruption, odb race condition? 
                Reply  15 Feb 2011, Konstantin Olchanski, Bug Report, fixed. odb corruption, odb race condition? 
                   Reply  16 Feb 2011, Konstantin Olchanski, Bug Report, fixed. odb corruption, odb race condition? 
Message ID: 734     Entry time: 23 Dec 2010     Reply to this: 735
Author: Konstantin Olchanski 
Topic: Bug Report 
Subject: odb corruption, odb race condition? 
The following script makes midas very unhappy and eventually causes odb corruption. I suspect the reason is some kind of race condition collision between client 
creation and destruction code and the watchdog activity (each client periodically runs cm_watchdog() to check if other clients are still alive, O(NxN) total complexity). 
Amongst messages appearing in midas.log:

Thu Dec 23 11:59:08 2010 [ODBEdit28,INFO] Client 'unknown' on buffer 'SYSMSG' removed by bm_open_buffer because client pid 20463 does not exist
Thu Dec 23 11:59:09 2010 [ODBEdit43,INFO] Client 'unknown' on buffer 'SYSMSG' removed by cm_watchdog because client pid 20465 does not exist
Thu Dec 23 12:11:21 2010 [ODBEdit,ERROR] [odb.c:1061:db_open_database,ERROR] Removing client 'ODBEdit11', pid 21536, index 27 because the pid no longer exists
Thu Dec 23 17:06:15 2010 [ODBEdit,ERROR] [odb.c:988:db_open_database,ERROR] maximum number of clients exceeded
Thu Dec 23 12:10:30 2010 [ODBEdit9,ERROR] [odb.c:3247:db_get_value,ERROR] "Name" is of type NULL, not STRING

The last message about <"Name" is of type NULL> appears during normal operation of the ND280 DAQ, leading me into these investigations.

Notes:
a) the script runs at most 50 copies of odbedit, never exceeding midas.h MAX_CLIENTS value 64, so one does not expect to see messages about "maximum number of 
clients exceeded"
b) the script runs 50 copies of odbedit in parallel, increasing the likelihood of whatever race condition is causing this. In the ND280 system, likelihood of failure is 
increased by the large number of running clients (10-20-30 clients), each client running periodic cm_watchdog, to collide with new client creation or destruction.
c) in other experiments, we do not see this (ok, we do have midas meltdowns once in a while) because (1) we tend to have fewer clients (reduced frequency of 
cm_watchdog), (2) we tend to not start and stop midas clients too often (reduced frequency of running client creation and destruction). (NB it seems like ND280 people 
tend to run many scripts containing odbedit commands, so they effectively start and stop midas clients more often than usual).


#!/usr/bin/perl -w
#$cmd = "odbedit -c \'scl -w\' &";
$cmd = "odbedit -c \'ls -l /system/clients\' &";
for (my $i=0; $i<50; $i++)
{
system $cmd;
}
#end
ELOG V3.1.4-2e1708b5