Cm msg deadlock note: Difference between revisions
No edit summary |
|||
Line 8: | Line 8: | ||
* https://midas.triumf.ca/elog/Midas/741 - cm_msg() deadlock through cm_watchdog() | * https://midas.triumf.ca/elog/Midas/741 - cm_msg() deadlock through cm_watchdog() | ||
From https://midas.triumf.ca/elog/Midas/734 | == From https://midas.triumf.ca/elog/Midas/734 == | ||
The following (odb torture) script makes midas very unhappy and eventually causes odb corruption. I suspect the reason is some kind of race condition collision between client creation and destruction code and the watchdog activity (each client periodically runs cm_watchdog() to check if other clients are still alive, O(NxN) total complexity). | The following (odb torture) script makes midas very unhappy and eventually causes odb corruption. I suspect the reason is some kind of race condition collision between client creation and destruction code and the watchdog activity (each client periodically runs cm_watchdog() to check if other clients are still alive, O(NxN) total complexity). | ||
Line 24: | Line 24: | ||
(Note: How is this odb torture test relevant? The T2K/ND280 experiment had a large number of MIDAS clients and they are started and stopped frequently. Any bugs/problems in client creation/removal cause problems very quickly - within days of operation we see strange midas errors - see the elog thread. The odb torture script helps to expose such problems quickly). | (Note: How is this odb torture test relevant? The T2K/ND280 experiment had a large number of MIDAS clients and they are started and stopped frequently. Any bugs/problems in client creation/removal cause problems very quickly - within days of operation we see strange midas errors - see the elog thread. The odb torture script helps to expose such problems quickly). | ||
== From https://midas.triumf.ca/elog/Midas/735 == | |||
> Thu Dec 23 12:10:30 2010 [ODBEdit9,ERROR] [odb.c:3247:db_get_value,ERROR] "Name" is of type NULL, not STRING | |||
This is caused by a race condition between client removal in cm_delete_client_info() and cm_exist(). | |||
The race condition in cm_exist() works like this: | |||
- db_enum_key() returns the hkey (pointer to) the next /System/Clients/PID directory | |||
- the client corresponding to PID is removed, our hkey now refers to a deleted entry | |||
- db_get_value() tries to use the now stale hkey pointing to a deleted entry, complains about invalid key TID. | |||
Because the offending db_get_value() is called with the "create if not found" argument set to TRUE, there is potential | |||
for writing into ODB using a stale hkey, maybe leading to ODB corruption. Other than that, this race condition seems | |||
to be benign. | |||
cm_exist() is called from: | |||
everybody->cm_yield()->al_check()->cm_exist() | |||
Further analysis: | |||
- cm_yield() calls al_check() every 10 sec, al_check() calls cm_exist() to check for "program is not running" alarms. | |||
- in al_check() cm_exist() is called once for each entry in /Programs/xxx, even for programs with no alarms. (Maybe I should change this?) | |||
- assuming 10 programs are running (10 clients), every 10 seconds, cm_exist() will be called 10 times and inside, will loop over 10 clients, exposing the enum-get race condition 10*10=100 times every 10 seconds. Usually, | |||
ODB /Programs/ has many more entries than there are active clients, further increasing the frequency of exposure of this race condition. | |||
K.O. |
Revision as of 08:41, 6 August 2013
Note on race condition and deadlock between ODB lock and SYSMSG lock in cm_msg()
In December 2010/January/February 2011 I identified and fixed a number of race conditions and deadlocks that were severely affecting the T2K/ND280 experiment in Japan. Removal of these problems was an important improvement to MIDAS. To remember lessons learned and to avoid having these problems come back, I document the relevant information in this wiki.
The whole blow-by-blow account can be read on the MIDAS forum:
- https://midas.triumf.ca/elog/Midas/734 - start of thread
- https://midas.triumf.ca/elog/Midas/737 - deadlock involving cm_msg()
- https://midas.triumf.ca/elog/Midas/741 - cm_msg() deadlock through cm_watchdog()
From https://midas.triumf.ca/elog/Midas/734
The following (odb torture) script makes midas very unhappy and eventually causes odb corruption. I suspect the reason is some kind of race condition collision between client creation and destruction code and the watchdog activity (each client periodically runs cm_watchdog() to check if other clients are still alive, O(NxN) total complexity).
#!/usr/bin/perl -w #$cmd = "odbedit -c \'scl -w\' &"; $cmd = "odbedit -c \'ls -l /system/clients\' &"; for (my $i=0; $i<50; $i++) { system $cmd; } #end
(Note: How is this odb torture test relevant? The T2K/ND280 experiment had a large number of MIDAS clients and they are started and stopped frequently. Any bugs/problems in client creation/removal cause problems very quickly - within days of operation we see strange midas errors - see the elog thread. The odb torture script helps to expose such problems quickly).
From https://midas.triumf.ca/elog/Midas/735
> Thu Dec 23 12:10:30 2010 [ODBEdit9,ERROR] [odb.c:3247:db_get_value,ERROR] "Name" is of type NULL, not STRING
This is caused by a race condition between client removal in cm_delete_client_info() and cm_exist().
The race condition in cm_exist() works like this: - db_enum_key() returns the hkey (pointer to) the next /System/Clients/PID directory - the client corresponding to PID is removed, our hkey now refers to a deleted entry - db_get_value() tries to use the now stale hkey pointing to a deleted entry, complains about invalid key TID.
Because the offending db_get_value() is called with the "create if not found" argument set to TRUE, there is potential for writing into ODB using a stale hkey, maybe leading to ODB corruption. Other than that, this race condition seems to be benign.
cm_exist() is called from: everybody->cm_yield()->al_check()->cm_exist()
Further analysis: - cm_yield() calls al_check() every 10 sec, al_check() calls cm_exist() to check for "program is not running" alarms. - in al_check() cm_exist() is called once for each entry in /Programs/xxx, even for programs with no alarms. (Maybe I should change this?) - assuming 10 programs are running (10 clients), every 10 seconds, cm_exist() will be called 10 times and inside, will loop over 10 clients, exposing the enum-get race condition 10*10=100 times every 10 seconds. Usually, ODB /Programs/ has many more entries than there are active clients, further increasing the frequency of exposure of this race condition.
K.O.