Difference between revisions of "Cm msg deadlock note"

From MidasWiki
Jump to navigation Jump to search
Line 7: Line 7:
 
* https://midas.triumf.ca/elog/Midas/737 - deadlock involving cm_msg()
 
* https://midas.triumf.ca/elog/Midas/737 - deadlock involving cm_msg()
 
* https://midas.triumf.ca/elog/Midas/741 - cm_msg() deadlock through cm_watchdog()
 
* https://midas.triumf.ca/elog/Midas/741 - cm_msg() deadlock through cm_watchdog()
 +
 +
From https://midas.triumf.ca/elog/Midas/734:
 +
 +
The following (odb torture) script makes midas very unhappy and eventually causes odb corruption. I suspect the reason is some kind of race condition collision between client  creation and destruction code and the watchdog activity (each client periodically runs cm_watchdog() to check if other clients are still alive, O(NxN) total complexity).
 +
 +
<pre>
 +
#!/usr/bin/perl -w
 +
#$cmd = "odbedit -c \'scl -w\' &";
 +
$cmd = "odbedit -c \'ls -l /system/clients\' &";
 +
for (my $i=0; $i<50; $i++)
 +
{
 +
system $cmd;
 +
}
 +
#end
 +
</pre>
 +
 +
(Note: How is this odb torture test relevant? The T2K/ND280 experiment had a large number of MIDAS clients and they are started and stopped frequently. Any bugs/problems in client creation/removal cause problems very quickly - within days of operation we see strange midas errors - see the elog thread. The odb torture script helps to expose such problems quickly).

Revision as of 09:40, 6 August 2013

Note on race condition and deadlock between ODB lock and SYSMSG lock in cm_msg()

In December 2010/January/February 2011 I identified and fixed a number of race conditions and deadlocks that were severely affecting the T2K/ND280 experiment in Japan. Removal of these problems was an important improvement to MIDAS. To remember lessons learned and to avoid having these problems come back, I document the relevant information in this wiki.

The whole blow-by-blow account can be read on the MIDAS forum:

From https://midas.triumf.ca/elog/Midas/734:

The following (odb torture) script makes midas very unhappy and eventually causes odb corruption. I suspect the reason is some kind of race condition collision between client creation and destruction code and the watchdog activity (each client periodically runs cm_watchdog() to check if other clients are still alive, O(NxN) total complexity).

#!/usr/bin/perl -w
#$cmd = "odbedit -c \'scl -w\' &";
$cmd = "odbedit -c \'ls -l /system/clients\' &";
for (my $i=0; $i<50; $i++)
{
system $cmd;
}
#end

(Note: How is this odb torture test relevant? The T2K/ND280 experiment had a large number of MIDAS clients and they are started and stopped frequently. Any bugs/problems in client creation/removal cause problems very quickly - within days of operation we see strange midas errors - see the elog thread. The odb torture script helps to expose such problems quickly).