Back Midas Rome Roody Rootana
  Midas DAQ System, Page 12 of 146  Not logged in ELOG logo
ID Date Author Topic Subjectdown
  256   18 May 2006 Konstantin OlchanskiBug Fixremoved a few "//" comments to fix compilation on VxWorks
Our VxWorks C compiler (gcc-2.8-something) does not like the "//" comments. Luckily, on VxWorks, we 
only compile a small subset of midas, so there is no point in banning all "//" comments. But I did have to 
convert a couple of them to /* commens */ in odb.c to make it compile. Changes to odb.c commited. K.O.
  1435   10 Jan 2019 Konstantin OlchanskiInforemoval of cm_watchdog()
cm_watchdog() has been removed from the latest midas sources. The watchdog functions performed by cm_watchdog() were 
moved to cm_yield() - those are - maintaining odb and event buffer "last active" timestamps and checking for and removing of 
timed-out clients.

Those who write midas programs should ensure that they call cm_yield() at least every 1-2 seconds (for the normal 10 second 
timeout settings). As always, before calling potentially time consuming operations, such as accessing slow responding 
hardware, one should increase or disable their watchdog timeout.

Those who write midas programs that do not use cm_yield() (i.e. the mserver), should call cm_periodic_tasks() instead with 
the same period as described for cm_yield() above.

Removal of cm_watchdog() solves many problems in the midas code base:
- firing of cm_watchdog() at random times in random places makes it difficult to do static code path analysis (call paths, etc) - 
this is needed to ensure correct multithread locking, etc
- ditto caused trouble with multithread locking when cm_watchdog() fires *inside* the pthread mutex locking library itself 
causing mutexes to be in an inconsistent state. This had to be kludged against in the ODB multithread locks - now this kludge 
can be removed.
- many non-midas codes in experiment frontends, etc was not expecting and did not correctly handle firing of the 
cm_watchdog() SIGALARM at random times (i.e. select() with timeout returned too soon, etc).
- today, UNIX signals are pretty obscure, best avoided. (i.e. interaction of signals and threads is not super will defined, etc).

commits up to (and including) plus merge of branch feature/remove_cm_watchdog
https://bitbucket.org/tmidas/midas/commits/9f1775d2fc75d0de0b9d4ef1abc7b2fb9bacca28

K.O.
  1439   21 Jan 2019 Konstantin OlchanskiInforemoval of cm_watchdog()
> cm_watchdog() has been removed from the latest midas sources
> Removal of cm_watchdog() solves many problems in the midas code base:

Removal of cm_watchdog() creates new problems:

a) the bm_send_event(BM_WAIT) and bm_receive_event(BM_WAIT) wait for free space and wait for new event do not update the timeouts (need to add a call 
to cm_periodic_tasks())
b) frontends that talk to slow external equipment now die unless they have their timeout adjusted to be longer than the longest equipment operation (they 
were already supposed to do this, but...)
c) mhttpd sometimes dies from from an odb timeout (with the default 10 sec timeout).

As one solution, we may bring an automatic cm_watchdog() back, but running from a thread instead of from SIGALARM.

K.O.
  1442   24 Jan 2019 Konstantin OlchanskiInforemoval of cm_watchdog()
> > cm_watchdog() has been removed from the latest midas sources
> > Removal of cm_watchdog() solves many problems in the midas code base:
> Removal of cm_watchdog() creates new problems:
> a) the bm_send_event(BM_WAIT) and bm_receive_event(BM_WAIT)
> b) frontends that talk to slow external equipment
> c) mhttpd sometimes dies from from an odb timeout (with the default 10 sec timeout).

The watchdog is back, in a "light" form. Added:

- cm_watchdog_thread() - runs every 2 seconds and updates the timestamps on ODB and all open event buffers (SYSMSG, SYSTEM, etc).
- cm_start_watchdog_thread() - added to mfe.c and mhttpd - so user frontends work the same as before cm_watchdog() removal
- cm_stop_watchdog_thread() - added to cm_disconnect_experiment() to avoid leaving the thread running after we closed odb and all event buffers.

As before, the watchdog only runs on locally attached midas programs. For programs attached remotely via the mserver, the mserver handles the watchdog functions.

This new light-weight watchdog thread only updates the timestamps, it does not check and remove dead clients, it does not check the alarms. These functions are now performed 
by cm_yield() and cm_periodic_tasks(). At least some program in an experiment should call them periodically. (normally, at least mlogger and mhttpd will do that).

Programs that accidentally relied on SIGALRM firing at 1Hz may still be affected - i.e. with the old cm_watchdog(), ::sleep(1000) will only sleep for 1 second (interrupted by 
SIGALARM), now it will sleep for the full 1000 seconds. Other syscalls, i.e. select(), are similarly affected.

For now, I think only mfe.c frontends and mhttpd need the watchdog thread. With luck all the other midas programs (mlogger, mdump, etc) will run fine without it.

K.O.
  1159   05 Feb 2016 Thomas LindnerSuggestionreducing sleep time in mhttpd main loop (for sequencer)
There were some complaints that the MIDAS sequencer was slow.  Specifically, the
complaint was that even lines in the sequence that didn't do any (like COMMENT
commands) tooks > 100ms to execute.  These slow sequencer steps could be a
little annoying if a script had to change a large number of ODB variables before
starting.

I tested this a little using a trivial sequence; note that I did all tests using
mhttpd with mongoose enabled on a newer macbook pro.  I found that with the
mongoose server each line in a sequencer script was taking ~100ms.  This is
consistent with the loop in the main thread, which is only doing a cm_yield and
a sleep:

   while (!_abort) {
      status = ss_mutex_wait_for(request_mutex, 0);
      status = cm_yield(0);
      if (status == RPC_SHUTDOWN)
         break;
      sequencer();
      status = ss_mutex_release(request_mutex);
      ss_sleep(100);
   }

I tested reducing the sleep to 20ms.  As expected, this made the sequencer more
zippy, able to execute ~50 commands per second.

I tried to think what would be downsides to making this change.  I think that
the main web communication should not be affected, because that communication is
all handled by the separate mongoose thread.

I checked how much extra CPU was used if the sleep was reduced from 100ms to
20ms.  I found that when a sequence was not running the CPU increased from 0% to
0.2% with my change.  When a sequence was running the CPU increased from 0.8% to
4% with my change.  4% is a little high, though I'd say still reasonable.  I
found that most of the CPU usage was occuring because every call to
'sequencer()' resulted in a call to db_set_record("/Sequencer/State"...).  I
guess that making that call 50 times causes the somewhat heavy CPU usage.

I would argue that it would still be worth making that change, so that the
sequencer can be more zippy.
  1160   05 Feb 2016 Thomas LindnerSuggestionreducing sleep time in mhttpd main loop (for sequencer)
> There were some complaints that the MIDAS sequencer was slow.  Specifically, the
> complaint was that even lines in the sequence that didn't do any (like COMMENT
> commands) tooks > 100ms to execute.  These slow sequencer steps could be a
> little annoying if a script had to change a large number of ODB variables before
> starting.
> ...
> I checked how much extra CPU was used if the sleep was reduced from 100ms to
> 20ms.  I found that when a sequence was not running the CPU increased from 0% to
> 0.2% with my change.  When a sequence was running the CPU increased from 0.8% to
> 4% with my change.  4% is a little high, though I'd say still reasonable.  I
> found that most of the CPU usage was occuring because every call to
> 'sequencer()' resulted in a call to db_set_record("/Sequencer/State"...).  I
> guess that making that call 50 times causes the somewhat heavy CPU usage.

One additional point: I think that it would be reasonably simple to reduce this CPU
usage even while a sequence was going on.  I would guess that for many sequences a
lot of time was spent in a 'WAIT SECONDS' command, since you would presumably want
to wait while data was being taken or conditions stabilizing.  I think that if you
are in a 'WAIT SECONDS' command that hasn't been satisfied then there probably isn't
any reason to do the db_set_record at the end of the sequencer() method.
  1161   06 Feb 2016 Stefan RittSuggestionreducing sleep time in mhttpd main loop (for sequencer)
> There were some complaints that the MIDAS sequencer was slow.  Specifically, the
> complaint was that even lines in the sequence that didn't do any (like COMMENT
> commands) tooks > 100ms to execute.  These slow sequencer steps could be a
> little annoying if a script had to change a large number of ODB variables before
> starting.
> 
> I tested this a little using a trivial sequence; note that I did all tests using
> mhttpd with mongoose enabled on a newer macbook pro.  I found that with the
> mongoose server each line in a sequencer script was taking ~100ms.  This is
> consistent with the loop in the main thread, which is only doing a cm_yield and
> a sleep:
> 
>    while (!_abort) {
>       status = ss_mutex_wait_for(request_mutex, 0);
>       status = cm_yield(0);
>       if (status == RPC_SHUTDOWN)
>          break;
>       sequencer();
>       status = ss_mutex_release(request_mutex);
>       ss_sleep(100);
>    }
> 
> I tested reducing the sleep to 20ms.  As expected, this made the sequencer more
> zippy, able to execute ~50 commands per second.
> 
> I tried to think what would be downsides to making this change.  I think that
> the main web communication should not be affected, because that communication is
> all handled by the separate mongoose thread.
> 
> I checked how much extra CPU was used if the sleep was reduced from 100ms to
> 20ms.  I found that when a sequence was not running the CPU increased from 0% to
> 0.2% with my change.  When a sequence was running the CPU increased from 0.8% to
> 4% with my change.  4% is a little high, though I'd say still reasonable.  I
> found that most of the CPU usage was occuring because every call to
> 'sequencer()' resulted in a call to db_set_record("/Sequencer/State"...).  I
> guess that making that call 50 times causes the somewhat heavy CPU usage.
> 
> I would argue that it would still be worth making that change, so that the
> sequencer can be more zippy.

The minimal time slice on most systems is 10 ms, and nothing prevents us from switching to
that. The original 100 ms was more for the fact that you can see the sequencer statements
executed one after the other (with the color bar). But this is more a "debugging" feature which 
we not really need. 

To do it "right" the sequencer would have to _return_ a sleep time. Like if it is in a wait loop (as
most of the time), the sleep time could be close to 1 second, to correctly update the wait
progress bar. If the sequencer executes ODB set statements, the wait time could be zero, so
thousands of statements can be executed in one second. The problem we will then have of course
that the sequencer will block the "request_mutex" almost always, which would prevent the
mongoose server from serving anything. So this should be carefully tested. It could be (on most OS)
that releasing the mutex by the main loop immediately switches to the mongoose thread, which would
make the web server still quite responsive, but I'm not sure about that. So as a first change making
the sleep time 10ms should be fine.

Stefan
  1162   15 Feb 2016 Thomas LindnerSuggestionreducing sleep time in mhttpd main loop (for sequencer)
> > I checked how much extra CPU was used if the sleep was reduced from 100ms to
> > 20ms.  I found that when a sequence was not running the CPU increased from 0% to
> > 0.2% with my change.  When a sequence was running the CPU increased from 0.8% to
> > 4% with my change.  4% is a little high, though I'd say still reasonable.  I
> > found that most of the CPU usage was occuring because every call to
> > 'sequencer()' resulted in a call to db_set_record("/Sequencer/State"...).  I
> > guess that making that call 50 times causes the somewhat heavy CPU usage.
> > 
> > I would argue that it would still be worth making that change, so that the
> > sequencer can be more zippy.
> 
> The minimal time slice on most systems is 10 ms, and nothing prevents us from switching to
> that. The original 100 ms was more for the fact that you can see the sequencer statements
> executed one after the other (with the color bar). But this is more a "debugging" feature which 
> we not really need. 

OK, I made this change; sleep is now 10ms on main thread.  Seems to work fine on SL6 and MacOS.

> To do it "right" the sequencer would have to _return_ a sleep time. Like if it is in a wait loop (as
> most of the time), the sleep time could be close to 1 second, to correctly update the wait
> progress bar. If the sequencer executes ODB set statements, the wait time could be zero, so
> thousands of statements can be executed in one second. The problem we will then have of course
> that the sequencer will block the "request_mutex" almost always, which would prevent the
> mongoose server from serving anything. So this should be carefully tested. It could be (on most OS)
> that releasing the mutex by the main loop immediately switches to the mongoose thread, which would
> make the web server still quite responsive, but I'm not sure about that. So as a first change making
> the sleep time 10ms should be fine.

Hmm, yeah, I'm not sure about how to handle reducing the wait time to zero after ODB set commands.

But it does seem like it would be straight-forward to increase the sleep time for waits; I'll look into
a clean way of doing that.
  1163   15 Feb 2016 Stefan RittSuggestionreducing sleep time in mhttpd main loop (for sequencer)
> Hmm, yeah, I'm not sure about how to handle reducing the wait time to zero after ODB set commands.
> 
> But it does seem like it would be straight-forward to increase the sleep time for waits; I'll look into
> a clean way of doing that.

Let's see how your 10 ms work in real life. If we need variable wait times, I can implement this for your without much effort.

Stefan
  1701   23 Sep 2019 Frederik WautersSuggestionrecover daq and hardware safety.

We have encountered a safety issue with our HPGe HV and it's midas frontend. Turning off or changing HV unknowingly has to be avoided at all costs.

 

Current safety protection

We use the DF_REPORT_STATUS flag to give the hardware settings precedence over odb settings. This all takes place in the init.

 

DAQ recovery Issue?

In the setup / development state, we sometimes have to remove the SHM files and reload an odb dump to recover the DAQ. When the FE is running, this can modify hardware settings. E.g. change a voltage

 

Question

Is there a way one can let the frontend know the "load"  function is called in odbedit? Or other suggestions to build in this safety.

 

  1708   27 Sep 2019 Konstantin OlchanskiSuggestionrecover daq and hardware safety.
> We have encountered a safety issue with our HPGe HV and it's midas frontend.

At TRIUMF and other labs the words "safety issue" have very specific meaning and
we tend to follow this guidance: MIDAS is not certified for and is not intended for use with 
safety critical applications as defined here:
https://en.wikipedia.org/wiki/Safety-critical_system

> A safety-critical system ... malfunction may result in ... following outcomes:
> death or serious injury to people
> loss or severe damage to equipment/property
> environmental harm

If this is your case, you should use properly certified software *and hardware*. Safety 
officers at most institutions require certified hardware interlocks and other protections to 
prevent such undesirable outcomes. Use of certified PLCs is sometimes permitted.

But I suspect in your case, there is no "safety issue", you only want to protect some 
valuable but not critical equipment against accidental damage.

In this case, you can probably use midas, but if midas malfunction may result in destroying 
your experiment (i.e. accidentally set wrong voltage on 3000 phototubes), you should also 
have hardware based protections (hardware limits on max/min high voltage). Most HV 
power supplies implement such protections (screw-driver actuated max voltage limits).

If there is danger of destroying your experiment you should also have an independent 
review of your control system to avoid avoidable mistakes and obvious problems.

> Turning off or changing HV unknowingly has to be avoided at all costs

The function of changing high-voltage is implemented in your frontend program. Right in 
the place in this program where you transmit the voltage setting from ODB to the hardware 
is where you implement your protections (validate the voltage range, check that changing 
the voltage is permitted, etc). This protects you against unexpected/incorrect/erroneous
changes in ODB (wrong ODB is loaded, wrong values in ODB, ODB is corrupted, etc).

In addition, it is wise to set software based limits in the HV power supply (software 
controlled max high voltage, software controlled max current, etc). Most HV power supplies 
implement such functions.

To ensure high voltage cannot be changed at the wrong times, you can also implement 
procedural and hardware protections, such as unplug the power supply control connection 
(usually ethernet or serial or usb cable). This will prevent you from monitoring the high 
voltage currents and the only solution is to use a  power supply with a hardware "write 
protect" function (a key needs to be inserted and turned to allow changing anything).

All of this is generic and applies to any controls software, not just MIDAS.

Without at least some of these protections (especially protections in your frontend 
program), the questions you asked about loading ODB are insufficient.

K.O.
  1712   28 Sep 2019 Frederik WautersSuggestionrecover daq and hardware safety.
Dear Konstantin,

So let me retract the term "safety issue" then, it was more a request/question for this type of 
info between the fe and the odb.

We have most of what you mention:
* The HV hardware has current limits
* The Hardware has fixed ramping limits.

same for the software. 

The issue occurs when e.g. one channel can not be turned on and ramp for some temp/specific 
reason, and someone else is working on the daq and reloads the odb for e.g. 1h ago.  

> > We have encountered a safety issue with our HPGe HV and it's midas frontend.
> 
> At TRIUMF and other labs the words "safety issue" have very specific meaning and
> we tend to follow this guidance: MIDAS is not certified for and is not intended for use with 
> safety critical applications as defined here:
> https://en.wikipedia.org/wiki/Safety-critical_system
> 
> > A safety-critical system ... malfunction may result in ... following outcomes:
> > death or serious injury to people
> > loss or severe damage to equipment/property
> > environmental harm
> 
> If this is your case, you should use properly certified software *and hardware*. Safety 
> officers at most institutions require certified hardware interlocks and other protections to 
> prevent such undesirable outcomes. Use of certified PLCs is sometimes permitted.
> 
> But I suspect in your case, there is no "safety issue", you only want to protect some 
> valuable but not critical equipment against accidental damage.
> 
> In this case, you can probably use midas, but if midas malfunction may result in destroying 
> your experiment (i.e. accidentally set wrong voltage on 3000 phototubes), you should also 
> have hardware based protections (hardware limits on max/min high voltage). Most HV 
> power supplies implement such protections (screw-driver actuated max voltage limits).
> 
> If there is danger of destroying your experiment you should also have an independent 
> review of your control system to avoid avoidable mistakes and obvious problems.
> 
> > Turning off or changing HV unknowingly has to be avoided at all costs
> 
> The function of changing high-voltage is implemented in your frontend program. Right in 
> the place in this program where you transmit the voltage setting from ODB to the hardware 
> is where you implement your protections (validate the voltage range, check that changing 
> the voltage is permitted, etc). This protects you against unexpected/incorrect/erroneous
> changes in ODB (wrong ODB is loaded, wrong values in ODB, ODB is corrupted, etc).
> 
> In addition, it is wise to set software based limits in the HV power supply (software 
> controlled max high voltage, software controlled max current, etc). Most HV power supplies 
> implement such functions.
> 
> To ensure high voltage cannot be changed at the wrong times, you can also implement 
> procedural and hardware protections, such as unplug the power supply control connection 
> (usually ethernet or serial or usb cable). This will prevent you from monitoring the high 
> voltage currents and the only solution is to use a  power supply with a hardware "write 
> protect" function (a key needs to be inserted and turned to allow changing anything).
> 
> All of this is generic and applies to any controls software, not just MIDAS.
> 
> Without at least some of these protections (especially protections in your frontend 
> program), the questions you asked about loading ODB are insufficient.
> 
> K.O.
  1715   29 Sep 2019 Konstantin OlchanskiSuggestionrecover daq and hardware safety.
> 
> The issue occurs when e.g. one channel can not be turned on and ramp for some temp/specific 
> reason, and someone else is working on the daq and reloads the odb for e.g. 1h ago.  
> 

So you want to ensure that some HV channels are turned off and stay turned off. Yes?

Most effective solution will depend on the consequences of unwanted turning-on of your channels:

- if hardware is destroyed if turned on - I think you should have a hardware lock-out. (unplug the HV cable)
- if hardware malfunctions and will degrade if left turned on for long time (i.e. a hot phototube or sparking wire chamber) - your data 
monitoring software should detect the anomaly (it will show up as a hot channel, dead channel, etc) and the people running the 
experiment will realize the mistake and turn the channel back off. also hardware monitoring (HV currents, etc) should detect this, with 
same effect.
- if collected data becomes useless (the turned-off channel make big noise in all other channels), then same thing, your data 
monitoring should catch it.

The next consideration is what are you protecting against:

a) one person flags channel defective, turns it off, next person knows nothing, turns it back on - you need to work on documentation, 
shift hand-off and other human-level procedures
b) people running experiment load random odb files - same thing, from human-level procedures and documentation it should be made 
clear which odb files are correct and which should not be used
c) software malfunction (not human person) causes data change in odb causes turned-off channel to turn back on
d) hardware malfunction causes turned-off channel to turn back on (HV power supply hardware or firmware malfunctions and decides 
that all channels should be turned on at maximum high voltage)

In the experiments I am most familiar with, problem (b) is avoided by never loading/reloading odb files directly, most/all interaction
with the experiment is done through web pages, and these web pages are carefully coded to be safe against most user mistakes.

Cases (a), (b) and (c) you can protect against by changing the frontend code to refuse to turn on some channels:

int set_hv(int channel, int voltage) {
   if (channel == 35) return COMMAND_REFUSED;
   write_to_hardware(channel, voltage);
   return COMMAND_SUCCESS;
}

But in reality this solution only creates problem (e):

e) people running the experiment start random versions of the frontend program, make random changes to the frontend source code, 
multiple people working on the frontend have their own personal versions/copies of the source code, etc.

This is the worst-case scenario, meaning the experiment lost control of software configuration, and even basic software version 
control tools (like svn or git) are not being used. If your experiment gets that chaotic, all protections are likely to be ineffective - 
documentation will not work (people will ignore post-it notes "do not turn on!"), hardware protections will not work (unplugged cable 
labeled "do not plug in!" will be plugged back in and powered), etc. good luck, then.

K.O.
  1726   15 Oct 2019 Stefan RittSuggestionrecover daq and hardware safety.
There is a not-so-well-known function in the ODB to write protect some keys. You can do

odbedit> chmod 1 /Equipment/HV/Demand

which will write protect your Demand values. You see that by doing

odbedit> ls -ls

where you only see a "R" at the right end instead a "RWD". I haven't tried it yet (so better do a dry run yourself), but that should prevent an odb load to overwrite your demand values. To change the values, put some logic on a custom page to unprotect the 
values, change them, and then protect them again.

Stefan
  2441   22 Oct 2022 Lars MartinSuggestionread_only odbxx?
I really like the concept of the odbxx interface.
I think it would be a nice feature if one could have a read_only connection, e.g. by declaring a "const midas::odb".
Just for fun I tried if this already works, but the compiler doesn't allow const midas::odb for e.g. the [] operator. I'm guessing this would be non-trivial to implement, but I like the idea of certain Midas clients being able to read the odb without risking corruption.
  2443   24 Oct 2022 Stefan RittSuggestionread_only odbxx?
> I really like the concept of the odbxx interface.
> I think it would be a nice feature if one could have a read_only connection, e.g. by declaring a "const midas::odb".
> Just for fun I tried if this already works, but the compiler doesn't allow const midas::odb for e.g. the [] operator. I'm guessing this would be non-trivial to implement, but I like the idea of certain Midas clients being able to read the odb without risking corruption.

Having a "const midas::odb" probably does not work (at least I would not know how to implement that).

But I could make an internal flag analog to the auto refresh flags. So you would have

    o.set_write_protect(true);

to turn on write protection. Would that work for you?

Best,
Stefan
  2444   26 Oct 2022 Lars MartinSuggestionread_only odbxx?
> Having a "const midas::odb" probably does not work (at least I would not know how to implement that).
> 
> But I could make an internal flag analog to the auto refresh flags. So you would have
> 
>     o.set_write_protect(true);
> 
> to turn on write protection. Would that work for you?

Absolutely. Looking at the underlying code I was also at a loss how const would work.
I'm mostly just interested in having small clients that only read from the odb (for whatever reason) without risking corrupting it by messing something 
up in the code, especially since such small clients are almost by definition hacked together quickly on the fly.
  2445   29 Oct 2022 Stefan RittSuggestionread_only odbxx?
Ok, I implemented and committed that. Just call

o.set_write_protect(true)

on a key you don't want to modify. If you do so, an exception gets thrown.

Best,
Stefan
  737   26 Dec 2010 Konstantin OlchanskiBug Reportrace condition and deadlock between ODB lock and SYSMSG lock in cm_msg()
> 
> The only remaining problem when running my script is some kind of deadlock between the ODB and SYSMSG semaphores...
> 


In theory, we understand how programs that use 2 semaphores to protect 2 shared resources can deadlock
if there are mistakes in how locks are used.

For example, consider 2 semaphores A and B and 2 concurrent
subroutines foo() and bar() running at exactly the same time:

foo() { lock(A); lock(B); do stuff; unlock(B); unlock(A); } and
bar() { lock(B); lock(A); do stuff; unlock(A); unlock(B); }

This system will deadlock immediately with foo() taking semaphore A, bar() taking semaphore B,
then foo() waiting for B and bar() waiting for A forever.

This situation can also be described as a race condition where foo() and bar() are racing each
other to get the semaphores, with the result depending on who gets there first
and, in this case, sometimes the result is deadlock.

In this example, the size of the race condition time window is the wall clock time
between actually locking both semaphores in the sequence "lock(X); lock(Y);". While
locking a semaphore is "instantaneous", the actual function lock() takes time to call
and execute, and this time is not fixed - it can change if the CPU takes a hardware
interrupt (quick), a page fault (when we may have to wait until data is read from the swap file)
or a scheduler interrupt (when we are outright stopped for milliseconds while the CPU runs
some other process).

In reality, subroutines foo() and bar() do not run at exactly the same time, so the probability
of deadlock will depend on how often foo() and bar() are executed, the size of the race condition time window,
the number of processes executing foo() and bar(), and the amount of background activity
like swapping, hardware interrupts, etc.

(Also note that on a single-cpu system, we will probably never see a deadlock between foo() and bar()
because they will never be running at the same time. But the deadlock is still there, waiting
for the lucky moment when the scheduler switches from foo() to bar() just at the wrong place).

There is more on deadlocks and stuff written at:
http://en.wikipedia.org/wiki/Deadlock
http://en.wikipedia.org/wiki/Race_condition

In case of MIDAS, the 2 semaphores are the ODB lock and the SYSMSG lock (also remember about locks
for the shared memory event buffers, SYSTEM, etc, but they seem to be unlikely to deadlock).

The function foo() is any ODB function (db_xxx) that locks ODB and then calls cm_msg() (which locks SYSMSG).

The function bar() is cm_msg() which locks SYSMSG and then calls some ODB db_xxx() function which tries to lock ODB.

(This is made more interesting by cm_watchdog() periodically called by alarm(), where we alternately
take SYSMSG (via bm_cleanup) and ODB locks.)

I think this establishes a theoretical possibility for MIDAS to deadlock on the ODB and SYSMSG semaphores.

In practice, I think we almost never see this deadlock because cm_msg() is not called very often, and during normal
operation, is almost never called from inside ODB functions holding the ODB lock - almost all calls to cm_msg from
ODB functions are made to report some kind of problem with the ODB internal structure, something that "never"
happens.

By "luck" I stumbled into this deadlock when doing the "odbedit" fork-bomb torture tests, when high ODB lock
activity is combined with high cm_msg() activity reporting clients starting and stopping, combined with a large
number of MIDAS clients running, starting and stopping.

So a deadlock I see within 1 minute of running the torture test, other lucky people will see after running an experiment
for 1 year, or 1 month, or 1 day, depending.

In theory, this deadlock can be removed by establishing a fixed order of taking locks. There will never be a deadlock
if we always take the SYSMSG lock first, then ask for the ODB lock.

In practice, it means that using cm_msg() while holding an ODB lock is automatically dangerous
and should be avoided if not forbidden.

And it does work. By refactoring a few places in client startup, shutdown and cleanup code, I made the deadlock "go away",
and my test script (posted in my first message) no longer deadlocks, even if I run hundreds of odbedit's at the same time.

Unfortunately,  it is impractical to audit and refactor all of MIDAS to completely remove this problem. MIDAS call graphs
are sufficiently complicated for making manual analysis of lock sequences infeasible and
I expect any automatic lock analysis tool will be defeated by the cm_watchdog() periodic interrupt.

An improvement is possible if we make cm_msg() safe for calling from inside the ODB db_xxx() function. Instead
of immediately sending messages to SYSMSG (requiring a SYSMSG lock), if ODB is locked, cm_msg() could
save the messages in a buffer, which would be flushed when the ODB lock is released. (This does not fix
all the other places that take ODB and SYSMSG locks in arbitrary order, but I think those places are not as
likely to deadlock, compared to cm_msg()).

However, now that I have greatly reduced the probability of deadlock in the client startup/shutdown/cleanup code,
maybe there is no urgency for changing cm_msg() - remember that if we do not call cm_msg() we will never deadlock -
and during normal operation, cm_msg() is almost never called.

Investigation completed, I will now cleanup, retest and commit my changes to midas.c and odb.c. Looking into this
and writing it up was a good intellectual exercise.

P.S. Also remember that there are locks for shared memory event buffers (SYSTEM, etc), but those do not involve
lock inversion leading to deadlock. I think all lock sequences are like this: SYSTEM->ODB, SYSTEM->SYSMSG->ODB,
there are no inverted sequences SYSMSG->SYSTEM or ODB->SYSTEM and the only deadlocking
sequence SYSTEM->ODB->SYSMSG, does not really involve the SYSTEM lock.

K.O.
  2493   01 May 2023 Giovanni MazzitelliBug Reportpython issue with mathplot lib vs odb query
Ciao, 
we have a very strange issue with python lib with client.odb_get("/") function 
when running as midas process and matplotlib is used.


we are developing a remote console by means of sending via kafka producer the odb, 
camera image and pmt waveforms, in the INFN cloud where grafana make available 
data for non expert shifters, as well as sending midas events for online 
reconstruction to the htcondr queue on cloud. The process work perfectly and allow 
use to parallelise to standard midas pipeline for file production, ecc the online 
monitoring and data processing where we have computing resources (our DAQ is 
underground at LNGS). Part of the work will be presented next weak at CHEP
the full code is available at https://github.com/CYGNUS-
RD/middleware/blob/master/dev/event_producer_s3.py
but to get the strange behaviour I report here a test script:

----
def main(verbose=False):
    from matplotlib import pyplot as plt

    import time

    import midas
    import midas.client

    
    client = midas.client.MidasClient("middleware")
    buffer_handle = client.open_event_buffer("SYSTEM",None,1000000000)
    request_id = client.register_event_request(buffer_handle, sampling_type = 2) 
    
    fpath = os.path.dirname(os.path.realpath(sys.argv[0]))
    
    
    while True:
        # 
        odb = client.odb_get("/")
        if verbose:
            print(odb)
        start1 = time.time()
              
        client.communicate(10)
        time.sleep(1)
        

    client.deregister_event_request(buffer_handle, request_id)

    client.disconnect()
----
if I run it as cli interactivity including or not matplotlib the everything si ok. 
As I run it as midas "program" I get: 
-----
Traceback (most recent call last):
  File "/home/standard/daq/middleware/dev/test_midas_error.py", line 48, in 
<module>
    main(verbose=options.verbose)
  File "/home/standard/daq/middleware/dev/test_midas_error.py", line 29, in main
    odb = client.odb_get("/")
  File "/home/standard/packages/midas/python/midas/client.py", line 354, in 
odb_get
    retval = midas.safe_to_json(buf.value, use_ordered_dict=True)
  File "/home/standard/packages/midas/python/midas/__init__.py", line 552, in 
safe_to_json
    return json.loads(decoded, strict=False, 
object_pairs_hook=collections.OrderedDict)
  File "/usr/lib/python3.8/json/__init__.py", line 370, in loads
    return cls(**kw).decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: 
line 300 column 26 (char 17535)
----
if I comment out the import of matplotlib every think works perfectly again also 
as midas program. 

it seams that there is a difference between the to way of use the code, and that 
is sufficient the call to matplotlib to corrupt in some way the odb. any ideas?
Attachment 1: Screenshot_2023-05-01_at_09.57.01.png
Screenshot_2023-05-01_at_09.57.01.png
ELOG V3.1.4-2e1708b5