Notes on the implementation of the MIDAS alarm system.
Alarms are checked inside alarm.c::al_check(). This function is called by
cm_yield() every 10 seconds and by rpc_server_thread(), also every 10 seconds.
For remote midas clients, their al_check() issues an RPC_AL_CHECK RPC call into
the mserver, where rpc_server_dispatch() calls the local al_check().
As result, all alarm checks run inside a process directly attached to the local
midas shared memory (inside a local client or inside an mserver process for a
remote client).
Each and every midas client runs the alarm checks. To prevent race conditions
between different midas clients, access to al_check() is serialized using the
ALARM semaphore.
Inside al_check(), alarms are triggered using al_trigger_alarm(), which in turn
calls al_trigger_class(). Inside al_trigger_class(), the alarm is recorded into
an elog or into midas.log using cm_msg(MTALK).
Special note should be made of the ODB setting "/Alarm/Classes/xxx/System
message interval", which has a surprising effect - after an alarm is recorded
into system messages (using cm_msg(MTALK)), no record is made of any subsequent
alarms until the time interval set by this variable elapses. With default value
of 60 seconds, after one alarm, no more alarms are recorded for 60 seconds.
Also, because all the alarms are checked at the same time, only the first
triggered alarm will be recorded.
As of alarm.c rev 4683, "System message interval" set to 0 ensures that every
alarm is recorded into the midas log file. (In previous revisions, this setting
may still miss some alarms).
There are 3 types of alarms:
1) "program not running" alarms.
These alarms are enabled in ODB by setting "/Programs/ppp/Alarm class". Each
time al_check() runs, every program listed in "/Programs" is tested using
"cm_exist()" and if the program is not running, the time of first failure is
remembered in "/Programs/ppp/First failed".
If the program has not been running for longer than the time set in ODB
"/Programs/ppp/Check interval", an alarm is triggered (if enabled by
"/Programs/ppp/Alarm class" and the program is restarted (if enabled by
"/Programs/ppp/Auto restart").
The "not running" condition is tested every 10 seconds (each time al_check() is
called), but the frequency of "program not running" alarms can be reduced by
increasing the value of "/Alarms/Alarms/ppp/Check interval" (default value 60
seconds). This can be useful if "System message interval" is set to zero.
2) "evaluated" alarms
3) "periodic" alarms
There is nothing surprising in these alarms. Each alarm is checked with a time
period set by "/Alarm/xxx/Check interval". The value of an evaluated alarm is
computed using al_evaluate_condition().
K.O. |