Alarm System: Difference between revisions
| No edit summary | mNo edit summary | ||
| (5 intermediate revisions by 2 users not shown) | |||
| Line 1: | Line 1: | ||
| {{Pagelinks}} | |||
| Line 15: | Line 10: | ||
| = Introduction = | = Introduction = | ||
| MIDAS provides an alarm system, which by default is turned off. When the alarm system is [[/Alarms ODB tree | MIDAS provides an alarm system, which by default is turned off. When the alarm system is [[/Alarms ODB tree#Alarm system active|activated]] and an alarm condition is detected, alarm messages are sent by the system which appear as an alarm banner on the [[Status Page|mhttpd status page]], and as a message on any windows running [[odbedit]] clients. The alarm system is flexible and can be extensively customized for each experiment | ||
| using the [[Alarms Page|mhttpd Alarms Page]] or [[odbedit]].   | using the [[Alarms Page|mhttpd Alarms Page]] or [[odbedit]].   | ||
| Line 26: | Line 21: | ||
| * Selection of alarm message destination (to system message log or to elog) | * Selection of alarm message destination (to system message log or to elog) | ||
| * email or SMS alerts can be sent | * email or SMS alerts can be sent | ||
| * Program  | * Alarm triggered when a Program is not running  | ||
| = Implementation of the MIDAS Alarm System = | = Implementation of the MIDAS Alarm System = | ||
| The alarm system source code is [http://ladd00.triumf.ca/~daqweb/doc/midas/doc/html/alarm_8c_source.html alarm.c]. | The alarm system source code is [http://ladd00.triumf.ca/~daqweb/doc/midas-devel/doc/html/alarm_8c_source.html alarm.c]. | ||
| Alarms are checked inside [http://ladd00.triumf.ca/~daqweb/doc/midas/doc/html/group__alfunctioncode.html | Alarms are checked inside [http://ladd00.triumf.ca/~daqweb/doc/midas-devel/doc/html/group__alfunctioncode.html alarm.c::al_check()]. This function is called by cm_yield() every 10 seconds and by rpc_server_thread(), also every 10 seconds. For remote MIDAS clients, their al_check() issues an RPC_AL_CHECK RPC call into the MIDAS server utility [[mserver]], where rpc_server_dispatch() calls the local al_check(). As result, all alarm checks run inside a process directly attached to the local MIDAS shared memory (inside a local client or inside an mserver process for a remote client). Each and every MIDAS client runs the alarm checks. To prevent race conditions between different MIDAS clients, access to al_check() is serialized using the ALARM semaphore. Inside al_check(), alarms are triggered using   | ||
| [http://ladd00.triumf.ca/~daqweb/doc/midas/doc/html/group__alfunctioncode.html | [http://ladd00.triumf.ca/~daqweb/doc/midas-devel/doc/html/group__alfunctioncode.html al_trigger_alarm()], which in turn calls al_trigger_class(). Inside al_trigger_class(), the alarm is recorded into an elog or into [[Message System|midas.log]] using cm_msg(MTALK). | ||
| Special note should be made of the ODB setting [[/Alarms ODB tree#System message interval|system message interval]], which has a surprising effect - after an alarm is recorded into system messages (using cm_msg(MTALK)), no record is made of any subsequent alarms until the time interval set by this variable elapses. With default value of 60 seconds, after one alarm, no more alarms are recorded for 60 seconds. Also, because all the alarms are checked at the same time, '''only the first''' triggered alarm will be recorded. | Special note should be made of the ODB setting [[/Alarms ODB tree#System message interval|system message interval]], which has a surprising effect - after an alarm is recorded into system messages (using cm_msg(MTALK)), no record is made of any subsequent alarms until the time interval set by this variable elapses. With default value of 60 seconds, after one alarm, no more alarms are recorded for 60 seconds. Also, because all the alarms are checked at the same time, '''only the first''' triggered alarm will be recorded. | ||
| As of alarm.c rev 4683,  | As of alarm.c rev 4683, {{Odbpath|path=/Alarms/System message interval}} is set to 0 ensures that every alarm is recorded into the [[Message System#MIDAS Log file|MIDAS log file]]. (In previous revisions, this setting may still miss some alarms). | ||
| <div id="Alarm class"></div> | |||
| = Alarms structure = | |||
| The [[/Alarms ODB tree]] structure is split into 2 sections: | |||
| *"Alarms" which define the condition to be tested. The user can create as many [[/Alarms ODB tree#Alarms subtree|Alarms]] as desired, but each must be one of the four defined [[#Alarm Types|Alarm types]] . | |||
| *"Classes" which define the action to be taken when the alarm occurs. Two Classes (Alarm and Warning) are defined by default. The user can add more [[/Alarms ODB tree#Classes subtree|Classes]] as desired. | |||
| <br><br> | |||
| In order to make the system flexible, each alarm class may perform different actions when an alarm is given. For example, it may | |||
| * write a system message (see [[/Alarms ODB tree#Write System Message|Write System Message]]) | |||
| * write to the elog (see [[/Alarms ODB tree#Write Elog Message|Write Elog Message]]) | |||
| * stop the run (see [[/Alarms ODB tree#Stop run|Stop run]]) | |||
| * spawn a detached script listed in the ODB variable [[/Alarms ODB tree#Execute command|Execute command]]. This feature is used when an Alarm triggers Email or SMS alerts (see [[#Alarm triggering Email or SMS alerts|example]]). | |||
| <br> | |||
| = Alarm Types = | |||
| The four available Alarm Types are shown in Table 1. They are defined in [http://ladd00.triumf.ca/~daqweb/doc/midas-devel/html/midas_8h.html  midas.h]. The alarm type is entered into the [[/Alarms ODB tree#Type|Type]] key. | |||
| {|  style="text-align: left; width: 100%; background-color: white;" border="3" cellpadding="2" cellspacing="2" | |||
| |+ Table 1 : Defined Alarm Types | |||
| |- | |||
| | colspan="2" rowspan="1" style="vertical-align: top; background-color: lavender; font-weight: bold;" | Alarm Type | |||
| | colspan="1" rowspan="1" style="vertical-align: top; background-color: lavender; font-weight: bold;" | INT value | |||
| | colspan="1" rowspan="1" style="vertical-align: top; background-color: lavender; font-weight: bold;" | Explanation | |||
| |-    | |||
| | colspan="1" rowspan="1"  style="vertical-align: top; background-color: white; font-weight: bold;"  |Internal alarms | |||
| | colspan="1" rowspan="1"  style="vertical-align: top; background-color: white; font-weight: normal;" | AT_INTERNAL | |||
| |1 | |||
| |Trigger on internal (program) alarm setting through the use of the al_...() functions.  | |||
| |-    | |||
| | style="vertical-align: top; background-color: white; font-weight: bold;" |Program alarms | |||
| |AT_PROGRAM | |||
| |2 | |||
| |Triggered on condition of the state of the defined task (i.e. program not running) | |||
| |-    | |||
| | style="vertical-align: top; background-color: white; font-weight: bold;" |Evaluated alarms | |||
| |AT_EVALUATED | |||
| |3 | |||
| |Triggered by ODB value on given arithmetical condition.   | |||
| |-    | |||
| | style="vertical-align: top; background-color: white; font-weight: bold;" |Periodic alarms | |||
| |AT_PERIODIC | |||
| |4 | |||
| |Triggered by timeout condition defined in the alarm setting.   | |||
| |} | |||
| The "not running" condition is tested every 10 seconds (each time al_check() is called), but the frequency of  | |||
| [[/Programs ODB tree#Check interval|Check interval]] | ==  Program Alarm == | ||
| Program (or rather "Program not running") alarms, when enabled, warn the user when a program is not running. | |||
| Program alarms are enabled by setting the ODB key [[/Programs ODB tree#Alarm class|/Programs/<client-name>/Alarm class]] to a valid Alarm class specified in the [[/Alarms ODB tree]]. The first time the alarm is triggered, an <span style="color: purple; font-style:italic;">/Alarms/Alarms/<client-name></span> subtree will be created automatically. The program alarm will not be visible in the [[Alarms Page]] until the alarm has triggered, and the subtree created. | |||
| The alarm system periodically calls [http://ladd00.triumf.ca/~daqweb/doc/midas-devel/doc/html/group__alfunctioncode.html al_check()]. This causes every client listed in the {{Odbpath|path=/Programs}} ODB tree to be tested using [http://ladd00.triumf.ca/~daqweb/doc/midas-devel/doc/html/group__cmfunctionc.html cm_exist()] to see if it is running. If the client is not running, the time of first failure is recorded in the ODB key [[/Programs ODB tree#First failed|/Programs/<client-name>/First failed]]. | |||
| If the client has not been running for longer than the time set in ODB key [[/Programs ODB tree#Check interval|/Programs/<client-name>/Check interval]], a "Program not running" alarm is triggered (if enabled by [[/Programs ODB tree#Alarm class|Alarm class]]) and the program is restarted (if enabled by [[/Programs ODB tree#Auto restart|/Programs/<client-name>/Auto restart]] and a valid [[/Programs ODB tree#Start command|Start command]] is supplied). | |||
| The "not running" condition is tested every 10 seconds (each time al_check() is called), but the frequency of ''Program not running'' alarms can be reduced by increasing the value of the ODB key | |||
| [[/Programs ODB tree#Check interval|/Programs/<client-name>/Check interval]] | |||
| (default value 60 seconds). This can be useful if  [[/Alarms ODB tree#System message interval|System message interval]] in the specified alarm class subtree is set to zero. | (default value 60 seconds). This can be useful if  [[/Alarms ODB tree#System message interval|System message interval]] in the specified alarm class subtree is set to zero. | ||
| == Periodic Alarm == | |||
| The periodic alarm is activated periodically according to the time in [[/Programs ODB tree#Check interval|/Programs/<client-name>/Check interval]]. An example of a periodic alarm is "Demo Periodic" in the  [[/Alarms ODB tree#Example|example]]. | |||
| == Evaluated Alarm  == | |||
| Evaluated alarms require an ''alarm condition'' which is entered into the ODB key [[/Alarms ODB tree#Condition|Condition]] in the  [[/Alarms ODB tree#<alarm_name> subtree|<alarm_name> subtree]]. | |||
| The condition may be simply a '''comparison''' between any ODB variable and a threshold parameter, e.g. | |||
|  /Runinfo/Run number > 100 | |||
| or it may be an '''evaluated condition'''. One can write conditions like | |||
|   /Equipment/HV/Variables/Input[*] < 100 | |||
| or | |||
|   /Equipment/HV/Variables/Input[2-3] < 100 | |||
| to check all values from an array or a certain range. If one array element fulfills the alarm condition, the alarm is triggered. In addition, bit-wise alarm conditions are possible, e.g. | |||
|   /Equipment/Environment/Variables/Input[0] & 8 | |||
| The alarm is triggered if bit #3 is set in Input[0]. | |||
| The value of an evaluated alarm is computed using al_evaluate_condition() in [http://ladd00.triumf.ca/~daqweb/doc/midas-devel/html/alarm_8c_source.html alarm.c]. | |||
| == Internal Alarm == | |||
| These are triggered in a program using a call to   | These are triggered in a program using a call to   | ||
| [ | [https://daq.triumf.ca/~daqweb/doc/midas-devel/html/group__alfunctioncode.html al_trigger_alarm()]. See also description of al_trigger_alarm() sequence  [[#Implementation of the MIDAS Alarm System|above]]. | ||
| There is nothing surprising in these alarms. Each alarm is checked with a time period set by ODB key [[ /Alarms ODB tree#Check interval|Check interval]] in the [[/Alarms ODB tree]]. | |||
| = Alarm triggering Email or SMS alerts = | |||
| It is possible to have the MIDAS alarm system send email or SMS alerts to cell phones when alarms are triggered. This can be configured by defining an ODB alarm on a critical ODB parameter, e.g. | |||
|  /Alarms/Alarms/Liquid Level | |||
|  Active                   y | |||
|  Triggered                0 (0x0) | |||
|  Type                     3 (0x3) | |||
|  Check interval          60 (0x3C) | |||
|  Checked last    1227690148 (0x492D10A4) | |||
|  Time triggered first    (empty) | |||
|  Time triggered last     (empty) | |||
|  Condition               /Equipment/Environment/Variables/Input[0] < 10 | |||
|  Alarm Class             Level Alarm | |||
|  Alarm Message           Liquid Level is only %s | |||
| In this example, the alarm triggers an alarm of class "Level Alarm". This alarm class is defined as follows: | |||
|  /Alarms/Classes/Level Alarm | |||
|  Write system message    y | |||
|  Write Elog message      n | |||
|  System message interval 600 (0x258) | |||
|  System message last     0 (0x0) | |||
|  Execute command         /home/midas/level_alarm '%s' | |||
|  Execute interval        1800 (0x708) | |||
|  Execute last            0 (0x0) | |||
|  Stop run                n | |||
|  Display BGColor         red | |||
|  Display FGColor         black | |||
| The key here is to call a script "level_alarm", which can send emails. Use something like: | |||
|  #/bin/csh | |||
|  echo $1 | mail -s \"Level Alarm\" your.name@domain.edu | |||
|  odbedit -c 'msg 2 level_alarm \"Alarm was sent to your.name@domain.edu\"' | |||
| The second command just generates a MIDAS system message for confirmation. Most cell phones (depends on the provider) have an email address. If you send an email there, it will be translated into a SMS message. | |||
| The script file above can of course be more complicated. A perl script could be used that parses an address list, so other interested parties can register by adding his/her email address to that list. The script may also collects some other slow control variables (like pressure, temperature) and combine them into the SMS message. | |||
| For very sensitive systems, having an alarm via SMS may not be sufficient, since the alarm system could be down (e.g. computer crash, network failure). In this case 'negative alarms' can be used. For example, every 30 minutes the system may send an SMS with the current parameter values. If the expected message is not received, it may indicate that something in the MIDAS system is wrong. | |||
| [[Category:Alarms]] | [[Category:Alarms]] | ||
Revision as of 20:01, 5 January 2018
Links
Introduction
MIDAS provides an alarm system, which by default is turned off. When the alarm system is activated and an alarm condition is detected, alarm messages are sent by the system which appear as an alarm banner on the mhttpd status page, and as a message on any windows running odbedit clients. The alarm system is flexible and can be extensively customized for each experiment using the mhttpd Alarms Page or odbedit.
The alarm system is built-in and part of the main experiment scheduler. This means no separate task is necessary to benefit from the alarm system. Its setup and activation is done through the /Alarms ODB tree. The alarm system includes several other features such as sequencing and control of the experiment. The alarm capabilities are:
- Alarm setting on any ODB variable against a threshold parameter.
- Alarm triggered by evaluated condition
- Selection of Alarm check frequency
- Selection of Alarm trigger frequency
- Customization alarm scheme; under this scheme multiple choices of alarm type can be selected
- Selection of alarm message destination (to system message log or to elog)
- email or SMS alerts can be sent
- Alarm triggered when a Program is not running
Implementation of the MIDAS Alarm System
The alarm system source code is alarm.c. Alarms are checked inside alarm.c::al_check(). This function is called by cm_yield() every 10 seconds and by rpc_server_thread(), also every 10 seconds. For remote MIDAS clients, their al_check() issues an RPC_AL_CHECK RPC call into the MIDAS server utility mserver, where rpc_server_dispatch() calls the local al_check(). As result, all alarm checks run inside a process directly attached to the local MIDAS shared memory (inside a local client or inside an mserver process for a remote client). Each and every MIDAS client runs the alarm checks. To prevent race conditions between different MIDAS clients, access to al_check() is serialized using the ALARM semaphore. Inside al_check(), alarms are triggered using al_trigger_alarm(), which in turn calls al_trigger_class(). Inside al_trigger_class(), the alarm is recorded into an elog or into midas.log using cm_msg(MTALK).
Special note should be made of the ODB setting system message interval, which has a surprising effect - after an alarm is recorded into system messages (using cm_msg(MTALK)), no record is made of any subsequent alarms until the time interval set by this variable elapses. With default value of 60 seconds, after one alarm, no more alarms are recorded for 60 seconds. Also, because all the alarms are checked at the same time, only the first triggered alarm will be recorded.
As of alarm.c rev 4683, /Alarms/System message interval is set to 0 ensures that every alarm is recorded into the MIDAS log file. (In previous revisions, this setting may still miss some alarms).
Alarms structure
The /Alarms ODB tree structure is split into 2 sections:
- "Alarms" which define the condition to be tested. The user can create as many Alarms as desired, but each must be one of the four defined Alarm types .
- "Classes" which define the action to be taken when the alarm occurs. Two Classes (Alarm and Warning) are defined by default. The user can add more Classes as desired.
In order to make the system flexible, each alarm class may perform different actions when an alarm is given. For example, it may
- write a system message (see Write System Message)
- write to the elog (see Write Elog Message)
- stop the run (see Stop run)
- spawn a detached script listed in the ODB variable Execute command. This feature is used when an Alarm triggers Email or SMS alerts (see example).
Alarm Types
The four available Alarm Types are shown in Table 1. They are defined in midas.h. The alarm type is entered into the Type key.
| Alarm Type | INT value | Explanation | |
| Internal alarms | AT_INTERNAL | 1 | Trigger on internal (program) alarm setting through the use of the al_...() functions. | 
| Program alarms | AT_PROGRAM | 2 | Triggered on condition of the state of the defined task (i.e. program not running) | 
| Evaluated alarms | AT_EVALUATED | 3 | Triggered by ODB value on given arithmetical condition. | 
| Periodic alarms | AT_PERIODIC | 4 | Triggered by timeout condition defined in the alarm setting. | 
Program Alarm
Program (or rather "Program not running") alarms, when enabled, warn the user when a program is not running.
Program alarms are enabled by setting the ODB key /Programs/<client-name>/Alarm class to a valid Alarm class specified in the /Alarms ODB tree. The first time the alarm is triggered, an /Alarms/Alarms/<client-name> subtree will be created automatically. The program alarm will not be visible in the Alarms Page until the alarm has triggered, and the subtree created.
The alarm system periodically calls al_check(). This causes every client listed in the 
/Programs ODB tree to be tested using cm_exist() to see if it is running. If the client is not running, the time of first failure is recorded in the ODB key /Programs/<client-name>/First failed.
If the client has not been running for longer than the time set in ODB key /Programs/<client-name>/Check interval, a "Program not running" alarm is triggered (if enabled by Alarm class) and the program is restarted (if enabled by /Programs/<client-name>/Auto restart and a valid Start command is supplied).
The "not running" condition is tested every 10 seconds (each time al_check() is called), but the frequency of Program not running alarms can be reduced by increasing the value of the ODB key /Programs/<client-name>/Check interval (default value 60 seconds). This can be useful if System message interval in the specified alarm class subtree is set to zero.
Periodic Alarm
The periodic alarm is activated periodically according to the time in /Programs/<client-name>/Check interval. An example of a periodic alarm is "Demo Periodic" in the example.
Evaluated Alarm
Evaluated alarms require an alarm condition which is entered into the ODB key Condition in the <alarm_name> subtree. The condition may be simply a comparison between any ODB variable and a threshold parameter, e.g.
/Runinfo/Run number > 100
or it may be an evaluated condition. One can write conditions like
/Equipment/HV/Variables/Input[*] < 100
or
/Equipment/HV/Variables/Input[2-3] < 100
to check all values from an array or a certain range. If one array element fulfills the alarm condition, the alarm is triggered. In addition, bit-wise alarm conditions are possible, e.g.
/Equipment/Environment/Variables/Input[0] & 8
The alarm is triggered if bit #3 is set in Input[0].
The value of an evaluated alarm is computed using al_evaluate_condition() in alarm.c.
Internal Alarm
These are triggered in a program using a call to al_trigger_alarm(). See also description of al_trigger_alarm() sequence above.
There is nothing surprising in these alarms. Each alarm is checked with a time period set by ODB key Check interval in the /Alarms ODB tree.
Alarm triggering Email or SMS alerts
It is possible to have the MIDAS alarm system send email or SMS alerts to cell phones when alarms are triggered. This can be configured by defining an ODB alarm on a critical ODB parameter, e.g.
/Alarms/Alarms/Liquid Level Active y Triggered 0 (0x0) Type 3 (0x3) Check interval 60 (0x3C) Checked last 1227690148 (0x492D10A4) Time triggered first (empty) Time triggered last (empty) Condition /Equipment/Environment/Variables/Input[0] < 10 Alarm Class Level Alarm Alarm Message Liquid Level is only %s
In this example, the alarm triggers an alarm of class "Level Alarm". This alarm class is defined as follows:
/Alarms/Classes/Level Alarm Write system message y Write Elog message n System message interval 600 (0x258) System message last 0 (0x0) Execute command /home/midas/level_alarm '%s' Execute interval 1800 (0x708) Execute last 0 (0x0) Stop run n Display BGColor red Display FGColor black
The key here is to call a script "level_alarm", which can send emails. Use something like:
#/bin/csh echo $1 | mail -s \"Level Alarm\" your.name@domain.edu odbedit -c 'msg 2 level_alarm \"Alarm was sent to your.name@domain.edu\"'
The second command just generates a MIDAS system message for confirmation. Most cell phones (depends on the provider) have an email address. If you send an email there, it will be translated into a SMS message.
The script file above can of course be more complicated. A perl script could be used that parses an address list, so other interested parties can register by adding his/her email address to that list. The script may also collects some other slow control variables (like pressure, temperature) and combine them into the SMS message.
For very sensitive systems, having an alarm via SMS may not be sufficient, since the alarm system could be down (e.g. computer crash, network failure). In this case 'negative alarms' can be used. For example, every 30 minutes the system may send an SMS with the current parameter values. If the expected message is not received, it may indicate that something in the MIDAS system is wrong.