Back Midas Rome Roody Rootana
  Midas DAQ System  Not logged in ELOG logo
Message ID: 3017     Entry time: 01 Apr 2025
Author: Konstantin Olchanski 
Topic: Bug Fix 
Subject: ODB and event buffer - release semaphore before abort() and core dump 
There is a long standing problem with ODB and event buffers. If they detect an 
internal data inconsistency and cannot continue running, they call abort() to 
dump core and stop.

Problem is in some code paths, they do this while holding the ODB or event 
buffer semaphore. (Linux kernel automatically releases SYSV semaphores after 
core dump is finished and program holding them is stopped).

If core dump takes longer than 10 seconds (for whatever reason, but we see this 
often enough), all other programs that wait for ODB or event buffer access, will 
also timeout and also crash (with core dump). Result is a core dump storm, at 
the end all MIDAS programs are crashed. (Luckily recovery is easy, simply 
restart everything).

Now I realize that in many situation, we do not need to hold the semaphore while 
dumping core - the content of ODB and event buffer shared memories is not 
important for debugging the crash - and it is safe to release the semaphore 
before calling abort().

This is now implemented for ODB and event buffers. Hopefully core dump storms 
will not happen again.

commit 96369c29deba1752fd3d25bed53e6594773d7e1a
release ODB semaphore before calling abort() to dump core. if core dump takes 
longer than 10 sec all other midas programs will timeout and crash.

commit 2506406813f1e7581572f0d5721d3761b7c8e8dd
unlock event buffer before calling abort() in bm_validate_client_index_locked(), 
refactor bm_get_my_client_locked()


K.O.
ELOG V3.1.4-2e1708b5