We have seen several times a problem with stuck data buffers. The symptoms are very confusing -
frontends cannot start, instead hang forever in a state very hard to kill. Also "mdump -s -d -z
BUF03" for the affected data buffers is stuck.
We have identified the source of this problem - the semaphore for the buffer is locked and nobody
will ever unlock it - MIDAS relies on a feature of SYSV semaphores where they are automatically
unlocked by the OS and cannot ever be stuck ever. (see man semop, SEM_UNDO function).
I think this SEM_UNDO function is broken in recent Linux kernels and sometimes the semaphore
remains locked after the process that locked it has died. MIDAS is not programmed to deal with this
situation and the stuck semaphore has to be cleared manually.
Here, "BUF3" is used as example, but we have seen "SYSTEM" and ODB with stuck semaphores, too.
Steps:
a) confirm that we are using SYSV semaphores: "ipcs" should show many semaphores
b) identify the stuck semaphore: "strace mdump -s -d -z BUF03".
c) here will be a large printout, but ultimately you will see repeated entries of
"semtimedop(9633800, {{0, -1, SEM_UNDO}}, 1, {1, 0}^C <unfinished ...>"
d) erase the stuck semaphore "ipcrm -s 9633800", where the number comes from semtimedop() in
the strace output.
e) try again: "mdump -s -d -z BUF03" should work now.
Ultimately, I think we should switch to POSIX semaphores - they are easier to manage (the strace
and ipcrm dance becomes "rm /dev/shm/deap_BUF03.sem" - but they do not have the SEM_UNDO
function, so detection of locked and stuck semaphores will have to be done by MIDAS. (Unless we
can find some library of semaphore functions that already provides such advanced functionality).
K.O. |