ELOG Midas

Back Midas Rome Roody Rootana

Midas DAQ System, Page 122 of 137

Not logged in

Find | Login | Help

Full | Summary | Threaded | Hide attachments

Show only new entries

2721 Entries

Goto page Previous 1, 2, 3 ... 121, 122, 123 ... 135, 136, 137 Next

odb "hot link" magic explored

KO wrote:

note 1: I do not completely understand the ss_suspend_xxx() stuff. The best I can tell is it creates a number of udp sockets bound to the local host and at least one udp rpc receive socket ultimately connected to the cm_dispatch_rpc() function.

The ss_suspend_xxx() stuff is indeed the most complicated thing in midas an I have to remind myself always
on how this works. So let me try again:

The basic idea is that for a high performance system, you cannot do the inter-process communication via
polling. That would waste CPU time. Inter-process communication is necessary for for buffer manager
(producer notifies consumer when new events are there), for the RPC mechanism (odbedit tells mlogger to
start a run) or for ODB hot-links. To avoid polling, the inter-process communication works with sockets (UDP
and TCP). This allows to use the select() call, which suspends the calling process until some socket
receives data or a pre-defined time-out expires. This is the only portable method I found which works under
unix and windows (signals are only poorly supported under windows).

So after creating all sockets, ss_suspend() does a select() on these sockets:

_suspend_struct[idx].listen_socket	Server side for any new RPC connection (each client is also a RPC server which gets contacted directly during run transitions for example
_suspend_struct[idx].server_acception.recv_sock	Receive socket (TCP) for any active RPC connection
_suspend_struct[idx].server_acception.event_sock	Receive socket (TCP) for bare events (bypassing RPC layer for performance reasons)
_suspend_struct[idx].server_connection->recv_sock	Outgoing TCP connection to mserver. Used for example for hot-link notifications from mserver
_suspend_struct[idx].ipc_recv_socket	UDP socket for inter-process notification

For each socket there is a dispatch function, which gets called if that socket receives some data. Hope this sheds some light on the guts of that.

Konstantin Olchanski

odb corruption, odb race condition?

The following script makes midas very unhappy and eventually causes odb corruption. I suspect the reason is some kind of race condition collision between client 
creation and destruction code and the watchdog activity (each client periodically runs cm_watchdog() to check if other clients are still alive, O(NxN) total complexity). 
Amongst messages appearing in midas.log:

Thu Dec 23 11:59:08 2010 [ODBEdit28,INFO] Client 'unknown' on buffer 'SYSMSG' removed by bm_open_buffer because client pid 20463 does not exist
Thu Dec 23 11:59:09 2010 [ODBEdit43,INFO] Client 'unknown' on buffer 'SYSMSG' removed by cm_watchdog because client pid 20465 does not exist
Thu Dec 23 12:11:21 2010 [ODBEdit,ERROR] [odb.c:1061:db_open_database,ERROR] Removing client 'ODBEdit11', pid 21536, index 27 because the pid no longer exists
Thu Dec 23 17:06:15 2010 [ODBEdit,ERROR] [odb.c:988:db_open_database,ERROR] maximum number of clients exceeded
Thu Dec 23 12:10:30 2010 [ODBEdit9,ERROR] [odb.c:3247:db_get_value,ERROR] "Name" is of type NULL, not STRING

The last message about <"Name" is of type NULL> appears during normal operation of the ND280 DAQ, leading me into these investigations.

Notes:
a) the script runs at most 50 copies of odbedit, never exceeding midas.h MAX_CLIENTS value 64, so one does not expect to see messages about "maximum number of 
clients exceeded"
b) the script runs 50 copies of odbedit in parallel, increasing the likelihood of whatever race condition is causing this. In the ND280 system, likelihood of failure is 
increased by the large number of running clients (10-20-30 clients), each client running periodic cm_watchdog, to collide with new client creation or destruction.
c) in other experiments, we do not see this (ok, we do have midas meltdowns once in a while) because (1) we tend to have fewer clients (reduced frequency of 
cm_watchdog), (2) we tend to not start and stop midas clients too often (reduced frequency of running client creation and destruction). (NB it seems like ND280 people 
tend to run many scripts containing odbedit commands, so they effectively start and stop midas clients more often than usual).


#!/usr/bin/perl -w
#$cmd = "odbedit -c \'scl -w\' &";
$cmd = "odbedit -c \'ls -l /system/clients\' &";
for (my $i=0; $i<50; $i++)
{
system $cmd;
}
#end

Konstantin Olchanski

odb corruption, odb race condition?

> Thu Dec 23 12:10:30 2010 [ODBEdit9,ERROR] [odb.c:3247:db_get_value,ERROR] "Name" is of type NULL, not STRING

This is caused by a race condition between client removal in cm_delete_client_info() and cm_exist().

The race condition in cm_exist() works like this:
- db_enum_key() returns the hkey (pointer to) the next /System/Clients/PID directory
- the client corresponding to PID is removed, our hkey now refers to a deleted entry
- db_get_value() tries to use the now stale hkey pointing to a deleted entry, complains about invalid key TID.

Because the offending db_get_value() is called with the "create if not found" argument set to TRUE, there is potential
for writing into ODB using a stale hkey, maybe leading to ODB corruption. Other than that, this race condition seems
to be benign.

cm_exist() is called from:
everybody->cm_yield()->al_check()->cm_exist()

Further analysis:
- cm_yield() calls al_check() every 10 sec, al_check() calls cm_exist() to check for "program is not running" alarms.
- in al_check() cm_exist() is called once for each entry in /Programs/xxx, even for programs with no alarms. (Maybe I should change this?)
- assuming 10 programs are running (10 clients), every 10 seconds, cm_exist() will be called 10 times and inside, will loop over 10 clients, exposing the enum-get race condition 10*10=100 times every 10 seconds. Usually, 
ODB /Programs/ has many more entries than there are active clients, further increasing the frequency of exposure of this race condition.

K.O.

Konstantin Olchanski

odb corruption, odb race condition?

> > Thu Dec 23 12:10:30 2010 [ODBEdit9,ERROR] [odb.c:3247:db_get_value,ERROR] "Name" is of type NULL, not STRING
> This is caused by a race condition between client removal in cm_delete_client_info() and cm_exist().
> ... this race condition seems to be benign.

Not so benign - after fixing cm_exist() to check the return value of db_get_value() and calling it without the "create" flag,
a crasher turned up inside db_find_key() called by db_get_value() with these stale hkeys. For invalid keys (not TID_KEY),
it would call db_get_path() and crash.

So after adding a check for valid key types, my test script runs much better - all the major weirdness is gone, I only see
rare messages from db_find_key(), db_get_key() and db_get_value() about invalid key and data types (after all,
I did not fix the underlying race condition).

The only remaining problem when running my script is some kind of deadlock between the ODB and SYSMSG semaphores...

K.O.

Konstantin Olchanski

odb disallow key names that start or end with spaces

while testing the new odb editor, we ran into a number of problems with key names 
that start or end with spaces. we cannot think of any valid use case for such key 
names (subdirectories and variables) and we think they could only have been 
created by mistake. ODB now disallows such names. K.O.

Konstantin Olchanski

odb multithread support repaired

multithreaded access to odb was implemented back in 2013-2014. but recently a bug surfaced - 
there was a race condition in the odb locking code against cm_watchdog(). Somehow this only 
affected the mserver for the DRAGON experiment at TRIUMF. This is now fixed on the branch 
feature/midas-2017-10. (this branch collects all the code that needs additional testing before 
merging into develop and becoming the next release of midas).
K.O.

Konstantin Olchanski

odb needs protection against ctrl-c

Even with the cm_watchdog signal removed, some trouble from UNIX signals remains.

This time, when one presses Ctrl-C at the wrong time, the Ctrl-C signal handler will run at the wrong time
and strange things will happen (including odb corruption).

In the captured stack trace, I pressed Ctrl-C right when odbedit was inside db_lock_database(). I had to make special
arrangements to make it happen, but I have seen it happen in normal use when running experiments.

K.O.

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00007fff6c2ceb66 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff6c499080 libsystem_pthread.dylib`pthread_kill + 333
    frame #2: 0x00007fff6c22a1ae libsystem_c.dylib`abort + 127
    frame #3: 0x00000001057ccf95 odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2048 [opt]
    frame #4: 0x00000001057aed9d odbedit`cm_delete_client_info(hDB=1, pid=46856) at midas.c:1702 [opt]
    frame #5: 0x00000001057b08fe odbedit`cm_disconnect_experiment at midas.c:2704 [opt]
    frame #6: 0x00000001057a8231 odbedit`ctrlc_odbedit(i=<unavailable>) at odbedit.cxx:2863 [opt]
    frame #7: 0x00007fff6c48cf5a libsystem_platform.dylib`_sigtramp + 26
    frame #8: 0x00007fff6c2ced83 libsystem_kernel.dylib`__semwait_signal + 11
    frame #9: 0x00007fff6c249724 libsystem_c.dylib`nanosleep + 199
    frame #10: 0x00007fff6c249586 libsystem_c.dylib`sleep + 41
    frame #11: 0x00000001057cce6b odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2057 [opt]
    frame #12: 0x00000001057e129a odbedit`db_get_record_size(hDB=1, hKey=141848, align=8, buf_size=0x00007ffeea44c14c) at odb.c:10232 [opt]
    frame #13: 0x00000001057e1b58 odbedit`db_get_record1(hDB=1, hKey=141848, data=0x00007ffeea44c320, buf_size=0x00007ffeea44c2a8, align=0, rec_str="[.]\nWrite system message = BOOL : 
y\nWrite Elog message = BOOL : n\nSystem message interval = INT : 60\nSystem message last = DWORD : 0\nExecute command = STRING : [256] \nExecute interval = INT : 0\nExecute last = DWORD : 
0\nStop run = BOOL : n\nDisplay BGColor = STRING : [32] red\nDisplay FGColor = STRING : [32] black\n\n") at odb.c:10390 [opt]
    frame #14: 0x00000001057f1de3 odbedit`al_trigger_class(alarm_class="Warning", alarm_message="This is an example alarm", first=YES) at alarm.c:389 [opt]
    frame #15: 0x00000001057f19e8 odbedit`al_trigger_alarm(alarm_name="Example alarm", alarm_message="This is an example alarm", default_class="Warning", cond_str="", type=<unavailable>) at 
alarm.c:310 [opt]
    frame #16: 0x00000001057f2c4e odbedit`al_check at alarm.c:655 [opt]
    frame #17: 0x00000001057b9f88 odbedit`cm_periodic_tasks at midas.c:5066 [opt]
    frame #18: 0x00000001057ba26d odbedit`cm_yield(millisec=100) at midas.c:5137 [opt]
    frame #19: 0x00000001057a30b8 odbedit`cmd_idle() at odbedit.cxx:1238 [opt]
    frame #20: 0x00000001057a92df odbedit`cmd_edit(prompt="[local:javascript1:S]/>", cmd=<unavailable>, dir=(odbedit`cmd_dir(char*, int*) at odbedit.cxx:705), idle=(odbedit`cmd_idle() at 
odbedit.cxx:1233))(char*, int*), int (*)()) at cmdedit.cxx:235 [opt]
    frame #21: 0x00000001057a3863 odbedit`command_loop(host_name="", exp_name="javascript1", cmd="", start_dir=<unavailable>) at odbedit.cxx:1435 [opt]
    frame #22: 0x00000001057a8664 odbedit`main(argc=1, argv=<unavailable>) at odbedit.cxx:2997 [opt]
    frame #23: 0x00007fff6c17e015 libdyld.dylib`start + 1

odb needs protection against ctrl-c

Not sure if you realized, but there is a two-stage Ctrl-C handling inside midas. The first time you hit ctrl-c, the handler just sets a flag for the main event loop, so that the program can gracefully exit without trouble. This is 
done inside cm_ctrlc_handler(), which sets _ctrlc_pressed true if called. Then cm_yield() tests this flag and returns RPC_SHUTDOWN if so. I agree not very obvious, maybe we should return a more appropriate status. So the 
main loop must check the return status of cm_yield() and break if it's RPC_SHUTDOWN. The frontend framework mfe.c does this for example in

 while (status != RPC_SHUTDOWN && status != SS_ABORT);

Any use-written program should do the same (well, probably this is nowhere documented).

Now when the program does not exit (e.g. if it's in an infinite loop), then the second ctrl-c creates a hard abort and terminates the program non-gracefully, which as you noticed can lead to undesired results. All the 
semaphores (at least when I implemented it) had a SEM_UNDO flag when obtaining ownership. This means that if the semaphore is locked and the process who owns it terminates (even with a hard kill), then the semaphore 
is released by the OS. This way a crashed program cannot keep the ODB locked for example. Not sure that with all your modifications in the semaphore calls this functionality is still guaranteed.

Stefan

> Even with the cm_watchdog signal removed, some trouble from UNIX signals remains.
> 
> This time, when one presses Ctrl-C at the wrong time, the Ctrl-C signal handler will run at the wrong time
> and strange things will happen (including odb corruption).
> 
> In the captured stack trace, I pressed Ctrl-C right when odbedit was inside db_lock_database(). I had to make special
> arrangements to make it happen, but I have seen it happen in normal use when running experiments.
> 
> K.O.
> 
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
>   * frame #0: 0x00007fff6c2ceb66 libsystem_kernel.dylib`__pthread_kill + 10
>     frame #1: 0x00007fff6c499080 libsystem_pthread.dylib`pthread_kill + 333
>     frame #2: 0x00007fff6c22a1ae libsystem_c.dylib`abort + 127
>     frame #3: 0x00000001057ccf95 odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2048 [opt]
>     frame #4: 0x00000001057aed9d odbedit`cm_delete_client_info(hDB=1, pid=46856) at midas.c:1702 [opt]
>     frame #5: 0x00000001057b08fe odbedit`cm_disconnect_experiment at midas.c:2704 [opt]
>     frame #6: 0x00000001057a8231 odbedit`ctrlc_odbedit(i=<unavailable>) at odbedit.cxx:2863 [opt]
>     frame #7: 0x00007fff6c48cf5a libsystem_platform.dylib`_sigtramp + 26
>     frame #8: 0x00007fff6c2ced83 libsystem_kernel.dylib`__semwait_signal + 11
>     frame #9: 0x00007fff6c249724 libsystem_c.dylib`nanosleep + 199
>     frame #10: 0x00007fff6c249586 libsystem_c.dylib`sleep + 41
>     frame #11: 0x00000001057cce6b odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2057 [opt]
>     frame #12: 0x00000001057e129a odbedit`db_get_record_size(hDB=1, hKey=141848, align=8, buf_size=0x00007ffeea44c14c) at odb.c:10232 [opt]
>     frame #13: 0x00000001057e1b58 odbedit`db_get_record1(hDB=1, hKey=141848, data=0x00007ffeea44c320, buf_size=0x00007ffeea44c2a8, align=0, rec_str="[.]\nWrite system message = BOOL : 
> y\nWrite Elog message = BOOL : n\nSystem message interval = INT : 60\nSystem message last = DWORD : 0\nExecute command = STRING : [256] \nExecute interval = INT : 0\nExecute last = DWORD : 
> 0\nStop run = BOOL : n\nDisplay BGColor = STRING : [32] red\nDisplay FGColor = STRING : [32] black\n\n") at odb.c:10390 [opt]
>     frame #14: 0x00000001057f1de3 odbedit`al_trigger_class(alarm_class="Warning", alarm_message="This is an example alarm", first=YES) at alarm.c:389 [opt]
>     frame #15: 0x00000001057f19e8 odbedit`al_trigger_alarm(alarm_name="Example alarm", alarm_message="This is an example alarm", default_class="Warning", cond_str="", type=<unavailable>) at 
> alarm.c:310 [opt]
>     frame #16: 0x00000001057f2c4e odbedit`al_check at alarm.c:655 [opt]
>     frame #17: 0x00000001057b9f88 odbedit`cm_periodic_tasks at midas.c:5066 [opt]
>     frame #18: 0x00000001057ba26d odbedit`cm_yield(millisec=100) at midas.c:5137 [opt]
>     frame #19: 0x00000001057a30b8 odbedit`cmd_idle() at odbedit.cxx:1238 [opt]
>     frame #20: 0x00000001057a92df odbedit`cmd_edit(prompt="[local:javascript1:S]/>", cmd=<unavailable>, dir=(odbedit`cmd_dir(char*, int*) at odbedit.cxx:705), idle=(odbedit`cmd_idle() at 
> odbedit.cxx:1233))(char*, int*), int (*)()) at cmdedit.cxx:235 [opt]
>     frame #21: 0x00000001057a3863 odbedit`command_loop(host_name="", exp_name="javascript1", cmd="", start_dir=<unavailable>) at odbedit.cxx:1435 [opt]
>     frame #22: 0x00000001057a8664 odbedit`main(argc=1, argv=<unavailable>) at odbedit.cxx:2997 [opt]
>     frame #23: 0x00007fff6c17e015 libdyld.dylib`start + 1

Konstantin Olchanski

odb needs protection against ctrl-c

Commit f81ff3c protects db_lock/unlock, but not any of the other functions. What if we do ctrl-c in the middle
of some odb write operation in the middle of memory allocation, etc.

A sure way to corrupt odb.

Perhaps we should disallow odb access from signal handlers? But we still want to be able to stop midas
programs using ctrl-c, even if the program is in some infinite loop somewhere and is not processing
midas events (no calls to cm_yield(), etc).

Maybe I should change the ctrl-c handler to set a flag for cm_yield() to return SS_EXIT,
and additional ctrl-c do nothing if this flag is already set? (maybe abort() if they do ctrl-c 10 times?).

K.O.

> Even with the cm_watchdog signal removed, some trouble from UNIX signals remains.
> 
> This time, when one presses Ctrl-C at the wrong time, the Ctrl-C signal handler will run at the wrong time
> and strange things will happen (including odb corruption).
> 
> In the captured stack trace, I pressed Ctrl-C right when odbedit was inside db_lock_database(). I had to make special
> arrangements to make it happen, but I have seen it happen in normal use when running experiments.
> 
> K.O.
> 
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
>   * frame #0: 0x00007fff6c2ceb66 libsystem_kernel.dylib`__pthread_kill + 10
>     frame #1: 0x00007fff6c499080 libsystem_pthread.dylib`pthread_kill + 333
>     frame #2: 0x00007fff6c22a1ae libsystem_c.dylib`abort + 127
>     frame #3: 0x00000001057ccf95 odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2048 [opt]
>     frame #4: 0x00000001057aed9d odbedit`cm_delete_client_info(hDB=1, pid=46856) at midas.c:1702 [opt]
>     frame #5: 0x00000001057b08fe odbedit`cm_disconnect_experiment at midas.c:2704 [opt]
>     frame #6: 0x00000001057a8231 odbedit`ctrlc_odbedit(i=<unavailable>) at odbedit.cxx:2863 [opt]
>     frame #7: 0x00007fff6c48cf5a libsystem_platform.dylib`_sigtramp + 26
>     frame #8: 0x00007fff6c2ced83 libsystem_kernel.dylib`__semwait_signal + 11
>     frame #9: 0x00007fff6c249724 libsystem_c.dylib`nanosleep + 199
>     frame #10: 0x00007fff6c249586 libsystem_c.dylib`sleep + 41
>     frame #11: 0x00000001057cce6b odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2057 [opt]
>     frame #12: 0x00000001057e129a odbedit`db_get_record_size(hDB=1, hKey=141848, align=8, buf_size=0x00007ffeea44c14c) at odb.c:10232 [opt]
>     frame #13: 0x00000001057e1b58 odbedit`db_get_record1(hDB=1, hKey=141848, data=0x00007ffeea44c320, buf_size=0x00007ffeea44c2a8, align=0, rec_str="[.]\nWrite system message = BOOL : 
> y\nWrite Elog message = BOOL : n\nSystem message interval = INT : 60\nSystem message last = DWORD : 0\nExecute command = STRING : [256] \nExecute interval = INT : 0\nExecute last = DWORD : 
> 0\nStop run = BOOL : n\nDisplay BGColor = STRING : [32] red\nDisplay FGColor = STRING : [32] black\n\n") at odb.c:10390 [opt]
>     frame #14: 0x00000001057f1de3 odbedit`al_trigger_class(alarm_class="Warning", alarm_message="This is an example alarm", first=YES) at alarm.c:389 [opt]
>     frame #15: 0x00000001057f19e8 odbedit`al_trigger_alarm(alarm_name="Example alarm", alarm_message="This is an example alarm", default_class="Warning", cond_str="", type=<unavailable>) at 
> alarm.c:310 [opt]
>     frame #16: 0x00000001057f2c4e odbedit`al_check at alarm.c:655 [opt]
>     frame #17: 0x00000001057b9f88 odbedit`cm_periodic_tasks at midas.c:5066 [opt]
>     frame #18: 0x00000001057ba26d odbedit`cm_yield(millisec=100) at midas.c:5137 [opt]
>     frame #19: 0x00000001057a30b8 odbedit`cmd_idle() at odbedit.cxx:1238 [opt]
>     frame #20: 0x00000001057a92df odbedit`cmd_edit(prompt="[local:javascript1:S]/>", cmd=<unavailable>, dir=(odbedit`cmd_dir(char*, int*) at odbedit.cxx:705), idle=(odbedit`cmd_idle() at 
> odbedit.cxx:1233))(char*, int*), int (*)()) at cmdedit.cxx:235 [opt]
>     frame #21: 0x00000001057a3863 odbedit`command_loop(host_name="", exp_name="javascript1", cmd="", start_dir=<unavailable>) at odbedit.cxx:1435 [opt]
>     frame #22: 0x00000001057a8664 odbedit`main(argc=1, argv=<unavailable>) at odbedit.cxx:2997 [opt]
>     frame #23: 0x00007fff6c17e015 libdyld.dylib`start + 1

Konstantin Olchanski

odb needs protection against ctrl-c

> Not sure if you realized, but there is a two-stage Ctrl-C handling inside midas.

Hmm... I am looking at the ctrl-c handler inside odbedit.

Yes, and the original bug report is against odbedit - press of ctrl-c in odbedit corrupts odb,
see stack trace in https://bitbucket.org/tmidas/midas/issues/99

So maybe only the odbedit ctrl-c handler is defective...

I will take a look at what the other ctrl-c handler does.

Safest is probably to call exit() without calling cm_disconnect_experiment().

From the ctrl-c handler, if we call cm_disconnect_experiment() -
- if we hold odb locked, we deadlock (after I remove the recursive mutex) or corrupt odb (if we run form inside db_create or db_set_data).
- if we run from inside db_lock/unlock, we abort() (with my newly added protection) or explode (if we run from inside mprotect(), like in the stack strace in the bug report)

I would say, from a signal handler, only safe things are - set a flag, or abort()/exit().

exit() is not super safe because the user may have attached some code to it that may access odb. (our
default atexit() handler just prints an error message).

K.O.


> The first time you hit ctrl-c, the handler just sets a flag for the main event loop, so that the program can gracefully exit without trouble. This is 
> done inside cm_ctrlc_handler(), which sets _ctrlc_pressed true if called. Then cm_yield() tests this flag and returns RPC_SHUTDOWN if so. I agree not very obvious, maybe we should return a more appropriate status. So the 
> main loop must check the return status of cm_yield() and break if it's RPC_SHUTDOWN. The frontend framework mfe.c does this for example in
> 
>  while (status != RPC_SHUTDOWN && status != SS_ABORT);
> 
> Any use-written program should do the same (well, probably this is nowhere documented).
> 
> Now when the program does not exit (e.g. if it's in an infinite loop), then the second ctrl-c creates a hard abort and terminates the program non-gracefully, which as you noticed can lead to undesired results. All the 
> semaphores (at least when I implemented it) had a SEM_UNDO flag when obtaining ownership. This means that if the semaphore is locked and the process who owns it terminates (even with a hard kill), then the semaphore 
> is released by the OS. This way a crashed program cannot keep the ODB locked for example. Not sure that with all your modifications in the semaphore calls this functionality is still guaranteed.
> 
> Stefan
> 
> > Even with the cm_watchdog signal removed, some trouble from UNIX signals remains.
> > 
> > This time, when one presses Ctrl-C at the wrong time, the Ctrl-C signal handler will run at the wrong time
> > and strange things will happen (including odb corruption).
> > 
> > In the captured stack trace, I pressed Ctrl-C right when odbedit was inside db_lock_database(). I had to make special
> > arrangements to make it happen, but I have seen it happen in normal use when running experiments.
> > 
> > K.O.
> > 
> > (lldb) bt
> > * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
> >   * frame #0: 0x00007fff6c2ceb66 libsystem_kernel.dylib`__pthread_kill + 10
> >     frame #1: 0x00007fff6c499080 libsystem_pthread.dylib`pthread_kill + 333
> >     frame #2: 0x00007fff6c22a1ae libsystem_c.dylib`abort + 127
> >     frame #3: 0x00000001057ccf95 odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2048 [opt]
> >     frame #4: 0x00000001057aed9d odbedit`cm_delete_client_info(hDB=1, pid=46856) at midas.c:1702 [opt]
> >     frame #5: 0x00000001057b08fe odbedit`cm_disconnect_experiment at midas.c:2704 [opt]
> >     frame #6: 0x00000001057a8231 odbedit`ctrlc_odbedit(i=<unavailable>) at odbedit.cxx:2863 [opt]
> >     frame #7: 0x00007fff6c48cf5a libsystem_platform.dylib`_sigtramp + 26
> >     frame #8: 0x00007fff6c2ced83 libsystem_kernel.dylib`__semwait_signal + 11
> >     frame #9: 0x00007fff6c249724 libsystem_c.dylib`nanosleep + 199
> >     frame #10: 0x00007fff6c249586 libsystem_c.dylib`sleep + 41
> >     frame #11: 0x00000001057cce6b odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2057 [opt]
> >     frame #12: 0x00000001057e129a odbedit`db_get_record_size(hDB=1, hKey=141848, align=8, buf_size=0x00007ffeea44c14c) at odb.c:10232 [opt]
> >     frame #13: 0x00000001057e1b58 odbedit`db_get_record1(hDB=1, hKey=141848, data=0x00007ffeea44c320, buf_size=0x00007ffeea44c2a8, align=0, rec_str="[.]\nWrite system message = BOOL : 
> > y\nWrite Elog message = BOOL : n\nSystem message interval = INT : 60\nSystem message last = DWORD : 0\nExecute command = STRING : [256] \nExecute interval = INT : 0\nExecute last = DWORD : 
> > 0\nStop run = BOOL : n\nDisplay BGColor = STRING : [32] red\nDisplay FGColor = STRING : [32] black\n\n") at odb.c:10390 [opt]
> >     frame #14: 0x00000001057f1de3 odbedit`al_trigger_class(alarm_class="Warning", alarm_message="This is an example alarm", first=YES) at alarm.c:389 [opt]
> >     frame #15: 0x00000001057f19e8 odbedit`al_trigger_alarm(alarm_name="Example alarm", alarm_message="This is an example alarm", default_class="Warning", cond_str="", type=<unavailable>) at 
> > alarm.c:310 [opt]
> >     frame #16: 0x00000001057f2c4e odbedit`al_check at alarm.c:655 [opt]
> >     frame #17: 0x00000001057b9f88 odbedit`cm_periodic_tasks at midas.c:5066 [opt]
> >     frame #18: 0x00000001057ba26d odbedit`cm_yield(millisec=100) at midas.c:5137 [opt]
> >     frame #19: 0x00000001057a30b8 odbedit`cmd_idle() at odbedit.cxx:1238 [opt]
> >     frame #20: 0x00000001057a92df odbedit`cmd_edit(prompt="[local:javascript1:S]/>", cmd=<unavailable>, dir=(odbedit`cmd_dir(char*, int*) at odbedit.cxx:705), idle=(odbedit`cmd_idle() at 
> > odbedit.cxx:1233))(char*, int*), int (*)()) at cmdedit.cxx:235 [opt]
> >     frame #21: 0x00000001057a3863 odbedit`command_loop(host_name="", exp_name="javascript1", cmd="", start_dir=<unavailable>) at odbedit.cxx:1435 [opt]
> >     frame #22: 0x00000001057a8664 odbedit`main(argc=1, argv=<unavailable>) at odbedit.cxx:2997 [opt]
> >     frame #23: 0x00007fff6c17e015 libdyld.dylib`start + 1

odb needs protection against ctrl-c

Have you read what I wrote? The current ctrl-c handler just sets the _ctrlc_pressed flag. It might be that some programs do not correctly interprete the return of cm_yield(), certainly the frontend does it correctly. On the SECOND ctrl-c, the program gets 
(internally) hard aborted, equivalent to calling abort(). Not sure if the code works everywhere, I see now that cm_yield(() should maybe return SS_ABORT like 

if (_ctrlc_pressed)
  return SS_ABORT;

then command_loop will exit gracefully. Have to check mfe.c, mlogger.c etc. if they all check for SS_ABORT returned from cm_yield().

> > Not sure if you realized, but there is a two-stage Ctrl-C handling inside midas.
> 
> Hmm... I am looking at the ctrl-c handler inside odbedit.
> 
> Yes, and the original bug report is against odbedit - press of ctrl-c in odbedit corrupts odb,
> see stack trace in https://bitbucket.org/tmidas/midas/issues/99
> 
> So maybe only the odbedit ctrl-c handler is defective...
> 
> I will take a look at what the other ctrl-c handler does.
> 
> Safest is probably to call exit() without calling cm_disconnect_experiment().
> 
> From the ctrl-c handler, if we call cm_disconnect_experiment() -
> - if we hold odb locked, we deadlock (after I remove the recursive mutex) or corrupt odb (if we run form inside db_create or db_set_data).
> - if we run from inside db_lock/unlock, we abort() (with my newly added protection) or explode (if we run from inside mprotect(), like in the stack strace in the bug report)
> 
> I would say, from a signal handler, only safe things are - set a flag, or abort()/exit().
> 
> exit() is not super safe because the user may have attached some code to it that may access odb. (our
> default atexit() handler just prints an error message).
> 
> K.O.
> 
> 
> > The first time you hit ctrl-c, the handler just sets a flag for the main event loop, so that the program can gracefully exit without trouble. This is 
> > done inside cm_ctrlc_handler(), which sets _ctrlc_pressed true if called. Then cm_yield() tests this flag and returns RPC_SHUTDOWN if so. I agree not very obvious, maybe we should return a more appropriate status. So the 
> > main loop must check the return status of cm_yield() and break if it's RPC_SHUTDOWN. The frontend framework mfe.c does this for example in
> > 
> >  while (status != RPC_SHUTDOWN && status != SS_ABORT);
> > 
> > Any use-written program should do the same (well, probably this is nowhere documented).
> > 
> > Now when the program does not exit (e.g. if it's in an infinite loop), then the second ctrl-c creates a hard abort and terminates the program non-gracefully, which as you noticed can lead to undesired results. All the 
> > semaphores (at least when I implemented it) had a SEM_UNDO flag when obtaining ownership. This means that if the semaphore is locked and the process who owns it terminates (even with a hard kill), then the semaphore 
> > is released by the OS. This way a crashed program cannot keep the ODB locked for example. Not sure that with all your modifications in the semaphore calls this functionality is still guaranteed.
> > 
> > Stefan
> > 
> > > Even with the cm_watchdog signal removed, some trouble from UNIX signals remains.
> > > 
> > > This time, when one presses Ctrl-C at the wrong time, the Ctrl-C signal handler will run at the wrong time
> > > and strange things will happen (including odb corruption).
> > > 
> > > In the captured stack trace, I pressed Ctrl-C right when odbedit was inside db_lock_database(). I had to make special
> > > arrangements to make it happen, but I have seen it happen in normal use when running experiments.
> > > 
> > > K.O.
> > > 
> > > (lldb) bt
> > > * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
> > >   * frame #0: 0x00007fff6c2ceb66 libsystem_kernel.dylib`__pthread_kill + 10
> > >     frame #1: 0x00007fff6c499080 libsystem_pthread.dylib`pthread_kill + 333
> > >     frame #2: 0x00007fff6c22a1ae libsystem_c.dylib`abort + 127
> > >     frame #3: 0x00000001057ccf95 odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2048 [opt]
> > >     frame #4: 0x00000001057aed9d odbedit`cm_delete_client_info(hDB=1, pid=46856) at midas.c:1702 [opt]
> > >     frame #5: 0x00000001057b08fe odbedit`cm_disconnect_experiment at midas.c:2704 [opt]
> > >     frame #6: 0x00000001057a8231 odbedit`ctrlc_odbedit(i=<unavailable>) at odbedit.cxx:2863 [opt]
> > >     frame #7: 0x00007fff6c48cf5a libsystem_platform.dylib`_sigtramp + 26
> > >     frame #8: 0x00007fff6c2ced83 libsystem_kernel.dylib`__semwait_signal + 11
> > >     frame #9: 0x00007fff6c249724 libsystem_c.dylib`nanosleep + 199
> > >     frame #10: 0x00007fff6c249586 libsystem_c.dylib`sleep + 41
> > >     frame #11: 0x00000001057cce6b odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2057 [opt]
> > >     frame #12: 0x00000001057e129a odbedit`db_get_record_size(hDB=1, hKey=141848, align=8, buf_size=0x00007ffeea44c14c) at odb.c:10232 [opt]
> > >     frame #13: 0x00000001057e1b58 odbedit`db_get_record1(hDB=1, hKey=141848, data=0x00007ffeea44c320, buf_size=0x00007ffeea44c2a8, align=0, rec_str="[.]\nWrite system message = BOOL : 
> > > y\nWrite Elog message = BOOL : n\nSystem message interval = INT : 60\nSystem message last = DWORD : 0\nExecute command = STRING : [256] \nExecute interval = INT : 0\nExecute last = DWORD : 
> > > 0\nStop run = BOOL : n\nDisplay BGColor = STRING : [32] red\nDisplay FGColor = STRING : [32] black\n\n") at odb.c:10390 [opt]
> > >     frame #14: 0x00000001057f1de3 odbedit`al_trigger_class(alarm_class="Warning", alarm_message="This is an example alarm", first=YES) at alarm.c:389 [opt]
> > >     frame #15: 0x00000001057f19e8 odbedit`al_trigger_alarm(alarm_name="Example alarm", alarm_message="This is an example alarm", default_class="Warning", cond_str="", type=<unavailable>) at 
> > > alarm.c:310 [opt]
> > >     frame #16: 0x00000001057f2c4e odbedit`al_check at alarm.c:655 [opt]
> > >     frame #17: 0x00000001057b9f88 odbedit`cm_periodic_tasks at midas.c:5066 [opt]
> > >     frame #18: 0x00000001057ba26d odbedit`cm_yield(millisec=100) at midas.c:5137 [opt]
> > >     frame #19: 0x00000001057a30b8 odbedit`cmd_idle() at odbedit.cxx:1238 [opt]
> > >     frame #20: 0x00000001057a92df odbedit`cmd_edit(prompt="[local:javascript1:S]/>", cmd=<unavailable>, dir=(odbedit`cmd_dir(char*, int*) at odbedit.cxx:705), idle=(odbedit`cmd_idle() at 
> > > odbedit.cxx:1233))(char*, int*), int (*)()) at cmdedit.cxx:235 [opt]
> > >     frame #21: 0x00000001057a3863 odbedit`command_loop(host_name="", exp_name="javascript1", cmd="", start_dir=<unavailable>) at odbedit.cxx:1435 [opt]
> > >     frame #22: 0x00000001057a8664 odbedit`main(argc=1, argv=<unavailable>) at odbedit.cxx:2997 [opt]
> > >     frame #23: 0x00007fff6c17e015 libdyld.dylib`start + 1

Konstantin Olchanski

odbc sql history mlogger update

> mhttpd and mlogger have been updated with potentially troublesome changes.
> These new features are now available:
> - a "feature complete" implementation of "history in an SQL database".

The mlogger SQL history driver has been updated with improvements that make this new system usable in 
production environment: the silly "create all tables on startup, every time, even if they already exist" is fixed,
mlogger survives restarts of mysqld and checks that existing sql columns have data types compatible with the 
data we are trying to write.

There are still a few trouble spots remaining. For example, in mapping midas names into sql names (sql names 
have more restrictions on permitted characters) and in reverse mapping of sql data types to midas data types. 
To properly solve this, I may have to save the midas names and data types into an additional index table.

Included is the mh2sql utility for importing existing history files into an SQL database (in the same way as if 
they were written into the database by mlogger).

The mhttpd side of this system still needs polishing, but should be already fully functional.

A preliminary version of documentation for this new SQL history system is here. After additional review and 
editing it will be committed to the midas midox documentation. Included are full instructions on enabling 
writing of midas history into a MySQL database.
http://ladd00.triumf.ca/~olchansk/midas/Internal.html#History_sql_internal

svn revision 4452
K.O.

Konstantin Olchanski

odbedit bad ctrl-C

When using "/bin/bash" shell, if I exit odbedit (and other midas programs) using ctrl-C, the terminal 
enters a funny state, "echo" is turned off (I cannot see what I type), "delete" key does not work (echoes 
^H instead).

This problem does not happen if I exit using the "exit" command or if I use the "/bin/tcsh" shell.

When this happens, the terminal can be restored to close to normal state using "stty sane", and "stty 
erase ^H".

The terminal is set into this funny state by system.c::getchar() and normal settings are never restored 
unless the midas program calls getchar(1) at the end. If the program does not finish normally, original 
terminal settings are never restored and the terminal is left in a funny state.

It is not clear why the problem does not happen with /bin/tcsh - perhaps they restore sane terminal 
settings automatically for us.
K.O.

odbedit bad ctrl-C

> When using "/bin/bash" shell, if I exit odbedit (and other midas programs) using ctrl-C, the terminal 
> enters a funny state, "echo" is turned off (I cannot see what I type), "delete" key does not work (echoes 
> ^H instead).
> 
> This problem does not happen if I exit using the "exit" command or if I use the "/bin/tcsh" shell.
> 
> When this happens, the terminal can be restored to close to normal state using "stty sane", and "stty 
> erase ^H".
> 
> The terminal is set into this funny state by system.c::getchar() and normal settings are never restored 
> unless the midas program calls getchar(1) at the end. If the program does not finish normally, original 
> terminal settings are never restored and the terminal is left in a funny state.
> 
> It is not clear why the problem does not happen with /bin/tcsh - perhaps they restore sane terminal 
> settings automatically for us.
> K.O.

Who uses bash ??? And who keeps baning on Ctrl-C, when there is a nice "exit" command ;-)

Well, I implemented a simple CTRL-C handler in odbedit (Rev. 4503) which resets the terminal before exiting. 
Give it a try. Of course this cannot catch a hard kill (-9), but CTRL-C works now correctly under bash at 
least.

Konstantin Olchanski

odbedit fixed size buffer overrun

odbedit uses a fixed size buffer for ODB data. If an array in ODB is bugger than this size, 
db_get_data() correctly returns DB_TRUNCATED and there is no memory overwrite, but the following 
code for printing the data does not know about this truncation and proceeds printing memory 
values contained after the end of the fixed size data buffer - in the current case, this memory 
somehow had the contents of the shell history - this very confused the MIDAS users as they though 
that the funny printout was actually in ODB. This is in the print_key() function. If db_get_data() 
returns DB_TRUNCATED, it should allocate a buffer of bigger size. This (and the previous) bug found 
by the TIGRESS experiment. K.O.

Konstantin Olchanski

once in 100 years midas shared memory bug

We were debugging a strange problem in the event builder, where out of 14
fragments, two fragments were always getting serial number mismatches and the
serial numbers were not sequentially increasing (the other 12 fragments were
just fine).

Then we noticed in the event builder debug output that these 2 fragments were
getting assigned the same buffer handle number, despite having different names -
BUF09 and BUFTPC.

Then we looked at "ipcs", counted the buffers, and there are only 13 buffers for
14 frontends.

Aha, we went, maybe we have unlucky buffer names, renamed BUFTPC to BUFAAA and
everything started to work just fine.

It turns out that the MIDAS ss_shm_open() function uses "ftok" to convert buffer
names to SystemV shared memory keys. This "ftok" function promises to create
unique keys, but I guess, just not today.

Using a short test program, I confirmed that indeed we have unlucky buffer names
and ftok() returns duplicate keys, see below.

Apparently ftok() uses the low 16 bits of the file inode number, but in our
case, the files are NFS mounted and inode numbers are faked inside NFS. When I
run my test program on a different computer, I get non-duplicate keys. So I
guess we are double unlucky.

Test program:

#include <stdio.h>
#include <sys/types.h>
#include <sys/ipc.h>

int main(int argc, char* argv[])
{
  //key_t ftok(const char *pathname, int proj_id);
  
  int k1 = ftok("/home/t2kdaq/midas/nd280/backend/.BUF09.SHM", 'M');
  int k2 = ftok("/home/t2kdaq/midas/nd280/backend/.BUFTPC.SHM", 'M');
  int k3 = ftok("/home/t2kdaq/midas/nd280/backend/.BUFFGD.SHM", 'M');

  printf("key1: 0x%08x, key2: 0x%08x, key3: 0x%08x\n", k1, k2, k3);
  return 0;
}

[t2kfgd@t2knd280logger ~/xxx]$ g++ -o ftok -Wall ftok.cxx
[t2kfgd@t2knd280logger ~/xxx]$ ./ftok
key1: 0x4d138154, key2: 0x4d138154, key3: 0x4d138152

Also:

[t2kfgd@t2knd280logger ~/xxx]$ ls -li ...
14385492 -rw-r--r-- 1 t2kdaq t2kdaq  8405052 Nov 24 17:42
/home/t2kdaq/midas/nd280/backend/.BUF09.SHM
36077906 -rw-r--r-- 1 t2kdaq t2kdaq 67125308 Nov 26 10:19
/home/t2kdaq/midas/nd280/backend/.BUFFGD.SHM
36077908 -rw-r--r-- 1 t2kdaq t2kdaq  8405052 Nov 25 15:53
/home/t2kdaq/midas/nd280/backend/.BUFTPC.SHM

(hint: print the inode numbers in hex and compare to shm keys printed by the
program)

K.O.

Konstantin Olchanski

open MIDAS RPC ports

we had a bit of trouble with open network ports recently and I now think security of MIDAS RPC
ports needs to be tightened.

TL;DR, this is a non-trivial network configuration problem, TL required, DR up to you.

as background, right now we have two settings in ODB, "/expt/security/enable non-localhost
RPC" set to "no" (the default) and set to "yes". Set to "no" is very secure, all RPC sockets
listen only on the "localhost" interface (127.0.0.1) and do not accept connections from other
computers. Set to "yes", RPC sockets accept connections from everywhere in the world, but
immediately close them without reading any data unless connection origins are listed in ODB
"/expt/security/RPC hosts" (white-listed).

the problem, one. for security and robustness we place most equipments on a private network
(192.168.1.x). MIDAS frontends running on these equipments must connect to MIDAS running on
the main computer. This requires setting "enable non-localhost RPC" to "yes" and white-listing
all private network equipments. so far so good.

the problem, one, continued. in this configuration, the MIDAS main computer is usually also
the network gateway (with NAT, IP forwarding, DHCP, DNS, etc). so now MIDAS RPC ports are open
to all external connections (in the absence of restrictive firewall rules). one would hope for
security-through-obscurity and expect that "external threat actors" will try to bother them,
but in reality we eventually see large numbers of rejected unwanted connections logged in
midas.log (we log the first 10 rejected connections to help with maintaining the RPC
connections white-list).

the problem, two. central IT do not like open network ports. they run their scanners, discover
the MIDAS RPC ports, complain about them, require lengthy explanations, etc.

it would be much better if in the typical configuration, MIDAS RPC ports did not listen on
external interfaces (campus network). only listen on localhost and on private network
interfaces (192.168.1.x).

I am not yet of the simplest way to implement this. But I think this is the direction we
should go.

P.S. what about firewall rules? two problems: one: from statistic-of-one, I make mistakes
writing firewall rules, others also will make mistakes, a literally fool-proof protection of
MIDAS RPC ports is needed. two: RHEL-derived Linuxes by-default have restrictive firewall
rules, and this is good for security, except that there is a failure mode where at boot time
something can go wrong and firewall rules are not loaded at all. we have seen this happen.
this is a complete disaster on a system that depends on firewall rules for security. better to
have secure applications (TCP ports protected by design and by app internals) with firewall
rules providing a secondary layer of protection.

P.P.S. what about MIDAS frontend initial connection to the mserver? this is currently very
insecure, but the vulnerability window is very small. Ideally we should rework the mserver
connection to make it simpler, more secure and compatible with SSH tunneling.

P.P.S. Typical network diagram:

internet - campus firewall - campus network - MIDAS host (MIDAS) - 192.168.1.x network - power
supplies, digitizers, MIDAS frontends.

P.P.S. mserver connection sequence:

1) midas frontend opens 3 tcp sockets, connections permitted from anywhere
2) midas frontend opens tcp socket to main mserver, sends port numbers of the 3 tcp sockets
3) main mserver forks out a secondary (per-client) mserver
4) secondary mserver connects to the 3 tcp sockets of the midas frontend created in (1)
5) from here midas rpc works
6) midas frontend loads the RPC white-list
7) from here MIDAS RPC sockets are secure (protected by the white-list).

(the 3 sockets are: RPC recv_sock, RPC send_sock and event_sock)

P.P.S. MIDAS UDP sockets used for event buffer and odb notifications are secure, they bind to
localhost interface and do not accept external connections.

K.O.

open MIDAS RPC ports

One thing coming to my mind is the interface binding. If you have a midas host with two networks 
("global" and "local"=192.168.x.x), you can tell to which interface a socket should bind. 
By default it binds both interfaces, but we could restrict the socket only to bind to the local 
interface 192.168.x.x. This way the open port would be invisible from the outside, but from 
your local network everybody can connect easily without the need of a white list.

Stefan

Konstantin Olchanski

openssl situation, MacOS 10.11 (El Capitan) openssl compilation errors

> I recently upgraded my macbook to MacOS 10.11. 
> [ and midas would not compile ]
> It seems that MacOS has now fully removed openssl ...

My read of tea leaves - the macos version of openssl was so old it was almost useless, did not support any of the modern HTTPS 
features. So to use mhttpd with https you pretty much had to install openssl from macports anyway. For macos 10.11 maybe they 
looked at upgrading to newer version, but since the openssl kerfuffle last year, there is several forks of openssl (the OpenBSD fork 
named libressl is the best, IMO), so rather than picking and choosing, they deleted the whole thing.

Now back to MIDAS.

We use the mongoose web server module and I have expected by now for them to make a move on improving HTTPS support, but no 
move happened.

Right now mongoose support OpenSSL only (I would expect the OpenBSD LibreSSL fork to work to of the box, too). Other then that, 
they have:
a) their own mickey-mouse https library (krypton) which does not support any modern cryptography (RC4 only - when RC4 is known to 
be useless).
b) an adapter library (polar) for interfacing with PolarSSL (mbedtls)

At this point I would rather abandon the implicit dependency on the system-provided openssl and have an explicit dependancy on a 
modern https crypto library.

Option (b) would work for us - 
1) add "git clone mbedtls; cd mbedtls; make" to midas build instructions
2) add polarssl_compat.c to midas git (from cessanta/polar repo)
3) retest mhttpd against ssllabs https scanner, retest against all web browsers.

The downside of this route is loss of automatic nightly updates to the https crypto library (for better or for worse).

K.O.

P.S. Because on MacOS use of openssl from macports is pretty much required, it should be moved from the "tricks" page to the 
standard midas installation instructions ("install required packages").

Konstantin Olchanski

pending problems and fixes from triumf

Here is the list of known problems I am aware of and of fixes not yet committed
to midas svn:

Goto page Previous 1, 2, 3 ... 121, 122, 123 ... 135, 136, 137 Next

ELOG V3.1.4-2e1708b5