Back Midas Rome Roody Rootana
  Midas DAQ System, Page 121 of 136  Not logged in ELOG logo
ID Date Author Topic Subjectup
  2708   13 Feb 2024 Stefan RittForumnumber of entries in a given ODB subdirectory ?
> > For ODB keys of type TID_KEY, the value num_values IS the number of subkeys. 
> 
> this logic makes sense, however it doesn't seem to be consistent with the printout of the test example
> at the end of https://daq00.triumf.ca/elog-midas/Midas/240203_095803/a.cc . The printout reports 
> 
> key.num_values = 1, but the actual number of subkeys = 6, and all subkeys being of TID_KEY type
> 
> I'm certain that the ODB subtree in question was not accessed concurrently during the test.

You are right, num_values is always 1 for TID_KEYS. The number of subkeys is stored in 

  ((KEYLIST *) ((char *)pheader + pkey->data))->num_keys

Maybe we should add a function to return this. But so far db_enum_key() was enough.

Stefan
  2713   15 Feb 2024 Konstantin OlchanskiForumnumber of entries in a given ODB subdirectory ?
> > > For ODB keys of type TID_KEY, the value num_values IS the number of subkeys. 
> > 
> > this logic makes sense, however it doesn't seem to be consistent with the printout of the test example
> > at the end of https://daq00.triumf.ca/elog-midas/Midas/240203_095803/a.cc . The printout reports 
> > 
> > key.num_values = 1, but the actual number of subkeys = 6, and all subkeys being of TID_KEY type
> > 
> > I'm certain that the ODB subtree in question was not accessed concurrently during the test.
> 
> You are right, num_values is always 1 for TID_KEYS. The number of subkeys is stored in 
> 
>   ((KEYLIST *) ((char *)pheader + pkey->data))->num_keys
> 
> Maybe we should add a function to return this. But so far db_enum_key() was enough.
> 
> Stefan

I would rather add a function that atomically returns an std::vector<KEY>. number of entries
is vector size, entry names are in key.name. If you need to do something with an entry,
like iterate a subdirectory, you have to go by name (not by HNDLE), and if somebody deleted
it, you get an error "entry deleted, tough!", (HNDLE becomes invalid without any error message about it, 
subsequent db_get_data() likely returns gibberish, subsequent db_set_data() likely corrupts ODB).

K.O.
  2714   15 Feb 2024 Konstantin OlchanskiForumnumber of entries in a given ODB subdirectory ?
> > You are right, num_values is always 1 for TID_KEYS. The number of subkeys is stored in 
> >   ((KEYLIST *) ((char *)pheader + pkey->data))->num_keys
> > Maybe we should add a function to return this. But so far db_enum_key() was enough.

Hmm... is there any use case where you want to know the number of directory entries, but you will not iterate 
over them later?

K.O.
  2715   15 Feb 2024 Stefan RittForumnumber of entries in a given ODB subdirectory ?
> Hmm... is there any use case where you want to know the number of directory entries, but you will not iterate 
> over them later?

I agree. 

One more way to iterate over subkeys by name is by using the new odbxx API:


   midas::odb tree("/Test/Settings");
   for (midas::odb& key : tree)
      std::cout << key.get_name() << std::endl;


Stefan
  2717   19 Feb 2024 Pavel MuratForumnumber of entries in a given ODB subdirectory ?
> > Hmm... is there any use case where you want to know the number of directory entries, but you will not iterate 
> > over them later?
> 
> I agree. 

here comes the use case: 

I have a slow control frontend which monitors several DAQ components - software processes. 
The components are listed in the system configuration stored in ODB, a subkey per component.

Each component has its own driver, so the length of the driver list, defined by the number of components, 
needs to be determined at run time.

I calculate the number of components by iterating over the list of component subkeys in the system configuration, 
allocate space for the driver list, and store the pointer to the driver list in the equipment record.

The approach works, but it does require pre-calculating the number of subkeys of a given key.

-- regards, Pasha
  546   01 Jan 2009 Konstantin OlchanskiInfoodb "hot link" magic explored
Here are my notes on the MIDAS ODB "hot link" function. Perhaps others can find them useful.

Using db_open_record(key,function), the user can tell MIDAS to call the specified user function when 
the specified ODB key is modified by any other MIDAS program. This function works both locally 
(shared memory odb access) and remotely (odb access through mserver tcp rpc). For example, the 
MIDAS "history" mechanism is implemented in the mlogger by "hot-linking" ODB 
"/equipment/xxx/Variables".

First, the relevant data structures defined in midas.h and msystem.h (ODB database headers, etc)

(in midas.h)
#define NAME_LENGTH            32            /**< length of names, mult.of 8! */
#define MAX_CLIENTS            64            /**< client processes per buf/db */
#define MAX_OPEN_RECORDS       256           /**< number of open DB records   */

(in msystem.h)
DATABASE buf <--- local, private to each client)
  DATABASE_HEADER* database_header <--- odb in shared memory
    char name[NAME_LENGTH]
    DATABASE_CLIENT client[MAX_CLIENTS]
      char name[NAME_LENGTH]
      OPEN_RECORD open_record[MAX_OPEN_RECORDS]
        handle
        access_mode
        flags

(the above means that each midas client has access to the list of all open records through
buf->database_header.client[i].open_record[j])

Second, the data path through db_set_data & co: (other odb "write" functions work the same way)

db_set_data(key)
  lock db
  update odb <--- memcpy(), really
  db_notify_clients(key)
  unlock db
  return

db_notify_clients(key)
  loop: <--- data for this key changed and so data for all keys containing it
             also changed, and we need to notify anybody who has an open record
             on the parents of this key. need to loop over parents of this key (follow "..")
  if (key->notify_count)
    foreach client
      foreach open_record
        if (open_record.handle == key)
          ss_resume(client->port, "O hDB hKey")
  key = key.parent
  goto loop;

ss_resume(port, message)
  idx = ss_suspend_get_index()  <--- magic here
  send udp message ("O hDB hKey") to localhost:port <-- notifications sent only to local host!

note 1: I do not completely understand the ss_suspend_xxx() stuff. The best I can tell
is it creates a number of udp sockets bound to the local host and at least one udp rpc
receive socket ultimately connected to the cm_dispatch_rpc() function.

note 2: More magic here: database_header->client[i].port appears to be the udp rpc server
port of the mserver, while ODB /Clients/xxx/Port is the tcp rpc server port
of the client itself, on the remote host

note 3: the following is for remote odb clients connected through the mserver. For local
clients, cm_dispatch_rpc() calls the local db_update_record() as shown at the very end.

note 4: this uses udp rpc. If the udp datagram is lost inside the os kernel (it looks like these udp/rpc 
datagrams never go out to the network), "hot-link" silently fails: code below is not executed. Some 
OSes (namely, Linux) are known to lose udp datagrams with high probability under certain
not very well understood conditions.

local mserver receives the udp datagram
  ...
  cm_dispatch_ipc()
    if (message=="O hDB hKey")
      decode message (hDB, hKey)
      db_update_record(hDB, hKey)
        send tcp rpc with args(MSG_ODB, hDB, hKey)

(note- unlike udp rpc, tcp rpc are never "lost")

remote client receives tcp rpc:
rpc_client_dispatch()
  recv_tcp(net_buffer)
  if (net_buffer.routine_id == MSG_ODB)
    db_update_record(hDB, hKey)

db_update_record(hDB, hKey)
  if remote delivery, see cm_dispatch_ipc() above
  <--- local delivery
  foreach (_recordlist)
    if (recordlist.handle == hKey)
      if (!recordlist.access_mode&MODE_WRITE)
        db_get_record(hDB,hKey,recordlist.data,recordlist.size)
        recordlist.dispatcher(hDB,hKey,recordlist.info); <-- user-supplied handler

Note: the dispatcher() above is the function supplied by the user in db_open_record().

K.O.
  553   14 Jan 2009 Stefan RittInfoodb "hot link" magic explored

KO wrote:
note 1: I do not completely understand the ss_suspend_xxx() stuff. The best I can tell is it creates a number of udp sockets bound to the local host and at least one udp rpc receive socket ultimately connected to the cm_dispatch_rpc() function.


The ss_suspend_xxx() stuff is indeed the most complicated thing in midas an I have to remind myself always
on how this works. So let me try again:

The basic idea is that for a high performance system, you cannot do the inter-process communication via
polling. That would waste CPU time. Inter-process communication is necessary for for buffer manager
(producer notifies consumer when new events are there), for the RPC mechanism (odbedit tells mlogger to
start a run) or for ODB hot-links. To avoid polling, the inter-process communication works with sockets (UDP
and TCP). This allows to use the select() call, which suspends the calling process until some socket
receives data or a pre-defined time-out expires. This is the only portable method I found which works under
unix and windows (signals are only poorly supported under windows).

So after creating all sockets, ss_suspend() does a select() on these sockets:

_suspend_struct[idx].listen_socket Server side for any new RPC connection (each client is also a RPC server which gets contacted directly during run transitions for example
_suspend_struct[idx].server_acception.recv_sock Receive socket (TCP) for any active RPC connection
_suspend_struct[idx].server_acception.event_sock Receive socket (TCP) for bare events (bypassing RPC layer for performance reasons)
_suspend_struct[idx].server_connection->recv_sock Outgoing TCP connection to mserver. Used for example for hot-link notifications from mserver
_suspend_struct[idx].ipc_recv_socket UDP socket for inter-process notification


For each socket there is a dispatch function, which gets called if that socket receives some data. Hope this sheds some light on the guts of that.
  734   23 Dec 2010 Konstantin OlchanskiBug Reportodb corruption, odb race condition?
The following script makes midas very unhappy and eventually causes odb corruption. I suspect the reason is some kind of race condition collision between client 
creation and destruction code and the watchdog activity (each client periodically runs cm_watchdog() to check if other clients are still alive, O(NxN) total complexity). 
Amongst messages appearing in midas.log:

Thu Dec 23 11:59:08 2010 [ODBEdit28,INFO] Client 'unknown' on buffer 'SYSMSG' removed by bm_open_buffer because client pid 20463 does not exist
Thu Dec 23 11:59:09 2010 [ODBEdit43,INFO] Client 'unknown' on buffer 'SYSMSG' removed by cm_watchdog because client pid 20465 does not exist
Thu Dec 23 12:11:21 2010 [ODBEdit,ERROR] [odb.c:1061:db_open_database,ERROR] Removing client 'ODBEdit11', pid 21536, index 27 because the pid no longer exists
Thu Dec 23 17:06:15 2010 [ODBEdit,ERROR] [odb.c:988:db_open_database,ERROR] maximum number of clients exceeded
Thu Dec 23 12:10:30 2010 [ODBEdit9,ERROR] [odb.c:3247:db_get_value,ERROR] "Name" is of type NULL, not STRING

The last message about <"Name" is of type NULL> appears during normal operation of the ND280 DAQ, leading me into these investigations.

Notes:
a) the script runs at most 50 copies of odbedit, never exceeding midas.h MAX_CLIENTS value 64, so one does not expect to see messages about "maximum number of 
clients exceeded"
b) the script runs 50 copies of odbedit in parallel, increasing the likelihood of whatever race condition is causing this. In the ND280 system, likelihood of failure is 
increased by the large number of running clients (10-20-30 clients), each client running periodic cm_watchdog, to collide with new client creation or destruction.
c) in other experiments, we do not see this (ok, we do have midas meltdowns once in a while) because (1) we tend to have fewer clients (reduced frequency of 
cm_watchdog), (2) we tend to not start and stop midas clients too often (reduced frequency of running client creation and destruction). (NB it seems like ND280 people 
tend to run many scripts containing odbedit commands, so they effectively start and stop midas clients more often than usual).


#!/usr/bin/perl -w
#$cmd = "odbedit -c \'scl -w\' &";
$cmd = "odbedit -c \'ls -l /system/clients\' &";
for (my $i=0; $i<50; $i++)
{
system $cmd;
}
#end
  735   24 Dec 2010 Konstantin OlchanskiBug Reportodb corruption, odb race condition?
> Thu Dec 23 12:10:30 2010 [ODBEdit9,ERROR] [odb.c:3247:db_get_value,ERROR] "Name" is of type NULL, not STRING

This is caused by a race condition between client removal in cm_delete_client_info() and cm_exist().

The race condition in cm_exist() works like this:
- db_enum_key() returns the hkey (pointer to) the next /System/Clients/PID directory
- the client corresponding to PID is removed, our hkey now refers to a deleted entry
- db_get_value() tries to use the now stale hkey pointing to a deleted entry, complains about invalid key TID.

Because the offending db_get_value() is called with the "create if not found" argument set to TRUE, there is potential
for writing into ODB using a stale hkey, maybe leading to ODB corruption. Other than that, this race condition seems
to be benign.

cm_exist() is called from:
everybody->cm_yield()->al_check()->cm_exist()

Further analysis:
- cm_yield() calls al_check() every 10 sec, al_check() calls cm_exist() to check for "program is not running" alarms.
- in al_check() cm_exist() is called once for each entry in /Programs/xxx, even for programs with no alarms. (Maybe I should change this?)
- assuming 10 programs are running (10 clients), every 10 seconds, cm_exist() will be called 10 times and inside, will loop over 10 clients, exposing the enum-get race condition 10*10=100 times every 10 seconds. Usually, 
ODB /Programs/ has many more entries than there are active clients, further increasing the frequency of exposure of this race condition.

K.O.
  736   24 Dec 2010 Konstantin OlchanskiBug Reportodb corruption, odb race condition?
> > Thu Dec 23 12:10:30 2010 [ODBEdit9,ERROR] [odb.c:3247:db_get_value,ERROR] "Name" is of type NULL, not STRING
> This is caused by a race condition between client removal in cm_delete_client_info() and cm_exist().
> ... this race condition seems to be benign.

Not so benign - after fixing cm_exist() to check the return value of db_get_value() and calling it without the "create" flag,
a crasher turned up inside db_find_key() called by db_get_value() with these stale hkeys. For invalid keys (not TID_KEY),
it would call db_get_path() and crash.

So after adding a check for valid key types, my test script runs much better - all the major weirdness is gone, I only see
rare messages from db_find_key(), db_get_key() and db_get_value() about invalid key and data types (after all,
I did not fix the underlying race condition).

The only remaining problem when running my script is some kind of deadlock between the ODB and SYSMSG semaphores...

K.O.
  2422   08 Aug 2022 Konstantin OlchanskiInfoodb disallow key names that start or end with spaces
while testing the new odb editor, we ran into a number of problems with key names 
that start or end with spaces. we cannot think of any valid use case for such key 
names (subdirectories and variables) and we think they could only have been 
created by mistake. ODB now disallows such names. K.O.
  1318   13 Oct 2017 Konstantin OlchanskiInfoodb multithread support repaired
multithreaded access to odb was implemented back in 2013-2014. but recently a bug surfaced - 
there was a race condition in the odb locking code against cm_watchdog(). Somehow this only 
affected the mserver for the DRAGON experiment at TRIUMF. This is now fixed on the branch 
feature/midas-2017-10. (this branch collects all the code that needs additional testing before 
merging into develop and becoming the next release of midas).
K.O.
  1448   20 Feb 2019 Konstantin OlchanskiInfoodb needs protection against ctrl-c
Even with the cm_watchdog signal removed, some trouble from UNIX signals remains.

This time, when one presses Ctrl-C at the wrong time, the Ctrl-C signal handler will run at the wrong time
and strange things will happen (including odb corruption).

In the captured stack trace, I pressed Ctrl-C right when odbedit was inside db_lock_database(). I had to make special
arrangements to make it happen, but I have seen it happen in normal use when running experiments.

K.O.

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00007fff6c2ceb66 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff6c499080 libsystem_pthread.dylib`pthread_kill + 333
    frame #2: 0x00007fff6c22a1ae libsystem_c.dylib`abort + 127
    frame #3: 0x00000001057ccf95 odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2048 [opt]
    frame #4: 0x00000001057aed9d odbedit`cm_delete_client_info(hDB=1, pid=46856) at midas.c:1702 [opt]
    frame #5: 0x00000001057b08fe odbedit`cm_disconnect_experiment at midas.c:2704 [opt]
    frame #6: 0x00000001057a8231 odbedit`ctrlc_odbedit(i=<unavailable>) at odbedit.cxx:2863 [opt]
    frame #7: 0x00007fff6c48cf5a libsystem_platform.dylib`_sigtramp + 26
    frame #8: 0x00007fff6c2ced83 libsystem_kernel.dylib`__semwait_signal + 11
    frame #9: 0x00007fff6c249724 libsystem_c.dylib`nanosleep + 199
    frame #10: 0x00007fff6c249586 libsystem_c.dylib`sleep + 41
    frame #11: 0x00000001057cce6b odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2057 [opt]
    frame #12: 0x00000001057e129a odbedit`db_get_record_size(hDB=1, hKey=141848, align=8, buf_size=0x00007ffeea44c14c) at odb.c:10232 [opt]
    frame #13: 0x00000001057e1b58 odbedit`db_get_record1(hDB=1, hKey=141848, data=0x00007ffeea44c320, buf_size=0x00007ffeea44c2a8, align=0, rec_str="[.]\nWrite system message = BOOL : 
y\nWrite Elog message = BOOL : n\nSystem message interval = INT : 60\nSystem message last = DWORD : 0\nExecute command = STRING : [256] \nExecute interval = INT : 0\nExecute last = DWORD : 
0\nStop run = BOOL : n\nDisplay BGColor = STRING : [32] red\nDisplay FGColor = STRING : [32] black\n\n") at odb.c:10390 [opt]
    frame #14: 0x00000001057f1de3 odbedit`al_trigger_class(alarm_class="Warning", alarm_message="This is an example alarm", first=YES) at alarm.c:389 [opt]
    frame #15: 0x00000001057f19e8 odbedit`al_trigger_alarm(alarm_name="Example alarm", alarm_message="This is an example alarm", default_class="Warning", cond_str="", type=<unavailable>) at 
alarm.c:310 [opt]
    frame #16: 0x00000001057f2c4e odbedit`al_check at alarm.c:655 [opt]
    frame #17: 0x00000001057b9f88 odbedit`cm_periodic_tasks at midas.c:5066 [opt]
    frame #18: 0x00000001057ba26d odbedit`cm_yield(millisec=100) at midas.c:5137 [opt]
    frame #19: 0x00000001057a30b8 odbedit`cmd_idle() at odbedit.cxx:1238 [opt]
    frame #20: 0x00000001057a92df odbedit`cmd_edit(prompt="[local:javascript1:S]/>", cmd=<unavailable>, dir=(odbedit`cmd_dir(char*, int*) at odbedit.cxx:705), idle=(odbedit`cmd_idle() at 
odbedit.cxx:1233))(char*, int*), int (*)()) at cmdedit.cxx:235 [opt]
    frame #21: 0x00000001057a3863 odbedit`command_loop(host_name="", exp_name="javascript1", cmd="", start_dir=<unavailable>) at odbedit.cxx:1435 [opt]
    frame #22: 0x00000001057a8664 odbedit`main(argc=1, argv=<unavailable>) at odbedit.cxx:2997 [opt]
    frame #23: 0x00007fff6c17e015 libdyld.dylib`start + 1
  1449   20 Feb 2019 Stefan RittInfoodb needs protection against ctrl-c
Not sure if you realized, but there is a two-stage Ctrl-C handling inside midas. The first time you hit ctrl-c, the handler just sets a flag for the main event loop, so that the program can gracefully exit without trouble. This is 
done inside cm_ctrlc_handler(), which sets _ctrlc_pressed true if called. Then cm_yield() tests this flag and returns RPC_SHUTDOWN if so. I agree not very obvious, maybe we should return a more appropriate status. So the 
main loop must check the return status of cm_yield() and break if it's RPC_SHUTDOWN. The frontend framework mfe.c does this for example in

 while (status != RPC_SHUTDOWN && status != SS_ABORT);

Any use-written program should do the same (well, probably this is nowhere documented).

Now when the program does not exit (e.g. if it's in an infinite loop), then the second ctrl-c creates a hard abort and terminates the program non-gracefully, which as you noticed can lead to undesired results. All the 
semaphores (at least when I implemented it) had a SEM_UNDO flag when obtaining ownership. This means that if the semaphore is locked and the process who owns it terminates (even with a hard kill), then the semaphore 
is released by the OS. This way a crashed program cannot keep the ODB locked for example. Not sure that with all your modifications in the semaphore calls this functionality is still guaranteed.

Stefan

> Even with the cm_watchdog signal removed, some trouble from UNIX signals remains.
> 
> This time, when one presses Ctrl-C at the wrong time, the Ctrl-C signal handler will run at the wrong time
> and strange things will happen (including odb corruption).
> 
> In the captured stack trace, I pressed Ctrl-C right when odbedit was inside db_lock_database(). I had to make special
> arrangements to make it happen, but I have seen it happen in normal use when running experiments.
> 
> K.O.
> 
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
>   * frame #0: 0x00007fff6c2ceb66 libsystem_kernel.dylib`__pthread_kill + 10
>     frame #1: 0x00007fff6c499080 libsystem_pthread.dylib`pthread_kill + 333
>     frame #2: 0x00007fff6c22a1ae libsystem_c.dylib`abort + 127
>     frame #3: 0x00000001057ccf95 odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2048 [opt]
>     frame #4: 0x00000001057aed9d odbedit`cm_delete_client_info(hDB=1, pid=46856) at midas.c:1702 [opt]
>     frame #5: 0x00000001057b08fe odbedit`cm_disconnect_experiment at midas.c:2704 [opt]
>     frame #6: 0x00000001057a8231 odbedit`ctrlc_odbedit(i=<unavailable>) at odbedit.cxx:2863 [opt]
>     frame #7: 0x00007fff6c48cf5a libsystem_platform.dylib`_sigtramp + 26
>     frame #8: 0x00007fff6c2ced83 libsystem_kernel.dylib`__semwait_signal + 11
>     frame #9: 0x00007fff6c249724 libsystem_c.dylib`nanosleep + 199
>     frame #10: 0x00007fff6c249586 libsystem_c.dylib`sleep + 41
>     frame #11: 0x00000001057cce6b odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2057 [opt]
>     frame #12: 0x00000001057e129a odbedit`db_get_record_size(hDB=1, hKey=141848, align=8, buf_size=0x00007ffeea44c14c) at odb.c:10232 [opt]
>     frame #13: 0x00000001057e1b58 odbedit`db_get_record1(hDB=1, hKey=141848, data=0x00007ffeea44c320, buf_size=0x00007ffeea44c2a8, align=0, rec_str="[.]\nWrite system message = BOOL : 
> y\nWrite Elog message = BOOL : n\nSystem message interval = INT : 60\nSystem message last = DWORD : 0\nExecute command = STRING : [256] \nExecute interval = INT : 0\nExecute last = DWORD : 
> 0\nStop run = BOOL : n\nDisplay BGColor = STRING : [32] red\nDisplay FGColor = STRING : [32] black\n\n") at odb.c:10390 [opt]
>     frame #14: 0x00000001057f1de3 odbedit`al_trigger_class(alarm_class="Warning", alarm_message="This is an example alarm", first=YES) at alarm.c:389 [opt]
>     frame #15: 0x00000001057f19e8 odbedit`al_trigger_alarm(alarm_name="Example alarm", alarm_message="This is an example alarm", default_class="Warning", cond_str="", type=<unavailable>) at 
> alarm.c:310 [opt]
>     frame #16: 0x00000001057f2c4e odbedit`al_check at alarm.c:655 [opt]
>     frame #17: 0x00000001057b9f88 odbedit`cm_periodic_tasks at midas.c:5066 [opt]
>     frame #18: 0x00000001057ba26d odbedit`cm_yield(millisec=100) at midas.c:5137 [opt]
>     frame #19: 0x00000001057a30b8 odbedit`cmd_idle() at odbedit.cxx:1238 [opt]
>     frame #20: 0x00000001057a92df odbedit`cmd_edit(prompt="[local:javascript1:S]/>", cmd=<unavailable>, dir=(odbedit`cmd_dir(char*, int*) at odbedit.cxx:705), idle=(odbedit`cmd_idle() at 
> odbedit.cxx:1233))(char*, int*), int (*)()) at cmdedit.cxx:235 [opt]
>     frame #21: 0x00000001057a3863 odbedit`command_loop(host_name="", exp_name="javascript1", cmd="", start_dir=<unavailable>) at odbedit.cxx:1435 [opt]
>     frame #22: 0x00000001057a8664 odbedit`main(argc=1, argv=<unavailable>) at odbedit.cxx:2997 [opt]
>     frame #23: 0x00007fff6c17e015 libdyld.dylib`start + 1
  1450   20 Feb 2019 Konstantin OlchanskiInfoodb needs protection against ctrl-c
Commit f81ff3c protects db_lock/unlock, but not any of the other functions. What if we do ctrl-c in the middle
of some odb write operation in the middle of memory allocation, etc.

A sure way to corrupt odb.

Perhaps we should disallow odb access from signal handlers? But we still want to be able to stop midas
programs using ctrl-c, even if the program is in some infinite loop somewhere and is not processing
midas events (no calls to cm_yield(), etc).

Maybe I should change the ctrl-c handler to set a flag for cm_yield() to return SS_EXIT,
and additional ctrl-c do nothing if this flag is already set? (maybe abort() if they do ctrl-c 10 times?).

K.O.

> Even with the cm_watchdog signal removed, some trouble from UNIX signals remains.
> 
> This time, when one presses Ctrl-C at the wrong time, the Ctrl-C signal handler will run at the wrong time
> and strange things will happen (including odb corruption).
> 
> In the captured stack trace, I pressed Ctrl-C right when odbedit was inside db_lock_database(). I had to make special
> arrangements to make it happen, but I have seen it happen in normal use when running experiments.
> 
> K.O.
> 
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
>   * frame #0: 0x00007fff6c2ceb66 libsystem_kernel.dylib`__pthread_kill + 10
>     frame #1: 0x00007fff6c499080 libsystem_pthread.dylib`pthread_kill + 333
>     frame #2: 0x00007fff6c22a1ae libsystem_c.dylib`abort + 127
>     frame #3: 0x00000001057ccf95 odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2048 [opt]
>     frame #4: 0x00000001057aed9d odbedit`cm_delete_client_info(hDB=1, pid=46856) at midas.c:1702 [opt]
>     frame #5: 0x00000001057b08fe odbedit`cm_disconnect_experiment at midas.c:2704 [opt]
>     frame #6: 0x00000001057a8231 odbedit`ctrlc_odbedit(i=<unavailable>) at odbedit.cxx:2863 [opt]
>     frame #7: 0x00007fff6c48cf5a libsystem_platform.dylib`_sigtramp + 26
>     frame #8: 0x00007fff6c2ced83 libsystem_kernel.dylib`__semwait_signal + 11
>     frame #9: 0x00007fff6c249724 libsystem_c.dylib`nanosleep + 199
>     frame #10: 0x00007fff6c249586 libsystem_c.dylib`sleep + 41
>     frame #11: 0x00000001057cce6b odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2057 [opt]
>     frame #12: 0x00000001057e129a odbedit`db_get_record_size(hDB=1, hKey=141848, align=8, buf_size=0x00007ffeea44c14c) at odb.c:10232 [opt]
>     frame #13: 0x00000001057e1b58 odbedit`db_get_record1(hDB=1, hKey=141848, data=0x00007ffeea44c320, buf_size=0x00007ffeea44c2a8, align=0, rec_str="[.]\nWrite system message = BOOL : 
> y\nWrite Elog message = BOOL : n\nSystem message interval = INT : 60\nSystem message last = DWORD : 0\nExecute command = STRING : [256] \nExecute interval = INT : 0\nExecute last = DWORD : 
> 0\nStop run = BOOL : n\nDisplay BGColor = STRING : [32] red\nDisplay FGColor = STRING : [32] black\n\n") at odb.c:10390 [opt]
>     frame #14: 0x00000001057f1de3 odbedit`al_trigger_class(alarm_class="Warning", alarm_message="This is an example alarm", first=YES) at alarm.c:389 [opt]
>     frame #15: 0x00000001057f19e8 odbedit`al_trigger_alarm(alarm_name="Example alarm", alarm_message="This is an example alarm", default_class="Warning", cond_str="", type=<unavailable>) at 
> alarm.c:310 [opt]
>     frame #16: 0x00000001057f2c4e odbedit`al_check at alarm.c:655 [opt]
>     frame #17: 0x00000001057b9f88 odbedit`cm_periodic_tasks at midas.c:5066 [opt]
>     frame #18: 0x00000001057ba26d odbedit`cm_yield(millisec=100) at midas.c:5137 [opt]
>     frame #19: 0x00000001057a30b8 odbedit`cmd_idle() at odbedit.cxx:1238 [opt]
>     frame #20: 0x00000001057a92df odbedit`cmd_edit(prompt="[local:javascript1:S]/>", cmd=<unavailable>, dir=(odbedit`cmd_dir(char*, int*) at odbedit.cxx:705), idle=(odbedit`cmd_idle() at 
> odbedit.cxx:1233))(char*, int*), int (*)()) at cmdedit.cxx:235 [opt]
>     frame #21: 0x00000001057a3863 odbedit`command_loop(host_name="", exp_name="javascript1", cmd="", start_dir=<unavailable>) at odbedit.cxx:1435 [opt]
>     frame #22: 0x00000001057a8664 odbedit`main(argc=1, argv=<unavailable>) at odbedit.cxx:2997 [opt]
>     frame #23: 0x00007fff6c17e015 libdyld.dylib`start + 1
  1451   20 Feb 2019 Konstantin OlchanskiInfoodb needs protection against ctrl-c
> Not sure if you realized, but there is a two-stage Ctrl-C handling inside midas.

Hmm... I am looking at the ctrl-c handler inside odbedit.

Yes, and the original bug report is against odbedit - press of ctrl-c in odbedit corrupts odb,
see stack trace in https://bitbucket.org/tmidas/midas/issues/99

So maybe only the odbedit ctrl-c handler is defective...

I will take a look at what the other ctrl-c handler does.

Safest is probably to call exit() without calling cm_disconnect_experiment().

From the ctrl-c handler, if we call cm_disconnect_experiment() -
- if we hold odb locked, we deadlock (after I remove the recursive mutex) or corrupt odb (if we run form inside db_create or db_set_data).
- if we run from inside db_lock/unlock, we abort() (with my newly added protection) or explode (if we run from inside mprotect(), like in the stack strace in the bug report)

I would say, from a signal handler, only safe things are - set a flag, or abort()/exit().

exit() is not super safe because the user may have attached some code to it that may access odb. (our
default atexit() handler just prints an error message).

K.O.


> The first time you hit ctrl-c, the handler just sets a flag for the main event loop, so that the program can gracefully exit without trouble. This is 
> done inside cm_ctrlc_handler(), which sets _ctrlc_pressed true if called. Then cm_yield() tests this flag and returns RPC_SHUTDOWN if so. I agree not very obvious, maybe we should return a more appropriate status. So the 
> main loop must check the return status of cm_yield() and break if it's RPC_SHUTDOWN. The frontend framework mfe.c does this for example in
> 
>  while (status != RPC_SHUTDOWN && status != SS_ABORT);
> 
> Any use-written program should do the same (well, probably this is nowhere documented).
> 
> Now when the program does not exit (e.g. if it's in an infinite loop), then the second ctrl-c creates a hard abort and terminates the program non-gracefully, which as you noticed can lead to undesired results. All the 
> semaphores (at least when I implemented it) had a SEM_UNDO flag when obtaining ownership. This means that if the semaphore is locked and the process who owns it terminates (even with a hard kill), then the semaphore 
> is released by the OS. This way a crashed program cannot keep the ODB locked for example. Not sure that with all your modifications in the semaphore calls this functionality is still guaranteed.
> 
> Stefan
> 
> > Even with the cm_watchdog signal removed, some trouble from UNIX signals remains.
> > 
> > This time, when one presses Ctrl-C at the wrong time, the Ctrl-C signal handler will run at the wrong time
> > and strange things will happen (including odb corruption).
> > 
> > In the captured stack trace, I pressed Ctrl-C right when odbedit was inside db_lock_database(). I had to make special
> > arrangements to make it happen, but I have seen it happen in normal use when running experiments.
> > 
> > K.O.
> > 
> > (lldb) bt
> > * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
> >   * frame #0: 0x00007fff6c2ceb66 libsystem_kernel.dylib`__pthread_kill + 10
> >     frame #1: 0x00007fff6c499080 libsystem_pthread.dylib`pthread_kill + 333
> >     frame #2: 0x00007fff6c22a1ae libsystem_c.dylib`abort + 127
> >     frame #3: 0x00000001057ccf95 odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2048 [opt]
> >     frame #4: 0x00000001057aed9d odbedit`cm_delete_client_info(hDB=1, pid=46856) at midas.c:1702 [opt]
> >     frame #5: 0x00000001057b08fe odbedit`cm_disconnect_experiment at midas.c:2704 [opt]
> >     frame #6: 0x00000001057a8231 odbedit`ctrlc_odbedit(i=<unavailable>) at odbedit.cxx:2863 [opt]
> >     frame #7: 0x00007fff6c48cf5a libsystem_platform.dylib`_sigtramp + 26
> >     frame #8: 0x00007fff6c2ced83 libsystem_kernel.dylib`__semwait_signal + 11
> >     frame #9: 0x00007fff6c249724 libsystem_c.dylib`nanosleep + 199
> >     frame #10: 0x00007fff6c249586 libsystem_c.dylib`sleep + 41
> >     frame #11: 0x00000001057cce6b odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2057 [opt]
> >     frame #12: 0x00000001057e129a odbedit`db_get_record_size(hDB=1, hKey=141848, align=8, buf_size=0x00007ffeea44c14c) at odb.c:10232 [opt]
> >     frame #13: 0x00000001057e1b58 odbedit`db_get_record1(hDB=1, hKey=141848, data=0x00007ffeea44c320, buf_size=0x00007ffeea44c2a8, align=0, rec_str="[.]\nWrite system message = BOOL : 
> > y\nWrite Elog message = BOOL : n\nSystem message interval = INT : 60\nSystem message last = DWORD : 0\nExecute command = STRING : [256] \nExecute interval = INT : 0\nExecute last = DWORD : 
> > 0\nStop run = BOOL : n\nDisplay BGColor = STRING : [32] red\nDisplay FGColor = STRING : [32] black\n\n") at odb.c:10390 [opt]
> >     frame #14: 0x00000001057f1de3 odbedit`al_trigger_class(alarm_class="Warning", alarm_message="This is an example alarm", first=YES) at alarm.c:389 [opt]
> >     frame #15: 0x00000001057f19e8 odbedit`al_trigger_alarm(alarm_name="Example alarm", alarm_message="This is an example alarm", default_class="Warning", cond_str="", type=<unavailable>) at 
> > alarm.c:310 [opt]
> >     frame #16: 0x00000001057f2c4e odbedit`al_check at alarm.c:655 [opt]
> >     frame #17: 0x00000001057b9f88 odbedit`cm_periodic_tasks at midas.c:5066 [opt]
> >     frame #18: 0x00000001057ba26d odbedit`cm_yield(millisec=100) at midas.c:5137 [opt]
> >     frame #19: 0x00000001057a30b8 odbedit`cmd_idle() at odbedit.cxx:1238 [opt]
> >     frame #20: 0x00000001057a92df odbedit`cmd_edit(prompt="[local:javascript1:S]/>", cmd=<unavailable>, dir=(odbedit`cmd_dir(char*, int*) at odbedit.cxx:705), idle=(odbedit`cmd_idle() at 
> > odbedit.cxx:1233))(char*, int*), int (*)()) at cmdedit.cxx:235 [opt]
> >     frame #21: 0x00000001057a3863 odbedit`command_loop(host_name="", exp_name="javascript1", cmd="", start_dir=<unavailable>) at odbedit.cxx:1435 [opt]
> >     frame #22: 0x00000001057a8664 odbedit`main(argc=1, argv=<unavailable>) at odbedit.cxx:2997 [opt]
> >     frame #23: 0x00007fff6c17e015 libdyld.dylib`start + 1
  1452   20 Feb 2019 Stefan RittInfoodb needs protection against ctrl-c
Have you read what I wrote? The current ctrl-c handler just sets the _ctrlc_pressed flag. It might be that some programs do not correctly interprete the return of cm_yield(), certainly the frontend does it correctly. On the SECOND ctrl-c, the program gets 
(internally) hard aborted, equivalent to calling abort(). Not sure if the code works everywhere, I see now that cm_yield(() should maybe return SS_ABORT like 

if (_ctrlc_pressed)
  return SS_ABORT;

then command_loop will exit gracefully. Have to check mfe.c, mlogger.c etc. if they all check for SS_ABORT returned from cm_yield().

> > Not sure if you realized, but there is a two-stage Ctrl-C handling inside midas.
> 
> Hmm... I am looking at the ctrl-c handler inside odbedit.
> 
> Yes, and the original bug report is against odbedit - press of ctrl-c in odbedit corrupts odb,
> see stack trace in https://bitbucket.org/tmidas/midas/issues/99
> 
> So maybe only the odbedit ctrl-c handler is defective...
> 
> I will take a look at what the other ctrl-c handler does.
> 
> Safest is probably to call exit() without calling cm_disconnect_experiment().
> 
> From the ctrl-c handler, if we call cm_disconnect_experiment() -
> - if we hold odb locked, we deadlock (after I remove the recursive mutex) or corrupt odb (if we run form inside db_create or db_set_data).
> - if we run from inside db_lock/unlock, we abort() (with my newly added protection) or explode (if we run from inside mprotect(), like in the stack strace in the bug report)
> 
> I would say, from a signal handler, only safe things are - set a flag, or abort()/exit().
> 
> exit() is not super safe because the user may have attached some code to it that may access odb. (our
> default atexit() handler just prints an error message).
> 
> K.O.
> 
> 
> > The first time you hit ctrl-c, the handler just sets a flag for the main event loop, so that the program can gracefully exit without trouble. This is 
> > done inside cm_ctrlc_handler(), which sets _ctrlc_pressed true if called. Then cm_yield() tests this flag and returns RPC_SHUTDOWN if so. I agree not very obvious, maybe we should return a more appropriate status. So the 
> > main loop must check the return status of cm_yield() and break if it's RPC_SHUTDOWN. The frontend framework mfe.c does this for example in
> > 
> >  while (status != RPC_SHUTDOWN && status != SS_ABORT);
> > 
> > Any use-written program should do the same (well, probably this is nowhere documented).
> > 
> > Now when the program does not exit (e.g. if it's in an infinite loop), then the second ctrl-c creates a hard abort and terminates the program non-gracefully, which as you noticed can lead to undesired results. All the 
> > semaphores (at least when I implemented it) had a SEM_UNDO flag when obtaining ownership. This means that if the semaphore is locked and the process who owns it terminates (even with a hard kill), then the semaphore 
> > is released by the OS. This way a crashed program cannot keep the ODB locked for example. Not sure that with all your modifications in the semaphore calls this functionality is still guaranteed.
> > 
> > Stefan
> > 
> > > Even with the cm_watchdog signal removed, some trouble from UNIX signals remains.
> > > 
> > > This time, when one presses Ctrl-C at the wrong time, the Ctrl-C signal handler will run at the wrong time
> > > and strange things will happen (including odb corruption).
> > > 
> > > In the captured stack trace, I pressed Ctrl-C right when odbedit was inside db_lock_database(). I had to make special
> > > arrangements to make it happen, but I have seen it happen in normal use when running experiments.
> > > 
> > > K.O.
> > > 
> > > (lldb) bt
> > > * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
> > >   * frame #0: 0x00007fff6c2ceb66 libsystem_kernel.dylib`__pthread_kill + 10
> > >     frame #1: 0x00007fff6c499080 libsystem_pthread.dylib`pthread_kill + 333
> > >     frame #2: 0x00007fff6c22a1ae libsystem_c.dylib`abort + 127
> > >     frame #3: 0x00000001057ccf95 odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2048 [opt]
> > >     frame #4: 0x00000001057aed9d odbedit`cm_delete_client_info(hDB=1, pid=46856) at midas.c:1702 [opt]
> > >     frame #5: 0x00000001057b08fe odbedit`cm_disconnect_experiment at midas.c:2704 [opt]
> > >     frame #6: 0x00000001057a8231 odbedit`ctrlc_odbedit(i=<unavailable>) at odbedit.cxx:2863 [opt]
> > >     frame #7: 0x00007fff6c48cf5a libsystem_platform.dylib`_sigtramp + 26
> > >     frame #8: 0x00007fff6c2ced83 libsystem_kernel.dylib`__semwait_signal + 11
> > >     frame #9: 0x00007fff6c249724 libsystem_c.dylib`nanosleep + 199
> > >     frame #10: 0x00007fff6c249586 libsystem_c.dylib`sleep + 41
> > >     frame #11: 0x00000001057cce6b odbedit`db_lock_database(hDB=<unavailable>) at odb.c:2057 [opt]
> > >     frame #12: 0x00000001057e129a odbedit`db_get_record_size(hDB=1, hKey=141848, align=8, buf_size=0x00007ffeea44c14c) at odb.c:10232 [opt]
> > >     frame #13: 0x00000001057e1b58 odbedit`db_get_record1(hDB=1, hKey=141848, data=0x00007ffeea44c320, buf_size=0x00007ffeea44c2a8, align=0, rec_str="[.]\nWrite system message = BOOL : 
> > > y\nWrite Elog message = BOOL : n\nSystem message interval = INT : 60\nSystem message last = DWORD : 0\nExecute command = STRING : [256] \nExecute interval = INT : 0\nExecute last = DWORD : 
> > > 0\nStop run = BOOL : n\nDisplay BGColor = STRING : [32] red\nDisplay FGColor = STRING : [32] black\n\n") at odb.c:10390 [opt]
> > >     frame #14: 0x00000001057f1de3 odbedit`al_trigger_class(alarm_class="Warning", alarm_message="This is an example alarm", first=YES) at alarm.c:389 [opt]
> > >     frame #15: 0x00000001057f19e8 odbedit`al_trigger_alarm(alarm_name="Example alarm", alarm_message="This is an example alarm", default_class="Warning", cond_str="", type=<unavailable>) at 
> > > alarm.c:310 [opt]
> > >     frame #16: 0x00000001057f2c4e odbedit`al_check at alarm.c:655 [opt]
> > >     frame #17: 0x00000001057b9f88 odbedit`cm_periodic_tasks at midas.c:5066 [opt]
> > >     frame #18: 0x00000001057ba26d odbedit`cm_yield(millisec=100) at midas.c:5137 [opt]
> > >     frame #19: 0x00000001057a30b8 odbedit`cmd_idle() at odbedit.cxx:1238 [opt]
> > >     frame #20: 0x00000001057a92df odbedit`cmd_edit(prompt="[local:javascript1:S]/>", cmd=<unavailable>, dir=(odbedit`cmd_dir(char*, int*) at odbedit.cxx:705), idle=(odbedit`cmd_idle() at 
> > > odbedit.cxx:1233))(char*, int*), int (*)()) at cmdedit.cxx:235 [opt]
> > >     frame #21: 0x00000001057a3863 odbedit`command_loop(host_name="", exp_name="javascript1", cmd="", start_dir=<unavailable>) at odbedit.cxx:1435 [opt]
> > >     frame #22: 0x00000001057a8664 odbedit`main(argc=1, argv=<unavailable>) at odbedit.cxx:2997 [opt]
> > >     frame #23: 0x00007fff6c17e015 libdyld.dylib`start + 1
  562   18 Feb 2009 Konstantin OlchanskiInfoodbc sql history mlogger update
> mhttpd and mlogger have been updated with potentially troublesome changes.
> These new features are now available:
> - a "feature complete" implementation of "history in an SQL database".

The mlogger SQL history driver has been updated with improvements that make this new system usable in 
production environment: the silly "create all tables on startup, every time, even if they already exist" is fixed,
mlogger survives restarts of mysqld and checks that existing sql columns have data types compatible with the 
data we are trying to write.

There are still a few trouble spots remaining. For example, in mapping midas names into sql names (sql names 
have more restrictions on permitted characters) and in reverse mapping of sql data types to midas data types. 
To properly solve this, I may have to save the midas names and data types into an additional index table.

Included is the mh2sql utility for importing existing history files into an SQL database (in the same way as if 
they were written into the database by mlogger).

The mhttpd side of this system still needs polishing, but should be already fully functional.

A preliminary version of documentation for this new SQL history system is here. After additional review and 
editing it will be committed to the midas midox documentation. Included are full instructions on enabling 
writing of midas history into a MySQL database.
http://ladd00.triumf.ca/~olchansk/midas/Internal.html#History_sql_internal

svn revision 4452
K.O.
  575   07 May 2009 Konstantin OlchanskiBug Reportodbedit bad ctrl-C
When using "/bin/bash" shell, if I exit odbedit (and other midas programs) using ctrl-C, the terminal 
enters a funny state, "echo" is turned off (I cannot see what I type), "delete" key does not work (echoes 
^H instead).

This problem does not happen if I exit using the "exit" command or if I use the "/bin/tcsh" shell.

When this happens, the terminal can be restored to close to normal state using "stty sane", and "stty 
erase ^H".

The terminal is set into this funny state by system.c::getchar() and normal settings are never restored 
unless the midas program calls getchar(1) at the end. If the program does not finish normally, original 
terminal settings are never restored and the terminal is left in a funny state.

It is not clear why the problem does not happen with /bin/tcsh - perhaps they restore sane terminal 
settings automatically for us.
K.O.
  588   04 Jun 2009 Stefan RittBug Reportodbedit bad ctrl-C
> When using "/bin/bash" shell, if I exit odbedit (and other midas programs) using ctrl-C, the terminal 
> enters a funny state, "echo" is turned off (I cannot see what I type), "delete" key does not work (echoes 
> ^H instead).
> 
> This problem does not happen if I exit using the "exit" command or if I use the "/bin/tcsh" shell.
> 
> When this happens, the terminal can be restored to close to normal state using "stty sane", and "stty 
> erase ^H".
> 
> The terminal is set into this funny state by system.c::getchar() and normal settings are never restored 
> unless the midas program calls getchar(1) at the end. If the program does not finish normally, original 
> terminal settings are never restored and the terminal is left in a funny state.
> 
> It is not clear why the problem does not happen with /bin/tcsh - perhaps they restore sane terminal 
> settings automatically for us.
> K.O.

Who uses bash ??? And who keeps baning on Ctrl-C, when there is a nice "exit" command ;-)

Well, I implemented a simple CTRL-C handler in odbedit (Rev. 4503) which resets the terminal before exiting. 
Give it a try. Of course this cannot catch a hard kill (-9), but CTRL-C works now correctly under bash at 
least.
ELOG V3.1.4-2e1708b5