Hi All,
I have found a weird one:
The following code executes on the frontend machine in the
frontend_exit() routine, and connects to the odb running on
another separate machine:
...
cm_msg(MINFO,__func__, "line %d", __LINE__);
cm_get_experiment_database(&hdb, NULL);
cm_msg(MINFO,__func__, "line %d", __LINE__);
status = db_find_key(hdb, 0, "/Experiment/Run Parameters", &hkey);
cm_msg(MINFO,__func__, "line %d, hkey=%d, status=%d",
__LINE__, hkey, status);
checkstat("db_find_key returned status %d", status);
cm_msg(MINFO,__func__, "line %d", __LINE__);
status = db_close_record(hdb, hkey);
/* NOTREACHED!! the above call to db_close_record
doesn't return!
*/
cm_msg(MINFO,__func__, "line %d, status=%d", __LINE__, status);
checkstat("db_close_record returned status %d", status);
checkstat is a macro that does the following:
#define checkstat(format, arg...)\
do{ if(status != DB_SUCCESS) {\
cm_msg(MERROR, __func__, format, ## arg);\
return FE_ERR_ODB;}}while(0)
The key exists, and the status of the search is 1
(i.e., DB_SUCCESS) and rest of the code tries to run. What gets
really weird is that the db_close_record _doesn't_ _return_.
The code following the NOTREACHED comment just doesn't get
called. I get the message from the __LINE__ just in front
of the call, but not the message afterwards (cm_msg and printf
were tried). Somehow db_close_record is causing a non-local
exit or signal or something. No error message is printed and the
frontend continues to exit with exit code 0. But, since the rest
of my frontend_exit/odb closing doesn't happen, the odb is left in
a lost state requiring a cleanup. If I comment out the calls to
db_close_record, the rest of my frontend_exit runs normally
and the cm_disconnect_experiment() in mfe.c eventually closes my
open records correctly (I expect, anyway) and this is the present
workaround i am using. The terror i have is that several of my
hotlinked callback routines will call the close_record routine
when resetting illegal values. No end of hilarity will result there...
I was using the same code in the frontend under 1.9.2 and
have only recently upgraded to 1.9.3-? tarball from PAA and
there were no problems using the 1.9.2 code: this is a 1.9.3
issue.
I have localized the weirdness to what I think is the RPC interface.
Running the nullfrontend (no camac access) on the same machine as
hosts the ODB I can make the problem appear and disappear in the
following way:
(odb is local on machine ``monet'')
nullfe -h monet -e acqmonad : db_close_record will get lost
nullfe -e acqmonad : db_close_record works as expected.
I've tried also with the patch for the 256 byte odb string bug since
many of the open records have strings of that length, but that isn't
it. The only substancial looking change to mserver from 1.9.2 to 1.9.3
is the SIGPIPE ignore and that doesn't look like a good candidate either.
Can this be that some of the
#IFDEF LOCAL_ROUTINES
that got moved about in odb.c and others
are causing the remote call to get confused?
Clearly the answer is to just use stable and happy 1.9.2, but the
people for whom I am working now really want to use ROOT for
an analyzer...
cheers,
.p.
Paul Knowles. phone: 41 26 300 90 64
email: Paul.Knowles@unifr.ch Fax: 41 26 300 97 47
finger me at pexppc33.unifr.ch for more contact information |