09 Dec 2003, Paul Knowles, , db_close_record non-local/non-return
|
Hi All,
I have found a weird one:
The following code executes on the frontend machine in the
frontend_exit() routine, and connects to the odb running on
another separate machine:
...
cm_msg(MINFO,__func__, "line %d", __LINE__);
cm_get_experiment_database(&hdb, NULL);
cm_msg(MINFO,__func__, "line %d", __LINE__);
status = db_find_key(hdb, 0, "/Experiment/Run Parameters", &hkey);
cm_msg(MINFO,__func__, "line %d, hkey=%d, status=%d",
__LINE__, hkey, status);
checkstat("db_find_key returned status %d", status);
cm_msg(MINFO,__func__, "line %d", __LINE__);
status = db_close_record(hdb, hkey);
/* NOTREACHED!! the above call to db_close_record
doesn't return!
*/
cm_msg(MINFO,__func__, "line %d, status=%d", __LINE__, status);
checkstat("db_close_record returned status %d", status);
checkstat is a macro that does the following:
#define checkstat(format, arg...)\
do{ if(status != DB_SUCCESS) {\
cm_msg(MERROR, __func__, format, ## arg);\
return FE_ERR_ODB;}}while(0)
The key exists, and the status of the search is 1
(i.e., DB_SUCCESS) and rest of the code tries to run. What gets
really weird is that the db_close_record _doesn't_ _return_.
The code following the NOTREACHED comment just doesn't get
called. I get the message from the __LINE__ just in front
of the call, but not the message afterwards (cm_msg and printf
were tried). Somehow db_close_record is causing a non-local
exit or signal or something. No error message is printed and the
frontend continues to exit with exit code 0. But, since the rest
of my frontend_exit/odb closing doesn't happen, the odb is left in
a lost state requiring a cleanup. If I comment out the calls to
db_close_record, the rest of my frontend_exit runs normally
and the cm_disconnect_experiment() in mfe.c eventually closes my
open records correctly (I expect, anyway) and this is the present
workaround i am using. The terror i have is that several of my
hotlinked callback routines will call the close_record routine
when resetting illegal values. No end of hilarity will result there...
I was using the same code in the frontend under 1.9.2 and
have only recently upgraded to 1.9.3-? tarball from PAA and
there were no problems using the 1.9.2 code: this is a 1.9.3
issue.
I have localized the weirdness to what I think is the RPC interface.
Running the nullfrontend (no camac access) on the same machine as
hosts the ODB I can make the problem appear and disappear in the
following way:
(odb is local on machine ``monet'')
nullfe -h monet -e acqmonad : db_close_record will get lost
nullfe -e acqmonad : db_close_record works as expected.
I've tried also with the patch for the 256 byte odb string bug since
many of the open records have strings of that length, but that isn't
it. The only substancial looking change to mserver from 1.9.2 to 1.9.3
is the SIGPIPE ignore and that doesn't look like a good candidate either.
Can this be that some of the
#IFDEF LOCAL_ROUTINES
that got moved about in odb.c and others
are causing the remote call to get confused?
Clearly the answer is to just use stable and happy 1.9.2, but the
people for whom I am working now really want to use ROOT for
an analyzer...
cheers,
.p.
Paul Knowles. phone: 41 26 300 90 64
email: Paul.Knowles@unifr.ch Fax: 41 26 300 97 47
finger me at pexppc33.unifr.ch for more contact information |
12 Dec 2003, Stefan Ritt, , db_close_record non-local/non-return
|
Hi Paul,
sorry my late reply, I had to find some time for debugging your problem.
Thank you very much for the detailed description of the problem, I wish all
bug reports would be such elaborate!
You were right that there was a bug in the RPC system. The function
db_remove_open_record() got a new parameter recently, which was not changed
in the RPC call, and caused the mserver side to crash on any
db_close_record() call.
I fixed it and the update is under CVS (http://midas.psi.ch/cgi-
bin/cvsweb/midas/src/). Since you need to update many files, I wonder if I
should enable anonymous CVS read access. Does anybody know how to set this
up using "ssh" as the protocol (via CVS_RSH=ssh)?
Please note that db_close_record() is not necessary as
cm_disconnect_experiment() takes care of this, but having it there does not
hurt. |
23 Feb 2014, William Page, Forum, db_check_record() for verifying structure of ODB subtree
|
Hi,
I have been trying to use db_check_record() in order to verify that a subtree in the ODB has the correct
variables, variable order, and overall size. I'm going off the documentation
(https://midas.psi.ch/htmldoc/group__odbfunctionc.html) and use a string to compare against the ODB
structure. Since the string format is not specified for db_check_record(), I'm formatting my string
according to the db_create_record() example.
Instead of db_check_record() checking the entire ODB subtree against all the variables represented in the
string, I'm finding that only the first variable is checked. The later variables in the string can be
misspelled, out of order, or inexistent, and db_check_record() will still return 1.
Am I using db_check_record incorrectly?
Thank you for any help with this issue.
I also believe that some of the documentation for db_check_record is outdated. For example, init_string
is referenced in the documentation but isn't part of the function definition. |
10 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid
|
Dear all
We just started our beam time at ILL and just found yesterday that for certain
settings of our detector the data is not saved into the .mid files. Running "mdump
-l 10" online we see the data coming in as they should. Nevertheless, if we run
"mdump -x runXXXXXX.mid" offline, the data file has no events and the banks are
missing. Any ideas where the data could go lost?
Thanks in advance,
Ivo |
10 Aug 2020, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid
|
> Dear all
>
> We just started our beam time at ILL and just found yesterday that for certain
> settings of our detector the data is not saved into the .mid files. Running "mdump
> -l 10" online we see the data coming in as they should. Nevertheless, if we run
> "mdump -x runXXXXXX.mid" offline, the data file has no events and the banks are
> missing. Any ideas where the data could go lost?
>
> Thanks in advance,
> Ivo
Have you checked
/Logger/Channels/0/Settings/Event ID = -1
/Logger/Channels/0/Settings/Trigger mask = -1
If these settings are not -1, they filter the data stream for certain events and trigger
masks.
Stefan |
10 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid
|
> > Dear all
> >
> > We just started our beam time at ILL and just found yesterday that for certain
> > settings of our detector the data is not saved into the .mid files. Running "mdump
> > -l 10" online we see the data coming in as they should. Nevertheless, if we run
> > "mdump -x runXXXXXX.mid" offline, the data file has no events and the banks are
> > missing. Any ideas where the data could go lost?
> >
> > Thanks in advance,
> > Ivo
>
> Have you checked
>
> /Logger/Channels/0/Settings/Event ID = -1
> /Logger/Channels/0/Settings/Trigger mask = -1
>
> If these settings are not -1, they filter the data stream for certain events and trigger
> masks.
>
> Stefan
Good morning Stefan
Both set to -1. We only have one logging channel. If we run a sequence with a few runs and the
same settings, sometimes data is in the .mid file and sometimes it is not.
Best,
Ivo |
10 Aug 2020, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid
|
> Both set to -1. We only have one logging channel. If we run a sequence with a few runs and the
> same settings, sometimes data is in the .mid file and sometimes it is not.
Then I'm running out of ideas. Things I would check:
- Are the file sizes about the same?
- When you dump the .mid file, you do you see your bank names?
This would tell you if the events are really missing or if mdump would just not find them.
But I guess without being able to debug the system at ILL I cannot be of any more help. You are the
first one reporting such a problem, so it must have to do with your local setup.
Stefan |
10 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid
|
> Then I'm running out of ideas. Things I would check:
>
> - Are the file sizes about the same?
>
> - When you dump the .mid file, you do you see your bank names?
>
> This would tell you if the events are really missing or if mdump would just not find them.
>
> But I guess without being able to debug the system at ILL I cannot be of any more help. You are the
> first one reporting such a problem, so it must have to do with your local setup.
>
> Stefan
So I did a quick check. The file size is about the same (322K and 329K). When I dump the .mid I don't see
the banks. It only prints two lines with "------ Event# 0 ------" and "------ Event# 1 ------" whereas for
the file with data I get the two banks with all the data. Our online analyzer also fails to see the banks.
Is there another way to check what is in the .mid file?
Best,
Ivo |
10 Aug 2020, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid
|
> So I did a quick check. The file size is about the same (322K and 329K). When I dump the .mid I don't see
> the banks. It only prints two lines with "------ Event# 0 ------" and "------ Event# 1 ------" whereas for
> the file with data I get the two banks with all the data. Our online analyzer also fails to see the banks.
> Is there another way to check what is in the .mid file?
with "dump" I meant a true object dump like "hexdump -C run000001.mid". I produced a file with ADC0 and TDC0
banks (that's the example from the distribution under exampels/experiments/frontend.cxx), and I get
....
00024220 01 00 00 00 41 44 43 30 04 00 08 00 eb 06 35 04 |....ADC0......5.|
00024230 31 09 4f 06 54 44 43 30 04 00 08 00 93 04 fb 07 |1.O.TDC0........|
00024240 5c 09 88 0b 01 00 00 00 01 00 00 00 2a 0b 31 5f |\...........*.1_|
00024250 28 00 00 00 20 00 00 00 01 00 00 00 41 44 43 30 |(... .......ADC0|
00024260 04 00 08 00 c3 09 24 05 85 05 f3 06 54 44 43 30 |......$.....TDC0|
00024270 04 00 08 00 88 08 2d 03 3b 0d d6 02 01 00 00 00 |......-.;.......|
00024280 02 00 00 00 2a 0b 31 5f 28 00 00 00 20 00 00 00 |....*.1_(... ...|
00024290 01 00 00 00 41 44 43 30 04 00 08 00 a5 0a 69 09 |....ADC0......i.|
where you clearly see the ADC0 and TDC0 banks.
Stefan |
10 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid
|
> with "dump" I meant a true object dump like "hexdump -C run000001.mid". I produced a file with ADC0 and TDC0
> banks (that's the example from the distribution under exampels/experiments/frontend.cxx), and I get
>
> ....
> 00024220 01 00 00 00 41 44 43 30 04 00 08 00 eb 06 35 04 |....ADC0......5.|
> 00024230 31 09 4f 06 54 44 43 30 04 00 08 00 93 04 fb 07 |1.O.TDC0........|
> 00024240 5c 09 88 0b 01 00 00 00 01 00 00 00 2a 0b 31 5f |\...........*.1_|
> 00024250 28 00 00 00 20 00 00 00 01 00 00 00 41 44 43 30 |(... .......ADC0|
> 00024260 04 00 08 00 c3 09 24 05 85 05 f3 06 54 44 43 30 |......$.....TDC0|
> 00024270 04 00 08 00 88 08 2d 03 3b 0d d6 02 01 00 00 00 |......-.;.......|
> 00024280 02 00 00 00 2a 0b 31 5f 28 00 00 00 20 00 00 00 |....*.1_(... ...|
> 00024290 01 00 00 00 41 44 43 30 04 00 08 00 a5 0a 69 09 |....ADC0......i.|
>
> where you clearly see the ADC0 and TDC0 banks.
>
> Stefan
So at least I learned something new. I tried it with the hexdump and the banks are not existent in the .mid file. I
only have the ODB inside the file. The 7K difference in size is actually just about what I expect to be the data
(1792 x 4 bytes)
Best, Ivo |
10 Aug 2020, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid
|
Have you tried longer files? Maybe a few 100 MB or so. Maybe a buffer is not flushed correctly at the end of a run. |
10 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid
|
> Have you tried longer files? Maybe a few 100 MB or so. Maybe a buffer is not flushed correctly at the end of a run.
Yes, I did. This 7 KB of the data bank is about the limit. If we go only 1 KB higher it seems that we save all data. In
our specific case, this is the number of time bins (256 pixels with 7 time bins results in data loss, with 8 time bins it
seems to be okay, data type is DWORD).
Of course, a workaround for us is to save at least 8 time bins and throw 7 of them away later on. Nevertheless, since we
are only in the commissioning phase now this is okay, I would just like to avoid data loss in the data taking phase of the
experiment so knowing where the problem origins could help.
I did another test with another FE running that produces a lot of data. The behavior is the same though. If the bank size
is less than about 8 KB, the bank is not saved anymore. But probably this is anyway the expected behavior since it is a
different FE that produces the data.
So if it is coming from the buffer, is there something I could change to test or solve the problem?
Best, Ivo |
10 Aug 2020, Pierre-Andre Amaudruz, Bug Report, data missing in runXXXXXX.mid
|
> > > Dear all
> > >
> > > We just started our beam time at ILL and just found yesterday that for certain
> > > settings of our detector the data is not saved into the .mid files. Running "mdump
> > > -l 10" online we see the data coming in as they should. Nevertheless, if we run
> > > "mdump -x runXXXXXX.mid" offline, the data file has no events and the banks are
> > > missing. Any ideas where the data could go lost?
> > >
> > > Thanks in advance,
> > > Ivo
> >
> > Have you checked
> >
> > /Logger/Channels/0/Settings/Event ID = -1
> > /Logger/Channels/0/Settings/Trigger mask = -1
> >
> > If these settings are not -1, they filter the data stream for certain events and trigger
> > masks.
> >
> > Stefan
>
> Good morning Stefan
>
> Both set to -1. We only have one logging channel. If we run a sequence with a few runs and the
> same settings, sometimes data is in the .mid file and sometimes it is not.
>
> Best,
> Ivo
Hi,
If the online mdump is correct (by default using the -1, -1 filter), the data are in the main SYSTEM buffer.
Similar to the dump -
The fact that the analyzer doesn't see the banks would indicate a buffer handling issue as mentioned by Stefan.
To confirm, I would check at the end of a run that the sum of the equipment "events sent" matches the logger "Events written". |
10 Aug 2020, Pierre-Andre Amaudruz, Bug Report, data missing in runXXXXXX.mid
|
> Have you tried longer files? Maybe a few 100 MB or so. Maybe a buffer is not flushed correctly at the end of a run.
Hi,
If the online mdump is correct (by default using the -1, -1 filter), the data are in the main SYSTEM buffer.
by the way, similar to the dump - mdump -x file.mid -m raw -d x will show you the events.
The fact that the analyzer doesn't see the banks would indicate a buffer handling issue as mentioned by Stefan.
To confirm, I would check at the end of a run that the sum of the equipment "events sent" matches the logger "Events written". |
10 Aug 2020, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid
|
I have to reproduce the problem to fix it. Why don't you go and modify midas/examples/experiment/frontend.cxx in such a way that
it creates exactly the banks you have, just with random data. If you see the same problem, send me your frontend file so that I
can reproduce it. |
11 Aug 2020, Konstantin Olchanski, Bug Report, data missing in runXXXXXX.mid
|
> I have to reproduce the problem to fix it. Why don't you go and modify midas/examples/experiment/frontend.cxx in such a way that
> it creates exactly the banks you have, just with random data. If you see the same problem, send me your frontend file so that I
> can reproduce it.
It would be good to pin point there the data is lost. This is the sequence:
frontend user code -> mfe.c code -> SYSTEM buffer -> mlogger -> disk
To see if correct data arrives to the SYSTEM buffer, run:
mdump -z SYSTEM
To see if mlogger is receiving events from the SYSTEM buffer, run:
mlogger -v ### mlogger should report all events, history and data
To see if mlogger writes events to disk, examine the disk file (in this case, you already did, data is not there).
I would guess that your data does not make it out from the frontend (mdump shows "nothing"),
if data were to arrive into the SYSTEM buffer, it would make it to disk, unless
mlogger is misconfigured (but you already checked that).
If you have trouble with the frontend framework code, you can try to switch from the mfe.c frontend
to the newer c++ tmfe frontend (see progs/fetest_tmfe.cxx and progs/fetest_tmfe_thread.cxx).
K.O. |
11 Aug 2020, Ivo Schulthess, Bug Report, data missing in runXXXXXX.mid
|
> It would be good to pin point there the data is lost. This is the sequence:
>
> frontend user code -> mfe.c code -> SYSTEM buffer -> mlogger -> disk
>
> To see if correct data arrives to the SYSTEM buffer, run:
> mdump -z SYSTEM
>
> To see if mlogger is receiving events from the SYSTEM buffer, run:
> mlogger -v ### mlogger should report all events, history and data
>
> To see if mlogger writes events to disk, examine the disk file (in this case, you already did, data is not there).
>
> I would guess that your data does not make it out from the frontend (mdump shows "nothing"),
> if data were to arrive into the SYSTEM buffer, it would make it to disk, unless
> mlogger is misconfigured (but you already checked that).
>
> If you have trouble with the frontend framework code, you can try to switch from the mfe.c frontend
> to the newer c++ tmfe frontend (see progs/fetest_tmfe.cxx and progs/fetest_tmfe_thread.cxx).
>
> K.O.
Good evening
I tried to reproduce the behavior in a very simple FE but it did not work out. The next thing for me would be to take the FE that is producing this behavior, replace all the device communication and data with dummies. If the problem is still there I would start to simplify as much as possible.
Following the inputs of KO, I pin-pointed the data loss. The system buffer still gets the data but the mlogger does not write the data event. Then of course the data is also not anymore present in the data file. Therefore, I checked the logger settings again, Event ID and Trigger Mask still -1. Nothing else, at least from my point of view, that is misconfigured. Nevertheless, if it helps I can send my ODB settings.
When doing the tests just before I found something else that probably can give a hint to the problem. The data is only lost if the time between two runs is long (a few seconds). As an example: If I run a sequence with a loop and after the FE stops the run the loop ends and the next run is started automatically, then only the first run has no data, which is the one after a longer time of no data taking. When I add a "WAIT Seconds 5" after the run before starting the next, not data is written to the disk for any run. I also found this once when adding a sleep(1) at the end of the FE readout function but back then did not think about it any further.
Best, Ivo |
24 Mar 2022, Konstantin Olchanski, Bug Report, data missing in runXXXXXX.mid
|
> > It would be good to pin point there the data is lost. This is the sequence:
> >
> > frontend user code -> mfe.c code -> SYSTEM buffer -> mlogger -> disk
> >
> > To see if correct data arrives to the SYSTEM buffer, run:
> > mdump -z SYSTEM
> >
> > To see if mlogger is receiving events from the SYSTEM buffer, run:
> > mlogger -v ### mlogger should report all events, history and data
> >
> > To see if mlogger writes events to disk, examine the disk file (in this case, you already did, data is not there).
> >
> > I would guess that your data does not make it out from the frontend (mdump shows "nothing"),
> > if data were to arrive into the SYSTEM buffer, it would make it to disk, unless
> > mlogger is misconfigured (but you already checked that).
> >
> > If you have trouble with the frontend framework code, you can try to switch from the mfe.c frontend
> > to the newer c++ tmfe frontend (see progs/fetest_tmfe.cxx and progs/fetest_tmfe_thread.cxx).
> >
> > K.O.
>
> Good evening
>
> I tried to reproduce the behavior in a very simple FE but it did not work out.
> The next thing for me would be to take the FE that is producing this behavior,
> replace all the device communication and data with dummies. If the problem is still
> there I would start to simplify as much as possible.
>
> Following the inputs of KO, I pin-pointed the data loss. The system buffer still
> gets the data but the mlogger does not write the data event. Then of course the data
> is also not anymore present in the data file. Therefore, I checked the logger
> settings again, Event ID and Trigger Mask still -1. Nothing else, at least from my point of view,
> that is misconfigured. Nevertheless, if it helps I can send my ODB settings.
>
> When doing the tests just before I found something else that probably
> can give a hint to the problem. The data is only lost if the time between
> two runs is long (a few seconds). As an example: If I run a sequence with a loop
> and after the FE stops the run the loop ends and the next run is started automatically,
> then only the first run has no data, which is the one after a longer time of
> no data taking. When I add a "WAIT Seconds 5" after the run before starting
> the next, not data is written to the disk for any run. I also found this
> once when adding a sleep(1) at the end of the FE readout function
> but back then did not think about it any further.
>
Looks like this problem fell into the covid crack.
As far as I know, MIDAS does not lose any events between bm_send_event() and the shared memory
buffer. It does not lose any events in the mlogger (unless the "event request" is misconfigured).
(there is lots of opportunity to lose events in complicated frontends).
If you have some evidence otherwise, I would very much like to hear about
it and I want to fix all problems that cause it.
In your previous report I was under the impression that you lose random events here and there,
but your latest report is about mlogger not writing anything at all.
Which case is it?
If you can definitely say that all your events make it to the SYSTEM buffer
but mlogger sometimes does not see some of them and sometimes does not see all of them,
we should look very closely at bm_receive_event() and mlogger itself.
In the case where mlogger is not seeing any events at all (output file is empty), as this is
happening, I would like to see the output of mdump (to confirm events are written to SYSTEM
buffer with correct event_id and trigger_mask) and the output of (say)
"manalyzer_test.exe --dump run01161.mid.lz4" on your output file.
If the output is very long, you can email it to me directly instead of posting it here.
K.O. |
24 Mar 2022, Stefan Ritt, Bug Report, data missing in runXXXXXX.mid
|
One idea: we should have a look at mlogger::close_channels(). There the SYSTEM buffer is emptied through the cm_yield() call. Instrumenting this with some debugging code will enlighten us.
Another possible problem: If the frontend requested to be notified for a run stop AFTER the logger, then the problem might happen: Logger closes file, and THEN the frontend flushes events ending up in the SYSTEM buffer and being logged at the beginning of the next run. The mfe.cxx framework takes care of this by calling
cm_register_transition(TR_SOP, 500);
while the mlogger does
cm_register_transition(TR_STOP, tr_stop, 800);
and since 800 > 500 the logger will be called AFTER the frontend. If one use a framework different from mfe.cxx, this could however be different.
Stefan |
24 Mar 2022, Konstantin Olchanski, Bug Report, data missing in runXXXXXX.mid
|
> One idea: we should have a look at mlogger::close_channels().
> There the SYSTEM buffer is emptied through the cm_yield() call.
> Instrumenting this with some debugging code will enlighten us.
right. this will "last few events are lost at the end of run".
but that code in the mlogger was not touched in years, if there is a problem there,
we would have seen it by now, most experiments check that the number
of events in the data file is same as number of triggers generated, both
numbers are shown on the midas status page.
> Another possible problem: If the frontend requested to be notified for a run stop AFTER the logger, then the problem might happen: Logger closes file, and THEN the frontend flushes events ending up in the SYSTEM buffer and being logged at the beginning of the next run. The mfe.cxx framework takes care of this by calling
> cm_register_transition(TR_STOP, 500);
default sequence, both mfe.c frontend and c++ tmfe frontend:
start of run:
- mlogger first (configure history, open data file)
- frontends last
- (if any frontend fails, TR_STARTABORT is sent to mlogger to close the output file and "undo" the run start)
end of run:
- frontends first (must not send any events after after processing the TR_STOP RPC call, inside the TR_STOP handler, bm_flush_cache() takes care of the write cache)
- mlogger last
- (if any frontend fails, failure is ignored, run stops regardless)
wrong order will be only if they manually change it, and whatever order
they set, you see it on the midas transition page (and mtransition -v and odbedit stop now -v, etc).
K.O. |
|