17 Nov 2003, Pierre-André Amaudruz, , Lazylogger application
|
- Remove temporary "/Programs/Lazy" creation.
- Fix Rate calculation for Web display.
- Change FTP channel description (see help). |
31 Oct 2003, Konstantin Olchanski, , more odb "run number" error checking
|
I added error checking to the places where we read "/runinfo/run number". In
general, I do this:
status = db_get_value("/runinfo/run number",&run_number);
assert(status==SUCCESS);
assert(run_number >= 0); (and run_number>0, where appropriate)
Here is the rationale: if we cannot read the run number, something must be
very terribly wrong. I cannot think of any recovery action other than
abort() and make a core dump for our debugging enjoyment.
I considered and rejected adding a "retry" loop: if we allow db_get_value()
to intermittently fail, then it's every use has to be wrapped in a retry
loop, which then should be inside db_get_value(), making it pointless to
have external "retry" loops.
I am now pondering on proposing a "db_get_value_cannot_possibly_fail()"
function (it would abort(), exit() with an error or commit harakiri if it
can't get the value). They way most db_xxx() functions are used in midas,
maybe they should be made "void" and "unfailible", with "STATUS
db_xxx_yes_I_can_fail_and_return_an_error_code()" evil twins. I guess this
is why "they" invented C/C++ exceptions. Anyway, something to think about.
Affected files:
src/lazylogger.c
src/odbedit.c
src/mlogger.c
src/mfe.c
src/odb.c
src/mana.c
src/midas.c
src/mhttpd.c
K.O. |
01 Nov 2003, Stefan Ritt, , more odb
|
> I added error checking to the places where we read "/runinfo/run number". In
> general, I do this:
> Affected files:
> src/lazylogger.c
> src/odbedit.c
> src/mlogger.c
> src/mfe.c
> src/odb.c
> src/mana.c
> src/midas.c
> src/mhttpd.c
Now YOU broke the system by editing all these files with something I consider
temporary debugging code. A run number of zero is *VALILD*. If I want to make
sure a new experiment starts with run number #1, I put a run number of 0 into
the ODB. So on the first start the number is incremented by one which results
in run number from one. So please remove those checks which prevents me of
doing that. Again, your "run number zero" problem is soemhow specific to your
environment, and I would not put all these tests into the distribution,
because this can have side effects, like that one I described above.
- Stefan |
01 Nov 2003, Konstantin Olchanski, , more odb
|
> > I added error checking to the places where we read "/runinfo/run number".
> Now YOU broke the system by editing all these files with something I consider
> temporary debugging code. A run number of zero is *VALILD*.
I think I broke nothing. I do know that run number 0 is a valid odb value. Here
is an audit of all places where I abort on invalid run numbers:
mana.c: line 3676: assert(current_run_number > 0);
we take the run number from an event and write it into ODB. Events cannot have
run number negative or zero.
mana.c:analyze_run(): line 4632: assert(run_number > 0);
we are asked to analyze run "run_number". zero or negative is not valid.
midas.c:assert(run_number > old_run_number);
midas.c:assert(run_number > 1);
this code is not in CVS.
odbedit.c: line 2563: assert(old_run_number >= 0);
run number zero is valid
odbedit.c: line 2641: assert(new_run_number > 0);
starting a new run number zero is not valid
mfe.c: line 1786: if (run_number<=0) cm_msg(MERROR, "main", "aborting on attempt
to use invalid run number %d", run_number);
auto restart from run 0 to 1 is not valid
midas.c: line 3917: if (run_number<=0) cm_msg(MERROR, "cm_transition", "aborting
on attempt to use invalid run number %d",run_number);
transition to run zero or negative is not valid
midas.c: line 16101: if (run_number<0) cm_msg(MERROR, "el_submit", "aborting on
attempt to use invalid run number %d", run_number);
negative run numbers are not valid
mlogger.c: line 3301: if (run_number<=0) cm_msg(MERROR, "main", "aborting on
attempt to use invalid run number %d", run_number);
auto restart from run 0 to run 1 is not valid
K.O. |
14 Nov 2003, Stefan Ritt, , more odb
|
Ok, I apologize. It's all ok. Thanks for clearifying. Concerning the assert's, it
would be nice to be able to disable them in release code. Under Windows, the
assert() is actually a macro which expands to zero if NDEBUG is defined. I
believe it's the same under linux, but I don't know about VxWorks. So we have
three options:
1) Keep asserts always. This might possible slow down a DAQ system, but I'm not
sure how much. Might be negligible.
2) Disable asserts by default (standard make). Only the "experts" can enable it
in the make file (by removing NDEBUG), since only they know what to do with the
assertation messages.
3) Let the user decide on the standard installation. Maybe have two libraries,
one debug, one no-debug. The no-debug can even have the compiler optimization
disabled, which makes debugging easier.
So what is your opinion (comments from others are welcome as well) of which way
to go? |
31 Oct 2003, Konstantin Olchanski, , Do not frob "/runinfo" in mhttpd.c
|
I found where we tickle the race condition in db_create_record().
1) in mhttpd.c, every time we show the status page, we call
db_create_record(hDB, 0, "/Runinfo", strcomb(runinfo_str));
2) internally db_create_record() deletes /RunInfo
3) other programs read "/runinfo/run number" while it is deleted do not
check for the db_get_value() error code and happily get a zero run number.
Stephan fixed the race condition, and now I commited an mhttpd.c change that
only calls db_create_record(hDB, 0, "/Runinfo", strcomb(runinfo_str)); if
/runinfo does not exist. This seems to be redundant with a similar call in
cm_connect_experiment1(), called each time a new client starts up.
Files changed:
src/mhttpd.c
K.O. |
01 Nov 2003, Stefan Ritt, , Do not frob
|
> I found where we tickle the race condition in db_create_record().
>
> 1) in mhttpd.c, every time we show the status page, we call
> db_create_record(hDB, 0, "/Runinfo", strcomb(runinfo_str));
> 2) internally db_create_record() deletes /RunInfo
> 3) other programs read "/runinfo/run number" while it is deleted do not
> check for the db_get_value() error code and happily get a zero run number.
>
> Stephan fixed the race condition, and now I commited an mhttpd.c change that
> only calls db_create_record(hDB, 0, "/Runinfo", strcomb(runinfo_str)); if
> /runinfo does not exist. This seems to be redundant with a similar call in
> cm_connect_experiment1(), called each time a new client starts up.
The reason for the db_create_record() is the following: Assume that we change
the /runinfo structure, by adding an additional variable in the future. If we
run a "new" mhttpd on an "old" experiment, the "runinfo" C structure does not
match the ODB contents. The db_create_record() ensures that the ODB structure
exactly matches the C structure. I agree with you that this can cause
potential problems. But most of them should be fixed by the additional lock()
I added recently. So other programs cannot read the run number while it is
deleted.
One could think of checking the record size, and re-creating the runinfo if
the ODB record size does not match the C record size. But this does not
prevent the potential error that some variable are reversed in order. They
are then mapped wrongly to the C runinfo structure.
I see that you work very hard now on all possible checks for the run number.
But I would not commit that and make it part of the distribution, since all
experiments at PSI for example do not have this run number problem. Run it
locally, determine the cause of your problem (the discovery of the race
condition was already very good, I'm glad that your found it, should make the
system much more stable), and we'll fix it. Puttin ASSERT's is a good idea, I
should have done it from the very beginning. But if you start now, please put
it in all other 100000 places (;-)
I would not add a db_get_value_cannot_possibly_fail() into the standard
distribution, because it probably cannot correct the initial problem and then
just will go into an infinite loop. We should tackle problems always at their
source.
If you cannot resolve your zero run number problem, do the following: There
is a cm_msg(MDEBUG, ...) which only puts a message into the shared memory,
but not in midas.log. This can be used for real time debugging. Add those
message temporarily in db_get_value() etc. to see what is going on. As soon
as the run number goes to zero, stop all processes immediately (for example
by locking the database with db_lock_database), and the look backwards in the
sysmsg buffer to see what happened *before* the run number went to zero.
- Stefan |
01 Nov 2003, Konstantin Olchanski, , Do not frob
|
> > I found where we tickle the race condition in db_create_record().
> The reason for the db_create_record() is the following: Assume that we change
> the /runinfo structure...
I think there is a deep fundamental problem with changing data structures "on the
fly". Calling db_create_record("/runinfo") at every show_status_page() does not
fix it.
If I change the runinfo structure, rebuild, relink and restart "mhttpd", the
db_create_record("/runinfo") from cm_connect_experiment() will update the runinfo
structure in ODB. In this case, the call from show_status_page() is redundant. As
a side effect, when we do this, we break every running ODB client- they still
have the old runinfo layout. Not good...
If I change the runinfo structure, rebuild, relink and restart all applications,
*except* for mhttpd, "/runinfo" in ODB will be updated when the first updated
client connects to ODB via the db_create_record("/runinfo") from
cm_connect_experiment(). Then, the old mhttpd will restore the old layout via the
db_create_record("/runinfo") in show_status_page(), breaking everything. Not good...
If I change the runinfo structure, rebuild, relink and restart everything,
"/runinfo" in ODB will be updated when the first client connects to ODB via the
db_create_record("/runinfo") from cm_connect_experiment(). In this case, the call
from show_status_page() is redundant. This is the only corruption-free scenario.
This lack of integrity enforcement vs version skew in binary data structures is,
I think, an ODB design error. Perhaps, ODB applications should be prohibited from
direct access to ODB "C" data structures: we cannot ensure that the data layout
in the application and in ODB are the same.
> One could think of checking the record size, and re-creating the runinfo if
> the ODB record size does not match the C record size. But this does not
> prevent the potential error that some variable are reversed in order. They
> are then mapped wrongly to the C runinfo structure.
Exacto.
> I see that you work very hard now on all possible checks for the run number.
> But I would not commit that and make it part of the distribution...
This is a philosophical issue.
My checks are in line with the "design by contract" school of programming. In a
nutshell, this ideology requires that before I do anything, I should enforce the
validity of my inputs and after I am done, I should enforce the validity of my
outputs. In practice, this translates into liberal use of assert()'s *in
production code*.
To ensure that old bugs stay fixed, and that new bugs are promptly discovered, it
is essential that the "contract checks" stay in the production code forever.
But let better writers argue programming philosophy in the literature.
Personally, when hunting down bugs in unstable code, I find this technique to be
vastly superior to the more common appoach of "This program has no bugs. Error
checking and assert()s are wasteful. Let's close our eyes and hope no bad things
happen to us (again)".
> But if you start now, please put [asserts] in all other 100000 places (;-)
I know that no good deed goes unpunished, but pewleeze!!!
> If you cannot resolve your zero run number problem, do the following: ...
> [lock ODB, freeze the experiment, look at log files]
This technique is obsolete. Today, we instrument the code with sanity checks
and validity tests. Then all the bugs find themselves with minimal manual
intervention.
K.O. |
31 Oct 2003, Konstantin Olchanski, , mana.c without ROOT and HBOOK
|
Stephan, why did you prohibit building mana.c without ROOT and HBOOK
support? I think such a configuration is valid and should be allowed.
Also, this prohibition broke the Midas Makefile, it now bombs building
mana.c. The Makefile is setup for building hmana.c with HBOOK support,
rmana.c with ROOT support (if ROOTSYS is set) and mana.c without HBOOK and
ROOT support (currently bombs on #error in mana.c).
K.O. |
01 Nov 2003, Stefan Ritt, , mana.c without ROOT and HBOOK
|
> Stephan, why did you prohibit building mana.c without ROOT and HBOOK
> support? I think such a configuration is valid and should be allowed.
Oops, sorry, my fault. I forgto that people use mana.c without ROOT and
HBOOK. The reason I made the change was that people forgot the -DHVAE_HBOOK
in their makefile. In that case, no HBOOK init is done in mana.c and the
first histogram booking in the user code crashes HBOOK.
So please take the #error statement out of mana.c (I'm away in two hours for
one week), but think about preventing the above mentionend problem. I don't
know any way for the makefile or mana.c to figure out if there is any HF1
call in the user code. Actually HF1 should return a "proper" error message
than just crashing.
One possibility is that we put an additional layer on top of the histogram
boooking/filling. These macros are converted to their HBOOK or ROOT
equivalents depending on the HAVE_HBOOK/HAVE_ROOT. If none of both is
present, the histogram booking macro can produce a runtime error. This has
the additional advantage that users can switch from HBOOK to ROOT without
change of their user code. |
01 Nov 2003, Konstantin Olchanski, , mana.c without ROOT and HBOOK
|
> > Stephan, why did you prohibit building mana.c without ROOT and HBOOK
> > support? I think such a configuration is valid and should be allowed.
>
> Oops, sorry, my fault. I forgto that people use mana.c without ROOT and
> HBOOK. The reason I made the change was that people forgot the -DHVAE_HBOOK
> in their makefile. In that case, no HBOOK init is done in mana.c and the
> first histogram booking in the user code crashes HBOOK.
Ahem. There is only so much rope we can give out to prevent people from shooting
themselves in the foot...
> So please take the #error statement out of mana.c
Done.
> One possibility is that we put an additional layer on top of the histogram
> boooking/filling. These macros are converted to their HBOOK or ROOT
> equivalents depending on the HAVE_HBOOK/HAVE_ROOT. If none of both is
> present, the histogram booking macro can produce a runtime error. This has
> the additional advantage that users can switch from HBOOK to ROOT without
> change of their user code.
I can't think of anything other than wrapping every HBOOK call with "if
(!hbook_is_initialized) initialize_hbook();". But then, where is PAWC
coming from anyway?!?
We could also print a warning message "This mana.c has no HBOOK support. If you
see HBOOK crashes, please relink with hmana,c". Ugly, but informative, plus it
points anybody who knows how to read towards a solution.
K.O. |
31 Oct 2003, Konstantin Olchanski, , Disable "tab"s in xemacs
|
The default C indentation style in xemacs uses "tab" characters, violating
the MIDAS coding convention. To disable this misfeature in xemacs (emacs
too?), put this incantation in your .xemacs/custom.el file:
(custom-set-variables
'(indent-tabs-mode nil))
K.O. |
30 Oct 2003, Stefan Ritt, , Fixed several potential problems for ODB corruption
|
I just realized that db_set_value, db_set_data, db_set_num_values and
db_merge_data do not check for num_values == 0. With such a parameter the
ODB can become corrupted, since zero length ODB entries are not allowed. I
fixed the according places in odb.c and committed the changes. Everyone
with ODB corruption problems should update that code. |
30 Oct 2003, Stefan Ritt, , 'umask' added to lazylogger for FTP connections
|
I had to add a 'umask' opiton to the loggers (lazy and mlogger) for the new
PSI archive. One can now put a filename into the settings like:
archive,21,user,pw,dir,run%05d.mid,026
where the optional last parameter is used for a "umask 026" command just
sent to the FTP server after the connection has been established. This
changes the mode bits of the newly transferred file. We needed that so that
the files are group readable, since several people from one group want to
read the data.
I committed mlogger.c and ybos.c which contains the ftp code (should
actually go into lazylogger.c instead of ybos.c). |
16 Oct 2003, David Morris, , Updated thread functions
|
ss_thread_create now returns the thread ID on success, and zero on failure.
Previously returned SS_SUCCESS or SS_NO_THREAD. User must now test the
return value to determine result.
ss_thread_kill added to kill the passed thread ID. Returns SS_SUCCESS or
SS_NO_THREAD.
Any thread creation must be verified now, and old code must be examined to
ensure the return value is checked. |
28 Oct 2003, Stefan Ritt, , Updated thread functions
|
> ss_thread_create now returns the thread ID on success, and zero on failure.
> Previously returned SS_SUCCESS or SS_NO_THREAD. User must now test the
> return value to determine result.
>
> ss_thread_kill added to kill the passed thread ID. Returns SS_SUCCESS or
> SS_NO_THREAD.
>
> Any thread creation must be verified now, and old code must be examined to
> ensure the return value is checked.
Thank you for that post. Internally, threads are not use in midas, so there
should be no problem. Only experiments using threads explicitly should take
care. |
15 Oct 2003, Konstantin Olchanski, , test
|
test
test
test |
15 Oct 2003, Konstantin Olchanski, , test
|
> test
> test
> test
another test
K.O. |
15 Oct 2003, Stefan Ritt, , test
|
> > test
> > test
> > test
>
> another test
>
> K.O.
I got the two email notifications, if you have tried that... |
12 Oct 2003, Konstantin Olchanski, , mhttpd: add Elog text to outgoing email.
|
This commit adds the elog message text to the outgoing email message. This
functionality has been requested a logn time ago, but I guess nobody got
around to implement it, until now. I also added assert() traps for the most
common array overruns in the Elog code.
Here is the cvs diff:
Index: src/mhttpd.c
===================================================================
RCS file: /usr/local/cvsroot/midas/src/mhttpd.c,v
retrieving revision 1.252
diff -r1.252 mhttpd.c
768a769
> #include <assert.h>
3740c3741
< char mail_to[256], mail_from[256], mail_text[256], mail_list[256],
---
> char mail_to[256], mail_from[256], mail_text[10000], mail_list[256],
3921a3923,3925
> // zero out the array. needed because later strncat() does not
always add the trailing '\0'
> memset(mail_text,0,sizeof(mail_text));
>
3931a3936,3945
>
> assert(strlen(mail_text) + 100 < sizeof(mail_text)); // bomb out
on array overrun.
>
> strcat(mail_text+strlen(mail_text),"\n");
> // this strncat() depends on the mail_text array being zeroed out:
> // strncat() does not always add the trailing '\0'
>
strncat(mail_text+strlen(mail_text),getparam("text"),sizeof(mail_text)-strlen(mail_text)-50);
> strcat(mail_text+strlen(mail_text),"\n");
>
> assert(strlen(mail_text) < sizeof(mail_text)); // bomb out on
array overrun.
Index: src/midas.c
===================================================================
RCS file: /usr/local/cvsroot/midas/src/midas.c,v
retrieving revision 1.192
diff -r1.192 midas.c
604a605
> #include <assert.h>
16267a16269,16270
>
> assert(strlen(message) < sizeof(message)); // bomb out on array overrun.
K.O. |
13 Oct 2003, Stefan Ritt, , mhttpd: add Elog text to outgoing email.
|
> around to implement it, until now. I also added assert() traps for the most
> common array overruns in the Elog code.
In addition to the assert() one should use strlcat() and strlcpy() all over
the code to avoid buffer overruns. The ELOG standalone code does that already
properly.
- Stefan |
13 Oct 2003, Konstantin Olchanski, , mhttpd: add Elog text to outgoing email.
|
> > around to implement it, until now. I also added assert() traps for the most
> > common array overruns in the Elog code.
>
> In addition to the assert() one should use strlcat() and strlcpy() all over
> the code to avoid buffer overruns. The ELOG standalone code does that already
> properly.
>
> - Stefan
Yes, the original authors should have used strlcat(). Now that I uncovered this source of mhttpd
memory corruption, maybe some volunteer will fix it up properly.
K.O. |
13 Oct 2003, Stefan Ritt, , mhttpd: add Elog text to outgoing email.
|
> > > around to implement it, until now. I also added assert() traps for the
most
> > > common array overruns in the Elog code.
> >
> > In addition to the assert() one should use strlcat() and strlcpy() all
over
> > the code to avoid buffer overruns. The ELOG standalone code does that
already
> > properly.
> >
> > - Stefan
>
> Yes, the original authors should have used strlcat(). Now that I uncovered
this source of mhttpd
> memory corruption, maybe some volunteer will fix it up properly.
>
> K.O.
I am the original author and will fix all that once I merged mhttpd and elog.
Due to my current task list, this will happen probably in November.
- Stefan |
12 Oct 2003, Konstantin Olchanski, , Array overruns in mhttpd.c::submit_elog()
|
While adding new functionality to submit_elog() (add the message text to the
outgoing email), I noticed that the email text is being stored into an array
of size 256, mail_text[256], without any checks for array overrun. This
cannot be good. How should this be corrected?
K.O. |
12 Oct 2003, Konstantin Olchanski, , Array overruns in mhttpd.c::submit_elog()
|
> While adding new functionality to submit_elog() (add the message text to the
> outgoing email), I noticed that the email text is being stored into an array
> of size 256, mail_text[256], without any checks for array overrun. This
> cannot be good. How should this be corrected?
> K.O.
Similar problem exists in midas.c::el_submit(). The array "message[10000]" is
easy to overrun by submitting a long elog message.
K.O. |
13 Oct 2003, Stefan Ritt, , Array overruns in mhttpd.c::submit_elog()
|
> > While adding new functionality to submit_elog() (add the message text to
the
> > outgoing email), I noticed that the email text is being stored into an
array
> > of size 256, mail_text[256], without any checks for array overrun. This
> > cannot be good. How should this be corrected?
> > K.O.
>
> Similar problem exists in midas.c::el_submit(). The array "message[10000]"
is
> easy to overrun by submitting a long elog message.
>
> K.O.
The whole elog functionality in mhttpd will be replaced (sometime) by the
standalone ELOG package, linked against mhttpd. The ELOG functionality is
much richer and does not conatin all the mentioned problems which have been
fixed there some time ago. For the time being it might however be worth to
fix the mentioned problems, but without spending too much time on it. |
13 Oct 2003, Konstantin Olchanski, , Array overruns in mhttpd.c::submit_elog()
|
> > > While adding new functionality to submit_elog() ....
>
> The whole elog functionality in mhttpd will be replaced (sometime) ...
I humbly submit that this has been the standard reply for the last 2 years since I was aware of
the "last N days does not always work" problem (just saw it again yesterday).
K.O. |
12 Oct 2003, Konstantin Olchanski, , Refuse to set run number zero
|
I am debugging the frequent problem where the run number is mysteriously
reset to zero. As a first step, I am commiting changes to mhttpd.c and midas.c:
- abort on obviously corrupted "run number < 0"
- abort on cm_transition() to run 0 (the only place where the run number is
explicitely written to ODB)
- in the mhttpd "Start run" form, reject user setting the run number to <= 0.
Here is the CVS diff:
===================================================================
RCS file: /usr/local/cvsroot/midas/src/mhttpd.c,v
retrieving revision 1.253
diff -r1.253 mhttpd.c
2451a2452,2457
> if (run_number < 0)
> {
> cm_msg(MERROR, "show_elog_new", "aborting on attempt to use invalid
run number %d",run_number);
> abort();
> }
>
2506a2513,2519
>
> if (run_number < 0)
> {
> cm_msg(MERROR, "show_elog_new", "aborting on attempt to use invalid
run number %d",run_number);
> abort();
> }
>
3582a3596,3602
>
> if (run_number < 0)
> {
> cm_msg(MERROR, "show_form_query", "aborting on attempt to use invalid
run number %d",run_number);
> abort();
> }
>
5730a5751,5756
> if (rn < 0) // value "zero" is okey
> {
> cm_msg(MERROR, "show_start_page", "aborting on attempt to use invalid
run number %d",rn);
> abort();
> }
>
9684a9711,9719
> if (i <= 0)
> {
> cm_msg(MERROR, "interprete", "Start run: invalid run number %d",i);
> memset(str,0,sizeof(str));
> snprintf(str,sizeof(str)-1,"Invalid run number %d",i);
> show_error(str);
> return;
> }
>
Index: src/midas.c
===================================================================
RCS file: /usr/local/cvsroot/midas/src/midas.c,v
retrieving revision 1.193
diff -r1.193 midas.c
3786c3786
< status = cm_transition(_requested_transition | TR_DEFERRED, 0,
str, 256, SYNC, FALSE);
---
> status = cm_transition(_requested_transition | TR_DEFERRED, 0,
str, sizeof(str), SYNC, FALSE);
3906a3907,3912
> if (run_number <= 0)
> {
> cm_msg(MERROR, "cm_transition", "aborting on attempt to use invalid
run number %d",run_number);
> abort();
> }
>
16069a16076,16081
> }
>
> if (run_number < 0)
> {
> cm_msg(MERROR, "el_submit", "aborting on attempt to use invalid run
number %d", run_number);
> abort();
K.O. |
12 Oct 2003, Konstantin Olchanski, , Refuse to set run number zero
|
> I am debugging the frequent problem where the run number is mysteriously
> reset to zero. As a first step, I am commiting changes to mhttpd.c and midas.c:
> - abort on obviously corrupted "run number < 0"
> - abort on cm_transition() to run 0 (the only place where the run number is
> explicitely written to ODB)
> - in the mhttpd "Start run" form, reject user setting the run number to <= 0.
- abort on cm_transition() from run 0 to 1 during auto restart in mlogger.
Cvs diff:
RCS file: /usr/local/cvsroot/midas/src/mlogger.c,v
retrieving revision 1.65
diff -r1.65 mlogger.c
3277a3278,3283
> if (run_number <= 0)
> {
> cm_msg(MERROR, "main", "aborting on attempt to use invalid run
number %d", run_number);
> abort();
> }
>
K.O. |
11 Aug 2003, Konstantin Olchanski, , mhttpd crash on corrupted ODB /RunInfo
|
Invalid values of ODB /RunInfo/State cause mhttpd crash in
show_status_page() because of an out of bounds access to the array of state
names. Suggest this fix: remove array of state names, use existing ladder of
if/else statements to explicitely set state name. Verified the fix works for
TWIST. Will commit this into MIDAS CVS unless get feedback.
src/mhttpd.c:show_status_page() {
...
rsprintf("<tr align=center><td>Run #%d", runinfo.run_number);
if (runinfo.state == STATE_STOPPED)
rsprintf("<td colspan=1 bgcolor=#FF0000>Stopped");
else if (runinfo.state == STATE_PAUSED)
rsprintf("<td colspan=1 bgcolor=#FFFF00>Paused");
else if (runinfo.state == STATE_RUNNING)
rsprintf("<td colspan=1 bgcolor=#00FF00>Running");
else
rsprintf("<td colspan=1 bgcolor=#FFFFFF>Unknown");
if (runinfo.requested_transition)
...
K.O. |
10 Oct 2003, Konstantin Olchanski, , mhttpd crash on corrupted ODB /RunInfo
|
There was no feedback. This code has been commited. K.O.
> Invalid values of ODB /RunInfo/State cause mhttpd crash in
> show_status_page() because of an out of bounds access to the array of state
> names. Suggest this fix: remove array of state names, use existing ladder of
> if/else statements to explicitely set state name. Verified the fix works for
> TWIST. Will commit this into MIDAS CVS unless get feedback.
>
> src/mhttpd.c:show_status_page() {
> ...
> rsprintf("<tr align=center><td>Run #%d", runinfo.run_number);
>
> if (runinfo.state == STATE_STOPPED)
> rsprintf("<td colspan=1 bgcolor=#FF0000>Stopped");
> else if (runinfo.state == STATE_PAUSED)
> rsprintf("<td colspan=1 bgcolor=#FFFF00>Paused");
> else if (runinfo.state == STATE_RUNNING)
> rsprintf("<td colspan=1 bgcolor=#00FF00>Running");
> else
> rsprintf("<td colspan=1 bgcolor=#FFFFFF>Unknown");
>
> if (runinfo.requested_transition)
> ...
>
> K.O. |
02 Sep 2003, Pierre-André Amaudruz, , minor fix, window build
|
- makefile.nt (/examples/experiment, /hbook)
adjusted for local hmana.obj build as for rmana.obj, add cvs tag for
revision comment entry.
- drivers/class/hv.c
change comment // to /* */ |
27 Aug 2003, Pierre-André Amaudruz, , Operation under 1.9.3 with the analyzer
|
1) Prior upgrading midas to 1.9.3, make sure you've saved your ODB in ASCII
format using "odbedit> save my_odb.odb", as the internal structure is
incompatible with previous version. You will be able to restore it once
the new odb is up using "odbedit> load my_odb.odb".
2) since version 1.9.2, the analyzer supports ROOT and PAW packages.
The general Midas makefile build the analyzer core system mana.c
differently depending on presence of the environment variable $ROOTSYS.
In the case $ROOTSYS is not defined, the Makefile will create:
~/os/lib/mana.o, build for NO HBOOK calls.
~/os/lib/hmana.o, build with HBOOK calls for PAW analyzer
(requires /cern/pro/lib to be present).
In the case $ROOTSYS is defined and pointing to a valid root directory:
~/os/lib/mana.o, build for NO HBOOK calls.
~/os/lib/rmana.o, build for ROOT analyzer.
3) Since 1.9.2, the ~/examples/experiment contains the ROOT
analyzer example instead of HBOOK. The local Makefile uses the source
examples and the ~/os/lib/rmana.o for building the final user
application.
The previous HBOOK(PAW) analyzer has been moved into ~examples/hbookexpt
directory. The analyzer is build using the ~/os/lib/hmana.o
4) A new application "rmidas" is available when the system is build with
ROOT support. This application is an initial "pure" ROOT GUI implementing
TSocket for remote ROOT histogram display.
Once a ONLINE ROOT analyzer is up and running, by invoking "rmidas"
you will be prompt for a host name. Enter the node name hosting the
analyzer. You will be presented with a list of histogram which can
be display in a ROOT frame environment (see attachment).
5) The support of ROOT is also available for the logger by changing
the data format and the destination file name in the ODB structure.
This option will save on file the Midas banks converted into ROOT Tree.
This file can be opened with ROOT (see attachment).
------- ODB structure of /Logger/Channels/0/Settings
[local:midas:R]Settings>ls
Active y
Type Disk
Filename run%05d.root <<<<<<<<< new extension
Format ROOT <<<<<<<<< new format
Compression 0
ODB dump y
Log messages 0
Buffer SYSTEM
Event ID -1
Trigger mask -1
Event limit 0
Byte limit 0
Tape capacity 0
Subdir format
Current filename run00211.root
-------
. |
19 Aug 2003, Pierre-André Amaudruz, , minor fixes, new tarball 1.9.3-1
|
- add pthread lib to examples/... makefile
- fix ybos_simfe.c for max_event_size
- fix camacnul.c for cam_inhibit_test(), cam_interrupt_test()
- update documentation (1.9.3)
- made midas-1.9.3-1.tar.gz on Triumf site |
29 Jul 2003, Konstantin Olchanski, , Have to link with -lpthread?
|
It appears that all midas applications are now required to link with the
pthreads library even if they do not use threads. This is caused by a
pthread_create() call from ss_thread_create() in system.c.
Is this the intended behaviour?
K.O. |
30 Jul 2003, David Morris, , Have to link with -lpthread?
|
The change is required to support implementation of pthreads in the Linux
compile of Midas. This was added recently. I believe pthreads is also needed
for ROOT based compiles.
David
> It appears that all midas applications are now required to link with the
> pthreads library even if they do not use threads. This is caused by a
> pthread_create() call from ss_thread_create() in system.c.
>
> Is this the intended behaviour?
>
> K.O. |
26 Jul 2003, Konstantin Olchanski, , use "odbedit -C" to connect to corrupted ODB
|
Add switch "-C" to odbedit to allow it to connect to corrupted ODB. Then,
depending on corruption, the user can manually remove or correct the
corrupted entries. Also, some corruption is automatically fixed by "odbedit"
itself. I use this functionality to debug and fix broken ODBs.
K.O.
For your enjoyment, here is the diff:
diff -r1.64 odbedit.c
3058a3059
> BOOL corrupted;
3063c3064
< debug = cmd_mode = FALSE;
---
> debug = corrupted = cmd_mode = FALSE;
3077a3079,3080
> else if (argv[i][0] == '-' && argv[i][1] == 'C')
> corrupted = TRUE;
3104c3107,3108
< printf(" [-c Command] [-c @CommandFile] [-s size]
[-g (debug)]\n\n");
---
> printf(" [-c Command] [-c @CommandFile] [-s size]\n");
> printf(" [-g (debug)] [-C (connect to corrupted
ODB)]\n\n");
3123c3127,3133
< if (status != CM_SUCCESS)
---
> else if ((status == DB_INVALID_HANDLE)&&corrupted)
> {
> cm_get_error(status, str);
> puts(str);
> printf("ODB is corrupted, connecting anyway...\n");
> }
> else if (status != CM_SUCCESS) |
26 Jul 2003, Konstantin Olchanski, , more ODB checks in src/odb.c
|
Add more checks to db_validate_key() for pkey->total_size, item_size and
num_values. Automatically correct total_size to be item_size*num_values (we
saw this corruption and tested this fix).
K.O.
For your enjoyment, here is the diff:
RCS file: /usr/local/cvsroot/midas/src/odb.c,v
retrieving revision 1.64
diff -r1.64 odb.c
718a719,744
> /* check key sizes */
> if ((pkey->total_size < 0)||(pkey->total_size > pheader->key_size))
> {
> cm_msg(MERROR, "db_validate_key", "Warning: invalid key \"%s\"
total_size: %d", path, pkey->total_size);
> return 0;
> }
>
> if ((pkey->item_size < 0)||(pkey->item_size > pheader->key_size))
> {
> cm_msg(MERROR, "db_validate_key", "Warning: invalid key \"%s\"
item_size: %d", path, pkey->item_size);
> return 0;
> }
>
> if ((pkey->num_values < 0)||(pkey->num_values > pheader->key_size))
> {
> cm_msg(MERROR, "db_validate_key", "Warning: invalid key \"%s\"
num_values: %d", path, pkey->num_values);
> return 0;
> }
>
> /* check and correct key size */
> if (pkey->total_size != pkey->item_size*pkey->num_values)
> {
> cm_msg(MINFO, "db_validate_key", "Warning: corrected key \"%s\" size:
total_size=%d, should be %d*%d=%d", path, pkey->total_size, pkey->item_size,
pkey->num_values, pkey
->item_size*pkey->num_values);
> pkey->total_size = pkey->item_size*pkey->num_values;
> }
> |
02 Jul 2003, Pierre-André Amaudruz, , Midas/ROOT Analyser situation
|
The current and future situation of the Midas analyzer is summarized in the
attachment below.
Box explanation:
================
Front end:
---------
Midas code for accessing/gathering the hardware information into the Midas
format.
Midas SHM:
---------
Midas back end shared memory where the front end data are sent to.
mlogger:
-------
Data logger collecting the midas events and storing them on a physical
logging device (Disk, Tape)
Midas Analyzer:
--------------
Midas client for event-by-event analysis. Incoming data can be either online
or offline.
mserver:
-------
Subprocess interfacing external (remote) midas client to the centralized
data collection and database system.
PAW:
---
Standalone physics data analyzer (CERN).
ROOT:
----
Standalone Physics data analyser (CERN).
This diagram represents the data path from the Frontend to the analyzer in
online and offline mode. Each data path is annoted with a circled number
discussed below. In all cases, the data will flow from the front end
application to the midas back end data buffers which reside in a specific
share memory for a given experiment.
Path:
(1): From the shared memory, the midas analyzer can request events directly
and process them for output to divers destination.
(2): The data logger is a specific application which stores all the data to
a storage media such as a disk or tape. This path is specific to the
creation of file.mid file format. The actual storage file in this .mid
format can be readout later on by the midas analyzer.
(3): The Midas analyzer has been developed originally for interfacing to the
PAW analyzer which uses its own shared memory segment for online display.
The analyzer can also save the data into a specific data format consistent
with PAW (HBOOK and Ntuples, extension .rz).
(4): Presently the data logger support a creation of the ROOT file format.
This file contains in the form of a Tree the midas event-by-event data. This
file is fully compatible with ROOT and therefore can be read out by the
standard ROOT application.
(5): Equivalent to the data logger, the analyzer receiving from the data
buffer or reading from a .mid file data can apply an event-by-event analysis
and on request produce a compliant ROOT file for further analysis. This
.root file can be composed of Trees as well as histograms.
(6): The possibility of ONLINE ROOT analysis has been implemented in a first
stage through the TMapFile (ROOT shared memory). While this configuration is
still in use an experiment, the intention is to deprecate it and replace it
with the data path (7).
(7): This path uses the network socket channel to transfer data out of the
analyzer to the ROOT environment. The current analyzer has a limited support
for ROOT analysis by only publishing on request the Midas analysis built in
histograms. No mean is yet implemented for Tree passing mechanism.
(8): The pass has not been yet investigated, but ROOT does provide
accessibility to external function calls which makes this option possible.
The ROOT framework will then perform dedicated event call to the main midas
data buffer using the standard midas communication scheme. The data format
translation from Midas banks to ROOT format will have to be taken care at
the user level in the ROOT environment.
Discussion:
==========
Presently the Socket communication between Midas and ROOT (7) is under
revision by Stefan Ritt and René Brun. This revision will simplify the
remote access of an object such as an histogram. For the Tree itself, the
requirement would be to implement a "ring buffer" mechanism for remote tree
request. This is currently under discussion.
The path (8) has been suggested by Triumf to address small experiment setup
where only a single analyzer is required. This path minimize the DAQ
requirements by moving all the data analysis handling to the user.
The same ROOT analysis code would be applicable to a ONLINE as well as
OFFLINE analysis.
Cons:
- Necessity of publishing raw data through the network for every instance of
the remote analyzer.
- Result sharing of the analysis cannot be done yet in real time.
Pros:
- No need of extra task for data translation (midas/root).
- Unique data unpacking code part of the user code.
- Less CPU requirement.
Other issues:
============
- The current necessity of the Midas shared memory for the midas analyzer to
run is a concern in particular for offline analysis where a priori no midas
is available.
- The handling of the run/analyzer parameters. Possible parameter extraction
from file.odb. |
|