Back Midas Rome Roody Rootana
  Midas DAQ System, Page 115 of 121  Not logged in ELOG logo
IDdown Date Author Topic Subject
  121   28 Oct 2003 Stefan Ritt Updated thread functions
> ss_thread_create now returns the thread ID on success, and zero on failure.
> Previously returned SS_SUCCESS or SS_NO_THREAD. User must now test the
> return value to determine result.
> 
> ss_thread_kill added to kill the passed thread ID. Returns SS_SUCCESS or
> SS_NO_THREAD.
> 
> Any thread creation must be verified now, and old code must be examined to
> ensure the return value is checked.

Thank you for that post. Internally, threads are not use in midas, so there 
should be no problem. Only experiments using threads explicitly should take 
care.
  120   16 Oct 2003 David Morris Updated thread functions
ss_thread_create now returns the thread ID on success, and zero on failure.
Previously returned SS_SUCCESS or SS_NO_THREAD. User must now test the
return value to determine result.

ss_thread_kill added to kill the passed thread ID. Returns SS_SUCCESS or
SS_NO_THREAD.

Any thread creation must be verified now, and old code must be examined to
ensure the return value is checked.
  119   30 Oct 2003 Stefan Ritt 'umask' added to lazylogger for FTP connections
I had to add a 'umask' opiton to the loggers (lazy and mlogger) for the new 
PSI archive. One can now put a filename into the settings like:

archive,21,user,pw,dir,run%05d.mid,026

where the optional last parameter is used for a "umask 026" command just 
sent to the FTP server after the connection has been established. This 
changes the mode bits of the newly transferred file. We needed that so that 
the files are group readable, since several people from one group want to 
read the data.

I committed mlogger.c and ybos.c which contains the ftp code (should 
actually go into lazylogger.c instead of ybos.c).
  118   30 Oct 2003 Stefan Ritt Fixed several potential problems for ODB corruption
I just realized that db_set_value, db_set_data, db_set_num_values and 
db_merge_data do not check for num_values == 0. With such a parameter the 
ODB can become corrupted, since zero length ODB entries are not allowed. I 
fixed the according places in odb.c and committed the changes. Everyone 
with ODB corruption problems should update that code.
  117   31 Oct 2003 Konstantin Olchanski Disable "tab"s in xemacs
The default C indentation style in xemacs uses "tab" characters, violating
the MIDAS coding convention. To disable this misfeature in xemacs (emacs
too?), put this incantation in your .xemacs/custom.el file:

(custom-set-variables
 '(indent-tabs-mode nil))

K.O.
  116   01 Nov 2003 Konstantin Olchanski mana.c without ROOT and HBOOK
> > Stephan, why did you prohibit building mana.c without ROOT and HBOOK
> > support? I think such a configuration is valid and should be allowed.
> 
> Oops, sorry, my fault. I forgto that people use mana.c without ROOT and 
> HBOOK. The reason I made the change was that people forgot the -DHVAE_HBOOK 
> in their makefile. In that case, no HBOOK init is done in mana.c and the 
> first histogram booking in the user code crashes HBOOK.

Ahem. There is only so much rope we can give out to prevent people from shooting
themselves in the foot...

> So please take the #error statement out of mana.c

Done.

> One possibility is that we put an additional layer on top of the histogram 
> boooking/filling. These macros are converted to their HBOOK or ROOT 
> equivalents depending on the HAVE_HBOOK/HAVE_ROOT. If none of both is 
> present, the histogram booking macro can produce a runtime error. This has 
> the additional advantage that users can switch from HBOOK to ROOT without 
> change of their user code.

I can't think of anything other than wrapping every HBOOK call with "if
(!hbook_is_initialized) initialize_hbook();". But then, where is PAWC
coming from anyway?!?

We could also print a warning message "This mana.c has no HBOOK support. If you
see HBOOK crashes, please relink with hmana,c". Ugly, but informative, plus it
points anybody who knows how to read towards a solution.

K.O.
  115   01 Nov 2003 Stefan Ritt mana.c without ROOT and HBOOK
> Stephan, why did you prohibit building mana.c without ROOT and HBOOK
> support? I think such a configuration is valid and should be allowed.

Oops, sorry, my fault. I forgto that people use mana.c without ROOT and 
HBOOK. The reason I made the change was that people forgot the -DHVAE_HBOOK 
in their makefile. In that case, no HBOOK init is done in mana.c and the 
first histogram booking in the user code crashes HBOOK.

So please take the #error statement out of mana.c (I'm away in two hours for 
one week), but think about preventing the above mentionend problem. I don't 
know any way for the makefile or mana.c to figure out if there is any HF1 
call in the user code. Actually HF1 should return a "proper" error message 
than just crashing.

One possibility is that we put an additional layer on top of the histogram 
boooking/filling. These macros are converted to their HBOOK or ROOT 
equivalents depending on the HAVE_HBOOK/HAVE_ROOT. If none of both is 
present, the histogram booking macro can produce a runtime error. This has 
the additional advantage that users can switch from HBOOK to ROOT without 
change of their user code.
  114   31 Oct 2003 Konstantin Olchanski mana.c without ROOT and HBOOK
Stephan, why did you prohibit building mana.c without ROOT and HBOOK
support? I think such a configuration is valid and should be allowed.

Also, this prohibition broke the Midas Makefile, it now bombs building
mana.c. The Makefile is setup for building hmana.c with HBOOK support,
rmana.c with ROOT support (if ROOTSYS is set) and mana.c without HBOOK and
ROOT support (currently bombs on #error in mana.c).

K.O.
  113   01 Nov 2003 Konstantin Olchanski Do not frob
> > I found where we tickle the race condition in db_create_record().
> The reason for the db_create_record() is the following: Assume that we change 
> the /runinfo structure...

I think there is a deep fundamental problem with changing data structures "on the
fly". Calling db_create_record("/runinfo") at every show_status_page() does not
fix it.

If I change the runinfo structure, rebuild, relink and restart "mhttpd", the
db_create_record("/runinfo") from cm_connect_experiment() will update the runinfo
structure in ODB. In this case, the call from show_status_page() is redundant. As
a side effect, when we do this, we break every running ODB client- they still
have the old runinfo layout. Not good...

If I change the runinfo structure, rebuild, relink and restart all applications,
*except* for mhttpd, "/runinfo" in ODB will be updated when the first updated
client connects to ODB via the db_create_record("/runinfo") from
cm_connect_experiment(). Then, the old mhttpd will restore the old layout via the 
db_create_record("/runinfo") in show_status_page(), breaking everything. Not good...

If I change the runinfo structure, rebuild, relink and restart everything,
"/runinfo" in ODB will be updated when the first client connects to ODB via the
db_create_record("/runinfo") from cm_connect_experiment(). In this case, the call
from show_status_page() is redundant. This is the only corruption-free scenario.

This lack of integrity enforcement vs version skew in binary data structures is,
I think, an ODB design error. Perhaps, ODB applications should be prohibited from
 direct access to ODB "C" data structures: we cannot ensure that the data layout
in the application and in ODB are the same.

> One could think of checking the record size, and re-creating the runinfo if 
> the ODB record size does not match the C record size. But this does not 
> prevent the potential error that some variable are reversed in order. They 
> are then mapped wrongly to the C runinfo structure.

Exacto.

> I see that you work very hard now on all possible checks for the run number. 
> But I would not commit that and make it part of the distribution...

This is a philosophical issue.

My checks are in line with the "design by contract" school of programming. In a
nutshell, this ideology requires that before I do anything, I should enforce the
validity of my inputs and after I am done, I should enforce the validity of my
outputs. In practice, this translates into liberal use of assert()'s *in
production code*.

To ensure that old bugs stay fixed, and that new bugs are promptly discovered, it
is essential that the "contract checks" stay in the production code forever.

But let better writers argue programming philosophy in the literature.

Personally, when hunting down bugs in unstable code, I find this technique to be
vastly superior to the more common appoach of "This program has no bugs. Error
checking and assert()s are wasteful. Let's close our eyes and hope no bad things
happen to us (again)".

> But if you start now, please put [asserts] in all other 100000 places (;-)

I know that no good deed goes unpunished, but pewleeze!!!

> If you cannot resolve your zero run number problem, do the following: ...
> [lock ODB, freeze the experiment, look at log files]

This technique is obsolete. Today, we instrument the code with sanity checks
and validity tests. Then all the bugs find themselves with minimal manual
intervention.

K.O.
  112   01 Nov 2003 Stefan Ritt Do not frob
> I found where we tickle the race condition in db_create_record().
> 
> 1) in mhttpd.c,  every time we show the status page, we call
> db_create_record(hDB, 0, "/Runinfo", strcomb(runinfo_str));
> 2) internally db_create_record() deletes /RunInfo
> 3) other programs read "/runinfo/run number" while it is deleted do not
> check for the db_get_value() error code and happily get a zero run number.
> 
> Stephan fixed the race condition, and now I commited an mhttpd.c change that
> only calls db_create_record(hDB, 0, "/Runinfo", strcomb(runinfo_str)); if
> /runinfo does not exist. This seems to be redundant with a similar call in
> cm_connect_experiment1(), called each time a new client starts up.

The reason for the db_create_record() is the following: Assume that we change 
the /runinfo structure, by adding an additional variable in the future. If we 
run a "new" mhttpd on an "old" experiment, the "runinfo" C structure does not 
match the ODB contents. The db_create_record() ensures that the ODB structure 
exactly matches the C structure. I agree with you that this can cause 
potential problems. But most of them should be fixed by the additional lock() 
I added recently. So other programs cannot read the run number while it is 
deleted.

One could think of checking the record size, and re-creating the runinfo if 
the ODB record size does not match the C record size. But this does not 
prevent the potential error that some variable are reversed in order. They 
are then mapped wrongly to the C runinfo structure.

I see that you work very hard now on all possible checks for the run number. 
But I would not commit that and make it part of the distribution, since all 
experiments at PSI for example do not have this run number problem. Run it 
locally, determine the cause of your problem (the discovery of the race 
condition was already very good, I'm glad that your found it, should make the 
system much more stable), and we'll fix it. Puttin ASSERT's is a good idea, I 
should have done it from the very beginning. But if you start now, please put 
it in all other 100000 places (;-)

I would not add a db_get_value_cannot_possibly_fail() into the standard 
distribution, because it probably cannot correct the initial problem and then 
just will go into an infinite loop. We should tackle problems always at their 
source. 

If you cannot resolve your zero run number problem, do the following: There 
is a cm_msg(MDEBUG, ...) which only puts a message into the shared memory, 
but not in midas.log. This can be used for real time debugging. Add those 
message temporarily in db_get_value() etc. to see what is going on. As soon 
as the run number goes to zero, stop all processes immediately (for example 
by locking the database with db_lock_database), and the look backwards in the 
sysmsg buffer to see what happened *before* the run number went to zero.

- Stefan
  111   31 Oct 2003 Konstantin Olchanski Do not frob "/runinfo" in mhttpd.c
I found where we tickle the race condition in db_create_record().

1) in mhttpd.c,  every time we show the status page, we call
db_create_record(hDB, 0, "/Runinfo", strcomb(runinfo_str));
2) internally db_create_record() deletes /RunInfo
3) other programs read "/runinfo/run number" while it is deleted do not
check for the db_get_value() error code and happily get a zero run number.

Stephan fixed the race condition, and now I commited an mhttpd.c change that
only calls db_create_record(hDB, 0, "/Runinfo", strcomb(runinfo_str)); if
/runinfo does not exist. This seems to be redundant with a similar call in
cm_connect_experiment1(), called each time a new client starts up.

Files changed:
src/mhttpd.c

K.O.
  110   14 Nov 2003 Stefan Ritt more odb
Ok, I apologize. It's all ok. Thanks for clearifying. Concerning the assert's, it 
would be nice to be able to disable them in release code. Under Windows, the 
assert() is actually a macro which expands to zero if NDEBUG is defined. I 
believe it's the same under linux, but I don't know about VxWorks. So we have 
three options:

1) Keep asserts always. This might possible slow down a DAQ system, but I'm not 
sure how much. Might be negligible.

2) Disable asserts by default (standard make). Only the "experts" can enable it 
in the make file (by removing NDEBUG), since only they know what to do with the 
assertation messages.

3) Let the user decide on the standard installation. Maybe have two libraries, 
one debug, one no-debug. The no-debug can even have the compiler optimization 
disabled, which makes debugging easier.

So what is your opinion (comments from others are welcome as well) of which way 
to go? 
  109   01 Nov 2003 Konstantin Olchanski more odb
> > I added error checking to the places where we read "/runinfo/run number". 
> Now YOU broke the system by editing all these files with something I consider 
> temporary debugging code. A run number of zero is *VALILD*.

I think I broke nothing. I do know that run number 0 is a valid odb value. Here
is an audit of all places where I abort on invalid run numbers:

mana.c: line 3676: assert(current_run_number > 0);
we take the run number from an event and write it into ODB. Events cannot have
run number negative or zero.

mana.c:analyze_run(): line 4632: assert(run_number > 0);
we are asked to analyze run "run_number". zero or negative is not valid.

midas.c:assert(run_number > old_run_number);
midas.c:assert(run_number > 1);
this code is not in CVS.

odbedit.c: line 2563: assert(old_run_number >= 0);
run number zero is valid

odbedit.c: line 2641: assert(new_run_number > 0);
starting a new run number zero is not valid

mfe.c: line 1786: if (run_number<=0) cm_msg(MERROR, "main", "aborting on attempt
to use invalid run number %d", run_number);
auto restart from run 0 to 1 is not valid

midas.c: line 3917: if (run_number<=0) cm_msg(MERROR, "cm_transition", "aborting
on attempt to use invalid run number %d",run_number);
transition to run zero or negative is not valid

midas.c: line 16101: if (run_number<0) cm_msg(MERROR, "el_submit", "aborting on
attempt to use invalid run number %d", run_number);
negative run numbers are not valid

mlogger.c: line 3301: if (run_number<=0) cm_msg(MERROR, "main", "aborting on
attempt to use invalid run number %d", run_number);
auto restart from run 0 to run 1 is not valid

K.O.
  108   01 Nov 2003 Stefan Ritt more odb
> I added error checking to the places where we read "/runinfo/run number". In
> general, I do this:

> Affected files:
> src/lazylogger.c
> src/odbedit.c
> src/mlogger.c
> src/mfe.c
> src/odb.c
> src/mana.c
> src/midas.c
> src/mhttpd.c

Now YOU broke the system by editing all these files with something I consider 
temporary debugging code. A run number of zero is *VALILD*. If I want to make 
sure a new experiment starts with run number #1, I put a run number of 0 into 
the ODB. So on the first start the number is incremented by one which results 
in run number from one. So please remove those checks which prevents me of 
doing that. Again, your "run number zero" problem is soemhow specific to your 
environment, and I would not put all these tests into the distribution, 
because this can have side effects, like that one I described above.

- Stefan
  107   31 Oct 2003 Konstantin Olchanski more odb "run number" error checking
I added error checking to the places where we read "/runinfo/run number". In
general, I do this:

  status = db_get_value("/runinfo/run number",&run_number);
  assert(status==SUCCESS);
  assert(run_number >= 0); (and run_number>0, where appropriate)

Here is the rationale: if we cannot read the run number, something must be
very terribly wrong. I cannot think of any recovery action other than
abort() and make a core dump for our debugging enjoyment.

I considered and rejected adding a "retry" loop: if we allow db_get_value()
to intermittently fail, then it's every use has to be wrapped in a retry
loop, which then should be inside db_get_value(), making it pointless to
have external "retry" loops.

I am now pondering on proposing a "db_get_value_cannot_possibly_fail()"
function (it would abort(), exit() with an error or commit harakiri if it
can't get the value). They way most db_xxx() functions are used in midas,
maybe they should be made "void" and "unfailible", with "STATUS
db_xxx_yes_I_can_fail_and_return_an_error_code()" evil twins. I guess this
is why "they" invented C/C++ exceptions. Anyway, something to think about.

Affected files:
src/lazylogger.c
src/odbedit.c
src/mlogger.c
src/mfe.c
src/odb.c
src/mana.c
src/midas.c
src/mhttpd.c

K.O.
  106   17 Nov 2003 Pierre-André Amaudruz Lazylogger application
- Remove temporary "/Programs/Lazy" creation.
- Fix Rate calculation for Web display.
- Change FTP channel description (see help).
  105   20 Nov 2003 Stefan Ritt Phantom
I tried to reproduce the problem, but without success. So in case this happens 
again, one should debug the code im cm_watchdog() next to the line

/* decrement notify_count for open records and clear exclusive mode */
...

So if a killed client is removed from the ODB via the watchdog (or a "cleanup" 
is done in ODBEdit), the notify_count should be decreased and thus the "open 
records" should be closed.
  104   16 Nov 2003 Stefan Ritt Phantom
I have seen the same behaviour and it annoys me, too. What I did in the past 
is a "cleanup" in ODBEdit which removes these open records. I have soem code 
in cm_watchdog(), which should take care of that. If a client is dead, it 
gets removed from the ODB, and its open records should get its notify_count 
decremented. So obviously this code has some bug. I plan to do in the 
following week (now I got some spare time) the following:

- exchange most db_create_record() by something better. Maybe 
db_check_record(..., correct_flag), which creates the record only if it does 
not exist at all, otherwise checks the structure. If correct_flag is TRUE, it 
corrects the strucure (by calling db_create_record()), if it's false it just 
returns an error code. This way one can decide from case to case which option 
is better. Like for the /Runinfo, the flag would be FALSE, maybe with a 
notification that the /Runinfo is different from the compiled-in structure, 
and one hast to recompile the application.

- revisit the open record issue from dying frontends. I remember vaguely that 
I tried to kill a frontend (kill -9), wait until the watchdog cleans up its 
entries, and it worked fine. So it's more the problem to reproduce the issue 
described in the previous elog entry. 
  103   15 Nov 2003 Konstantin Olchanski Phantom "open records"
Sometimes (maybe after a client uncleanly exits?), I see phantom "open
records", for example:
[local:twist:Running]Gas>sor
/Equipment/Gas/Common open 2 times by fe1hp 
/Equipment/Gas/Variables open 1 times by Logger 
/Equipment/Gas/Variables/Flow1 open 2 times by uBeamTcl1 uBeamTcl 
/Equipment/Gas/Settings/Command open 2 times by fe1hp 
/Equipment/Gas/Statistics open 1 times by 

Note the blank client name in the "/Equipment/Gas/Statistics" line.

This causes these warnings from mfe.c:
Cannot init equipment record, probably other FE is using it
Cannot delete statistics record, error 320
Cannot create statistics record, error 320
Cannot open statistics record, error 318. Probably other FE is using it

Then the number of generated events for this front end is never incremented.

Also attempts to delete this "open" record fail:
[local:twist:Running]Gas>del /Equipment/Gas/Statistics
Are you sure to delete the key
"/Equipment/Gas/Statistics"
and all its subkeys? (y/[n]) y
key is open by other client

How do I go about writing the db_validate_xxx() code to cleanup this
bogosity? I am not too familiar with the implementation of "open record"...

K.O.
  102   20 Nov 2003 Stefan Ritt set-uid-root midas programs
> dio- is required to be setuid-root to gain I/O permissions. I looked at it a
> few times, and it is probably safe, but I would like to get a second
> opinion. Stephan, can you should it to your local security geeks?
> 
> mhttpd- definitely unsafe. It has more buffer overflows than I can shake a
> stick at. Why is it suid-root anyway?
> 
> webpaw- what is it?!?

dio was written by Pierre. 

mhttpd and webpaw both are web servers. webpaw is used to display PAW 
pictures over the web. If you run these programs at a port <1024, and most 
people do run them at port 80 (at least at PSI), they need to be setuid-root. 
Unless you know a better way to do that...
ELOG V3.1.4-2e1708b5