> > I found where we tickle the race condition in db_create_record().
> The reason for the db_create_record() is the following: Assume that we change
> the /runinfo structure...
I think there is a deep fundamental problem with changing data structures "on the
fly". Calling db_create_record("/runinfo") at every show_status_page() does not
fix it.
If I change the runinfo structure, rebuild, relink and restart "mhttpd", the
db_create_record("/runinfo") from cm_connect_experiment() will update the runinfo
structure in ODB. In this case, the call from show_status_page() is redundant. As
a side effect, when we do this, we break every running ODB client- they still
have the old runinfo layout. Not good...
If I change the runinfo structure, rebuild, relink and restart all applications,
*except* for mhttpd, "/runinfo" in ODB will be updated when the first updated
client connects to ODB via the db_create_record("/runinfo") from
cm_connect_experiment(). Then, the old mhttpd will restore the old layout via the
db_create_record("/runinfo") in show_status_page(), breaking everything. Not good...
If I change the runinfo structure, rebuild, relink and restart everything,
"/runinfo" in ODB will be updated when the first client connects to ODB via the
db_create_record("/runinfo") from cm_connect_experiment(). In this case, the call
from show_status_page() is redundant. This is the only corruption-free scenario.
This lack of integrity enforcement vs version skew in binary data structures is,
I think, an ODB design error. Perhaps, ODB applications should be prohibited from
direct access to ODB "C" data structures: we cannot ensure that the data layout
in the application and in ODB are the same.
> One could think of checking the record size, and re-creating the runinfo if
> the ODB record size does not match the C record size. But this does not
> prevent the potential error that some variable are reversed in order. They
> are then mapped wrongly to the C runinfo structure.
Exacto.
> I see that you work very hard now on all possible checks for the run number.
> But I would not commit that and make it part of the distribution...
This is a philosophical issue.
My checks are in line with the "design by contract" school of programming. In a
nutshell, this ideology requires that before I do anything, I should enforce the
validity of my inputs and after I am done, I should enforce the validity of my
outputs. In practice, this translates into liberal use of assert()'s *in
production code*.
To ensure that old bugs stay fixed, and that new bugs are promptly discovered, it
is essential that the "contract checks" stay in the production code forever.
But let better writers argue programming philosophy in the literature.
Personally, when hunting down bugs in unstable code, I find this technique to be
vastly superior to the more common appoach of "This program has no bugs. Error
checking and assert()s are wasteful. Let's close our eyes and hope no bad things
happen to us (again)".
> But if you start now, please put [asserts] in all other 100000 places (;-)
I know that no good deed goes unpunished, but pewleeze!!!
> If you cannot resolve your zero run number problem, do the following: ...
> [lock ODB, freeze the experiment, look at log files]
This technique is obsolete. Today, we instrument the code with sanity checks
and validity tests. Then all the bugs find themselves with minimal manual
intervention.
K.O. |