01 Nov 2003, Konstantin Olchanski, , more odb
|
> > I added error checking to the places where we read "/runinfo/run number".
> Now YOU broke the system by editing all these files with something I consider
> temporary debugging code. A run number of zero is *VALILD*.
I think I broke nothing. I do know that run number 0 is a valid odb value. Here
is an audit of all places where I abort on invalid run numbers:
mana.c: line 3676: assert(current_run_number > 0);
we take the run number from an event and write it into ODB. Events cannot have
run number negative or zero.
mana.c:analyze_run(): line 4632: assert(run_number > 0);
we are asked to analyze run "run_number". zero or negative is not valid.
midas.c:assert(run_number > old_run_number);
midas.c:assert(run_number > 1);
this code is not in CVS.
odbedit.c: line 2563: assert(old_run_number >= 0);
run number zero is valid
odbedit.c: line 2641: assert(new_run_number > 0);
starting a new run number zero is not valid
mfe.c: line 1786: if (run_number<=0) cm_msg(MERROR, "main", "aborting on attempt
to use invalid run number %d", run_number);
auto restart from run 0 to 1 is not valid
midas.c: line 3917: if (run_number<=0) cm_msg(MERROR, "cm_transition", "aborting
on attempt to use invalid run number %d",run_number);
transition to run zero or negative is not valid
midas.c: line 16101: if (run_number<0) cm_msg(MERROR, "el_submit", "aborting on
attempt to use invalid run number %d", run_number);
negative run numbers are not valid
mlogger.c: line 3301: if (run_number<=0) cm_msg(MERROR, "main", "aborting on
attempt to use invalid run number %d", run_number);
auto restart from run 0 to run 1 is not valid
K.O. |
01 Nov 2003, Konstantin Olchanski, , mana.c without ROOT and HBOOK
|
> > Stephan, why did you prohibit building mana.c without ROOT and HBOOK
> > support? I think such a configuration is valid and should be allowed.
>
> Oops, sorry, my fault. I forgto that people use mana.c without ROOT and
> HBOOK. The reason I made the change was that people forgot the -DHVAE_HBOOK
> in their makefile. In that case, no HBOOK init is done in mana.c and the
> first histogram booking in the user code crashes HBOOK.
Ahem. There is only so much rope we can give out to prevent people from shooting
themselves in the foot...
> So please take the #error statement out of mana.c
Done.
> One possibility is that we put an additional layer on top of the histogram
> boooking/filling. These macros are converted to their HBOOK or ROOT
> equivalents depending on the HAVE_HBOOK/HAVE_ROOT. If none of both is
> present, the histogram booking macro can produce a runtime error. This has
> the additional advantage that users can switch from HBOOK to ROOT without
> change of their user code.
I can't think of anything other than wrapping every HBOOK call with "if
(!hbook_is_initialized) initialize_hbook();". But then, where is PAWC
coming from anyway?!?
We could also print a warning message "This mana.c has no HBOOK support. If you
see HBOOK crashes, please relink with hmana,c". Ugly, but informative, plus it
points anybody who knows how to read towards a solution.
K.O. |
01 Nov 2003, Konstantin Olchanski, , Do not frob
|
> > I found where we tickle the race condition in db_create_record().
> The reason for the db_create_record() is the following: Assume that we change
> the /runinfo structure...
I think there is a deep fundamental problem with changing data structures "on the
fly". Calling db_create_record("/runinfo") at every show_status_page() does not
fix it.
If I change the runinfo structure, rebuild, relink and restart "mhttpd", the
db_create_record("/runinfo") from cm_connect_experiment() will update the runinfo
structure in ODB. In this case, the call from show_status_page() is redundant. As
a side effect, when we do this, we break every running ODB client- they still
have the old runinfo layout. Not good...
If I change the runinfo structure, rebuild, relink and restart all applications,
*except* for mhttpd, "/runinfo" in ODB will be updated when the first updated
client connects to ODB via the db_create_record("/runinfo") from
cm_connect_experiment(). Then, the old mhttpd will restore the old layout via the
db_create_record("/runinfo") in show_status_page(), breaking everything. Not good...
If I change the runinfo structure, rebuild, relink and restart everything,
"/runinfo" in ODB will be updated when the first client connects to ODB via the
db_create_record("/runinfo") from cm_connect_experiment(). In this case, the call
from show_status_page() is redundant. This is the only corruption-free scenario.
This lack of integrity enforcement vs version skew in binary data structures is,
I think, an ODB design error. Perhaps, ODB applications should be prohibited from
direct access to ODB "C" data structures: we cannot ensure that the data layout
in the application and in ODB are the same.
> One could think of checking the record size, and re-creating the runinfo if
> the ODB record size does not match the C record size. But this does not
> prevent the potential error that some variable are reversed in order. They
> are then mapped wrongly to the C runinfo structure.
Exacto.
> I see that you work very hard now on all possible checks for the run number.
> But I would not commit that and make it part of the distribution...
This is a philosophical issue.
My checks are in line with the "design by contract" school of programming. In a
nutshell, this ideology requires that before I do anything, I should enforce the
validity of my inputs and after I am done, I should enforce the validity of my
outputs. In practice, this translates into liberal use of assert()'s *in
production code*.
To ensure that old bugs stay fixed, and that new bugs are promptly discovered, it
is essential that the "contract checks" stay in the production code forever.
But let better writers argue programming philosophy in the literature.
Personally, when hunting down bugs in unstable code, I find this technique to be
vastly superior to the more common appoach of "This program has no bugs. Error
checking and assert()s are wasteful. Let's close our eyes and hope no bad things
happen to us (again)".
> But if you start now, please put [asserts] in all other 100000 places (;-)
I know that no good deed goes unpunished, but pewleeze!!!
> If you cannot resolve your zero run number problem, do the following: ...
> [lock ODB, freeze the experiment, look at log files]
This technique is obsolete. Today, we instrument the code with sanity checks
and validity tests. Then all the bugs find themselves with minimal manual
intervention.
K.O. |
15 Nov 2003, Konstantin Olchanski, , Phantom "open records"
|
Sometimes (maybe after a client uncleanly exits?), I see phantom "open
records", for example:
[local:twist:Running]Gas>sor
/Equipment/Gas/Common open 2 times by fe1hp
/Equipment/Gas/Variables open 1 times by Logger
/Equipment/Gas/Variables/Flow1 open 2 times by uBeamTcl1 uBeamTcl
/Equipment/Gas/Settings/Command open 2 times by fe1hp
/Equipment/Gas/Statistics open 1 times by
Note the blank client name in the "/Equipment/Gas/Statistics" line.
This causes these warnings from mfe.c:
Cannot init equipment record, probably other FE is using it
Cannot delete statistics record, error 320
Cannot create statistics record, error 320
Cannot open statistics record, error 318. Probably other FE is using it
Then the number of generated events for this front end is never incremented.
Also attempts to delete this "open" record fail:
[local:twist:Running]Gas>del /Equipment/Gas/Statistics
Are you sure to delete the key
"/Equipment/Gas/Statistics"
and all its subkeys? (y/[n]) y
key is open by other client
How do I go about writing the db_validate_xxx() code to cleanup this
bogosity? I am not too familiar with the implementation of "open record"...
K.O. |
20 Nov 2003, Konstantin Olchanski, , midas timeout wraparound
|
While reviving midas on midtig01 after it was not used for a while, we see
this. Notice negative "last called" numbers. Looks like a time_t wraparound
somewhere...
[local:tigress:S]/>scl -w
Name Host Timeout Last called
mhttpd midtig01.triumf.ca 10000 -2037131082
Logger midtig01.triumf.ca 10000 -2037131166
Analyzer midtig01.triumf.ca 10000 -2037131048
JACQ midtig01.triumf.ca 10000 -2037131667
mhttpd1 midtig01.triumf.ca 10000 325
ODBEdit midtig01.triumf.ca 10000 829
K.O. |
20 Nov 2003, Konstantin Olchanski, , cannot shutdown defunct clients
|
> While reviving midas on midtig01 after it was not used for a while ...
> [local:tigress:S]/>scl -w
> Name Host Timeout Last called
> mhttpd midtig01.triumf.ca 10000 -2037131082
These clients cannot be deleted. I tried:
1) shutdown from mhttpd "programs" page -> "cannot shutdown client"
2) "sh mhttpd" from odbedit ->
[midas.c:5298:cm_shutdown] cannot connect to client mhttpd on host
midtig01.triumf.ca, port 32853
Client mhttpd not active
3) in odbedit: "cd /system/clients; rm xxxx"
refuses to delete the key
Lacking any better ideas, I deleted them via brain surgery on the odb file:
1) stop everything
2) ipcrm the SYSV shared memory segment
3) odbedit -> save xxx.odb
4) xemacs xxx.odb, delete offending odb entries
5) rm .ODB.SHM
6) odbedit -> load xxx.odb
7) voila, bad clients gone, gone, gone.
K.O. |
20 Nov 2003, Konstantin Olchanski, , set-uid-root midas programs
|
I see that MIDAS installs several set-uid-root programs into /usr/local/bin.
In this age and time of evil computer hackers, this is not a good idea and
we should Do Something (TM) about it. Here is my risk assessment:
[olchansk@midtis06 midas]$ ls -l /usr/local/bin | grep wsr
-rwsr-sr-x 1 root root 25811 Nov 20 09:27 dio
-rwsr-sr-x 1 root root 344553 Nov 20 09:27 mhttpd
-rwsr-sr-x 1 root root 70736 Nov 20 09:27 webpaw
dio- is required to be setuid-root to gain I/O permissions. I looked at it a
few times, and it is probably safe, but I would like to get a second
opinion. Stephan, can you should it to your local security geeks?
mhttpd- definitely unsafe. It has more buffer overflows than I can shake a
stick at. Why is it suid-root anyway?
webpaw- what is it?!?
K.O. |
20 Nov 2003, Konstantin Olchanski, , cannot shutdown defunct clients
|
> > 1) shutdown from mhttpd "programs" page -> "cannot shutdown client"
> Have you tried a "cleanup" in ODBEdit?
Nope. Will try next time...
> The "last_activity" is a 32-bit int, filled with milliseconds. So indeed it
> wraps around after about one month.... change last_activity
> from INT to DWORD. This way it's alway positive and the wraparound does not
> hurt.
INT == "int", wraparound in 1 month
DWORD == "unsigned int", wraparound in 2 months
should we make it the 64-bit "long long" (or C98's "int64_t")?
K.O. |
27 Nov 2003, Konstantin Olchanski, , Implementation of db_check_record()
|
> I have therefore implemented the function
> db_check_record(HNDLE hDB, HNDLE hKey, char *keyname, char *rec_str, BOOL
correct)
Stephan, something is very wrong with the new code. My
"/logger/channels/0/settings" is being destroyed on "begin run". Midas
checkout from october 31st is okey. This is a show stopper, but I am in a rush
and cannot debug it. I am falling back to the Oct 31st version... K.O. |
30 Nov 2003, Konstantin Olchanski, , bad call to cm_cleanup() in fal.c
|
fal.c does not compile: it calls cm_cleanup() with one argument when there
should be two arguments. K.O. |
30 Nov 2003, Konstantin Olchanski, , Implementation of db_check_record()
|
> > I have therefore implemented the function
> > db_check_record(HNDLE hDB, HNDLE hKey, char *keyname, char *rec_str, BOOL
> correct)
>
> Stephan, something is very wrong with the new code. My
> "/logger/channels/0/settings" is being destroyed on "begin run".
Okey. I found the problem in db_check_record(): when we decide that we have a
mismatch, we call db_create_record(...,rec_str), but by this time, rec_str no
longer points to the beginning of the ODB string because we started parsing it.
I tried this solution: save rec_str into rec_str_orig, then when we decide that
we have a mismatch, call db_create_record() with this saved rec_str_orig. It
fixes my immediate problem (destruction of "/logger/channels/0/settings"), but is
it correct?
I would like to fix it ASAP to get cvs-head working again: our mhttpd dumps core
on an assert() failure in db_create_record() and the set of db_check_record()
changes might fix it for me.
Here is the CVS diff:
RCS file: /usr/local/cvsroot/midas/src/odb.c,v
retrieving revision 1.73
diff -r1.73 odb.c
7810a7811
> char *rec_str_orig = rec_str;
7820c7821
< return db_create_record(hDB, hKey, keyname, rec_str);
---
> return db_create_record(hDB, hKey, keyname, rec_str_orig);
7838c7839
< return db_create_record(hDB, hKey, keyname, rec_str);
---
> return db_create_record(hDB, hKey, keyname, rec_str_orig);
8023c8024
< return db_create_record(hDB, hKey, keyname, rec_str);
---
> return db_create_record(hDB, hKey, keyname, rec_str_orig);
8037c8038
< return db_create_record(hDB, hKey, keyname, rec_str);
---
> return db_create_record(hDB, hKey, keyname, rec_str_orig);
K.O. |
01 Dec 2003, Konstantin Olchanski, , Implementation of db_check_record()
|
> Fixed and committed. Can you check if it's working?
Yes, it is fixed. Thanks. K.O. |
05 Dec 2003, Konstantin Olchanski, , HOWTO setup MIDAS ROOT tree analysis
|
> root -l
root> TFile *f = new TFile("run00064.root")
root> TTree *t = f->Get("Trigger")
root> t->StartViewer() // look at the ROOT TTree
root> t->MakeSelector() // generates Trigger.h, Trigger.C
edit run.C, the main program:
{
gROOT->Reset();
TFile f("data/run00064.root");
TTree *t = f.Get("Trigger");
TH1D* adc8 = new TH1D("adc8","ADC8",1500,0,1500-1);
TH1D* tdc2 = new TH1D("tdc2","TDC2",1500,0,1500-1);
TH2D* h12 = new TH2D("h2","ADC8 vs TDC2",100,0,1500,100,0,1500);
TH2D* h12cut = new TH2D("h2cut","ADC8 vs TDC2",50,0,1000-1,50,0,1500);
TSelector *s = TSelector::GetSelector("Trigger.C");
t->Process(s);
adc8->Draw();
tdc2->Draw();
h12->Draw();
h12cut->Draw();
}
edit Trigger.C:
Bool_t Trigger::ProcessCut(Int_t entry)
{
fChain->GetTree()->GetEntry(entry);
if (entry%100 == 0) printf("entry %d\r",entry);
return kTRUE;
}
void Trigger::ProcessFill(Int_t entry)
{
adc8->Fill(ADCS_ADCS[8]);
tdc2->Fill(TDCS_TDCS[2]);
h12->Fill(TDCS_TDCS[2],ADCS_ADCS[8]);
if (ADCS_ADCS[8] > 100)
h12cut->Fill(TDCS_TDCS[2],ADCS_ADCS[8]);
}
Run the analysis:
root -l
root> .x run.C
K.O. |
05 Dec 2003, Konstantin Olchanski, , HOWTO setup MIDAS ROOT tree analysis
|
> root -l
root> TFile *f = new TFile("run00064.root")
root> TTree *t = f->Get("Trigger")
root> t->StartViewer() // look at the ROOT TTree
root> t->MakeSelector() // generates Trigger.h, Trigger.C
edit run.C, the main program:
{
gROOT->Reset();
TFile f("data/run00064.root");
TTree *t = f.Get("Trigger");
TH1D* adc8 = new TH1D("adc8","ADC8",1500,0,1500-1);
TH1D* tdc2 = new TH1D("tdc2","TDC2",1500,0,1500-1);
TH2D* h12 = new TH2D("h2","ADC8 vs TDC2",100,0,1500,100,0,1500);
TH2D* h12cut = new TH2D("h2cut","ADC8 vs TDC2",50,0,1000-1,50,0,1500);
TSelector *s = TSelector::GetSelector("Trigger.C");
t->Process(s);
adc8->Draw();
tdc2->Draw();
h12->Draw();
h12cut->Draw();
}
edit Trigger.C:
Bool_t Trigger::ProcessCut(Int_t entry)
{
fChain->GetTree()->GetEntry(entry);
if (entry%100 == 0) printf("entry %d\r",entry);
return kTRUE;
}
void Trigger::ProcessFill(Int_t entry)
{
adc8->Fill(ADCS_ADCS[8]);
tdc2->Fill(TDCS_TDCS[2]);
h12->Fill(TDCS_TDCS[2],ADCS_ADCS[8]);
if (ADCS_ADCS[8] > 100)
h12cut->Fill(TDCS_TDCS[2],ADCS_ADCS[8]);
}
Run the analysis:
root -l
root> .x run.C
K.O. |
05 Dec 2003, Konstantin Olchanski, , HOWTO setup MIDAS ROOT tree analysis
|
> root -l
root> TFile *f = new TFile("run00064.root")
root> TTree *t = f->Get("Trigger")
root> t->StartViewer() // look at the ROOT TTree
root> t->MakeSelector() // generates Trigger.h, Trigger.C
edit run.C, the main program:
{
gROOT->Reset();
TFile f("data/run00064.root");
TTree *t = f.Get("Trigger");
TH1D* adc8 = new TH1D("adc8","ADC8",1500,0,1500-1);
TH1D* tdc2 = new TH1D("tdc2","TDC2",1500,0,1500-1);
TH2D* h12 = new TH2D("h2","ADC8 vs TDC2",100,0,1500,100,0,1500);
TH2D* h12cut = new TH2D("h2cut","ADC8 vs TDC2",50,0,1000-1,50,0,1500);
TSelector *s = TSelector::GetSelector("Trigger.C");
t->Process(s);
adc8->Draw();
tdc2->Draw();
h12->Draw();
h12cut->Draw();
}
edit Trigger.C:
Bool_t Trigger::ProcessCut(Int_t entry)
{
fChain->GetTree()->GetEntry(entry);
if (entry%100 == 0) printf("entry %d\r",entry);
return kTRUE;
}
void Trigger::ProcessFill(Int_t entry)
{
adc8->Fill(ADCS_ADCS[8]);
tdc2->Fill(TDCS_TDCS[2]);
h12->Fill(TDCS_TDCS[2],ADCS_ADCS[8]);
if (ADCS_ADCS[8] > 100)
h12cut->Fill(TDCS_TDCS[2],ADCS_ADCS[8]);
}
Run the analysis:
root -l
root> .x run.C
K.O. |
01 Jan 2004, Konstantin Olchanski, , Poll about default indent style
|
> I don't feel a strong need of giving up a "-i2"...
I am comfortable with the current MIDAS styling convention and I would rather not
have yet another private religious war over the right location for the curley braces.
If we are to consider changing the MIDAS coding convention, I urge all and sundry
to read the ROOT coding convention, as written by Rene Brun and Fons Rademakers at
http://root.cern.ch/root/Conventions.html. The ROOT people did their homework, they
did read the literature and they produced a well considered and well argumented style.
Also, while there, do read the Taligent documentation- by far, one of the most
coherent manuals to C++ programming style.
K.O. |
14 Jan 2004, Konstantin Olchanski, , First try- midas on darwin/macosx
|
While watching "The Wizard of Oz", the greatest movie ever made, I took a shot at building
midas on my macosx computer. After stumbling on a few small and on a few hard problems, I
built almost everything. However, odb does not work- some further debugging is in order.
Anyway, the easy problems are:
- a few missing header files: pty.h, sys/vfs.h, malloc.h
- a few missing features in system.c (stime(), "get tape position")
- /usr/include/string.h already has strlcpy() & co.
- dbg_malloc() has inconsistent prototypes (size_t vs unsigned int)
- for reasons unknown, PVM is #defined. This flushed a bug in mana.c
A few hard problems:
- namespace pollution by Apple- they #define ALIGN in system headers, colliding with ALIGN
in midas.h. I was amazed that the two are almost identical, but MIDAS ALIGN aligns to 8
bytes, while Apple does 4 bytes. ALIGN is used all over the place and I am not sure how to
reconcile this.
- "timezone" in mhttpd.c. On linux, it's an "int", on darwin, it's a function. What gives?
- building libmidas.a requires running ranlib
- building libmidas.so requires unknown macosx specific magic.
For your enjoyment, the "cvs diff" is attached. The resulting code is known to not work.
K.O. |
16 Jan 2004, Konstantin Olchanski, , First try- midas on darwin/macosx
|
> Great, I got already questions about MacOSX support...
> Once it's working, you should commit the changes.
With the ALIGN8() change ODB works, mhttpd works. ALIGN8 change now commited to cvs, verified that "make all" builds
on Linux.
ROOT stuff still blows up because of more namespace pollution (/usr/include/sys/something does #define Free(x)
free(blah...)). Arguably, it is not Apple's fault- portable programs should not include any <sys/foo.h> header files. I
think I can fix it by moving "#include <sys/mount.h>" from midasinc.h to system.h.
Also figured out why PVM is defined- more pollution from "#include <sys/blah...>". This is only in mana.c and I will
repace every "#ifdef PVM" with "#ifdef HAVE_PVM". Is there documentation that should be updated as well? Alternatively I
can try to play games with header files...
> But take into account that using "//" for comments might cause problems for the VxWorks compiler (talk to Pierre
about that!).
Yes, "// comments" stay out of midas. I used them to make the modification more visible.
> You can rename ALIGN to ALIGN8 all over the place.
Done, commited.
> > - "timezone" in mhttpd.c. On linux, it's an "int", on darwin, it's a function. What gives?
> Wrap it into a function get_timezone(). Under linux, just return "timezone", under OSX,
> return timezone() via conditional compiling.
Right. Still on the todo list.
> > - building libmidas.a requires running ranlib
I still have to cleanup the Makefile. Not commiting it yet.
Then, a new problem- on MacOSX, pthread_t is not an "INT" and system.c:ss_thread_create() whines about it. I want to
introduce a system dependant THREAD_T (or whatever) and make ss_thread_create() return that, rather than INT.
ROOT stuff is still not fully tested- it takes a little while to build ROOT on a 600MHz laptop.
Attached is my current CVS diff.
K.O. |
18 Jan 2004, Konstantin Olchanski, , First try- midas on darwin/macosx
|
> I would like to keep all OS specific #includes in midasinc.h
No go. Here is the problem:
midasinc.h includes sys/mount.h, which #defines Free(x) to be something else
mana.c includes msystem.h, which includes midasinc.h
mana.c includes ROOT header files, which blow up because Free(x) is redefined.
I want this:
mana.c does *not* include sys/mount.h
system.c does include sys/mount.h
Simplest solution is to take sys/mount.h out of midasinc.h and include it in system.c
> Right, PVM should be replaced by HAVE_PVM.
Commited.
> > Then, a new problem- on MacOSX, pthread_t is not an "INT" and system.c:ss_thread_create() whines about it. I want to
> > introduce a system dependant THREAD_T (or whatever) and make ss_thread_create() return that, rather than INT.
> Good. If you have a OS_MACOSX, that should help you there.
Okey. In Darwin, pthread_t is not an int. It is a pointer to a struct. In midas.c I typedef midas_pthread_t to HANDLE on Windows and to pthread_t n OS_UNIX.
This uncovered a problem with ss_getthandle(). What is it supposed to do? On Windows it returns a handle to the current thread, on OS_UNIX, it returns getpid().
What gives? I am leaving it alone for now.
Attached is the current diff. Most changes are in system.c: ss_timezone() and midas_pthread_t. The Makefile part is already commited. Building the shared
library was made dependant on NEED_SHLIB. Now, building static midas applications is very simple, use "make SHLIB="
K.O. |
19 Jan 2004, Konstantin Olchanski, , First try- midas on darwin/macosx
|
> > Simplest solution is to take sys/mount.h out of midasinc.h and include it in system.c
> Agree.
Done.
With this, I commited the rest of my changes: midas_thread_t in midas.h, change ss_thread_xxx() prototypes in msystem.h
, implementation in system.c
My cvs diff is now empty.
Midas should compile on Darwin aka macosx, I tested "odbedit" and "mhttpd"- they seem to work.
> > This uncovered a problem with ss_getthandle().
> The Unix version of ss_getthandle() returns the pid since at the time when I wrote that function (many years ago) there were no threads under Unix. It should now
> be replaces with a function which returns the real thread id (at least under Linux).
I do not want to touch this. Sorry.
K.O. |
|