ID |
Date |
Author |
Topic |
Subject |
166
|
13 Oct 2004 |
Konstantin Olchanski | Bug Report | TWIST upgrade bombed... |
> > The upgrade of TWIST to the latest midas has bombed- we see mevb and mlogger
> > crashes during shared memory data buffer accesses. I am looking into it and I
> > will add information as I figure things out. K.O.
>
> Since 1.9.5 the EventBuilder has been modified. Please consult the documentation
> where the new mevb scheme is explained.
> Test of the mevb with up to 16 frontends (15 different CPUs) has been tested
> successfully. Data rate at the EventBuilder were measured about 50MB/s without the
> logger and ~30MB/s with the logger.
It turns out that TWIST uses a private mevb.c. We will consider upgrading to the
standard one.
K.O. |
168
|
14 Oct 2004 |
Konstantin Olchanski | Bug Report | lazylogger complains about zero-size files |
With latest midas, I see this:
Thu Oct 14 19:31:17 2004 [Lazy_Tape] [lazylogger.c:1717:Lazy] lazy_file_exists
file run17567.ybs doesn't exists
Thu Oct 14 19:31:27 2004 [Lazy_Tape] [lazylogger.c:1717:Lazy] lazy_file_exists
file run17567.ybs doesn't exists
The file run17567.ybs has size zero:
-rw-r--r-- 1 twistonl users 950272 Oct 13 19:29
/twist/data_onl/current/run17565.ybs
-rw-r--r-- 1 twistonl users 950272 Oct 13 19:45
/twist/data_onl/current/run17566.ybs
-rw-r--r-- 1 twistonl users 0 Oct 13 20:00
/twist/data_onl/current/run17567.ybs
-rw-r--r-- 1 twistonl users 983040 Oct 13 20:03
/twist/data_onl/current/run17568.ybs
-rw-r--r-- 1 twistonl users 950272 Oct 13 20:26
/twist/data_onl/current/run17569.ybs
I am not sure how to fix this lazylogger logic. Please help.
K.O. |
169
|
14 Oct 2004 |
Konstantin Olchanski | Bug Report | TWIST upgrade bombed... |
> The upgrade of TWIST to the latest midas has bombed- we see mevb and mlogger
> crashes during shared memory data buffer accesses. I am looking into it and I
> will add information as I figure things out. K.O.
On second try, it looks like we are in business- the first try did not work
because of two mistakes:
1) I did not delete *all* old .SHM files (.ODB.SHM, .SYSTEM.SHM, .YBUF1.SHM,
.YBUF2.SHM). I deleted ODB.SHM, so odb worked, but forgot about the data buffers
SYSTEM.SHM & co and ended up with segmentation faults and core dumps in the buffer
management code caused by a mismatch of the old-midas buffers and new-midas code.
2) while debugging these core dumps, I made an error in my test code, so even
after I deleted the old data buffers, things still did not work. Talk about
over-debugging a problem...
K.O. |
170
|
22 Oct 2004 |
Konstantin Olchanski | Bug Fix | mhttpd message colouring |
I commited a fix to mhttpd logic that decides which messages should be shown in
"red" colour- before, any message with square brackets and colons would be
highlighted in red. Now only messages matching the pattern [...:...] are
highlighted. The decision logic was moved into a function message_red(). K.O. |
177
|
14 Dec 2004 |
Konstantin Olchanski | Forum | use of assert in mhttpd |
> We've had mhttpd aborting regularly since upgrading from midas-1.9.3. This
> happens during elog queries, and is due to an elog file that was incorrectly
> modified by hand.
(sorry for delayed reply, for reasons unknown, I did not get an email notice when this was posted)
Yes, I agree, error handling in midas elog code is insufficient (note missing error checks for
read() and lseek() system calls). Anything but "perfect" elog files would cause funny errors and
malfunctions.
> The modification to the file occurred 6 months ago.
> el_retrieve(midas.c:15683) now has several assert statements, one of which
> aborts the program on reading the bad entry.
I added those to fix problems with "broken last NN days" and with infinite looping in the elog code
that we observed in TWIST.
You are welcome to replace the assert() statements with proper error handling. I used to have some code
that could report the filename of the bad elog file. Can we also report the exact file location for broken
files.
Please send me the diff, I will commit it to midas cvs.
> Why is assert used, instead of an error return from the function (if
> necessary), and maybe an error message in the log file? Assert statements are
> often removed, using NDEBUG, for normal use.
I use assert() in several ways:
0) I want a core dump each time X happens. (This is the only reasonable action when facing memory/stack
corruption. The problems in the elog code were stack corruption).
1) "I am too lazy to write proper error handling code" so I just crash and burn. This includes the
case where "proper error handling" would be "too invasive".
2) the error is too bad (or too deep) and there is no reasonable way to recover. Print an error message
and dump core (for later analysis). I sometimes use "cm_msg(); abort()". (assert is "printf("error"); abort()")
Please refer to literature for philosophic discussions on uses of assert() (Argh! Stefan will have my
head again!), but I will mention that "abort() early, abort() often" I find very effective. BTW, this technique
is heavily used in the Linux kernel (oops(), bug(), panic()) with some good effect, too.
> The problem elog entry had one character removed, so end-of-file came before
> the end of the message. This could probably occur without the file being
> altered, if the disk containing the elog fills.
Yes, I think you are right. In TWIST, we have seen disk-full conditions break both elog and history.
K.O. |
178
|
14 Dec 2004 |
Konstantin Olchanski | Info | Commit local TWIST modifications |
I am commiting MIDAS modification accumulated during the last few months of running TWIST:
1) system.c::ss_shm_open() fail if trying to map a file that is smaller than we expect.
2) midas.c::bm_lock_buffer(), el_submit(), el_delete_message(): do not wait for mutexes forever, use a 5
minute timeout. If we can't get the lock, cm_msg()/abort().
The above helps dealing with complete midas freezes. I also have code to keep track of "who locked
the mutex *and* is still holding it?!?" but it is way too ugly to commit. I wish we had a "lockedByPid"
entry for all lockable objects.
K.O.
|
179
|
14 Dec 2004 |
Konstantin Olchanski | Info | Commit local TWIST modifications |
> I am commiting MIDAS modification accumulated during the last few months of running TWIST:
More:
- mfe.c: in error messages "cannot find statistics record", also print
the name of the record we are looking for.
- mlogger.c: in warning message "Write operation took N ms", report the name
of the offending data stream.
- system.c: do not chdir("/") in ss_daemon_init()- it prevents us from ever
getting core dumps from midas daemons. The old behaviour is trivially
restored by "cd /" before starting the daemon; or by "limit coredumpsize 0".
- odb.c: db_validate_db() detect and break infinite looping on free list corruption.
K.O. |
180
|
14 Dec 2004 |
Konstantin Olchanski | Info | mhttpd: Commit local TWIST modifications |
> > I am commiting MIDAS modification accumulated...
mhttpd changes:
- Renee's improvements on http transaction logging
- Implement "minimum" and "maximum" clamping for history graphs. Unfortunately
there is no GUI code for changing the "minimum" and "maximum" settings,
other than directly frobbing the odb.
- When making history graphs, detect NaNs in the history data.
(- status page code for the TWIST event builder (precursor of the standard
event builder) stays uncommited).
K.O. |
186
|
16 Dec 2004 |
Konstantin Olchanski | Info | "cd /" in ss_daemon_init(), was- Commit local TWIST modifications |
> > - system.c: do not chdir("/") in ss_daemon_init()- it prevents us from ever
> > getting core dumps from midas daemons.
>
> The chdir("/") is from one of the unix text books. They say you HAVE to do it. If you start a
> daemon on an NFS file system, you cannot unmount that file system as long as the daemon is
> running.
Right, I remember this NFS problem from a while back.
This problem does not exist in the current crop of Linux systems (since Red Hat 7.3 at least) - they
either kill off all user programs or use "umount -f" and "umount -l".
"umount -l" works in any case to unmount a "busy" filesystem.
For systems where the NFS problem does still exist, one should do this: "mlogger -D" becomes "(cd /; mlogger -D)".
So I suspect that the "cd /" advice from the unix programming book is no longer as necessary
as it used to be. (Perhaps a better advice would have been to "cd /tmp", so we could still get
core dumps from non-root daemons).
K.O. |
191
|
20 Jan 2005 |
Konstantin Olchanski | Suggestion | HOWTO create ROOT objects in the MIDAS analyzer |
With recent changes to mana.c, creation of user ROOT objects in the MIDAS
analyser has changed. Here is the new example code for creating ROOT objects
that are visible in ROODY and are saved into the histogram file.
1) in the "global" context (outside of any function)
#include <TH1D.h>
#include <TProfile.h>
static TH1D* gMyHist1 = 0;
static TProfile* gMyHist2 = 0;
2) In the analyzer "init" or "begin run" method, create the histogram:
//extern TFolder *gManaHistosFolder; // from midas.h
gMyHist1 = new TH1D("gMyHist1",...);
gMyHist2 = new TProfile("gMyHist2",...);
gManaHistosFolder->Add(gMyHist1);
gManaHistosFolder->Add(gMyHist2);
(note: this will produce an warning about "possible memory leak")
3) In the per-event method, fill the histograms
gMyHist1->Fill(x);
gMyHist2->Fill(x,y);
4) In the Makefile, where you compile the frontend, add "-DUSE_ROOT" right after
"-I$(ROOTSYS)/include"
K.O. |
192
|
20 Jan 2005 |
Konstantin Olchanski | Bug Report | Persistency problem with h1_book() & co |
The current h1_book() macros (and the previous example analyzer code) have an
odd persistency problem: for example, the user wants to change some histogram
limits, edits the h1_book() calls, rebuilds and restarts the analyzer, starts a
new run, and observes that all histograms are filled using the old limits, his
changes "did not take". The user panics, I get paged during the Holy Lunch Hour,
everybody is unhappy.
This is what I think happens:
1) analyzer starts
2) LoadRootHistgrams() loads old histograms from file
3) user code calls h1_book()
4) h1_book template in midas.h does this (roughly):
hist = (TH1X *) gManaHistosFolder->FindObjectAny(name);
if (hist == NULL) {
hist = new TH1X(name, title, bins, min, max);
5) since the histogram already exists (loaded from the file, with the old
limits), the TH1X constructor is not called at all, new histogram limits are
utterly ignored.
A possible solution is to unconditionally create the ROOT objects, like I do in
the example code posted at http://dasdevpc.triumf.ca:9080/Midas/191. That code
produces an annoying warning from ROOT about possible memory leaks. This could
be fixed by adding a two liner to "find and delete" the object before it is
created, trippling the number of user code lines per histogram (find & delete,
then create). Highly ugly.
midas.h macros (h1_book & co) can be fixed by adding checks for histogram limits
and such, but I would much prefer a generic solution/convention that would work
for arbitrary ROOT objects without MIDAS-specific wrappers (think TProfile,
TGraph, etc...).
Any suggestions?
K.O. |
200
|
25 Feb 2005 |
Konstantin Olchanski | Bug Fix | fixed: double free in FORMAT_MIDAS ybos.c causing lazylogger crashes |
We stumbled upon and fixed a "double free" bug in src/ybos.c causing crashes in
lazylogger writing .mid files in the FORMAT_MIDAS format (why does it use
ybos.c? Pierre says- for generic file i/o). Why this code had ever worked before
remains a mystery. K.O. |
204
|
31 Mar 2005 |
Konstantin Olchanski | Info | ODB dump format switched to XML |
> > All the XML functionality is implemented in the new mxml.c/h library
>
> mxml.c/h ... I separated it's CVS tree.
>
> The midas Makefile has been adjusted accordingly.
Looks like the midas mxml Makefile bits did not make it to CVS. Current Makefile
revision 1.67 does not have them and building midas from cvs sources fails because it
does not find mxml.h and mxml.c
K.O. |
207
|
21 Apr 2005 |
Konstantin Olchanski | Bug Report | pointers and segfault in yb_any_file_rclose |
> I'm getting segfaults in yb_any_file_rclose (closing a file opened with
> yb_any_file_ropen with type MIDAS).
>
> I think there are bugs with freeing from uninitialized pointers my.pmagta,
> my.pyh, and my.pylrl (which are only set when opening a YBOS file). These
> should be set to NULL in yb_any_file_ropen (case MIDAS). Likewise, the MIDAS
> format pointers my.pmp and my.pmrd should be NULLed for YBOS opens.
>
> It might be wise to also initialize the pointers in the "my" structure to null.
Do you see this crash even after my fix to (another?) double free?
K.O. |
208
|
21 Apr 2005 |
Konstantin Olchanski | Suggestion | Correct MIDASSYS setting? |
Current MIDAS versions nag me about setting the env.variable MIDASSYS to the
"midas installation directory", but I do not have one, so what should I set
MIDASSYS to? I checkout MIDAS from cvs into /home/olchansk/daq/midas, build it
there, run it from there. I never do "make install" (I am not "root" on every
machine; I am not the only MIDAS user on every machine). What should I set
MIDASSYS to? K.O. |
211
|
05 May 2005 |
Konstantin Olchanski | Bug Fix | fix: minor bit rot in the example experiment |
I fixed some minor bit rot in the example experiment: a few minor Makefile
problems, make the analyzer use the current histogram creation macros, etc. I
also added startup and shutdown scripts. These will be documented as we work
through them with our Summer student. K.O. |
212
|
02 Aug 2005 |
Konstantin Olchanski | Bug Fix | fix odb corruption when running analzer for the first time |
I have been plagued by ODB corruption when I run the analyzer for the first time
after setting up the new experiment. Some time ago, I traced this to
mana.c::book_ttree() and now I found and fixed the bug, fix now commited to
midas cvs. In book_ttree(), db_find("/Analyzer/Bank switches") was returning an
error and setting hkey to zero. Then we called db_open_record() with hkey==0,
which cased ODB corruption later on. The normal db_validate_hkey() did not catch
this because it considers hkey==0 to be valid (when most likely it is not). K.O. |
213
|
18 Aug 2005 |
Konstantin Olchanski | Info | midas Makefile changes |
Minor Makefile changes:
- add "-m32" gcc flag to force 32-bit compilation on 64-bit Linux.
- do not link ybos.o into lazylogger and mdump.
K.O. |
214
|
18 Aug 2005 |
Konstantin Olchanski | Info | CAMAC register_cnaf_callback() |
Some time ago, the "remote CAMAC" functionality in mfe.c was made conditional on
HAVE_CAMAC. This flag is not set by default so remote camac calls silently do
not work, unless midas is compiled in a special way. I am too lazy to compile
midas differently depending on what hardware I use, so I split
register_cnaf_callback() into a separate file and made it easy to call directly
from the user front end.
I left the HAVE_CAMAC bits in mfe.c so people who use that would see no change.
Affected files:
Makefile (add cnaf_callback.o)
midas.h (add void register_cnaf_callback(int debug);
mfe.c (move the rpc code to cnaf_callback.c, call register_cnaf_callback())
cnaf_callback.c (new file)
K.O. |
215
|
18 Aug 2005 |
Konstantin Olchanski | Info | minor changes to run transition code |
Minor changes to run transitions code:
- improve debug messages
- fail transition if cannot connect to one of the clients
K.O. |