ID |
Date |
Author |
Topic |
Subject |
177
|
14 Dec 2004 |
Konstantin Olchanski | Forum | use of assert in mhttpd | > We've had mhttpd aborting regularly since upgrading from midas-1.9.3. This
> happens during elog queries, and is due to an elog file that was incorrectly
> modified by hand.
(sorry for delayed reply, for reasons unknown, I did not get an email notice when this was posted)
Yes, I agree, error handling in midas elog code is insufficient (note missing error checks for
read() and lseek() system calls). Anything but "perfect" elog files would cause funny errors and
malfunctions.
> The modification to the file occurred 6 months ago.
> el_retrieve(midas.c:15683) now has several assert statements, one of which
> aborts the program on reading the bad entry.
I added those to fix problems with "broken last NN days" and with infinite looping in the elog code
that we observed in TWIST.
You are welcome to replace the assert() statements with proper error handling. I used to have some code
that could report the filename of the bad elog file. Can we also report the exact file location for broken
files.
Please send me the diff, I will commit it to midas cvs.
> Why is assert used, instead of an error return from the function (if
> necessary), and maybe an error message in the log file? Assert statements are
> often removed, using NDEBUG, for normal use.
I use assert() in several ways:
0) I want a core dump each time X happens. (This is the only reasonable action when facing memory/stack
corruption. The problems in the elog code were stack corruption).
1) "I am too lazy to write proper error handling code" so I just crash and burn. This includes the
case where "proper error handling" would be "too invasive".
2) the error is too bad (or too deep) and there is no reasonable way to recover. Print an error message
and dump core (for later analysis). I sometimes use "cm_msg(); abort()". (assert is "printf("error"); abort()")
Please refer to literature for philosophic discussions on uses of assert() (Argh! Stefan will have my
head again!), but I will mention that "abort() early, abort() often" I find very effective. BTW, this technique
is heavily used in the Linux kernel (oops(), bug(), panic()) with some good effect, too.
> The problem elog entry had one character removed, so end-of-file came before
> the end of the message. This could probably occur without the file being
> altered, if the disk containing the elog fills.
Yes, I think you are right. In TWIST, we have seen disk-full conditions break both elog and history.
K.O. |
176
|
25 Nov 2004 |
chris pearson | Forum | use of assert in mhttpd | We've had mhttpd aborting regularly since upgrading from midas-1.9.3. This
happens during elog queries, and is due to an elog file that was incorrectly
modified by hand. The modification to the file occurred 6 months ago.
el_retrieve(midas.c:15683) now has several assert statements, one of which
aborts the program on reading the bad entry.
Why is assert used, instead of an error return from the function (if
necessary), and maybe an error message in the log file? Assert statements are
often removed, using NDEBUG, for normal use.
Chris
The problem elog entry had one character removed, so end-of-file came before
the end of the message. This could probably occur without the file being
altered, if the disk containing the elog fills. |
175
|
24 Nov 2004 |
chris pearson | Info | midas on 64bit opteron | Midas, version 1.9.5 of 7th October, was installed, with a few changes, on a
64 bit opteron computer, running linux. For this processor, as for the alpha
processor, long integers and addresses are 64 bits. We added a new flag in the
Makefile,
250a251
> ARCH = $(shell uname -m)
377a379,381
> ifeq ($(ARCH),x86_64)
> OSFLAGS := $(OSFLAGS) -DX86_64
> endif
and extended the alpha-specific definitions, of DWORD and PTYPE, in midas.h to
include this case,
549c549
< #ifdef __alpha
---
> #if defined(__alpha) || defined(X86_64)
598c598
< #ifdef __alpha
---
> #if defined(__alpha) || defined(X86_64)
apart from this, there are a large number of cases where pointers are cast to
integers, without using the PTYPE definition. These all need to be changed by
hand, although these conversions should probably be removed anyway - in almost
all cases they are unnecessary, as just differences are being calculated.
There were also a number of warnings, which we ignored, where printf format
strings specified long integers, but the argument was not a long integer. Casts
should probably be added in all cases where the type of the argument can vary
depending on the machine.
A midas analyser was made, which was able to successfully replay some data, but
this was all that was tested.
Chris |
174
|
09 Nov 2004 |
Pierre-Andre Amaudruz | Bug Fix | New transition scheme | Problem:
If cm_set_transition_sequence() is used for changing the sequence number, the
command odbedit> start/stop/resume/pause -v report the propre sequence but the
action on the client side is actually not performed!
Fix:
Local transition table updated in midas.c (1.226)
Note:
The transition number under /system/clients/<pid>/transition...
is used internally. Changing it won't have any effect on the client action
if sequence number is not registered. |
173
|
04 Nov 2004 |
Stefan Ritt | Forum | Frontend code and the ODB | Hi Jan,
I usually keep under /Experiment/Run Parameters only those settings which are kind of "global" and thus of
interest to frontend *and* analyzer, like a run mode (data/calibration/cosmic/...). Settings more specific to a
frontend I keep under /Equipment/<name>/Settings where <name> is the equipment name the specific frontend
produces. In your case each frontend will then get its own tree (related to each fragment). Please note that
both discussed trees can contain a whole tree with subdirectories, which lets you organize your data better.
Best regards, Stefan. |
172
|
04 Nov 2004 |
Jan Wouters | Forum | Frontend code and the ODB | I would like to know whether all parameters used by the frontend code have to be in the "Experiment/
Run Parameters" section. This section can become big and difficult to maintain, because it is one single
big section of experim.h (EXP_PARAM_DEFINED). I have parameters the various frontends read at the
beginning of each run, which set the hardware settings of various devices. I would like to place these in
a section all their own, organized by device. Is this doable? |
171
|
02 Nov 2004 |
Renee Poutissou | Info | Event Builder info in mhttpd Status page | Information about the Event Builder statistics has been removed from the
Status page in mhttpd. I heard from Pierre that this information might
be redundant when using the new Event Builder format???
For the TWIST experiment, we are running and cannot change on the fly
to a new format Event Builder. It is very important for us to show the users
the rates and statistics coming out of the EventBuilder. I had to put this
piece of code back in mhttpd.
Can I put it back in the distribution? or do I have to put a special TWIST flag?
or do I have to keep reinserting this every time there is an update to mhttpd.c?
At the moment, TWIST is generating a couple of updates/week to mhttpd.c |
170
|
22 Oct 2004 |
Konstantin Olchanski | Bug Fix | mhttpd message colouring | I commited a fix to mhttpd logic that decides which messages should be shown in
"red" colour- before, any message with square brackets and colons would be
highlighted in red. Now only messages matching the pattern [...:...] are
highlighted. The decision logic was moved into a function message_red(). K.O. |
169
|
14 Oct 2004 |
Konstantin Olchanski | Bug Report | TWIST upgrade bombed... | > The upgrade of TWIST to the latest midas has bombed- we see mevb and mlogger
> crashes during shared memory data buffer accesses. I am looking into it and I
> will add information as I figure things out. K.O.
On second try, it looks like we are in business- the first try did not work
because of two mistakes:
1) I did not delete *all* old .SHM files (.ODB.SHM, .SYSTEM.SHM, .YBUF1.SHM,
.YBUF2.SHM). I deleted ODB.SHM, so odb worked, but forgot about the data buffers
SYSTEM.SHM & co and ended up with segmentation faults and core dumps in the buffer
management code caused by a mismatch of the old-midas buffers and new-midas code.
2) while debugging these core dumps, I made an error in my test code, so even
after I deleted the old data buffers, things still did not work. Talk about
over-debugging a problem...
K.O. |
168
|
14 Oct 2004 |
Konstantin Olchanski | Bug Report | lazylogger complains about zero-size files | With latest midas, I see this:
Thu Oct 14 19:31:17 2004 [Lazy_Tape] [lazylogger.c:1717:Lazy] lazy_file_exists
file run17567.ybs doesn't exists
Thu Oct 14 19:31:27 2004 [Lazy_Tape] [lazylogger.c:1717:Lazy] lazy_file_exists
file run17567.ybs doesn't exists
The file run17567.ybs has size zero:
-rw-r--r-- 1 twistonl users 950272 Oct 13 19:29
/twist/data_onl/current/run17565.ybs
-rw-r--r-- 1 twistonl users 950272 Oct 13 19:45
/twist/data_onl/current/run17566.ybs
-rw-r--r-- 1 twistonl users 0 Oct 13 20:00
/twist/data_onl/current/run17567.ybs
-rw-r--r-- 1 twistonl users 983040 Oct 13 20:03
/twist/data_onl/current/run17568.ybs
-rw-r--r-- 1 twistonl users 950272 Oct 13 20:26
/twist/data_onl/current/run17569.ybs
I am not sure how to fix this lazylogger logic. Please help.
K.O. |
167
|
14 Oct 2004 |
Stefan Ritt | Bug Report | TWIST upgrade bombed... | Agree.
Once you did the modification, please check following situation: Create a fresh
ODB withe increased size ("odbedit -s 2000000" for example). Then check that the
other clients "adopt" this increased size. Note that some experiments need a
bigger ODB, and I don't want to have them recompile all clients, that's why the
code in ss_shm_open() can attach to a *larger* shared memory. However, it should
not matter to the process, since the ODB (or SYSTEM) shared memory size is
stored in the pheader->key_size and pheader->data_size of each participating
process. So they should never write beyond the limits defined in that header.
The size to ss_shm_open() is only a "hint" if the shared memory does not exist,
and is nowhere later used in the code. |
166
|
13 Oct 2004 |
Konstantin Olchanski | Bug Report | TWIST upgrade bombed... | > > The upgrade of TWIST to the latest midas has bombed- we see mevb and mlogger
> > crashes during shared memory data buffer accesses. I am looking into it and I
> > will add information as I figure things out. K.O.
>
> Since 1.9.5 the EventBuilder has been modified. Please consult the documentation
> where the new mevb scheme is explained.
> Test of the mevb with up to 16 frontends (15 different CPUs) has been tested
> successfully. Data rate at the EventBuilder were measured about 50MB/s without the
> logger and ~30MB/s with the logger.
It turns out that TWIST uses a private mevb.c. We will consider upgrading to the
standard one.
K.O. |
165
|
13 Oct 2004 |
Konstantin Olchanski | Bug Report | TWIST upgrade bombed... | > The upgrade of TWIST to the latest midas has bombed- we see mevb and mlogger
> crashes during shared memory data buffer accesses. I am looking into it and I
> will add information as I figure things out. K.O.
I traced buffer memory corruption to a logic error in system.c::ss_shm_open(). If
a .SHM file exists, it's size is used as the size of the sysv shared memory
segment, even if the requested shared memory size is bigger, but the caller of
ss_shm_open() thinks it got all the requested memory. Eventually we try to use
the unallocated memory and crash. This is the proposed fix and I will commit it
after I retest the upgrade during the next few days.
[olchansk@send src]$ cvs diff -u system.c
olchansk@midas.psi.ch's password:
Index: system.c
===================================================================
RCS file: /usr/local/cvsroot/midas/src/system.c,v
retrieving revision 1.83
diff -u -r1.83 system.c
--- system.c 4 Oct 2004 07:04:01 -0000 1.83
+++ system.c 14 Oct 2004 05:51:16 -0000
@@ -544,8 +544,14 @@
} else {
/* if file exists, retrieve its size */
file_size = (INT) ss_file_size(file_name);
- if (file_size > 0)
+ if (file_size > 0) {
+ if (file_size < size) {
+ cm_msg(MERROR, "ss_shm_open", "Shared memory segment \'%s\' size
%d is smaller than requested size %d. Please remove it and try
again",file_name,file_size,size);
+ return SS_NO_MEMORY;
+ }
+
size = file_size;
+ }
}
/* get the shared memory, create if not existing */
K.O. |
164
|
13 Oct 2004 |
Pierre-Andre Amaudruz | Bug Report | TWIST upgrade bombed... | > The upgrade of TWIST to the latest midas has bombed- we see mevb and mlogger
> crashes during shared memory data buffer accesses. I am looking into it and I
> will add information as I figure things out. K.O.
Since 1.9.5 the EventBuilder has been modified. Please consult the documentation
where the new mevb scheme is explained.
Test of the mevb with up to 16 frontends (15 different CPUs) has been tested
successfully. Data rate at the EventBuilder were measured about 50MB/s without the
logger and ~30MB/s with the logger. |
163
|
13 Oct 2004 |
Konstantin Olchanski | Bug Report | TWIST upgrade bombed... | The upgrade of TWIST to the latest midas has bombed- we see mevb and mlogger
crashes during shared memory data buffer accesses. I am looking into it and I
will add information as I figure things out. K.O. |
162
|
13 Oct 2004 |
Konstantin Olchanski | Suggestion | No al_clear_alarm()? | > > We have al_trigger_alarm(), but no matching al_clear_alarm(), and I need it to
> > clear my alarm once the alarm condition no longer exists. Any objections if I
> > add this function? K.O.
>
> call al_reset_alarm()
Thanks. I must be quite blind as I did not see al_reset_alarm() in midas.h. I se eit
now. Thanks.
K.O. |
161
|
13 Oct 2004 |
Stefan Ritt | Suggestion | No al_clear_alarm()? | > We have al_trigger_alarm(), but no matching al_clear_alarm(), and I need it to
> clear my alarm once the alarm condition no longer exists. Any objections if I
> add this function? K.O.
The idea is that once an alarm got triggered, it stays until the user
acknowledged, even if the alarm condition has been disappeared. Through mhttpd,
the user can press the "Reset" button, which then executes al_reset_alarm().
However, it is possible to call al_reset_alarm() directly from user code to
achieve the same thing. |
160
|
13 Oct 2004 |
Konstantin Olchanski | Suggestion | No al_clear_alarm()? | We have al_trigger_alarm(), but no matching al_clear_alarm(), and I need it to
clear my alarm once the alarm condition no longer exists. Any objections if I
add this function? K.O. |
159
|
13 Oct 2004 |
Stefan Ritt | Bug Report | silly odbedit "rename Display xxx/yyy" | > odbedit command "rename Display xxx/yyy" creates a key named "xxx/yyy" (yes,
> with a slash in the name) and this key cannot be deleted or renamed...
> K.O.
"rename" is "rename", not "mv" under Unix. If you want this functionality, put it
in and don't complain! |
158
|
13 Oct 2004 |
Konstantin Olchanski | Bug Report | silly odbedit "rename Display xxx/yyy" | odbedit command "rename Display xxx/yyy" creates a key named "xxx/yyy" (yes,
with a slash in the name) and this key cannot be deleted or renamed...
K.O. |
|