Back Midas Rome Roody Rootana
  Midas DAQ System, Page 39 of 136  Not logged in ELOG logo
New entries since:Wed Dec 31 16:00:00 1969
ID Date Authorup Topic Subject
  770   27 Jun 2011 Konstantin OlchanskiInfoupdated mhttpd history "export" function
The mhttpd history "export" function has been converted to the new midas history
interface and should now work for SQL-based history systems. In the process,
improvements by Eoin Butler (CERN AD-5/ALPHA) were merged - adding a UNIX
timestamp and a better text timestamp. Also now "export" outputs the actual
values from the history file - the scaling values from the definition of the
history plot panel are no longer applied.

Here is an example of the new file format:

Time, Timestamp, Run, Run State, SLOW
2011.06.21 15:45:21, 1308696321, 13292, 3, -89.1007

svn rev 5104
K.O.
  771   27 Jun 2011 Konstantin OlchanskiInfomidas shared memory changes
A number of changes were made to the midas shared memory implementation for
Linux and MacOS:

1) SysV or POSIX shared memory compile-type choice is removed. Both shared
memory types are compiled-in and are selected at run time.
2) the shared memory type used by an experiment is recorded in the file
.SHM_TYPE.TXT. Currently implemented are "POSIXv2_SHM" (the new default for new
experiments), "POSIX_SHM", "MMAP_SHM" and "SYSV_SHM". (see system.c) (MMAP_SHM
is fully functional but is not recommended). The POSIXv2_SHM uses an improved
filename scheme (on Linux, see "ls -l /dev/shm") and permits multiple
experiments to coexist on a MacOS computer (where there is a severe limit on
shared memory filename length).
3) following a number of mishaps where "odbedit" has been run on the wrong
computer (causing havoc with ODB and .xxx.SHM files), for each experiment, the
hostname of the computer where the ODB shared memory is meant to reside is now
recorded in the file .SHM_HOST.TXT. Typically, this is the machine running
mserver, mhttpd and mlogger. If some client is accidentally started on the wrong
machine or if MIDAS_SERVER_HOST is accidentally left undefined, MIDAS will now
print a stern message reporting the hostname mismatch, tell the user to use the
mserver and refuse to run. The user has the choice of starting the client on the
correct computer (as reported in the error message), using the mserver (start
client with -H flag) or edit/delete the .SHM_HOST.TXT file (full pathname is
reported by the error message).

With this update, MIDAS on MacOS becomes fully functional (before, only one
experiment could be used at a time).

svn rev 5105
K.O.
  772   27 Jun 2011 Konstantin OlchanskiInfomlogger lock for runNNN.mid.gz files
By popular request, Stefan R. implemented a locking scheme for mlogger output files.

To use this function, set the mlogger ODB /Logger/Channels/NNN/Settings/Filename
to ".run%05dsub%05d.mid.gz" (note the leading dot).

In this mode, active output files will have a filename with a leading dot
(.run00001sub00001.mid.gz) while the file is being written to. After the file is
closed, it is renamed and the leading dot is removed.

To use this function with the lazylogger, please set ODB
"/Lazy/Foo/Settings/Filename format" to "run*.mid.gz,run*.xml" (note the leading
text "run"). Set "stay behind" to 0.

svn rev 5080 (or so, checking by Stefan R.)
K.O.
  773   05 Jul 2011 Konstantin OlchanskiInfomidas shared memory changes
> 2) the shared memory type used by an experiment is recorded in the file .SHM_TYPE.TXT.

An error with creating the file .SHM_TYPE.TXT was corrected in system.c svn rev 5125 - if file did not exist, it is 
created correctly, but MIDAS reports "cannot connect to ODB". Second try works correctly because the file exists 
now.

> 3) the hostname of the computer where the ODB shared memory is meant to reside is now
> recorded in the file .SHM_HOST.TXT.

This is causing problems on mobile computers where "hostname" changes all the time (i.e. set according to 
DHCP on whatever network happens to be connected).

If you run into this problem, keep deleting .SHM_HOST.TXT or use this workaround: disable the hostname check 
by making the file .SHM_HOST.TXT empty (zero length).

K.O.
  774   05 Jul 2011 Konstantin OlchanskiBug ReportMacOS network socket timeouts non-functional
It turns out that because of differences between select() syscall implementation between UNIX (MacOS, 
maybe BSD) and Linux,  network socket timeouts do not work.

This affects timeouts during run transitions (transition calls to dead clients do not timeout), maybe other 
places.

I am looking into fixing this. The main difficulty is with UNIX select() not updating the timeout parameter 
when it is interrupted by the MIDAS watchdog alarm signal. Linux select() subtracts the elapsed time from 
the timeout value and this code from system.c works correctly: while (1) { status = select(..., &timeout); if 
(status==0) break; } (value of timeout becomes smaller each time), while on MacOS it loops forever (value 
of timeout does not change).
K.O.
  775   10 Jul 2011 Konstantin OlchanskiBug Fixmidas shared memory changes
> > 2) the shared memory type used by an experiment is recorded in the file .SHM_TYPE.TXT.
> > 3) the hostname of the computer where the ODB shared memory is meant to reside is now
> > recorded in the file .SHM_HOST.TXT.

Due to a typo in src/system.c svn rev 5125, ss_shm_delete() did not work at all. This broke "odbedit -R", "odbedit -s 5000000" (to change ODB size), etc. 
Fixed in src/system.c svn rev 5134. (It is safe to update just tis one file to fix this problem).

Sorry for the inconvenience,
K.O.
  776   11 Jul 2011 Konstantin OlchanskiBug Fixmidas shared memory changes
> > > 2) the shared memory type used by an experiment is recorded in the file .SHM_TYPE.TXT.
> > > 3) the hostname of the computer where the ODB shared memory is meant to reside is now
> > > recorded in the file .SHM_HOST.TXT.


Because the mserver did not setup correct experiment name and path, POSIX shared memory did not work at all when used with the mserver. Fixed in mserver.c rev 5135


Sorry for the inconvenience,
K.O.
  777   11 Jul 2011 Konstantin OlchanskiInfoMake "STOP" run transition always succeed
Over the years, there was some back-and-forth changes in what happens to run transitions when some 
of the participants misbehave (do not respond to RPC calls, timeout, crash, etc).

The very original behaviour was to ignore all errors. This resulted in user confusion when some clients 
would start, some would not, data from frontends that missed the transition did not arrive, etc.

So it was changed to fail the transition if any client misbehaves.

This left mlogger (who is usually the first one to see the TR_START transition) in a funny state - output 
file is open, etc, but there is no run active. This was fixed by adding a TR_STARTABORT transition to tell 
mlogger, event builder & co that the just started run did not start after all.

Also at some point code was added to forcefully kill clients that do not respond to run transitions (do 
not respond to RPC, timeout, etc).

Recently, it was observed how during unattended overnight operation of a MIDAS DAQ system, with the 
logger set to "auto restart", some unnecessary clients misbehave during the run stop transition, and 
prevent the run from stopping and restarting. The user comes in the morning and is unhappy that data 
taking stopped some time during the night.

midas.c svn rev 5136 changes the TR_STOP transition to always succeed, even if some clients had 
transition errors. If these clients are unnecessary for normal operation of the DAQ, the following run 
"auto restart" will continue taking data. If those were important clients, data taking will continue the 
best it can - it *is* unattended operation - nobody is looking - but users can always setup alarms for 
checking that important clients are always running during data taking. (For very important clients, one 
can setup alarms to send email, send SMS messages, etc).

K.O.
  781   16 Dec 2011 Konstantin OlchanskiBug Reportbk_delete uses memcpy instead of memmove
> In midas.c, the bk_delete function removes a bank by decrementing the total
> event size and then copying the remaining banks into the location of the first
> using memcpy from string.h.


I confirm the documented difference between memcpy() and memmove() and I confirm the 
questionable use of memcpy() in bk_delete(). I think it should be memmove(). I made it so in my copy 
of midas, so this change will not be lost.

But I am not sure how to test it - I do not think I ever used bk_delete(). I will probably ponder upon 
this and do a blind commit.


K.O.
  784   29 Feb 2012 Konstantin OlchanskiBug ReportProblem with semaphores
Hi there! In the T2K/ND280 experiment in Japan, we keep having problems with MIDAS locking (probably 
of ODB). The symptoms are: some program reports a timeout waiting for the ODB lock, then all programs 
eventually die with this same error. Complete system meltdown. This does not look like the deadlock 
between locks for ODB, cm_msg and the data buffers that I looked into last year. It looks more like 
somebody locks ODB, dies and the Linux kernel fails to unlock the lock (via the SYSV "sem undo" 
function). But it is hard to confirm, hence this message:

The implementation of semaphores in MIDAS (used for locking ODB and the shared memory data buffers) 
uses the straight SYSV semaphore API - which lacks basic debugging features - there is no tracking of 
who locked what when, so if anything at all goes wrong at all, i.e. we are confronted with a timeout 
waiting for the ODB lock, the only corrective action possible is to kill all MIDAS clients and tell the user to 
start from scratch. There is no additional information available from the SYSV semaphore API to identify 
which MIDAS program caused the fault.

The POSIX semaphore API is even worse - no debugging features are available, *and* if a program dies 
while holding a lock, the lock stays locked forever (everybody else will wait forever or see a semaphore 
timeout, and then what?).

So I am looking for an "advanced semaphore library" to use in MIDAS. In addition to the boring functions 
of reliable locking and unlocking, it should support:
- wait with timeout
- remember who is holding the lock
- detect that the process holding the lock is dead and take corrective action (automatic unlock as done by 
SYSV semaphores, call back to user code where we can cleanup and unlock ourselves, etc)
- maybe permit recursive locking (not really required as ODB locks are already made recursive "by hand")
- maybe remember some of the locking history (so we can dump it into a log file when we detect a 
deadlock or other lock malfunction).

Quick google search only find sundry wrappers for SYSV and POSIX semaphores. How they deal with the 
problem of processes locking the semaphore and dying remains a mystery to me (other than telling users 
to remove the Ctrl-C button from their keyboard). BTW, we have seen this problem with several 
commercial applications that use SYSV semaphores but forget to enable the SEM_UNDO function).

Anyhow, if anybody can suggest such an advanced locking library it would be great. Will save me the 
effort of writing one.

K.O.
  788   25 Apr 2012 Konstantin OlchanskiBug ReportBuild error with mlogger: invalid conversion from ‘void*’ to ‘gzFile’
Stefan's fix is incomplete - the "gzFile" cast is needed for all calls to zlib, not just those that some version 
of GCC happens to complain about. Fixed.
svn rev 5286.

BTW, I read the midas elog via email and if you post html or elcode messages, I receive complete 
gibberish. For prompt service, please select message type "plain". (yes, you cannot use fancy colours and 
blinking text, but better than me not reading your stuff at all).

BTW2, for easier reading, please include error messages as plain text in your message. As opposed to 
compressed attachements.

K.O.
  791   10 Jun 2012 Konstantin OlchanskiBug Report_net_send_buffer realloc
> In midas.c, ...
>
> 1) _net_send_buffer is not set to NULL when declared.

_net_send_buffer is a global variable. All global variables are automatically initialized to zero before the program 
starts.

static char*x; // = NULL; is redundant
char*y=realloc(x, 100);  // x is NULL, usage is correct

> 2) cm_disconect_experiment() calls free(_net_send_buffer) but does not set its 
> value to NULL.

My copy of midas.c (svn rev 5256) sets _net_send_buffer to NULL:

   if (_net_send_buffer_size > 0) { 
      M_FREE(_net_send_buffer); 
      _net_send_buffer_size = 0; 
   } 
 
What version of midas do you have? (svn info .)

K.O.
  793   11 Jun 2012 Konstantin OlchanskiBug Report_net_send_buffer realloc
> > > In midas.c, ...
> > >
> > > 1) _net_send_buffer is not set to NULL when declared.
> 
> Ah,okay. I was not aware of this feature of global variables.
> 

RTFM K&R "The C programming language". 
http://en.wikipedia.org/wiki/The_C_Programming_Language

>  
> > > 2) cm_disconect_experiment() calls free(_net_send_buffer) but does not set 
> its value to NULL.
>

Confirmed. Sorry for confusion in my previous message. Set the pointer to NULL after free() is good practice.

But note that calling cm_connect and cm_disconnect multiple times is unusual use of MIDAS and you will most 
likely find more breakage.

K.O.
  795   13 Jun 2012 Konstantin OlchanskiBug ReportCannot start/stop run through mhttpd
> Revision: r5286 
> Platform: Debian Linux 6.0.5 AMD64, with packages from squeeze-backports 
> Problem:
> After building and installation, using the script 'start_daq.sh' to start
> 'sampleexpt'. Everything seems fine. But I cannot start a run through web. Using
> 'odbedit' and 'mtransition' to start/stop a run works fine. So, what may cause
> such a problem?

Well, it's mhttpd who cannot start the run, not you. So what happens when you press
the "start run" button? Any errors in midas.log or in midas messages? Is mtransition
in your PATH?

K.O.
  796   13 Jun 2012 Konstantin OlchanskiForumladd00.triumf.ca https ssl certificate update
The HTTPS SSL certificate on ladd00.triumf.ca has been updated. Same as the old
certificate, the new one is self-signed and your web browser may complain about
that and ask you to "save a security exception".

When you save the new certificate, you can verify that you are connected to the
real ladd00.triumf.ca by comparing the "SHA1 fingerprint" reported by your web
browser to the one given below (as reported by "svn update"):

Certificate information:
 - Hostname: ladd00.triumf.ca
 - Valid: from Wed, 13 Jun 2012 22:31:51 GMT until Thu, 13 Jun 2013 22:31:51 GMT
 - Issuer: DAQ, TRIUMF, Vancouver, BC, CA
 - Fingerprint: 82:95:78:cb:78:d3:93:1d:d4:c8:e8:1a:64:0f:62:04:2d:0e:c3:4a

K.O.
  800   14 Jun 2012 Konstantin OlchanskiBug ReportCannot start/stop run through mhttpd
> > I found the problem only appears when I run mhttpd in scripts, whether bash or python.
> > And I'm quite sure that the MIDAS environments (e.g. PATH, MIDAS_EXPTAB, MIDASSYS, etc.)
> > are set in such scripts. If I start mhttpd in an xterm with or without "-D", it works
> > fine. So, what's the difference between invoking mhttpd directly and through a script?
> 
> When you start it with "-D", then mhttpd become a daemon. According to linux rules, it has to "cd /", so it lives in the 
> root directory, in order not to block any NFS mount/unmount. If something with the path is not correct then, mhttpd 
> cannot find mtransition then. Once I fixed that problem my moving mtransition to /usr/bin.
> 

I agree. Somehow mhttpd cannot run mtransition. I am not super happy with this dependance on user $PATH settings and the inability to capture error messages 
from attempts to start mtransition. I am now thinking in the direction of running mtransition code by forking. But remember that mlogger and the event builder also
have to use mtransition to stop runs (otherwise they can dead-lock). So an mhttpd-only solution is not good enough...

K.O.
  801   14 Jun 2012 Konstantin OlchanskiBug ReportCannot start/stop run through mhttpd
> > > Revision: r5286 
> > > Platform: Debian Linux 6.0.5 AMD64, with packages from squeeze-backports 
> 
> I found the problem only appears when I run mhttpd in scripts, whether bash or python.
> And I'm quite sure that the MIDAS environments (e.g. PATH, MIDAS_EXPTAB, MIDASSYS, etc.)
> are set in such scripts. If I start mhttpd in an xterm with or without "-D", it works
> fine.

Right. I see Debian 6.0.5 just came out hot off the presses. Would be good to fix this problem.

As a work around, can you run mhttpd without "-D", but in the background, i.e. "mhttpd -p xxx >& mhttpd.log &"?

Also what are your $PATH settings?

> So, what's the difference between invoking mhttpd directly and through a script?

As Stefan mentioned, "-D" invokes some nasty unix magic to disconnect the process from the user login session. It is 
possible that this magic breaks in the latest Debian.

MIDAS "-D" does roughly the same thing as "nohup".

K.O.
  802   15 Jun 2012 Konstantin OlchanskiBug Reportbk_delete uses memcpy instead of memmove
> In midas.c, the bk_delete function removes a bank by decrementing the total
> event size and then copying the remaining banks into the location of the first
> using memcpy from string.h.

Replaced some memcpy() with memmove(), including bk_delete().

svn rev 5293
K.O.
  803   15 Jun 2012 Konstantin OlchanskiBug Report_net_send_buffer realloc
> 2) cm_disconect_experiment() calls free(_net_send_buffer) but does not set its 
> value to NULL.

Set pointer to NULL after free() in these files:

M       odb.c
M       sequencer.cxx
M       mlogger.cxx
M       mhttpd.cxx
M       midas.c

svn rev 5294
K.O.
  804   20 Jun 2012 Konstantin OlchanskiInfolazylogger write to HADOOP HDFS
I tried using the lazylogger "Disk" method to write into a HADOOP HDFS clustered filesystem and found a 
number of problems. I ended up replacing the lazylogger lazy_copy() function that still uses former YBOS 
code with a new lazy_disk_copy() function that uses generic fread/fwrite. Also fixed the situation where 
lazylogger cannot cleanly stop from the mhttpd "programs/stop" button while it is busy writing (the fix 
works only for the "Disk" method).

(Note that one can also use the "Script" method for writing into HDFS)

Anyhow, the new lazylogger writes into HDFS just fine and I expect that it would also work for writing into 
DCACHE using PNFS (if ever we get the SL6 PNFS working with our DCACHE servers).

Writing into our test HDFS cluster runs at about 20 MiBytes/sec for 1GB files with replication set to 3.

svn rev 5295
K.O.
ELOG V3.1.4-2e1708b5