Back Midas Rome Roody Rootana
  Midas DAQ System, Page 21 of 146  Not logged in ELOG logo
New entries since:Wed Dec 31 16:00:00 1969
Entry  22 Dec 2005, Konstantin Olchanski, Info, How do I do custom event building? 
It turns out the the standard event builder fragment matching algorithm cannot
be used in my TPC application. I have two TPC-USB interfaces, which lack any
"busy" or synchronization logic. I send the hardware trigger into both
interfaces, and if one of them misses it, the data is out of sync forever. Consider:

Hardware
trigger    trig1     trig2    trig3    trig4
TPC01      serial1   serial2  serial3  serial4
TPC02      serial1  (missing) serial2  serial3

With the event builder matching only the event serial numbers, the first event
will be okey, but the second event will have trig2 data from TPC01 and trig3
data from TPC02, etc.

The problem exists even if the TPC-USB interfaces do not miss any triggers:
during begin and end of run, the interfaces are enabled one at a time, so if a
trigger arrives after the first interface was enabled, but before the second is
enabled, the data starts being out of sync (and if the same happens during the
end-of-run, the event counts from both frontends will match, but all data would
*still* be out of sync).

Obviously additional data is needed to match the fragments.

So in each frontend, I have a high-precision timestamp (gettimeofday(), usec
resolution) and I would like to have the event builder match the timestamps
instead of event serial numbers. What is the best way to do this? The mevb.c
code does not have any user callbacks for checking "do these fragments belong to
the same event?".

P.S. The event rate will be about 1/sec from cosmic ray tests and at most
10-50/sec in the M11 beam line at TRIUMF, at these low rates, the gettimeofday()
timestamps should be adequate.

K.O.
    Reply  23 Dec 2005, Stefan Ritt, Info, midas max event size? 
> My TPC events are fairly large: 18 FEC cards * 128 channels per card * 2 Kbytes
> per channel = about 4 Mbytes. In my
> frontend, when I request this event size, MIDAS complaints (in mfe.c) that it is
> bigger than MAX_EVENT_SIZE, which
> is set to 0.5 Mbytes in midas.h. What is the best way to deal with this? Should
> we increase MAX_EVENT_SIZE to
> something bigger? Remove the MAX_EVENT_SIZE limitation altogether?

If you teach me how to remove the MAX_EVENT_SIZE, that would be perfect!

Unfortunately the limit comes from the shared memory on the back end (the so-called
"SYSTEM" shared memory). Due to the structure of the buffer manager, the shared
memory has to hold at least two events simultaneously. And once the shared memeory
is created, it's size cannot be changed without restarting all the clients. That's
the origin of the MAX_EVENT_SIZE. In former days, the total allowed shared memory on
a typical linux machine was 2MB. That's why I set MAX_EVENT_SIZE to 0.5 MB, so midas
takes 2*0.5MB=1MB plus 0.2MB for the ODB, leaving 0.8MB for other applications.
Nowadays, the shared memory might be bigger (actually it's a parameter during kernel
compilation), so one could consider increasing the default MAX_EVENT_SIZE. If you
make a survey of the shared memory sizes in some of the current distributions, we
can choose a safe value.

> For now, I increased the value MAX_EVENT_SIZE & co to (10*1024*1024) and it
> seems to work (I also had to bump the
> sanity check in bm_open_buffer() from 10E6 to 100E6). With 1/4 of the FEC cards,
> the event size is 1 Mbyte at ~6
> ev/sec the machine is almost idle, with the biggest CPU user being the event
> builder at 10% CPU utilization.

I made sure that there is no other limitation as the one given by MAX_EVENT_SIZE, so
it should work fine. Thanks for telling me the wrong sanity check, that should be
changed in the repository.
    Reply  23 Dec 2005, Stefan Ritt, Info, How do I do custom event building? 
> It turns out the the standard event builder fragment matching algorithm cannot
> be used in my TPC application. I have two TPC-USB interfaces, which lack any
> "busy" or synchronization logic. I send the hardware trigger into both
> interfaces, and if one of them misses it, the data is out of sync forever. Consider:
> 
> Hardware
> trigger    trig1     trig2    trig3    trig4
> TPC01      serial1   serial2  serial3  serial4
> TPC02      serial1  (missing) serial2  serial3
> 
> With the event builder matching only the event serial numbers, the first event
> will be okey, but the second event will have trig2 data from TPC01 and trig3
> data from TPC02, etc.

Well, I would say: this is a very poor design of an experiment. Before curing the
problems in software, I first would consider a redesign of the data readout scheme with
a global hardware trigger and a hardware busy.

> So in each frontend, I have a high-precision timestamp (gettimeofday(), usec
> resolution) and I would like to have the event builder match the timestamps
> instead of event serial numbers.

What do you do if the frontend clock drifts away? I have seen drifts of up to 10 sec/day
on some PCs, so your required accuracy of 1/50 s would be violated after 3 minutes. You
would have to synchronize your clocks constantly. If your synchronization algorithm
determines a clock is out of sync and adjusts it, and the delta t is more than 1/50 sec,
you are screwed.

So all together I conclude that this proposed synchronization scheme is pretty dangerous
and could ruin the whole experiment.

> What is the best way to do this? The mevb.c
> code does not have any user callbacks for checking "do these fragments belong to
> the same event?".

Pierre can answer that.

- Stefan
    Reply  03 Jan 2006, John O'Donnell, Info, How do I do custom event building? 
At DANCE we have a similar issue.  We are still doing "software
handshaking" between multiple frontends (15 which read data, and 16th
with direct accessto the trigger logic), and we apply a time stamp
using gettimeofday().  We use the regular mevb, sorting on serial number.

In the analyzer (MIDAS or ROME) we then keep a big circular buffer of
event fragments, which are rebuilt into new events based on the time stamp
obtained from gettimeofday().  We keep the system clocks synchronized
(often to within about 1ms) using ntp (need to average over several
ntp servers to avoid issues with network noise).  ntp can take a while
to stabilize, so we never reboot our computers... (well almost never).
We have a slow control frontend which monitors the ntp time offsets and
put's them in the history system for easy visualization.

Occasionally we seem to get in a mess, but somehow this fixes itself on
the next run, so it has been a useable system.  Maybe one day we will
get hardware handshaking between the frontend computers and the trigger
logic, but in the meantime we are taking data.

John.
Entry  23 Mar 2006, Sergio Ballestrero, Info, svn@savannah.psi.ch down ? 
 Hi,
I was trying to update the checkout of Midas, but it looks like something is not
working - maybe a component of the Savannah system:
[sergio@daq-pc midas-SVN]$ svn update
svn@savannah.psi.ch's password: svn
unix dgram connect: Connection refused at /bin/cvssh.pl line 32
no connection to syslog available at /bin/cvssh.pl line 32
svn: Connection closed unexpectedly

my .svn/entries says (amongst the rest)
 url="svn+ssh://svn@savannah.psi.ch/afs/psi.ch/project/meg/svn/midas/trunk"
and yes, it used to work well... 

Cheers,
  Sergio
    Reply  26 Mar 2006, Stefan Ritt, Info, svn@savannah.psi.ch down ? 
>  Hi,
> I was trying to update the checkout of Midas, but it looks like something is not
> working - maybe a component of the Savannah system:
> [sergio@daq-pc midas-SVN]$ svn update
> svn@savannah.psi.ch's password: svn
> unix dgram connect: Connection refused at /bin/cvssh.pl line 32
> no connection to syslog available at /bin/cvssh.pl line 32
> svn: Connection closed unexpectedly
> 
> my .svn/entries says (amongst the rest)
>  url="svn+ssh://svn@savannah.psi.ch/afs/psi.ch/project/meg/svn/midas/trunk"
> and yes, it used to work well... 
> 
> Cheers,
>   Sergio

I just tried now and it seemed to work fine. Do you still have the problem?

- Stefan
    Reply  27 Mar 2006, Sergio Ballestrero, Info, svn@savannah.psi.ch down ? 
> I just tried now and it seemed to work fine. Do you still have the problem?
> 
> - Stefan

 The problem was still there this morning, shortly after seeing your mail, but seems
to be fixed now.
 BTW, which is the best way to submit patches ? I have a version of khyt1331 for Linux
kernel 2.6 (we are running Scientific Linux 4.1), and a few smaller things, mostly in
the examples. 

 Thanks, Sergio
Entry  13 Jun 2006, Stefan Ritt, Info, ZLIB dependency modified 
Due to recent problems with the ROME analyzer having zlib.h both in the
system and in the midas tree it has been decided to change the zlib policy in midas. By default, zlib support is not included in the midas analyzer. If one want it (but I guess only very few experiments need that), one can do a

make NEED_ZLIB=1

to compile zlib support into mana.c

Under linux (&Co), the zlib is these days normally pre-installed. The header file will therefor be taken from /usr/include and the library from /usr/lib/libz.a. Under Windows, the zlib is still included in the distribution, and has to be manually added to the Visual C++ project file.
Entry  13 Jun 2006, Stefan Ritt, Info, Scheduler changed for slow control equipment 
The schedule in mfe.c is used both for "normal" front-ends and for "slow-control" front-ends. Unfortunately it was only optimized for the first class. This lead to the fact that the slow control equipment was read out at different speed depending if the run is started or not. Furthermore, the maximum readout speed was somehow limited. This has been changed in the current version of mfe.c (SVN revision 3146). There are now two ways to control the readout speed of slow control equipment:

1) The "event limit" in the equipment list can be used as minimum time between readouts. I'm not happy about the "mis-use" of this variable, but it has been there since the beginning. If I would change it now, all front-ends on this world would have to be changed, which I maybe not a good idea. If this event limit is set to let's say 10, then the slow control equipment is read out with a maximum speed of 1/10ms = 100Hz. That means up to 100 variables (not complete equipments) are read out per second. If an equipment has 200 variables, each variable is then read out every two seconds of course. This number can be used to limit the readout speed differently for different equipments. Like one might want to read a sensitive pressure as often as possible, but some beamline magnet values only once every minute.

2) By default, the scheduler runs now at "full speed" when slow control equipment is present, resulting in a 100% CPU usage. To avoid this, following code can be added into the frontend_loop function:

BOOL frontend_call_loop = TRUE;

INT frontend_loop()
{
/* don't eat up all CPU time */
return cm_yield(10);
}

This limits the readout speed of all slow control equipment again to 100Hz, but avoids the 100% CPU usage. On most operating systems, the minimum time is 10ms as shown above, since this is the basic time slice of a process.

The readout scheme of slow control equipment will be re-visited this summer, when multi-threaded slow control front-ends will be implemented.
Entry  07 Aug 2006, Stefan Ritt, Info, New multi-threaded midas slow control system 
Multi-threaded slow control system

The Midas slow control system has been modified to support multi-threaded slow control front-ends. Each device gets it's own thread in the front-end, which has several advantages:

- the communication of all devices runs in parallel and therefor is much faster
- slow devices cannot block any more the front-end. Response times to run transitions etc. become therefore much faster.

This modification requires some minor modifications in the existing class and device drivers.

Dropping of CMD_xxx_ALL commands

The slow control commands CMD_SET_ALL, CMD_GET_ALL, CMD_SET_CURRENT_LIMIT_ALL, CMD_GET_CURRENT_LIMIT_ALL, etc. have been dropped. They were there to accomodate some slow devices, which sometimes works a bit faster if all channels are set or read at once. Since the inter-thread communication scheme implemented now does only allow passing one channel at a time, the "ALL" functions cannot be supported any more. On the other hand this is not such an issue any more, since slow devices are handled now in parallel, speeding up things considereably.

The command have been removed from midas.h and from all device and class drivers coming with the midas distribution. If you have your own drivers, just delete the sections wich use these commands.

Calling the device driver inside the class driver

The device drivers have now to be called differently in the class driver. The reason for that is that in a multi-threaded front-end, there is only one central device driver dispatcher, which communicates with the individual device driver threads. The device drivers do not need to be modified, but all existing class drivers need modification, if they are going to be run in a multi-threaded front-end. Old class drivers which are not used in a multi-threaded front-end do not to be modified.

Following modifications are necessary:

  • Remove following line:
    #define DRIVER(_i) ...

  • Find all lines containing
    DRIVER(i)(CMD_xxx, info->dd_info[i], ...)

    and replace them with
    device_driver(info->driver[i], CMD_xxx, ...)

    note that info->dd_info[i] is not passed any more. Instead, you pass info->driver[i]. Pleae note that the arguments passed after CMD_xxx are not checked by the compiler, since they are a variable argument list. Any error there will not produce a compiler warning, but will just crash the front-end.

  • Find the line with
    status = pequipment->driver[i].dd(CMD_INIT, hKey, &pequipment->driver[i].dd_info,
                                            pequipment->driver[i].channels,
                                            pequipment->driver[i].flags,
                                            pequipment->driver[i].bd);

    and replace it with
    status = device_driver(&pequipment->driver[i], CMD_INIT, hKey);

  • Find the line with
    pequipment->driver[i].dd(CMD_EXIT, pequipment->driver[i].dd_info);

    and replace it with
    device_driver(&pequipment->driver[i], CMD_EXIT);

  • Find following lines
    hv_info->driver[i] = pequipment->driver[index].dd;
    hv_info->dd_info[i] = pequipment->driver[index].dd_info;
    hv_info->channel_offset[i] = offset;
    hv_info->flags[i] = pequipment->driver[index].flags;

    and replace them with
    hv_info->driver[i] = &pequipment->driver[index];
    hv_info->channel_offset[i] = offset;

The class drivers multi.c and generic.c can be used as a reference for these modifications.

Implementing CMD_STOP command

For multithread-enabled device drivers it is necessary to support the CMD_STOP command, which is needed to stop all device threads before the actual device gets closed. Following code is necessary:
INT cd_xxx(INT cmd, EQUIPMENT * pequipment)
{
   INT i, status;

   switch (cmd) {
   case CMD_INIT:
      ...

   case CMD_STOP:
      for (i = 0; pequipment->driver[i].dd != NULL &&
                  pequipment->driver[i].flags & DF_MULTITHREAD ; i++)
         status = device_driver(&pequipment->driver[i], CMD_STOP);
      break;

   case CMD_IDLE:
      ...

   return status;
}

Enabling multi-thread support

To turn on multi-thread support for a device, the flag DF_MULTITHREAD must be used in the front-end user code device driver list, such as
DEVICE_DRIVER multi_driver[] = {
   {"Input", nulldev, 2, null, DF_INPUT | DF_MULTITHREAD},
   {"Output", nulldev, 2, null, DF_OUTPUT | DF_MULTITHREAD},
   {""}
};
Entry  11 Feb 2007, Konstantin Olchanski, Info, svn and "make indent" trashed my svn checkout tree... 
Fuming, fuming, fuming.

The combination of "make indent" and "svn update" completely trashed my work copy of midas. Half of 
the files now show as status "M", half as status "C" ("in conflict"), even those I never edited myself (e.g. 
mscb firmware files).

I think what happened as that once I ran "make indent", the indent program did things to the source 
files (changed indentation, added spaces in "foo(a,b,c); --> foo(a, b, c);" etc, so now svn thinks that I 
edited the files and they are in conflict with later modifications.

I suggest that nobody ever ever ever should use "make indent", and if they do, they should better 
commit their "changes" made by indent very quickly, before their midas tree is trashed by the next "svn 
update".

And if they commit the changes made by "make indent", beware that "make indent" is not idempotent, 
running it multiple times, it keeps changing files (keeps moving some dox comments around).

Also beware of entering a tug-of-war with Stefan - at least on my machines, my "make indent" seems 
to produce different output from his.

Still fuming, even after some venting...
K.O.
    Reply  15 Feb 2007, Ryu Sawada, Info, Latest FC5 Compilation attempt 
On February 13, 2007, gcc 4.1.2 was released.
I checked this version, and it compiles midas successfully,

GCC 3                    - OK
GCC 4.0                  - OK
GCC 4.1.0 and 4.1.1      - Bad
GCC 4.1.2                - OK
GCC 4.2                  - This is not released. Development version of GCC 4.2 is OK
Entry  23 Feb 2007, Konstantin Olchanski, Info, RFC- support for writing to removable hard disk storage 
At triumf, we are developing a system to use removable hard drives to store data collected by midas 
daq stations. The basic idea is to replace storage on 300 GB DLT tapes with storage on removable 
esata, usb2 or firewire 750 GB hard drives.

To minimize culture shock, we stay as close as possible to the "tape" paradigm. Two removable disks 
are used in tandem. Data is written to the first removable disk until it is full. Then midas automatically 
switches to the second disk and asks the operator to replace the full disk with a blank disk. Similar to 
handling tapes, the operator takes the full disk and stores it on the shelf (offline); takes a blank disk 
and connects it to the computer. To read data from one of the disks, the operator takes the disk from 
the shelf and connects it to the daq computer or to some other computer equipped with a compatible 
removable storage bay. The full data disks are mounted read-only to prevent accidental data 
modifications.

Two pieces of software are needed to implement this system:

1) midas support for switching to alternate output disks as they become full. Data could be written to 
the removable disk directly by the mlogger (no extra data copy on local disks) or by the lazylogger 
(mlogger writes the data to the local disk, then the lazylogger copies it to the removable disk). Writing 
directly to the removable disk is more efficient as it avoids the one extra data copy operation by the 
lazylogger.

2) a user interface utility for mounting and dismounting removable disks. Handling of removable disks 
cannot be fully automatic: before unplugging a removable disk, the user has to inform the system; after 
connecting a removable disk, the user has to tell the system to mount it read-only (for existing data), 
read-write (to add more data) or to initialize a blank disk (fdisk+mkfs). (Also, some SATA interfaces do 
not implement automatic hot-plug: they have to be manually told "please look for new disks").

We are presently evaluating various internal SATA hot-plug enclosures. We evaluated external eSATA 
and USB2 enclosures and decided not to use them: while the performance is adequate, presence of 
extra bulky components (eSATA and USB cables, non-standardized power bricks) and the extra cost of 
eSATA and USB hard drive enclosures makes them unattractive.

I am open to suggestions and comments. I am most interested in hearing which data path (mlogger or 
the lazylogger) would be most useful for other users.

K.O.
Entry  23 Feb 2007, Konstantin Olchanski, Info, RFC- history system improvements 
While running the ALPHA experiment at CERN, we stressed and broke the MIDAS history system. We 
generated about 0.5 GB of history data per day, and this killed the performance of the history plot 
system in mhttpd - we had to wait for *minutes* to look at any plots of any variables.

One way to address this problem could be by changing the way ALPHA slow controls data is collected.

Another way to address this problem could be by improving the midas history system by removing 
some of the existing limitations and inefficiencies, enabling it to handle the ever increasing data 
volumes we keep throwing at it.

I feel the second approach (improving midas) is more useful in general and it appears that big 
improvements can be made by small modifications of existing code. No rewrites of midas are required. 
Read on.

Issue 1: in the mlogger, history is recorded with fairly coarse granularity.

For an equipment, if any varible changes, *all* variables for that equipment are written into the history 
file.

Historically, this worked fairly well for experiments with low data rates (a few history changes per 
minute) and with variables equally distributed between different equipments. But even for a modest 
sized experiment like TRIUMF-E614-TWIST, recording many variables when only one has changed has 
been a visible inefficiency. Current experiments wish to record more history data more frequently, but 
even with latest and greatest hardware, in the case of ALPHA, this inefficiency has become a 
performance killer.

One could solve this problem by refactoring the data (one variable per equipment/one equipment per 
variable). I find this approach inelegant and contrary to the "midas way" (whatever that is).

An alternative would be to change the mlogger to record history with per-variable granularity. When 
one variable changes, only that variable is recorded. Preliminary examination of the existing code 
indicates that history writing in the mlogger is already structured in a way that makes it easy to 
implement, while the history reading code does not seem to need any changes at all.

Issue 2: all history data is recorded into a single file.

Again, this has worked well historically. In fact, until not so long ago, it was the only sane way to record 
history data because operating systems could not efficiently write data into multiple files at the same 
time. Insifficient data buffering, suboptimal storage allocation strategies - all leading to bad 
performance. Latest Linux kernels have largely resolved all such issues.

The present problem arises when recording large amounts of history data (say 100 variables) and then 
making a history plot of 1 variable. Because data for the one variable of interest is spread across the 
whole file, effectively, the whole file has to be read into memory, data for 1 variable collected and data 
for the other 99 variables skipped.

In this case, a speed up by a factor of 100 could be obtained by recording (say) one variable per history 
file. (Yes, the history code does use "lseek", but the seek granularity of modern disks is very coarse and 
in my tests, reading the whole file (streaming) is almost faster than seeking through it).

One has to be very careful when looking at these numbers and running benchmarks. Modern computers 
with fast disks and large RAM performs very well no matter how history data is stored and organized. 
Performance problems surface only under the load when running the production system, when the 
disks are busy recording the main data stream and all RAM is consumed by user applications doing 
data analysis.

The obvious solution to this problem is to record each variable into a separate data file. This will 
require modifications to the history writing code in the mlogger and to the history reading code in 
mhttpd, mhist & co.

An extra challenge in this tast is to minimize changes to the existing code and to keep compatibility 
with the existing data files - new code should be able to read existing data files.

I propose to organize data into subdirectores:
history/equipmentNNN/variableVVV/YYMMDD.hst

This scheme does two good things for the history plotting in mhttpd:

1) note that mhttpd always plots one variable at a time, and the variables are addressed by equipment 
(int) and variable name (string) (plus the array index). In the proposed scheme, the code would know 
exactly which history file to open to get the data, no scanning of directories or seeking inside the 
history file.

2) when setting up mhttpd history plots, the code can easily see what equipment and variables exist 
and *ever existed*. The present code only examines the latest history file and cannot see variables that 
have been deleted (or not yet written into the existing file). For example, one cannot see variables that 
existed in the 2005 history but were removed (or renamed) in 2006. (Yes, it can be done by an expert 
using mhist to examine the 2005 history files and odbedit to manually setup the history plots).

Over the next few weeks, I will proceed with implementing these two improvements: (1) mlogger write 
history with per-variable granularity; (2) history file split into one-file-per-variable. If my initial 
assessment is correct and the changes indeed are small, contained, non-intrusive and compatible with 
existing history files, I will submit them for inclusion into mainline midas.

K.O.
    Reply  23 Feb 2007, John M O'Donnell, Info, RFC- support for writing to removable hard disk storage 
We stopped using tapes at Los Alamos a while ago.  The model we use is:

write data with mlogger to a local RAID system.  This is NFS mounted read only on teh analysis machines, and
becomes the working copy for most tasks.  Copy data to external hardrives.  We have been using USB.  The USB
system is sometime a little flaky (lnux 2.4.21-7, so we have a computer dedicated to this task.  The USB driver
can be reloaded, or if the user is not so knowledgeable, the copmuter can be rebooted.  users on this computer
have sudo privs, so they can format hard drives.  The disks are inserted into boxes while in use, and stored on
a shelf for data archival, so we don't have a lot of enclosures.

I use the automounter to mount and unmount the drives.  With a 10 second timeout, the user needs only to wait a
few seconds before unplugging the disk.  (cat /proc/mounts allows them to check if they want.) dmesg allows
them to find the drive letter.  This works for any device which appears later as a SCSI disk.  The automounter
manages /mnt/usb for vfat formatted devices, and /mnt/usbl for ext3 formatted devices (preferred for data
archiving).

autofs config files are:

/etc/auto.usb

# This is an automounter map and it has the following format
# key [ -mount-options-separated-by-comma ] location
# Details may be found in the autofs(5) manpage
 
*       -fstype=auto,nosuid,nodev,umask=0000,noatime    :/dev/&

/etc/auto.usbl

# This is an automounter map and it has the following format
# key [ -mount-options-separated-by-comma ] location
# Details may be found in the autofs(5) manpage
 
*       -fstype=auto,nosuid,nodev       :/dev/&

/etc/auto.master contains

/mnt/usb                /etc/auto.usb  --timeout=10
/mnt/usbl               /etc/auto.usbl --timeout=10


John.

> At triumf, we are developing a system to use removable hard drives to store data collected by midas 
> daq stations. The basic idea is to replace storage on 300 GB DLT tapes with storage on removable 
> esata, usb2 or firewire 750 GB hard drives.
> 
> To minimize culture shock, we stay as close as possible to the "tape" paradigm. Two removable disks 
> are used in tandem. Data is written to the first removable disk until it is full. Then midas automatically 
> switches to the second disk and asks the operator to replace the full disk with a blank disk. Similar to 
> handling tapes, the operator takes the full disk and stores it on the shelf (offline); takes a blank disk 
> and connects it to the computer. To read data from one of the disks, the operator takes the disk from 
> the shelf and connects it to the daq computer or to some other computer equipped with a compatible 
> removable storage bay. The full data disks are mounted read-only to prevent accidental data 
> modifications.
> 
> Two pieces of software are needed to implement this system:
> 
> 1) midas support for switching to alternate output disks as they become full. Data could be written to 
> the removable disk directly by the mlogger (no extra data copy on local disks) or by the lazylogger 
> (mlogger writes the data to the local disk, then the lazylogger copies it to the removable disk). Writing 
> directly to the removable disk is more efficient as it avoids the one extra data copy operation by the 
> lazylogger.
> 
> 2) a user interface utility for mounting and dismounting removable disks. Handling of removable disks 
> cannot be fully automatic: before unplugging a removable disk, the user has to inform the system; after 
> connecting a removable disk, the user has to tell the system to mount it read-only (for existing data), 
> read-write (to add more data) or to initialize a blank disk (fdisk+mkfs). (Also, some SATA interfaces do 
> not implement automatic hot-plug: they have to be manually told "please look for new disks").
> 
> We are presently evaluating various internal SATA hot-plug enclosures. We evaluated external eSATA 
> and USB2 enclosures and decided not to use them: while the performance is adequate, presence of 
> extra bulky components (eSATA and USB cables, non-standardized power bricks) and the extra cost of 
> eSATA and USB hard drive enclosures makes them unattractive.
> 
> I am open to suggestions and comments. I am most interested in hearing which data path (mlogger or 
> the lazylogger) would be most useful for other users.
> 
> K.O.
    Reply  26 Feb 2007, Stefan Ritt, Info, RFC- history system improvements 
I agree to what you propose. I'm pretty sure you are right in getting a significant improvement in readout speed
of the history system. So far there was no big request for improving the history system, since the performance in
the experiments I was involved in was good. In MEG for example, we have ~20MB of history data per day, and all
plots even going back some months can be made in a couple of seconds. Have a look for example at

http://midas.psi.ch/megon00/HS/PCS/Pressures.gif?hscale=1843200&hoffset=-5068800

This plot stretches over two weeks and involves ~500 MB of history data, and is prepared in a couple of seconds.
The key question here is how big the disk cache of the OS is. The above plot does not read all 500 MB, but skips
many data points in order to obtain ~1000 data points (one per pixel) for the requested period. To find these
data points, it reads and scans the history index files (yymmdd.idx), which are only a few percent of the
yymmdd.hst data files. The index file contains only the time stamp, the event id and the location of the event in
the *.hst file. Scanning the index file is as efficient as scanning a history file with a single variable. Now
comes the access of the history file. For ~1000 data points, 1000 locations have to be read. This requires
reading in the FAT table for the history file and accessing the sector clusters containing the data. In worst
case one has to read 1000 clusters. With a cluster size of 2kB this will be 2MB of data, something which can be
read very quickly. On the MEG system I observe that the first history plot takes about 5 seconds, while all
consecutive plots take about 1 second. This indicates that the FAT information is cached by the OS. This depends
of course as you indicated correctly on how much memory is available for disk caching, how many processes are
running etc. and will finally determine how fast your history access will be.

So if you implement your proposed new scheme, please consider the following:

- Scanning a single variable file is about the same as scanning the current index file. You save however the
access to the data file. If you plot several variables together, you have to access several "single variable
files", so your access time scales with the number of variables. In the current system, it's likely that
different variables from the same event are located in the same cluster. So you have to read the history file
once for each variable, but after the first variable the sectors of interest are very likely cached by the OS. So
I would estimate that the break-even point is about 2-3 variables. I mean if you read more that three variables,
your proposed method might get slower than the current one. This is of course not the case if there are very many
events in the history file. In that case the index file might be much bigger, since it gets a new entry if *any*
variable in an event changes. If all index file together are bigger than you disk cache, the system will become
slow (and I guess that's what you see). In MEG, the index file is about 1MB per day, so a few weeks fit easily
into the disk cache.

- In order not to get too much data, the history system needs fine tuning. Each slow control system class driver
as an "update threshold", which is used to determine if a variable has "changed". For some noisy channels, it
might be worth to set the threshold at 3 sigma of the noise level (RMS). This can reduce your history data
dramatically. For some equipment, you even might consider to define a minimum update period. This is done via
"/Equipment/<name>/Common/Log history". If that variable is set to 10, the time between two consecutive history
records is at least 10 seconds. For some temperatures for example it might make sense to set this even to one
minute or so, depending on how fast your temperatures change.

- If you implement a per-variable history, you probably have to use the per-event hot link in the ODB. Otherwise
you would exceed the number of hot links MAX_OPEN_RECORDS which is currently 256. If you then get a hot link
update, you have to check manually which variable(s) have changed in log_history() in mlogger.c

- Before you actually go and implement the full system, I would write some small test code to "simulate" the new
scheme. Write some dummy files with the full data you expect in the ALPHA experiment and see what the improvement
is under realistic conditions. Only if you see a big improvement it's worth to implement the full code. Test this
on various machine to get a better overview. Maybe it's worth testing different file systems and cluster sizes as
well.

- If there is an improvement, I'm more than happy to replace the current history code in midas. It might however
not be clean to have a heterogeneous history system, where some files are in the old format and some in the new.
It might be better to write a little conversion routine which converts the old format into the new one, even
omitting records where single variables did not change. This conversion could be even put into the standard
mlogger code and is executed automatically if the logger is started first and finds some old data files.

Even if the speed improvement is not so big, one will certainly win a lot on disk file size (like if only one
variable out of 100 changes). This will probably make it worth to implement anyhow.
    Reply  26 Feb 2007, Stefan Ritt, Info, RFC- support for writing to removable hard disk storage 
In the MEG experiment, we simply installed 100TB of RAID disks and don't need to change anything Wink

But seriously, you are right that such a system might be beneficial. I propose to extend the current logger code to switch disks. In the current tr_start() funciton in mlogger, the code checks for "subdir_format" to create separate subdirectories like once per week. One could extend this code in the following way:

- Add an array of strings and name it "Path", such as

/dev/sda1/datadir/
/dev/sdb1/datadir/

- On each stop of the run, check if the current disk has enough space for one more run. Take either the "Byte limit" of that channel, or the actual size of the last run and multiply it by two or so. If the disk is "almost full", switch to the next array element in "Path". Append the file name, such as "/dev/sda1/datadir/run1234.mid" and put this into "Current filename" as a feedback for the user. Now write to the new disk/file.

- Add as string like "Execute on switch", which gets called after you switched to the next disk. This shell script can then handle the un-mounting of the full disk, notify the user etc. This is similar to the "/Programs/Execute on start run" in the ODB, but it gets only called if you switch the disk.
Entry  26 Feb 2007, Stefan Ritt, Info, Usage of event channel for improved throughput 
Starting from SVN revision 3642, sending events from the front-end has been revised.

Since long time ago, there is a special TCP socket established between any front-end and the mserver which can be used to bypass the midas RPC layer completely and purely send events. There was a #define USE_EVENT_CHANNEL but to my knowledge nobody used it.

While optimizing data throughput for the MEG experiment, I revisited this mechanism and got it finally working. Here are some benchmark tests made with the produce program on two dual-CPU machines running on Gigabit Ethernet:

Using normal RPC socket:

event size    speed [MB/sec] CPU usage front-end  CPU usage server
==================================================================
    40          3            22                   100                
  1000         44            25                   100
100000        101            14                    50

Using new event socket:

event size    speed [MB/sec] CPU usage front-end  CPU usage server
==================================================================
    40         12            100                   34                
  1000         99            58                    59
100000        101            14                    43

As can be seen, the CPU load on the server drops significantly for smaller events since the processing time per event is reduced. If the transfer was limited by the server, the throughput goes up significantly. For large events the bottleneck on the server side is the memcpy of events, so no big improvement is visible. The saved CPU time however can be used to analyze more events for example.

The event socket is now enabled by default in the front-end by setting
rpc_mode = 1

in mfe.c and should be checked carefully in various experiments. There is a small chance that events get stuck in the buffer cache on the server side at the end of the run, in which case they would show up as the first events of the next run. I know that this problem happened in some experiment before, but that must have been unrelated to the rpc_mode. So please check again and report any problem with the new rpc_mode.
Entry  26 Feb 2007, Stefan Ritt, Info, Fragmented polled events 
Fragmented polled events have been implemented in SVN revision 3625.
Fragmentation is a method of breaking down large (>MB) events into smaller
pieces and send them through the shared memory buffers, reassembling them at the
output. In the past this was only possible for periodic events (such as large
histograms read out once every few seconds), but now this is also possible for
polled events.
Entry  06 Mar 2007, Konstantin Olchanski, Info, commited mhttpd fixes & improvements 
I commited the mhttpd fixes and improvements to the history code accumulated while running the ALPHA 
experiment at CERN:

- fix crashes and infinite loops while generating history plots (also seen in TWIST)
- permit more than 10 variables per history plot
- let users set their own colours for variables on history plot
- (finally) add gui elements for setting mimimum and maximum values on a plot
- implement special "history" mode. In this mode, the master mhttpd does all the work, except for 
generating of history plots, which is done in a separate mhttpd running in history mode, possibly on a 
different computer (via ODB variable "/history/url").

I also have improvements to the mhttpd elog code (better formatting of email) and to the "export history 
plot as CSV" function, which I will not be commiting: for elog, we switched to the standalone elogd; and 
CSV export is still very broken, even with my fixes.

The commited fixes have been in use at CERN since last Summer, but I could have introduced errors 
during the merge & commit. I am now using this new code, so any new errors should surface and get 
squashed quickly.

K.O.
ELOG V3.1.4-2e1708b5