10 Aug 2012, Carl Blaksley, Forum, Problem with CAMAC controlled by CES8210 and read out by CAEN V1718 VME controller
|
Hello all,
I am trying to put together a system to read out several camac adc. The camac is
read by a ces8210 camac to vme controller. The vme is then interfaced to a
computer through a CAEN v1718 usb control module. As anyone gotten the latter to
work?
Previous users seemed to indicate that they had here:
https://ladd00.triumf.ca/elog/Midas/493
but I am having problems to get this example frontend to compile. What is set as
the driver in the makefile for example? If I put v1718 there then I recieve
numerous errors from the CAENVMElib files.
If someone else has gotten the V1718 running, I would be grateful for their
insight.
Thanks,
-Carl |
27 Jul 2012, Cheng-Ju Lin, Info, MIDAS under Scientific Linux 6
|
Hi All,
I was wondering if anyone has attempted to install MIDAS under Scientific Linux 6? I am planning to install
Scientific Linux on one of the PCs in our lab to run MIDAS. I would like to know if anyone has been
successful in getting MIDAS to run under SL6. Thanks.
Cheng-Ju |
31 Jul 2012, Pierre-Andre Amaudruz, Info, MIDAS under Scientific Linux 6
|
Hi Cheng-Ju,
Midas will install and run under SL6. We're presently running SL6.2.
Cheers, PAA
> Hi All,
>
> I was wondering if anyone has attempted to install MIDAS under Scientific Linux 6? I am planning to install
> Scientific Linux on one of the PCs in our lab to run MIDAS. I would like to know if anyone has been
> successful in getting MIDAS to run under SL6. Thanks.
>
> Cheng-Ju |
04 Jul 2012, Konstantin Olchanski, Bug Report, Crash after recursive use of rpc_execute()
|
I am looking at a MIDAS kaboom when running out of space on the data disk - everything was freezing
up, even the VME frontend crashed sometimes.
The freeze was traced to ROOT use in mlogger - it turns out that ROOT intercepts many signal handlers,
including SIGSEGV - but instead of crashing the program as God intended, ROOT SEGV handler just hangs,
and the rest of MIDAS hangs with it. One solution is to always build mlogger without ROOT support -
does anybody use this feature anymore? Or reset the signal handlers back to the default setting somehow.
Freeze fixed, now I see a crash (seg fault) inside mlogger, in the newly introduced memmove() function
inside the MIDAS RPC code rpc_execute(). memmove() replaced memcpy() in the same place and I am
surprised we did not see this crash with memcpy().
The crash is caused by crazy arguments passed to memmove() - looks like corrupted RPC arguments
data.
Then I realized that I see a recursive call to rpc_execute(): rpc_execute() calls tr_stop() calls cm_yield() calls
ss_suspend() calls rpc_execute(). The second rpc_execute successfully completes, but leave corrupted
data for the original rpc_execute(), which happily crashes. At the moment of the crash, recursive call to
rpc_execute() is no longer visible.
Note that rpc_execute() cannot be called recursively - it is not re-entrant as it uses a global buffer for RPC
argument processing. (global tls_buffer structure).
Here is the mlogger stack trace:
#0 0x00000032a8032885 in raise () from /lib64/libc.so.6
#1 0x00000032a8034065 in abort () from /lib64/libc.so.6
#2 0x00000032a802b9fe in __assert_fail_base () from /lib64/libc.so.6
#3 0x00000032a802bac0 in __assert_fail () from /lib64/libc.so.6
#4 0x000000000041d3e6 in rpc_execute (sock=14, buffer=0x7ffff73fc010 "\340.", convert_flags=0) at
src/midas.c:11478
#5 0x0000000000429e41 in rpc_server_receive (idx=1, sock=<value optimized out>, check=<value
optimized out>) at src/midas.c:12955
#6 0x0000000000433fcd in ss_suspend (millisec=0, msg=0) at src/system.c:3927
#7 0x0000000000429b12 in cm_yield (millisec=100) at src/midas.c:4268
#8 0x00000000004137c0 in close_channels (run_number=118, p_tape_flag=0x7fffffffcd34) at
src/mlogger.cxx:3705
#9 0x000000000041390e in tr_stop (run_number=118, error=<value optimized out>) at
src/mlogger.cxx:4148
#10 0x000000000041cd42 in rpc_execute (sock=12, buffer=0x7ffff73fc010 "\340.", convert_flags=0) at
src/midas.c:11626
#11 0x0000000000429e41 in rpc_server_receive (idx=0, sock=<value optimized out>, check=<value
optimized out>) at src/midas.c:12955
#12 0x0000000000433fcd in ss_suspend (millisec=0, msg=0) at src/system.c:3927
#13 0x0000000000429b12 in cm_yield (millisec=1000) at src/midas.c:4268
#14 0x0000000000416c50 in main (argc=<value optimized out>, argv=<value optimized out>) at
src/mlogger.cxx:4431
K.O. |
04 Jul 2012, Konstantin Olchanski, Bug Report, Crash after recursive use of rpc_execute()
|
> ... I see a recursive call to rpc_execute(): rpc_execute() calls tr_stop() calls cm_yield() calls
> ss_suspend() calls rpc_execute()
> ... rpc_execute() cannot be called recursively - it is not re-entrant as it uses a global buffer
It turns out that rpc_server_receive() also need protection against recursive calls - it also uses
a global buffer to receive network data.
My solution is to protect rpc_server_receive() against recursive calls by detecting recursion and returning SS_SUCCESS (to ss_suspend()).
I was worried that this would cause a tight loop inside ss_suspend() but in practice, it looks like ss_suspend() tries to call
us about once per second. I am happy with this solution. Here is the diff:
@@ -12813,7 +12815,7 @@
/********************************************************************/
-INT rpc_server_receive(INT idx, int sock, BOOL check)
+INT rpc_server_receive1(INT idx, int sock, BOOL check)
/********************************************************************\
Routine: rpc_server_receive
@@ -13047,7 +13049,28 @@
return status;
}
+/********************************************************************/
+INT rpc_server_receive(INT idx, int sock, BOOL check)
+{
+ static int level = 0;
+ int status;
+ // Provide protection against recursive calls to rpc_server_receive() and rpc_execute()
+ // via rpc_execute() calls tr_stop() calls cm_yield() calls ss_suspend() calls rpc_execute()
+
+ if (level != 0) {
+ //printf("*** enter rpc_server_receive level %d, idx %d sock %d %d -- protection against recursive use!\n", level, idx, sock, check);
+ return SS_SUCCESS;
+ }
+
+ level++;
+ //printf(">>> enter rpc_server_receive level %d, idx %d sock %d %d\n", level, idx, sock, check);
+ status = rpc_server_receive1(idx, sock, check);
+ //printf("<<< exit rpc_server_receive level %d, idx %d sock %d %d, status %d\n", level, idx, sock, check, status);
+ level--;
+ return status;
+}
+
/********************************************************************/
INT rpc_server_shutdown(void)
/********************************************************************\
ladd02:trinat~/packages/midas>svn info src/midas.c
Path: src/midas.c
Name: midas.c
URL: svn+ssh://svn@savannah.psi.ch/repos/meg/midas/trunk/src/midas.c
Repository Root: svn+ssh://svn@savannah.psi.ch/repos/meg/midas
Repository UUID: 050218f5-8902-0410-8d0e-8a15d521e4f2
Revision: 5297
Node Kind: file
Schedule: normal
Last Changed Author: olchanski
Last Changed Rev: 5294
Last Changed Date: 2012-06-15 10:45:35 -0700 (Fri, 15 Jun 2012)
Text Last Updated: 2012-06-29 17:05:14 -0700 (Fri, 29 Jun 2012)
Checksum: 8d7907bd60723e401a3fceba7cd2ba29
K.O. |
13 Jul 2012, Stefan Ritt, Bug Report, Crash after recursive use of rpc_execute()
|
> Then I realized that I see a recursive call to rpc_execute(): rpc_execute() calls tr_stop() calls cm_yield() calls
> ss_suspend() calls rpc_execute(). The second rpc_execute successfully completes, but leave corrupted
> data for the original rpc_execute(), which happily crashes. At the moment of the crash, recursive call to
> rpc_execute() is no longer visible.
This is really strange. I did not protect rpc_execute against recursive calls since this should not happen. rpc_server_receive() is linked to rpc_call() on the client side. So there cannot be
several rpc_call() since there I do the recursive checking (also multi-thread checking) via a mutex. See line 10142 in midas.c. So there CANNOT be recursive calls to rpc_execute() because
there cannot be recursive calls to rpc_server_receive(). But apparently there are, according to your stack trace.
So even if your patch works fine, I would like to know where the recursive calls to rpc_server_receive() come from. Since we have one subproces of mserver for each client, there should only
be one client connected to each mserver process, and the client is protected via the mutex in rpc_call(). Can you please debug this? I would like to understand what is going on there. Maybe
there is a deeper underlying problem, which we better solve, otherwise it might fall back on use in the future.
For debugging, you have to see what commands rpc_call() send and what rpc_server_receive() gets, maybe by writing this into a common file together with a time stamp.
SR |
20 Jun 2012, Konstantin Olchanski, Info, lazylogger write to HADOOP HDFS
|
I tried using the lazylogger "Disk" method to write into a HADOOP HDFS clustered filesystem and found a
number of problems. I ended up replacing the lazylogger lazy_copy() function that still uses former YBOS
code with a new lazy_disk_copy() function that uses generic fread/fwrite. Also fixed the situation where
lazylogger cannot cleanly stop from the mhttpd "programs/stop" button while it is busy writing (the fix
works only for the "Disk" method).
(Note that one can also use the "Script" method for writing into HDFS)
Anyhow, the new lazylogger writes into HDFS just fine and I expect that it would also work for writing into
DCACHE using PNFS (if ever we get the SL6 PNFS working with our DCACHE servers).
Writing into our test HDFS cluster runs at about 20 MiBytes/sec for 1GB files with replication set to 3.
svn rev 5295
K.O. |
29 Jun 2012, Konstantin Olchanski, Info, lazylogger write to HADOOP HDFS
|
> Anyhow, the new lazylogger writes into HDFS just fine and I expect that it would also work for writing into
> DCACHE using PNFS (if ever we get the SL6 PNFS working with our DCACHE servers).
>
> Writing into our test HDFS cluster runs at about 20 MiBytes/sec for 1GB files with replication set to 3.
Minor update to lazylogger and mlogger:
lazylogger default timeout 60 sec is too short for writing into HDFS - changed to 10 min.
mlogger checks for free space were insufficient and it would fill the output disk to 100% full before stopping
the run. Now for disks bigger than 100GB, it will stop the run if there is less than 1GB of free space. (100%
disk full would break the history and the elog if they happen to be on the same disk).
Also I note that mlogger.cxx rev 5297 includes a fix for a performance bug introduced about 6 month ago (mlogger
would query free disk space after writing each event - depending on your filesystem configuration and the event
rate, this bug was observed to extremely severely reduce the midas disk writing performance).
svn rev 5296, 5297
K.O.
P.S. I use these lazylogger settings for writing to HDFS. Write speed varies around 10-20-30 Mbytes/sec (4-node
cluster, 3 replicas of each file).
[local:trinat_detfac:S]Settings>pwd
/Lazy/HDFS/Settings
[local:trinat_detfac:S]Settings>ls -l
Key name Type #Val Size Last Opn Mode Value
---------------------------------------------------------------------------
Period INT 1 4 7m 0 RWD 10
Maintain free space (%) INT 1 4 7m 0 RWD 20
Stay behind INT 1 4 7m 0 RWD 0
Alarm Class STRING 1 32 7m 0 RWD
Running condition STRING 1 128 7m 0 RWD ALWAYS
Data dir STRING 1 256 7m 0 RWD /home/trinat/online/data
Data format STRING 1 8 7m 0 RWD MIDAS
Filename format STRING 1 128 7m 0 RWD run*
Backup type STRING 1 8 7m 0 RWD Disk
Execute after rewind STRING 1 64 7m 0 RWD
Path STRING 1 128 7m 0 RWD /hdfs/users/trinat/data
Capacity (Bytes) FLOAT 1 4 7m 0 RWD 5e+09
List label STRING 1 128 7m 0 RWD HDFS
Execute before writing file STRING 1 64 7m 0 RWD
Execute after writing file STRING 1 64 7m 0 RWD
Modulo.Position STRING 1 8 7m 0 RWD
Tape Data Append BOOL 1 4 7m 0 RWD y
K.O. |
20 Jun 2012, Konstantin Olchanski, Info, midas vme benchmarks
|
I am recording here the results from a test VME system using two VF48 waveform digitizers and a 64-bit
dual-core VME processor (V7865). VF48 data suppression is off, VF48 modules set to read 48 channels,
1000 ADC samples each. mlogger data compression is enabled (gzip -1).
Event rate is about 200/sec
VME Data rate is about 40 Mbytes/sec
System is 100% busy (estimate)
System utilization of host computer (dual-core 2.2GHz, dual-channel DDR333 RAM):
(note high CPU use by mlogger for gzip compression of midas files)
top - 12:23:45 up 68 days, 20:28, 3 users, load average: 1.39, 1.22, 1.04
Tasks: 193 total, 3 running, 190 sleeping, 0 stopped, 0 zombie
Cpu(s): 32.1%us, 6.2%sy, 0.0%ni, 54.4%id, 2.7%wa, 0.1%hi, 4.5%si, 0.0%st
Mem: 3925556k total, 3797440k used, 128116k free, 1780k buffers
Swap: 32766900k total, 8k used, 32766892k free, 2970224k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5169 trinat 20 0 246m 108m 97m R 64.3 2.8 29:36.86 mlogger
5771 trinat 20 0 119m 98m 97m R 14.9 2.6 139:34.03 mserver
6083 root 20 0 0 0 0 S 2.0 0.0 0:35.85 flush-9:3
1097 root 20 0 0 0 0 S 0.9 0.0 86:06.38 md3_raid1
System utilization of VME processor (dual-core 2.16 GHz, single-channel DDR2 RAM):
(note the more than 100% CPU use of multithreaded fevme)
top - 12:24:49 up 70 days, 19:14, 2 users, load average: 1.19, 1.05, 1.01
Tasks: 103 total, 1 running, 101 sleeping, 1 stopped, 0 zombie
Cpu(s): 6.3%us, 45.1%sy, 0.0%ni, 47.7%id, 0.0%wa, 0.2%hi, 0.6%si, 0.0%st
Mem: 1019436k total, 866672k used, 152764k free, 3576k buffers
Swap: 0k total, 0k used, 0k free, 20976k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19740 trinat 20 0 177m 108m 984 S 104.5 10.9 1229:00 fevme_gef.exe
1172 ganglia 20 0 416m 99m 1652 S 0.7 10.0 1101:59 gmond
32353 olchansk 20 0 19240 1416 1096 R 0.2 0.1 0:00.05 top
146 root 15 -5 0 0 0 S 0.1 0.0 42:52.98 kslowd001
Attached are the CPU and network ganglia plots from lxdaq09 (VME) and ladd02 (host).
The regular bursts of "network out" on ladd02 is lazylogger writing mid.gz files to HADOOP HDFS.
K.O. |
20 Jun 2012, Konstantin Olchanski, Info, midas vme benchmarks
|
> I am recording here the results from a test VME system using two VF48 waveform digitizers
Note 1: data compression is about 89% (hence "data to disk" rate is much smaller than the "data from VME" rate)
Note 2: switch from VME MBLT64 block transfer to 2eVME block transfer:
- raises the VME data rate from 40 to 48 M/s
- event rate from 220/sec to 260/sec
- mlogger CPU use from 64% to about 80%
This is consistent with the measured VME block transfer rates for the VF48 module: MBLT64 is about 40 M/s, 2eVME is about 50 M/s (could be
80 M/s if no clock cycles were lost to sync VME signals with the VF48 clocks), 2eSST is implemented but impossible - VF48 cannot drive the
VME BERR and RETRY signals. Evil standards, grumble, grumble, grumble).
K.O. |
24 Jun 2012, Konstantin Olchanski, Info, midas vme benchmarks
|
> > I am recording here the results from a test VME system using two VF48 waveform digitizers
(I now have 4 VF48 waveform digitizers, so the event rates are half of those reported before. Date rate
is up to 51 M/s - event size has doubled, per-event overhead is the same, so the effective data rate goes
up).
This message demonstrates the effects of tuning the MIDAS system for high rate data taking.
Attached is the history plot of the event rate counters which show the real-time performance of the MIDAS
system with better detail compared to the average event rate reported on the MIDAS status page. For an
ideal real-time system, the event rate should be a constant, without any drop-outs.
Seen on the plot:
run 75: the periodic dropouts in the event rate correspond to the lazylogger writing data into HADOOP
HDFS. Clearly the host computer cannot keep up with both data taking and data archiving at the same
time. (see the output of "top" "with HDFS" and "without HDFS" below)
run 76: SYSTEM buffer size increased from 100Mbytes to 300Mbytes. Maybe there is an improvement.
run 77-78: "event_buffer_size" inside the multithreaded (EQ_MULTITHREAD) VME frontend increased from
100Mbytes to 300Mbytes. (6 seconds of data at 50M/s). Much better, yes?
Conclusion: for improved real-time performance, there should be sufficient buffering between the VME
frontend readout thread and the mlogger data compression thread.
For benchmark hardware, at 50M/s, 4 seconds of buffer space (100M in the SYSTEM buffer and 100M in
the frontend) is not enough. 12 seconds of buffer space (300+300) is much better. (Or buy a faster
backend computer).
P.S. HDFS data rate as measured by lazylogger is around 20M/s for CDH3 HADOOP and around 30M/s for
CDH4 HADOOP.
P.S. Observe the ever present unexplained event rate fluctuations between 130-140 event/sec.
K.O.
---- "top" output during normal data taking, notice mlogger data compression consumes 99% CPU at 51
M/s data rate.
top - 08:55:22 up 72 days, 17:00, 5 users, load average: 2.47, 2.32, 2.27
Tasks: 206 total, 2 running, 204 sleeping, 0 stopped, 0 zombie
Cpu(s): 52.2%us, 6.1%sy, 0.0%ni, 34.4%id, 0.8%wa, 0.1%hi, 6.2%si, 0.0%st
Mem: 3925556k total, 3064928k used, 860628k free, 3788k buffers
Swap: 32766900k total, 200704k used, 32566196k free, 2061048k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5826 trinat 20 0 437m 291m 287m R 97.6 7.6 636:39.63 mlogger
27617 trinat 20 0 310m 288m 288m S 24.6 7.5 6:59.28 mserver
1806 ganglia 20 0 415m 62m 1488 S 0.9 1.6 668:43.55 gmond
--- "top" output during lazylogger/HDFS activity. Observe high CPU use by lazylogger and fuse_dfs (the
HADOOP HDFS client). Observe that CPU use adds up to 167% out of 200% available.
top - 08:57:16 up 72 days, 17:01, 5 users, load average: 2.65, 2.35, 2.29
Tasks: 206 total, 2 running, 204 sleeping, 0 stopped, 0 zombie
Cpu(s): 57.6%us, 23.1%sy, 0.0%ni, 8.1%id, 0.0%wa, 0.4%hi, 10.7%si, 0.0%st
Mem: 3925556k total, 3642136k used, 283420k free, 4316k buffers
Swap: 32766900k total, 200692k used, 32566208k free, 2597752k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5826 trinat 20 0 437m 291m 287m R 68.7 7.6 638:24.07 mlogger
23450 root 20 0 1849m 200m 4472 S 64.4 5.2 75:35.64 fuse_dfs
27617 trinat 20 0 310m 288m 288m S 18.5 7.5 7:22.06 mserver
26723 trinat 20 0 38720 11m 1172 S 17.9 0.3 22:37.38 lazylogger
7268 trinat 20 0 1007m 35m 4004 D 1.3 0.9 187:14.52 nautilus
1097 root 20 0 0 0 0 S 0.8 0.0 101:45.55 md3_raid1 |
25 Jun 2012, Stefan Ritt, Info, midas vme benchmarks
|
> P.S. Observe the ever present unexplained event rate fluctuations between 130-140 event/sec.
An important aspect of optimizing your system is to keep the network traffic under control. I use GBit Ethernet between FE and BE, and make sure the switch
can accomodate all accumulated network traffic through its backplane. This way I do not have any TCP retransmits which kill you. Like if a single low-level
ethernet packet is lost due to collision, the TCP stack retransmits it. Depending on the local settings, this can be after a timeout of one (!) second, which
punches already a hole in your data rate. On the MSCB system actually I use UDP packets, where I schedule the retransmit myself. For a LAN, 10-100ms timeout
is there enough. The one second is optimized for a WAN (like between two continents) where this is fine, but it is not what you want on a LAN system. Also
make sure that the outgoing traffic (lazylogger) uses a different network card than the incoming traffic. I found that this also helps a lot.
- Stefan |
25 Jun 2012, Konstantin Olchanski, Info, midas vme benchmarks
|
> > P.S. Observe the ever present unexplained event rate fluctuations between 130-140 event/sec.
>
> An important aspect of optimizing your system is to keep the network traffic under control. I use GBit Ethernet between FE and BE, and make sure the switch
> can accomodate all accumulated network traffic through its backplane. This way I do not have any TCP retransmits which kill you. Like if a single low-level
> ethernet packet is lost due to collision, the TCP stack retransmits it. Depending on the local settings, this can be after a timeout of one (!) second, which
> punches already a hole in your data rate. On the MSCB system actually I use UDP packets, where I schedule the retransmit myself. For a LAN, 10-100ms timeout
> is there enough. The one second is optimized for a WAN (like between two continents) where this is fine, but it is not what you want on a LAN system. Also
> make sure that the outgoing traffic (lazylogger) uses a different network card than the incoming traffic. I found that this also helps a lot.
>
In typical applications at TRIUMF we do not setup a private network for the data traffic - data from VME to backend computer
and data from backend computer to DCACHE all go through the TRIUMF network.
This is justified by the required data rates - the highest data rate experiment running right now is PIENU - running
at about 10 M/s sustained, nominally April through December. (This is 20% of the data rate of the present benchmark).
The next highest data rate experiment is T2K/ND280 in Japan running at about 20 M/s (neutrino beam, data rate
is dominated by calibration events).
All other experiments at TRIUMF run at lower data rates (low intensity light ion beams), but we are planning for an experiment
that will run at 300 M/s sustained over 1 week of scheduled beam time.
But we do have the technical capability to separate data traffic from the TRIUMF network - the VME processors and
the backend computers all have dual GigE NICs.
(I did not say so, but obviously the present benchmark at 50 M/s VME to backend and 20-30 M/s from backend to HDFS is a GigE network).
(I am not monitoring the TCP loss and retransmit rates at present time)
(The network switch between VME and backend is a "the cheapest available" rackmountable 8-port GigE switch. The network between
the backend and the HDFS nodes is mostly Nortel 48-port GigE edge switches with single-GigE uplinks to the core router).
K.O. |
26 Jun 2012, Konstantin Olchanski, Info, midas vme benchmarks
|
> > > I am recording here the results from a test VME system using four VF48 waveform digitizers
Now we look at the detail of the event readout, or if you want, the real-time properties of the MIDAS
multithreaded VME frontend program.
The benchmark system includes a TRIUMF-made VME-NIMIO32 VME trigger module which records the
time of the trigger and provides a 20 MHz timestamp register. The frontend program is instrumented to
save the trigger time and readout timing data into a special "trigger" bank ("VTR0"). The ROOTANA-based
MIDAS analyzer is used to analyze this data and to make these plots.
Timing data is recorded like this:
NIM trigger signal ---> latched into the IO32 trigger time register (VTR0 "trigger time")
...
int read_event(pevent, etc) {
VTR0 "trigger time" = io32->latched_trigger_time();
VTR0 "readout start time" = io32->timestamp();
read the VF48 data
io32->release_busy();
VTR0 "readout end time" = io32->timestamp();
}
From the VTR0 time data, we compute these values:
1) "trigger latency" = "readout start time" - "trigger time" --- the time it takes us to "see" the trigger
2) "readout time" = "readout end time" - "readout start time" --- the time it takes to read the VF48 data
3) "busy time" = "readout end time" - "trigger time" --- time during which the "DAQ busy" trigger veto is
active.
also computed is
4) "time between events" = "trigger time" - "time of previous trigger"
And plot them on the attached graphs:
1) "trigger latency" - we see average trigger latency is 5 usec with hardly any events taking more than 10
usec (notice the log Y scale!). Also notice that there are 35 events that took longer that 100 usec (0.7% out
of 5000 events).
So how "real time" is this? For "hard real time" the trigger latency should never exceed some maximum,
which is determined by formal analysis or experimentally (in which case it will carry an experimental error
bar - "response time is always less than X usec with probability 99.9...%" - the better system will have
smaller X and more nines). Since I did not record the maximum latency, I can only claim that the
"response time is always less than 1 sec, I am pretty sure of it".
For "soft real time" systems, such as subatomic particle physics DAQ systems, one is permitted to exceed
that maximum response time, but "not too often". Such systems are characterized by the quantities
derived from the present plot (mean response time, frequency of exceeding some deadlines, etc). The
quality of a soft real time system is usually judged by non-DAQ criteria (i.e. if the DAQ for the T2K/ND280
experiment does not respond within 20 msec, a neutrino beam spill an be lost and the experiment is
required to report the number of lost spills to the weekly facility management meeting).
Can the trigger latency be improved by using interrupts instead of polling? Remember that on most
hardware, the VME and PCI bus access time is around 1 usec and trigger latency of 5-10 usec corresponds
to roughly 5-10 reads of a PCI or VME register. So there is not much room for speed up. Consider that an
interrupt handler has to perform at least 2-3 PCI register reads (to determine the source of the interrupt
and to clear the interrupt condition), it has to wake up the right process and do a rather slow CPU context
switch, maybe do a cross-CPU interrupt (if VME interrupts are routed to the wrong CPU core). All this
takes time. Then the Linux kernel interrupt latency comes into play. All this is overhead absent in pure-
polling implementations. (Yes, burning a CPU core to poll for data is wasteful, but is there any other use
for this CPU core? With a dual-core CPU, the 1st core polls for data, the 2nd core runs mfe.c, the TCP/IP
stack and the ethernet transmitter.)
2) "readout time" - between 7 and 8 msec, corresponding to the 50 Mbytes/sec VME block transfer rate.
No events taking more than 10 msec. (Could claim hard real time performance here).
3) "busy time" - for the simple benchmark system it is a boring sum of plots (1) and (2). The mean busy
time ("dead time") goes straight into the formula for computing cross-sections (if that is what you do).
4) "time between events" - provides an independent measurement of dead time - one can see that no
event takes less than 7 msec to process and 27 events took longer than 10 msec (0.65% out of 4154
events). If the trigger were cosmic rays instead of a pulser, this plot would also measure the cosmic ray
event rate - one would see the exponential shape of the Poisson distribution (linear on Log scale, with the
slope being the cosmic event rate).
K.O. |
26 Jun 2012, Konstantin Olchanski, Info, midas vme benchmarks
|
> > > > I am recording here the results from a test VME system using four VF48
waveform digitizers
Last message from this series. After all the tuning, I reduce the trigger rate
from 120 Hz to 100 Hz to see
what happens when the backend computer is not overloaded and has some spare
capacity.
event rate: 100 Hz (down from 120 Hz)
data rate: 37 Mbytes/sec (down from 50 M/s)
mlogger cpu use: 65% (down from 99%)
Attached:
1) trigger rate event plot: now the rate is solid 100 Hz without dropouts
2) CPU and Network plots frog ganglia: the spikes is lazylogger saving mid.gz
files to HDFS storage
3) time structure plots:
a) trigger latency: mean 5 us, most below 10 us, 59 events (0.046%) longer than
100 us, (bottom left graph) 7000 us is longest latency observed.
b) readout time is 7000-8000 us (same as before - VME data rate is independant
from the trigger rate)
c) busy time: mean 7.2 us, 12 events (0.0094%) longer than 10 ms, longest busy
time ever observed is 17 ms (bottom middle graph)
d) time between events is 10 ms (100 Hz pulser trigger), 1 event was missed
about 10 times (spike at 20 ms) (0.0085%), more than 1 event missed never (no
spike at 30 ms, 40 ms, etc).
CPU use on the backend computer:
top - 16:30:59 up 75 days, 35 min, 6 users, load average: 0.98, 0.99, 1.01
Tasks: 206 total, 3 running, 203 sleeping, 0 stopped, 0 zombie
Cpu(s): 39.3%us, 8.2%sy, 0.0%ni, 39.4%id, 5.7%wa, 0.3%hi, 7.2%si, 0.0%st
Mem: 3925556k total, 3404192k used, 521364k free, 8792k buffers
Swap: 32766900k total, 296304k used, 32470596k free, 2477268k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5826 trinat 20 0 441m 292m 287m R 65.8 7.6 2215:16 mlogger
26756 trinat 20 0 310m 288m 288m S 16.8 7.5 34:32.03 mserver
29005 olchansk 20 0 206m 39m 17m R 14.7 1.0 26:19.42 ana_vf48.exe
7878 olchansk 20 0 99m 3988 740 S 7.7 0.1 27:06.34 sshd
29012 trinat 20 0 314m 288m 288m S 2.8 7.5 4:22.14 mserver
23317 root 20 0 0 0 0 S 1.4 0.0 24:21.52 flush-9:3
K.O. |
21 Jun 2012, Stefan Ritt, Info, midas vme benchmarks
|
Just for completeness: Attached is the VME transfer speed I get with the SIS3100/SIS1100 interface using
2eVME transfer. This curve can be explained exactly with an overhead of 125 us per DMA transfer and a
continuous link speed of 83 MB/sec. |
21 Jun 2012, Konstantin Olchanski, Info, midas vme benchmarks
|
> Just for completeness: Attached is the VME transfer speed I get with the SIS3100/SIS1100 interface using
> 2eVME transfer. This curve can be explained exactly with an overhead of 125 us per DMA transfer and a
> continuous link speed of 83 MB/sec.
What VME module is on the other end?
K.O. |
22 Jun 2012, Stefan Ritt, Info, midas vme benchmarks
|
> > Just for completeness: Attached is the VME transfer speed I get with the SIS3100/SIS1100 interface using
> > 2eVME transfer. This curve can be explained exactly with an overhead of 125 us per DMA transfer and a
> > continuous link speed of 83 MB/sec.
>
> What VME module is on the other end?
>
> K.O.
The PSI-built DRS4 board, where we implemented the 2eVME protocol in the Virtex II FPGA. The same speed can be obtained with the commercial
VME memory module CI-VME64 from Chrislin Industries (see http://www.controlled.com/vme/chinp1.html).
Stefan |
24 Jun 2012, Konstantin Olchanski, Info, midas vme benchmarks
|
> > > Just for completeness: Attached is the VME transfer speed I get with the SIS3100/SIS1100 interface using
> > > 2eVME transfer. This curve can be explained exactly with an overhead of 125 us per DMA transfer and a
> > > continuous link speed of 83 MB/sec.
>
> [with ...] the PSI-built DRS4 board, where we implemented the 2eVME protocol in the Virtex II FPGA.
This is an interesting hardware benchmark. Do you also have benchmarks of the MIDAS system using the DRS4 (measurements
of end-to-end data rates, maximum event rate, maximum trigger rate, any tuning of the frontend program
and of the MIDAS experiment to achieve those rates, etc)?
K.O. |
22 Jun 2012, Zisis Papandreou, Info, adding 2nd ADC and TDC to crate
|
Hi folks:
we've been running midas-1.9.5 for a few years here at Regina. We are now
working on a larger cosmic ray testing that requires a second ADC and second TDC
module in our Camac crate (we use the hytek1331 controller by the way). We're
baffled as to how to set this up properly. Specifically we have tried:
frontend.c
/* number of channels */
#define N_ADC 12
(changed this from the old '8' to '12', and it seems to work for Lecroy 2249)
#define SLOT_ADC0 10
#define SLOT_TDC0 9
#define SLOT_ADC1 15
#define SLOT_TDC1 14
Is this the way to define the additional slots (by adding 0, 1 indices)?
Also, we were not able to get a new bank (ADC1) working, so we used a loop to
tag the second ADC values onto those of the first.
If someone has an example of how to handle multiple ADCs and TDCs and
suggestions as to where changes need to be made (header files, analyser, etc)
this would be great.
Thanks, Zisis...
P.S. I am attaching the relevant files. |
13 Jun 2012, Exaos Lee, Bug Report, Cannot start/stop run through mhttpd
|
Revision: r5286
Platform: Debian Linux 6.0.5 AMD64, with packages from squeeze-backports
Problem:
After building and installation, using the script 'start_daq.sh' to start
'sampleexpt'. Everything seems fine. But I cannot start a run through web. Using
'odbedit' and 'mtransition' to start/stop a run works fine. So, what may cause
such a problem? |
13 Jun 2012, Konstantin Olchanski, Bug Report, Cannot start/stop run through mhttpd
|
> Revision: r5286
> Platform: Debian Linux 6.0.5 AMD64, with packages from squeeze-backports
> Problem:
> After building and installation, using the script 'start_daq.sh' to start
> 'sampleexpt'. Everything seems fine. But I cannot start a run through web. Using
> 'odbedit' and 'mtransition' to start/stop a run works fine. So, what may cause
> such a problem?
Well, it's mhttpd who cannot start the run, not you. So what happens when you press
the "start run" button? Any errors in midas.log or in midas messages? Is mtransition
in your PATH?
K.O. |
13 Jun 2012, Exaos Lee, Bug Report, Cannot start/stop run through mhttpd
|
> Well, it's mhttpd who cannot start the run, not you. So what happens when you press
> the "start run" button? Any errors in midas.log or in midas messages? Is mtransition
> in your PATH?
After pressing "start run", there is a message displayed: "Run start requested". There
is no error in midas.log. And mtransition is actually in my PATH. I even looked into
"mhttpd.cxx" and found where "cm_transition" is called for starting a run. I have no
clue to grasp the reason. |
14 Jun 2012, Exaos Lee, Bug Report, Cannot start/stop run through mhttpd
|
> > Revision: r5286
> > Platform: Debian Linux 6.0.5 AMD64, with packages from squeeze-backports
> > Problem:
> > After building and installation, using the script 'start_daq.sh' to start
> > 'sampleexpt'. Everything seems fine. But I cannot start a run through web. Using
> > 'odbedit' and 'mtransition' to start/stop a run works fine. So, what may cause
> > such a problem?
>
> Well, it's mhttpd who cannot start the run, not you. So what happens when you press
> the "start run" button? Any errors in midas.log or in midas messages? Is mtransition
> in your PATH?
>
> K.O.
I found the problem only appears when I run mhttpd in scripts, whether bash or python.
And I'm quite sure that the MIDAS environments (e.g. PATH, MIDAS_EXPTAB, MIDASSYS, etc.)
are set in such scripts. If I start mhttpd in an xterm with or without "-D", it works
fine. So, what's the difference between invoking mhttpd directly and through a script? |
14 Jun 2012, Stefan Ritt, Bug Report, Cannot start/stop run through mhttpd
|
> I found the problem only appears when I run mhttpd in scripts, whether bash or python.
> And I'm quite sure that the MIDAS environments (e.g. PATH, MIDAS_EXPTAB, MIDASSYS, etc.)
> are set in such scripts. If I start mhttpd in an xterm with or without "-D", it works
> fine. So, what's the difference between invoking mhttpd directly and through a script?
When you start it with "-D", then mhttpd become a daemon. According to linux rules, it has to "cd /", so it lives in the
root directory, in order not to block any NFS mount/unmount. If something with the path is not correct then, mhttpd
cannot find mtransition then. Once I fixed that problem my moving mtransition to /usr/bin.
Stefan |
14 Jun 2012, Konstantin Olchanski, Bug Report, Cannot start/stop run through mhttpd
|
> > I found the problem only appears when I run mhttpd in scripts, whether bash or python.
> > And I'm quite sure that the MIDAS environments (e.g. PATH, MIDAS_EXPTAB, MIDASSYS, etc.)
> > are set in such scripts. If I start mhttpd in an xterm with or without "-D", it works
> > fine. So, what's the difference between invoking mhttpd directly and through a script?
>
> When you start it with "-D", then mhttpd become a daemon. According to linux rules, it has to "cd /", so it lives in the
> root directory, in order not to block any NFS mount/unmount. If something with the path is not correct then, mhttpd
> cannot find mtransition then. Once I fixed that problem my moving mtransition to /usr/bin.
>
I agree. Somehow mhttpd cannot run mtransition. I am not super happy with this dependance on user $PATH settings and the inability to capture error messages
from attempts to start mtransition. I am now thinking in the direction of running mtransition code by forking. But remember that mlogger and the event builder also
have to use mtransition to stop runs (otherwise they can dead-lock). So an mhttpd-only solution is not good enough...
K.O. |
21 Jun 2012, Stefan Ritt, Bug Report, Cannot start/stop run through mhttpd
|
> I agree. Somehow mhttpd cannot run mtransition. I am not super happy with this dependance on user $PATH settings and the inability to capture error messages
> from attempts to start mtransition. I am now thinking in the direction of running mtransition code by forking. But remember that mlogger and the event builder also
> have to use mtransition to stop runs (otherwise they can dead-lock). So an mhttpd-only solution is not good enough...
The way to go is to make cm_transition multi-threaded. Like on thread for each client to be contacted. This way the transition can go in parallel when there are many frontend computers for example, which will speed up
transitions significantly. In addition, cm_transition should execute a callback whenever a client succeeded or failed, so to give immediate feedback to the user. I think of something like implementing WebSockets in mhttpd for that (http://en.wikipedia.org/wiki/WebSocket).
I have this in mind since many years, but did not have time to implement it yet. Maybe on my next visit to TRIUMF?
Stefan |
14 Jun 2012, Konstantin Olchanski, Bug Report, Cannot start/stop run through mhttpd
|
> > > Revision: r5286
> > > Platform: Debian Linux 6.0.5 AMD64, with packages from squeeze-backports
>
> I found the problem only appears when I run mhttpd in scripts, whether bash or python.
> And I'm quite sure that the MIDAS environments (e.g. PATH, MIDAS_EXPTAB, MIDASSYS, etc.)
> are set in such scripts. If I start mhttpd in an xterm with or without "-D", it works
> fine.
Right. I see Debian 6.0.5 just came out hot off the presses. Would be good to fix this problem.
As a work around, can you run mhttpd without "-D", but in the background, i.e. "mhttpd -p xxx >& mhttpd.log &"?
Also what are your $PATH settings?
> So, what's the difference between invoking mhttpd directly and through a script?
As Stefan mentioned, "-D" invokes some nasty unix magic to disconnect the process from the user login session. It is
possible that this magic breaks in the latest Debian.
MIDAS "-D" does roughly the same thing as "nohup".
K.O. |
09 Jun 2012, Greg Christian, Bug Report, _net_send_buffer realloc
|
In midas.c, I noticed that memory is only allocated to the global buffer
_net_send_buffer by calling realloc() from within the function
resize_net_send_buffer() (at least this was the only place I could find
allocation to _net_send_buffer happening). This can cause problems for a couple
of reasons:
1) _net_send_buffer is not set to NULL when declared. To my understanding, this
makes the first call to realloc(_net_send_buffer, /*size*/) undefined. When
passed a pointer that has not previously been allocated, realloc() acts like
malloc() only if the pointer equal to NULL. Otherwise, the behavior is undefined
and usually causes a crash.
2) cm_disconect_experiment() calls free(_net_send_buffer) but does not set its
value to NULL. Thus if a client tries to include more than one
connect...disconnect cycle within an application, there is undefined behavior
the next time realloc(_net_send_buffer, ...) gets called.
I think that any potential allocation issues involving _net_send_buffer could be
solved by:
1) Initializing _net_send_buffer to NULL.
2) In cm_disconnect_experiment(), changing
> M_FREE(_net_send_buffer);
to
> M_FREE(_net_send_buffer);
> _net_send_buffer = NULL; |
10 Jun 2012, Konstantin Olchanski, Bug Report, _net_send_buffer realloc
|
> In midas.c, ...
>
> 1) _net_send_buffer is not set to NULL when declared.
_net_send_buffer is a global variable. All global variables are automatically initialized to zero before the program
starts.
static char*x; // = NULL; is redundant
char*y=realloc(x, 100); // x is NULL, usage is correct
> 2) cm_disconect_experiment() calls free(_net_send_buffer) but does not set its
> value to NULL.
My copy of midas.c (svn rev 5256) sets _net_send_buffer to NULL:
if (_net_send_buffer_size > 0) {
M_FREE(_net_send_buffer);
_net_send_buffer_size = 0;
}
What version of midas do you have? (svn info .)
K.O. |
10 Jun 2012, Greg Christian, Bug Report, _net_send_buffer realloc
|
> > In midas.c, ...
> >
> > 1) _net_send_buffer is not set to NULL when declared.
>
> _net_send_buffer is a global variable. All global variables are automatically
initialized to zero before the program
> starts.
>
> static char*x; // = NULL; is redundant
> char*y=realloc(x, 100); // x is NULL, usage is correct
>
Ah,okay. I was not aware of this feature of global variables.
> > 2) cm_disconect_experiment() calls free(_net_send_buffer) but does not set
its
> > value to NULL.
>
> My copy of midas.c (svn rev 5256) sets _net_send_buffer to NULL:
>
> if (_net_send_buffer_size > 0) {
> M_FREE(_net_send_buffer);
> _net_send_buffer_size = 0;
> }
>
> What version of midas do you have? (svn info .)
>
> K.O.
I have version 5256 also (matches what you posted), but I only see
_net_send_buffer_size being set to 0, not _net_send_buffer itself. In midas.h,
M_FREE(x) only expands to free(x) if _MEM_DBG is not defined. |
11 Jun 2012, Konstantin Olchanski, Bug Report, _net_send_buffer realloc
|
> > > In midas.c, ...
> > >
> > > 1) _net_send_buffer is not set to NULL when declared.
>
> Ah,okay. I was not aware of this feature of global variables.
>
RTFM K&R "The C programming language".
http://en.wikipedia.org/wiki/The_C_Programming_Language
>
> > > 2) cm_disconect_experiment() calls free(_net_send_buffer) but does not set
> its value to NULL.
>
Confirmed. Sorry for confusion in my previous message. Set the pointer to NULL after free() is good practice.
But note that calling cm_connect and cm_disconnect multiple times is unusual use of MIDAS and you will most
likely find more breakage.
K.O. |
15 Jun 2012, Konstantin Olchanski, Bug Report, _net_send_buffer realloc
|
> 2) cm_disconect_experiment() calls free(_net_send_buffer) but does not set its
> value to NULL.
Set pointer to NULL after free() in these files:
M odb.c
M sequencer.cxx
M mlogger.cxx
M mhttpd.cxx
M midas.c
svn rev 5294
K.O. |
12 Dec 2011, Michael Murray, Bug Report, bk_delete uses memcpy instead of memmove
|
In midas.c, the bk_delete function removes a bank by decrementing the total
event size and then copying the remaining banks into the location of the first
using memcpy from string.h.
memcpy is not specified to handle overlapping memory regions (such as MIDAS
banks), though it seems most common implementations do.
memmove should be used instead, which is specified to behave as if copying
through an intermediate buffer.
I noticed the misbehavior using glibc with gcc version 4.4.4 and scientific
linux 6.0. Other gcc versions changed nothing, as this originates from the
implementation of memcpy in libc.
libc version:
GNU C Library stable release version 2.12, by Roland McGrath et al.
Compiled by GNU CC version 4.4.5 20110214 (Red Hat 4.4.5-6).
Compiled on a Linux 2.6.32 system on 2011-12-06. |
16 Dec 2011, Konstantin Olchanski, Bug Report, bk_delete uses memcpy instead of memmove
|
> In midas.c, the bk_delete function removes a bank by decrementing the total
> event size and then copying the remaining banks into the location of the first
> using memcpy from string.h.
I confirm the documented difference between memcpy() and memmove() and I confirm the
questionable use of memcpy() in bk_delete(). I think it should be memmove(). I made it so in my copy
of midas, so this change will not be lost.
But I am not sure how to test it - I do not think I ever used bk_delete(). I will probably ponder upon
this and do a blind commit.
K.O. |
19 Dec 2011, Stefan Ritt, Bug Report, bk_delete uses memcpy instead of memmove
|
> > In midas.c, the bk_delete function removes a bank by decrementing the total
> > event size and then copying the remaining banks into the location of the first
> > using memcpy from string.h.
>
>
> I confirm the documented difference between memcpy() and memmove() and I confirm the
> questionable use of memcpy() in bk_delete(). I think it should be memmove(). I made it so in my copy
> of midas, so this change will not be lost.
>
> But I am not sure how to test it - I do not think I ever used bk_delete(). I will probably ponder upon
> this and do a blind commit.
>
>
> K.O.
It cannot hurt to use memmove(), so please go ahead to commit the changes.
- Stefan |
15 Jun 2012, Konstantin Olchanski, Bug Report, bk_delete uses memcpy instead of memmove
|
> In midas.c, the bk_delete function removes a bank by decrementing the total
> event size and then copying the remaining banks into the location of the first
> using memcpy from string.h.
Replaced some memcpy() with memmove(), including bk_delete().
svn rev 5293
K.O. |
13 Jun 2012, Konstantin Olchanski, Forum, ladd00.triumf.ca https ssl certificate update
|
The HTTPS SSL certificate on ladd00.triumf.ca has been updated. Same as the old
certificate, the new one is self-signed and your web browser may complain about
that and ask you to "save a security exception".
When you save the new certificate, you can verify that you are connected to the
real ladd00.triumf.ca by comparing the "SHA1 fingerprint" reported by your web
browser to the one given below (as reported by "svn update"):
Certificate information:
- Hostname: ladd00.triumf.ca
- Valid: from Wed, 13 Jun 2012 22:31:51 GMT until Thu, 13 Jun 2013 22:31:51 GMT
- Issuer: DAQ, TRIUMF, Vancouver, BC, CA
- Fingerprint: 82:95:78:cb:78:d3:93:1d:d4:c8:e8:1a:64:0f:62:04:2d:0e:c3:4a
K.O. |
18 Apr 2012, Exaos Lee, Bug Report, Build error with mlogger: invalid conversion from ‘void*’ to ‘gzFile’
|
I tried to build MIDAS under ArchLinux, failed on errors as following:src/mlogger.cxx: In function ‘INT midas_flush_buffer(LOG_CHN*)’:
src/mlogger.cxx:1011:54: error: invalid conversion from ‘void*’ to ‘gzFile’ [-fpermissive]
In file included from src/mlogger.cxx:33:0:
/usr/include/zlib.h:1318:21: error: initializing argument 1 of ‘int gzwrite(gzFile, voidpc, unsigned int)’ [-fpermissive]
src/mlogger.cxx: In function ‘INT midas_log_open(LOG_CHN*, INT)’:
src/mlogger.cxx:1200:79: error: invalid conversion from ‘void*’ to ‘gzFile’ [-fpermissive]
In file included from src/mlogger.cxx:33:0: Please refer to attachment elog:786/1 for detail. There are also many warnings listed.
This error can be supressed by adding -fpermissive to CXXFLAGS. But the error message is correct."gzFile" is not equal to "void *"! C allows implicit casts between void* and any pointer type, C++ doesn't allow that. It's better to fix this error. A quick fix would be adding explicit casts. But I'm not sure what is the proper way to fix this. |
19 Apr 2012, Stefan Ritt, Bug Report, Build error with mlogger: invalid conversion from ‘void*’ to ‘gzFile’
|
Exaos Lee wrote: | I tried to build MIDAS under ArchLinux, failed on errors as following:src/mlogger.cxx: In function ‘INT midas_flush_buffer(LOG_CHN*)’:
src/mlogger.cxx:1011:54: error: invalid conversion from ‘void*’ to ‘gzFile’ [-fpermissive]
In file included from src/mlogger.cxx:33:0:
/usr/include/zlib.h:1318:21: error: initializing argument 1 of ‘int gzwrite(gzFile, voidpc, unsigned int)’ [-fpermissive]
src/mlogger.cxx: In function ‘INT midas_log_open(LOG_CHN*, INT)’:
src/mlogger.cxx:1200:79: error: invalid conversion from ‘void*’ to ‘gzFile’ [-fpermissive]
In file included from src/mlogger.cxx:33:0: Please refer to attachment elog:786/1 for detail. There are also many warnings listed.
This error can be supressed by adding -fpermissive to CXXFLAGS. But the error message is correct."gzFile" is not equal to "void *"! C allows implicit casts between void* and any pointer type, C++ doesn't allow that. It's better to fix this error. A quick fix would be adding explicit casts. But I'm not sure what is the proper way to fix this. |
Ah, dumb gcc gets pickier and pickier. I added a case (gzFile)log_chn->gzfile which fixes the error. I cannot put gzFile already into the header file since the zlib header is included after the midas header, otherwise we get some other problems. The SVN version with the fix is 5275. |
25 Apr 2012, Konstantin Olchanski, Bug Report, Build error with mlogger: invalid conversion from ‘void*’ to ‘gzFile’
|
Stefan's fix is incomplete - the "gzFile" cast is needed for all calls to zlib, not just those that some version
of GCC happens to complain about. Fixed.
svn rev 5286.
BTW, I read the midas elog via email and if you post html or elcode messages, I receive complete
gibberish. For prompt service, please select message type "plain". (yes, you cannot use fancy colours and
blinking text, but better than me not reading your stuff at all).
BTW2, for easier reading, please include error messages as plain text in your message. As opposed to
compressed attachements.
K.O. |
27 Apr 2012, Stefan Ritt, Bug Report, Build error with mlogger: invalid conversion from ‘void*’ to ‘gzFile’
|
KO wrote: | BTW, I read the midas elog via email and if you post html or elcode messages, I receive complete
gibberish. For prompt service, please select message type "plain". (yes, you cannot use fancy colours and
blinking text, but better than me not reading your stuff at all).
BTW2, for easier reading, please include error messages as plain text in your message. As opposed to
compressed attachements.
K.O.
|
BTW3, if you use a real email program you don't get glibberish. I know some people prefer good-old-text-only pine, but I'm sure you do not use the ascii-only browser lynx to browse the internet, right? So if you browse the web in graphics, why not read your email in graphics as well. Better change yourself than the whole rest of the world |
29 Feb 2012, Konstantin Olchanski, Bug Report, Problem with semaphores
|
Hi there! In the T2K/ND280 experiment in Japan, we keep having problems with MIDAS locking (probably
of ODB). The symptoms are: some program reports a timeout waiting for the ODB lock, then all programs
eventually die with this same error. Complete system meltdown. This does not look like the deadlock
between locks for ODB, cm_msg and the data buffers that I looked into last year. It looks more like
somebody locks ODB, dies and the Linux kernel fails to unlock the lock (via the SYSV "sem undo"
function). But it is hard to confirm, hence this message:
The implementation of semaphores in MIDAS (used for locking ODB and the shared memory data buffers)
uses the straight SYSV semaphore API - which lacks basic debugging features - there is no tracking of
who locked what when, so if anything at all goes wrong at all, i.e. we are confronted with a timeout
waiting for the ODB lock, the only corrective action possible is to kill all MIDAS clients and tell the user to
start from scratch. There is no additional information available from the SYSV semaphore API to identify
which MIDAS program caused the fault.
The POSIX semaphore API is even worse - no debugging features are available, *and* if a program dies
while holding a lock, the lock stays locked forever (everybody else will wait forever or see a semaphore
timeout, and then what?).
So I am looking for an "advanced semaphore library" to use in MIDAS. In addition to the boring functions
of reliable locking and unlocking, it should support:
- wait with timeout
- remember who is holding the lock
- detect that the process holding the lock is dead and take corrective action (automatic unlock as done by
SYSV semaphores, call back to user code where we can cleanup and unlock ourselves, etc)
- maybe permit recursive locking (not really required as ODB locks are already made recursive "by hand")
- maybe remember some of the locking history (so we can dump it into a log file when we detect a
deadlock or other lock malfunction).
Quick google search only find sundry wrappers for SYSV and POSIX semaphores. How they deal with the
problem of processes locking the semaphore and dying remains a mystery to me (other than telling users
to remove the Ctrl-C button from their keyboard). BTW, we have seen this problem with several
commercial applications that use SYSV semaphores but forget to enable the SEM_UNDO function).
Anyhow, if anybody can suggest such an advanced locking library it would be great. Will save me the
effort of writing one.
K.O. |
01 Mar 2012, Stefan Ritt, Bug Report, Problem with semaphores
|
> Anyhow, if anybody can suggest such an advanced locking library it would be great. Will save me the
> effort of writing one.
Hi Konstantin,
yes there is a good way, which I used during development of the buffer manager function. Put in each sm_xxx function a cm_msg(M_DEBUG, ...) to
generate a debug system message. They go only into the SYSMSG ring buffer and thus are light weight and don't influence the timing much. You can
keep odbedit open to see these messages, but there is also another way. You can write a little program which dumps the whole SYSMSG buffer, which
you can call when the lock happens. You then look "backwards" in time and get all messages stored there, depending of the size of the SYSMSG buffer of
course. Of course this only works if the lock does not happen on the SYSMSB buffer itself. In that case you have to produce M_LOG messages which are
written to the logging file. This will influence the timing slightly (the file might grow rapidly) but you are independent of semaphores.
The interesting thing is that in the MEG experiment (9 Front-ends, Event Builder, Logger, Lazylogger, ....) we run for months without any lock up. So I
might suspect it's caused in your case from a program only you are using.
Best regards,
Stefan |
30 Jan 2012, Stefan Ritt, Info, IEEE Real Time 2012 Call for Abstracts
|
Hello,
I'm co-organizing the upcoming Real Time Conference, which covers also the field of data acquisition, so it might be interesting for people working
with MIDAS. If you have something to report, you could also consider to send an abstract to this conference. It will be nicely located in Berkeley,
California. We plan excursions to San Francisco and to Napa Valley.
Best regards,
Stefan Ritt
---------------------------
18th Real Time Conference
June 11 – 15, 2012
Berkeley, CA
We invite you to the Hotel Shattuck Plaza in downtown Berkeley, California for
the 2012 Real-Time Conference (RT2012). It will take place Monday, June 11
through Friday, June 15, 2012, with optional pre-conference tutorials Saturday
and Sunday, June 9-10.
Like the previous editions, RT2012 will be a multidisciplinary conference
devoted to the latest developments on realtime techniques in the fields of
plasma and nuclear fusion, particle physics, nuclear physics and astrophysics,
space science, accelerators, medical physics, nuclear power instrumentation and
other radiation instrumentation.
Abstract submission is open as of 18 January (deadline 2 March). Please visit
http://www.npss-confs.org/rtc/welcome.asp?flag=44675.77&Retry=1 to submit an
abstract.
Call for Abstracts
RT 2012 is an interdisciplinary conference on realtime data acquisition and
computing applications in the physical sciences. These applications include:
* High energy physics
* Nuclear physics
* Astrophysics and astroparticle physics
* Nuclear fusion
* Medical physics
* Space instrumentation
* Nuclear power instrumentation
* Realtime security and safety
* General Radiation Instrumentation
Specific topics include (but are certainly not limited to) the list shown below.
We welcome correspondence to see how your research fits our venue.
Key Dates
* Abstract submission opened: January 18, 2012
* Abstract deadline: March 2, 2012
* Program available: April 2
Suggested Topics
* Realtime system architectures
* Intelligent signal processing
* Programmable devices
* Fast data transfer links and networks
* Trigger systems
* Data acquisition
* Processing farms
* Control, monitoring, and test systems
* Upgrades
* Emerging realtime technologies
* New standards
* Realtime safety and security
* Feedback on experiences
Contact Information
If you have a question or wish to opt in for occasional e-mail updates about
RT2012, send us a message at RT2012@lbl.gov. To view full conference
information, visit http://rt2012.lbl.gov/index.html |
05 Sep 2011, John McMillan, Forum, khyt1331 under scientific linux 5.5?
|
Hello,
I'm trying to build khyt1331 under scientific linux 5.5, kernel
2.6.18-238.9.1el5. Has anyone succeeded with this. So far, I've
managed to compile by hacking all the references to man9 pages out
of the makefile. I've then hand installed the kernel driver with
insmod. cat /proc/khyt1331 produces
Hytec 5331 card found at address 0xE800, using interrupt 10
Device not in use
CAMAC crate 0: responding
CAMAC crate 1: not responding
CAMAC crate 2: not responding
CAMAC crate 3: not responding
and the "addr" LED blinks - so progress of some sort.
There's no sign of /dev/camac.
Next up I'm going to compile stuff like camactest.c - though the
makefiles in the drivers folder don't mention these, so I'll have to
work through what is needed by hand.
At some point I'll have to rewrite a bit so that it all load automatically.
Any hints or tips greatfully received.
John McMillan |
25 Aug 2011, Francesco Prelz, Forum, 64-bit integer support in MIDAS
|
Hi,
I've been doing some preliminary work to use at least the MIDAS
SQL history component for a new CERN experiment (Aegis). I wonder
whether there is any plan to support 64-bit signed/unsigned integer data types
in MIDAS. time_t on 64-bit architectures is actually signed 64-bit
(the 'easy' way to work around the 2038 crisis), and this may be enough to
cause problems.
Thanks.
Francesco Prelz
INFN Milano |
11 Jul 2011, Konstantin Olchanski, Info, Make "STOP" run transition always succeed
|
Over the years, there was some back-and-forth changes in what happens to run transitions when some
of the participants misbehave (do not respond to RPC calls, timeout, crash, etc).
The very original behaviour was to ignore all errors. This resulted in user confusion when some clients
would start, some would not, data from frontends that missed the transition did not arrive, etc.
So it was changed to fail the transition if any client misbehaves.
This left mlogger (who is usually the first one to see the TR_START transition) in a funny state - output
file is open, etc, but there is no run active. This was fixed by adding a TR_STARTABORT transition to tell
mlogger, event builder & co that the just started run did not start after all.
Also at some point code was added to forcefully kill clients that do not respond to run transitions (do
not respond to RPC, timeout, etc).
Recently, it was observed how during unattended overnight operation of a MIDAS DAQ system, with the
logger set to "auto restart", some unnecessary clients misbehave during the run stop transition, and
prevent the run from stopping and restarting. The user comes in the morning and is unhappy that data
taking stopped some time during the night.
midas.c svn rev 5136 changes the TR_STOP transition to always succeed, even if some clients had
transition errors. If these clients are unnecessary for normal operation of the DAQ, the following run
"auto restart" will continue taking data. If those were important clients, data taking will continue the
best it can - it *is* unattended operation - nobody is looking - but users can always setup alarms for
checking that important clients are always running during data taking. (For very important clients, one
can setup alarms to send email, send SMS messages, etc).
K.O. |
27 Jun 2011, Konstantin Olchanski, Info, midas shared memory changes
|
A number of changes were made to the midas shared memory implementation for
Linux and MacOS:
1) SysV or POSIX shared memory compile-type choice is removed. Both shared
memory types are compiled-in and are selected at run time.
2) the shared memory type used by an experiment is recorded in the file
.SHM_TYPE.TXT. Currently implemented are "POSIXv2_SHM" (the new default for new
experiments), "POSIX_SHM", "MMAP_SHM" and "SYSV_SHM". (see system.c) (MMAP_SHM
is fully functional but is not recommended). The POSIXv2_SHM uses an improved
filename scheme (on Linux, see "ls -l /dev/shm") and permits multiple
experiments to coexist on a MacOS computer (where there is a severe limit on
shared memory filename length).
3) following a number of mishaps where "odbedit" has been run on the wrong
computer (causing havoc with ODB and .xxx.SHM files), for each experiment, the
hostname of the computer where the ODB shared memory is meant to reside is now
recorded in the file .SHM_HOST.TXT. Typically, this is the machine running
mserver, mhttpd and mlogger. If some client is accidentally started on the wrong
machine or if MIDAS_SERVER_HOST is accidentally left undefined, MIDAS will now
print a stern message reporting the hostname mismatch, tell the user to use the
mserver and refuse to run. The user has the choice of starting the client on the
correct computer (as reported in the error message), using the mserver (start
client with -H flag) or edit/delete the .SHM_HOST.TXT file (full pathname is
reported by the error message).
With this update, MIDAS on MacOS becomes fully functional (before, only one
experiment could be used at a time).
svn rev 5105
K.O. |
05 Jul 2011, Konstantin Olchanski, Info, midas shared memory changes
|
> 2) the shared memory type used by an experiment is recorded in the file .SHM_TYPE.TXT.
An error with creating the file .SHM_TYPE.TXT was corrected in system.c svn rev 5125 - if file did not exist, it is
created correctly, but MIDAS reports "cannot connect to ODB". Second try works correctly because the file exists
now.
> 3) the hostname of the computer where the ODB shared memory is meant to reside is now
> recorded in the file .SHM_HOST.TXT.
This is causing problems on mobile computers where "hostname" changes all the time (i.e. set according to
DHCP on whatever network happens to be connected).
If you run into this problem, keep deleting .SHM_HOST.TXT or use this workaround: disable the hostname check
by making the file .SHM_HOST.TXT empty (zero length).
K.O. |
10 Jul 2011, Konstantin Olchanski, Bug Fix, midas shared memory changes
|
> > 2) the shared memory type used by an experiment is recorded in the file .SHM_TYPE.TXT.
> > 3) the hostname of the computer where the ODB shared memory is meant to reside is now
> > recorded in the file .SHM_HOST.TXT.
Due to a typo in src/system.c svn rev 5125, ss_shm_delete() did not work at all. This broke "odbedit -R", "odbedit -s 5000000" (to change ODB size), etc.
Fixed in src/system.c svn rev 5134. (It is safe to update just tis one file to fix this problem).
Sorry for the inconvenience,
K.O. |
11 Jul 2011, Konstantin Olchanski, Bug Fix, midas shared memory changes
|
> > > 2) the shared memory type used by an experiment is recorded in the file .SHM_TYPE.TXT.
> > > 3) the hostname of the computer where the ODB shared memory is meant to reside is now
> > > recorded in the file .SHM_HOST.TXT.
Because the mserver did not setup correct experiment name and path, POSIX shared memory did not work at all when used with the mserver. Fixed in mserver.c rev 5135
Sorry for the inconvenience,
K.O. |
05 Jul 2011, Konstantin Olchanski, Bug Report, MacOS network socket timeouts non-functional
|
It turns out that because of differences between select() syscall implementation between UNIX (MacOS,
maybe BSD) and Linux, network socket timeouts do not work.
This affects timeouts during run transitions (transition calls to dead clients do not timeout), maybe other
places.
I am looking into fixing this. The main difficulty is with UNIX select() not updating the timeout parameter
when it is interrupted by the MIDAS watchdog alarm signal. Linux select() subtracts the elapsed time from
the timeout value and this code from system.c works correctly: while (1) { status = select(..., &timeout); if
(status==0) break; } (value of timeout becomes smaller each time), while on MacOS it loops forever (value
of timeout does not change).
K.O. |
27 Jun 2011, Konstantin Olchanski, Info, mlogger lock for runNNN.mid.gz files
|
By popular request, Stefan R. implemented a locking scheme for mlogger output files.
To use this function, set the mlogger ODB /Logger/Channels/NNN/Settings/Filename
to ".run%05dsub%05d.mid.gz" (note the leading dot).
In this mode, active output files will have a filename with a leading dot
(.run00001sub00001.mid.gz) while the file is being written to. After the file is
closed, it is renamed and the leading dot is removed.
To use this function with the lazylogger, please set ODB
"/Lazy/Foo/Settings/Filename format" to "run*.mid.gz,run*.xml" (note the leading
text "run"). Set "stay behind" to 0.
svn rev 5080 (or so, checking by Stefan R.)
K.O. |
27 Jun 2011, Konstantin Olchanski, Info, updated mhttpd history "export" function
|
The mhttpd history "export" function has been converted to the new midas history
interface and should now work for SQL-based history systems. In the process,
improvements by Eoin Butler (CERN AD-5/ALPHA) were merged - adding a UNIX
timestamp and a better text timestamp. Also now "export" outputs the actual
values from the history file - the scaling values from the definition of the
history plot panel are no longer applied.
Here is an example of the new file format:
Time, Timestamp, Run, Run State, SLOW
2011.06.21 15:45:21, 1308696321, 13292, 3, -89.1007
svn rev 5104
K.O. |
|