Back Midas Rome Roody Rootana
  Midas DAQ System, Page 2 of 121  Not logged in ELOG logo
ID Date Author Topic Subjectup
  229   17 Oct 2005 Exaos LeeBug Fix"make install" error under MacOS X
Under MacOS X, "make install" will cours an error like this:
...
install: darwin/bin/dio: No such file or directory
make: *** [install] Error 71

This can be fixed as the following diff:
404,405c404,405
< $(BIN_DIR)/mcnaf: $(UTL_DIR)/mcnaf.c $(DRV_DIR)/camac/camacrpc.c
<       $(CC) $(CFLAGS) $(OSFLAGS) -o $@ $(UTL_DIR)/mcnaf.c $(DRV_DIR)/camac/camacrpc.c $(LIB) $(LIBS)
---
> $(BIN_DIR)/mcnaf: $(UTL_DIR)/mcnaf.c $(DRV_DIR)/bus/camacrpc.c
>       $(CC) $(CFLAGS) $(OSFLAGS) -o $@ $(UTL_DIR)/mcnaf.c $(DRV_DIR)/bus/camacrpc.c $(LIB) $(LIBS)
438c438,439
<       @for i in mserver mhttpd odbedit mlogger ; \
---
> 
>       @for i in mserver mhttpd odbedit mlogger dio ; \
444,447d444
<       chmod +s $(SYSBIN_DIR)/mhttpd
< 
< ifeq ($(OSTYPE),linux)
<       install -v -m 755 $(BIN_DIR)/dio $(SYSBIN_DIR)
449c446
< endif
---
>       chmod +s $(SYSBIN_DIR)/mhttpd
  678   26 Nov 2009 Konstantin OlchanskiBug Report"mserver -s" is broken
I notice that "mserver -s" (a non-default mode of operation) does not work right
- if I connect odbedit for the first time, all is okey, if I connect the second
time, mserver crashes - because after the first connection closed,
rpc_deregister_functions() was called, rpc_list is deleted and causes a crash
later on. Because everybody uses the default "mserver -m" mode, I am not sure
how important it is to fix this.
K.O.
  680   27 Nov 2009 Stefan RittBug Report"mserver -s" is broken
> I notice that "mserver -s" (a non-default mode of operation) does not work right
> - if I connect odbedit for the first time, all is okey, if I connect the second
> time, mserver crashes - because after the first connection closed,
> rpc_deregister_functions() was called, rpc_list is deleted and causes a crash
> later on. Because everybody uses the default "mserver -m" mode, I am not sure
> how important it is to fix this.
> K.O.

"mserver -s" is there for historical reasons and for debugging. I started originally 
with a single process server back in the 90's, and only afterwards developed the multi 
process scheme. The single process server now only works for one connection and then 
crashes, as you described. But it can be used for debugging any server connection, 
since you don't have to follow the creation of a subprocess with your debugger, and 
therefore it's much easier. But after the first connection has been closed, you have 
to restart that single server process. Maybe one could add some warning about that, or 
even fix it, but it's nowhere used in production mode.
  681   27 Nov 2009 Konstantin OlchanskiBug Report"mserver -s" is broken
> 
> "mserver -s" is there for historical reasons and for debugging.
>

I confirm that my modification also works for "mserver -s". I also added an assert() to the
place in midas.c were it eventually crashes, to make it more obvious for the next guys.

K.O.
  2379   31 Mar 2022 Konstantin OlchanskiBug Fix"run stop" trouble in mlogger, fixed
while debugging something else, I ran into a bit of trouble in mlogger.

I set the mlogger event limit to 100, and after reaching 100 events, mlogger
sayd "stopping run", but nothing happened, run kept going.

it turns out mlogger tried stopping the run too soon, the run-start
transition did not finish yet and the error message about trying
to stop a run while another transition is in progress was missing.

(fixed - if another transition is in progress, we try again later)

it also turns out that cm_transition() checks if another transition
is in progress way too late, all the way in the transition thread,
where it cannot return it is an error to mlogger.

(fixed - first thing done in cm_transition() is this check).

while debugging this, I tested the ODB flags "/Logger/Async transitions"
and /Logger/Multithread transitions". It turns out only two transition
types still work from inside mlogger - multithread transition
and detached transition (via the mtransition helper).

the issue is the dead lock between mlogger and frontend. while mlogger
is inside cm_transition(), it is not reading the SYSTEM buffer,
while at the same time frontends are writing into it. If SYSTEM
buffer happens to be pretty full, we dead lock - frontends are waiting
for free space in the SYSTEM buffer do not respond to RPCs, mlogger is not 
reading from the SYSTEM and it stuck trying to issue "run stop" RPC
to frontend. (this dead lock is not forever, eventually frontend
is killed by RPC timeout, mlogger survives and stops the run).

this is a well known problem and as solution, mlogger has been using the 
multithreaded transitions for years.

now I removed the OBD /Logger/Async transition and /Logger/Multithread 
transition flags, instead, there is now a flag /Logger/Detached transitions
set to FALSE by default. Setting it to TRUE will cause mlogger to fork 
"mtransition STOP" and "mtransition START" for stopping and starting runs,
this is useful in case there is trouble with multithreading in mlogger.

K.O.
  119   30 Oct 2003 Stefan Ritt 'umask' added to lazylogger for FTP connections
I had to add a 'umask' opiton to the loggers (lazy and mlogger) for the new 
PSI archive. One can now put a filename into the settings like:

archive,21,user,pw,dir,run%05d.mid,026

where the optional last parameter is used for a "umask 026" command just 
sent to the FTP server after the connection has been established. This 
changes the mode bits of the newly transferred file. We needed that so that 
the files are group readable, since several people from one group want to 
read the data.

I committed mlogger.c and ybos.c which contains the ftp code (should 
actually go into lazylogger.c instead of ybos.c).
  2310   26 Jan 2022 Frederik WautersForum.gz files
I adapted our analyzer to compile against the manalyzer included in the midas repo.

All our data files are .mid.gz, which now fail to process :(

frederik@frederik-ThinkPad-T550:~/new_daq/build/analyzer$ ./analyzer -e100 -s100 ../../run_backup_11783.mid.gz 
...
...n
Registered modules: 1
file[0]: ../../run_backup_11783.mid.gz
Setting up the analysis!
TMReadEvent: error: short read 0 instead of -1193512213

Which is in the TMEvent* TMReadEvent(TMReaderInterface* reader) class in the midasio.cxx file

Reading the unzipped files works. But we have always processed our .gz files directly, for the unzipping we would need ~2x disk space.

Am I doing something wrong? I see that there is some activity on lz4 in the midasio repo, is gunzip next?
  2311   26 Jan 2022 Konstantin OlchanskiForum.gz files
> I adapted our analyzer to compile against the manalyzer included in the midas repo.
> TMReadEvent: error: short read 0 instead of -1193512213

I think this problem is fixed in the latest version of midasio and manalyzer, but this update
was not pulled into midas yet. (Canada is in the middle of a covid wave since December).

What happens is you do not have the gzip library installed on your computer and
your analyzer is built without support for gzip.

The fix is done the hard way, the gzip library is no longer optional, but required.

You do not say what linux you use, so I cannot give exact instructions, but for:
ubuntu: apt -y install libz-dev
centos7: installed by default
centos8: installed by default
debian11/raspbian: same as ubuntu

K.O.
  2328   31 Jan 2022 Frederik WautersForum.gz files
> > I adapted our analyzer to compile against the manalyzer included in the midas repo.
> > TMReadEvent: error: short read 0 instead of -1193512213
> 
> I think this problem is fixed in the latest version of midasio and manalyzer, but this update
> was not pulled into midas yet. (Canada is in the middle of a covid wave since December).
> 
> What happens is you do not have the gzip library installed on your computer and
> your analyzer is built without support for gzip.
> 
> The fix is done the hard way, the gzip library is no longer optional, but required.
> 
> You do not say what linux you use, so I cannot give exact instructions, but for:
> ubuntu: apt -y install libz-dev
> centos7: installed by default
> centos8: installed by default
> debian11/raspbian: same as ubuntu
> 
> K.O.

My libz under ubuntu

-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.11") 
-- MIDAS: Found ZLIB version 1.2.11

I got both the manalyzer example and mine going with
* the latest midas dev
* + the latest manalyzer (cf6c233)
* + almost latest midasio (568a617, otherwise I get an linking error 

./libmidas.a(midasio.cxx.o): In function `Lz4Error(int)':
midasio.cxx:(.text+0x359): undefined reference to `MLZ4F_getErrorName(unsigned long)'

So this works, I will assume that in the near future this all will come together in the standard midas release.

thanks  
  1168   09 Mar 2016 Konstantin OlchanskiInfo/Experiment/Edit on start/Edit Run number
The MIDAS documentation here:
  https://midas.triumf.ca/MidasWiki/index.php/Edit-on-start_Parameters
is missing informaiton about this ODB entry:
  /Experiment/Edit on start/Edit Run number (TID_BOOL)

This is what it does in mhttpd:
a) if it exists and is of type TID_BOOL and set to "n", run number is not editable
b) "Edit run number" itself is hidden, will not show up on the web page

This is what it does in odbedit:
a) it is hidden, will not show up in the list of run parameters
b) it's value has no effect, run number is always editable.

K.O.
  2216   15 Jun 2021 Konstantin OlchanskiInfo1000 Mbytes/sec through midas achieved!
I am sure everybody else has 10gige and 40gige networks and are sending terabytes of data before breakfast.

Myself, I only have one computer with a 10gige network link and sufficient number of daq boards to fill
it with data. Here is my success story of getting all this data through MIDAS.

This is the anti-matter experiment ALPHA-g now under final assembly at CERN. The main particle detector is a long but 
thin cylindrical TPC. It surrounds the magnetic bottle (particle trap) where we make and study anti-hydrogen. There are 
64 daq boards to read the TPC cathode pads and 8 daq boards to read the anode wires and to form the trigger. Each daq 
board can produce data at 80-90 Mbytes/sec (1gige links). Data is sent as UDP packets (no jumbo frames). Altera FPGA 
firmware was done here at TRIUMF by Bryerton Shaw, Chris Pearson, Yair Lynn and myself.

Network interconnect is a 96-port Juniper switch with a 10gige uplink to the main daq computer (quad core Intel(R) 
Xeon(R) CPU E3-1245 v6 @ 3.70GHz, 64 GBytes of DDR4 memory).

MIDAS data path is: UDP packet receiver frontend -> event builder -> mlogger -> disk -> lazylogger -> CERN EOS cloud 
storage.

First chore was to get all the UDP packets into the main computer. "U" in UDP stands for "unreliable", and at first, UDP 
packets have been disappearing pretty much anywhere they could. To fix this, in order:

- reading from the udp socket must be done in a dedicated thread (in the midas context, pauses to write statistics or 
check alarms result in lost udp packets)
- udp socket buffer has to be very big
- maximum queue sizes must be enabled in the 10gige NIC
- ethernet flow control must be enabled on the 10gige link
- ethernet flow control must be enabled in the switch (to my surprise many switches do not have working end-to-end 
ethernet flow control and lose UDP packets, ask me about this. our big juniper switch balked at first, but I got it 
working eventually).
- ethernet flow control must be enabled on the 1gige links to each daq module
- ethernet flow control must be enabled in the FPGA firmware (it's a checkbox in qsys)
- FPGA firmware internally must have working back pressure and flow control (avalon and axi buses)
- ideally, this back-pressure should feed back to the trigger. ALPHA-g does not have this (it does not need it).

Next chore was to multithread the UDP receiver frontend and to multithread the event builder. Stock single-threaded 
programs quickly max out with 100% CPU use and reach nowhere near 10gige data speeds.

Naive multithreading, with two threads, reader (read UDP packet, lock a mutex, put it into a deque, unlock, repeat) and 
sender (lock a mutex, get a packet from deque, unlock, bm_send_event(), repeat) spends all it's time locking and 
unlocking the mutex and goes nowhere fast (with 1500 byte packets, about 600 kHz of lock/unlock at 10gige speed).

So one has to do everything in batches: reader thread: accumulate 1000 udp packets in an std::vector, lock the mutex, 
dump this batch into a deque, unlock, repeat; sender thread: lock mutex, get 1000 packets from the deque, unlock, stuff 
the 1000 packets into 1 midas event, bm_send_event(), repeat.

It takes me 5 of these multithreaded udp reader frontends to keep up with a 10gige link without dropping any UDP packets. 
My first implementation chewed up 500% CPU, that's all of it, there is only 4 CPU cores available, leaving nothing
for the event builder (and mlogger, and ...)

I had to:
a) switch from plain socket read() to socket recvmmsg() - 100000 udp packets per syscall vs 1 packet per syscall, and
b) switch from plain bm_send_event() to bm_send_event_sg() - using a scatter-gather list to avoid a memcpy() of each udp 
packet into one big midas event.

Next is the event builder.

The event builder needs to read data from the 5 midas event buffers (one buffer per udp reader frontend, each midas event 
contains 1000 udp packets as indovidual data banks), examine trigger timestamps inside each udp packet, collect udp 
packets with matching timestamps into a physics event, bm_send_event() it to the SYSTEM buffer. rinse and repeat.

Initial single threaded implementation maxed out at about 100-200 Mbytes/sec with 100% busy CPU.

After trying several threading schemes, the final implementation has these threads:
- 5 threads to read the 5 event buffers, these threads also examine the udp packets, extract timestamps, etc
- 1 thread to sort udp packets by timestamp and to collect them into physics events
- 1 thread to bm_send_event() physics events to the SYSTEM buffer
- main thread and rpc handler thread (tmfe frontend)

(Again, to reduce lock contention, all data is passed between threads in large batches)

This got me up to about 800 Mbytes/sec. To get more, I had to switch the event builder from old plain bm_send_event() to 
the scatter-gather bm_send_event_sg(), *and* I had to reduce CPU use by other programs, see steps (a) and (b) above.

So, at the end, success, full 10gige data rate from daq boards to the MIDAS SYSTEM buffer.

(But wait, what about the mlogger? In this experiment, we do not have a disk storage array to sink this
much data. But it is an already-solved problem. On the data storage machines I built for GRIFFIN - 8 SATA NAS HDDs using 
raidz2 ZFS - the stock MIDAS mlogger can easily sink 1000 Mbytes/sec from SYSTEM buffer to disk).

Lessons learned:

- do not use UDP. dealing with packet loss will cost you a fortune in headache medicines and hair restorations.
- use jumbo frames. difference in per-packet overhead between 1500 byte and 9000 byte packets is almost a factor of 10.
- everything has to be done in bulk to reduce per-packet overheads. recvmmsg(), batched queue push/pop, etc
- avoid memory allocations (I has a per-packet std::string, replaced it with char[5])
- avoid memcpy(), use writev(), bm_send_event_sg() & co

K.O.

P.S. Let's counting the number of data copies in this system:

x udp reader frontend:
- ethernet NIC DMA into linux network buffers
- recvmmsg() memcpy() from linux network buffer to my memory
- bm_send_event_sg() memcpy() from my memory to the MIDAS shared memory event buffer

x event builder:
- bm_receive_event() memcpy() from MIDAS shared memory event buffer to my event buffer
- my memcpy() from my event buffer to my per-udp-packet buffers
- bm_send_event_sg() memcpy() from my per-udp-packet buffers to the MIDAS shared memory event buffer (SYSTEM)

x mlogger:
- bm_receive_event() memcpy() from MIDAS SYSTEM buffer
- memcpy() in the LZ4 data compressor
- write() syscall memcpy() to linux system disk buffer
- SATA interface DMA from linux system disk buffer to disk.

Would a monolithic massively multithreaded daq application be more efficient?
("udp receiver + event builder + logger"). Yes, about 4 memcpy() out of about 10 will go away.

Would I be able to write such a monolithic daq application?

I think not. Already, at 10gige data rates, for all practical purposes, it is impossible
to debug most problems, especially subtle trouble in multithreading (race conditions)
and in memory allocations. At best, I can sprinkle assert()s and look at core dumps.

So the good old divide-and-conquer approach is still required, MIDAS still rules.

K.O.
  2217   15 Jun 2021 Stefan RittInfo1000 Mbytes/sec through midas achieved!
In MEG II we also kind of achieved this rate. Marco F. will post an entry soon to describe the details. There is only one thing 
I want to mention, which is our network switch. Instead of an expensive high-grade switch, we chose a cheap "Chinese" high-grade 
switch. We have "rack switches", which are collector switch for each rack receiving up to 10 x 1GBit inputs, and outputting 1 x 
10 GBit to an "aggregation switch", which collects all 10 GBit lines form rack switches and forwards it with (currently a single 
) 10 GBit line. For the rack switch we use a 

MikroTik CRS354-48G-4S+2Q+RM 54 port

and for the aggregation switch

MikroTik CRS326-24S-2Q+RM 26 Port

both cost in the order of 500 US$. We were astonished that they don't loose UDP packets when all inputs send a packet at the 
same time, and they have to pipe them to the single output one after the other, but apparently the switch have enough buffers 
(which is usually NOT written in the data sheets). 

To avoid UDP packet loss for several events, we do traffic shaping by arming the trigger only when the previous event is 
completely received by the frontend. This eliminates all flow control and other complicated methods. Marco can tell you the 
details.

Another interesting aspect: While we get the data into the frontend, we have problems in getting it through midas. Your 
bm_send_event_sg() is maybe a good approach which we should try. To benchmark the out-of-the-box midas, I run the dummy frontend 
attached on my MacBook Pro 2.4 GHz, 4 cores, 16 GB RAM, 1 TB SSD disk. I got

Event size: 7 MB

No logging: 900 events/s = 6.7 GBytes/s

Logging with LZ4 compression: 155 events/s = 1.2 GBytes/s

Logging without compression: 170 events/s = 1.3 GBytes/s

So with this simple approach I got already more than 1 GByte of "dummy data" through midas, indicating that the buffer 
management is not so bad. I did use the plain mfe.c frontend framework, no bm_send_event_sg() (but mfe.c uses rpc_send_event() which is an 
optimized version of bm_send_event()).

Best,
Stefan
  2218   16 Jun 2021 Marco FrancesconiInfo1000 Mbytes/sec through midas achieved!
As reported by Stefan, in MEG II we have very similar ethernet throughputs.
In total, we have 34 crates each with 32 DRS4 digitiser chips and a single 1 Gbps readout link through a Xilinx Zynq SoC.
The data arrives in push mode without any external intervention, the only throttling being an optional prescaling on the trigger rate.
We discovered the hard way that 1 Gbps throughput on Zynq is not trivial at all: the embedded ethernet MAC does not support jumbo frames (always read the fine prints in the manuals!) and the embedded Linux ethernet stack seems to struggle when we go beyond 250 Mbps of UDP traffic.

Anyhow, even with the reduced speed, the maximum throughput at network input is around 8.5 Gbps which passes through the Mikrotik switches mentioned by Stefan.
We had very bad experiences in the past with similar price-point switches, observing huge packet drops when the instantaneous switching capacity cannot cope with the traffic, but so far we are happy with the Mikrotik ones.

On the receiver side, we have the DAQ server with an Intel E5-2630 v4 CPU and a 10 Gbit connection to the network using an Intel X710 Network card.
In the past, we used also a "cheap" 10 Gbit card from Tehuti but the driver performance was so bad that it could not digest more than 5 Gbps of data.

The current frontend is based on the mfe.c scheme for historical reasons (the very first version dates back to 2015).
We opted for a monolithic multithread solution so we can reuse the underlying DAQ code for other experiments which may not have the complete Midas backend.
Just to mention them: one is the FOOT experiment (which afaik uses an adapted version of Altas DAQ) and the other is the LOLX experiment (for which we are going to ship to Canada soon a small 32 channel system using Midas).
A major modification to Konstantin scheme is that we need to calibrate all WFMs online so that a software zero suppression can be applied to reduce the final data size (that part is still to be implemented).
This requirement results in additional resource usage to parse the UDP content into floats and calibrate them.
Currently, we have 7 packet collector threads to digest the full packet flow (using recvmmsg), followed by an event building stage that uses 4 threads and 3 other threads for WFM calibration.
We have progressive packet numbers on each packet generated by the hardware and a set of flags marking the start and end of the event; combining the packet number difference between the start and end of the event and the total received packets for that event it is really easy to understand if packet drops are happening.

All the thread infrastructure was tested and we could digest the complete throughput, we still have to finalise the full 10 Gbit connection to Midas because the final system has been installed only recently (April).
We are using EQ_USER flag to push events into mfe.c buffers with up to 4 threads, but I was observing that above ~1.5 Gbps the rb_get_wp() returns almost always DB_TIMEOUT and I'm forced to drop the event.
This conflicts with the measurements reported by Stefan (we were discussing this yesterday), so we are still investigating the possible cause.

It is difficult to report three years of development in a single Elog, I hope I put all the relevant point here.
It looks to me that we opted for very complementary approaches for high throughput ethernet with Midas, and I think there are still a lot of details that could be worth reporting.
In case someone organises some kind of "virtual workshop" on this, I'm willing to participate.
Best,

Marco


> In MEG II we also kind of achieved this rate. Marco F. will post an entry soon to describe the details. There is only one thing 
> I want to mention, which is our network switch. Instead of an expensive high-grade switch, we chose a cheap "Chinese" high-grade 
> switch. We have "rack switches", which are collector switch for each rack receiving up to 10 x 1GBit inputs, and outputting 1 x 
> 10 GBit to an "aggregation switch", which collects all 10 GBit lines form rack switches and forwards it with (currently a single 
> ) 10 GBit line. For the rack switch we use a 
> 
> MikroTik CRS354-48G-4S+2Q+RM 54 port
> 
> and for the aggregation switch
> 
> MikroTik CRS326-24S-2Q+RM 26 Port
> 
> both cost in the order of 500 US$. We were astonished that they don't loose UDP packets when all inputs send a packet at the 
> same time, and they have to pipe them to the single output one after the other, but apparently the switch have enough buffers 
> (which is usually NOT written in the data sheets). 
> 
> To avoid UDP packet loss for several events, we do traffic shaping by arming the trigger only when the previous event is 
> completely received by the frontend. This eliminates all flow control and other complicated methods. Marco can tell you the 
> details.
> 
> Another interesting aspect: While we get the data into the frontend, we have problems in getting it through midas. Your 
> bm_send_event_sg() is maybe a good approach which we should try. To benchmark the out-of-the-box midas, I run the dummy frontend 
> attached on my MacBook Pro 2.4 GHz, 4 cores, 16 GB RAM, 1 TB SSD disk. I got
> 
> Event size: 7 MB
> 
> No logging: 900 events/s = 6.7 GBytes/s
> 
> Logging with LZ4 compression: 155 events/s = 1.2 GBytes/s
> 
> Logging without compression: 170 events/s = 1.3 GBytes/s
> 
> So with this simple approach I got already more than 1 GByte of "dummy data" through midas, indicating that the buffer 
> management is not so bad. I did use the plain mfe.c frontend framework, no bm_send_event_sg() (but mfe.c uses rpc_send_event() which is an 
> optimized version of bm_send_event()).
> 
> Best,
> Stefan
  2222   18 Jun 2021 Konstantin OlchanskiInfo1000 Mbytes/sec through midas achieved!
> In MEG II we also kind of achieved this rate.
>
> Instead of an expensive high-grade switch, we chose a cheap "Chinese" high-grade switch.

Right. We built this DAQ system about 3 years ago and the cheep Chineese switches arrived
on the market about 1 year after we purchased the big 96 port juniper switch. Bad timing/good timing.

Actually I have a very nice 24-port 1gige switch ($2000 about 3 years ago), I could have
used 4 of them in parallel, but they were discontinued and replaced with a $5000 switch
(+$3000 for a 10gige uplink. I think I got the last very last one cheap switch).

But not all Chineese switches are equal. We have an Ubiquity 10gige switch, and it does
not have working end-to-end ethernet flow control. (yikes!).

BTW, for this project we could not use just any cheap switch, we must have 64 fiber SFP ports
for connecting on-TPC electronics. This narrows the market significantly and it does
not match the industry standard port counts 8-16-24-48-96.

> MikroTik CRS354-48G-4S+2Q+RM 54 port
> MikroTik CRS326-24S-2Q+RM 26 Port

We have a hard time buying this stuff in Vancouver BC, Canada. Most of our regular suppliers
are US based and there is a technology trade war still going on between the US and China.
I guess we could buy direct on alibaba, but for the risk of scammers, scalpers and iffy shipping.

> both cost in the order of 500 US$

tell one how much we overpay for US based stuff. not surprising, with how Cisco & co can afford
to buy sports arenas, etc.

> We were astonished that they don't loose UDP packets when all inputs send a packet at the 
> same time, and they have to pipe them to the single output one after the other,
> but apparently the switch have enough buffers.

You probably see ethernet flow control in action. Look at the counters for ethernet pause frames
in your daq boards and in your main computer.

> (which is usually NOT written in the data sheets).

True, when I looked into this, I found a paper by somebody in Berkley for special
technique to measure the size of such buffers.

(The big Juniper switch has only 8 Mbytes of buffer. The current wisdom for backbone networks
is to have as little buffering as possible).

> To avoid UDP packet loss for several events, we do traffic shaping by arming the trigger only when the previous event is 
> completely received by the frontend. This eliminates all flow control and other complicated methods. Marco can tell you the 
> details.

We do not do this. (very bad!). When each trigger arrives, all 64+8 DAQ boards send a train of UDP packets
at maximum line speed (64+8 at 1 gige) all funneled into one 10 gige ((64+8)/10 oversubscription).

Before we got ethernet flow control to work properly, we had to throttle all the 1gige links by about 60%
to get any complete events at all. This would not have been acceptable for physics data taking.

> Another interesting aspect: While we get the data into the frontend, we have problems in getting it through midas. Your 
> bm_send_event_sg() is maybe a good approach which we should try. To benchmark the out-of-the-box midas, I run the dummy frontend 
> attached on my MacBook Pro 2.4 GHz, 4 cores, 16 GB RAM, 1 TB SSD disk.

Dummy frontend is not very representative, because limitation is the memory bandwidth
and CPU load, and a real ethernet receiver has quite a bit of both (interrupt processing,
DMA into memory, implicit memcpy() inside the socket read()).

For example, typical memcpy() speeds are between 22 and 10 Gbytes/sec for current
generation CPUs and DRAM. This translates for a total budget of 22 and 10 memcpy()
at 10gige speeds. Subtract from this 1 memcpy() to DMA data from ethernet into memory
and 1 memcpy() to DMA data from memory to storage. Subtract from this 2 implicit
memcpy() for read() in the frontend and write() in mlogger. (the Linux sendfile() syscall
was invented to cut them out). Subtract from this 1 memcpy() for instruction and incidental
data fetch (no interesting program fits into cache). Subtract from this memory bandwidth
for running the rest of linux (systemd, ssh, cron jobs, NFS, etc). Hardly anything
left when all is said and done. (Found it, the alphagdaq memcpy() runs at 14 Gbytes/sec,
so total budget of 14 memcpy() at 10gige speeds).

And the event builder eats up 2 CPU cores to process the UDP packets at 10gige rate,
there is quite a bit of CPU-expensive data unpacking, inspection and processing
going on that cannot be cut out. (alphagdaq has 4 cores, "8 threads").

K.O.

P.S. Waiting for rack-mounted machines with AMD "X" series processors... K.O.
  2223   18 Jun 2021 Konstantin OlchanskiInfo1000 Mbytes/sec through midas achieved!
> ... MEG II ... 34 crates each with 32 DRS4 digitiser chips and a single 1 Gbps readout link through a Xilinx Zynq SoC.
>
> Zynq ... embedded ethernet MAC does not support jumbo frames (always read the fine prints in the manuals!)
> and the embedded Linux ethernet stack seems to struggle when we go beyond 250 Mbps of UDP traffic.

that's an ouch. we use the altera ethernet mac, and jumbo frames are supported, but the firmware data path
was originally written assuming 1500-byte packets and it is too much work to rewrite it for jumbo frames.

we send the data directly from the FPGA fabric to the ethernet, there is an avalon/axi bus multiplexer
to split the ethernet packets to the NIOS slow control CPU. not sure if such scheme is possible
for SoC FPGAs with embedded ARM CPUs.

and yes, a 1 GHz ARM CPU will not do 10gige. You see it yourself, measure your memcpy() speed. Where
typical PC will have dual-channel 128-bit wide memory (and the famous for it's low latency
Intel memory controller), ARM SoC will have at best 64-bit wide memory (some boards are only 32-bit wide!),
with DDR3 (not DDR4) severely under-clocked (i.e. DDR3-900, etc). This is why the new Apple ARM chips
are so interesting - can Apple ARM memory controller beat the Intel x86 memory controller?

> On the receiver side, we have the DAQ server with an Intel E5-2630 v4 CPU

that's the right gear for the job. quad-channel memory with nominal "Max Memory Bandwidth 68.3 GB/s",
10 CPU cores. My benchmark of memcpy() for the much older duad-channel memory i7-4820 with DDR3-1600 DIMMs
is 20 Gbytes/sec. waiting for ARM CPU with similar specs.

> and a 10 Gbit connection to the network using an Intel X710 Network card.
> In the past, we used also a "cheap" 10 Gbit card from Tehuti but the driver performance was so bad that it could not digest more than 5 Gbps of data.

yup, same here. use Intel ethernet exclusively, even for 1gige links.

> A major modification to Konstantin scheme is that we need to calibrate all WFMs online so that a software zero suppression

I implemented hardware zero suppression in the FPGA code. I think 1 GHz ARM CPU does not have the oomph for this.

> rb_get_wp() returns almost always DB_TIMEOUT

replace rb_xxx() with std::deque<std::vector<char>> (protected by a mutex, of course). lots of stuff in the mfe.c frontend
is obsolete in the same way. check out the newer tmfe frontends (tmfe.md, tmfe.h and tmfe examples).

> It is difficult to report three years of development in a single Elog

but quite successful at it. big thanks for your write-up. I think our info is quite useful for the next people.

K.O.
  1152   05 Jan 2016 Tom StuttardSuggestion64 bit bank type
I've seen that a similar question has been asked in 2011 but I'll ask again in 
case there are any updates. Is there any way to write 64-bit data words to MIDAS 
banks (other than breaking them up in to two 32-bit words, such as 2 DWORDs) 
currently? And if not, is there any plan to introduce this feature in the future?

Many thanks,
Tom
  1153   05 Jan 2016 Konstantin OlchanskiSuggestion64 bit bank type
> I've seen that a similar question has been asked in 2011 but I'll ask again in 
> case there are any updates. Is there any way to write 64-bit data words to MIDAS 
> banks (other than breaking them up in to two 32-bit words, such as 2 DWORDs) 
> currently? And if not, is there any plan to introduce this feature in the future?

There is no "breaking them up" as such, you can treat a midas bank as a char* array
and store arbitrary data inside. In this sense, "there is no need" for a special 64-bit bank type.

For endian-ness conversion (if such things still matter, big-endian PPC CPUs still exist), single 64-bit 
word converts the same as two 32-bit words, so here also "there is no need", once can use banks of 
DWORD with equal effect.

The above applies equally to 64-bit integers and 64-bit double-precision IEEE-754 floating point 
numbers.

But specifically for 64-bit values, such as float64, there is a big gotcha.

The MIDAS banks structure goes to great lengths to make sure each data type is correctly aligned,
and gets it exactly wrong for 64-bit quantities - all because the bank header is three 32-bit words.

bankhheader1
bh2
bh3
bankdata1 <--- misaligned
...
bankdataN
bh1
bh2
bh3
banddata1 <--- aligned
... etc

So we could introduce QWORD banks today, but inside the midas file, they will be misaligned defeating 
the only purpose of adding them.

I guess the misalignement could be cured by adding dummy words, dummy banks, dummy bank 
headers, etc.

I figure this problem dates all the way bank where alignement to 16-bits was just getting important. 
Today, in the VME word, I have to align things on 128-bit boundaries (for 2eSST 2x2 DWORD transfers).

So back to your question, what advantage do you see in using a QWORD bank instead of putting the 
same data in a DWORD bank?

K.O.
  Draft   15 Jan 2016 Tom StuttardSuggestion64 bit bank type
> > I've seen that a similar question has been asked in 2011 but I'll ask again in 
> > case there are any updates. Is there any way to write 64-bit data words to MIDAS 
> > banks (other than breaking them up in to two 32-bit words, such as 2 DWORDs) 
> > currently? And if not, is there any plan to introduce this feature in the future?
> 
> There is no "breaking them up" as such, you can treat a midas bank as a char* array
> and store arbitrary data inside. In this sense, "there is no need" for a special 64-bit bank type.
> 
> For endian-ness conversion (if such things still matter, big-endian PPC CPUs still exist), single 64-bit 
> word converts the same as two 32-bit words, so here also "there is no need", once can use banks of 
> DWORD with equal effect.
> 
> The above applies equally to 64-bit integers and 64-bit double-precision IEEE-754 floating point 
> numbers.
> 
> But specifically for 64-bit values, such as float64, there is a big gotcha.
> 
> The MIDAS banks structure goes to great lengths to make sure each data type is correctly aligned,
> and gets it exactly wrong for 64-bit quantities - all because the bank header is three 32-bit words.
> 
> bankhheader1
> bh2
> bh3
> bankdata1 <--- misaligned
> ...
> bankdataN
> bh1
> bh2
> bh3
> banddata1 <--- aligned
> ... etc
> 
> So we could introduce QWORD banks today, but inside the midas file, they will be misaligned defeating 
> the only purpose of adding them.
> 
> I guess the misalignement could be cured by adding dummy words, dummy banks, dummy bank 
> headers, etc.
> 
> I figure this problem dates all the way bank where alignement to 16-bits was just getting important. 
> Today, in the VME word, I have to align things on 128-bit boundaries (for 2eSST 2x2 DWORD transfers).
> 
> So back to your question, what advantage do you see in using a QWORD bank instead of putting the 
> same data in a DWORD bank?
> 
> K.O.
  1155   19 Jan 2016 Tom StuttardSuggestion64 bit bank type
> > I've seen that a similar question has been asked in 2011 but I'll ask again in 
> > case there are any updates. Is there any way to write 64-bit data words to MIDAS 
> > banks (other than breaking them up in to two 32-bit words, such as 2 DWORDs) 
> > currently? And if not, is there any plan to introduce this feature in the future?
> 
> There is no "breaking them up" as such, you can treat a midas bank as a char* array
> and store arbitrary data inside. In this sense, "there is no need" for a special 64-bit bank type.
> 
> For endian-ness conversion (if such things still matter, big-endian PPC CPUs still exist), single 64-bit 
> word converts the same as two 32-bit words, so here also "there is no need", once can use banks of 
> DWORD with equal effect.
> 
> The above applies equally to 64-bit integers and 64-bit double-precision IEEE-754 floating point 
> numbers.
> 
> But specifically for 64-bit values, such as float64, there is a big gotcha.
> 
> The MIDAS banks structure goes to great lengths to make sure each data type is correctly aligned,
> and gets it exactly wrong for 64-bit quantities - all because the bank header is three 32-bit words.
> 
> bankhheader1
> bh2
> bh3
> bankdata1 <--- misaligned
> ...
> bankdataN
> bh1
> bh2
> bh3
> banddata1 <--- aligned
> ... etc
> 
> So we could introduce QWORD banks today, but inside the midas file, they will be misaligned defeating 
> the only purpose of adding them.
> 
> I guess the misalignement could be cured by adding dummy words, dummy banks, dummy bank 
> headers, etc.
> 
> I figure this problem dates all the way bank where alignement to 16-bits was just getting important. 
> Today, in the VME word, I have to align things on 128-bit boundaries (for 2eSST 2x2 DWORD transfers).
> 
> So back to your question, what advantage do you see in using a QWORD bank instead of putting the 
> same data in a DWORD bank?
> 
> K.O.


Thanks very much for your reply. I have implemented your suggestion of treating the 64-bit array as a 32-bit 
array for the bank write/read and this solution is working for me.

Thanks again for your help.
  778   25 Aug 2011 Francesco PrelzForum64-bit integer support in MIDAS
Hi,

I've been doing some preliminary work to use at least the MIDAS
SQL history component for a new CERN experiment (Aegis). I wonder
whether there is any plan to support 64-bit signed/unsigned integer data types
in MIDAS. time_t on 64-bit architectures is actually signed 64-bit
(the 'easy' way to work around the 2038 crisis), and this may be enough to
cause problems.

Thanks.
Francesco Prelz
INFN Milano
ELOG V3.1.4-2e1708b5