As reported by Stefan, in MEG II we have very similar ethernet throughputs.
In total, we have 34 crates each with 32 DRS4 digitiser chips and a single 1 Gbps readout link through a Xilinx Zynq SoC.
The data arrives in push mode without any external intervention, the only throttling being an optional prescaling on the trigger rate.
We discovered the hard way that 1 Gbps throughput on Zynq is not trivial at all: the embedded ethernet MAC does not support jumbo frames (always read the fine prints in the manuals!) and the embedded Linux ethernet stack seems to struggle when we go beyond 250 Mbps of UDP traffic.
Anyhow, even with the reduced speed, the maximum throughput at network input is around 8.5 Gbps which passes through the Mikrotik switches mentioned by Stefan.
We had very bad experiences in the past with similar price-point switches, observing huge packet drops when the instantaneous switching capacity cannot cope with the traffic, but so far we are happy with the Mikrotik ones.
On the receiver side, we have the DAQ server with an Intel E5-2630 v4 CPU and a 10 Gbit connection to the network using an Intel X710 Network card.
In the past, we used also a "cheap" 10 Gbit card from Tehuti but the driver performance was so bad that it could not digest more than 5 Gbps of data.
The current frontend is based on the mfe.c scheme for historical reasons (the very first version dates back to 2015).
We opted for a monolithic multithread solution so we can reuse the underlying DAQ code for other experiments which may not have the complete Midas backend.
Just to mention them: one is the FOOT experiment (which afaik uses an adapted version of Altas DAQ) and the other is the LOLX experiment (for which we are going to ship to Canada soon a small 32 channel system using Midas).
A major modification to Konstantin scheme is that we need to calibrate all WFMs online so that a software zero suppression can be applied to reduce the final data size (that part is still to be implemented).
This requirement results in additional resource usage to parse the UDP content into floats and calibrate them.
Currently, we have 7 packet collector threads to digest the full packet flow (using recvmmsg), followed by an event building stage that uses 4 threads and 3 other threads for WFM calibration.
We have progressive packet numbers on each packet generated by the hardware and a set of flags marking the start and end of the event; combining the packet number difference between the start and end of the event and the total received packets for that event it is really easy to understand if packet drops are happening.
All the thread infrastructure was tested and we could digest the complete throughput, we still have to finalise the full 10 Gbit connection to Midas because the final system has been installed only recently (April).
We are using EQ_USER flag to push events into mfe.c buffers with up to 4 threads, but I was observing that above ~1.5 Gbps the rb_get_wp() returns almost always DB_TIMEOUT and I'm forced to drop the event.
This conflicts with the measurements reported by Stefan (we were discussing this yesterday), so we are still investigating the possible cause.
It is difficult to report three years of development in a single Elog, I hope I put all the relevant point here.
It looks to me that we opted for very complementary approaches for high throughput ethernet with Midas, and I think there are still a lot of details that could be worth reporting.
In case someone organises some kind of "virtual workshop" on this, I'm willing to participate.
Best,
Marco
> In MEG II we also kind of achieved this rate. Marco F. will post an entry soon to describe the details. There is only one thing
> I want to mention, which is our network switch. Instead of an expensive high-grade switch, we chose a cheap "Chinese" high-grade
> switch. We have "rack switches", which are collector switch for each rack receiving up to 10 x 1GBit inputs, and outputting 1 x
> 10 GBit to an "aggregation switch", which collects all 10 GBit lines form rack switches and forwards it with (currently a single
> ) 10 GBit line. For the rack switch we use a
>
> MikroTik CRS354-48G-4S+2Q+RM 54 port
>
> and for the aggregation switch
>
> MikroTik CRS326-24S-2Q+RM 26 Port
>
> both cost in the order of 500 US$. We were astonished that they don't loose UDP packets when all inputs send a packet at the
> same time, and they have to pipe them to the single output one after the other, but apparently the switch have enough buffers
> (which is usually NOT written in the data sheets).
>
> To avoid UDP packet loss for several events, we do traffic shaping by arming the trigger only when the previous event is
> completely received by the frontend. This eliminates all flow control and other complicated methods. Marco can tell you the
> details.
>
> Another interesting aspect: While we get the data into the frontend, we have problems in getting it through midas. Your
> bm_send_event_sg() is maybe a good approach which we should try. To benchmark the out-of-the-box midas, I run the dummy frontend
> attached on my MacBook Pro 2.4 GHz, 4 cores, 16 GB RAM, 1 TB SSD disk. I got
>
> Event size: 7 MB
>
> No logging: 900 events/s = 6.7 GBytes/s
>
> Logging with LZ4 compression: 155 events/s = 1.2 GBytes/s
>
> Logging without compression: 170 events/s = 1.3 GBytes/s
>
> So with this simple approach I got already more than 1 GByte of "dummy data" through midas, indicating that the buffer
> management is not so bad. I did use the plain mfe.c frontend framework, no bm_send_event_sg() (but mfe.c uses rpc_send_event() which is an
> optimized version of bm_send_event()).
>
> Best,
> Stefan |