12 Jun 2003, Pierre-André Amaudruz, , Tape handling
|
- remove ss_tape_get_blockn from lazylogger.c
- add ss_tape_get_blockn to system.c
- add ss_tape_get_blockn prototype into midas.h
- fix buffer size for "dir" in mtape.c
- add block# for "dir" in mtape if command successful.
- handle TID_STRUCT bank type by display as 8bit in ybos.c (mdump) |
01 May 2020, Joseph McKenna, Forum, Taking MIDAS beyond 64 clients
|
Hi all,
I have been experimenting with a frontend solution for my experiment
(ALPHA). The intention to replace how we log data from PCs running LabVIEW.
I am at the proof of concept stage. So far I have some promising
performance, able to handle 10-100x more data in my test setup (current
limitations now are just network bandwith, MIDAS is impressively efficient).
==========================================================================
Our experiment has many PCs using LabVIEW which all log to MIDAS, the
experiment has grown such that we need some sort of load balancing in our
frontend.
The concept was to have a 'supervisor frontend' and an array of 'worker
frontend' processes.
-A LabVIEW client would connect to the supervisor, then be referred to a
worker frontend for data logging.
-The supervisor could start a 'worker frontend' process as the demand
required.
To increase accountability within the experiment, I intend to have a 'worker
frontend' per PC connecting. Then any rouge behavior would be clear from the
MIDAS frontpage.
Presently there around 20-30 of these LabVIEW PCs, but given how the group
is growing, I want to be sure that my data logging solution will be viable
for the next 5-10 years. With the increased use of single board computers, I
chose the target of benchmarking upto 1000 worker frontends... but I quickly
hit the '64 MAX CLIENTS' and '64 RPC CONNECTION' limit. Ok...
branching and updating these limits:
https://bitbucket.org/tmidas/midas/branch/experimental-beyond_64_clients
I have two commits.
1. update the memory layout assertions and use MAX_CLIENTS as a variable
https://bitbucket.org/tmidas/midas/commits/302ce33c77860825730ce48849cb810cf
366df96?at=experimental-beyond_64_clients
2. Change the MAX_CLIENTS and MAX_RPC_CONNECTION
https://bitbucket.org/tmidas/midas/commits/f15642eea16102636b4a15c8411330969
6ce3df1?at=experimental-beyond_64_clients
Unintended side effects:
I break compatibility of existing ODB files... the database layout has
changed and I read my old ODB as corrupt. In my test setup I can start from
scratch but this would be horrible for any existing experiment.
Edit: I noticed 'make testdiff' pipeline is failing... also fails locally...
investigating
Early performance results:
In early tests, ~700 PCs logging 10 unique arrays of 10 doubles into
Equipment variables in the ODB seems to perform well... All transactions
from client PCs are finished within a couple of ms or less
==========================================================================
Questions:
Does the community here have strong opinions about increasing the
MAX_CLIENTS and MAX_RPC_CONNECTION limits?
Am I looking at this problem in a naive way?
Potential solutions other than increasing the MAX_CLIENTS limit:
-Make worker threads inside the supervisor (not a separate process), I am
using TMFE, so I can dynamically create equipment. I have not yet taken a
deep dive into how any multithreading is implemented
-One could have a round robin system to load balance between a limited pool
of 'worker frontend' proccesses. I don't like this solution as I want to
able to clearly see which client PCs have been setup to log too much data
========================================================================== |
01 May 2020, Stefan Ritt, Forum, Taking MIDAS beyond 64 clients
|
Hi Joseph,
here some thoughts from my side:
- Breaking ODB compatibility in the master/develop midas branch is very bad, since almost all experiments worldwide are affected if they just do blindly a pull and want to recompile and rerun. Currently,
even during our Corona crisis, still some experiments are running and monitored remotely.
- On the other hand, if we have to break compatibility, now is maybe a good time since most accelerators worldwide are off. But before doing so, I would like to get feedback from the main experiments
around the world (MEG, T2K, g-2, DEAP besides ALPHA).
- Having a maximum of 64 clients was originally decided when memory was scarce. In the early days one had just a couple of megabytes of share memory. Now this is not an issue any more, but I see
another problem. The main status page gives a nice overview of the experiment. This only works because there is a limited number of midas clients and equipments. If we blow up to 1000+, the status
page would be rather long and we have to scroll up and down forever. In such a scenario one would have at least to redesign the status and program pages. To start your experiment, you would have to
click 1000 times to start each front-end, also not very practicable.
- Having 100's or 1000's of front-ends calls rather for a hierarchical design, like the LHC experiments have. That would be a major change of midas and cannot be done quickly. It would also result in
much slower run start/stops.
- If you see limitations with your LabVIEW PCs, have you considered multi-threading on your front-ends? Note that the standard midas slow control system supports multithreaded devices
(DF_MULTITHREAD). In MEG, we use about 800 microcontrollers via the MSCB protocol. They are grouped together and each group is a multithreaded device in the midas slow control lingo, meaning the
group gets its own thread for control and readout in the midas frontend. This way, one group cannot slow down all other groups. There is one front-end for all groups, which can be started/stopped with
a single click, it shows up just as one line in the status page, and still it's pretty fast. Have you considered such a scheme? So your LabVIEW PCs would not be individual front-ends, but just make a
network connection to the midas front-end which then manages all LabVEIW PCs. The midas slow control system allows to define custom commands (besides the usual read/write command for slow
control data), so you could maybe integrate all you need into that scheme.
Best,
Stefan
>
> Hi all,
> I have been experimenting with a frontend solution for my experiment
> (ALPHA). The intention to replace how we log data from PCs running LabVIEW.
> I am at the proof of concept stage. So far I have some promising
> performance, able to handle 10-100x more data in my test setup (current
> limitations now are just network bandwith, MIDAS is impressively efficient).
> ==========================================================================
> Our experiment has many PCs using LabVIEW which all log to MIDAS, the
> experiment has grown such that we need some sort of load balancing in our
> frontend.
> The concept was to have a 'supervisor frontend' and an array of 'worker
> frontend' processes.
> -A LabVIEW client would connect to the supervisor, then be referred to a
> worker frontend for data logging.
> -The supervisor could start a 'worker frontend' process as the demand
> required.
> To increase accountability within the experiment, I intend to have a 'worker
> frontend' per PC connecting. Then any rouge behavior would be clear from the
> MIDAS frontpage.
> Presently there around 20-30 of these LabVIEW PCs, but given how the group
> is growing, I want to be sure that my data logging solution will be viable
> for the next 5-10 years. With the increased use of single board computers, I
> chose the target of benchmarking upto 1000 worker frontends... but I quickly
> hit the '64 MAX CLIENTS' and '64 RPC CONNECTION' limit. Ok...
> branching and updating these limits:
> https://bitbucket.org/tmidas/midas/branch/experimental-beyond_64_clients
> I have two commits.
> 1. update the memory layout assertions and use MAX_CLIENTS as a variable
> https://bitbucket.org/tmidas/midas/commits/302ce33c77860825730ce48849cb810cf
> 366df96?at=experimental-beyond_64_clients
> 2. Change the MAX_CLIENTS and MAX_RPC_CONNECTION
> https://bitbucket.org/tmidas/midas/commits/f15642eea16102636b4a15c8411330969
> 6ce3df1?at=experimental-beyond_64_clients
> Unintended side effects:
> I break compatibility of existing ODB files... the database layout has
> changed and I read my old ODB as corrupt. In my test setup I can start from
> scratch but this would be horrible for any existing experiment.
> Edit: I noticed 'make testdiff' is failing... also fails lok
> Early performance results:
> In early tests, ~700 PCs logging 10 unique arrays of 10 doubles into
> Equipment variables in the ODB seems to perform well... All transactions
> from client PCs are finished within a couple of ms or less
> ==========================================================================
> Questions:
> Does the community here have strong opinions about increasing the
> MAX_CLIENTS and MAX_RPC_CONNECTION limits?
> Am I looking at this problem in a naive way?
>
> Potential solutions other than increasing the MAX_CLIENTS limit:
> -Make worker threads inside the supervisor (not a separate process), I am
> using TMFE, so I can dynamically create equipment. I have not yet taken a
> deep dive into how any multithreading is implemented
> -One could have a round robin system to load balance between a limited pool
> of 'worker frontend' proccesses. I don't like this solution as I want to
> able to clearly see which client PCs have been setup to log too much data
> ========================================================================== |
01 May 2020, Pierre Gorel, Forum, Taking MIDAS beyond 64 clients
|
> - On the other hand, if we have to break compatibility, now is maybe a good time since most accelerators worldwide are off. But before doing so, I would like to get feedback from the main experiments
> around the world (MEG, T2K, g-2, DEAP besides ALPHA).
Hello Stefan,
For what is worth, DEAP will not be impacted: as we have been taking data around the clock for the last few years, we froze the code running on the computers. We may have some window of opportunity for upgrade in few months but such a move has not been discussed yet.
Best regards,
Pierre |
02 May 2020, Stefan Ritt, Forum, Taking MIDAS beyond 64 clients
|
TRIUMF stayed quiet, probably they have other things to do.
I allowed myself to move the maximum number of clients back to its original value, in order not to break running experiments.
This does not mean that the increase is a bad idea, we just have to be careful not to break running experiments. Let's discuss it
more thoroughly here before we make a decision in that direction.
Best regards,
Stefan |
02 May 2020, Joseph McKenna, Forum, Taking MIDAS beyond 64 clients
|
Thank you very much for feedback.
I am satisfied with not changing the 64 client limit. I will look at re-writing my frontend to spawn threads rather than
processses. The load of my frontend is low, so I do not anticipate issues with a threaded implementation.
In this threaded scenario, it will be a reasonable amount of time until ALPHA bumps into the 64 client limit.
If it avoids confusion, I am happy for my experimental branch 'experimental-beyond_64_clients' to be deleted.
Perhaps a item for future discussion would be for the odbinit program to be able to 'upgrade' the ODB and enable some backwards
compatibility.
Thanks again
Joseph |
02 May 2020, Stefan Ritt, Forum, Taking MIDAS beyond 64 clients
|
> Perhaps a item for future discussion would be for the odbinit program to be able to 'upgrade' the ODB and enable some backwards
> compatibility.
We had this discussion already a few times. There is an ODB version number (DATABSE_VERSION 3 in midas.h) which is intended for that. If we break teh
binary compatibility, programs should complain "ODB version has changed, please run ...", then odbinit (written by KO) should have a well-defined
procedure to upgrade existing ODBs by re-creating them, but keeping all old contents. This should be tested on a few systems.
Stefan |
02 May 2020, Konstantin Olchanski, Forum, Taking MIDAS beyond 64 clients
|
>
> Does the community here have strong opinions about increasing the
> MAX_CLIENTS and MAX_RPC_CONNECTION limits?
> Am I looking at this problem in a naive way?
>
I think MAX_CLIENTS set at 64 is on the low side for today.
And in the past, we did have experiments that did not work without increasing MAX_CLIENTS. I
think T2K/ND280 needed MAX_CLIENTS bumped to about 100 (200?).
If ALPHA needs MAX_CLIENTS bigger than the default 64, nothing stops the experiment
from changing this number in the local copy of MIDAS.
It is not necessary to change it in the central repository for everybody.
K.O. |
02 May 2020, Konstantin Olchanski, Forum, Taking MIDAS beyond 64 clients
|
> >
> > Does the community here have strong opinions about increasing the
> > MAX_CLIENTS and MAX_RPC_CONNECTION limits?
> > Am I looking at this problem in a naive way?
> >
The issue is binary compatibility.
MIDAS has been binary compatible with itself for a long time, 20 years now, easily.
If we are to give this up, we must gain more than we lose.
On the technical level, bumping MAX_CLIENTS from 64 to 100 gives us nothing, Tomorrow an experiment
will come along asking for 101 clients. Any number you pick, it is too small for somebody. And MIDAS
already has a solution for this: edit midas.h, hit make, done.
If we are to break binary compatibility, we should go big. Remove these limits completely!
Move the MAX_CLIENTS & co fixed size arrays out of the headers in ODB and in event buffers, put
them where they can be resized as needed.
That's a binary-compatibility breaking solution I would vote for.
K.O. |
02 May 2020, Konstantin Olchanski, Forum, Taking MIDAS beyond 64 clients
|
> > >
> > > Does the community here have strong opinions about increasing the
> > > MAX_CLIENTS and MAX_RPC_CONNECTION limits?
> > > Am I looking at this problem in a naive way?
> > >
The issue is |
02 May 2020, Konstantin Olchanski, Forum, Taking MIDAS beyond 64 clients
|
> > >
> > > Does the community here have strong opinions about increasing the
> > > MAX_CLIENTS and MAX_RPC_CONNECTION limits?
> > > Am I looking at this problem in a naive way?
> > >
The issue is: how to organize an experiment? how many frontends should I have?
There are two extremes:
- collect all data in 1 frontend (and today with c++ threads and c++ ring buffers, this is trivial)
- instantiate 1 frontend for each data source. (for example, ALPHA-g detector has 8 ADCs, 64 PWBs plus some
small fish. No that's wrong. Each ADC looks like 48 individual data sources, each PWB looks like 4 data sources,
so this would be 8*48+4*64=640 data sources, could be 640 frontends easily, plus small fish).
Which way is best? Every experiment is different, but consider simple things:
640 frontends writing into 1 event buffer will probably cause large contention for the event buffer lock. bad.
640 frontends running on a 4 core CPU will probably cause unhappiness in the OS. bad.
starting and stopping 640 frontends requires some scripting, monitoring that they all still run, etc. extra work. bad.
640 frontends on the midas status page? your cell phone web browser will explode. bad.
What I am saying is - arbitrary limits are good for you. Make you think about what is going on before throwing
resources at the problem.
K.O. |
13 Oct 2004, Konstantin Olchanski, Bug Report, TWIST upgrade bombed...
|
The upgrade of TWIST to the latest midas has bombed- we see mevb and mlogger
crashes during shared memory data buffer accesses. I am looking into it and I
will add information as I figure things out. K.O. |
13 Oct 2004, Pierre-Andre Amaudruz, Bug Report, TWIST upgrade bombed...
|
> The upgrade of TWIST to the latest midas has bombed- we see mevb and mlogger
> crashes during shared memory data buffer accesses. I am looking into it and I
> will add information as I figure things out. K.O.
Since 1.9.5 the EventBuilder has been modified. Please consult the documentation
where the new mevb scheme is explained.
Test of the mevb with up to 16 frontends (15 different CPUs) has been tested
successfully. Data rate at the EventBuilder were measured about 50MB/s without the
logger and ~30MB/s with the logger. |
13 Oct 2004, Konstantin Olchanski, Bug Report, TWIST upgrade bombed...
|
> The upgrade of TWIST to the latest midas has bombed- we see mevb and mlogger
> crashes during shared memory data buffer accesses. I am looking into it and I
> will add information as I figure things out. K.O.
I traced buffer memory corruption to a logic error in system.c::ss_shm_open(). If
a .SHM file exists, it's size is used as the size of the sysv shared memory
segment, even if the requested shared memory size is bigger, but the caller of
ss_shm_open() thinks it got all the requested memory. Eventually we try to use
the unallocated memory and crash. This is the proposed fix and I will commit it
after I retest the upgrade during the next few days.
[olchansk@send src]$ cvs diff -u system.c
olchansk@midas.psi.ch's password:
Index: system.c
===================================================================
RCS file: /usr/local/cvsroot/midas/src/system.c,v
retrieving revision 1.83
diff -u -r1.83 system.c
--- system.c 4 Oct 2004 07:04:01 -0000 1.83
+++ system.c 14 Oct 2004 05:51:16 -0000
@@ -544,8 +544,14 @@
} else {
/* if file exists, retrieve its size */
file_size = (INT) ss_file_size(file_name);
- if (file_size > 0)
+ if (file_size > 0) {
+ if (file_size < size) {
+ cm_msg(MERROR, "ss_shm_open", "Shared memory segment \'%s\' size
%d is smaller than requested size %d. Please remove it and try
again",file_name,file_size,size);
+ return SS_NO_MEMORY;
+ }
+
size = file_size;
+ }
}
/* get the shared memory, create if not existing */
K.O. |
13 Oct 2004, Konstantin Olchanski, Bug Report, TWIST upgrade bombed...
|
> > The upgrade of TWIST to the latest midas has bombed- we see mevb and mlogger
> > crashes during shared memory data buffer accesses. I am looking into it and I
> > will add information as I figure things out. K.O.
>
> Since 1.9.5 the EventBuilder has been modified. Please consult the documentation
> where the new mevb scheme is explained.
> Test of the mevb with up to 16 frontends (15 different CPUs) has been tested
> successfully. Data rate at the EventBuilder were measured about 50MB/s without the
> logger and ~30MB/s with the logger.
It turns out that TWIST uses a private mevb.c. We will consider upgrading to the
standard one.
K.O. |
14 Oct 2004, Stefan Ritt, Bug Report, TWIST upgrade bombed...
|
Agree.
Once you did the modification, please check following situation: Create a fresh
ODB withe increased size ("odbedit -s 2000000" for example). Then check that the
other clients "adopt" this increased size. Note that some experiments need a
bigger ODB, and I don't want to have them recompile all clients, that's why the
code in ss_shm_open() can attach to a *larger* shared memory. However, it should
not matter to the process, since the ODB (or SYSTEM) shared memory size is
stored in the pheader->key_size and pheader->data_size of each participating
process. So they should never write beyond the limits defined in that header.
The size to ss_shm_open() is only a "hint" if the shared memory does not exist,
and is nowhere later used in the code. |
14 Oct 2004, Konstantin Olchanski, Bug Report, TWIST upgrade bombed...
|
> The upgrade of TWIST to the latest midas has bombed- we see mevb and mlogger
> crashes during shared memory data buffer accesses. I am looking into it and I
> will add information as I figure things out. K.O.
On second try, it looks like we are in business- the first try did not work
because of two mistakes:
1) I did not delete *all* old .SHM files (.ODB.SHM, .SYSTEM.SHM, .YBUF1.SHM,
.YBUF2.SHM). I deleted ODB.SHM, so odb worked, but forgot about the data buffers
SYSTEM.SHM & co and ended up with segmentation faults and core dumps in the buffer
management code caused by a mismatch of the old-midas buffers and new-midas code.
2) while debugging these core dumps, I made an error in my test code, so even
after I deleted the old data buffers, things still did not work. Talk about
over-debugging a problem...
K.O. |
24 Jun 2009, Konstantin Olchanski, Bug Report, TR_STARTABORT transition, mlogger duplicate event problem
|
> > > We have seen on several daq systems this problem: we start a run and observe that the number of
> > > events written by mlogger to the output file is double the number of events actually collected. Upon
> > > inspection of the output file, we see that every event is written twice. Restarting the run usually fixes
> > > this problem.
> >
> > mlogger.c fixed svn rev 4497. (from tr_start(), call tr_stop() if somehow it was not called already by end-run transition).
>
> There is a new problem: after an unsuccessful run start, the next run start bombs with the error "output file runNNN.mid already exists". One way around this is to
> manually remove the useless data file, another is to bump up the run number. Better solution is to automatically erase the output file created by unsuccessful run
> starts.
Stefan suggested implementing a new transition, TR_STARTABORT, issued if TR_START fails. mlogger can use it to cleanup open files, etc, similar to TR_STOP.
This is now implemented. In mlogger, TR_STARTABORT is similar to TR_STOP, but deletes open output files and does not save end-of-run information into databases, etc. mfe.c does not handle this trnasition yet, but I
plan to add it - to fix the observed situations where the run failed to start, but some equipment does not know about it and continues to generate events and send data.
svn rev 4514
K.O. |
25 Jun 2009, Stefan Ritt, Bug Report, TR_STARTABORT transition, mlogger duplicate event problem
|
> Stefan suggested implementing a new transition, TR_STARTABORT, issued if TR_START fails. mlogger can use it to cleanup open files, etc, similar to TR_STOP.
>
> This is now implemented. In mlogger, TR_STARTABORT is similar to TR_STOP, but deletes open output files and does not save end-of-run information into databases, etc. mfe.c does not handle this trnasition yet, but I
> plan to add it - to fix the observed situations where the run failed to start, but some equipment does not know about it and continues to generate events and send data.
>
> svn rev 4514
> K.O.
There is one problem with the TR_STARTABORT: If you combine old and new clients they will crash, since the old clients don't know anything about TR_STARTABORT. The way to prevent this is to increase the Midas version from
2.0.0 to 2.1.0. Then you will get a warning if you mix clients. Please test this and commit the change if it works. |
25 Jun 2009, Konstantin Olchanski, Bug Report, TR_STARTABORT transition, mlogger duplicate event problem
|
> > Stefan suggested implementing a new transition, TR_STARTABORT, issued if TR_START fails. mlogger can use it to cleanup open files, etc, similar to TR_STOP.
> >
> > This is now implemented. In mlogger, TR_STARTABORT is similar to TR_STOP, but deletes open output files and does not save end-of-run information into databases, etc. mfe.c does not handle this trnasition yet, but I
> > plan to add it - to fix the observed situations where the run failed to start, but some equipment does not know about it and continues to generate events and send data.
> >
> > svn rev 4514
> > K.O.
>
> There is one problem with the TR_STARTABORT: If you combine old and new clients they will crash, since the old clients don't know anything about TR_STARTABORT. The way to prevent this is to increase the Midas version from
> 2.0.0 to 2.1.0. Then you will get a warning if you mix clients. Please test this and commit the change if it works.
Are you sure? Only clients that register themselves to receive the TR_STARTABORT transition (via cm_register_transition()) will receive this transition.
As of now, the only client that registers and receives this transition is mlogger.
I also confirm that old clients that know nothing about TR_STARTABORT are *not* sent this transition. (this is tested).
K.O. |
|