ELOG Midas

Back Midas Rome Roody Rootana

Midas DAQ System, Page 119 of 159

Not logged in

Find | Login | Help

Full | Summary | Threaded | Hide attachments

3173 Entries

Goto page Previous 1, 2, 3 ... 118, 119, 120 ... 157, 158, 159 Next

ID	Date	Author	Topic	Subject
2363	23 Mar 2022	Konstantin Olchanski	Bug Fix	fix for event buffer corruption in bm_flush_cache()
I confirm, there is no problem in single-threaded programs, and there is no problem if all bm_send_event() and bm_flush_cache() are called from the same thread. > ... instead of struggling with all your locks. it is better to have midas fully thread safe. ODB has been so for a long time, event buffer partially (except for this bug), now fully. without that the problem still exists, because in many frontends, bm_flush_buffer() is called from the main thread, and will race against the "bm_send_event() thread". Of course you can do everything on the main thread, but this opens you to RPC timeouts during run transitions (if you sleep in bm_wait_for_free_space()). also the SYSMSG buffer is subject to the same bug. cm_msg() is of course safe to call from anywhere, but cm_msg_flush_buffer() and cm_periodic_tasks() can be called from any thread, and they issue bm_send_event(SYSMSG), and there will be mysterious crashes and SYSMSG corruptions, probably only during message storms, but still! K.O.
2364	23 Mar 2022	Konstantin Olchanski	Bug Fix	mhttpd bug fixed
the mhttpd bug should be fixed now (branch feature/buffer_mutex). simplest way to reproduce: wget http://localhost:8080/ quickly ctrl-C it wget http://localhost:8080/ inside mhttpd (by hook or crook) observe that the second wget got the data meant for the first wget. if you cannot ctrl-C the first wget quickly enough, put a sleep somewhere in the worker thread (in mongoose_write(), I think). this is what happens. 1st wget stops (by ctrl-C), socket is closed, mongoose frees it's mg_connection object (corresponding worker is still labouring, hmm... actually sleeping, and now has a stale nc pointer) 2nd wget starts, new socket is opened, mongoose allocates a new mg_connection object, but malloc() gives it back the same memory we just freed(), and the 1st wget's worker thread nc pointer is no longer stale, but points to 2nd wget's connection. so we think we are clever and we check the socket file descriptors. but same thing happens there, too. if 1st wget was file descriptor 7, it is closed, (1st wget worker now has a stale file handle), then reopened for the 2nd wget, per POSIX, we get back the same file descriptor 7. 1st wget worker now has the file handle for the 2nd wget tcp socket and the famous test/crash for "sending data to wrong socket" is defeated. now, worker thread for the 1st wget wants to send a reply, it has a valid nc pointer (points to 2nd wget's mg_connection object) and a valid file descriptor (points to 2nd wget's tcp socket), reply meant for the 1st wget is successfully sent to the 2nd wget, 2nd wget finishes, it's socket is closed, mg_connection object is free'ed. Now the worker thread for the 2nd wget has stale connection info, but this is okey, mongoose does not find a matching connection, 2nd wget worked thread reply goes nowhere, thread finishes silently (no memory leaks here, I checked). so, connection for 2nd wget completely impersonates the closed connection of 1st wget (I guess I could check the full socket address info, remote ip address, remote port number, etc, but...) in practice, this bug does not happen often because modern browsers tend to keep tcp sockets open for very long time. (not sure about sundry web proxies, etc). solution of course is very simple. match worker thread data to mongoose mg_connection objects using our own connection sequential number, which are unique and very easy to keep track of through the mongoose event handler. all this mess runs in the main thread, so no locking trouble here, small blessing. K.O.
2368	24 Mar 2022	Konstantin Olchanski	Bug Fix	mhttpd bug fixed
> > 1st wget stops (by ctrl-C), socket is closed, mongoose frees it's mg_connection object > > (corresponding worker is still labouring, hmm... actually sleeping, and now has a stale nc pointer) > > > > 2nd wget starts, new socket is opened, mongoose allocates a new mg_connection object, > > but malloc() gives it back the same memory we just freed(), and the 1st wget's worker thread > > nc pointer is no longer stale, but points to 2nd wget's connection. > > Why don't we CLEAR the memory (memset(object,0,sizeof(object)) before the free(), this way it cannot be > mistakenly re-used by the next thread. > My description was unclear. I will try better now. When http replies are generated by worker threads, matching of reply to mg_connection is done by checking the address of the mg_connection object. (mongoose itself unhelpfully offers to send the reply to every mg_connection, see the responder to mg_broadcast() messages). This works for open/active connections, addresses of all mg_connections are unique. But if connection is closed and a new connection is opened, the address is reused (by malloc()/free() reusing memory blocks or by mongoose using a pool of mg_connection objects, does not matter). So matching http reply to mg_connection using only address of mg_connection can match the wrong connection. (contents of mg_connection object does not matter, only address is used by matching. so memzero() of mg_connection object does not help). I saw this during my testing - wrong data was sent to wrong browser often enough - but did not understand that the above problem is happening. Because I was unable to reliably reproduce the problem, I could not debug it. I tried to add a check for the tcp socket file descriptor number, in case there is a straight bug or multithread race or simple memory corruption. This replaced "we sent wrong data to wrong browser, poisoned browser cache, confused the user" with a crash. This "fix" seemed effective at the time. Maybe I should mention browser cache poisoning again. What happened is html pages and rpc replies were returned as responses to load things like CSS files, these bad responses are cached by the browser pretty much forever, so all subsequent midas pages will look wrong (bad css!) forever, until user manually clears browser cache. reload of page did not help, restart of browser did not help (I think). So a very bad bug. Unfortunately, the check for file descriptor was not effective because file descriptors are also reused. And I did see wrong data returned by mhttpd, but even more rarely. And everybody (myself included) complained about mhttpd crashes. Now, matching of responses to connections is done by connection sequential/serial number, which is unique 32-bit counter. Mismatch of reply to connection should not happen again. P.S. Latest version of the mongoose web server library does not help with this problem, the example code for matching reply to connection in their multithread example looks bogus: https://github.com/cesanta/mongoose/blob/master/examples/multi-threaded/main.c K.O.
2369	24 Mar 2022	Konstantin Olchanski	Bug Fix	fix for event buffer corruption in bm_flush_cache()
> Thanks for the investigation. Back in 2020, we had some issues > of losing data between the system buffer and the logger writing them > to disk (https://daq00.triumf.ca/elog-midas/Midas/1966). This was polled equipment > but we had a multithreaded FE running at the same time. Could this be related to the same problem? I think we will have to follow up on your problem 1966 separately. I think this bug cannot lose events. Writing events to the write cache has correct locking, no loss here. writing write cache to shared memory has correct locking, no loss there. the bug will cause the next event in the event buffer to be overwritten, this will be detected by most programs as shared memory corruption and everybody will quit. (mhttpd, mserver, odbedit will probably survive). I guess there could be unlucky corruption that looks like nothing was corrupted, but this will affect only a few events right at the shared memory read/write pointer, it so happens that they are the oldest events in the buffer and likely mlogger already wrote them to disk. mlogger read pointer will likely follow the shared memory write pointer closely, well ahead of the shared memory read pointer which always pointe to the older event and where this bug's corruption will happen. So no, I do not think this bug can cause event loss between frontend and mlogger. K.O.
2370	24 Mar 2022	Konstantin Olchanski	Bug Report	data missing in runXXXXXX.mid
> > It would be good to pin point there the data is lost. This is the sequence: > > > > frontend user code -> mfe.c code -> SYSTEM buffer -> mlogger -> disk > > > > To see if correct data arrives to the SYSTEM buffer, run: > > mdump -z SYSTEM > > > > To see if mlogger is receiving events from the SYSTEM buffer, run: > > mlogger -v ### mlogger should report all events, history and data > > > > To see if mlogger writes events to disk, examine the disk file (in this case, you already did, data is not there). > > > > I would guess that your data does not make it out from the frontend (mdump shows "nothing"), > > if data were to arrive into the SYSTEM buffer, it would make it to disk, unless > > mlogger is misconfigured (but you already checked that). > > > > If you have trouble with the frontend framework code, you can try to switch from the mfe.c frontend > > to the newer c++ tmfe frontend (see progs/fetest_tmfe.cxx and progs/fetest_tmfe_thread.cxx). > > > > K.O. > > Good evening > > I tried to reproduce the behavior in a very simple FE but it did not work out. > The next thing for me would be to take the FE that is producing this behavior, > replace all the device communication and data with dummies. If the problem is still > there I would start to simplify as much as possible. > > Following the inputs of KO, I pin-pointed the data loss. The system buffer still > gets the data but the mlogger does not write the data event. Then of course the data > is also not anymore present in the data file. Therefore, I checked the logger > settings again, Event ID and Trigger Mask still -1. Nothing else, at least from my point of view, > that is misconfigured. Nevertheless, if it helps I can send my ODB settings. > > When doing the tests just before I found something else that probably > can give a hint to the problem. The data is only lost if the time between > two runs is long (a few seconds). As an example: If I run a sequence with a loop > and after the FE stops the run the loop ends and the next run is started automatically, > then only the first run has no data, which is the one after a longer time of > no data taking. When I add a "WAIT Seconds 5" after the run before starting > the next, not data is written to the disk for any run. I also found this > once when adding a sleep(1) at the end of the FE readout function > but back then did not think about it any further. > Looks like this problem fell into the covid crack. As far as I know, MIDAS does not lose any events between bm_send_event() and the shared memory buffer. It does not lose any events in the mlogger (unless the "event request" is misconfigured). (there is lots of opportunity to lose events in complicated frontends). If you have some evidence otherwise, I would very much like to hear about it and I want to fix all problems that cause it. In your previous report I was under the impression that you lose random events here and there, but your latest report is about mlogger not writing anything at all. Which case is it? If you can definitely say that all your events make it to the SYSTEM buffer but mlogger sometimes does not see some of them and sometimes does not see all of them, we should look very closely at bm_receive_event() and mlogger itself. In the case where mlogger is not seeing any events at all (output file is empty), as this is happening, I would like to see the output of mdump (to confirm events are written to SYSTEM buffer with correct event_id and trigger_mask) and the output of (say) "manalyzer_test.exe --dump run01161.mid.lz4" on your output file. If the output is very long, you can email it to me directly instead of posting it here. K.O.
2373	24 Mar 2022	Konstantin Olchanski	Bug Report	data missing in runXXXXXX.mid
> One idea: we should have a look at mlogger::close_channels(). > There the SYSTEM buffer is emptied through the cm_yield() call. > Instrumenting this with some debugging code will enlighten us. right. this will "last few events are lost at the end of run". but that code in the mlogger was not touched in years, if there is a problem there, we would have seen it by now, most experiments check that the number of events in the data file is same as number of triggers generated, both numbers are shown on the midas status page. > Another possible problem: If the frontend requested to be notified for a run stop AFTER the logger, then the problem might happen: Logger closes file, and THEN the frontend flushes events ending up in the SYSTEM buffer and being logged at the beginning of the next run. The mfe.cxx framework takes care of this by calling > cm_register_transition(TR_STOP, 500); default sequence, both mfe.c frontend and c++ tmfe frontend: start of run: - mlogger first (configure history, open data file) - frontends last - (if any frontend fails, TR_STARTABORT is sent to mlogger to close the output file and "undo" the run start) end of run: - frontends first (must not send any events after after processing the TR_STOP RPC call, inside the TR_STOP handler, bm_flush_cache() takes care of the write cache) - mlogger last - (if any frontend fails, failure is ignored, run stops regardless) wrong order will be only if they manually change it, and whatever order they set, you see it on the midas transition page (and mtransition -v and odbedit stop now -v, etc). K.O.
2374	24 Mar 2022	Konstantin Olchanski	Bug Fix	mhttpd bug fixed
> As for the browser cache problem: This Chrome extension is your friend ... for google chrome, it is easy, open the javascript debugger (left-click "inspect"), the reload button becomes a left-click menu, one left-click option is "clear cache and reload". (there is no button for "clear cookies and reload", re recent elog cookie problem). but this does not help me personally any. if midas web pages get confused, I will also get confused, too, and I will spend hours debugging mhttpd before thinking "hmm... maybe I should clear the browser cache!" not sure about firefox, safari, microsoft edge and opera. if I ever need it, I google it. K.O.
2377	29 Mar 2022	Konstantin Olchanski	Bug Fix	mdump can read lz4 and bz2 files now
I converted mdump file i/o from older mdsupport library to newer midasio library and it can now read .mid, .mid.gz, .mid.lz4 and .mid.bz2 files. Output should be identical to what it printed before, if you see any differences, please report them here or on bitbucket. K.O.
2378	30 Mar 2022	Konstantin Olchanski	Bug Fix	erroneous removal of odb clients, fixed
commit https://bitbucket.org/tmidas/midas/commits/b1fe21445109774be3f059c2124727b414abf835 made on 2022-02-21 fixed a serious bug in ODB. a multithread race condition against an incorrectly updated shared variable caused removal of random clients from ODB with error message: My client index %d in ODB is invalid: out of range 0..%d. Maybe this client was removed by a timeout, see midas.log. Cannot continue, aborting... the race is between db_open_database() in one program (executed when any midas program starts) and db_get_my_client_locked() in all running midas programs. as long as no midas programs are started (db_open_database() is not executed), this bug does not happen. if i.e. odbedit is executed very often, i.e. from a script, probability of hitting this bug becomes quite high. fixed now. K.O.
2379	31 Mar 2022	Konstantin Olchanski	Bug Fix	"run stop" trouble in mlogger, fixed
while debugging something else, I ran into a bit of trouble in mlogger. I set the mlogger event limit to 100, and after reaching 100 events, mlogger sayd "stopping run", but nothing happened, run kept going. it turns out mlogger tried stopping the run too soon, the run-start transition did not finish yet and the error message about trying to stop a run while another transition is in progress was missing. (fixed - if another transition is in progress, we try again later) it also turns out that cm_transition() checks if another transition is in progress way too late, all the way in the transition thread, where it cannot return it is an error to mlogger. (fixed - first thing done in cm_transition() is this check). while debugging this, I tested the ODB flags "/Logger/Async transitions" and /Logger/Multithread transitions". It turns out only two transition types still work from inside mlogger - multithread transition and detached transition (via the mtransition helper). the issue is the dead lock between mlogger and frontend. while mlogger is inside cm_transition(), it is not reading the SYSTEM buffer, while at the same time frontends are writing into it. If SYSTEM buffer happens to be pretty full, we dead lock - frontends are waiting for free space in the SYSTEM buffer do not respond to RPCs, mlogger is not reading from the SYSTEM and it stuck trying to issue "run stop" RPC to frontend. (this dead lock is not forever, eventually frontend is killed by RPC timeout, mlogger survives and stops the run). this is a well known problem and as solution, mlogger has been using the multithreaded transitions for years. now I removed the OBD /Logger/Async transition and /Logger/Multithread transition flags, instead, there is now a flag /Logger/Detached transitions set to FALSE by default. Setting it to TRUE will cause mlogger to fork "mtransition STOP" and "mtransition START" for stopping and starting runs, this is useful in case there is trouble with multithreading in mlogger. K.O.
2381	04 Apr 2022	Konstantin Olchanski	Suggestion	Maximum ODB size
> Anybody some idea what the maximum ODB size can be? It turns out ODB size limit is hardwired on db_open_database() at 100 Mbytes. I now committed an improved error message for this. I confirm that "odbinit -s 100MB" works and creates ODB with 50 Mbyte data area and 50 Mbyte key area. > in the age of 64GB RAM being a standard, we should be able to grow bigger ... I agree, I think we can safely bump the limit from 100 Mbytes to 1 Gbyte, maybe 1.5 or 1.99 Gbytes. Above that we run into 32-bit/31-bit cleanliness problems. And creating extra large 1 GB ODB but using only a few megabytes will not waste any RAM, because the .ODB.SHM file is demand-paged and non-used parts of ODB will not be mapped into RAM. (It will waste disk space, file .ODB.SHM will be 1 GByte size). However, 1 GByte (FPGA based) and 4-8 GByte (Raspberry Pi & co) machines are again becoming popular and relevant for running MIDAS, and they have very slow "disk" subsystems, with NAND, SD and USB flash, so we should not go crazy here. > odbinit -s 1024MB --cleanup there is a bug in odbinit, if initial odbinit fails, ODB with default size is creates, and original rejected ODB size is written to .ODB_SIZE.TXT (an inconsistency). bitbucket bug 328 > [ how do I resize ODB ??? ] we need odbresize. bitbucket bug 329. K.O.
2382	12 Apr 2022	Konstantin Olchanski	Info	ODB JSON support
> > > > odbedit can now save ODB in JSON-formatted files. > > encode NaN, Inf and -Inf as JSON string values "NaN", "Infinity" and "-Infinity". (Corresponding to the respective Javascript values). > http://docs.oasis-open.org/odata/odata-json-format/v4.0/os/odata-json-format-v4.0-os.html > > Values of types [...] Edm.Single, Edm.Double, and Edm.Decimal are represented as JSON numbers, > except for NaN, INF, and �INF which are represented as strings "NaN", "INF" and "-INF". > https://xkcd.com/927/ Per xkcd, there is a new json standard "json5". In addition to other things, numeric values NaN, +Infinity and -Infinity are encoded as literals NaN, Infinity and -Infinity (without quotes): https://spec.json5.org/#numbers Good discussion of this mess here: https://stackoverflow.com/questions/1423081/json-left-out-infinity-and-nan-json-status-in-ecmascript K.O.
2384	13 Apr 2022	Konstantin Olchanski	Info	ODB JSON support
> > Per xkcd, there is a new json standard "json5". In addition to other things, numeric > > values NaN, +Infinity and -Infinity are encoded as literals NaN, Infinity and -Infinity (without quotes): > > https://spec.json5.org/#numbers > > Just for curiosity: Is this implemented by the midas json library now? MIDAS encodes NaN, Infinity and -Infinity as javascript compatible "NaN", "Infinity" and "-Infinity", this encoding is popular with other projects and allows correct transmission of these values from ODB to javascript. The test code for this is on the MIDAS "Example" page, scroll down to "Test nan and inf encoding". I think this type of encoding, using strings to encode special values, is more in the spirit of json, compared to other approaches such as adding special literals just for a few special cases leaving other special cases in the cold (ieee-754 specifies several different types of NaN, you can encode them into different nan-strings, but not into the one nan-literal (need more nan-literals, requires change to the standard and change to every json parser). As editorial comment, it boggles my mind, what university or kindergarden these people went to who made the biggest number, the smallest number and the imaginary number (sqrt(-1)) all equal to zero (all encoded as literal null). K.O.
2386	24 Apr 2022	Konstantin Olchanski	Bug Fix	mserver buffer overrun and crash
There is a memory allocation bug in the mserver. ALIGN8() was missing when receiving events from the event socket and data buffer was allocated 4 bytes too short. but only for some received events and only in very unlucky sequence of received events. result was a rare but obnoxious crash of fevme frontend in alpha-2 at CERN. (we do not see any crash from this in alpha-g or anywhere else, the best I can tell). fixed in commit 4dc06ba47ff7caa5251fd8c48d8533f35799f3a6. If you use the mserver, please update to this commit or apply following patch in midas.cxx: - int bufsize = sizeof(INT) + event_size; + int bufsize = sizeof(INT) + total_size; K.O.
2388	30 Apr 2022	Konstantin Olchanski	Forum	S3 Object Storage
> We are storing raw MIDAS files to S3 Object Storage, but MIDAS file are not > optimised for readout from such kind of storage. There is any work around on > evolution of midas raw output or, beyond simulated posix fs, to develop midas > python library optimised to stream data from S3 (is not really clear to me if this > is possible). We have plans for adding S3 object storage support to lazylogger, but have not gotten around to it yet. We do not plan to add this in mlogger. mlogger works well for writing data to locally- attached storage (local ext4, XFS, ZFS) but always runs into problems with timeouts and delays when writing to anything network-attached (even writing to NFS). I envision that each midas raw data file (mid.gz or mid.lz4 or mid.bz2) will be stored as an S3 object and there will be some kind of directory object to map object ids to run and subrun numbers. Choice of best file size is open, normally we use subruns to limit file size to 1-2 Gbytes. If cloud storage prefers some other object size, we can easily to up to 10 Gbytes and down to "a few megabytes" (ODB dumps will have to be turned off for this). Other than that, in your view, what else is needed to optimize midas files for storage in the Amazon S3 could? P.S. For reading files from the cloud, code needs to be written and added to midasio/midasio.cxx, for example, see the code that is already there for reading ssh- attached files and dcache/dccp-attached files. (CERN EOS files can be read directly from POSIX mount point /eos). K.O.
2389	30 Apr 2022	Konstantin Olchanski	Info	added web pages for "show odb clients" and "show open records"
for a long time, midas web pages have been missing the equivalent of odbedit "scl" and "sor" to display current odb clients and current odb open records. this is now added as buttons "show open records" and "show odb clients" in the odb editor page. as in odbedit, "sor" shows open records under the current subtree, i.e. if you are looking at /equipment, you will not see open records for /experiment. to see all open records, go to "/". commit b1ab7e67ecf785744fff092708d8389f222b14a4 K.O.
2390	01 May 2022	Konstantin Olchanski	Info	added web page for "mdump"
added JSON RPC for bm_receive_event() and added a web page for "mdump". the event dump is a hex dump for now. if somebody can contribute a javascript decoder for midas bank format, it would be greatly appreciated. otherwise, I will eventually write my own decoder library patterned on midasio.h and midasio.cxx. as of commit 5882d55d1f5bbbdb0d9238ada639e63ac27d8825 K.O.
2391	01 May 2022	Konstantin Olchanski	Info	added web page for "mdump"
> added JSON RPC for bm_receive_event() there is a number of problems with implementing bm_receive_event() as a RPC: 1) mhttpd has only event buffer 1 read pointer for all javascript connections, if two browser tabs are running mdump, they will "steal" events from each other. 2) javascript connections are state-less and we cannot specify per-connection event_id and trigger_mask filters to bm_receive_event(). our bm_request_event() has to be for all event_id and all trigger_mask. 3) for same reason, we cannot have some requests to be GET_ALL, some to be GET_RECENT and some to be GET_OLD (if GET_OLD is ever implemented). Problem (1) is hard to fix. Only solution I can see is to have mhttpd have it's own event buffer that can somehow track which events have been sent to which javascript connection. The same scheme allows implementing GET_ALL and per-connection event_id and trigger_mask filters. The difficulty is in detecting javascript connections that are no longer active and it's event request and events we have buffered for it can be deleted. Unlike proper rpc clients, javascript browser tabs can be closed without warning and without opportunity to tell rpc server that they are closed, gone. K.O.
2392	01 May 2022	Konstantin Olchanski	Info	added web page for "mdump"
> added a web page for "mdump". missing functions: - get a list of existing event buffers (should read event buffer names from /Experiment/Buffer sizes) - selector box to select event buffer - button for "get next" and "get new" (should call bm_skip_event() before bm_receive_event()) - entry fields for event_id and trigger_mask event filter - check box for "keep getting new data" and entry field for update frequency - (eventually) entry field for bank name filter K.O.
2396	04 May 2022	Konstantin Olchanski	Bug Fix	mysql history update
the code for writing midas history to mysql has been updated to work against MYSQL 8.0.23 (CERN ALPHA-2): - as ever mysql reports inconsistent data types (I create column with type "integer", mysql reports it has type "int" and so forth), the special kludge to take care of this had to be tweaked. - this caused some columns to be marked "inactive" and the code to "reactivate" them was missing (fixed) - binary history event data size was computed incorrectly for events with "inactive" columns (fixed) and caused assert() failure and mlogger crash. - mysql read of column definitions for history event "system" (as in /history/links/system) bombed because of incorrect quoting (worked before, why? why bombed now?). this caused duplicate columns to be created in mysql table "system" and mlogger bomb-out with complaint about "duplicated columns" (actually the error message was missing, so it was a silent bomb-out). quoting fixed, missing error message fixed, but cleanup of duplicate columns has to be done by hand. in case of alpha-2 the fix was to remove the unused /history/links/system). if you are using mysql history please update or patch src/history_schema.cxx. commit 9d17d2fef233cf457121ca7c2a283c4c76ed33bc K.O.

Goto page Previous 1, 2, 3 ... 118, 119, 120 ... 157, 158, 159 Next

ELOG V3.1.4-2e1708b5