ELOG Midas

Back Midas Rome Roody Rootana

Midas DAQ System, Page 130 of 139

Not logged in

Find | Login | Help

New entries since:

Wed Dec 31 16:00:00 1969

Full | Summary | Threaded | Show attachments

2765 Entries

Goto page Previous 1, 2, 3 ... 129, 130, 131 ... 137, 138, 139 Next

ID	Date	Author	Topic	Subject
2340	14 Feb 2022	Stefan Ritt	Bug Fix	ODBINC/Sequencer Issue
Just post here a minimal script which produces the error, so that I can try myself. ... and make sure that you have the latest develop version of midas. Stefan
2341	14 Feb 2022	jago aartsen	Bug Fix	ODBINC/Sequencer Issue
> Just post here a minimal script which produces the error, so that I can try myself. > > ... and make sure that you have the latest develop version of midas. > > Stefan Here is the simplest script which produces the error: WAIT seconds, 3 ODBINC /Equipment/ArduinoTestStation/Variables/_S_ I noticed that "Jacob Thorne" in the forum had the same issue as us in Novemeber last year. Indeed we have not installed any later versions of MIDAS since then so we will double check we have the latest version. Jago
2342	14 Feb 2022	Stefan Ritt	Bug Fix	ODBINC/Sequencer Issue
> I noticed that "Jacob Thorne" in the forum had the same issue as us in Novemeber last > year. Indeed we have not installed any later versions of MIDAS since then so we will > double check we have the latest version. As you see from my reply to Jacob, the bug has been fixed in midas since then, so just update. Stefan
2343	14 Feb 2022	jago aartsen	Bug Fix	ODBINC/Sequencer Issue
> > I noticed that "Jacob Thorne" in the forum had the same issue as us in Novemeber last > > year. Indeed we have not installed any later versions of MIDAS since then so we will > > double check we have the latest version. > > As you see from my reply to Jacob, the bug has been fixed in midas since then, so just > update. > > Stefan We have tried updating using both: git submodule update --init --recursive and: git pull --recurse-submodules But the error still persists. Is there another way to update which we are missing? Cheers Jago
2344	15 Feb 2022	Stefan Ritt	Bug Fix	ODBINC/Sequencer Issue
> But the error still persists. Is there another way to update which we are missing? The bug was definitively fixed in this modification: https://bitbucket.org/tmidas/midas/commits/5f33f9f7f21bcaa474455ab72b15abc424bbebf2 You probably forgot to compile/install correctly after your pull. Of you start "odbedit" and do a "ver" you see which git revision you are currently running. Make sure to get this output: MIDAS version: 2.1 GIT revision: Fri Feb 11 08:56:02 2022 +0100 - midas-2020-08-a-509-g585faa96 on branch develop ODB version: 3 Stefan
Draft	15 Feb 2022	jago aartsen	Bug Fix	ODBINC/Sequencer Issue
> > But the error still persists. Is there another way to update which we are missing? > > The bug was definitively fixed in this modification: > > https://bitbucket.org/tmidas/midas/commits/5f33f9f7f21bcaa474455ab72b15abc424bbebf2 > > You probably forgot to compile/install correctly after your pull. Of you start "odbedit" and do > a "ver" you see which git revision you are currently running. Make sure to get this output: > > MIDAS version: 2.1 > GIT revision: Fri Feb 11 08:56:02 2022 +0100 - midas-2020-08-a-509-g585faa96 on branch > develop > ODB version: 3 > > > Stefan Hey Stefan, We are running the GIT revision midas-2020-08-a-509-g585faa96: [local:mu3eMSci:S]/>ver MIDAS version: 2.1 GIT revision: Tue Feb 15 16:31:07 2022 +0000 - midas-2020-08-a-521-ge43ea7c5 on branch develop ODB version: 3 which is still giving the error unfortunately.
2346	16 Feb 2022	jago aartsen	Bug Fix	ODBINC/Sequencer Issue
> > But the error still persists. Is there another way to update which we are missing? > > The bug was definitively fixed in this modification: > > https://bitbucket.org/tmidas/midas/commits/5f33f9f7f21bcaa474455ab72b15abc424bbebf2 > > You probably forgot to compile/install correctly after your pull. Of you start "odbedit" and do > a "ver" you see which git revision you are currently running. Make sure to get this output: > > MIDAS version: 2.1 > GIT revision: Fri Feb 11 08:56:02 2022 +0100 - midas-2020-08-a-509-g585faa96 on branch > develop > ODB version: 3 > > > Stefan We we're having some problems compiling but have got it sorted now - thanks for your help:) Jago
2354	15 Mar 2022	Konstantin Olchanski	Bug Fix	mhttpd ipv6 bind should be fixed now
Something changed after my initial implementation of ipv6 in mhttpd and listening to ipv6 http/https connections was broken. It turns out, I do not need to listen to both ipv4 and ipv6 sockets, it is sufficient to listen to just ipv6. ipv4 connections will also magically work. see linux kernel "bindv6only" sysctl setting: https://sysctl- explorer.net/net/ipv6/bindv6only/ The specific bug in mhttpd was to bind to ipv4 socket first, subsequent bind() to ipv6 socket fails with error "Address already in use", which is silent, not reported by the mongoose library. For reasons unknown, this does not happen to bind() to "localhost" aka ipv6 "::1". Apparently other web servers (apache, nginx) are/were also affected by this problem. https://chrisjean.com/fix-nginx-emerg-bind-to-80-failed-98-address-already-in-use/ First fix was to bind to ipv6 first (success) and to ipv4 second (fails). Second fix committed to git is to only listen to ipv6. This works both on MacOS and on Linux. Linux reports the listener socket is "tcp6", MacOS reports the listener socket as "tcp46": 4ed0:javascript1 olchansk$ netstat -an \| grep 808 \| grep LISTEN tcp46 0 0 .8081 .* LISTEN tcp6 0 0 ::1.8080 . LISTEN tcp4 0 0 127.0.0.1.8080 . LISTEN 4ed0:javascript1 olchansk$ K.O.
2359	22 Mar 2022	Konstantin Olchanski	Bug Fix	fix for event buffer corruption in bm_flush_cache()
multithreaded frontends have an unusual event buffer corruption if the write cache is enabled. For a long time now I had to disable the write cache on all multithreaded frontends in alpha-g, I was hitting this bug quite often. (somehow I do not see this problem reported on bitbucket!) last week I reworked the multithread locking of event buffers, in hope that this bug will turn up, but nope, all mutexes and locking look okey, except for a number of unrelated problems (races against bm_close_buffer() were the most troublesome to fix). but finally found the trouble. first, some background. because multiprocess locking is expensive, frontends that generate a large number of small events can use the write cache to reduce this overhead. instead of locking the shared memory event buffer for each event, events are accumulated in the write cache, and periodic calls to bm_flush_buffer() flush them to shared memory. For best effect, one should increase the size of the write cache until lock rate is around 10/second. it turns out introduction of multithreading broke bm_flush_cache(). it does this: - int ask_free = pbuf->wp; // how much data we have in the write cache now - call bm_wait_for_free_space(ask_free); // ensure we have this much free shared memory space - copy pbuf->wp worth if events to shared memory looks okey at first sight. this is what happens to trigger the bug: - int ask_free = pbuf->wp; // ok - call bm_wait_for_free_space(ask_free); // ok, but if shared memory is full, it will go to sleep waiting for free space - in the mean time, another thread calls bm_send_event(), this adds more data to the write cache, moves pbuf->wp - bm_wait_for_free_space() eventually returns - copy pbuf->wp worth of data to shared memo KABOOM! shared memory corruption! we just overwrote some unlucky event in shared memory: we only have "ask_free" free bytes available, but pbuf->wp moved and now has more data, and it does not fit, and there is no check against it. of course in the single threaded world this bug did not exist, there was no other thread to call bm_send_event() while bm_flush_cache() is sleeping. the obvious fix is to ask for more free space if cached data does not fit. this is now implemented on the branch feature/buffer_mutex. after a bit more tested I will merge it into develop. so that's it? not so fast. there was more going on. as described, the bug will only happen when shared memory event buffer is full. (i.e. rarely or never). It turns out the old version of thread locking code was defective and permitted a race between bm_send_event() and bm_send_event() in another thread: thread 1: while (1) { bm_send_event(very small event); } thread 2: -> bm_send_event(very big event) -> no space in the cache for the very big event, call bm_flush_cache() -> bm_flush_cache() asks bm_wait_for_free_space() to make space for cached data -> this was done with write cache mutex released (mistake!) -> at the same time bm_send_event(very small event) added 1 more small event to the cache -> back in bm_flush_cache() write cache mutex is locked correctly, we copy cached data to shared memory and again KABOOM because we now have more data than we asked free space for. So in the original implementation, corruption was possible even when share memory event buffer was pretty much empty. The reworked locking code closed that loop hole - bm_flush_cache() is now called with write cache locked, and bm_send_event() from another thread cannot confuse things, unless shared memory buffer is full and we go to sleep inside bm_wait_for_free_space(). And this is now fixed, too. K.O.
2360	22 Mar 2022	Stefan Ritt	Bug Fix	fix for event buffer corruption in bm_flush_cache()
Thanks Konstantin for your detailed description. I wonder why we never saw this problem at PSI. Here is the reason: In multil-threaded environments, we never call bm_send_event() directly from all threads (since in the old days nothing was thread safe in midas). Instead, we use a collector thread which gets all events via the rb_xxx functions from the individual readout threads. This is well integrated into the mfe.cxx framework. Look at examples/mtfe/mfte.cxx. Each thread does (simplified): while (true) { do { status = rb_get_wp(&pevent); } while (status == DB_TIMEOUT) bm_compose_event_threadsafe(pevent, ..., &serial_number); bk_init32(pevent+1); ... fill event ... bk_close(pevent) rb_increment_wp(sizeof(EVENT_HEADER) + pevent->data_size); } The framework now collects all these events in receive_trigger_event() which runs in the main thread: for (i=0 ; i<n_thread ; i++) { rb_get_rp(i, pevent); if (pevent->serial_number == prev_serial+1) break; } prev_serial = pevent->serial_number; rpc_send_event(pevent); rb_increment_rp(sizeof(EVENT_HEADER) + pevent->data_size); This code ensures that all events are in the right sequence (before the serial numbers where mixed up) and that all events are sent only from a single thread, so the write buffer can be used effectively without complicated multi-thread locks. This solution works nicely at PSI since many years, maybe you should put some thought to use it in your tmfe framework in Alpha-g as well instead of struggling with all your locks. Stefan
2361	23 Mar 2022	Ivo Schulthess	Bug Fix	fix for event buffer corruption in bm_flush_cache()
Thanks for the investigation. Back in 2020, we had some issues of losing data between the system buffer and the logger writing them to disk (https://daq00.triumf.ca/elog-midas/Midas/1966). This was polled equipment but we had a multithreaded FE running at the same time. Could this be related to the same problem? Best, Ivo
2362	23 Mar 2022	Konstantin Olchanski	Bug Fix	mhttpd ipv6 bind should be fixed now
> Something changed after my initial implementation of ipv6 in mhttpd > and listening to ipv6 http/https connections was broken. Reporting that mhttpd ipv6 works at CERN. The hostnames for ipv6 connections come back as alphacpc05.ipv6.cern.ch instead of alphacpc05.cern.ch so both are added to the http "insecure port" whitelist. K.O.
2363	23 Mar 2022	Konstantin Olchanski	Bug Fix	fix for event buffer corruption in bm_flush_cache()
I confirm, there is no problem in single-threaded programs, and there is no problem if all bm_send_event() and bm_flush_cache() are called from the same thread. > ... instead of struggling with all your locks. it is better to have midas fully thread safe. ODB has been so for a long time, event buffer partially (except for this bug), now fully. without that the problem still exists, because in many frontends, bm_flush_buffer() is called from the main thread, and will race against the "bm_send_event() thread". Of course you can do everything on the main thread, but this opens you to RPC timeouts during run transitions (if you sleep in bm_wait_for_free_space()). also the SYSMSG buffer is subject to the same bug. cm_msg() is of course safe to call from anywhere, but cm_msg_flush_buffer() and cm_periodic_tasks() can be called from any thread, and they issue bm_send_event(SYSMSG), and there will be mysterious crashes and SYSMSG corruptions, probably only during message storms, but still! K.O.
2364	23 Mar 2022	Konstantin Olchanski	Bug Fix	mhttpd bug fixed
the mhttpd bug should be fixed now (branch feature/buffer_mutex). simplest way to reproduce: wget http://localhost:8080/ quickly ctrl-C it wget http://localhost:8080/ inside mhttpd (by hook or crook) observe that the second wget got the data meant for the first wget. if you cannot ctrl-C the first wget quickly enough, put a sleep somewhere in the worker thread (in mongoose_write(), I think). this is what happens. 1st wget stops (by ctrl-C), socket is closed, mongoose frees it's mg_connection object (corresponding worker is still labouring, hmm... actually sleeping, and now has a stale nc pointer) 2nd wget starts, new socket is opened, mongoose allocates a new mg_connection object, but malloc() gives it back the same memory we just freed(), and the 1st wget's worker thread nc pointer is no longer stale, but points to 2nd wget's connection. so we think we are clever and we check the socket file descriptors. but same thing happens there, too. if 1st wget was file descriptor 7, it is closed, (1st wget worker now has a stale file handle), then reopened for the 2nd wget, per POSIX, we get back the same file descriptor 7. 1st wget worker now has the file handle for the 2nd wget tcp socket and the famous test/crash for "sending data to wrong socket" is defeated. now, worker thread for the 1st wget wants to send a reply, it has a valid nc pointer (points to 2nd wget's mg_connection object) and a valid file descriptor (points to 2nd wget's tcp socket), reply meant for the 1st wget is successfully sent to the 2nd wget, 2nd wget finishes, it's socket is closed, mg_connection object is free'ed. Now the worker thread for the 2nd wget has stale connection info, but this is okey, mongoose does not find a matching connection, 2nd wget worked thread reply goes nowhere, thread finishes silently (no memory leaks here, I checked). so, connection for 2nd wget completely impersonates the closed connection of 1st wget (I guess I could check the full socket address info, remote ip address, remote port number, etc, but...) in practice, this bug does not happen often because modern browsers tend to keep tcp sockets open for very long time. (not sure about sundry web proxies, etc). solution of course is very simple. match worker thread data to mongoose mg_connection objects using our own connection sequential number, which are unique and very easy to keep track of through the mongoose event handler. all this mess runs in the main thread, so no locking trouble here, small blessing. K.O.
2366	24 Mar 2022	Stefan Ritt	Bug Fix	mhttpd bug fixed
> 1st wget stops (by ctrl-C), socket is closed, mongoose frees it's mg_connection object > (corresponding worker is still labouring, hmm... actually sleeping, and now has a stale nc pointer) > > 2nd wget starts, new socket is opened, mongoose allocates a new mg_connection object, > but malloc() gives it back the same memory we just freed(), and the 1st wget's worker thread > nc pointer is no longer stale, but points to 2nd wget's connection. Why don't we CLEAR the memory (memset(object,0,sizeof(object)) before the free(), this way it cannot be mistakenly re-used by the next thread. Stefan
2367	24 Mar 2022	Stefan Ritt	Bug Fix	fix for event buffer corruption in bm_flush_cache()
> > ... instead of struggling with all your locks. > > it is better to have midas fully thread safe. ODB has been so for a long time, > event buffer partially (except for this bug), now fully. > > without that the problem still exists, because in many frontends, > bm_flush_buffer() is called from the main thread, and will race > against the "bm_send_event() thread". Of course you can do > everything on the main thread, but this opens you to RPC timeouts > during run transitions (if you sleep in bm_wait_for_free_space()). Just for the record: in the mfe.cxx framework both bm_send_event() and bm_flush_buffer() are called from the main thread, as can be seen in the midas/examples/mtfe/mtfe.cxx example. But I agree that having all buffer operations thread safe is a clear benefit. Stefan
2368	24 Mar 2022	Konstantin Olchanski	Bug Fix	mhttpd bug fixed
> > 1st wget stops (by ctrl-C), socket is closed, mongoose frees it's mg_connection object > > (corresponding worker is still labouring, hmm... actually sleeping, and now has a stale nc pointer) > > > > 2nd wget starts, new socket is opened, mongoose allocates a new mg_connection object, > > but malloc() gives it back the same memory we just freed(), and the 1st wget's worker thread > > nc pointer is no longer stale, but points to 2nd wget's connection. > > Why don't we CLEAR the memory (memset(object,0,sizeof(object)) before the free(), this way it cannot be > mistakenly re-used by the next thread. > My description was unclear. I will try better now. When http replies are generated by worker threads, matching of reply to mg_connection is done by checking the address of the mg_connection object. (mongoose itself unhelpfully offers to send the reply to every mg_connection, see the responder to mg_broadcast() messages). This works for open/active connections, addresses of all mg_connections are unique. But if connection is closed and a new connection is opened, the address is reused (by malloc()/free() reusing memory blocks or by mongoose using a pool of mg_connection objects, does not matter). So matching http reply to mg_connection using only address of mg_connection can match the wrong connection. (contents of mg_connection object does not matter, only address is used by matching. so memzero() of mg_connection object does not help). I saw this during my testing - wrong data was sent to wrong browser often enough - but did not understand that the above problem is happening. Because I was unable to reliably reproduce the problem, I could not debug it. I tried to add a check for the tcp socket file descriptor number, in case there is a straight bug or multithread race or simple memory corruption. This replaced "we sent wrong data to wrong browser, poisoned browser cache, confused the user" with a crash. This "fix" seemed effective at the time. Maybe I should mention browser cache poisoning again. What happened is html pages and rpc replies were returned as responses to load things like CSS files, these bad responses are cached by the browser pretty much forever, so all subsequent midas pages will look wrong (bad css!) forever, until user manually clears browser cache. reload of page did not help, restart of browser did not help (I think). So a very bad bug. Unfortunately, the check for file descriptor was not effective because file descriptors are also reused. And I did see wrong data returned by mhttpd, but even more rarely. And everybody (myself included) complained about mhttpd crashes. Now, matching of responses to connections is done by connection sequential/serial number, which is unique 32-bit counter. Mismatch of reply to connection should not happen again. P.S. Latest version of the mongoose web server library does not help with this problem, the example code for matching reply to connection in their multithread example looks bogus: https://github.com/cesanta/mongoose/blob/master/examples/multi-threaded/main.c K.O.
2369	24 Mar 2022	Konstantin Olchanski	Bug Fix	fix for event buffer corruption in bm_flush_cache()
> Thanks for the investigation. Back in 2020, we had some issues > of losing data between the system buffer and the logger writing them > to disk (https://daq00.triumf.ca/elog-midas/Midas/1966). This was polled equipment > but we had a multithreaded FE running at the same time. Could this be related to the same problem? I think we will have to follow up on your problem 1966 separately. I think this bug cannot lose events. Writing events to the write cache has correct locking, no loss here. writing write cache to shared memory has correct locking, no loss there. the bug will cause the next event in the event buffer to be overwritten, this will be detected by most programs as shared memory corruption and everybody will quit. (mhttpd, mserver, odbedit will probably survive). I guess there could be unlucky corruption that looks like nothing was corrupted, but this will affect only a few events right at the shared memory read/write pointer, it so happens that they are the oldest events in the buffer and likely mlogger already wrote them to disk. mlogger read pointer will likely follow the shared memory write pointer closely, well ahead of the shared memory read pointer which always pointe to the older event and where this bug's corruption will happen. So no, I do not think this bug can cause event loss between frontend and mlogger. K.O.
2371	24 Mar 2022	Stefan Ritt	Bug Fix	mhttpd bug fixed
I see, now I understand. As for the browser cache problem: This Chrome extension is your friend: https://chrome.google.com/webstore/detail/clear-cache/cppjkneekbjaeellbfkmgnhonkkjfpdn?hl=en I use it all the time I change the CSS or a JS file. Having the "Developer Tools" open in Chrome helps as well (cache is then turned off). Firefox has similar extensions. Stefan
2374	24 Mar 2022	Konstantin Olchanski	Bug Fix	mhttpd bug fixed
> As for the browser cache problem: This Chrome extension is your friend ... for google chrome, it is easy, open the javascript debugger (left-click "inspect"), the reload button becomes a left-click menu, one left-click option is "clear cache and reload". (there is no button for "clear cookies and reload", re recent elog cookie problem). but this does not help me personally any. if midas web pages get confused, I will also get confused, too, and I will spend hours debugging mhttpd before thinking "hmm... maybe I should clear the browser cache!" not sure about firefox, safari, microsoft edge and opera. if I ever need it, I google it. K.O.

Goto page Previous 1, 2, 3 ... 129, 130, 131 ... 137, 138, 139 Next

ELOG V3.1.4-2e1708b5