ELOG Midas

Back Midas Rome Roody Rootana

Midas DAQ System, Page 108 of 139

Not logged in

Find | Login | Help

New entries since:

Wed Dec 31 16:00:00 1969

Full | Summary | Threaded | Hide attachments

2765 Entries

Goto page Previous 1, 2, 3 ... 107, 108, 109 ... 137, 138, 139 Next

ID	Date	Author	Topic	Subject
2314	26 Jan 2022	Konstantin Olchanski	Bug Report	some frontend kicked by cm_periodic_tasks
> The problem is that eventually some of frontend closed with message > :19:22:31.834 2021/12/02 [rootana,INFO] Client 'Sample Frontend38' on buffer > 'SYSMSG' removed by cm_periodic_tasks because process pid 9789 does not exist This messages means what it says. A client was registered with the SYSMSG buffer and this client had pid 9789. At some point some other client (rootana, in this case) checked it and process pid 9789 was no longer running. (it then proceeded to remove the registration). There is 2 possibilities: - simplest: your frontend has crashed. best to debug this by running it inside gdb, wait for the crash. - unlikely: reported pid is bogus, real pid of your frontend is different, the client registration in SYSMSG is corrupted. this would indicate massive corruption of midas shared memory buffers, not impossible if your frontend misbehaves and writes to random memory addresses. ODB has protection against this (normally turned off, easy to enable, set ODB "/experiment/protect odb" to yes), shared memory buffers do not have protection against this (should be added?). Do this. When you start your frontend, write down it's pid, when you see the crash message, confirm pid number printed is the same. As additional test, run your frontend inside gdb, after it crashes, you can print the stack trace, etc. > > in the meantime mserver loggging : > mserver started interactively > mserver will listen on TCP port 1175 > double free or corruption (!prev) > double free or corruption (!prev) > free(): invalid next size (normal) > double free or corruption (!prev) > Are these "double free" messages coming from the mserver or from your frontend? (i.e. you run them in different terminals, not all in the same terminal?). If messages are coming from the mserver, this confirms possibility (1), except that for frontends connected remotely, the pid is the pid of the mserver, and what we see are crashes of mserver, not crashes of your frontend. These are much harder to debug. You will need to enable core dumps (ODB /Experiment/Enable core dumps set to "y"), confirm that core dumps work (i.e. "killall -SEGV mserver", observe core files are created in the directory where you started the mserver), reproduce the crash, run "gdb mserver core.NNNN", run "bt" to print the stack trace, post the stack trace here (or email to me directly). > > I can find some correlation between number of events/event size produced by > frontend, cause its failed when its become big enough. > There is no limit on event size or event rate in midas, you should not see any crash regardless of what you do. (there is a limit of event size, because an event has to fit inside an event buffer and event buffer size is limited to 2 GB). Obviously you hit a bug in mserver that makes it crash. Let's debug it. One thing to try is set the write cache size to zero and see if your crash goes away. I see some indication of something rotten in the event buffer code if write cache is enabled. This is set in ODB "/Eq/XXX/Common/Write Cache Size", set it to zero. (beware recent confusion where odb settings have no effect depending on value of "equipment_common_overwrite"). > > frontend scheme is like this: > Best if you use the tmfe c++ frontend, event data handling is much simpler and we do not have to debug the convoluted old code in mfe.c. K.O. > > poll event time set to 0; > > poll_event{ > //if buffer not transferred return (continue cutting the main buffer) > //read main buffer from hardware > //buffer not transfered > } > > read event{ > // cut the main buffer to subevents (cut one event from main buffer) return; > //if (last subevent) {buffer transfered ;return} > } > > What is strange to me that 2 frontends (1 per remote pc) causing this. > > Also, I'm executing one FEcode with -i # flag , put setting eventid in > frontend_init , and using SYSTEM buffer for all. > > Is there something I'm missing? > Thanks. > A.
2315	26 Jan 2022	Konstantin Olchanski	Bug Report	Off-by-one in sequencer documentation
> > 3 LOOP n,4 > > 4 MESSAGE $n,1 > > 5 ENDLOOP > > Indeed you're right. The loop variable runs from 1...n. I fixed that in the documentation. Shades/ghosts of FORTRAN. c/c++/perl/python loops loop from 0 to n-1. K.O.
2316	26 Jan 2022	Konstantin Olchanski	Info	MityCAMAC Login
For those curious about CAMAC controllers, this one was built around 2014 to replace the aging CAMAC A1/A2 controllers (parallel and serial) in the TRIUMF cyclotron controls system (around 50 CAMAC crates). It implements the main and the auxiliary controller mode (single width and double width modules). The design predates Altera Cyclone-5 SoC and has separate ARM processor (TI 335x) and Cyclone-4 FPGA connected by GPMC bus. ARM processor boots Linux kernel and CentOS-7 userland from an SD card, FPGA boots from it's own EPCS flash. User program running on the ARM processor (i.e. a MIDAS frontend) initiates CAMAC operations, FPGA executes them. Quite simple. K.O.
2318	26 Jan 2022	Konstantin Olchanski	Forum	Issue in data writing speed
Francesco, when you say "writing an event is slow", do you mean it in the frontend or in the output data file? Stefan is quite right about the data file, it can take seconds between generating an event in the frontend and seeing it written to the data file. (if compression buffers are too big, an event can sit there forever, until pushed out by next events or by run stop). But maybe you see this on the frontend side. What you are looking at is "real time" performance of the frontend and of the linux kernel. The mfe.c frontend has many problems with real time performance, it can stall and take a long time between calls to read_event(), for many reasons. There are ways around that, but it is simpler to switch to the tmfe c++ frontend that was designed for good real time performance. In the tmfe frontend, if you use the polled equipment and enable the poll thread, your frontend will be limited only by the linux kernel real time performance (i.e. on a single-core CPU, other programs will delay execution of your frontend and you will see it as long delays (usec, millisec) between calls to your read_event(). Next limit to real time performance (common to mfe.c and tmfe frontends) is the writing of event data to the midas shared event buffer. One has to lock the shared memory semaphore and this has to wait until other users of the event buffer finish their reading or writing and unlock it. Arbitrary amount of time (usec, millisec, sec) can pass. (there is also problems with "fairness" of the linux semaphores, a different story, again). Making things more interesting, midas event buffers implement a write cache (default size 100 kbytes), events smaller than the cache are quickly accumulated (no need to lock the shared memory semaphore), them flushed to shared memory when cache is full. This is done to reduce the number of shared memory semaphore locks per event, in the case of very high rate of very small events. Solution to all this is to use 2 threads: read the data from hardware in one thread and write the data to midas in a different thread. Between the threads would be an event fifo (circular buffer in mfe.c, std::deque<EVENT> in tmfe c++ frontends). For remote connected frontends, things are a bit different. Event data is written directly into the TCP socket and as long as socket buffers are big enough, there is no real-time delays, unless SYSTEM buffer is very congested and mserver does not read the TCP socket quickly enough. So depending on event size, data rate and tcp socket buffer size, the extra 2nd thread may not be necessary and poll thread real time performance may be good enough. I hope this clarifies the situation somewhat. K.O. > Dear all, > I've a frontend writing a quite big bunch of data into a MIDAS bank (16bit output from a 4MP photo camera). > I'm experiencing a writing speed problem that I don't understand. When the photo camera is triggered at a low rate (< 2 Hz) > writing into the bank takes a very short time for each event (indeed, what I measure is the time to write and go back > into the polling function). If I increase the rate to 4 Hz, I see that writing the first two events takes a sort time, > but the third event takes a very long time (hundreds of ms), then again the fourth and fifth events are very fast, and > the sixth is very slow. If I further increase the rate, every other event is very slow. The problem is not in the readout > of the camera, because if I just remove the bank writing and keep the camera readout, the problem disappears. Can you > explain this behavior? Is there any way to improve it? > > Below you can also find the code I use to copy the data from the camera buffer into the bank. If you have any suggestion > to improve it, it would be really appreciated. > > Thank you very much, > Francesco > > > > const char* pSrc = (const char)bufframe.buf; > > for(int y = 0; y < bufframe.height; y++ ){ > > //Copy one row > const unsigned short pDst = (const unsigned short)pSrc; > > //go through the row > for(int x = 0; x < bufframe.width; x++ ){ > > WORD tmpData = pDst++; > > *pdata++ = tmpData; > > } > > pSrc += bufframe.rowbytes; > > } >
2320	26 Jan 2022	Konstantin Olchanski	Forum	Issue in data writing speed
> Francesco, when you say "writing an event is slow", do you mean it in the frontend > or in the output data file? Another explanation just occurred to me. We do not know your event size and we do not know the size of your SYSTEM buffer. But if you have an unlucky combination, this can happen: Consider event size is 6 Mbytes, buffer size is 8 Mbytes, enough space for only 1 event. First event is written quickly (buffer is empty). Second event will be delayed, there is not enough free space in the buffer, we have to wait for mlogger to finish reading the first event. Same thing happens if event size is 3 Mbytes, the first 2 events will write quickly, writing the 3rd event will be delayed until mlogger does it's thing. The mlogger reads the SYSTEM buffer "fast" and "quickly", but it can be delayed for a number of reasons, i.e. handling a history event, a delay writing to disk, a delay writing to network connected storage, etc. In general, it is best to size the SYSTEM buffer to hold about 1 second worth of data (of average size, average rate). If your event size is 4 Mbytes, and you record them at 10/sec, SYSTEM buffer should be at least 40 Mbytes big. (this is set in ODB /Experiment/Buffer Sizes). (MIDAS event buffer size is limited to 2 GBytes). K.O.
2321	26 Jan 2022	Konstantin Olchanski	Bug Report	Off-by-one in sequencer documentation
> > Shades/ghosts of FORTRAN. c/c++/perl/python loops loop from 0 to n-1. > > for (i=1 ; i<=10 ; i++); ;-) Similar code made big news just recently: (scroll down to the example main() program) https://blog.qualys.com/vulnerabilities-threat-research/2022/01/25/pwnkit-local-privilege-escalation- vulnerability-discovered-in-polkits-pkexec-cve-2021-4034 I forget if the FORTRAN rules were "loop once" or "never loop" or if it was different between Fortran-4, fortran-77, DEC extensions and IBM extension, or if it was a compiler switch. We should check that we do something reasonable with such loops to zero: LOOP n,0 MESSAGE $n,1 ENDLOOP P.S. Yup. "man g77" option "-fonetrip". K.O.
2322	26 Jan 2022	Konstantin Olchanski	Bug Report	Writting MIDAS Events via FPGAs
> > > Any error messages printed by the frontend? any error message in midas.log? core dumps? crashes? > > I do not understand what you mean by "did not get the data into midas". You create events > > and send them to a midas event buffer and you do not see them there? With mdump? > > Do you see this both connected locally and connected remotely through the mserver? > > I simply don't see the event counter counting up and I also don't see them using mdump. No logs, no dumps and no crashes - every is quite. I only tested it locally. > If you are connected locally (no mserver), I want to know the value returned by bm_send_event(). Simplest if you edit mfe.c and everywhere it calls bm_send_event() and rpc_send_event(), print the returned value. It would be very interesting to see if bm_send_event() returns 1 (SUCCESS), but the event vanishes without a trace. Before you do that, try something simpler: Run "mdump -s -d", it will print some event buffer internals. Watch to see if any data pointers change when you send your events ("wp", "rp", etc). If nothing changes at all, then we are not sending anything (fault is in your code or on mfe.c). If you see "wp" counting up, then we definitely write your events into the buffer and mdump & mlogger should see them. But there is some funny logic for event_id and trigger_mask and it is worth checking their values. For a good test, set event_id=1 and trigger_mask=0x1. There might be trouble if either is set to zero. K.O.
2323	26 Jan 2022	Konstantin Olchanski	Bug Report	Unknown Error 319 from client
> I�m trying to run MIDAS using a frontend code/client named �fetiglab�. Run stops > after 2/3sec with an error saying �Unknown error 319 from client �fetiglab� on > localhost. actually run never starts. > 11:46:32 [fetiglab,ERROR] [odb.cxx:11268:db_get_record,ERROR] struct size > mismatch for "/" (expected size: 1, size in ODB: 41920) this is the error that causes run start to fail. for reasons unknown your frontend is trying to do a db_get_record() from "/" (ODB root top directory). if this is an mfe.c frontend, I do not think I have ever seen it do something like this. so, a puzzle. K.O.
2324	26 Jan 2022	Konstantin Olchanski	Forum	mhttpd error
> > Enable IPv6 y > > Probably the IPv6 problem, see here elog:2269 > > I asked to turn off IPv6 by default, or at least mention this in the documentation, > but unfortunately nothing happened. But IPv4 and IPv6 code is completely separate, if IPv6 bind fails, IPv4 should still work. This is all very strange. It does not help that the OP does not say in which way things do not work, "the server is not accessible from other machines" is not an error message reported by any browser, and we do not know what URL he is using to access mhttpd - http: or https: Also he is enabling the "insecure" port 8081, I am pretty sure the documentation is pretty clear, either use the secure https port or the insecure port, but not both at the same time. In any case, I see current version of mongoose have removed support for password files, so all this stuff will likely become reworked and at the end mhttpd will only listen to localhost ports. To make it "accessible to other machines", one will have to use the apache https proxy. (or mtpcproxy from midas). K.O.
2329	07 Feb 2022	Konstantin Olchanski	Forum	MidasWiki moved from ladd00 to daq00.triumf.ca and updated to MediaWiki 1.35
MidasWiki moved from ladd00 (obsolete SL6) to daq00.triumf.ca (Ubuntu LTS 20.04) and updated from obsolete MediaWiki LTS 1.27.7 to MediaWiki LTS 1.35, supported until mid-2023, see https://www.mediawiki.org/wiki/Version_lifecycle Old URL https://midas.triumf.ca and https://midas.triumf.ca/MidasWiki/... redirect to new URL https://daq00.triumf.ca/MidasWiki/index.php/Main_Page All old links and bookmarks should continue to work (via redirect). To report problems with this MediaWiki instance and to request any changes in configuration or installed extensions, please reply to this message here. K.O.
2331	08 Feb 2022	Konstantin Olchanski	Bug Fix	ODBINC/Sequencer Issue
Please post the output of odbedit "ls -l" for /eq/ar.../variables. (you posted the variable name as an image, and I cannot cut-and-paste the odb path!). BTW data size 4 is correct, 4 bytes for INT32/UINT32/FLOAT. For DOUBLE it should be 8. For you it prints 32 and this is wrong, we need to see the output of "ls -l". K.O.
2333	09 Feb 2022	Konstantin Olchanski	Bug Fix	ODBINC/Sequencer Issue
> > [local:mu3eMSci:S]/>cd Equipment/ArduinoTestStation/Variables > [local:mu3eMSci:S]Variables>ls -l > Key name Type #Val Size Last Opn Mode Value > --------------------------------------------------------------------------- > _T_ FLOAT 1 4 1h 0 RWD 20.93 > _F_ FLOAT 1 4 1h 0 RWD 12.8 > _P_ FLOAT 1 4 1h 0 RWD 56 > _S_ FLOAT 1 4 1h 0 RWD 5 > _H_ FLOAT 1 4 60h 0 RWD 44.74 > _B_ FLOAT 1 4 60h 0 RWD 18.54 > _A_ FLOAT 1 4 1h 0 RWD 14.41 > _RH_ FLOAT 1 4 1h 0 RWD 41.81 > _AT_ FLOAT 1 4 1h 0 RWD 20.46 > SP INT16 1 2 1h 0 RWD 10 > This looks okey, so we still have no explanation for your error. Please post your sequencer script? K.O.
2350	03 Mar 2022	Konstantin Olchanski	Info	zlib required, lz4 internal
as of commit 8eb18e4ae9c57a8a802219b90d4dc218eb8fdefb, the gzip compression library is required, not optional. this fixes midas and manalyzer mis-build if the system gzip library is accidentally not installed. (is there any situation where gzip library is not installed on purpose?) midas internal lz4 compression library was renamed to mlz4 to avoid collision against system lz4 library (where present). lz4 files from midasio are now used, lz4 files in midas/include and midas/src are removed. I see that on recent versions of ubuntu we could switch to the system version of the lz4 library. however, on centos-7 systems it is usually not present and it still is a supported and widely used platform, so we stay with the midas-internal library for now. K.O.
2351	03 Mar 2022	Konstantin Olchanski	Info	manalyzer updated
manalyzer was updated to latest version. mostly multi-threading improvements from Joseph and myself. K.O.
2354	15 Mar 2022	Konstantin Olchanski	Bug Fix	mhttpd ipv6 bind should be fixed now
Something changed after my initial implementation of ipv6 in mhttpd and listening to ipv6 http/https connections was broken. It turns out, I do not need to listen to both ipv4 and ipv6 sockets, it is sufficient to listen to just ipv6. ipv4 connections will also magically work. see linux kernel "bindv6only" sysctl setting: https://sysctl- explorer.net/net/ipv6/bindv6only/ The specific bug in mhttpd was to bind to ipv4 socket first, subsequent bind() to ipv6 socket fails with error "Address already in use", which is silent, not reported by the mongoose library. For reasons unknown, this does not happen to bind() to "localhost" aka ipv6 "::1". Apparently other web servers (apache, nginx) are/were also affected by this problem. https://chrisjean.com/fix-nginx-emerg-bind-to-80-failed-98-address-already-in-use/ First fix was to bind to ipv6 first (success) and to ipv4 second (fails). Second fix committed to git is to only listen to ipv6. This works both on MacOS and on Linux. Linux reports the listener socket is "tcp6", MacOS reports the listener socket as "tcp46": 4ed0:javascript1 olchansk$ netstat -an \| grep 808 \| grep LISTEN tcp46 0 0 .8081 .* LISTEN tcp6 0 0 ::1.8080 . LISTEN tcp4 0 0 127.0.0.1.8080 . LISTEN 4ed0:javascript1 olchansk$ K.O.
2359	22 Mar 2022	Konstantin Olchanski	Bug Fix	fix for event buffer corruption in bm_flush_cache()
multithreaded frontends have an unusual event buffer corruption if the write cache is enabled. For a long time now I had to disable the write cache on all multithreaded frontends in alpha-g, I was hitting this bug quite often. (somehow I do not see this problem reported on bitbucket!) last week I reworked the multithread locking of event buffers, in hope that this bug will turn up, but nope, all mutexes and locking look okey, except for a number of unrelated problems (races against bm_close_buffer() were the most troublesome to fix). but finally found the trouble. first, some background. because multiprocess locking is expensive, frontends that generate a large number of small events can use the write cache to reduce this overhead. instead of locking the shared memory event buffer for each event, events are accumulated in the write cache, and periodic calls to bm_flush_buffer() flush them to shared memory. For best effect, one should increase the size of the write cache until lock rate is around 10/second. it turns out introduction of multithreading broke bm_flush_cache(). it does this: - int ask_free = pbuf->wp; // how much data we have in the write cache now - call bm_wait_for_free_space(ask_free); // ensure we have this much free shared memory space - copy pbuf->wp worth if events to shared memory looks okey at first sight. this is what happens to trigger the bug: - int ask_free = pbuf->wp; // ok - call bm_wait_for_free_space(ask_free); // ok, but if shared memory is full, it will go to sleep waiting for free space - in the mean time, another thread calls bm_send_event(), this adds more data to the write cache, moves pbuf->wp - bm_wait_for_free_space() eventually returns - copy pbuf->wp worth of data to shared memo KABOOM! shared memory corruption! we just overwrote some unlucky event in shared memory: we only have "ask_free" free bytes available, but pbuf->wp moved and now has more data, and it does not fit, and there is no check against it. of course in the single threaded world this bug did not exist, there was no other thread to call bm_send_event() while bm_flush_cache() is sleeping. the obvious fix is to ask for more free space if cached data does not fit. this is now implemented on the branch feature/buffer_mutex. after a bit more tested I will merge it into develop. so that's it? not so fast. there was more going on. as described, the bug will only happen when shared memory event buffer is full. (i.e. rarely or never). It turns out the old version of thread locking code was defective and permitted a race between bm_send_event() and bm_send_event() in another thread: thread 1: while (1) { bm_send_event(very small event); } thread 2: -> bm_send_event(very big event) -> no space in the cache for the very big event, call bm_flush_cache() -> bm_flush_cache() asks bm_wait_for_free_space() to make space for cached data -> this was done with write cache mutex released (mistake!) -> at the same time bm_send_event(very small event) added 1 more small event to the cache -> back in bm_flush_cache() write cache mutex is locked correctly, we copy cached data to shared memory and again KABOOM because we now have more data than we asked free space for. So in the original implementation, corruption was possible even when share memory event buffer was pretty much empty. The reworked locking code closed that loop hole - bm_flush_cache() is now called with write cache locked, and bm_send_event() from another thread cannot confuse things, unless shared memory buffer is full and we go to sleep inside bm_wait_for_free_space(). And this is now fixed, too. K.O.
2362	23 Mar 2022	Konstantin Olchanski	Bug Fix	mhttpd ipv6 bind should be fixed now
> Something changed after my initial implementation of ipv6 in mhttpd > and listening to ipv6 http/https connections was broken. Reporting that mhttpd ipv6 works at CERN. The hostnames for ipv6 connections come back as alphacpc05.ipv6.cern.ch instead of alphacpc05.cern.ch so both are added to the http "insecure port" whitelist. K.O.
2363	23 Mar 2022	Konstantin Olchanski	Bug Fix	fix for event buffer corruption in bm_flush_cache()
I confirm, there is no problem in single-threaded programs, and there is no problem if all bm_send_event() and bm_flush_cache() are called from the same thread. > ... instead of struggling with all your locks. it is better to have midas fully thread safe. ODB has been so for a long time, event buffer partially (except for this bug), now fully. without that the problem still exists, because in many frontends, bm_flush_buffer() is called from the main thread, and will race against the "bm_send_event() thread". Of course you can do everything on the main thread, but this opens you to RPC timeouts during run transitions (if you sleep in bm_wait_for_free_space()). also the SYSMSG buffer is subject to the same bug. cm_msg() is of course safe to call from anywhere, but cm_msg_flush_buffer() and cm_periodic_tasks() can be called from any thread, and they issue bm_send_event(SYSMSG), and there will be mysterious crashes and SYSMSG corruptions, probably only during message storms, but still! K.O.
2364	23 Mar 2022	Konstantin Olchanski	Bug Fix	mhttpd bug fixed
the mhttpd bug should be fixed now (branch feature/buffer_mutex). simplest way to reproduce: wget http://localhost:8080/ quickly ctrl-C it wget http://localhost:8080/ inside mhttpd (by hook or crook) observe that the second wget got the data meant for the first wget. if you cannot ctrl-C the first wget quickly enough, put a sleep somewhere in the worker thread (in mongoose_write(), I think). this is what happens. 1st wget stops (by ctrl-C), socket is closed, mongoose frees it's mg_connection object (corresponding worker is still labouring, hmm... actually sleeping, and now has a stale nc pointer) 2nd wget starts, new socket is opened, mongoose allocates a new mg_connection object, but malloc() gives it back the same memory we just freed(), and the 1st wget's worker thread nc pointer is no longer stale, but points to 2nd wget's connection. so we think we are clever and we check the socket file descriptors. but same thing happens there, too. if 1st wget was file descriptor 7, it is closed, (1st wget worker now has a stale file handle), then reopened for the 2nd wget, per POSIX, we get back the same file descriptor 7. 1st wget worker now has the file handle for the 2nd wget tcp socket and the famous test/crash for "sending data to wrong socket" is defeated. now, worker thread for the 1st wget wants to send a reply, it has a valid nc pointer (points to 2nd wget's mg_connection object) and a valid file descriptor (points to 2nd wget's tcp socket), reply meant for the 1st wget is successfully sent to the 2nd wget, 2nd wget finishes, it's socket is closed, mg_connection object is free'ed. Now the worker thread for the 2nd wget has stale connection info, but this is okey, mongoose does not find a matching connection, 2nd wget worked thread reply goes nowhere, thread finishes silently (no memory leaks here, I checked). so, connection for 2nd wget completely impersonates the closed connection of 1st wget (I guess I could check the full socket address info, remote ip address, remote port number, etc, but...) in practice, this bug does not happen often because modern browsers tend to keep tcp sockets open for very long time. (not sure about sundry web proxies, etc). solution of course is very simple. match worker thread data to mongoose mg_connection objects using our own connection sequential number, which are unique and very easy to keep track of through the mongoose event handler. all this mess runs in the main thread, so no locking trouble here, small blessing. K.O.
2368	24 Mar 2022	Konstantin Olchanski	Bug Fix	mhttpd bug fixed
> > 1st wget stops (by ctrl-C), socket is closed, mongoose frees it's mg_connection object > > (corresponding worker is still labouring, hmm... actually sleeping, and now has a stale nc pointer) > > > > 2nd wget starts, new socket is opened, mongoose allocates a new mg_connection object, > > but malloc() gives it back the same memory we just freed(), and the 1st wget's worker thread > > nc pointer is no longer stale, but points to 2nd wget's connection. > > Why don't we CLEAR the memory (memset(object,0,sizeof(object)) before the free(), this way it cannot be > mistakenly re-used by the next thread. > My description was unclear. I will try better now. When http replies are generated by worker threads, matching of reply to mg_connection is done by checking the address of the mg_connection object. (mongoose itself unhelpfully offers to send the reply to every mg_connection, see the responder to mg_broadcast() messages). This works for open/active connections, addresses of all mg_connections are unique. But if connection is closed and a new connection is opened, the address is reused (by malloc()/free() reusing memory blocks or by mongoose using a pool of mg_connection objects, does not matter). So matching http reply to mg_connection using only address of mg_connection can match the wrong connection. (contents of mg_connection object does not matter, only address is used by matching. so memzero() of mg_connection object does not help). I saw this during my testing - wrong data was sent to wrong browser often enough - but did not understand that the above problem is happening. Because I was unable to reliably reproduce the problem, I could not debug it. I tried to add a check for the tcp socket file descriptor number, in case there is a straight bug or multithread race or simple memory corruption. This replaced "we sent wrong data to wrong browser, poisoned browser cache, confused the user" with a crash. This "fix" seemed effective at the time. Maybe I should mention browser cache poisoning again. What happened is html pages and rpc replies were returned as responses to load things like CSS files, these bad responses are cached by the browser pretty much forever, so all subsequent midas pages will look wrong (bad css!) forever, until user manually clears browser cache. reload of page did not help, restart of browser did not help (I think). So a very bad bug. Unfortunately, the check for file descriptor was not effective because file descriptors are also reused. And I did see wrong data returned by mhttpd, but even more rarely. And everybody (myself included) complained about mhttpd crashes. Now, matching of responses to connections is done by connection sequential/serial number, which is unique 32-bit counter. Mismatch of reply to connection should not happen again. P.S. Latest version of the mongoose web server library does not help with this problem, the example code for matching reply to connection in their multithread example looks bogus: https://github.com/cesanta/mongoose/blob/master/examples/multi-threaded/main.c K.O.

Goto page Previous 1, 2, 3 ... 107, 108, 109 ... 137, 138, 139 Next

ELOG V3.1.4-2e1708b5