ELOG Midas

Back Midas Rome Roody Rootana

Midas DAQ System, Page 6 of 137

Not logged in

Find | Login | Help

New entries since:

Wed Dec 31 16:00:00 1969

Full | Summary | Threaded | Hide attachments

2723 Entries

Goto page Previous 1, 2, 3 ... 5, 6, 7 ... 135, 136, 137 Next

ID	Date	Author	Topic	Subject
1265	15 Apr 2017	Konstantin Olchanski	Bug Report	stop form odbedit broken
> > when I try to stop a run from odbedit I get a core dump. > > > > [ODBEdit1,INFO] Run #31 stopped odbedit: src/system.c:1223: ss_shm_flush: > > Assertion `size == mmap_size[handle]' failed. Aborted (core dumped) > > I am quite puzzled by this situation. We have seen the above error before, tried to track it down, failed. I was always thinking this is some kind of strange size mismatch between odb size and shared memory size and shared memory save file odb.shm size. Now with your information, it looks like it is memory corruption. I always thought there is no length limit to odb strings, except for the odb api problem where you have to know the maximum string length for db_get_value() & co otherwise long strings will be corrupted. Today nobody uses fixed size buffers, either db_get_value() allocates the string of correct size (replacing buffer overflow errors with memory leak errors), or return std::string. I shall check on the use of MAX_STRING_SIZE at least in odb itself... The default value 256 seems to be too small for today's use. (if you want to store json data, web page fragments, etc). K.O. > > midas commit 53af92a5d0... > > > > ----- > > > > I checked what happens if I try to stop a run via the mhttpd web-page: this > > works! So what is different? > > > > ----- > > > > I placed a issue (# 47) on bitbucket as well. > > > > What is the preferred channel to report potential bugs (elog / bitbucket issues)? > > I think I found the problem. Some ODB String values which are automatically > generated: > > CSS File = STRING : [1024] mhttpd.css > Sqlite dir = STRING : [1024] > History dir = STRING : [1024] > Sound = STRING : [1000] alarm.mp3 > > are exceeding the MAX_STRING_LENGTH 256 (defined in msystem.h) > > It looks as if this screws up quite a bit of the system! When deleting .ODB.SHM and > afterwards try to reload the ODB via a dump I previously made with odbedit, the > following is happening: > > 1) I get the error message that some strings are too long (exceeding > MAX_STRING_LENGTH). Unfortunately the underlying routine doesn't tell which ODB > variables this is. > > 2) After this reload, essentially nothing is working anymore. Any client I tried to > start just crashed. > > Since it seems that the string length of MAX_STRING_LENGTH is very crucial I would > suggest that db_create_record (or whatever routine is dealing with it) checks for > STRING variables and ensures that they cannot exceed MAX_STRING_LENGTH. > > When I shortened in my dump the above variables to MAX_STRING_LENGTH, regenerated the > ODB, also the 'stop' Problem in odbedit is gone.
1270	15 Apr 2017	Konstantin Olchanski	Bug Report	stop form odbedit broken
> when I try to stop a run from odbedit I get a core dump. > [ODBEdit1,INFO] Run #31 stopped odbedit: src/system.c:1223: ss_shm_flush: > Assertion `size == mmap_size[handle]' failed. Aborted (core dumped) > I am puzzled. The crash is at the very end of everything (save odb shared memory to odb.shm), does the run actually stop, or the crash is before the run is fully stopped? (I guess if you want to run more odbedit commands after stopping the run, so you care about not crashing). K.O.
1278	24 Apr 2017	Stefan Ritt	Bug Report	stop form odbedit broken
> CSS File = STRING : [1024] mhttpd.css > Sqlite dir = STRING : [1024] > History dir = STRING : [1024] > Sound = STRING : [1000] alarm.mp3 After a quick discussion with Konstantin, I changed these strings to a length of 256 chars (MAX_STRING_LENGTH). Actually all changes I had to made was on code introduced by KO, so I hope I did everything correctly. He should carefully check my changes (actually I would have preferred if he could change his code himself...). I agree with KO that the preferred format for saving the ODB should be JSON, but there might be experiments with have some old ODB dumps in other formats, so we should not remove the possibility to read those formats back. Stefan
Draft	04 Jun 2020	Lukas Gerritzen		stime() deprecated in glibc 2.31
In glibc 2.31, the stime function was deprecated: * The obsolete function stime is no longer available to newly linked binaries, and its declaration has been removed from <time.h>. Programs that set the system time should use clock_settime instead. https://sourceware.org/legacy-ml/libc-announce/2020/msg00001.html This creates a problem in src/system.cxx:3197:4
1410	22 Nov 2018	Konstantin Olchanski	Info	status of self-signed https certificates
I just happened to check the current situation with self-signed https certificates as implemented in mhttpd. (To remember, the powers-that-be are pushing for universal use of https for all web access. The https implementation in mhttpd at the moment can only generate self-signed certificates, so...) plain unencrypted http: - both google chrome and firefox say "connection not secure", but connect without any fuss. - apple safari does not say anything https with self-signed certificate: - google chrome goes through an "are you sure?" page, "red not secure" status in toolbar - firefox does the same thing, requires adding a security exception, but still shows "not secure" status in toolbar - apple safari goes through a sequence of "are you sure?" pages, asks for the user password to add the self-signed certificate to the macos key store, then marks the connection as "secure" (good) So clearly powers-that-be do not want us to use self-signed certificates for https. (And frown on use of unencrypted http even for localhost connections). Properly signed certificates can be obtained from letsencrypt almost automatically, but of course mhttpd needs to know how to use them and how to do handle their automatic renewals. I plan to update the mongoose web server library inside mhttpd and with luck I will straighten some of this certificate business at the same time. In the mean time, we continue to recommend that mhttpd should be used behind a password protected https proxy (i.e. apache httpd, etc). K.O.
1411	30 Nov 2018	Stefan Ritt	Info	status of self-signed https certificates
> In the mean time, we continue to recommend that mhttpd should be used behind a password protected https proxy (i.e. apache > httpd, etc). I guess this is what moste people do anyhow these days. Do I understand correctly that this then rules out the usage of letsencrype certificates, since the host needs to be accessed from outside, which is not possible if running behind a password protected firewall. Stefan
1412	03 Dec 2018	Konstantin Olchanski	Info	status of self-signed https certificates
> > In the mean time, we continue to recommend that mhttpd should be used behind a password protected https proxy (i.e. apache > > httpd, etc). > > I guess this is what moste people do anyhow these days. Do I understand correctly that this then rules out the usage of letsencrype certificates, since the > host needs to be accessed from outside, which is not possible if running behind a password protected firewall. > > Stefan Careful, firewall != proxy, very different things. A firewall prevents network communications, period. (Like fences and locked doors, there are good reasons to have them). An https proxy is a way to have encrypted (protected) web communications with a machine behind a firewall. Basically, we have 4 main cases, all with trouble. 1) mhttpd running on localhost, "just for testing", is in trouble. there is no simple way to get a "blessed" certificate, and self-signed certificates are now "almost forbidden". http is "okey for now", but the writing is on the wall. There is no special exception for "local-only" connections. 2a) mhttpd running on an internet-connected machine, with apache httpd, our best case. To get this working one has to configure both apache httpd and the "blessed certificate" certbot tool. With luck, both tools work smoothly on current OSes (they do NOT). 2b) same, but without apache httpd. One still has to run certbot, and the "glue" between mhttpd and certbot is currently missing: need a way to point mhttpd to the certbot certificate files and a way to reload mhttpd when the certificate is auto-renewed. 3) mhttpd running on a machine behind a corporate firewall. worst case. if firewall Gods make an opening for ports 80 and 443, it becomes case (2a/b), otherwise, one must use some kind of https proxy. (Plus there is no trivial way to setup an encrypted secure communication channel between mhttpd and this proxy, a double bad). K.O. P.S. I guess one can use nginx as the https proxy instead of apache httpd. I did not try yet. My impression is that everybody uses nginx, except for people who started with apache httpd and are too lazy to try nginx. K.O.
1546	10 Jun 2019	Konstantin Olchanski	Info	status of self-signed https certificates
> > > In the mean time, we continue to recommend that mhttpd should be used behind a password protected https proxy (i.e. apache > > > httpd, etc). There we go. google-chrome 74 refuses to connect to mhttpd configured with a self-signed certificate generated per instructions printed by mhttpd. Here is the full error text (there is no button to "let me connect to it anyway"): Your connection is not private Attackers might be trying to steal your information from musr03.triumf.ca (for example, passwords, messages, or credit cards). Learn more NET::ERR_CERT_AUTHORITY_INVALID Help improve Safe Browsing by sending some system information and page content to Google. Privacy policy musr03.triumf.ca normally uses encryption to protect your information. When Google Chrome tried to connect to musr03.triumf.ca this time, the website sent back unusual and incorrect credentials. This may happen when an attacker is trying to pretend to be musr03.triumf.ca, or a Wi-Fi sign-in screen has interrupted the connection. Your information is still secure because Google Chrome stopped the connection before any data was exchanged. You cannot visit musr03.triumf.ca right now because the website uses HSTS. Network errors and attacks are usually temporary, so this page will probably work later.
1769	13 Jan 2020	Konstantin Olchanski	Info	status of self-signed https certificates
Now firefox returns the same error. version 72.0.1. > daqlabpc.triumf.ca has a security policy called HTTP Strict Transport Security (HSTS), which means that Firefox can only connect to it securely. You can�t add an exception to visit this site. > Error code: MOZILLA_PKIX_ERROR_SELF_SIGNED_CERT I think the problem is with HSTS. I enabled HSTS (in mhttpd and in apache httpd) because SSLlabs encourage it and because my reading of it's description at https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Strict-Transport-Security makes it sound like a good idea without any downsides. However, the actual HSTS RFC says something completely different: https://tools.ietf.org/html/rfc6797 "The aim is to prevent click-through insecurity and address other potential threats". To me this explains what I see. In contrast to the description at developer.mozilla.org, firefox (and google chrome) disable "click-through" exceptions for "I do not like this https certificate", and there is no way to connect to self-signed https. Bottom line, either use certbot to get blessed https certificate or no https for you. K.O. > > > > In the mean time, we continue to recommend that mhttpd should be used behind a password protected https proxy (i.e. apache > > > > httpd, etc). > > There we go. google-chrome 74 refuses to connect to mhttpd configured with a self-signed certificate generated per instructions printed by mhttpd. > > Here is the full error text (there is no button to "let me connect to it anyway"): > > Your connection is not private > Attackers might be trying to steal your information from musr03.triumf.ca (for example, passwords, messages, or credit cards). Learn more > NET::ERR_CERT_AUTHORITY_INVALID > > Help improve Safe Browsing by sending some system information and page content to Google. Privacy policy > musr03.triumf.ca normally uses encryption to protect your information. When Google Chrome tried to connect to musr03.triumf.ca this time, the website sent back unusual and incorrect credentials. This may happen when an > attacker is trying to pretend to be musr03.triumf.ca, or a Wi-Fi sign-in screen has interrupted the connection. Your information is still secure because Google Chrome stopped the connection before any data was exchanged. > > You cannot visit musr03.triumf.ca right now because the website uses HSTS. Network errors and attacks are usually temporary, so this page will probably work later.
841	12 Dec 2012	Shaun Mead	Bug Report	ss_thread_kill() kills entire program
Hi, I'm having some trouble getting ss_thread_kill() to work properly. It seems to kill the entire program instead of just the thread. Here is a test program to show the error: _________________________________ #include <stdio.h> #include <stdlib.h> #include "midas.h" #include "msystem.h" INT f(void *param) { for (int x = 0; x < 100; x++) sleep(1); return 0; } int main() { printf("creating thread\n"); midas_thread_t thr = ss_thread_create(f, NULL); sleep(2); printf("killing thread\n"); ss_thread_kill(thr); printf("success\n"); return 0; } _________________________________ Makefile: _________________________________ FLAGS=-g -Wall -DLINUX -DOS_LINUX -I/home/deap/packages/midas/include LIBS=-L/home/deap/packages/midas/linux-m64/lib -lmidas -lpthread -lrt -lutil main.exe: main.cpp g++ $(FLAGS) -o $@ $^ $(LIBS) _________________________________ Output when run: _________________________________ [deap@deap04 multithread]$ ./main.exe creating thread killing thread Killed [deap@deap04 multithread]$ _________________________________ The last "Killed" indicated the whole program got killed, when it should actually just kill the thread and then print "success". I noticed the function in system.c uses pthread_kill(). Some google searches show me that it may be better to use pthread_cancel() (ie http://stackoverflow.com/questions/3438536/when-to-use- pthread-cancel-and-not-pthread-kill ). Shaun
842	13 Dec 2012	Stefan Ritt	Bug Report	ss_thread_kill() kills entire program
The Linux thread functionality was introduced by Konstantin, so he might have a better idea about that. What I usually do is a graceful thread shutdown just by a flag. Like int stop_thread = 0; INT f(void *param) { for (int x = 0; x < 100; x++) { sleep(1); if (stop_thread) { // clean up things here... return 0; } } return 0; } int main() { printf("creating thread\n"); midas_thread_t thr = ss_thread_create(f, NULL); sleep(2); printf("killing thread\n"); stop_thread = 1; sleep(2); printf("success\n"); return 0; } This way I have a chance to clean up things in the thread, which otherwise I would not be able to.
843	13 Dec 2012	Konstantin Olchanski	Bug Report	ss_thread_kill() kills entire program
> Hi, I'm having some trouble getting ss_thread_kill() to work properly. It seems > to kill the entire program instead of just the thread. You cannot kill a thread. It's not a well defined operation. Most OSes do have the technical possibility to kill threads, but if you use them, you will not like the results. For a taste of small trouble, if a thread is holding a lock and you kill it, who's job is it to release the lock? The best you can do is to ask the thread to gracefully shutdown itself. (I.e. by using global variable flags). P.S. I did not implement the ss_thread stuff, I do not know what ss_thread_kill() does, but I recommend that you do not use it. P.P.S. Programming using threads is complicated, I recommend that you read at least some literature on the topic before using threads. At the least you must understand the common pitfalls and mistakes. At the least, you must know about deadlocks, livelocks, race conditions and semaphore priority inversions. K.O.
844	13 Dec 2012	Shaun Mead	Bug Report	ss_thread_kill() kills entire program
> > Hi, I'm having some trouble getting ss_thread_kill() to work properly. It seems > > to kill the entire program instead of just the thread. > > You cannot kill a thread. It's not a well defined operation. Most OSes do have the > technical possibility to kill threads, but if you use them, you will not like the > results. For a taste of small trouble, if a thread is holding a lock and you kill > it, who's job is it to release the lock? > > The best you can do is to ask the thread to gracefully shutdown itself. (I.e. by > using global variable flags). > > P.S. I did not implement the ss_thread stuff, I do not know what ss_thread_kill() > does, but I recommend that you do not use it. > > P.P.S. Programming using threads is complicated, I recommend that you read at least > some literature on the topic before using threads. At the least you must understand > the common pitfalls and mistakes. At the least, you must know about deadlocks, > livelocks, race conditions and semaphore priority inversions. > > K.O. Yes, but unfortunately what I was attempting to do was use a library function that I can't alter. It sometimes gets stuck and I wanted a way to kill it. Anyway I ended up not doing this at all in c++; I was able to do what I needed in python. Shaun
2267	31 Jul 2021	Peter Kunz	Bug Report	ss_shm_name: unsupported shared memory type, bye!
I ran into a problem trying to compile the latest MIDAS version on a Fedora system. mhttpd and odbedit return: ss_shm_name: unsupported shared memory type, bye! check_shm_type: preferred POSIXv4_SHM got SYSV_SHM The check returns SYSV_SHM which doesn't seem to be supported in ss_shm_name. Is there an easy solution for this? Thanks.
2307	02 Dec 2021	Alexey Kalinin	Bug Report	some frontend kicked by cm_periodic_tasks
Hello, We have a small experiment with MIDAS based DAQ. Status page shows : ES ESFrontend@192.168.0.37 207 0.2 0.000 Trigger06 Sample Frontend06@192.168.0.37 1.297M 0.3 0.000 Trigger01 Sample Frontend01@192.168.0.37 1.297M 0.3 0.000 Trigger16 Sample Frontend16@192.168.0.37 1.297M 0.3 0.000 Trigger38 Sample Frontend38@192.168.0.37 1.297M 0.3 0.000 Trigger37 Sample Frontend37@192.168.0.37 1.297M 0.3 0.000 Trigger03 Sample Frontend03@192.168.0.38 1.297M 0.3 0.000 Trigger07 Sample Frontend07@192.168.0.38 1.297M 0.3 0.000 Trigger04 Sample Frontend04@192.168.0.38 59898 0.0 0.000 Trigger08 Sample Frontend08@192.168.0.38 59898 0.0 0.000 Trigger17 Sample Frontend17@192.168.0.38 59898 0.0 0.000 And SYSTEM buffers page shows: ESFrontend 1968 198 47520 0 0x00000000 0 193 ms Sample Frontend06 1332547 1330826 379729872 0 0x00000000 0 1.1 sec Sample Frontend16 1332542 1330839 361988208 0 0x00000000 0 94 ms Sample Frontend37 1332530 1330841 337798408 0 0x00000000 0 1.1 sec Sample Frontend01 1332543 1330829 467136688 0 0x00000000 0 34 ms Sample Frontend38 1332528 1330830 291453608 0 0x00000000 0 1.1 sec Sample Frontend04 63254 61467 20882584 0 0x00000000 0 208 ms Sample Frontend08 63262 61476 27904056 0 0x00000000 0 205 ms Sample Frontend17 63271 61473 20433840 0 0x00000000 0 213 ms Sample Frontend03 1332549 1330818 386821728 0 0x00000000 0 82 ms Sample Frontend07 1332554 1330821 462210896 0 0x00000000 0 37 ms Logger 968742 0w+9500418r 0w+2718405736r 0 0x00000000 0 GET_ALL Used 0 bytes 0.0% 303 ms rootana 254561 0w+29856958r 0w+8718288352r 0 0x00000000 0 762 ms The problem is that eventually some of frontend closed with message :19:22:31.834 2021/12/02 [rootana,INFO] Client 'Sample Frontend38' on buffer 'SYSMSG' removed by cm_periodic_tasks because process pid 9789 does not exist in the meantime mserver loggging : mserver started interactively mserver will listen on TCP port 1175 double free or corruption (!prev) double free or corruption (!prev) free(): invalid next size (normal) double free or corruption (!prev) I can find some correlation between number of events/event size produced by frontend, cause its failed when its become big enough. frontend scheme is like this: poll event time set to 0; poll_event{ //if buffer not transferred return (continue cutting the main buffer) //read main buffer from hardware //buffer not transfered } read event{ // cut the main buffer to subevents (cut one event from main buffer) return; //if (last subevent) {buffer transfered ;return} } What is strange to me that 2 frontends (1 per remote pc) causing this. Also, I'm executing one FEcode with -i # flag , put setting eventid in frontend_init , and using SYSTEM buffer for all. Is there something I'm missing? Thanks. A.
2314	26 Jan 2022	Konstantin Olchanski	Bug Report	some frontend kicked by cm_periodic_tasks
> The problem is that eventually some of frontend closed with message > :19:22:31.834 2021/12/02 [rootana,INFO] Client 'Sample Frontend38' on buffer > 'SYSMSG' removed by cm_periodic_tasks because process pid 9789 does not exist This messages means what it says. A client was registered with the SYSMSG buffer and this client had pid 9789. At some point some other client (rootana, in this case) checked it and process pid 9789 was no longer running. (it then proceeded to remove the registration). There is 2 possibilities: - simplest: your frontend has crashed. best to debug this by running it inside gdb, wait for the crash. - unlikely: reported pid is bogus, real pid of your frontend is different, the client registration in SYSMSG is corrupted. this would indicate massive corruption of midas shared memory buffers, not impossible if your frontend misbehaves and writes to random memory addresses. ODB has protection against this (normally turned off, easy to enable, set ODB "/experiment/protect odb" to yes), shared memory buffers do not have protection against this (should be added?). Do this. When you start your frontend, write down it's pid, when you see the crash message, confirm pid number printed is the same. As additional test, run your frontend inside gdb, after it crashes, you can print the stack trace, etc. > > in the meantime mserver loggging : > mserver started interactively > mserver will listen on TCP port 1175 > double free or corruption (!prev) > double free or corruption (!prev) > free(): invalid next size (normal) > double free or corruption (!prev) > Are these "double free" messages coming from the mserver or from your frontend? (i.e. you run them in different terminals, not all in the same terminal?). If messages are coming from the mserver, this confirms possibility (1), except that for frontends connected remotely, the pid is the pid of the mserver, and what we see are crashes of mserver, not crashes of your frontend. These are much harder to debug. You will need to enable core dumps (ODB /Experiment/Enable core dumps set to "y"), confirm that core dumps work (i.e. "killall -SEGV mserver", observe core files are created in the directory where you started the mserver), reproduce the crash, run "gdb mserver core.NNNN", run "bt" to print the stack trace, post the stack trace here (or email to me directly). > > I can find some correlation between number of events/event size produced by > frontend, cause its failed when its become big enough. > There is no limit on event size or event rate in midas, you should not see any crash regardless of what you do. (there is a limit of event size, because an event has to fit inside an event buffer and event buffer size is limited to 2 GB). Obviously you hit a bug in mserver that makes it crash. Let's debug it. One thing to try is set the write cache size to zero and see if your crash goes away. I see some indication of something rotten in the event buffer code if write cache is enabled. This is set in ODB "/Eq/XXX/Common/Write Cache Size", set it to zero. (beware recent confusion where odb settings have no effect depending on value of "equipment_common_overwrite"). > > frontend scheme is like this: > Best if you use the tmfe c++ frontend, event data handling is much simpler and we do not have to debug the convoluted old code in mfe.c. K.O. > > poll event time set to 0; > > poll_event{ > //if buffer not transferred return (continue cutting the main buffer) > //read main buffer from hardware > //buffer not transfered > } > > read event{ > // cut the main buffer to subevents (cut one event from main buffer) return; > //if (last subevent) {buffer transfered ;return} > } > > What is strange to me that 2 frontends (1 per remote pc) causing this. > > Also, I'm executing one FEcode with -i # flag , put setting eventid in > frontend_init , and using SYSTEM buffer for all. > > Is there something I'm missing? > Thanks. > A.
2337	11 Feb 2022	Alexey Kalinin	Bug Report	some frontend kicked by cm_periodic_tasks
Thanks for the answer. As soon as I can(possible in a month) I'll try suggestion below: > One thing to try is set the write cache size to zero and see if your crash goes away. I see > some indication of something rotten in the event buffer code if write cache is enabled. This > is set in ODB "/Eq/XXX/Common/Write Cache Size", set it to zero. (beware recent confusion > where odb settings have no effect depending on value of "equipment_common_overwrite"). I tried to change this ODB for one of the frontend via mhttpd/browser, and eventually it goes back to default value (1000 as I remember). but this frontend has the minimum rate 50DWORD/~10sec. and depending on cashe size it appears in mdump once per 31 events but all aff them . SO its different story, but m.b. it has the same solution to play with Write Cashe Size. double free message goes from mserver terminal. all of the frontends are remote. I can't exclude crashes of frontend , but when I run ./frontend -i 1(2,3 etc) thet means that I run one code for all, and only several causes crash.also I found that crash in frontend happened while it do nothing with collected data (last event reached and new data is not ready), but it tries to watch for the ODB changes.I mean it crashes iside (while {odb_changes(value in watchdog)}),and I don't know what else happenned meanwhile with cahed buffer. Future plans is to use event buider for frontends when data/signals will be perfectly reasonable i/e/ without broken events. for now i kinda worry about if one of frontends will skip one of the event inside its buffer. Thanks for the way to dig into. A. > > The problem is that eventually some of frontend closed with message > > :19:22:31.834 2021/12/02 [rootana,INFO] Client 'Sample Frontend38' on buffer > > 'SYSMSG' removed by cm_periodic_tasks because process pid 9789 does not exist > > This messages means what it says. A client was registered with the SYSMSG buffer and this > client had pid 9789. At some point some other client (rootana, in this case) checked it and > process pid 9789 was no longer running. (it then proceeded to remove the registration). > > There is 2 possibilities: > - simplest: your frontend has crashed. best to debug this by running it inside gdb, wait for > the crash. > - unlikely: reported pid is bogus, real pid of your frontend is different, the client > registration in SYSMSG is corrupted. this would indicate massive corruption of midas shared > memory buffers, not impossible if your frontend misbehaves and writes to random memory > addresses. ODB has protection against this (normally turned off, easy to enable, set ODB > "/experiment/protect odb" to yes), shared memory buffers do not have protection against this > (should be added?). > > Do this. When you start your frontend, write down it's pid, when you see the crash message, > confirm pid number printed is the same. As additional test, run your frontend inside gdb, > after it crashes, you can print the stack trace, etc. > > > > > in the meantime mserver loggging : > > mserver started interactively > > mserver will listen on TCP port 1175 > > double free or corruption (!prev) > > double free or corruption (!prev) > > free(): invalid next size (normal) > > double free or corruption (!prev) > > > > Are these "double free" messages coming from the mserver or from your frontend? (i.e. you run > them in different terminals, not all in the same terminal?). > > If messages are coming from the mserver, this confirms possibility (1), > except that for frontends connected remotely, the pid is the pid of the mserver, > and what we see are crashes of mserver, not crashes of your frontend. These are much harder to > debug. > > You will need to enable core dumps (ODB /Experiment/Enable core dumps set to "y"), > confirm that core dumps work (i.e. "killall -SEGV mserver", observe core files are created > in the directory where you started the mserver), reproduce the crash, run "gdb mserver > core.NNNN", run "bt" to print the stack trace, post the stack trace here (or email to me > directly). > > > > > I can find some correlation between number of events/event size produced by > > frontend, cause its failed when its become big enough. > > > > There is no limit on event size or event rate in midas, you should not see any crash > regardless of what you do. (there is a limit of event size, because an event has > to fit inside an event buffer and event buffer size is limited to 2 GB). > > Obviously you hit a bug in mserver that makes it crash. Let's debug it. > > One thing to try is set the write cache size to zero and see if your crash goes away. I see > some indication of something rotten in the event buffer code if write cache is enabled. This > is set in ODB "/Eq/XXX/Common/Write Cache Size", set it to zero. (beware recent confusion > where odb settings have no effect depending on value of "equipment_common_overwrite"). > > > > > frontend scheme is like this: > > > > Best if you use the tmfe c++ frontend, event data handling is much simpler and we do not > have to debug the convoluted old code in mfe.c. > > K.O. > > > > > poll event time set to 0; > > > > poll_event{ > > //if buffer not transferred return (continue cutting the main buffer) > > //read main buffer from hardware > > //buffer not transfered > > } > > > > read event{ > > // cut the main buffer to subevents (cut one event from main buffer) return; > > //if (last subevent) {buffer transfered ;return} > > } > > > > What is strange to me that 2 frontends (1 per remote pc) causing this. > > > > Also, I'm executing one FEcode with -i # flag , put setting eventid in > > frontend_init , and using SYSTEM buffer for all. > > > > Is there something I'm missing? > > Thanks. > > A.
1330	01 Dec 2017	Frederik Wauters	Bug Report	small bug in mfe.c init
There is a small bug in the mfe.c initialization for the EQ_POLLED mode. There is a routine where the number of polls fitting in eq_info->period is counted: count = 1; do { if (display_period) printf("."); start_time = ss_millitime(); poll_event(equipment[idx].info.source, (INT)count, TRUE); delta_time = ss_millitime() - start_time; ... if (delta_time > 0) count = count * eq_info->period / delta_time; else count = 100; // avoid overflows if (count > 2147483647.0) { count = 2147483647.0; break; } } while (delta_time > eq_info->period 1.2 \|\| delta_time < eq_info- >period * 0.8); As "start_time = ss_millitime();" resets "delta_time" each time, only the "avoid overflows" addition saves the day. start_time = ss_millitime(); show be out of the loop.
1332	01 Dec 2017	Stefan Ritt	Bug Report	small bug in mfe.c init
> There is a small bug in the mfe.c initialization for the EQ_POLLED mode. There > is a routine where the number of polls fitting in eq_info->period is counted: > > > count = 1; > do { > if (display_period) > printf("."); > > start_time = ss_millitime(); > > poll_event(equipment[idx].info.source, (INT)count, TRUE); > > delta_time = ss_millitime() - start_time; > > ... > > if (delta_time > 0) > count = count * eq_info->period / delta_time; > else > count = 100; > > // avoid overflows > if (count > 2147483647.0) { > count = 2147483647.0; > break; > } > > } while (delta_time > eq_info->period 1.2 \|\| delta_time < eq_info- > >period * 0.8); > > As "start_time = ss_millitime();" resets "delta_time" each time, only the > "avoid overflows" addition saves the day. > > start_time = ss_millitime(); show be out of the loop. Nope. What I want is to determine how often I have to call poll_event to stay there for a certain time (usually 100ms). So I iterate "count" until I roughly get to my 100ms. Each call to poll_event with a different count is a new measurement, therefore I initialize start_time before each measurement. If i do it outside the loop, and kind of incrementally increase it, then the whole code inside the loop is added to the measurement which makes it (slightly) wrong. The whole loop optimization has some background. Polling can be sow (think of talking to a device via Ethernet which can easily take milli seconds). So how often do we poll before we do other things in the main look (like looking if a run has been started). If I only poll once, then the average front-end response time would be poor, because I mostly look if a run has been started in the main loop. This is not effective. If I poll too often inside the poll_event loop, then the front-end does not react on run stops any more. So there is some optimum, and this is set by the polling time of usually 100ms. This ensures that the front-end does optimal polling - without ANYTHING in between - for about 100ms. But how can I know how often I should poll for 100 ms? As said above, polling can be very fast (reading a memory cell) or very slow (network). The the best method I found is to do a calibration at the startup, and this is what the code above does. Maybe there are better ways today, but that code worked nicely in the last 25 years. Stefan
1333	04 Dec 2017	Frederik Wauters	Bug Report	small bug in mfe.c init
> > There is a small bug in the mfe.c initialization for the EQ_POLLED mode. There > > is a routine where the number of polls fitting in eq_info->period is counted: > > > > > > count = 1; > > do { > > if (display_period) > > printf("."); > > > > start_time = ss_millitime(); > > > > poll_event(equipment[idx].info.source, (INT)count, TRUE); > > > > delta_time = ss_millitime() - start_time; > > > > ... > > > > if (delta_time > 0) > > count = count * eq_info->period / delta_time; > > else > > count = 100; > > > > // avoid overflows > > if (count > 2147483647.0) { > > count = 2147483647.0; > > break; > > } > > > > } while (delta_time > eq_info->period 1.2 \|\| delta_time < eq_info- > > >period * 0.8); > > > > As "start_time = ss_millitime();" resets "delta_time" each time, only the > > "avoid overflows" addition saves the day. > > > > start_time = ss_millitime(); show be out of the loop. > > Nope. > > What I want is to determine how often I have to call poll_event to stay there for a certain time (usually 100ms). So I iterate "count" until I roughly get to my 100ms. Each call to > poll_event with a different count is a new measurement, therefore I initialize start_time before each measurement. If i do it outside the loop, and kind of incrementally increase > it, then the whole code inside the loop is added to the measurement which makes it (slightly) wrong. > > The whole loop optimization has some background. Polling can be sow (think of talking to a device via Ethernet which can easily take milli seconds). So how often do we poll > before we do other things in the main look (like looking if a run has been started). If I only poll once, then the average front-end response time would be poor, because I mostly > look if a run has been started in the main loop. This is not effective. If I poll too often inside the poll_event loop, then the front-end does not react on run stops any more. So > there is some optimum, and this is set by the polling time of usually 100ms. This ensures that the front-end does optimal polling - without ANYTHING in between - for about > 100ms. But how can I know how often I should poll for 100 ms? As said above, polling can be very fast (reading a memory cell) or very slow (network). The the best method I > found is to do a calibration at the startup, and this is what the code above does. Maybe there are better ways today, but that code worked nicely in the last 25 years. > > Stefan Thanks, I misunderstood the loop then. If poll_event(equipment[idx].info.source, (INT)count, TRUE); doesn`t do anything with "count", the loop becomes infinite except for the overflow check.

Goto page Previous 1, 2, 3 ... 5, 6, 7 ... 135, 136, 137 Next

ELOG V3.1.4-2e1708b5