11 Feb 2011, Konstantin Olchanski, Bug Report, fixed. odb corruption, odb race condition?
|
> >
> > The only remaining problem when running my script is some kind of deadlock between the ODB and SYSMSG semaphores...
> >
>
> For now, I am happy that we no longer corrupt ODB (nor deadlock) ...
>
Found one more deadlock between ODB and SYSMSG semaphores, this time through cm_watchdog():
If cm_watchdog somehow runs while we are holding the ODB semaphore, it will eventually try to lock SYSMSG (through bm_cleanup & co) in
violation of our semaphore locking order. If at the same time another application tries to lock stuff using the correct order (SYSMSG first, ODB last),
the two programs will deadlock (wait for each other forever). I presently have two copies of gdb attached to two copies of odbedit
waiting for each other in a deadlock through this cm_watchdog scenario...
Solution shall follow quickly, I have been hunting this deadlock for the last couple of weeks...
K.O. |
13 Feb 2011, Lee Pool, Bug Report, Problems with midas history SVN 4936
|
> I have the following problems after updating to midas SVN 4936: the history
> system (web-page via mhttpd) seems to stop working. I checked the history files
> themself and they are indeed written, except that the events ID's are not the
> same anymore (I mean the ones defined under /Equipment/XXX/Common/Event ID),
> rather the mlogger seems to choose an ID by itself.
> Currently the only way to get things working again was to recompile midas with
> adding -DOLD_HISTORY to the CFLAGS which is troublesome since it is likely to be
> forgotton with the next SVN update. When looking into the SVN I have the
> impression there is something going on concerning the history system, however
> I couldn't find any documentation.
> What is the best practice for the future, in order not to run into any problems
> but still being able to look at the old history (also from within the web-page
> via mhttpd)?
Hi...
Do you mind giving little more detail? We might have the same issue, where we got
complaints that midas history stops working after a certain time.
L |
15 Feb 2011, Konstantin Olchanski, Bug Report, fixed. odb corruption, odb race condition?
|
> Solution shall follow quickly, I have been hunting this deadlock for the last couple of weeks...
Over the last couple of days I made a series of commits to odb.c and midas.c to implement a buffer-based cm_msg()
and fix the latest deadlock problem, also to help with the race conditions in client creation and cleanup.
My torture test runs okey in my mac now, one remaining problem is spurious client removal caused
by semaphore starvation - I see 2-3-7-10 sec wait times for semaphores - probably caused by some
kind of unfairness in the MacOS SysV semaphore implementation (in a "fair" semaphore implementation,
the process that waited the longest would be woken up the first and one would never see semaphore wait
times measured in seconds). Probably worth investigating fairness of MacOS posix semaphores. On LInux
things are probably different and under normal running conditions one should not see any semaphore starvation.
I will be doing extensive tests of this update at TRIUMF, but I do not expect any problems. If you use this
version and see any anomalies, please report them as replies to this message or email me directly.
svn rev 4976
K.O. |
16 Feb 2011, Konstantin Olchanski, Bug Report, Problems with midas history SVN 4936
|
> I have the following problems after updating to midas SVN 4936: the history
> system (web-page via mhttpd) seems to stop working. I checked the history files
> themself and they are indeed written, except that the events ID's are not the
> same anymore (I mean the ones defined under /Equipment/XXX/Common/Event ID),
> rather the mlogger seems to choose an ID by itself.
Yes, I found the problem - it was introduced around svn rev 4827 in September 2010.
It is fixed now, please do this:
1) update history_midas.c to latest svn rev 4979
1a) do NOT update any other files - update only history_midas.c
2) rebuild mlogger (it will do no harm and no good if you rebuild everything)
3) odbedit save odb.xml
4) in odb, remove /history/events and /history/tags (you can also set "/History/DisableTags" to "y")
5) restart mlogger
6) observe that odb /history/events now has event ids same as equipment ids
7) restart your frontend, observe that history file is growing
8) use mhdump to observe that history is now written with correct event id
9) go to mhttpd history plot, you should see the new data coming in. Plot history in the "1 year" scale, you
should see the old data and you should see a gap where data was written with wrong event id
10) I should still have an mhrewrite program sitting somewhere that can change the event ids inside midas
history files, if you have many data files with wrong event id, let me know, I will find this program and tell you
how to use it to repair your data files.
> Currently the only way to get things working again was to recompile midas with
> adding -DOLD_HISTORY to the CFLAGS which is troublesome since it is likely to be
> forgotton with the next SVN update.
Yes, I am glad you found OLD_HISTORY, I kept it just for the case some breakage like this happens. I will still
keep it around until the dust settles.
> When looking into the SVN I have the impression there is something going on concerning the history
system, however I couldn't find any documentation.
Yes, you found the right stuff, and it is partially documented. mlogger uses /History/Events to map history
event names (equipment names in your case) to history event ids. But in your case, the wrong event id has
been assigned by mlogger so nothing worked right. As a bonus, I now see inconsistency between event_id
code remaining in mlogger (which is not used) and event_id code in history_midas (which *is* used). I will be
straightening this stuff over the next few days.
I hope my correction to history_midas.cxx is good enough to get you going for now.
> What is the best practice for the future, in order not to run into any problems
> but still being able to look at the old history (also from within the web-page
> via mhttpd)?
Personally, I think that the midas history storage into binary files is not robust enough
when facing changes to equipment and event ids, renaming and deleting of stuff, etc. There
are other limitations, as well, i.e. the 16-bit history event id, etc.
The newly implemented SQL history storage (uses ODBC layer, MySQL supported, PgSQL partially
implemented) does not have any of these problems and seems to work well enough
for T2K/ND280. Sometimes MySQL history is even faster when making history plots in mhttpd.
I am now thinking about implementing SQL history storage in SQLite files, and it will not have
any of these problems, too. Performance and robustness for database corruption remain a question, though.
K.O. |
16 Feb 2011, Konstantin Olchanski, Bug Report, Problems with midas history SVN 4936
|
It looks like email notices did not go the first time. Please read my replies below. K.O.
> > I have the following problems after updating to midas SVN 4936: the history
> > system (web-page via mhttpd) seems to stop working. I checked the history files
> > themself and they are indeed written, except that the events ID's are not the
> > same anymore (I mean the ones defined under /Equipment/XXX/Common/Event ID),
> > rather the mlogger seems to choose an ID by itself.
>
> Yes, I found the problem - it was introduced around svn rev 4827 in September 2010.
>
> It is fixed now, please do this:
> 1) update history_midas.c to latest svn rev 4979
> 1a) do NOT update any other files - update only history_midas.c
> 2) rebuild mlogger (it will do no harm and no good if you rebuild everything)
> 3) odbedit save odb.xml
> 4) in odb, remove /history/events and /history/tags (you can also set "/History/DisableTags" to "y")
> 5) restart mlogger
> 6) observe that odb /history/events now has event ids same as equipment ids
> 7) restart your frontend, observe that history file is growing
> 8) use mhdump to observe that history is now written with correct event id
> 9) go to mhttpd history plot, you should see the new data coming in. Plot history in the "1 year" scale, you
> should see the old data and you should see a gap where data was written with wrong event id
> 10) I should still have an mhrewrite program sitting somewhere that can change the event ids inside midas
> history files, if you have many data files with wrong event id, let me know, I will find this program and tell you
> how to use it to repair your data files.
>
> > Currently the only way to get things working again was to recompile midas with
> > adding -DOLD_HISTORY to the CFLAGS which is troublesome since it is likely to be
> > forgotton with the next SVN update.
>
> Yes, I am glad you found OLD_HISTORY, I kept it just for the case some breakage like this happens. I will still
> keep it around until the dust settles.
>
> > When looking into the SVN I have the impression there is something going on concerning the history
> system, however I couldn't find any documentation.
>
> Yes, you found the right stuff, and it is partially documented. mlogger uses /History/Events to map history
> event names (equipment names in your case) to history event ids. But in your case, the wrong event id has
> been assigned by mlogger so nothing worked right. As a bonus, I now see inconsistency between event_id
> code remaining in mlogger (which is not used) and event_id code in history_midas (which *is* used). I will be
> straightening this stuff over the next few days.
>
> I hope my correction to history_midas.cxx is good enough to get you going for now.
>
> > What is the best practice for the future, in order not to run into any problems
> > but still being able to look at the old history (also from within the web-page
> > via mhttpd)?
>
> Personally, I think that the midas history storage into binary files is not robust enough
> when facing changes to equipment and event ids, renaming and deleting of stuff, etc. There
> are other limitations, as well, i.e. the 16-bit history event id, etc.
>
> The newly implemented SQL history storage (uses ODBC layer, MySQL supported, PgSQL partially
> implemented) does not have any of these problems and seems to work well enough
> for T2K/ND280. Sometimes MySQL history is even faster when making history plots in mhttpd.
>
> I am now thinking about implementing SQL history storage in SQLite files, and it will not have
> any of these problems, too. Performance and robustness for database corruption remain a question, though.
>
> K.O. |
16 Feb 2011, Konstantin Olchanski, Bug Report, Problems with midas history SVN 4936
|
>
> Do you mind giving little more detail? We might have the same issue, where we got
> complaints that midas history stops working after a certain time.
>
Yes, please do supply more information. What problems do *you* see?
K.O. |
16 Feb 2011, Konstantin Olchanski, Bug Report, fixed. odb corruption, odb race condition?
|
> My torture test runs okey in my mac now, one remaining problem is spurious client removal caused
> by semaphore starvation...
My torture test runs okey on Linux and I do not see any problems with spurious client removal - actually
I do not see any strange longs waits for semaphores that I was seeing on MacOS. Must be another
proof that MacOS is years behind Linux in kernel technology (but parsecs ahead in user experience)
K.O. |
16 Feb 2011, Lee Pool, Bug Report, Problems with midas history SVN 4936
|
> >
> > Do you mind giving little more detail? We might have the same issue, where we got
> > complaints that midas history stops working after a certain time.
> >
>
>
> Yes, please do supply more information. What problems do *you* see?
>
>
> K.O.
Hi.
uhm, mine might be completely unrelated to this, but it just so happened that the rev.
4936 was one that was used in a recent experiment, in which there was complaints about
the responsiveness of the history plots. The history plots would take up to 30 seconds
to respond, if the run was about 30-40 minutes old. When the run is about < 10 minutes
old , the history plot was responsive to within 1-2 seconds.
I received rather limited information regarding this problem. So hence my apprehension
on stating it as a *problem* or bug. It could be something related to hardware/beam etc.
Lee |
17 Feb 2011, Stefan Ritt, Bug Report, Problems with midas history SVN 4936
|
> uhm, mine might be completely unrelated to this, but it just so happened that the rev.
> 4936 was one that was used in a recent experiment, in which there was complaints about
> the responsiveness of the history plots. The history plots would take up to 30 seconds
> to respond, if the run was about 30-40 minutes old. When the run is about < 10 minutes
> old , the history plot was responsive to within 1-2 seconds.
Ah, that rings a bell. How big are your history files on disk? How much RAM do you have?
What I see in our experiment is that linux buffers everything I write to disk in a cache
located in RAM. But this cache is limited, so after a certain time it's overwritten. Now
this is handled by the OS, the application (mlogger in this case) has no influence on it.
Let's say you write 5 MB/minute of history, and your cache is 50 MB large. Then after 10
minutes you can still read the history data from the RAM cache which is ~10x faster than
your disk. But your older history data (30-40 min) is flushed out of the cache and has to be
re-read from disk. A typical symptom of this is that the first time you display this it
takes maybe 30 seconds, but if you do a "reload" of your page it goes much faster. In that
case the contents is cached again in RAM. If you observe this, you can almost be certain
that you see th "too small RAM cache" problem. In that case just add RAM and things should
run better (I use 16 GByte in my machine).
Best regards,
Stefan |
05 Jul 2011, Konstantin Olchanski, Bug Report, MacOS network socket timeouts non-functional
|
It turns out that because of differences between select() syscall implementation between UNIX (MacOS,
maybe BSD) and Linux, network socket timeouts do not work.
This affects timeouts during run transitions (transition calls to dead clients do not timeout), maybe other
places.
I am looking into fixing this. The main difficulty is with UNIX select() not updating the timeout parameter
when it is interrupted by the MIDAS watchdog alarm signal. Linux select() subtracts the elapsed time from
the timeout value and this code from system.c works correctly: while (1) { status = select(..., &timeout); if
(status==0) break; } (value of timeout becomes smaller each time), while on MacOS it loops forever (value
of timeout does not change).
K.O. |
12 Dec 2011, Michael Murray, Bug Report, bk_delete uses memcpy instead of memmove
|
In midas.c, the bk_delete function removes a bank by decrementing the total
event size and then copying the remaining banks into the location of the first
using memcpy from string.h.
memcpy is not specified to handle overlapping memory regions (such as MIDAS
banks), though it seems most common implementations do.
memmove should be used instead, which is specified to behave as if copying
through an intermediate buffer.
I noticed the misbehavior using glibc with gcc version 4.4.4 and scientific
linux 6.0. Other gcc versions changed nothing, as this originates from the
implementation of memcpy in libc.
libc version:
GNU C Library stable release version 2.12, by Roland McGrath et al.
Compiled by GNU CC version 4.4.5 20110214 (Red Hat 4.4.5-6).
Compiled on a Linux 2.6.32 system on 2011-12-06. |
16 Dec 2011, Konstantin Olchanski, Bug Report, bk_delete uses memcpy instead of memmove
|
> In midas.c, the bk_delete function removes a bank by decrementing the total
> event size and then copying the remaining banks into the location of the first
> using memcpy from string.h.
I confirm the documented difference between memcpy() and memmove() and I confirm the
questionable use of memcpy() in bk_delete(). I think it should be memmove(). I made it so in my copy
of midas, so this change will not be lost.
But I am not sure how to test it - I do not think I ever used bk_delete(). I will probably ponder upon
this and do a blind commit.
K.O. |
19 Dec 2011, Stefan Ritt, Bug Report, bk_delete uses memcpy instead of memmove
|
> > In midas.c, the bk_delete function removes a bank by decrementing the total
> > event size and then copying the remaining banks into the location of the first
> > using memcpy from string.h.
>
>
> I confirm the documented difference between memcpy() and memmove() and I confirm the
> questionable use of memcpy() in bk_delete(). I think it should be memmove(). I made it so in my copy
> of midas, so this change will not be lost.
>
> But I am not sure how to test it - I do not think I ever used bk_delete(). I will probably ponder upon
> this and do a blind commit.
>
>
> K.O.
It cannot hurt to use memmove(), so please go ahead to commit the changes.
- Stefan |
29 Feb 2012, Konstantin Olchanski, Bug Report, Problem with semaphores
|
Hi there! In the T2K/ND280 experiment in Japan, we keep having problems with MIDAS locking (probably
of ODB). The symptoms are: some program reports a timeout waiting for the ODB lock, then all programs
eventually die with this same error. Complete system meltdown. This does not look like the deadlock
between locks for ODB, cm_msg and the data buffers that I looked into last year. It looks more like
somebody locks ODB, dies and the Linux kernel fails to unlock the lock (via the SYSV "sem undo"
function). But it is hard to confirm, hence this message:
The implementation of semaphores in MIDAS (used for locking ODB and the shared memory data buffers)
uses the straight SYSV semaphore API - which lacks basic debugging features - there is no tracking of
who locked what when, so if anything at all goes wrong at all, i.e. we are confronted with a timeout
waiting for the ODB lock, the only corrective action possible is to kill all MIDAS clients and tell the user to
start from scratch. There is no additional information available from the SYSV semaphore API to identify
which MIDAS program caused the fault.
The POSIX semaphore API is even worse - no debugging features are available, *and* if a program dies
while holding a lock, the lock stays locked forever (everybody else will wait forever or see a semaphore
timeout, and then what?).
So I am looking for an "advanced semaphore library" to use in MIDAS. In addition to the boring functions
of reliable locking and unlocking, it should support:
- wait with timeout
- remember who is holding the lock
- detect that the process holding the lock is dead and take corrective action (automatic unlock as done by
SYSV semaphores, call back to user code where we can cleanup and unlock ourselves, etc)
- maybe permit recursive locking (not really required as ODB locks are already made recursive "by hand")
- maybe remember some of the locking history (so we can dump it into a log file when we detect a
deadlock or other lock malfunction).
Quick google search only find sundry wrappers for SYSV and POSIX semaphores. How they deal with the
problem of processes locking the semaphore and dying remains a mystery to me (other than telling users
to remove the Ctrl-C button from their keyboard). BTW, we have seen this problem with several
commercial applications that use SYSV semaphores but forget to enable the SEM_UNDO function).
Anyhow, if anybody can suggest such an advanced locking library it would be great. Will save me the
effort of writing one.
K.O. |
01 Mar 2012, Stefan Ritt, Bug Report, Problem with semaphores
|
> Anyhow, if anybody can suggest such an advanced locking library it would be great. Will save me the
> effort of writing one.
Hi Konstantin,
yes there is a good way, which I used during development of the buffer manager function. Put in each sm_xxx function a cm_msg(M_DEBUG, ...) to
generate a debug system message. They go only into the SYSMSG ring buffer and thus are light weight and don't influence the timing much. You can
keep odbedit open to see these messages, but there is also another way. You can write a little program which dumps the whole SYSMSG buffer, which
you can call when the lock happens. You then look "backwards" in time and get all messages stored there, depending of the size of the SYSMSG buffer of
course. Of course this only works if the lock does not happen on the SYSMSB buffer itself. In that case you have to produce M_LOG messages which are
written to the logging file. This will influence the timing slightly (the file might grow rapidly) but you are independent of semaphores.
The interesting thing is that in the MEG experiment (9 Front-ends, Event Builder, Logger, Lazylogger, ....) we run for months without any lock up. So I
might suspect it's caused in your case from a program only you are using.
Best regards,
Stefan |
18 Apr 2012, Exaos Lee, Bug Report, Build error with mlogger: invalid conversion from ‘void*’ to ‘gzFile’
|
I tried to build MIDAS under ArchLinux, failed on errors as following:src/mlogger.cxx: In function ‘INT midas_flush_buffer(LOG_CHN*)’:
src/mlogger.cxx:1011:54: error: invalid conversion from ‘void*’ to ‘gzFile’ [-fpermissive]
In file included from src/mlogger.cxx:33:0:
/usr/include/zlib.h:1318:21: error: initializing argument 1 of ‘int gzwrite(gzFile, voidpc, unsigned int)’ [-fpermissive]
src/mlogger.cxx: In function ‘INT midas_log_open(LOG_CHN*, INT)’:
src/mlogger.cxx:1200:79: error: invalid conversion from ‘void*’ to ‘gzFile’ [-fpermissive]
In file included from src/mlogger.cxx:33:0: Please refer to attachment elog:786/1 for detail. There are also many warnings listed.
This error can be supressed by adding -fpermissive to CXXFLAGS. But the error message is correct."gzFile" is not equal to "void *"! C allows implicit casts between void* and any pointer type, C++ doesn't allow that. It's better to fix this error. A quick fix would be adding explicit casts. But I'm not sure what is the proper way to fix this. |
19 Apr 2012, Stefan Ritt, Bug Report, Build error with mlogger: invalid conversion from ‘void*’ to ‘gzFile’
|
Exaos Lee wrote: | I tried to build MIDAS under ArchLinux, failed on errors as following:src/mlogger.cxx: In function ‘INT midas_flush_buffer(LOG_CHN*)’:
src/mlogger.cxx:1011:54: error: invalid conversion from ‘void*’ to ‘gzFile’ [-fpermissive]
In file included from src/mlogger.cxx:33:0:
/usr/include/zlib.h:1318:21: error: initializing argument 1 of ‘int gzwrite(gzFile, voidpc, unsigned int)’ [-fpermissive]
src/mlogger.cxx: In function ‘INT midas_log_open(LOG_CHN*, INT)’:
src/mlogger.cxx:1200:79: error: invalid conversion from ‘void*’ to ‘gzFile’ [-fpermissive]
In file included from src/mlogger.cxx:33:0: Please refer to attachment elog:786/1 for detail. There are also many warnings listed.
This error can be supressed by adding -fpermissive to CXXFLAGS. But the error message is correct."gzFile" is not equal to "void *"! C allows implicit casts between void* and any pointer type, C++ doesn't allow that. It's better to fix this error. A quick fix would be adding explicit casts. But I'm not sure what is the proper way to fix this. |
Ah, dumb gcc gets pickier and pickier. I added a case (gzFile)log_chn->gzfile which fixes the error. I cannot put gzFile already into the header file since the zlib header is included after the midas header, otherwise we get some other problems. The SVN version with the fix is 5275. |
25 Apr 2012, Konstantin Olchanski, Bug Report, Build error with mlogger: invalid conversion from ‘void*’ to ‘gzFile’
|
Stefan's fix is incomplete - the "gzFile" cast is needed for all calls to zlib, not just those that some version
of GCC happens to complain about. Fixed.
svn rev 5286.
BTW, I read the midas elog via email and if you post html or elcode messages, I receive complete
gibberish. For prompt service, please select message type "plain". (yes, you cannot use fancy colours and
blinking text, but better than me not reading your stuff at all).
BTW2, for easier reading, please include error messages as plain text in your message. As opposed to
compressed attachements.
K.O. |
27 Apr 2012, Stefan Ritt, Bug Report, Build error with mlogger: invalid conversion from ‘void*’ to ‘gzFile’
|
KO wrote: | BTW, I read the midas elog via email and if you post html or elcode messages, I receive complete
gibberish. For prompt service, please select message type "plain". (yes, you cannot use fancy colours and
blinking text, but better than me not reading your stuff at all).
BTW2, for easier reading, please include error messages as plain text in your message. As opposed to
compressed attachements.
K.O.
|
BTW3, if you use a real email program you don't get glibberish. I know some people prefer good-old-text-only pine, but I'm sure you do not use the ascii-only browser lynx to browse the internet, right? So if you browse the web in graphics, why not read your email in graphics as well. Better change yourself than the whole rest of the world  |
09 Jun 2012, Greg Christian, Bug Report, _net_send_buffer realloc
|
In midas.c, I noticed that memory is only allocated to the global buffer
_net_send_buffer by calling realloc() from within the function
resize_net_send_buffer() (at least this was the only place I could find
allocation to _net_send_buffer happening). This can cause problems for a couple
of reasons:
1) _net_send_buffer is not set to NULL when declared. To my understanding, this
makes the first call to realloc(_net_send_buffer, /*size*/) undefined. When
passed a pointer that has not previously been allocated, realloc() acts like
malloc() only if the pointer equal to NULL. Otherwise, the behavior is undefined
and usually causes a crash.
2) cm_disconect_experiment() calls free(_net_send_buffer) but does not set its
value to NULL. Thus if a client tries to include more than one
connect...disconnect cycle within an application, there is undefined behavior
the next time realloc(_net_send_buffer, ...) gets called.
I think that any potential allocation issues involving _net_send_buffer could be
solved by:
1) Initializing _net_send_buffer to NULL.
2) In cm_disconnect_experiment(), changing
> M_FREE(_net_send_buffer);
to
> M_FREE(_net_send_buffer);
> _net_send_buffer = NULL; |
|