ID |
Date |
Author |
Topic |
Subject |
906
|
14 Sep 2013 |
Konstantin Olchanski | Info | mktime() and daylight savings time | I would like to share with you a silly problem with mktime() and daylight savings time (Summer
time/Winter time) that I have run into while working on the mhttpd history query page.
I am implementing 1 hour granularity for the queries (was 1 day granularity) and somehow all my queries
were off by 1 hour.
It turns out that the mktime() and localtime() functions for converting between time_t and normal time
units (days, hours) are not exact inverses of each other.
Daylight savings time (DST) is to blame.
While localtime() always applies the current DST, mktime() will return the wrong answer unless tm_isdst is
set correctly.
For tm_isdst, the default value 0 is wrong 50% of the time in most locations as it means "DST off" (whether
that's Summer time or Winter time depends on your location).
Today in Vancouver, BC, DST is in effect, and localtime(mktime()) is off by 1 hour.
If I were doing this in January, I would not see this problem.
"man mktime" talks about "tm_isdst" special value "-1" that is supposed to fix this. But the wording of
"man mktime" on Linux and on MacOS is different (I am amused by the talk about "attempting to divine
the DST setting"). Wording at http://pubs.opengroup.org/onlinepubs/007908799/xsh/mktime.html is
different again. MS Windows (Visual Studio) documentation says different things for different versions.
So for mhttpd I use the following code. First mktime() gets the approximate time, a call to localtime()
returns the DST setting in effect for that date, a second mktime() with the correct DST setting returns the
correct time. (By "correct" I mean that localtime(mktime(t)) == t).
time_t mktime_with_dst(const struct tm* ptms)
{
// this silly stuff is required to correctly handle daylight savings time (Summer time/Winter time)
// when we fill "struct tm" from user input, we cannot know if daylight savings time is in effect
// and we do not know how to initialize the value of tms->tm_isdst.
// This can cause the output of mktime() to be off by one hour.
// (Rules for daylight savings time are set by national and local govt and in some locations, changes
yearly)
// (There are no locations with 2 hour or half-hour daylight savings that I know of)
// (Yes, "man mktime" talks about using "tms->tm_isdst = -1")
//
// We assume the user is using local time and we convert in two steps:
//
// first we convert "struct tm" to "time_t" using mktime() with unknown tm_isdst
// second we convert "time_t" back to "struct tm" using localtime_r()
// this fills "tm_isdst" with correct value from the system time zone database
// then we reset all the time fields (except for sub-minute fields not affected by daylight savings)
// and call mktime() again, now with the correct value of "tm_isdst".
// K.O. 2013-09-14
struct tm tms = *ptms;
struct tm tms2;
time_t t1 = mktime(&tms);
localtime_r(&t1, &tms2);
tms2.tm_year = ptms->tm_year;
tms2.tm_mon = ptms->tm_mon;
tms2.tm_mday = ptms->tm_mday;
tms2.tm_hour = ptms->tm_hour;
tms2.tm_min = ptms->tm_min;
time_t t2 = mktime(&tms2);
//printf("t1 %.0f, t2 %.0f, diff %d\n", (double)t1, (double)t2, (int)(t1-t2));
return t2;
}
K.O. |
907
|
18 Sep 2013 |
Konstantin Olchanski | Bug Report | mhttpd truncates string variables to 32 characters | I confirm the second part of the problem.
Inline edit uses ODBSet(), which uses the "jset" AJAX call to mhttpd which does not extend string variables.
This is the jset code. The best I can tell it truncates string variables to the existing size in ODB:
db_find_key(hDB, 0, str, &hkey)
db_get_key(hDB, hkey, &key);
memset(data, 0, sizeof(data));
size = sizeof(data);
db_sscanf(getparam("value"), data, &size, 0, key.type);
db_set_data_index(hDB, hkey, data, key.item_size, index, key.type);
These original jset/jget functions are a little bit too complicated and there is no documentation (what exists is done by me trying to read the existing code).
We now have a jcopy/ODBMCopy() as a sane replacement for jget, but nothing comparable for jset, yet.
I think this quirk of inline edit cannot be fixed in javascript - the mhttpd code for "jset" has to change.
K.O.
>
> I can confirm part of the problem - the new inline-edit function - after you finish editing - shows you what you
> have typed, not what's actually in ODB - at the very end it should do an ODBGet() to load the actual ODB
> contents and show *that* to the user.
>
> The truncation to 32 characters - most likely it is a failure to resize the ODB string - is probably in mhttpd and
> I can take a quick look into it.
>
> There is a 3rd problem - the mhttpd ODB editor "create" function does not ask for the string length to create.
>
> Actually, in ODB, "create" and "set string size" are two separate functions - db_create_key(TID_STRING) creates
> a string of length zero, then db_set_data() creates an empty string of desired length.
>
> In the new AJAX interface these two functions are separate (ODBCreate just calls db_create_key()).
>
> In the present ODBSet() function the two are mixed together - and the ODB inline edit function uses ODBSet().
>
> K.O.
>
>
>
> > I find that new mhttpd has strange behaviour for ODB strings.
> >
> > - I create a new STRING variable in ODB through mhttpd. It defaults to size 32.
> >
> > - I then edit the STRING variable through mhttpd, writing a new string larger
> > than 32 characters.
> >
> > - Initially everything looks fine; it seems as if the new string value has been
> > accepted.
> >
> > - But if you reload the page, or navigate back to the page, you realize that
> > mhttpd has silently truncated the string back to 32 characters.
> >
> > You can reproduce this problem on a test page here:
> >
> > http://midptf01.triumf.ca:8081/AnnMessage
> >
> > Older versions of mhttpd (I'm testing one from 2 years ago) don't have this
> > 'feature'. For older mhttpd the string variable would get resized when a larger
> > string was inputted. That definitely seems like the right behavior to me.
> >
> > I am using fresh copy of midas from bitbucket as of this morning. (How do I get
> > a particular tag/hash of the version of midas that I am using?) |
911
|
24 Sep 2013 |
Konstantin Olchanski | Info | mktime() and daylight savings time | > I vaguely remember that I had a similar problem with ELOG. The solution was to call tzset() at the beginning of the program. The man page says that
> this function is called automatically by programs using time zones, but apparently it is not. Can you try that? There is also the TZ environment
> variable and /etc/localtime. I never understood the details, but playing with these things can influence mktime() and localtime().
I confirm that the timezone is set correctly - I do get the correct time eventually - so there is no missing call to tzet().
K.O. |
915
|
25 Sep 2013 |
Konstantin Olchanski | Info | Documentation for ODBGet() & co, Javascript and AJAX functions. | > > The bulk of the MIDAS AJAX and Javascript functions is now documented on the MIDAS Wiki:
> >
> > https://midas.triumf.ca/MidasWiki/index.php/Mhttpd.js
> > https://midas.triumf.ca/MidasWiki/index.php/AJAX
> >
>
> The documentation was updated again.
>
Newly documented are the additional Javascript and AJAX functions present in the GIT branch "feature/ajax":
ODBMCreate(paths, types);
ODBMCreate(paths, types, arraylengths, stringlengths, callback);
ODBMResize(paths, arraylengths, stringlengths, callback);
ODBMRename(paths, names, callback);
ODBMLink(paths, links, callback);
ODBMReorder(paths, indices, callback);
ODBMKey(paths, callback);
ODBMDelete(paths, callback);
All these functions permit asynchronous use (with callback on completion) and the underlying AJAX functions permit JSON-P encoding.
ODBSetUrl("http://mhttpd.somewhere.com:8080") : this new function removes the restriction that custom scripts had to be loaded from the same mhttpd that they will
access. Together with the newly added CORS support in mhttpd, allows loading custom scripts from any web server, including local file, and having then access any one (or
any several) mhttpd data sources.
I think these new functions are now stable (I still had to make some changes to ODBMCreate() recently) and after some more testing this branch will be merged into
"develop".
To use this branch, do either:
a) git clone midas; git pull; git checkout feature/ajax
b) git clone midas; git checkout develop; git pull; git checkout -b ajaxtest; git merge feature/ajax;
(Option (b) creates a local branch with the latest "develop" and "feature/ajax" merged together).
K.O. |
916
|
27 Sep 2013 |
Konstantin Olchanski | Info | ODB JSON support | > odbedit can now save ODB in JSON-formatted files.
>
> JSON encoding implementation follows specifications at:
> http://json.org/
>
> The result passes validation by:
> http://jsonlint.com/
>
A bug was reported in my JSON ODB encoder: NaN values are not encoded correctly. A quick review found this:
1) the authors of JSON smoked some bad mushrooms and specifically disallowed NaN and Inf values for floating point numbers:
http://tools.ietf.org/html/rfc4627
2) most JSON encoders and decoders do reasonable and unreasonable things with NaN and Inf values. The worst ones encode them as zero. More bad
mushrooms.
There is a quick survey at: http://lavag.org/topic/16217-cr-json-labview/?p=99058
<pre>
Some Javascript engines allow it since it is valid Javascript but not valid Json however there is no concensus.
cmj-JSON4Lua: raw tostring() output (invalid JSON).
dkjson: 'null' (like in the original JSON-implementation).
Fleece: NaN is 0.0000, freezes on +/-Inf.
jf-JSON: NaN is 'null', Inf is 1e+9999 (the encode_pretty function still outputs raw tostring()).
Lua-Yajl: NaN is -0, Inf is 1e+666.
mp-CJSON: raises invalid JSON error by default, but runtime configurable ('null' or Nan/Inf).
nm-luajsonlib: 'null' (like in the original JSON-implementation).
sb-Json: raw tostring() output (invalid JSON).
th-LuaJSON: JavaScript? constants: NaN is 'NaN', Inf is 'Infinity' (this is valid JavaScript?, but invalid JSON).
</pre>
For the MIDAS JSON encoder (and decoder) I have several choices:
a) encode NaN and Inf using the printf("%f") encoding (as strings, making it valid JSON)
b) encode NaN and Inf as strings using the Javascript special values: "NaN", "Infinity" and "-Infinity", see
http://www.w3schools.com/jsref/jsref_positive_infinity.asp
I note that the Python JSON encoder does (b), see section 18.2.3.3 at http://docs.python.org/2/library/json.html
In either case, behaviour of the JSON decoder on the Javascript side needs to be tested. (Silent conversion to value of zero is not acceptable).
If anybody has an suggestion on this, please let me know.
P.S. If you do not know all about NaN, Inf, "-0" and other floating point funnies, please read: https://www.ualberta.ca/~kbeach/phys420_580_2010/docs/ACM-Goldberg.pdf
P.P.S. If you ever used the type "float" or "double", used the "/" operator or the function "sqrt()" you also should read that reference.
K.O. |
917
|
01 Oct 2013 |
Konstantin Olchanski | Info | MacOS select() problem | The following code found in mhttpd does not work on MacOS (BSD UNIX).
On Linux, the do-loop will finish after 2 seconds as expected. On MacOS (and other BSD systems), it will
loop forever.
The cause is the MIDAS watchdog alarm() signal that fires every 1 second and always interrupts the 2
second sleep of select(). The Linux select() updates it's timeout argument to reflect time already slept, so
eventually we finish. The MacOS (BSD) select() does not update the timeout argument and select goes back
to sleep for another 2 seconds (to be again interrupted half-way through).
The POSIX standard (specification for select() & co) permits either behaviour. Compare "man select" on
MacOS and on Linux.
If the select() timeout were not 2 seconds, but 0.9 seconds; or if the MIDAS watchdog alarm fired every
2.1 seconds, this problem would also not exist.
I think there are several places in MIDAS with code like this. An audit is required.
{
FD_ZERO(&readfds);
FD_SET(_sock, &readfds);
timeout.tv_sec = 2;
timeout.tv_usec = 0;
do {
status = select(FD_SETSIZE, &readfds, NULL, NULL, &timeout);
/* if an alarm signal was cought, restart with reduced timeout */
} while (status == -1 && errno == EINTR);
}
K.O. |
918
|
09 Oct 2013 |
Konstantin Olchanski | Info | ODB JSON support | > > odbedit can now save ODB in JSON-formatted files.
> A bug was reported in my JSON ODB encoder: NaN values are not encoded correctly.
Tested the browser-builtin JSON.stringify() function in google-chrome, firefox, safari, opera:
everybody encodes numeric values NaN and Inf as JSON value [null].
To me, this clearly demonstrates a severe defect in the JSON standard and in it's Javascript implementation:
a) NaN, Inf and -Inf are valid, useful and commonly used numeric values defined by the IEEE754/854 standard (as opposed to the special value "-0", which is also defined by the standard, but is not nearly as useful)
b) they are all distinct numeric values, encoding them all into the same JSON value [null] is the same as encoding all even numbers into the JSON value [42].
c) on the decoding end, JSON value [null] is decoded into Javascript value [null], which works as 0 for numeric computation, so effectively NaN, Inf and -Inf are made equal to zero. A neat trick.
Note that (c) - NaN, Inf is same as 0 - eventually produces incorrect numerical results by breaking the IEEE754/854 standard specification that number+NaN->NaN, number+infinity->infinity, etc.
In MIDAS we have a requirement that results be numerically correct: if an ODB value is "infinity", the corresponding web page should not show "0".
In addition we have a requirement that JSON encoding should be lossess: i.e. ODB contents encoded by JSON should decode back into the same ODB contents.
To satisfy both requirements, I now encode NaN, Inf and -Inf as JSON string values "NaN", "Infinity" and "-Infinity". (Corresponding to the respective Javascript values).
Notes:
1) this is valid JSON
2) it survives decode/encode in the browser (ODBMCopy()/JSON.parse/modify some values/JSON.stringify/ODBMPaste() does not destroy these special values)
3) it is numerically correct for "NaN" values (Javascript [1+"NaN"] -> NaN)
4) it fails in an obvious way for Inf and -Inf values (Javascript [1+"Infinity"] is NaN instead of Infinity).
https://bitbucket.org/tmidas/midas/commits/82dd203cc95dacb6ec9c0a24bc97ffd45bb58427
K.O. |
919
|
22 Oct 2013 |
Konstantin Olchanski | Info | midas programs "auto start", etc | MIDAS "programs" settings include: /programs/xxx/"auto start", "auto restart" and "auto stop". What do
they do?
"auto start":
if set to "y", the program's "start command" will be unconditionally executed at the beginning of the run
start transition.
Because there are no checks or tests, the "start command" will be executed even if the program is already
running. It means that this function cannot be used to start frontend programs - a new copy will be
started each time, and a previously running copy will be killed.
Also the timing of the program startup and run transition is wrong - in my tests, the program starts too
late to see the run transition. If the program is a frontend, it will never see the begin-of-run transition.
1st conclusion: "auto start" should be "n" for frontend programs and for any other programs that are
supposed to be continuously running (mlogger, lazylogger, etc).
2nd conclusion: "auto start" does the same thing as "/programs/execute on start run".
"auto stop":
if set to "y", the program will be stopped after the end of run. (using cm_shutdown).
"auto restart":
this has nothing to do with starting and stopping runs. Instead, it works in conjunction with the alarm
system and the "program is not running" alarm.
The alarm system periodically calls al_check(). al_check() checks all programs defined under /Programs to
see if they are running (using cm_exist()). If a program is not running and an alarm is defined, the alarm is
raised ("program is not running" alarm). If there is a start command and "auto restart" is set to "y", the
start command is executed.
When using these "auto start" and "auto restart" functions, one needs to be careful about the context
where the start command will be executed: midas clients may be running from different directories, under
different user names and on different computers.
In "auto start", the start command is executed from cm_transition. For remote clients, this will happen on
the remote computer. (against the expectation that the program will be started on the main computer).
In "auto restart", the start command is executed by al_check() which always runs locally (for remote
clients, it runs inside the mserver). So the started program will always run on the main computer, but
maybe not in the same directory as when started from the mhttpd "programs -> start" button.
Conclusion:
"programs auto start" : works but has strange interactions and side effects, do not use it.
"programs auto stop" : works, can be used to stop programs at the end of run (but what for?)
"programs auto restart" : works, seems to work correctly, can be used to auto restart mlogger, frontends,
etc.
K.O. |
920
|
22 Oct 2013 |
Konstantin Olchanski | Info | audit of db_get_record() | Record-oriented ODB functions db_create_record(), db_get_record(), db_check_record() and
db_set_record() require special attention to the consistency between their "C struct"s (usually defined in
midas.h), their initialization strings (usually defined in midas.h) and the contents of ODB.
When these 3 items become inconsistent, the corresponding midas functions tend to break.
Unlike ODB internal structures and event buffer internal structures, these record-oriented functions are
not part of the midas binary-compatibility abi and they are not protected by db_validate_sizes().
From time to time, new items are added to some of these data structures. Usually this does not cause
problems, but recently we had some difficulty with the runinfo and equipment structures, prompting this
audit.
db_check_record: note: (C) means that this record is created there
alarm.c: alarm_odb_str(C)
mana.c: skipped
mfe.c: equipment_common_str, equipment_statistics_str(C), event_descrip(C), bank_list(C)
mhttpd.cxx: cgif_label_str(C), cgif_bar_str(C), runinfo_str(C), equipment_common_str(C)
mlogger.cxx: ch_settings_str(C)
sequencer.cxx: sequencer_str(C)
db_create_record:
alarm.c: alarm_odb_str, alarm_periodic_str, alarm_class_str
fal.c: skipped
mfe.c: equipment_common_str
midas.c: program_info_str (maybe)
odb.c: (maybe)
lazylogger.cxx: lazy_settings, lazy_statistics
mhttpd.cxx: runinfo_str
mlogger.cxx: chn_settings_str
db_get_record: (hard to do with grep, will have to check every db_get_record by hand)
alarm.c: alarm, class, program_info
fal.c: skipped
mana.c: skipped
midas.c: program_info
odb.c: (maybe)
lazylogger.cxx: lazyst
mhttpd.cxx: runinfo, equipment, ?hkeytemp?, chn_settings, chn_stats, ?label?, ?bar?
mlogger.cxx: ?, ?, chn_stats, chn, settings
sequencer.cxx: hkeyseq
db_set_record:
alarm.c: hkeyalarm, hkeyclass, ???, program_info
fal.c: skipped
mana.c: skipped
mfe.c: equipment_info, ?event structure?
odb.c: (maybe)
lazelogger.cxx: lazyst
mlogger.cxx: chn_stat
sequencer.cxx: seq
db_open_record: note: (W) means MODE_WRITE
fal.c: skipped
mana.c: skipped
mfe.c: equipment_info, equipment_stats(W)
midas.c: requested_transition
odbedit.c: key_update - generic test of hotlink
lazylogger.cxx: runstate, lazyst(W), lazy?
mlogger.cxx: history, chn_statistics, chn_settings
sequencer.cxx: seq
K.O. |
921
|
25 Oct 2013 |
Konstantin Olchanski | Bug Fix | fixed mlogger run auto restart bug | A problem existed in midas for some time: when recording long data sets of time (or event) limited runs
with logger run auto restart set to "yes", the runs will automatically stop and restart as expected, but
sometimes the run will stop and never restart and beam will be lost until the experiment operator on shift
wakes up and restarts the run manually.
I have now traced this problem to a race condition inside the mlogger - when a run is being stopped from
the mlogger, the mlogger run transition handler (tr_stop) triggers an immediate attempt to start the next
run, without waiting for the run-stop transition to actually complete. If the run-stop transition does not
finish quickly enough, a safety check in start_the_run() will cause the run restart attempt to silently fail
without any error message.
This race condition is pretty rare but somehow I managed to replicate it while debugging the
multithreaded transitions. It is fixed by making mlogger wait until the run-stop transition completes.
https://bitbucket.org/tmidas/midas/commits/b2631fbed5f7b1ec80e8a6c8781ada0baed7702b
K.O. |
922
|
25 Oct 2013 |
Konstantin Olchanski | Info | MacOS select() problem | > The following code found in mhttpd does not work on MacOS (BSD UNIX). ...
Because of this problem, on MacOS, run transitions can get stuck forever - most timeouts do not work. (Specifically, recv_string() never times out)
K.O. |
924
|
28 Oct 2013 |
Konstantin Olchanski | Bug Fix | fixed mlogger run auto restart bug | >
> More generally I kind of consider the mlogger auto restart facility as deprecated. It works in the background and the operator does not have a clue
> what is going on. We use now the sequencer to achieve exactly the same functionality.
>
Before subruns were available, most experiments at TRIUMF have used the "auto restart" function. Now, I think most of them use subruns,
with the notable exception of PIENU where the analysis framework could not handle subruns. (PIENU is now shutdown and disassembled).
>
> It just requires a few lines of sequencer code:
>
> LOOP INFINITE
> TRANSITION start
> WAIT events, 5000
> TRANSITION stop
> ENDLOOP
>
Mouse click "auto restart" to "yes" is a little bit simpler than setting up a sequencer file, and it survives a crash of mhttpd.
Does the sequencer survive a crash or a restart of mhttpd?
K.O. |
929
|
11 Nov 2013 |
Konstantin Olchanski | Forum | Installation problem | > I run into problems while trying to install Midas on Slackware 14.0.
Thank you for reporting this. We do not have any slackware computers so we cannot see these message usually.
We use SL/RHEL 5/6 and MacOS for most development, plus we now have an Ubuntu test machine, where I see a
whole different spew of compiler messages.
Most of the messages are:
a) useless compiler whining:
src/midas.c: In function 'cm_transition2':
src/midas.c:3769:74: warning: variable 'error' set but not used [-Wunused-but-set-variable]
b) an actual error in fal.c:
src/fal.c:131:0: warning: "EQUIPMENT_COMMON_STR" redefined [enabled by default]
c) actual error in fal.c: assignment into string constant is not permitted: char*x="aaa"; x[0]='c'; // core dump
src/fal.c:383:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
these are fixed by making sure all such pointers are "const char*" and the corresponding midas functions are
also "const char*".
d) maybe an error (gcc sometimes gets this one wrong)
./mscb/mscb.c: In function 'int mscb_info(int, short unsigned int, MSCB_INFO*)':
./mscb/mscb.c:1682:8: warning: 'size' may be used uninitialized in this function [-Wuninitialized]
> Apparently there is a problem with a shared library which should be on the
> system, I think make checks for /usr/include/mysql and then supposes that
> libodbc.so should be on disk. I don't know why on my system it is not.
g++ -g -O2 -Wall -Wno-strict-aliasing -Wuninitialized -Iinclude -Idrivers -I../mxml -I./mscb -Llinux/lib -
DHAVE_FTPLIB -D_LARGEFILE64_SOURCE -DHAVE_MYSQL -I/usr/include/mysql -DHAVE_ODBC -DHAVE_SQLITE
-DHAVE_ROOT -pthread -m64 -I/home/exodaq/root_5.35.10/include -DHAVE_ZLIB -DHAVE_MSCB -DOS_LINUX
-fPIC -Wno-unused-function -o linux/bin/mhttpd linux/lib/mhttpd.o linux/lib/mgd.o linux/lib/mscb.o
linux/lib/sequencer.o linux/lib/libmidas.a linux/lib/libmidas.a -lodbc -lsqlite3 -lutil -lpthread -lrt -lz -lm
/usr/lib64/gcc/x86_64-slackware-linux/4.7.1/../../../../x86_64-slackware-linux/bin/ld: cannot find -lodbc
The ODBC library is not found (shared .so or static .a).
The Makefile check is for /usr/include/sql.h (usually part of the ODBC package). On the command line above,
HAVE_ODBC is set, and the rest of MIDAS compiled okey, so the ODBC header files at least are present. But why
the library is not found?
I do not know how slackware packages this stuff the way they do and I do not have a slackware system to check
how it should look like, so I cannot suggest anything other than commenting out "HAVE_ODBC := ..." in the
Makefile.
> But I was wondering if I have some other problems (configuration problem?)
> because I get a very large number of warnings. My last installation of Midas is
> like from two years ago but I don't remember getting many warnings.
There are no "many warnings". Mostly it's just one same warning repeated many times that complains about
perfectly valid code:
src/midas.c: In function 'cm_transition':
src/midas.c:4388:19: warning: variable 'tr_main' set but not used [-Wunused-but-set-variable]
They complain about code:
{ int i=foo(); ... } // yes, "i" is not used, yes, if you have to keep it if you want to be able to see the return value
of foo() in gdb.
> Do I do something obviously wrong?
No you. GCC people turned on one more noisy junk warning.
> Thanks a lot!
No idea about your missing ODBC library, I do not even know how to get a package listing on slackware (and
proud of it).
But if you do know how to get a package listing for your odbc package, please send it here. On RHEL/SL, I would
do:
rpm -qf /usr/include/sql.h ### find out the name of the package that owns this file
rpm -ql xxx ### list all files in this package
K.O. |
930
|
11 Nov 2013 |
Konstantin Olchanski | Forum | Installation problem | > > I run into problems while trying to install Midas on Slackware 14.0.
>
> b) an actual error in fal.c:
>
> src/fal.c:131:0: warning: "EQUIPMENT_COMMON_STR" redefined [enabled by default]
>
> c) actual error in fal.c: assignment into string constant is not permitted: char*x="aaa"; x[0]='c'; // core dump
>
> src/fal.c:383:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
>
> these are fixed by making sure all such pointers are "const char*" and the corresponding midas functions are
the warnings in fal.c are now fixed.
K.O. |
932
|
13 Nov 2013 |
Konstantin Olchanski | Forum | Installation problem | > > I run into problems while trying to install Midas on Slackware 14.0.
>
> Thank you for reporting this. We do not have any slackware computers so we cannot see these message usually.
>
>
> src/midas.c: In function 'cm_transition2':
> src/midas.c:3769:74: warning: variable 'error' set but not used [-Wunused-but-set-variable]
>
got around to look at compile messages on ubuntu: in addition to "variable 'error' set but not used" we have these:
warning: ignoring return value of 'ssize_t write(int, const void*, size_t)'
warning: ignoring return value of 'ssize_t read(int, void*, size_t)'
warning: ignoring return value of 'int setuid(__uid_t)'
and a few more of similar
K.O. |
935
|
14 Nov 2013 |
Konstantin Olchanski | Forum | Installation problem | # slackpkg file-search sql.h
[ installed ] - libiodbc-3.52.7-x86_64-2
...
# slackpkg search package
...
# cat /var/log/packages/libiodbc-3.52.7-x86_64-2
usr/include/sql.h
...
usr/lib64/libiodbc.so.2.1.19
...
Thanks, I am saving the slackpkg commands for future reference. Looks like the immediate problem is
with the library name: libiodbc instead of libodbc. But the header file sql.h is the same.
I am not sure if it is worth making a generic solution for this: on MacOS, all ODBC functions are now
obsoleted, to be removed, and since we are stanardized on MySQL anyway, so I think I will rewrite the SQL
history driver to use the MySQL interface directly. Then all this ODBC extra layering will go away.
K.O. |
936
|
14 Nov 2013 |
Konstantin Olchanski | Forum | Installation problem | > #include "use.h"
> { USED int i=foo(); }
Sounds nifty, but google does not find use.h.
As for unused variables, some can be removed, others not so much, there is some code in there:
int i = blah...
#if 0
if (i=42) printf("wow, we got a 42!\n");
#endif
and
if (0) printf("debug: i=%d\n", i);
(difference is if you remove "i" or otherwise break the disabled debug code, "#if 0" will complain the next time you need that debugging code, "if (0)" will
complain right away).
Some of this disabled debug code I would rather not remove - so much debug scaffolding I have added, removed, added again, removed again, all in the same
places that I cannot be bothered with removing it anymore. I "#if 0" it and it stays there until I need it next time. But of course now gcc complains about it.
K.O. |
937
|
14 Nov 2013 |
Konstantin Olchanski | Bug Report | MacOS10.9 strlcpy() problem | On MacOS 10.9 MIDAS will crashes in strlcpy() somewhere inside odb.c. We think this is because strlcpy()
in MacOS 10.9 was changed to abort() if input and output strings overlap. For overlapping memory one is
supposed to use memmove(). This is fixed in current midas, for older versions, you can try this patch:
konstantin-olchanskis-macbook:midas olchansk$ git diff
diff --git a/src/odb.c b/src/odb.c
index 1589dfa..762e2ed 100755
--- a/src/odb.c
+++ b/src/odb.c
@@ -6122,7 +6122,10 @@ INT db_paste(HNDLE hDB, HNDLE hKeyRoot, const char *buffer)
pc++;
while ((*pc == ' ' || *pc == ':') && *pc)
pc++;
- strlcpy(data_str, pc, sizeof(data_str));
+
+ //strlcpy(data_str, pc, sizeof(data_str)); // MacOS 10.9 does not permit strlcpy() of overlapping
strings
+ assert(strlen(pc) < sizeof(data_str)); // "pc" points at a substring inside "data_str"
+ memmove(data_str, pc, strlen(pc)+1);
if (n_data > 1) {
data_str[0] = 0;
konstantin-olchanskis-macbook:midas olchansk$
As historical reference:
a) MacOS documentation says "behavior is undefined", which is no longer true, the behaviour is KABOOM!
https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man3/strlcpy.3.h
tml
b) the original strlcpy paper from OpenBSD does not contain the word "overlap"
http://www.courtesan.com/todd/papers/strlcpy.html
c) the OpenBSD man page says the same as Apple man page (behaviour undefined)
http://www.openbsd.org/cgi-bin/man.cgi?query=strlcpy
d) the linux kernel strlcpy() uses memcpy() and is probably unsafe for overlapping strings
http://lxr.free-electrons.com/source/lib/string.c#L149
e) midas strlcpy() looks to be safe for overlapping strings.
K.O. |
938
|
15 Nov 2013 |
Konstantin Olchanski | Bug Report | stuck data buffers | We have seen several times a problem with stuck data buffers. The symptoms are very confusing -
frontends cannot start, instead hang forever in a state very hard to kill. Also "mdump -s -d -z
BUF03" for the affected data buffers is stuck.
We have identified the source of this problem - the semaphore for the buffer is locked and nobody
will ever unlock it - MIDAS relies on a feature of SYSV semaphores where they are automatically
unlocked by the OS and cannot ever be stuck ever. (see man semop, SEM_UNDO function).
I think this SEM_UNDO function is broken in recent Linux kernels and sometimes the semaphore
remains locked after the process that locked it has died. MIDAS is not programmed to deal with this
situation and the stuck semaphore has to be cleared manually.
Here, "BUF3" is used as example, but we have seen "SYSTEM" and ODB with stuck semaphores, too.
Steps:
a) confirm that we are using SYSV semaphores: "ipcs" should show many semaphores
b) identify the stuck semaphore: "strace mdump -s -d -z BUF03".
c) here will be a large printout, but ultimately you will see repeated entries of
"semtimedop(9633800, {{0, -1, SEM_UNDO}}, 1, {1, 0}^C <unfinished ...>"
d) erase the stuck semaphore "ipcrm -s 9633800", where the number comes from semtimedop() in
the strace output.
e) try again: "mdump -s -d -z BUF03" should work now.
Ultimately, I think we should switch to POSIX semaphores - they are easier to manage (the strace
and ipcrm dance becomes "rm /dev/shm/deap_BUF03.sem" - but they do not have the SEM_UNDO
function, so detection of locked and stuck semaphores will have to be done by MIDAS. (Unless we
can find some library of semaphore functions that already provides such advanced functionality).
K.O. |
939
|
20 Nov 2013 |
Konstantin Olchanski | Bug Report | Too many bm_flush_cache() in mfe.c | I was looking at something in the mserver and noticed that for remote frontends, for every periodic event,
there are about 3 RPC calls to bm_flush_cache().
Sure enough, in mfe.c::send_event(), for every event sent, there are 2 calls to bm_flush_cache() (once for
the buffer we used, second for all buffers). Then, for a good measure, the mfe idle loop calls
bm_flush_cache() for all buffers about once per second (even if no events were generated).
So what is going on here? To allow good performance when processing many small events,
the MIDAS event buffer code (bm_send_event()) buffers small events internally, and only after this internal
buffer is full, the accumulated events are flushed into the shared memory event buffer,
where they become visible to the mlogger, mdump and other consumers.
Because of this internal buffering, infrequent small size periodic events can become
stuck for quite a long time, confusing the user: "my frontend is sending events, how come I do not
see them in mdump?"
To avoid this, mfe.c manually flushes these internal event buffers by calling bm_flush_buffer().
And I think that works just fine for frontends directly connected to the shared memory, one call to
bm_flush_buffer() should be sufficient.
But for remote fronends connected through the mserver, it turns out there is a race condition between
sending the event data on one tcp connection and sending the bm_flush_cache() rpc request on another
tcp connection.
I see that the mserver always reads the rpc connection before the event connection, so bm_flush_cache()
is done *before* the event is written into the buffer by bm_send_event(). So the newly
send event is stuck in the buffer until bm_flush_cache() for the *next* event shows up:
mfe.c: send_event1 -> flush -> ... wait until next event ... -> send_event2 -> flush
mserver: flush -> receive_event1 -> ... wait ... -> flush -> receive_event2 -> ... wait ...
mdump -> ... nothing ... -> ... nothing ... -> event1 -> ... nothing ...
Enter the 2nd call to bm_flush_cache in mfe.c (flush all buffers) - now because mserver seems to be
alternating between reading the rpc connection and the event connection, the race condition looks like
this:
mfe.c: send_event -> flush -> flush
mserver: flush -> receive_event -> flush
mdump: ... -> event -> ...
So in this configuration, everything works correctly, the data is not stuck anywhere - but by accident, and
at the price of an extra rpc call.
But what about the periodic 1/second bm_flush_cache() on all buffers? I think it does not quite work
either because the race condition is still there: we send an event, and the first flush may race it and only
the 2nd flush gets the job done, so the delay between sending the event and seeing it in mdump would be
around 1-2 seconds. (no more than 2 seconds, I think). Since users expect their events to show up "right
away", a 2 second delay is probably not very good.
Because periodic events are usually not high rate, the current situation (4 network transactions to send 1
event - 1x send event, 3x flush buffer) is probably acceptable. But this definitely sets a limit on the
maximum rate to 3x (2x?) the mserver rpc latency - without the rpc calls to bm_flush_buffer() there
would be no limit - the events themselves are sent through a pipelined tcp connection without
handshaking.
One solution to this would be to implement periodic bm_flush_buffer() in the mserver, making all calls to
bm_flush_buffer() in mfe.c unnecessary (unless it's a direct connection to shared memory).
Another solution could be to send events with a special flag telling the mserver to "flush the buffer right
away".
P.S. Look ma!!! A race condition with no threads!!!
K.O. |
|