Back Midas Rome Roody Rootana
  Midas DAQ System, Page 124 of 137  Not logged in ELOG logo
ID Date Author Topic Subject
  261   30 May 2006 Konstantin OlchanskiBug Reportbadness with vxworks/ppc
It appears that the latest version of MIDAS malfunctions on PowerPC/VxWorks
machines, below are two problem reports. As reported, previous versions of MIDAS
work fine, I guess that reduces the probability of it being buggy user code. At
least one of the problems feels like a missing endian conversion somewhere, but
I am not aware of any recent changes in the MIDAS RPC code... We will be trying
to debug both problems, but any insight would be greatly appreciated.

K.O.


From suz@triumf.ca  Tue May 30 16:58:16 2006
Date: Tue, 30 May 2006 16:58:16 -0700 (PDT)
From: Suzannah Daviel <suz@triumf.ca>
To: konstantin olchanski <olchansk@triumf.ca>
Subject: rpc problems

Hi Konstantin,

Herewith a description of the problems,

Suzannah

Problem on system A:
--------------------

After upgrading the Linux operating system from RH9 to SL4, and installing
latest Midas software, the first time a manual trigger is issued, the VxWorks
frontend (running
on a PPC) crashes:


Output on PPC consol:

trigger histo event from status page

rpc_client_accept: starting with sock:11

program
Exception current instruction address: 0x01ac7388
Machine Status Register: 0x0008b030
Condition Register: 0x24000082
Task: 0x1b47908 "mfe"



The histo event is usually large so is fragmented. It is sent out by a
manual trigger and at end of run. When the run is ended (before an event
request using a manual trigger so program has not yet crashed) the histo
event is sent successfully.

After returning to the previous version of Midas but still running SL4,
this problem disappeared.




Problem on system B:
--------------------

Again, SL9 was installed, and the Midas software updated to the latest.
When sending a periodic (non-fragmented) event, after a while, one of the
parameters appears to become corrupted, and a lot of rpc_call error
messages appear. These continue while data is still successfully sent out
until the run is ended.



Tue May  9 05:20:29 2006 [Mdarc] *** data saved in file
/is01_data/bnmr/dlog/2006/040377.msr_v5 at Tue May  9 05:20:29
2006 (SN=5) ***
Tue May  9 05:21:30 2006 [Mdarc] *** data saved in file
/is01_data/bnmr/dlog/2006/040377.msr_v6 at Tue May  9 05:21:30
2006 (SN=6) ***
Tue May  9 05:22:31 2006 [Mdarc] *** data saved in file
/is01_data/bnmr/dlog/2006/040377.msr_v7 at Tue May  9 05:22:31
2006 (SN=7) ***

Tue May  9 05:23:12 2006 [feBNMR] [midas.c:9325:rpc_call] parameters
(1099059848) too large for network buffer
(524344); param_size=1099059808
Tue May  9 05:23:12 2006 [feBNMR] [midas.c:9325:rpc_call] parameters
(1099059848) too large for network buffer
(524344); param_size=1099059808
........................................
Tue May  9 05:23:31 2006 [feBNMR] [midas.c:9325:rpc_call] parameters
(1099059848) too large for network buffer
(524344); param_size=1099059808
Tue May  9 05:23:32 2006 [feBNMR] [midas.c:9325:rpc_call] parameters
(1099059848) too large for network buffer
(524344); param_size=1099059808

Tue May  9 05:23:32 2006 [Mdarc] *** data saved in file
/is01_data/bnmr/dlog/2006/040377.msr_v8 at Tue May  9 05:23:32
2006 (SN=8) ***

Tue May  9 05:23:32 2006 [feBNMR] [midas.c:9325:rpc_call] parameters
(1099059848) too large for network buffer
(524344); param_size=1099059808
Tue May  9 05:23:33 2006 [feBNMR] [midas.c:9325:rpc_call] parameters
(1099059848) too large for network buffer
(524344); param_size=1099059808
etc.

Another example showing that the corrupted parameter varies in size:

Thu Apr 13 19:00:00 2006 [mhttpd] Run #30005 started
Thu Apr 13 19:00:08 2006 [Mdarc] *** Saved data file
/is01_data/bnmr/dlog/2006/030005.msr_v1 at Thu Apr 13 19:00:08 2006 ***
Thu Apr 13 19:01:10 2006 [Mdarc] *** Saved data file
/is01_data/bnmr/dlog/2006/030005.msr_v2 at Thu Apr 13 19:01:10 2006 ***
Thu Apr 13 19:02:14 2006 [Mdarc] *** Saved data file
/is01_data/bnmr/dlog/2006/030005.msr_v3 at Thu Apr 13 19:02:14 2006 ***
Thu Apr 13 19:03:20 2006 [Mdarc] *** Saved data file
/is01_data/bnmr/dlog/2006/030005.msr_v4 at Thu Apr 13 19:03:20 2006 ***
Thu Apr 13 19:04:22 2006 [Mdarc] *** Saved data file
/is01_data/bnmr/dlog/2006/030005.msr_v5 at Thu Apr 13 19:04:22 2006 ***
Thu Apr 13 19:05:12 2006 [feBNMR] [midas.c:9323:rpc_call] parameters
(1077739560) too large for network buffer
(524344)
Thu Apr 13 19:05:13 2006 [feBNMR] [midas.c:9323:rpc_call] parameters
(1077739560) too large for network buffer
(524344)
etc.
  260   25 May 2006 Pierre-Andre AmaudruzBug FixFixed compiler warnings with gcc 3.4.4

Stefan Ritt wrote:
I fixed a couple of compiler warning which came up with the new gcc 3.4.4. Seems like the compiler gets more and more picky. There a still warning left in ybos.c and in mcnaf.c, which I leave to the original author Wink



Pierre-A. Amaudruz wrote:
>ybos.c, cnaf_callback.c, mcnaf.c, mana.c have been corrected too.
  259   25 May 2006 Stefan RittBug FixFixed compiler warnings with gcc 3.4.4
I fixed a couple of compiler warning which came up with the new gcc 3.4.4. Seems like the compiler gets more and more picky. There a still warning left in ybos.c and in mcnaf.c, which I leave to the original author Wink
  258   25 May 2006 Konstantin OlchanskiBug Fixfix crash in xml odb load
There is a crash in odbedit when loading some xml odb files: a missing check for NULL pointer when 
loading an array of strings and one of the array elements is blank. This check is present when loading 
other string values. Here is the diff:

-bash-3.00$ diff odb.c odb.c-new
5621c5621,5624
<                db_set_data_index(hDB, hKey, mxml_get_value(child), size, i, tid);
---
>                if (mxml_get_value(child) == NULL)
>                   db_set_data_index(hDB, hKey, "", size, i, tid);
>                else
>                   db_set_data_index(hDB, hKey, mxml_get_value(child), size, i, tid);

K.O.
  257   18 May 2006 Stefan RittBug FixFixed problems with reload of custom pages
We had a problem with custom pages and reloading of them. If they contain an ODB field which is editable, one can change the ODB value through the custom page. The URL then contains a "?cmd=Set&value=x&index=x" section, which stays in the browser's address bar after the ODB value has been updated. If the value changes later by some other means in the ODB, and one presses "reload" in the browser, the above URL gets executed again and the value gets changed back which is not wanted.

The problem has been fixed such that mhttpd redirects the browser after setting a variable to the URL not containing the "Set" command from above.
  256   18 May 2006 Konstantin OlchanskiBug Fixremoved a few "//" comments to fix compilation on VxWorks
Our VxWorks C compiler (gcc-2.8-something) does not like the "//" comments. Luckily, on VxWorks, we 
only compile a small subset of midas, so there is no point in banning all "//" comments. But I did have to 
convert a couple of them to /* commens */ in odb.c to make it compile. Changes to odb.c commited. K.O.
  255   11 May 2006 Konstantin OlchanskiBug ReportMIDAS and Fedora 4
Fellow Midasites- we are receiving reports that current Midas sources do not compile on Fedora 4 (and 5?) 
with errors "invalid lvalue in assignment". It looks like the new compilers reject what looks to my eye like 
perfectly valid C code that we have been writing since the beginning of C. Any suggestions on the best fix? 
K.O.
  254   08 May 2006 Stefan RittBug Reportcm_register_transition gyrations
> I am debugging a Rome-based DAQ system setup by Pierre A. (the system does not
> work because of bugs in Rome).
> 
> One problem I see is with my copy of cm_register_transition() in midas.c. Rome
> calls it with a NULL function to register a "queued" transition, but the
> cm_register_transition() code has changed around (rev 3051) to make NULL mean
> "unregister" a transition (this broke the queued transitions used by Rome), then
> it got changed back (rev 3085). Of course, I was stuck with the broken version,
> so Rome did not work at all, and it cost me real wall time to get to the bottom
> of all this, only to discover that this problem is already fixed. So-
> 
> I would greatly appreciate it if, in the future, changes (and bug fixes) to the
> MIDAS API were announced on this mailing list here.
> 
> K.O.

Yes you are right. I apologize. Fact was that I was not aware that anybody else uses
already ROME in online mode. Nevertheless, let me at least explain the reason for
that change:

Some experiments at PSI run a slow control front end, which talks to pretty slow
hardware, and thus can be nonresponsive for many seconds. Since each frontend by
default registers in the start and stop transitions, this frontend delayed the start
/stop of each run. To solve this problem in the short run, the frontend should not
register in the transition. Originally I implemented this by using the NULL function
pointer, until we figured out that ROME uses this to register (not de-register)
together with the cm_query_transition() function. Therefore a new function
cm_deregister_transition() was implemented and is used now by the slow frontends.

In the long run this will be solved by implementing multi-threaded frontends which
get one thread for each equipment and therefore do not block any transition anymore.
  253   07 May 2006 Konstantin OlchanskiBug Reportcm_register_transition gyrations
I am debugging a Rome-based DAQ system setup by Pierre A. (the system does not
work because of bugs in Rome).

One problem I see is with my copy of cm_register_transition() in midas.c. Rome
calls it with a NULL function to register a "queued" transition, but the
cm_register_transition() code has changed around (rev 3051) to make NULL mean
"unregister" a transition (this broke the queued transitions used by Rome), then
it got changed back (rev 3085). Of course, I was stuck with the broken version,
so Rome did not work at all, and it cost me real wall time to get to the bottom
of all this, only to discover that this problem is already fixed. So-

I would greatly appreciate it if, in the future, changes (and bug fixes) to the
MIDAS API were announced on this mailing list here.

K.O.
  252   07 May 2006 Konstantin OlchanskiBug FixUpdate & add VME drivers
I commited fixes for a few minor compilation errors in the VME drivers
(vmicvme.c, etc)
I also added new drivers for the v513 latch and v560 scaler that I wrote for
CERN-ALPHA.

(Maybe I should mention that we also have drivers for the SIS 3820 multiscaler,
the v895 VME discriminator and a few more modules. Will commit them as they mature).

K.O.
  251   27 Mar 2006 Sergio BallestreroInfosvn@savannah.psi.ch down ?
> I just tried now and it seemed to work fine. Do you still have the problem?
> 
> - Stefan

 The problem was still there this morning, shortly after seeing your mail, but seems
to be fixed now.
 BTW, which is the best way to submit patches ? I have a version of khyt1331 for Linux
kernel 2.6 (we are running Scientific Linux 4.1), and a few smaller things, mostly in
the examples. 

 Thanks, Sergio
  250   26 Mar 2006 Stefan RittInfosvn@savannah.psi.ch down ?
>  Hi,
> I was trying to update the checkout of Midas, but it looks like something is not
> working - maybe a component of the Savannah system:
> [sergio@daq-pc midas-SVN]$ svn update
> svn@savannah.psi.ch's password: svn
> unix dgram connect: Connection refused at /bin/cvssh.pl line 32
> no connection to syslog available at /bin/cvssh.pl line 32
> svn: Connection closed unexpectedly
> 
> my .svn/entries says (amongst the rest)
>  url="svn+ssh://svn@savannah.psi.ch/afs/psi.ch/project/meg/svn/midas/trunk"
> and yes, it used to work well... 
> 
> Cheers,
>   Sergio

I just tried now and it seemed to work fine. Do you still have the problem?

- Stefan
  249   23 Mar 2006 Sergio BallestreroInfosvn@savannah.psi.ch down ?
 Hi,
I was trying to update the checkout of Midas, but it looks like something is not
working - maybe a component of the Savannah system:
[sergio@daq-pc midas-SVN]$ svn update
svn@savannah.psi.ch's password: svn
unix dgram connect: Connection refused at /bin/cvssh.pl line 32
no connection to syslog available at /bin/cvssh.pl line 32
svn: Connection closed unexpectedly

my .svn/entries says (amongst the rest)
 url="svn+ssh://svn@savannah.psi.ch/afs/psi.ch/project/meg/svn/midas/trunk"
and yes, it used to work well... 

Cheers,
  Sergio
  248   03 Jan 2006 John O'DonnellInfoHow do I do custom event building?
At DANCE we have a similar issue.  We are still doing "software
handshaking" between multiple frontends (15 which read data, and 16th
with direct accessto the trigger logic), and we apply a time stamp
using gettimeofday().  We use the regular mevb, sorting on serial number.

In the analyzer (MIDAS or ROME) we then keep a big circular buffer of
event fragments, which are rebuilt into new events based on the time stamp
obtained from gettimeofday().  We keep the system clocks synchronized
(often to within about 1ms) using ntp (need to average over several
ntp servers to avoid issues with network noise).  ntp can take a while
to stabilize, so we never reboot our computers... (well almost never).
We have a slow control frontend which monitors the ntp time offsets and
put's them in the history system for easy visualization.

Occasionally we seem to get in a mess, but somehow this fixes itself on
the next run, so it has been a useable system.  Maybe one day we will
get hardware handshaking between the frontend computers and the trigger
logic, but in the meantime we are taking data.

John.
  247   03 Jan 2006 Stefan RittSuggestionHandling multiple identical USB devices
> Any thoughts?

I got an idea of how to solve this problem in an OS-independent manner. The USB
devices and hubs form a tree, like this

  Root  HUB
  0   1   2
  |   |   \__...
  |   \___
 DevY     \
         HUB
        0 1 2
        |   |
        |  DevX
       HUB
      0 1 2
          |
        DevZ

This tree can be considered as an ordered tree, if you read it from left to right.
In that order, the devices are orderd

DevY - DevZ - DevX

Since the devices are ordered, the "instance" parameter from musb_init can be used
to identify them uniquely, like

instance==0   => DevY
instance==1   => DevZ
instance==2   => DevX

So I would say that we can use the current API using the "instance" parameter to
uniquely access a device. All we have to do is to build that tree, sort it, and then
use the instance parameter as an entry to that tree. The sorting takes care of
different ordering, which can happen during enumeration (depeding on power-up
sequence, phase of the Moon etc.). So if you have three devices like above, DevZ
should alway be at "instance==1". The only problem is if you unplug DevY for
example, then you get the map

instance==0   => DevZ
instance==1   => DevX

which is different from above. But if you have a different number of devices, you
likely have to change your frontend cody anyhow, so you can change the device
mapping there as well. 

In order to simplify the code, I would not build a complete tree and sort it, but
scan the whole tree hierarchically, i.e. look at

Bus1/Port1
Bus1/Port2
Bus1/...
Bus2/Port1
Bus2/Port2
...

Since there is a maximum of toal 127 USB devices, this scan should be pretty quick.
If you find a device with matching vendor and product ID, you increment an internal
counter. If that counter matches your instance parameter, you open that device.

The ultimate solution of course is to put an additional address into each device, so
you can distinguish them easily. For a out-of-the box Web cam you probably have no
chance, but for the home-made MSCB nodes I put such an address into each node, so I
can distinguish them even if the have the same product and vendor ID.
  246   03 Jan 2006 Stefan RittBug Reportmhttpd "edit on start" broken for arrays
> If a variable under "/experiment/edit on start/" is an array, it is correctly
> offered for editing on the "start run page", but then all elements in the array
> end up set to the value of the first element.

You are right. This was was there from the beginning, you are just the first one
trying "edit on start" with an array. I applied your fix and committed to SVN
reviwion 3013.

Stefan
  245   30 Dec 2005 Konstantin OlchanskiBug Reportmhttpd "edit on start" broken for arrays
If a variable under "/experiment/edit on start/" is an array, it is correctly
offered for editing on the "start run page", but then all elements in the array
end up set to the value of the first element.

This appears to be an error in mhttpd.c:interprete(), in the "start dialog"
section. The non-working version in CVS reads:

               for (j = 0; j < key.num_values; j++) {
                  size = key.item_size;
                  sprintf(str, "x%d", n++);
                  db_sscanf(getparam(str), data, &size, j, key.type);
                  db_set_data_index(hDB, hsubkey, data, size + 1, j, key.type);
               }

the fix that works for me reads:
                  db_sscanf(getparam(str), data, &size, 0, key.type);

(notice: the argument "j" is replaced with "0").

The way I understand this, all array elements are encoded into individual HTTP
thingy strings, named sequentially x0, x1, ... and when we parse the values out
of them, the array index should never show up.

(Stefan, if you can, please commit a fix to svn).

K.O.
  244   28 Dec 2005 Konstantin OlchanskiSuggestionHandling multiple identical USB devices
When I wrote the musbstd.h "open" method, I kind of punted on the problem of
handling multiple identical USB devices. Instead of a real solution, I added an
"instance" parameter, which allows one to "open" the "first", "second", etc USB
device, as listed in a magic random system dependant order.

Normally, USB devices are identified by two 16-bit integers: manufacturer ID and
product ID (i.e. as reported by "lsusb"). This works well until one has more
than one "identical" device. Two years ago, I had 5 identical USB cameras
(optical alignement system for TRIUMF-TWIST); last year, I had multiple USB
serial adapters; today I have two identical USB-TPC interfaces.

Most of the time, the devices are plugged into the same USB ports, so
theoretically, one should be able to tell exactly which one is which ("upstream
camera is plugged into port 1, downstream camera is plugged into port 2"). But
in the magic system dependant enumeration order, they keep moving around,
depending on the order of enumeration, history of powering up and down, phase of
the Moon, etc.

So my generic "musbstd" method of "open first", "open second", etc turned out to
be completely disfunctional.

So far, I am unable to come up with a system independant solution. But I have a
solution for Linux and maybe for MacOSX:

1) on Linux, I can use the information parsed from /proc/bus/usb/devices to say
"please open the USB device on USB bus 1, port 1", the so called USB device
"path", as seen in the system log and in /sys/bus/usb/devices.

2) on MacOSX, I was unable to find a way to discover the USB topology, but they
seem to maintain an uint32_t "location", which they promise to keep at least
across reboots (did not check this yet).

3) Windows I did not look at yet.

So we have a choice:

a) use system dependant "musb_open_linux(usbpath,vendor,product)",
"musb_open_macosx(???,vendor,product)", etc

b) create order out of chaos by manually keeping a map of "instances" (first,
second, third device) to "persistant addresses". On Linux, it would be a file
containing something like this: "USB-TPC-0 is on bus1-port1, USB-TPC-1 is on
bus1-port2". Then again, I can say "please open USB-TPC interface instance 0" or
"instance 1", etc. There is a small difficulty with dealing with devices
temporarily or permanantly going away, or changing physical addresses ("I moved
the USB device from port 1 to port 3"). This could be handled by telling the
user "hmm... USB topology has changed, please delete the map file and try
again", or we could come up with something more user friendly.

Any thoughts?

P.S. For my immediate need (I need this tomorrow), I will write a
musb_open_linux(usbpath,vendor,product) function.

K.O.
  243   24 Dec 2005 Stefan RittBug Reportminor changes to run transition code
> I am now considering allowing the run to end even if some clients cannot be
> contacted. The begin, pause and resume transitions would continue to fail if
> clients cannot be contacted.

Sounds like a good idea.

- Stefan
  242   23 Dec 2005 Konstantin OlchanskiBug Reportminor changes to run transition code
> Minor changes to run transitions code:
> - fail transition if cannot connect to one of the clients

This change introduced a problem:
1) a run is happily taking data
2) a frontend crashes
3) the web interface cannot stop the run (cannot contact the crashed frontend)
until  it is removed by the timeout (10-60 seconds?).

I am now considering allowing the run to end even if some clients cannot be
contacted. The begin, pause and resume transitions would continue to fail if
clients cannot be contacted.

K.O.
ELOG V3.1.4-2e1708b5