Back Midas Rome Roody Rootana
  Midas DAQ System, Page 130 of 142  Not logged in ELOG logo
ID Date Author Topic Subjectdown
  2845   16 Sep 2024 Marius KöppelBug ReportCrash using ODB watch
Hi all,

last week I was running MIDAS with the commit 3ad98c5. Today I updated MIDAS and now all my watch functions are crashing. Attached I have a minimal example frontend of the problem.

In our software we have two functions one which sets up the ODB values of the frontend and another one which sets up all watch functions. So overall we connect two time to the ODB during fronend_init one time to create the values and one time to create the watch. In the example code a simple version of this setup is shown:

INT frontend_init() {

  cm_msg(MINFO, "frontend_init() setup", "Test FE");

  odb settings = {
    {"Test", 123},
    {"sub", {}}
  };
  settings.connect_and_fix_structure("/Equipment/Test FE/Settings");
  // settings.watch(watch); <-- this works without segmentation fault

  odb new_settings("/Equipment/Test FE/Settings");
  new_settings.watch(watch); // <-- here I am getting a segmentation fault

  return CM_SUCCESS;
}

When I directly set the watch everything runs fine however, when I create a new ODB object and use this one to set a watch I am getting the following segmentation fault:

Process 18474 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x34)
    frame #0: 0x000000010004fa38 test_fe`midas::odb::watch_callback(hDB=<unavailable>, hKey=<unavailable>, index=0, info=0x00006000002001c0) at odbxx.cxx:96:25 [opt]
   93  	      if (po->m_data == nullptr)
   94  	         mthrow("Callback received for a midas::odb object which went out of scope");
   95  	      midas::odb *poh = search_hkey(po, hKey);
-> 96  	      poh->m_last_index = index;
   97  	      po->m_watch_callback(*poh);
   98  	      poh->m_last_index = -1;
   99  	   }

Best,
Marius
Attachment 1: test_fe.cpp
#include <algorithm>
#include <math.h>
#include <random>
#include <array>
#include <string>
#include <stdio.h>
#include <stdlib.h>

#include "midas.h"
#include "msystem.h"
#include "odbxx.h"
#include "mfe.h"

using namespace std;
using midas::odb;


/*-- Globals -------------------------------------------------------*/
/* The frontend name (client name) as seen by other MIDAS clients   */
const char *frontend_name = "Test FE";
/* The frontend file name, don't change it */
const char *frontend_file_name = __FILE__;

/* frontend_loop is called periodically if this variable is TRUE    */
BOOL frontend_call_loop = FALSE;

/* a frontend status page is displayed with this frequency in ms    */
INT display_period = 0;

/* maximum event size produced by this frontend */
INT max_event_size = 32 * (1024 * 1024); // 32MiB

/* maximum event size for fragmented events (EQ_FRAGMENTED) */
INT max_event_size_frag = 5 * 1024 * 1024;

/* buffer size to hold events */
INT event_buffer_size = 4 * max_event_size;

BOOL equipment_common_overwrite = TRUE;//true is overwriting the common odb

/*-- Function declarations -----------------------------------------*/
INT read_odb(char * pevent, INT);
void watch(odb o);
/*-- Equipment list ------------------------------------------------*/

EQUIPMENT equipment[] = {

  {
    "Test FE",      /* equipment name */
    {1, 0,          /* event ID, trigger mask */
    "SYSTEM",      /* event buffer */
    EQ_PERIODIC,   /* equipment type */
    0,             /* event source */
    "MIDAS",       /* format */
    TRUE,          /* enabled */
    RO_ALWAYS | RO_ODB,  /* read always and update ODB */
    1000,          /* read every 1 sec */
    0,             /* stop run after this event limit */
    0,             /* number of sub events */
    0,             /* log history every event */
    "", "", ""},
    read_odb,       /* readout routine */
  },

  {""}};

/*-- Dummy routines ------------------------------------------------*/

INT poll_event(INT, INT count, BOOL test)  {
    return 0;
}

INT interrupt_configure(INT, INT, POINTER_T) {
    return 1;
}
/*-- Frontend Init -------------------------------------------------*/

INT frontend_init() {

  cm_msg(MINFO, "frontend_init() setup", "Test FE");

  odb settings = {
    {"Test", 123},
    {"sub", {}}
  };
  settings.connect_and_fix_structure("/Equipment/Test FE/Settings");
  // settings.watch(watch); <-- this works without segmentation fault

  odb new_settings("/Equipment/Test FE/Settings");
  new_settings.watch(watch); // <-- here I am getting a segmentation fault

  return CM_SUCCESS;
}

void watch(odb o) {
    std::string name = o.get_name();

    if (name == "Test") {
      printf("I am a watch on Test\n");
    }
}

/*-- Frontend Exit -------------------------------------------------*/

INT frontend_exit() {
  return CM_SUCCESS;
}

/*-- Frontend Loop -------------------------------------------------*/

INT frontend_loop() {
  return CM_SUCCESS;
}

/*-- Begin of Run --------------------------------------------------*/

INT begin_of_run(INT run_number, char *error) {
  return CM_SUCCESS;
}

/*-- End of Run ----------------------------------------------------*/

INT end_of_run(INT run_number, char *error) {
  return CM_SUCCESS;
}

/*-- Pause Run -----------------------------------------------------*/

INT pause_run(INT run_number, char *error) {
  return CM_SUCCESS;
}

/*-- Resume Run ----------------------------------------------------*/

INT resume_run(INT run_number, char *error) {
  return CM_SUCCESS;
}

/* -- Readout --*/
INT read_odb(char *pevent, INT){

    return 0;

}
  2846   16 Sep 2024 Stefan RittBug ReportCrash using ODB watch
The answer is in the error message: „Object went out of scope“. When your frontent_init() exits, the odb objects are destroyed. When you get a callback, it‘s linked to the
destroyed object. This is like if you have a local string and pass a reference to that string in the return of the function.

Use a global object (bad) or use „new“ (potential memory leak). I would use a global structure which holds all odb objects.

Stefan
 
> 
> last week I was running MIDAS with the commit 3ad98c5. Today I updated MIDAS and now all my watch functions are crashing. Attached I have a minimal example frontend of the problem.
> 
> In our software we have two functions one which sets up the ODB values of the frontend and another one which sets up all watch functions. So overall we connect two time to the ODB during fronend_init one time to create the values and one time to create the watch. In the example code a simple version of this setup is shown:
> 
> INT frontend_init() {
> 
>   cm_msg(MINFO, "frontend_init() setup", "Test FE");
> 
>   odb settings = {
>     {"Test", 123},
>     {"sub", {}}
>   };
>   settings.connect_and_fix_structure("/Equipment/Test FE/Settings");
>   // settings.watch(watch); <-- this works without segmentation fault
> 
>   odb new_settings("/Equipment/Test FE/Settings");
>   new_settings.watch(watch); // <-- here I am getting a segmentation fault
> 
>   return CM_SUCCESS;
> }
> 
> When I directly set the watch everything runs fine however, when I create a new ODB object and use this one to set a watch I am getting the following segmentation fault:
> 
> Process 18474 stopped
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x34)
>     frame #0: 0x000000010004fa38 test_fe`midas::odb::watch_callback(hDB=<unavailable>, hKey=<unavailable>, index=0, info=0x00006000002001c0) at odbxx.cxx:96:25 [opt]
>    93  	      if (po->m_data == nullptr)
>    94  	         mthrow("Callback received for a midas::odb object which went out of scope");
>    95  	      midas::odb *poh = search_hkey(po, hKey);
> -> 96  	      poh->m_last_index = index;
>    97  	      po->m_watch_callback(*poh);
>    98  	      poh->m_last_index = -1;
>    99  	   }
> 
> Best,
> Marius
  2847   16 Sep 2024 Marius KoeppelBug ReportCrash using ODB watch
This is not the case here. Note that the error message: "Callback received for a midas::odb object which went out of scope" is not called! The segmentation fault happens later line 96.

> The answer is in the error message: „Object went out of scope“. When your frontent_init() exits, the odb objects are destroyed. When you get a callback, it‘s linked to the
> destroyed object. This is like if you have a local string and pass a reference to that string in the return of the function.
> 
> Use a global object (bad) or use „new“ (potential memory leak). I would use a global structure which holds all odb objects.
> 
> Stefan
>  
> > 
> > last week I was running MIDAS with the commit 3ad98c5. Today I updated MIDAS and now all my watch functions are crashing. Attached I have a minimal example frontend of the problem.
> > 
> > In our software we have two functions one which sets up the ODB values of the frontend and another one which sets up all watch functions. So overall we connect two time to the ODB during fronend_init one time to create the values and one time to create the watch. In the example code a simple version of this setup is shown:
> > 
> > INT frontend_init() {
> > 
> >   cm_msg(MINFO, "frontend_init() setup", "Test FE");
> > 
> >   odb settings = {
> >     {"Test", 123},
> >     {"sub", {}}
> >   };
> >   settings.connect_and_fix_structure("/Equipment/Test FE/Settings");
> >   // settings.watch(watch); <-- this works without segmentation fault
> > 
> >   odb new_settings("/Equipment/Test FE/Settings");
> >   new_settings.watch(watch); // <-- here I am getting a segmentation fault
> > 
> >   return CM_SUCCESS;
> > }
> > 
> > When I directly set the watch everything runs fine however, when I create a new ODB object and use this one to set a watch I am getting the following segmentation fault:
> > 
> > Process 18474 stopped
> > * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x34)
> >     frame #0: 0x000000010004fa38 test_fe`midas::odb::watch_callback(hDB=<unavailable>, hKey=<unavailable>, index=0, info=0x00006000002001c0) at odbxx.cxx:96:25 [opt]
> >    93  	      if (po->m_data == nullptr)
> >    94  	         mthrow("Callback received for a midas::odb object which went out of scope");
> >    95  	      midas::odb *poh = search_hkey(po, hKey);
> > -> 96  	      poh->m_last_index = index;
> >    97  	      po->m_watch_callback(*poh);
> >    98  	      poh->m_last_index = -1;
> >    99  	   }
> > 
> > Best,
> > Marius
  2848   16 Sep 2024 Stefan RittBug ReportCrash using ODB watch
Well, the object *went* out of scope. For my code it‘s hard to realize this, so the error reporting is poor. Also the first object should have the same
problem. Just by accident that it does not crash.

Stefan 

> This is not the case here. Note that the error message: "Callback received for a midas::odb object which went out of scope" is not called! The segmentation fault happens later line 96.
> 
> > The answer is in the error message: „Object went out of scope“. When your frontent_init() exits, the odb objects are destroyed. When you get a callback, it‘s linked to the
> > destroyed object. This is like if you have a local string and pass a reference to that string in the return of the function.
> > 
> > Use a global object (bad) or use „new“ (potential memory leak). I would use a global structure which holds all odb objects.
> > 
> > Stefan
> >  
> > > 
> > > last week I was running MIDAS with the commit 3ad98c5. Today I updated MIDAS and now all my watch functions are crashing. Attached I have a minimal example frontend of the problem.
> > > 
> > > In our software we have two functions one which sets up the ODB values of the frontend and another one which sets up all watch functions. So overall we connect two time to the ODB during fronend_init one time to create the values and one time to create the watch. In the example code a simple version of this setup is shown:
> > > 
> > > INT frontend_init() {
> > > 
> > >   cm_msg(MINFO, "frontend_init() setup", "Test FE");
> > > 
> > >   odb settings = {
> > >     {"Test", 123},
> > >     {"sub", {}}
> > >   };
> > >   settings.connect_and_fix_structure("/Equipment/Test FE/Settings");
> > >   // settings.watch(watch); <-- this works without segmentation fault
> > > 
> > >   odb new_settings("/Equipment/Test FE/Settings");
> > >   new_settings.watch(watch); // <-- here I am getting a segmentation fault
> > > 
> > >   return CM_SUCCESS;
> > > }
> > > 
> > > When I directly set the watch everything runs fine however, when I create a new ODB object and use this one to set a watch I am getting the following segmentation fault:
> > > 
> > > Process 18474 stopped
> > > * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x34)
> > >     frame #0: 0x000000010004fa38 test_fe`midas::odb::watch_callback(hDB=<unavailable>, hKey=<unavailable>, index=0, info=0x00006000002001c0) at odbxx.cxx:96:25 [opt]
> > >    93  	      if (po->m_data == nullptr)
> > >    94  	         mthrow("Callback received for a midas::odb object which went out of scope");
> > >    95  	      midas::odb *poh = search_hkey(po, hKey);
> > > -> 96  	      poh->m_last_index = index;
> > >    97  	      po->m_watch_callback(*poh);
> > >    98  	      poh->m_last_index = -1;
> > >    99  	   }
> > > 
> > > Best,
> > > Marius
  2849   16 Sep 2024 Marius KoeppelBug ReportCrash using ODB watch
Okay, but this is then a big issue IMO. For Mu3e we do this in every frontend and I also checked again all of these watches are broken at the moment (with commit 3ad98c5 they worked).
 
In the old style we did for example (see https://bitbucket.org/tmidas/midas/src/develop/examples/crfe/crfe.cxx):

INT frontend_init()
{
   HNDLE hKey;

   // create Settings structure in ODB
   db_create_record(hDB, 0, "Equipment/Clock Reset/Settings", strcomb1(cr_settings_str).c_str());
   db_find_key(hDB, 0, "/Equipment/Clock Reset", &hKey);
   assert(hKey);

   db_watch(hDB, hKey, cr_settings_changed, NULL);

   /*
    * Set our transition sequence. The default is 500. Setting it
    * to 600 means we are called AFTER most other clients.
    */
   cm_set_transition_sequence(TR_START, 600);

   return CM_SUCCESS;
}

I thought this will be the same (under the hood) in the current odbxx way via:

odb settings("Equipment/Clock Reset/Settings");
settings.watch(cr_settings_changed);

Best,
Marius


> Well, the object *went* out of scope. For my code it‘s hard to realize this, so the error reporting is poor. Also the first object should have the same
> problem. Just by accident that it does not crash.
> 
> Stefan 
> 
> > This is not the case here. Note that the error message: "Callback received for a midas::odb object which went out of scope" is not called! The segmentation fault happens later line 96.
> > 
> > > The answer is in the error message: „Object went out of scope“. When your frontent_init() exits, the odb objects are destroyed. When you get a callback, it‘s linked to the
> > > destroyed object. This is like if you have a local string and pass a reference to that string in the return of the function.
> > > 
> > > Use a global object (bad) or use „new“ (potential memory leak). I would use a global structure which holds all odb objects.
> > > 
> > > Stefan
> > >  
> > > > 
> > > > last week I was running MIDAS with the commit 3ad98c5. Today I updated MIDAS and now all my watch functions are crashing. Attached I have a minimal example frontend of the problem.
> > > > 
> > > > In our software we have two functions one which sets up the ODB values of the frontend and another one which sets up all watch functions. So overall we connect two time to the ODB during fronend_init one time to create the values and one time to create the watch. In the example code a simple version of this setup is shown:
> > > > 
> > > > INT frontend_init() {
> > > > 
> > > >   cm_msg(MINFO, "frontend_init() setup", "Test FE");
> > > > 
> > > >   odb settings = {
> > > >     {"Test", 123},
> > > >     {"sub", {}}
> > > >   };
> > > >   settings.connect_and_fix_structure("/Equipment/Test FE/Settings");
> > > >   // settings.watch(watch); <-- this works without segmentation fault
> > > > 
> > > >   odb new_settings("/Equipment/Test FE/Settings");
> > > >   new_settings.watch(watch); // <-- here I am getting a segmentation fault
> > > > 
> > > >   return CM_SUCCESS;
> > > > }
> > > > 
> > > > When I directly set the watch everything runs fine however, when I create a new ODB object and use this one to set a watch I am getting the following segmentation fault:
> > > > 
> > > > Process 18474 stopped
> > > > * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x34)
> > > >     frame #0: 0x000000010004fa38 test_fe`midas::odb::watch_callback(hDB=<unavailable>, hKey=<unavailable>, index=0, info=0x00006000002001c0) at odbxx.cxx:96:25 [opt]
> > > >    93  	      if (po->m_data == nullptr)
> > > >    94  	         mthrow("Callback received for a midas::odb object which went out of scope");
> > > >    95  	      midas::odb *poh = search_hkey(po, hKey);
> > > > -> 96  	      poh->m_last_index = index;
> > > >    97  	      po->m_watch_callback(*poh);
> > > >    98  	      poh->m_last_index = -1;
> > > >    99  	   }
> > > > 
> > > > Best,
> > > > Marius
  2850   16 Sep 2024 Mark GrimesBug ReportCrash using ODB watch
Hi,
Maybe I've misunderstood the code, but odb::watch() creates a deep copy of itself to set the watch to.  The comment where this happens specifies that this is in case the current one goes out of scope.  See https://bitbucket.org/tmidas/midas/src/2878647fb73648474b35223ce53a125180f751b3/src/odbxx.cxx#lines-1393:1395
So as far as I can tell allowing the current odb instance to go out of scope is supported.

Thanks,

Mark.


> Okay, but this is then a big issue IMO. For Mu3e we do this in every frontend and I also checked again all of these watches are broken at the moment (with commit 3ad98c5 they worked).
>  
> In the old style we did for example (see https://bitbucket.org/tmidas/midas/src/develop/examples/crfe/crfe.cxx):
> 
> INT frontend_init()
> {
>    HNDLE hKey;
> 
>    // create Settings structure in ODB
>    db_create_record(hDB, 0, "Equipment/Clock Reset/Settings", strcomb1(cr_settings_str).c_str());
>    db_find_key(hDB, 0, "/Equipment/Clock Reset", &hKey);
>    assert(hKey);
> 
>    db_watch(hDB, hKey, cr_settings_changed, NULL);
> 
>    /*
>     * Set our transition sequence. The default is 500. Setting it
>     * to 600 means we are called AFTER most other clients.
>     */
>    cm_set_transition_sequence(TR_START, 600);
> 
>    return CM_SUCCESS;
> }
> 
> I thought this will be the same (under the hood) in the current odbxx way via:
> 
> odb settings("Equipment/Clock Reset/Settings");
> settings.watch(cr_settings_changed);
> 
> Best,
> Marius
> 
> 
> > Well, the object *went* out of scope. For my code it‘s hard to realize this, so the error reporting is poor. Also the first object should have the same
> > problem. Just by accident that it does not crash.
> > 
> > Stefan 
> > 
> > > This is not the case here. Note that the error message: "Callback received for a midas::odb object which went out of scope" is not called! The segmentation fault happens later line 96.
> > > 
> > > > The answer is in the error message: „Object went out of scope“. When your frontent_init() exits, the odb objects are destroyed. When you get a callback, it‘s linked to the
> > > > destroyed object. This is like if you have a local string and pass a reference to that string in the return of the function.
> > > > 
> > > > Use a global object (bad) or use „new“ (potential memory leak). I would use a global structure which holds all odb objects.
> > > > 
> > > > Stefan
> > > >  
> > > > > 
> > > > > last week I was running MIDAS with the commit 3ad98c5. Today I updated MIDAS and now all my watch functions are crashing. Attached I have a minimal example frontend of the problem.
> > > > > 
> > > > > In our software we have two functions one which sets up the ODB values of the frontend and another one which sets up all watch functions. So overall we connect two time to the ODB during fronend_init one time to create the values and one time to create the watch. In the example code a simple version of this setup is shown:
> > > > > 
> > > > > INT frontend_init() {
> > > > > 
> > > > >   cm_msg(MINFO, "frontend_init() setup", "Test FE");
> > > > > 
> > > > >   odb settings = {
> > > > >     {"Test", 123},
> > > > >     {"sub", {}}
> > > > >   };
> > > > >   settings.connect_and_fix_structure("/Equipment/Test FE/Settings");
> > > > >   // settings.watch(watch); <-- this works without segmentation fault
> > > > > 
> > > > >   odb new_settings("/Equipment/Test FE/Settings");
> > > > >   new_settings.watch(watch); // <-- here I am getting a segmentation fault
> > > > > 
> > > > >   return CM_SUCCESS;
> > > > > }
> > > > > 
> > > > > When I directly set the watch everything runs fine however, when I create a new ODB object and use this one to set a watch I am getting the following segmentation fault:
> > > > > 
> > > > > Process 18474 stopped
> > > > > * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x34)
> > > > >     frame #0: 0x000000010004fa38 test_fe`midas::odb::watch_callback(hDB=<unavailable>, hKey=<unavailable>, index=0, info=0x00006000002001c0) at odbxx.cxx:96:25 [opt]
> > > > >    93  	      if (po->m_data == nullptr)
> > > > >    94  	         mthrow("Callback received for a midas::odb object which went out of scope");
> > > > >    95  	      midas::odb *poh = search_hkey(po, hKey);
> > > > > -> 96  	      poh->m_last_index = index;
> > > > >    97  	      po->m_watch_callback(*poh);
> > > > >    98  	      poh->m_last_index = -1;
> > > > >    99  	   }
> > > > > 
> > > > > Best,
> > > > > Marius
  2851   17 Sep 2024 Konstantin OlchanskiBug ReportCrash using ODB watch
> {
> odb new_settings("/Equipment/Test FE/Settings");
> new_settings.watch(watch); // <-- here I am getting a segmentation fault
> }

this code has a bug. "watch" is attached to object "new_settings" that is deleted
after the closing curly bracket.

I would say Stefan's odb API should not allow you to write code like this. an API defect.

K.O.
  2852   18 Sep 2024 Marius KoeppelBug ReportCrash using ODB watch
I created a PR to fix this issue https://bitbucket.org/tmidas/midas/pull-requests/42.
The crash happened since the change in commit 3ad98c5 always got the ODB via XML.
However, the creation from XML should only be used when a user wants to read fast (and when we are on a remote machine) so I added the flag use_from_xml to explicitly specify this.


> > {
> > odb new_settings("/Equipment/Test FE/Settings");
> > new_settings.watch(watch); // <-- here I am getting a segmentation fault
> > }
> 
> this code has a bug. "watch" is attached to object "new_settings" that is deleted
> after the closing curly bracket.

> I would say Stefan's odb API should not allow you to write code like this. an API defect.

As pointed out in the thread this feature is explicitly supported by odbxx.cxx:

void odb::watch(std::function<void(midas::odb &)> f) {
      if (m_hKey == 0 || m_hKey == -1)
         mthrow("watch() called for ODB key \"" + m_name +
                "\" which is not connected to ODB");

      // create a deep copy of current object in case it
      // goes out of scope
      midas::odb* ow = new midas::odb(*this);

      ow->m_watch_callback = f;
      db_watch(s_hDB, m_hKey, midas::odb::watch_callback, ow);

      // put object into watchlist
      g_watchlist.push_back(ow);
}

Also in the old way (see for example https://bitbucket.org/tmidas/midas/src/191d13f98626fae533cbca17b00df7ee361edf16/examples/crfe/crfe.cxx#lines-126) it was possible to create a watch in a scope without the user taking care that the "object" does not go out of scope.
I think this feature should be supported by the framework.

Best,
Marius
  2853   20 Sep 2024 Stefan RittBug ReportCrash using ODB watch
The problem has been fixed in the current version. Here is my analysis:

- the midas::odb object *can* go out of scope in the function, since the odb::watch() function creates a deep copy of the object. 
This does not cause a memory leak if one call odb::unwatch_all() at the end of a program.

- The creation from XML had a flaw where the ODB key handle ("hKey") is not initialized since it is not passed by the db_copy_xml() function.
I added code to db_copy_xml() to also fetch the key handle in the XML file, which now fixes the issue. Please note that you have to
update both the server and client side of midas to get this functionality if you are using it by a remote client.

- I saw the flag MK added on his pull request to the constructor of odb::odb(). This is a way to fight the symptoms (by creating an
object the "old" way if not otherwise needed, but how we have the cause cured. Nevertheless I added that parameter, but set to to true by default:

   odb::odb(const std::string &str, bool init_via_xml = true);

since this should be fully working now and should always be faster than the old method. I only keep it for debugging should we observe
another flaw in odb_from_xml(). 

Best regards,
Stefan
  819   04 Jul 2012 Konstantin OlchanskiBug ReportCrash after recursive use of rpc_execute()
I am looking at a MIDAS kaboom when running out of space on the data disk - everything was freezing 
up, even the VME frontend crashed sometimes.

The freeze was traced to ROOT use in mlogger - it turns out that ROOT intercepts many signal handlers, 
including SIGSEGV - but instead of crashing the program as God intended, ROOT SEGV handler just hangs, 
and the rest of MIDAS hangs with it. One solution is to always build mlogger without ROOT support - 
does anybody use this feature anymore? Or reset the signal handlers back to the default setting somehow.

Freeze fixed, now I see a crash (seg fault) inside mlogger, in the newly introduced memmove() function 
inside the MIDAS RPC code rpc_execute(). memmove() replaced memcpy() in the same place and I am 
surprised we did not see this crash with memcpy().

The crash is caused by crazy arguments passed to memmove() - looks like corrupted RPC arguments 
data.

Then I realized that I see a recursive call to rpc_execute(): rpc_execute() calls tr_stop() calls cm_yield() calls 
ss_suspend() calls rpc_execute(). The second rpc_execute successfully completes, but leave corrupted 
data for the original rpc_execute(), which happily crashes. At the moment of the crash, recursive call to 
rpc_execute() is no longer visible.

Note that rpc_execute() cannot be called recursively - it is not re-entrant as it uses a global buffer for RPC 
argument processing. (global tls_buffer structure).

Here is the mlogger stack trace:

#0  0x00000032a8032885 in raise () from /lib64/libc.so.6
#1  0x00000032a8034065 in abort () from /lib64/libc.so.6
#2  0x00000032a802b9fe in __assert_fail_base () from /lib64/libc.so.6
#3  0x00000032a802bac0 in __assert_fail () from /lib64/libc.so.6
#4  0x000000000041d3e6 in rpc_execute (sock=14, buffer=0x7ffff73fc010 "\340.", convert_flags=0) at 
src/midas.c:11478
#5  0x0000000000429e41 in rpc_server_receive (idx=1, sock=<value optimized out>, check=<value 
optimized out>) at src/midas.c:12955
#6  0x0000000000433fcd in ss_suspend (millisec=0, msg=0) at src/system.c:3927
#7  0x0000000000429b12 in cm_yield (millisec=100) at src/midas.c:4268
#8  0x00000000004137c0 in close_channels (run_number=118, p_tape_flag=0x7fffffffcd34) at 
src/mlogger.cxx:3705
#9  0x000000000041390e in tr_stop (run_number=118, error=<value optimized out>) at 
src/mlogger.cxx:4148
#10 0x000000000041cd42 in rpc_execute (sock=12, buffer=0x7ffff73fc010 "\340.", convert_flags=0) at 
src/midas.c:11626
#11 0x0000000000429e41 in rpc_server_receive (idx=0, sock=<value optimized out>, check=<value 
optimized out>) at src/midas.c:12955
#12 0x0000000000433fcd in ss_suspend (millisec=0, msg=0) at src/system.c:3927
#13 0x0000000000429b12 in cm_yield (millisec=1000) at src/midas.c:4268
#14 0x0000000000416c50 in main (argc=<value optimized out>, argv=<value optimized out>) at 
src/mlogger.cxx:4431


K.O.
  820   04 Jul 2012 Konstantin OlchanskiBug ReportCrash after recursive use of rpc_execute()
>  ... I see a recursive call to rpc_execute(): rpc_execute() calls tr_stop() calls cm_yield() calls 
> ss_suspend() calls rpc_execute()
> ... rpc_execute() cannot be called recursively - it is not re-entrant as it uses a global buffer

It turns out that rpc_server_receive() also need protection against recursive calls - it also uses
a global buffer to receive network data.

My solution is to protect rpc_server_receive() against recursive calls by detecting recursion and returning SS_SUCCESS (to ss_suspend()).

I was worried that this would cause a tight loop inside ss_suspend() but in practice, it looks like ss_suspend() tries to call
us about once per second. I am happy with this solution. Here is the diff:


@@ -12813,7 +12815,7 @@
 
 
 /********************************************************************/
-INT rpc_server_receive(INT idx, int sock, BOOL check)
+INT rpc_server_receive1(INT idx, int sock, BOOL check)
 /********************************************************************\
 
   Routine: rpc_server_receive
@@ -13047,7 +13049,28 @@
    return status;
 }
 
+/********************************************************************/
+INT rpc_server_receive(INT idx, int sock, BOOL check)
+{
+  static int level = 0;
+  int status;
 
+  // Provide protection against recursive calls to rpc_server_receive() and rpc_execute()
+  // via rpc_execute() calls tr_stop() calls cm_yield() calls ss_suspend() calls rpc_execute()
+
+  if (level != 0) {
+    //printf("*** enter rpc_server_receive level %d, idx %d sock %d %d -- protection against recursive use!\n", level, idx, sock, check);
+    return SS_SUCCESS;
+  }
+
+  level++;
+  //printf(">>> enter rpc_server_receive level %d, idx %d sock %d %d\n", level, idx, sock, check);
+  status = rpc_server_receive1(idx, sock, check);
+  //printf("<<< exit rpc_server_receive level %d, idx %d sock %d %d, status %d\n", level, idx, sock, check, status);
+  level--;
+  return status;
+}
+
 /********************************************************************/
 INT rpc_server_shutdown(void)
 /********************************************************************\


ladd02:trinat~/packages/midas>svn info src/midas.c
Path: src/midas.c
Name: midas.c
URL: svn+ssh://svn@savannah.psi.ch/repos/meg/midas/trunk/src/midas.c
Repository Root: svn+ssh://svn@savannah.psi.ch/repos/meg/midas
Repository UUID: 050218f5-8902-0410-8d0e-8a15d521e4f2
Revision: 5297
Node Kind: file
Schedule: normal
Last Changed Author: olchanski
Last Changed Rev: 5294
Last Changed Date: 2012-06-15 10:45:35 -0700 (Fri, 15 Jun 2012)
Text Last Updated: 2012-06-29 17:05:14 -0700 (Fri, 29 Jun 2012)
Checksum: 8d7907bd60723e401a3fceba7cd2ba29

K.O.
  821   13 Jul 2012 Stefan RittBug ReportCrash after recursive use of rpc_execute()
> Then I realized that I see a recursive call to rpc_execute(): rpc_execute() calls tr_stop() calls cm_yield() calls 
> ss_suspend() calls rpc_execute(). The second rpc_execute successfully completes, but leave corrupted 
> data for the original rpc_execute(), which happily crashes. At the moment of the crash, recursive call to 
> rpc_execute() is no longer visible.

This is really strange. I did not protect rpc_execute against recursive calls since this should not happen. rpc_server_receive() is linked to rpc_call() on the client side. So there cannot be 
several rpc_call() since there I do the recursive checking (also multi-thread checking) via a mutex. See line 10142 in midas.c. So there CANNOT be recursive calls to rpc_execute() because 
there cannot be recursive calls to rpc_server_receive(). But apparently there are, according to your stack trace.

So even if your patch works fine, I would like to know where the recursive calls to rpc_server_receive() come from. Since we have one subproces of mserver for each client, there should only 
be one client connected to each mserver process, and the client is protected via the mutex in rpc_call(). Can you please debug this? I would like to understand what is going on there. Maybe 
there is a deeper underlying problem, which we better solve, otherwise it might fall back on use in the future.

For debugging, you have to see what commands rpc_call() send and what rpc_server_receive() gets, maybe by writing this into a common file together with a time stamp.

SR
  618   18 Aug 2009 Denis CalvetSuggestionCould not create strings other than 32 characters with odbedit -c "..." command
Hi,
I am writing shell scripts to create some tree structure in an ODB. When 
creating an array of strings, the default length of each string element is 32 
characters. If odbedit is used interactively to create the array of strings, 
the user is prompted to enter a different length if desired. But if the 
command odbedit is called from a shell script, I did not succeed in passing 
the argument to get a different length.
I tried:
odbedit -c "create STRING Test[8][40]"
Or:
odbedit -c "create STRING Test[8] 40"
Or:
odbedit -c "create STRING Test[8] \n 40"
etc. all produce an array of 8 strings with 32 characters each.
I haven't tried all possible syntaxes, but I suspect the length argument is 
dropped. If it has not been fixed in a later release than the one I am using, 
could this problem be looked at?
Thanks,
Denis.
  
  627   03 Sep 2009 Stefan RittSuggestionCould not create strings other than 32 characters with odbedit -c "..." command
> Hi,
> I am writing shell scripts to create some tree structure in an ODB. When 
> creating an array of strings, the default length of each string element is 32 
> characters. If odbedit is used interactively to create the array of strings, 
> the user is prompted to enter a different length if desired. But if the 
> command odbedit is called from a shell script, I did not succeed in passing 
> the argument to get a different length.
> I tried:
> odbedit -c "create STRING Test[8][40]"
> Or:
> odbedit -c "create STRING Test[8] 40"
> Or:
> odbedit -c "create STRING Test[8] \n 40"
> etc. all produce an array of 8 strings with 32 characters each.
> I haven't tried all possible syntaxes, but I suspect the length argument is 
> dropped. If it has not been fixed in a later release than the one I am using, 
> could this problem be looked at?

Ok, I added a command

odbedit -c "create STRING Test[8][40]"

which works now. Please update to SVN revision 4555 of odbedit.c

- Stefan
  634   06 Sep 2009 Exaos LeeSuggestionCould not create strings other than 32 characters with odbedit -c "..." command
> Ok, I added a command
> 
> odbedit -c "create STRING Test[8][40]"
> 
> which works now. Please update to SVN revision 4555 of odbedit.c
> 
> - Stefan

If I want to create only one string, should I write like this:

  odbedit -c "create STRING Test[] [256]"

OK. I need it. I will try the new odbedit.
  640   06 Sep 2009 Exaos LeeSuggestionCould not create strings other than 32 characters with odbedit -c "..." command
> > Ok, I added a command
> > 
> > odbedit -c "create STRING Test[8][40]"
> > 
> > which works now. Please update to SVN revision 4555 of odbedit.c
> > 
> > - Stefan
> 
> If I want to create only one string, should I write like this:
> 
>   odbedit -c "create STRING Test[] [256]"
> 
> OK. I need it. I will try the new odbedit.

"create STRING test[1][256]" works.
  208   21 Apr 2005 Konstantin OlchanskiSuggestionCorrect MIDASSYS setting?
Current MIDAS versions nag me about setting the env.variable MIDASSYS to the
"midas installation directory", but I do not have one, so what should I set
MIDASSYS to? I checkout MIDAS from cvs into /home/olchansk/daq/midas, build it
there, run it from there. I never do "make install" (I am not "root" on every
machine; I am not the only MIDAS user on every machine). What should I set
MIDASSYS to? K.O.
  209   22 Apr 2005 Stefan RittSuggestionCorrect MIDASSYS setting?
> Current MIDAS versions nag me about setting the env.variable MIDASSYS to the
> "midas installation directory", but I do not have one, so what should I set
> MIDASSYS to? I checkout MIDAS from cvs into /home/olchansk/daq/midas, build it
> there, run it from there. I never do "make install" (I am not "root" on every
> machine; I am not the only MIDAS user on every machine). What should I set
> MIDASSYS to? K.O.

Then set it to /home/olchansk/daq/midas. The reason for MIDASSYS is the same as
for ROOTSYS. Having it allows other packages like ROME to access the Midas source
code, include files and libraries.
  1905   07 May 2020 EstelleBug ReportConflic between Rootana and midas about the redefinition of TID_xxx data types
Dear Midas and Rootana people,

We have tried to update our midas DAQ with the new TID definitions describe in https://midas.triumf.ca/elog/Midas/1871 

And we have noticed an incompatibility of this new definitions with Rootana when reading an XmlOdb in our offline analyzer. 

The problem comes from  the function FindArrayPath in XmlOdb.cxx and the comparison of bank type as strings.
Ex: comparing the strings "DWORD" and "UNINT32"

An naive solution would be to print the number associated to the type (ex: '6' for DWORD/UNINT32), but that would mean changing Rootana and Midas source code. Moreover, it does decrease the readability of the XmlOdb file. 


Thanks for your time.
Estelle
  1911   20 May 2020 Konstantin OlchanskiBug ReportConflic between Rootana and midas about the redefinition of TID_xxx data types
> Dear Midas and Rootana people,
> 
> We have tried to update our midas DAQ with the new TID definitions describe in https://midas.triumf.ca/elog/Midas/1871 
> 
> And we have noticed an incompatibility of this new definitions with Rootana when reading an XmlOdb in our offline analyzer. 
> 
> The problem comes from  the function FindArrayPath in XmlOdb.cxx and the comparison of bank type as strings.
> Ex: comparing the strings "DWORD" and "UNINT32"
> 
> An naive solution would be to print the number associated to the type (ex: '6' for DWORD/UNINT32), but that would mean changing Rootana and Midas source code. Moreover, it does decrease the readability of the XmlOdb file. 
> 

Hi, it is unfortunate that a change was made in MIDAS that is incompatible with existing analysis software. I shall update the ROOTANA package to deal with this ASAP.

K.O.
ELOG V3.1.4-2e1708b5