On Jun 2, 2009, at 4:25 PM, Manuel Prinz wrote:

I'm putting you in the loop since I'm quite lost here... It would be
great if you could throw in your thoughts!

Sorry for the delay in replying; this week has been crazier than most.

mpicc segfaults when it's called via fakeroot.

What is fakeroot?

Since this tool is needed
in the build process of Debian packages, packages depending on Open MPI fail to build. This blocks our transition to 1.3.2. Nicholas was so kind
to investigate the issue; his results are quoted below.

As far as I can say, the problem appeared somewhere between Open MPI 1.3
and 1.3.2. I also successfully used mpicc with fakeroot before Debian
switched to eglibc, so this might be the cause. (Though it should be
fully compatible to glibc. At least they claim to be.)

Do you have some idea what might go wrong?

Based on the call stack, yes.  Uccckkk!!  More below...

FWIW, debugging OMPI is easier if you tell OMPI to slurp all the plugins into its libraries -- so there's no dlopen's and all the plugins are physically located in libmpi.so (and friends). You can get better call stacks this way from corefiles, etc.

Although I think a lot of those missing symbols are in glibc, not ompi.

Okay, it looks like the root cause is something that's appeared recently in openmpi - it fails under 1.3.2-2, but works under 1.3-2. Manuel, I'm cloning the bug for tracking purposes, but I'm certainly not sure that it's actually an
OpenMPI bug at heart.  Have you seen anything else like this?

% echo "int main(void) { return 0; }" > test.c
% mpicc.openmpi test.c ; echo $?
0
% fakeroot mpicc.openmpi test.c ; echo $?
Segmentation fault
139

No failures with the other MPI implementations, nor with OpenMPI 1.3. I can put it under gdb but am missing some debugging libraries in the middle:

% fakeroot gdb mpicc.openmpi
[...]
(gdb) run test.c
Starting program: /usr/bin/mpicc.openmpi conftest.c
[Thread debugging using libthread_db enabled]
[New Thread 0xb7dc46c0 (LWP 6958)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb7dc46c0 (LWP 6958)]
__libc_calloc (n=1, elem_size=20) at malloc.c:3932
3932    malloc.c: No such file or directory.
       in malloc.c
(gdb) bt
#0  __libc_calloc (n=1, elem_size=20) at malloc.c:3932
#1 0xb7f83086 in _dlerror_run (operate=0xb7f82d90 <dlsym_doit>, args=0xbfd3002c) at dlerror.c:142 #2 0xb7f82d43 in __dlsym (handle=0xffffffff, name=0xb800f16a "open") at dlsym.c:71 #3 0xb800db73 in load_library_symbols () from /usr/lib/libfakeroot/ libfakeroot-sysv.so #4 0xb800e687 in tmp___xstat () from /usr/lib/libfakeroot/ libfakeroot-sysv.so #5 0xb800daa3 in __xstat () from /usr/lib/libfakeroot/libfakeroot- sysv.so
#6  0xb7fbcefc in ?? () from /usr/lib/libopen-pal.so.0
#7  0x00000003 in ?? ()
#8  0xb7fc948e in ?? () from /usr/lib/libopen-pal.so.0
#9  0xbfd300e4 in ?? ()
#10 0x00001b2e in ?? ()
#11 0xbfd300e4 in ?? ()
#12 0x00000003 in ?? ()
#13 0x00000003 in ?? ()
#14 0xbfd30164 in ?? ()
#15 0xb7f20ff4 in ?? () from /lib/i686/cmov/libc.so.6
#16 0x00000001 in ?? ()
#17 0xb7dc8d0c in ?? () from /lib/i686/cmov/libc.so.6
#18 0xbfd30148 in ?? ()
#19 0xb7ee4429 in *__GI__dl_addr (address=0xb7e34e70, info=0xbfd30184, mapp=0xbfd30194,
   symbolp=0xb7f2260c) at dl-addr.c:146
#20 0xb7e35096 in ptmalloc_init () at arena.c:571
#21 0xb7e386bc in malloc_hook_ini (sz=12, caller=0xb7f939ab) at hooks.c:37
#22 0xb7e38535 in *__GI___libc_malloc (bytes=12) at malloc.c:3546
#23 0xb7f939ab in opal_class_initialize () from /usr/lib/libopen- pal.so.0
#24 0xb7fb3227 in opal_output_init () from /usr/lib/libopen-pal.so.0
#25 0xb7f96205 in opal_init_util () from /usr/lib/libopen-pal.so.0
#26 0x08049b62 in main (argc=2, argv=0xbfd30404) at ../../../../../ opal/tools/wrappers/opal_wrapper.c:480

Ick... I have zero experience with eglibc; this *could* be a compatibility issue...?

In OMPI 1.3.2, we started using the __malloc_initialize_hook functionality to get a function of ours called at the first time the memory allocation subsystem is invoked in a process. Specifically, we do this:

void (*__malloc_initialize_hook) (void) =
    opal_memory_ptmalloc2_malloc_init_hook;

which sets up opal_memory_ptmalloc_malloc_init_hook() to be invoked during the memory subsystem's init (sometimes even pre-main). This function is in opal/mca/memory/ptmalloc2/hooks.c. Note that this is a *different* hooks.c than is listed at #21 in the stack trace above. It looks like that is the ptmalloc2 hooks.c that is in elibc, and it is calling the elibc ptmalloc_init() which should then be calling our init hook function (opal_memory_ptmalloc2_malloc_init_hook). Can you step throught and see what is happening there?

I wonder if there's a bug in elibc such that when it's looking up this symbol, it's trying to open libopen-pal.so to find that symbol, and something is going bad in there...?

It's weird that the gdb #2 is a dlsym of -1 (self) and it's looking for the symbol "open"...? I don't know enough about how dlsym works internally --perhaps that's normal...?

--
Jeff Squyres
Cisco Systems




--
To UNSUBSCRIBE, email to debian-bugs-rc-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to