On Feb 21, 2007, at 7:45 PM, Chris Samuel wrote:
Hi folks,
We've got an IBM Power5 cluster running SLES9 and using the GM
drivers.
We occasionally get users who manage to use up all the DMA memory
that is
addressable by the Myrinet card through the Power5 hypervisor.
Through various firmware and driver tweaks (thanks to both IBM and
Myrinet)
we've gotten that limit up to almost 1GB and then we use an
undocumented
environment variable (GMPI_MAX_LOCKED_MBYTE) to say only use 248MB
of that
per process (as we've got 4 cores in each box), which we enforce
through
Torque.
The problems went away. Or at least it did until just now. :-(
The characterstic error we get is:
[13]: alloc_failed, not enough memory (Fatal Error)
Context: <(gmpi_init) gmpi_dma_alloc: dma_recv buffers>
Now Myrinet can handle running out of DMA memory once a process is
running,
but when it starts it must be able to allocate a (fairly trivial)
amount of
DMA memory otherwise you get that fatal error.
Looking at the node I can confirm that there are only 3 user processes
running, so what I am after is a way of determining how much of
that DMA
memory a process has allocated.
I looked at /proc/${PID}/maps and saw this:
40028000-40029000 r--s 00002000 00:0c \
8483 /dev/gm0
which to me looks like a memory mapping, but to my eyes that looks
like just
1,000 bytes..
Does anyone have any ideas at all ?
Isn't this in hex? If so, it would be 4096 bytes. I do not use GM
much and I do not know what this is. I just loaded GM on one node and
with no GM processes running except the mapper, I have a similar
entry (at a different address, but also 0x1000). I would guess this
is to allow GM and the mapper to communicate. I will check internally.
Oh - switching to the Myrinet MX drivers (which doesn't have this
problem) is
not an option, we have an awful lot of users, mostly (non-computer)
scientists, who have their own codes and trying to persuade them to
recompile
would be very hard - which would be necessary as we've not been
able to
convince MPICH-GM to build shared libraries on Linux on Power with
the IBM
compilers. :-(
cheers,
Chris
I am sorry you have not had success with MPICH-GM to compile dynamic
libs. Have you sent email to Myricom help?
Regards,
Scott
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf