Hi Chris,

Chris Samuel wrote:
We occasionally get users who manage to use up all the DMA memory that is addressable by the Myrinet card through the Power5 hypervisor.

The IOMMU limit set by the hypervisor varies depending on the machine, the hypervisor version and the phase of the moon. Sometimes, it's a limit per PCI slot (ie per device), sometimes it is a limit for the whole machine (can be virtual machine, that's one of the reason behind the hypervisor) and it's shared by all the devices. Sometimes, it's reasonable large (1 or 2 GB), sometimes it is ridiculously small (256 MB).

The hypervisor does not make a lot of sense in a HPC environment, but it would be non-trivial work to remove it on PPC.

Through various firmware and driver tweaks (thanks to both IBM and Myrinet) we've gotten that limit up to almost 1GB and then we use an undocumented environment variable (GMPI_MAX_LOCKED_MBYTE) to say only use 248MB of that per process (as we've got 4 cores in each box), which we enforce through Torque.

The problems went away.  Or at least it did until just now. :-(

The characterstic error we get is:

[13]: alloc_failed, not enough memory (Fatal Error)
        Context: <(gmpi_init) gmpi_dma_alloc: dma_recv buffers>

Now Myrinet can handle running out of DMA memory once a process is running, but when it starts it must be able to allocate a (fairly trivial) amount of DMA memory otherwise you get that fatal error.

GM does pipeline large messages with chunks of 1 MB, so you can progress as long as you can register 1 MB at a time (you can think of pathological deadlocking situations, but it's not the common case). However, GM registers some buffers for Eager messages at init time. From memory, it's in the order of 32 MB per process (constant, does not depend on the size of the job). If you can't register that, there is nothing you can do so aborting is a good idea.

If you limit registration per process, then I can think of one situation that will hit the IOMMU limit: if a process dies of abnormal death (segfault, killed, whatever), the GM port will be "shutting down" while the outstanding messages are dropped. During this time, the memory is still registered. If you start another process at that time, you will effectively have more than 4 processes with registered memory, and it may exceed the limit. A quick workaround would be to modify the MPICH-GM init code to only try to open the first 4 GM ports. That will in effect guarantee that only 4 processes can register memory at one time (latest release of GM provides 13 ports).

I see from your next post that it's not what happened. It could have :-)

Looking at the node I can confirm that there are only 3 user processes running, so what I am after is a way of determining how much of that DMA memory a process has allocated.

There is no handy way, but it would not be hard to add this info to the output of gm_board_info. There is not many releases of GM these days. Nevertheless, I will add it to the queue, it's simple enough to not be considered a new feature.

Oh - switching to the Myrinet MX drivers (which doesn't have this problem) is not an option, we have an awful lot of users, mostly (non-computer)

Actually, MX would not behave well in your environment: MX does not pipeline large messages, it register the whole message at once (MX registration is much faster, and pipelining prevents overlap of communication with computation). With a 250 MB of DMA-able memory per process, that would be the maximum message size you can send or receive.

We have plan to do something about that, but it's not at the top of the queue. The right thing would be to get rid of the hypervisor (by the way, the hypervisor makes the memory registration overhead much more expensive), but it probably will never happen.

scientists, who have their own codes and trying to persuade them to recompile would be very hard - which would be necessary as we've not been able to convince MPICH-GM to build shared libraries on Linux on Power with the IBM compilers. :-(

Time for dreaming about an MPI ABI :-)

Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to