Re: [Beowulf] DMA Memory Mapping Question

Patrick Geoffray Wed, 21 Feb 2007 20:18:14 -0800

Hi Chris,

Chris Samuel wrote:

We occasionally get users who manage to use up all the DMA memory that isaddressable by the Myrinet card through the Power5 hypervisor.

The IOMMU limit set by the hypervisor varies depending on the machine,the hypervisor version and the phase of the moon. Sometimes, it's alimit per PCI slot (ie per device), sometimes it is a limit for thewhole machine (can be virtual machine, that's one of the reason behindthe hypervisor) and it's shared by all the devices. Sometimes, it'sreasonable large (1 or 2 GB), sometimes it is ridiculously small (256 MB).

The hypervisor does not make a lot of sense in a HPC environment, but itwould be non-trivial work to remove it on PPC.

Through various firmware and driver tweaks (thanks to both IBM and Myrinet)we've gotten that limit up to almost 1GB and then we use an undocumentedenvironment variable (GMPI_MAX_LOCKED_MBYTE) to say only use 248MB of thatper process (as we've got 4 cores in each box), which we enforce throughTorque.
The problems went away.  Or at least it did until just now. :-(

The characterstic error we get is:

[13]: alloc_failed, not enough memory (Fatal Error)
        Context: <(gmpi_init) gmpi_dma_alloc: dma_recv buffers>
Now Myrinet can handle running out of DMA memory once a process is running,but when it starts it must be able to allocate a (fairly trivial) amount ofDMA memory otherwise you get that fatal error.

GM does pipeline large messages with chunks of 1 MB, so you can progressas long as you can register 1 MB at a time (you can think ofpathological deadlocking situations, but it's not the common case).However, GM registers some buffers for Eager messages at init time. Frommemory, it's in the order of 32 MB per process (constant, does notdepend on the size of the job). If you can't register that, there isnothing you can do so aborting is a good idea.

If you limit registration per process, then I can think of one situationthat will hit the IOMMU limit: if a process dies of abnormal death(segfault, killed, whatever), the GM port will be "shutting down" whilethe outstanding messages are dropped. During this time, the memory isstill registered. If you start another process at that time, you willeffectively have more than 4 processes with registered memory, and itmay exceed the limit. A quick workaround would be to modify the MPICH-GMinit code to only try to open the first 4 GM ports. That will in effectguarantee that only 4 processes can register memory at one time (latestrelease of GM provides 13 ports).


I see from your next post that it's not what happened. It could have :-)

Looking at the node I can confirm that there are only 3 user processesrunning, so what I am after is a way of determining how much of that DMAmemory a process has allocated.

There is no handy way, but it would not be hard to add this info to theoutput of gm_board_info. There is not many releases of GM these days.Nevertheless, I will add it to the queue, it's simple enough to not beconsidered a new feature.

Oh - switching to the Myrinet MX drivers (which doesn't have this problem) isnot an option, we have an awful lot of users, mostly (non-computer)

Actually, MX would not behave well in your environment: MX does notpipeline large messages, it register the whole message at once (MXregistration is much faster, and pipelining prevents overlap ofcommunication with computation). With a 250 MB of DMA-able memory perprocess, that would be the maximum message size you can send or receive.

We have plan to do something about that, but it's not at the top of thequeue. The right thing would be to get rid of the hypervisor (by theway, the hypervisor makes the memory registration overhead much moreexpensive), but it probably will never happen.

scientists, who have their own codes and trying to persuade them to recompilewould be very hard - which would be necessary as we've not been able toconvince MPICH-GM to build shared libraries on Linux on Power with the IBMcompilers. :-(


Time for dreaming about an MPI ABI :-)

Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] DMA Memory Mapping Question

Reply via email to