On Wed, Jul 01, 2020 at 10:46:40AM +0100, Robin Murphy wrote: > On 2020-06-30 20:08, Jonathan Lemon wrote: > > On Mon, Jun 29, 2020 at 02:15:16PM +0100, Robin Murphy wrote: > > > On 2020-06-27 08:02, Christoph Hellwig wrote: > > > > On Fri, Jun 26, 2020 at 01:54:12PM -0700, Jonathan Lemon wrote: > > > > > On Fri, Jun 26, 2020 at 09:47:25AM +0200, Christoph Hellwig wrote: > > > > > > > > > > > > Note that this is somewhat urgent, as various of the APIs that the > > > > > > code > > > > > > is abusing are slated to go away for Linux 5.9, so this addition > > > > > > comes > > > > > > at a really bad time. > > > > > > > > > > Could you elaborate on what is upcoming here? > > > > > > > > Moving all these calls out of line, and adding a bypass flag to avoid > > > > the indirect function call for IOMMUs in direct mapped mode. > > > > > > > > > Also, on a semi-related note, are there limitations on how many pages > > > > > can be left mapped by the iommu? Some of the page pool work involves > > > > > leaving the pages mapped instead of constantly mapping/unmapping them. > > > > > > > > There are, but I think for all modern IOMMUs they are so big that they > > > > don't matter. Maintaines of the individual IOMMU drivers might know > > > > more. > > > > > > Right - I don't know too much about older and more esoteric stuff like > > > POWER > > > TCE, but for modern pagetable-based stuff like Intel VT-d, AMD-Vi, and Arm > > > SMMU, the only "limits" are such that legitimate DMA API use should never > > > get anywhere near them (you'd run out of RAM for actual buffers long > > > beforehand). The most vaguely-realistic concern might be a pathological > > > system topology where some old 32-bit PCI device doesn't have ACS > > > isolation > > > from your high-performance NIC such that they have to share an address > > > space, where the NIC might happen to steal all the low addresses and > > > prevent > > > the soundcard or whatever from being able to map a usable buffer. > > > > > > With an IOMMU, you typically really *want* to keep a full working set's > > > worth of pages mapped, since dma_map/unmap are expensive while dma_sync is > > > somewhere between relatively cheap and free. With no IOMMU it makes no > > > real > > > difference from the DMA API perspective since map/unmap are effectively no > > > more than the equivalent sync operations anyway (I'm assuming we're not > > > talking about the kind of constrained hardware that might need SWIOTLB). > > > > > > > > On a heavily loaded box with iommu enabled, it seems that quite often > > > > > there is contention on the iova_lock. Are there known issues in this > > > > > area? > > > > > > > > I'll have to defer to the IOMMU maintainers, and for that you'll need > > > > to say what code you are using. Current mainlaine doesn't even have > > > > an iova_lock anywhere. > > > > > > Again I can't speak for non-mainstream stuff outside drivers/iommu, but > > > it's > > > been over 4 years now since merging the initial scalability work for the > > > generic IOVA allocator there that focused on minimising lock contention, > > > and > > > it's had considerable evaluation and tweaking since. But if we can achieve > > > the goal of efficiently recycling mapped buffers then we shouldn't need to > > > go anywhere near IOVA allocation either way except when expanding the > > > pool. > > > > > > I'm running a set of patches which uses the page pool to try and keep > > all the RX buffers mapped as the skb goes up the stack, returning the > > pages to the pool when the skb is freed. > > > > On a dual-socket 12-core Intel machine (48 processors), and 256G of > > memory, when iommu is enabled, I see the following from 'perf top -U', > > as the hottest function being run: > > > > - 43.42% worker [k] queued_spin_lock_slowpath > > - 43.42% queued_spin_lock_slowpath > > - 41.69% _raw_spin_lock_irqsave > > + 41.39% alloc_iova > > + 0.28% iova_magazine_free_pfns > > + 1.07% lock_sock_nested > > > > Which likely is heavy contention on the iovad->iova_rbtree_lock. > > (This is on a 5.6 based system, BTW). More scripts and data are below. > > Is there a way to reduce the contention here? > > Hmm, how big are your DMA mappings? If you're still hitting the rbtree a > lot, that most likely implies that either you're making giant IOVA > allocations that are too big to be cached, or you're allocating/freeing > IOVAs in a pathological pattern that defeats the whole magazine cache > mechanism (It's optimised for relatively-balanced allocation and freeing of > sizes up order 6). On a further hunch, does the "intel_iommu=forcedac" > option make any difference at all?
The allocations are only 4K in size (packet memory) but there are a lot of them. I tried running with "forcedac" over the weekend, and that seems to have made a huge difference. Why the name of 'forcedac'? It would seem this should be the default for a system with a sane 64-bit PCI device. > Either way if this persists after some initial warm-up period, it further > implies that the page pool is not doing its job properly (or at least in the > way I would have expected). The alloc_iova() call is part of the dma_map_*() > overhead, and if the aim is to keep pages mapped then that should only be > called relatively infrequently. The optimal behaviour would be to dma_map() > new clean pages as they are added to the pool, use dma_sync() when they are > claimed and returned by the driver, and only dma_unmap() if they're actually > freed back to the page allocator. And if you're still seeing a lot of > dma_map/unmap time after that, then the pool itself is churning pages and > clearly needs its size/thresholds tuning. Most of the dma_map() overhead is coming from the TX path, which isn't using the page_pool. With the RX datapath I'm using, pages are returned to the pool in a non-napi context (so they enter the ring), then the driver refills the cache from the ring. I'm seeing a 1% overflow (ring_full), so the default size seems appropriate. The single ring lock is likely a problem, but can be addressed now that things aren't blowing up in alloc_iova(). >From a box with the mlx4 driver + persistent page pool: rx_alloc_pages: 35239524767 cache_hit: 34381377143 cache_full: 19935 cache_empty: 859266494 ring_consume: 31665993135 ring_produce: 31666119491 ring_full: 353805844 -- Jonathan