Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

Tariq Toukan Thu, 23 Feb 2017 06:04:02 -0800


On 23/02/2017 4:18 AM, Alexander Duyck wrote:

On Wed, Feb 22, 2017 at 6:06 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:

On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:


Right but you were talking about using both halves one after the
other.  If that occurs you have nothing left that you can reuse.  That
was what I was getting at.  If you use up both halves you end up
having to unmap the page.


You must have misunderstood me.

Once we use both halves of a page, we _keep_ the page, we do not unmap
it.

We save the page pointer in a ring buffer of pages.
Call it the 'quarantine'

When we _need_ to replenish the RX desc, we take a look at the oldest
entry in the quarantine ring.

If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
this saved page.

If not, _then_ we unmap and release the page.


Okay, that was what I was referring to when I mentioned a "hybrid
between the mlx5 and the Intel approach".  Makes sense.


Indeed, in mlx5 Striding RQ (mpwqe) we do something similar.

Our NIC (ConnectX-4 LX and newer) knows to write multiple _consecutive_packets into the same RX buffer (page).

AFAIU, this is what Eric suggests to do in SW in mlx4.

Here are the main characteristics of our page-cache in mlx5:
1) FIFO (for higher chances of an available page).

2) If the page-cache head is busy, it is not freed. This has its prosand cons. We might reconsider.3) Pages in cache have no page-to-WQE assignment (WQE is for Work QueueElement, == RX descriptor). They are shared for all WQEs of an RQ andmight be used by different WQEs in different rounds.4) Cache size is smaller than suggested, we would happily increase it toreflect a whole ring.

Still, performance tests over mlx5 show that on high load we quickly endup allocating pages as the stack does not release its ref count on time.

Increasing the cache size helps of course.

As there's no _fixed_ fair size that guarantees the availability ofpages every ring cycle, reflecting a ring size can help, and would givethe opportunity for users to tune their performance by setting theirring size according to how powerful their CPUs are, and what traffictype/load they're running.

Note that we would have received 4096 frames before looking at the page
count, so there is high chance both halves were consumed.

To recap on x86 :

2048 active pages would be visible by the device, because 4096 RX desc
would contain dma addresses pointing to the 4096 halves.

And 2048 pages would be in the reserve.


The buffer info layout for something like that would probably be
pretty interesting.  Basically you would be doubling up the ring so
that you handle 2 Rx descriptors per a single buffer info since you
would automatically know that it would be an even/odd setup in terms
of the buffer offsets.

If you get a chance to do something like that I would love to know the
result.  Otherwise if I get a chance I can try messing with i40e or
ixgbe some time and see what kind of impact it has.

The whole idea behind using only half the page per descriptor is to
allow us to loop through the ring before we end up reusing it again.
That buys us enough time that usually the stack has consumed the frame
before we need it again.



The same will happen really.

Best maybe is for me to send the patch ;)


I think I have the idea now.  However patches are always welcome..  :-)


Same here :-)

Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

Reply via email to