On 23/02/2017 4:18 AM, Alexander Duyck wrote:
On Wed, Feb 22, 2017 at 6:06 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:
Right but you were talking about using both halves one after the
other. If that occurs you have nothing left that you can reuse. That
was what I was getting at. If you use up both halves you end up
having to unmap the page.
You must have misunderstood me.
Once we use both halves of a page, we _keep_ the page, we do not unmap
it.
We save the page pointer in a ring buffer of pages.
Call it the 'quarantine'
When we _need_ to replenish the RX desc, we take a look at the oldest
entry in the quarantine ring.
If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
this saved page.
If not, _then_ we unmap and release the page.
Okay, that was what I was referring to when I mentioned a "hybrid
between the mlx5 and the Intel approach". Makes sense.
Indeed, in mlx5 Striding RQ (mpwqe) we do something similar.
Our NIC (ConnectX-4 LX and newer) knows to write multiple _consecutive_
packets into the same RX buffer (page).
AFAIU, this is what Eric suggests to do in SW in mlx4.
Here are the main characteristics of our page-cache in mlx5:
1) FIFO (for higher chances of an available page).
2) If the page-cache head is busy, it is not freed. This has its pros
and cons. We might reconsider.
3) Pages in cache have no page-to-WQE assignment (WQE is for Work Queue
Element, == RX descriptor). They are shared for all WQEs of an RQ and
might be used by different WQEs in different rounds.
4) Cache size is smaller than suggested, we would happily increase it to
reflect a whole ring.
Still, performance tests over mlx5 show that on high load we quickly end
up allocating pages as the stack does not release its ref count on time.
Increasing the cache size helps of course.
As there's no _fixed_ fair size that guarantees the availability of
pages every ring cycle, reflecting a ring size can help, and would give
the opportunity for users to tune their performance by setting their
ring size according to how powerful their CPUs are, and what traffic
type/load they're running.
Note that we would have received 4096 frames before looking at the page
count, so there is high chance both halves were consumed.
To recap on x86 :
2048 active pages would be visible by the device, because 4096 RX desc
would contain dma addresses pointing to the 4096 halves.
And 2048 pages would be in the reserve.
The buffer info layout for something like that would probably be
pretty interesting. Basically you would be doubling up the ring so
that you handle 2 Rx descriptors per a single buffer info since you
would automatically know that it would be an even/odd setup in terms
of the buffer offsets.
If you get a chance to do something like that I would love to know the
result. Otherwise if I get a chance I can try messing with i40e or
ixgbe some time and see what kind of impact it has.
The whole idea behind using only half the page per descriptor is to
allow us to loop through the ring before we end up reusing it again.
That buys us enough time that usually the stack has consumed the frame
before we need it again.
The same will happen really.
Best maybe is for me to send the patch ;)
I think I have the idea now. However patches are always welcome.. :-)
Same here :-)