Den mån 4 juni 2018 kl 11:44 skrev Mykyta Iziumtsev <mykyta.iziumt...@linaro.org>: > > Hi Björn and Magnus, > > Here is a short update on the proposed changes to AF_XDP. > > During implementation phase I figured out that 'end-of-page' in RX > descriptor's flags won't work, unfortunately. This is because the > kernel driver can't know if the frame will be last on the page when RX > descriptor is being filled in. Imagine that the driver is processing a > frame on a page, and there is still 1000 bytes of non utilized memory > beyond this frame. If the next frame (which haven't yet arrived) is > less than that -- the NIC can place it on the same page, otherwise the > NIC will skip remaining 1000 bytes and take next page. Of course, the > driver can update RX descriptor retrospectively, or use 'empty' RX > descriptors just to indicate page completion, but it's too complex or > inefficient and as such doesn't look good. >
I think both "update RX descriptor retrospectively" and "zero-length end-of-chunk descriptor" are OK (as long as the zero-length descs much less common that the non-zero one). Out of curiosity, does Chelsio (or any other type-writer based) NIC has a public data sheet? It would be interesting to see how this is solved in practice. > What I could propose instead is to make the 'filling' and 'page > completion' symmetric. That is, UMEM will have 1 queue to pass pages > from application to the kernel driver (fring in your terminology) and > one to indicate complete pages back to the application. > > That 'page completion' queue can be added as a third queue to UMEM, or > alternatively TX completions can be decoupled from UMEM to make it an > simple 'memory area' object with two directional queues which work > with RX pages only. TX completions can be handled in a simpler manner > as part of TX queues: just make sure consumer index is advanced only > when the frame is finally sent out. The TX completion processing can > be then done by observing consume index advance from application. That > would not only be more in line with best NIC ring design practices > but will improve cache locality and eliminate critical section if TX > queues are assigned to different CPUs. > ...and this is also a valid option. Adding an additional umem ring for end-of-chunk completions, might make sense, but let's the zero-copy implementations drive that requirement. As you probably noted Magnus and I just sent out a patch set with an updated descriptor layout. A next step would be adding support for type-writer styled NICs. Thanks for the information, and hopefully we'll see an additional AF_XDP zero-copy implementation on list (nudge, nudge). ;-) Björn > With best regards, > Mykyta > > On 22 May 2018 at 14:09, Björn Töpel <bjorn.to...@gmail.com> wrote: > > 2018-05-22 9:45 GMT+02:00 Mykyta Iziumtsev <mykyta.iziumt...@linaro.org>: > >> On 21 May 2018 at 20:55, Björn Töpel <bjorn.to...@gmail.com> wrote: > >>> 2018-05-21 14:34 GMT+02:00 Mykyta Iziumtsev <mykyta.iziumt...@linaro.org>: > >>>> Hi Björn and Magnus, > >>>> > >>>> (This thread is a follow up to private dialogue. The intention is to > >>>> let community know that AF_XDP can be enhanced further to make it > >>>> compatible with wider range of NIC vendors). > >>>> > >>> > >>> Mykyta, thanks for doing the write-up and sending it to the netdev > >>> list! The timing could not be better -- we need to settle on an uapi > >>> that works for all vendors prior enabling it in the kernel. > >>> > >>>> There are two NIC variations which don't fit well with current AF_XDP > >>>> proposal. > >>>> > >>>> The first variation is represented by some NXP and Cavium NICs. AF_XDP > >>>> expects NIC to put incoming frames into slots defined by UMEM area. > >>>> Here slot size is set in XDP_UMEM_REG xdp_umem.reg.frame_size and > >>>> slots available to NIC are communicated to the kernel via UMEM fill > >>>> queue. While Intel NICs support only one slot size, NXP and Cavium > >>>> support multiple slot sizes to optimize memory usage (e.g. { 128, 512, > >>>> 2048 }, please refer to [1] for rationale). On frame reception the NIC > >>>> pulls a slot from best fit pool based on frame size. > >>>> > >>>> The second variation is represented by e.g. Chelsio T5/T6 and Netcope > >>>> NICs. As shown above, AF_XDP expects NIC to put incoming frames at > >>>> predefined addresses. This is not the case here as the NIC is in > >>>> charge of placing frames in memory based on it's own algorithm. For > >>>> example, Chelsio T5/T6 is expecting to get whole pages from the driver > >>>> and puts incoming frames on the page in a 'tape recorder' fashion. > >>>> Assuming 'base' is page address and flen[N] is an array of frame > >>>> lengths, the frame placement in memory will look like that: > >>>> base + 0 > >>>> base + frame_len[0] > >>>> base + frame_len[0] + frame_len[1] > >>>> base + frame_len[0] + frame_len[1] + frame_len[2] > >>>> ... > >>>> > >>>> To better support these two NIC variations I suggest to abandon 'frame > >>>> size' structuring in UMEM and stick with 'pages' instead. The NIC > >>>> kernel driver is then responsible for splitting provided pages into > >>>> slots expected by underlying HW (or not splitting at all in case of > >>>> Chelsio/Netcope). > >>>> > >>> > >>> Let's call the first variation "multi-pool" and the second > >>> "type-writer" for simplicity. The type-writer model is very > >>> interesting, and makes a lot of sense when the PCIe bus is a > >>> constraint. > >>> > >>>> On XDP_UMEM_REG the application needs to specify page_size. Then the > >>>> application can pass empty pages to the kernel driver using UMEM > >>>> 'fill' queue by specifying page offset within the UMEM area. xdp_desc > >>>> format needs to be changed as well: frame location will be defined by > >>>> offset from the beginning of UMEM area instead of frame index. As > >>>> payload headroom can vary with AF_XDP we'll need to specify it in > >>>> xdp_desc as well. Beside that it could be worth to consider changing > >>>> payload length to u16 as 64k+ frames aren't very common in networking. > >>>> The resulting xdp_desc would look like that: > >>>> > >>>> struct xdp_desc { > >>>> __u64 offset; > >>>> __u16 headroom; > >>>> __u16 len; > >>>> __u8 flags; > >>>> __u8 padding[3]; > >>>> }; > >>>> > >>> > >>> Let's expand a bit here; Instead of passing indicies to fixed sized > >>> frames to the fill ring, we now pass a memory region. For simplicity, > >>> let's say a page. The fill ring descriptor requires offset and > >>> len. The offset is a global offset from an UMEM perspective, and len > >>> is the size of the region. > >>> > >> > >> I would rather stick with region equal to page (regular or huge page, > >> defined by application). The page size can be extracted from > >> vm_area_struct in XDP_UMEM_REG (preferred) or configured by > >> application. > >> > > > > Ok, thinking more about it I prefer this as well. This means that we > > only need to grow the UMEM fring/cring descriptors to u64, and not > > care about length. As you state below, this makes the validation > > simple. > > > > We might consider exposing a "page size hint", that the user can set. > > For 4G hugepage scenario, it might make sense to have a chunk *less* > > than 4G to avoid HW Rx memory running low when the end of a chunk is > > approaching. > > > > As for THP, I need to think about proper behavior here. > > > >>> On the Rx ring the descriptor, as you wrote, must be changed as well > >>> to your suggestion above. Note, that headroom is still needed, since > >>> XDP can change the size of a packet, so the fixed headroom stated in > >>> UMEM registration is not sufficient. > >>> > >>> This model is obviously more flexible, but then also a bit more > >>> complex. E.g. a fixed-frame NIC (like ixgbe), what should the > >>> behaviour be? Should the fill ring entry be used only for *one* frame, > >>> or chopped up to multiple receive frames? Should it be configurable? > >>> Driver specific? > >> > >> I think driver-specific makes most sense here. In case of fixed-frame > >> NIC the driver shall chop the ring entry into multiple receive frames. > >> > > > > Let's start there, keeping the configuration space small. > > > >>> > >>> Also, validating the entries in the fill queue require more work > >>> (compared to the current index variant). Currently, we only skip > >>> invalid indicies. What should we do when say, you pass a memory window > >>> that is too narrow (say 128B) but correct from a UMEM > >>> perspective. Going this path, we need to add pretty hard constraints > >>> so we don't end up it too complex code -- because then it'll be too > >>> slow. > >> > >> If we stick with pages -- the only possible erroneous input will be > >> 'page out of UMEM boundaries'. The validation will be essentially: > >> > >> if ((offset > umem->size) || (offset & (umem->page_size - 1)) > >> fail > >> > >> The question is what shall be done if validation fails ? Would > >> SEGFAULT be reasonable ? This is more or less equivalent to > >> dereferencing invalid pointer. > >> > > > > The current scheme is simply dropping that kernel skips the invalid > > fill ring entry. SIGSEGV is an interesting idea! > > > >>> > >>>> In current proposal you have a notion of 'frame ownership': 'owned by > >>>> kernel' or 'owned by application'. The ownership is transferred by > >>>> means of enqueueing frame index in UMEM 'fill' queue (from application > >>>> to kernel) or in UMEM 'tx completion' queue (from kernel to > >>>> application). If you decide to adopt 'page' approach this notion needs > >>>> to be changed a bit. This is because in case of packet forwarding one > >>>> and the same page can be used for RX (parts of it enqueued in HW 'free > >>>> lists') and TX (forwarding of previously RXed packets). > >>>> > >>>> I propose to define 'ownership' as a right to manipulate the > >>>> partitioning of the page into frames. Whenever application passes a > >>>> page to the kernel via UMEM 'fill' queue -- the ownership is > >>>> transferred to the kernel. The application can't allocate packets on > >>>> this page until kernel is done with it, but it can change payload of > >>>> RXed packets before forwarding them. The kernel can pass ownership > >>>> back by means of 'end-of-page' in xdp_desc.flags. > >>>> > >>> > >>> I like the end-of-page mechanism. > >>> > >>>> The pages are definitely supposed to be recycled sooner or later. Even > >>>> if it's not part of kernel API and the housekeeping implementation > >>>> resided completely in application I still would like to propose > >>>> possible (hopefully, cost efficient) solution to that. The recycling > >>>> could be achieved by keeping refcount on pages and recycling the page > >>>> only when it's owned by application and refcount reaches 0. > >>>> > >>>> Whenever application transfers page ownership to the kernel the > >>>> refcount shall be initialized to 0. With each incoming RX xdp_desc the > >>>> corresponding page needs to be identified (xdp_desc.offset >> > >>>> PAGE_SHIFT) and refcount incremented. When the packet gets freed the > >>>> refcount shall be decremented. If packet is forwarded in TX xdp_desc > >>>> -- the refcount gets decremented only on TX completion (again, > >>>> tx_completion.desc >> PAGE_SHIFT). For packets originating from the > >>>> application itself the payload buffers needs to be allocated from > >>>> empty page owned by the application and refcount needs to be > >>>> incremented as well. > >>>> > >>> > >>> Strictly speaking, we're not enforcing correctness in the current > >>> solution. If the userspace application passes index 1 mulitple times > >>> to the fill ring, and at the same time send index 1, things will > >>> break. So, with the existing solution the userspace application > >>> *still* needs to track the frames. With this new model, the > >>> tracking/refcounting will be more complex. That might be a concern. > >>> > >>> For the multi-pool NICs I think we can still just have one UMEM, and > >>> let the driver decide where in which pool to place this certain chunk > >>> of memory. Thoughts? > >> > >> Definitely agree with that. This is HW specific and exposing it to the > >> application would only harm portability. > >> > > > > Good stuff, we're on the same page then. > > > >>> > >>> Now, how do we go forward? I think this is very important, and I will > >>> hack a copy-mode version for this new API. I'm a bit worried that the > >>> complexity/configuration space will grow and impact performance, but > >>> let's see. > >>> > >>> To prove that the API works for the NICs you mentioned, we need an > >>> actual zero-copy implementation for them. Do you think Linaro could > >>> work on a zero-copy variant for any of the NICs above? > >>> > >> > >> Linaro will definitely contribute zero-copy implementation for some > >> ARM-based NICs with 'multi-pool' variation. > > > > Very nice! > > > >> Implementation of > >> 'type-writer' variation is up to Chelsio/Netcope, we only try to come > >> up with API which (most probably) will fit them as well. > >> > > > > Let's hope we get an implementation from these vendors as well! :-) > > > > > > Björn > > > >>> > >>> Again thanks for bringing this up! > >>> Björn > >>> > >>> > >>> > >>>> [1] https://en.wikipedia.org/wiki/Internet_Mix > >>>> > >>>> With best regards, > >>>> Mykyta