On 21 May 2018 at 20:55, Björn Töpel <bjorn.to...@gmail.com> wrote: > 2018-05-21 14:34 GMT+02:00 Mykyta Iziumtsev <mykyta.iziumt...@linaro.org>: >> Hi Björn and Magnus, >> >> (This thread is a follow up to private dialogue. The intention is to >> let community know that AF_XDP can be enhanced further to make it >> compatible with wider range of NIC vendors). >> > > Mykyta, thanks for doing the write-up and sending it to the netdev > list! The timing could not be better -- we need to settle on an uapi > that works for all vendors prior enabling it in the kernel. > >> There are two NIC variations which don't fit well with current AF_XDP >> proposal. >> >> The first variation is represented by some NXP and Cavium NICs. AF_XDP >> expects NIC to put incoming frames into slots defined by UMEM area. >> Here slot size is set in XDP_UMEM_REG xdp_umem.reg.frame_size and >> slots available to NIC are communicated to the kernel via UMEM fill >> queue. While Intel NICs support only one slot size, NXP and Cavium >> support multiple slot sizes to optimize memory usage (e.g. { 128, 512, >> 2048 }, please refer to [1] for rationale). On frame reception the NIC >> pulls a slot from best fit pool based on frame size. >> >> The second variation is represented by e.g. Chelsio T5/T6 and Netcope >> NICs. As shown above, AF_XDP expects NIC to put incoming frames at >> predefined addresses. This is not the case here as the NIC is in >> charge of placing frames in memory based on it's own algorithm. For >> example, Chelsio T5/T6 is expecting to get whole pages from the driver >> and puts incoming frames on the page in a 'tape recorder' fashion. >> Assuming 'base' is page address and flen[N] is an array of frame >> lengths, the frame placement in memory will look like that: >> base + 0 >> base + frame_len[0] >> base + frame_len[0] + frame_len[1] >> base + frame_len[0] + frame_len[1] + frame_len[2] >> ... >> >> To better support these two NIC variations I suggest to abandon 'frame >> size' structuring in UMEM and stick with 'pages' instead. The NIC >> kernel driver is then responsible for splitting provided pages into >> slots expected by underlying HW (or not splitting at all in case of >> Chelsio/Netcope). >> > > Let's call the first variation "multi-pool" and the second > "type-writer" for simplicity. The type-writer model is very > interesting, and makes a lot of sense when the PCIe bus is a > constraint. > >> On XDP_UMEM_REG the application needs to specify page_size. Then the >> application can pass empty pages to the kernel driver using UMEM >> 'fill' queue by specifying page offset within the UMEM area. xdp_desc >> format needs to be changed as well: frame location will be defined by >> offset from the beginning of UMEM area instead of frame index. As >> payload headroom can vary with AF_XDP we'll need to specify it in >> xdp_desc as well. Beside that it could be worth to consider changing >> payload length to u16 as 64k+ frames aren't very common in networking. >> The resulting xdp_desc would look like that: >> >> struct xdp_desc { >> __u64 offset; >> __u16 headroom; >> __u16 len; >> __u8 flags; >> __u8 padding[3]; >> }; >> > > Let's expand a bit here; Instead of passing indicies to fixed sized > frames to the fill ring, we now pass a memory region. For simplicity, > let's say a page. The fill ring descriptor requires offset and > len. The offset is a global offset from an UMEM perspective, and len > is the size of the region. >
I would rather stick with region equal to page (regular or huge page, defined by application). The page size can be extracted from vm_area_struct in XDP_UMEM_REG (preferred) or configured by application. > On the Rx ring the descriptor, as you wrote, must be changed as well > to your suggestion above. Note, that headroom is still needed, since > XDP can change the size of a packet, so the fixed headroom stated in > UMEM registration is not sufficient. > > This model is obviously more flexible, but then also a bit more > complex. E.g. a fixed-frame NIC (like ixgbe), what should the > behaviour be? Should the fill ring entry be used only for *one* frame, > or chopped up to multiple receive frames? Should it be configurable? > Driver specific? I think driver-specific makes most sense here. In case of fixed-frame NIC the driver shall chop the ring entry into multiple receive frames. > > Also, validating the entries in the fill queue require more work > (compared to the current index variant). Currently, we only skip > invalid indicies. What should we do when say, you pass a memory window > that is too narrow (say 128B) but correct from a UMEM > perspective. Going this path, we need to add pretty hard constraints > so we don't end up it too complex code -- because then it'll be too > slow. If we stick with pages -- the only possible erroneous input will be 'page out of UMEM boundaries'. The validation will be essentially: if ((offset > umem->size) || (offset & (umem->page_size - 1)) fail The question is what shall be done if validation fails ? Would SEGFAULT be reasonable ? This is more or less equivalent to dereferencing invalid pointer. > >> In current proposal you have a notion of 'frame ownership': 'owned by >> kernel' or 'owned by application'. The ownership is transferred by >> means of enqueueing frame index in UMEM 'fill' queue (from application >> to kernel) or in UMEM 'tx completion' queue (from kernel to >> application). If you decide to adopt 'page' approach this notion needs >> to be changed a bit. This is because in case of packet forwarding one >> and the same page can be used for RX (parts of it enqueued in HW 'free >> lists') and TX (forwarding of previously RXed packets). >> >> I propose to define 'ownership' as a right to manipulate the >> partitioning of the page into frames. Whenever application passes a >> page to the kernel via UMEM 'fill' queue -- the ownership is >> transferred to the kernel. The application can't allocate packets on >> this page until kernel is done with it, but it can change payload of >> RXed packets before forwarding them. The kernel can pass ownership >> back by means of 'end-of-page' in xdp_desc.flags. >> > > I like the end-of-page mechanism. > >> The pages are definitely supposed to be recycled sooner or later. Even >> if it's not part of kernel API and the housekeeping implementation >> resided completely in application I still would like to propose >> possible (hopefully, cost efficient) solution to that. The recycling >> could be achieved by keeping refcount on pages and recycling the page >> only when it's owned by application and refcount reaches 0. >> >> Whenever application transfers page ownership to the kernel the >> refcount shall be initialized to 0. With each incoming RX xdp_desc the >> corresponding page needs to be identified (xdp_desc.offset >> >> PAGE_SHIFT) and refcount incremented. When the packet gets freed the >> refcount shall be decremented. If packet is forwarded in TX xdp_desc >> -- the refcount gets decremented only on TX completion (again, >> tx_completion.desc >> PAGE_SHIFT). For packets originating from the >> application itself the payload buffers needs to be allocated from >> empty page owned by the application and refcount needs to be >> incremented as well. >> > > Strictly speaking, we're not enforcing correctness in the current > solution. If the userspace application passes index 1 mulitple times > to the fill ring, and at the same time send index 1, things will > break. So, with the existing solution the userspace application > *still* needs to track the frames. With this new model, the > tracking/refcounting will be more complex. That might be a concern. > > For the multi-pool NICs I think we can still just have one UMEM, and > let the driver decide where in which pool to place this certain chunk > of memory. Thoughts? Definitely agree with that. This is HW specific and exposing it to the application would only harm portability. > > Now, how do we go forward? I think this is very important, and I will > hack a copy-mode version for this new API. I'm a bit worried that the > complexity/configuration space will grow and impact performance, but > let's see. > > To prove that the API works for the NICs you mentioned, we need an > actual zero-copy implementation for them. Do you think Linaro could > work on a zero-copy variant for any of the NICs above? > Linaro will definitely contribute zero-copy implementation for some ARM-based NICs with 'multi-pool' variation. Implementation of 'type-writer' variation is up to Chelsio/Netcope, we only try to come up with API which (most probably) will fit them as well. > > Again thanks for bringing this up! > Björn > > > >> [1] https://en.wikipedia.org/wiki/Internet_Mix >> >> With best regards, >> Mykyta