Initial thoughts on TXDP

Tom Herbert Wed, 30 Nov 2016 14:54:43 -0800

Posting for discussion....

Now that XDP seems to be nicely gaining traction we can start to
consider the next logical step which is to apply the principles of XDP
to accelerating transport protocols in the kernel. For lack of a
better name I'll refer to this as Transport eXpress Data Path, or just
TXDP :-). Pulling off TXDP might not be the most trivial of problems
to solve, but if we can this may address the performance gap between
kernel bypass and the stack for transport layer protocols (XDP
addresses the performance gap for stateless packet processing). The
problem statement is analogous to that which we had for XDP, namely
can we create a mode in the kernel that offer the same performance
that is seen with L4 protocols over kernel bypass (e.g. TCP/OpenOnload
or TCP/DPDK) or perhaps something reasonably close to a full HW
offload solution (such as RDMA)?


TXDP is different from XDP in that we are dealing with stateful
protocols and is part of a full protocol implementation, specifically
this would be an accelerated mode of transport connections (e.g. TCP)
in the kernel. Also, unlike XDP we now need to be concerned with
transmit path (both application generating packets as well as protocol
sourced packets like ACKs, retransmits, clocking out data, etc.).
Another distinction is that the user API needs to be considered, for
instance optimizing the nominal protocol stack but then using an
unmodified socket interface could easily undo the effects of
optimizing the lower layers. This last point actually implies a nice
constraint, if we can't keep the accelerated path simple its probably
not worth trying to accelerate.

One simplifying assumption we might make is that TXDP is primarily for
optimizing latency, specifically request/response type operations
(think HPC, HFT, flash server, or other tightly coupled communications
within the datacenter). Notably, I don't think that saving CPU is as
relevant to TXDP, in fact we have already seen that CPU utilization
can be traded off for lower latency via spin polling. Similar to XDP
though, we might assume that single CPU performance is relevant (i.e.
on a cache server we'd like to spin as few CPUs as needed and no more
to handle the load an maintain throughput and latency requirements).
High throughput (ops/sec) and low variance should be side effects of
any design.

As with XDP, TXDP is _not_ intended to be a completely generic and
transparent solution. The application may be specifically optimized
for use with TXDP (for instance to implement perfect lockless
silo'ing). So TXDP is not going to be for everyone and it should be as
minimally invasive to the rest of the stack as possible.

I imagine there are a few reasons why userspace TCP stacks can get
good performance:

- Spin polling (we already can do this in kernel)
- Lockless, I would assume that threads typically have exclusive
access to a queue pair for a connection
- Minimal TCP/IP stack code
- Zero copy TX/RX
- Light weight structures for queuing
- No context switches
- Fast data path for in order, uncongested flows
- Silo'ing between application and device queues

Not all of these have cognates in the Linux stack, for instance we
probably can't entirely eliminate context switches for a userspace
application.

So with that the components of TXDP might look something like:

RX

  - Call into TCP/IP stack with page data directly from driver-- no
skbuff allocation or interface. This is essentially provided by the
XDP API although we would need to generalize the interface to call
stack functions (I previously posted patches for that). We will also
need a new action, XDP_HELD?, that indicates the XDP function held the
packet (put on a socket for instance).
  - Perform connection lookup. If we assume lockless model as
described below then we should be able to perform lockless connection
lookup similar to work Eric did to optimize UDP lookups for tunnel
processing
  - Call function that implements expedited TCP/IP datapath (something
like Van Jacobson's famous 80 instructions? :-) ).
  - If there is anything funky about the packet, connection state, or
TCP connection is not being TXDP accelerated just return XDP_PASS so
that packet follows normal stack processing. Since we did connection
lookup we could return an early demux also.Since we're already in an
exception mode this is where we might want to move packet processing
to different CPU (can be done by RPS/RFS)..
  - If packet contains new data we can allocate a "mini skbfuf (talked
about that at netdev) for queuing on socket.
  - If packet is an ACK we can process it directly without ever creating skbuff
  - There is also the possibility of avoiding the skbuff allocation
for in-kernel applications. Stream parser might also be taught how to
deal with raw buffers.
  - If we're really ambitious we can also consider putting packets
into a packet ring for user space presuming that packets are typically
in order (might be a little orthogonal to TXDP.

TX

  - Normal TX socket options apply however they might be lockless
under locking constraints below
  - skbuff is required for keeping data on socket a TCP (not
necessarily for UDP though), but we might be able to use a mini skbuff
for this
  - We'd need an interface to transmit a packet in a page buffer
without an skbuff.
  - When we transmit, it would be nice to go straight from TCP
connection to an XDP device queue and in particular skip the qdisc
layer. This follows the principle of low latency being first criteria.
Effective use of qdiscs, especially non work conserving ones, imply
longer latencies in effect which likely means TXDP isn't appropriate
in such a cases. BQL is also out, however we would want the TX
batching of XDP.
  - If we really need priority queing we should have the option of
using device mulitple queue
  - Pure ACKs would not require an skb also
  - Zero copy TX (splice) can be used where needed. This might be
particular useful in a flash or remote memory server
  - TX completion would most like be happening on same CPU also

Miscellaneous

  - Under the simplicity principle we really only want the TXDP to
contain the minimal necessary path. What is "minimal" is a very
relevant question. If we constrain the use case to be communications
within a single rack such that we can engineer for basically no loss
and no congestion (pause might be reasonable here) then the TXDP data
path might just implement that. Mostly this would just be the
established data path that is accelerated, handling other states might
be done in existing path (which becomes slow path in TXDP).
  - To make transport sockets to have a lockless mode I am
contemplating that connections/sockets can be bound to particularly
CPUs and that any operations (socket operations, timers, receive
processing) must occur on that CPU. The CPU would be the one where RX
happens. Note this implies perfect silo'ing, everything for driver RX
to application processing happens inline on the CPU. The stack would
not cross CPUs for a connection while in this mode.
  - We might be able to take advantage of per connection queues or
ntuple filter to isolate flows to to certain queues and hence certain
CPUs. I don't think this can be a requirement though, TXDP should be
able to work well with generic MQ devices. Specialized devices queues
are an optimization, without having those those we'd want to push to
other CPUs as quickly as possible (RPS but maybe we can get away with
a mini-skbuff here).

Thanks,
Tom

Initial thoughts on TXDP

Reply via email to