From: Willem de Bruijn <will...@google.com>

Add zerocopy socket sendmsg() support with flag MSG_ZEROCOPY.
Implement the feature for TCP, UDP, RAW and packet sockets. This is
a generalization of a previous packet socket RFC patch

  http://patchwork.ozlabs.org/patch/413184/

On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
creates skbuff fragments directly from these pages. On tx completion,
it notifies the socket owner that it is safe to modify memory by
queuing a completion notification onto the socket error queue.

The kernel already implements such copy avoidance with vmsplice plus
splice and with ubuf_info for tun and virtio. Extend the second
with features required by TCP and others: reference counting to
support cloning (retransmit queue) and shared fragments (GSO) and
notification coalescing to handle corking.

Notifications are queued onto the socket error queue as a range
range [N, N+m], where N is a per-socket counter incremented on each
successful zerocopy send call.

* Performance

The below table shows cycles reported by perf for a netperf process
sending a single 10 Gbps TCP_STREAM. The first three columns show
Mcycles spent in the netperf process context. The second three columns
show time spent systemwide (-a -C A,B) on the two cpus that run the
process and interrupt handler. Reported is the median of 3 runs. std
is a standard netperf, zc uses zerocopy and % is the ratio.

NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -- -m $size

perf stat -e cycles $NETPERF
perf stat -C 2,3 -a -e cycles $NETPERF

        --process cycles--      ----cpu cycles----
           std     zc    %      std         zc   %
4K      11,060  5,615   51      20,517  19,694  96
16K      8,706  2,045   23      17,913  15,549  86
64K      8,105  1,152   14      17,592  12,167  69
256K     8,087   926    11      16,953  11,279  66
1M       7,955   826    10      17,228  10,655  62

Perf record indicates the main source of these differences. Process
cycles only (perf record; perf report -n):

std:
 Samples: 15K of event 'cycles', Event count (approx.): 7967793182
 73.02%         11564  netperf  [kernel.kallsyms]  [k] copy_user_generic_string
  4.73%           746  netperf  [kernel.kallsyms]  [k] __memset
  2.73%           433  netperf  [kernel.kallsyms]  [k] tcp_sendmsg
  2.41%           383  netperf  [kernel.kallsyms]  [k] get_page_from_freelist
  0.90%           143  netperf  [kernel.kallsyms]  [k] copy_from_iter

zc:
 Samples: 1K of event 'cycles', Event count (approx.): 858290585
 17.11%           182  netperf.zc.aug2  [kernel.kallsyms]   [k] gup_pte_range
  9.31%           100  netperf.zc.aug2  [kernel.kallsyms]   [k] __memset
  7.79%            81  netperf.zc.aug2  [kernel.kallsyms]   [k] 
__zerocopy_sg_from_iter
  3.87%            44  netperf.zc.aug2  [kernel.kallsyms]   [k] __alloc_skb
  3.75%            18  netperf.zc.aug2  netperf.zc.aug2015  [.] 
allocate_buffer_ring

The individual patches report additional micro-benchmark results.


* Safety

The number of pages that can be pinned on behalf of a process with
MSG_ZEROCOPY is bound by the locked memory ulimit.

Pages are not mapped read-only. Processes can modify packet contents
while packets are in flight in the kernel path. Bytes on which kernel
control flow depends (headers) are copied to avoid TOCTTOU attacks.

Datapath integrity does not otherwise depend on payload, with three
exceptions: checksums, optional sk_filter/tc u32/.. and device +
driver logic. The effect of wrong checksums is limited to the
misbehaving process. Filters may have to be addressed by inserting a
preventative skb_copy_ubufs(). Device drivers can be whitelisted,
similar to scatter-gather support (NETIF_F_SG).

Conversely, while the kernel holds process memory pinned, a process
cannot safely reuse those pages for other purposes. Some protocols,
notably TCP, may hold data for an unbounded length of time. Tun and
virtio bound latency by calling skb_copy_ubuf before cloning and
before injecting packets in unbounded latency paths. This approach
is not feasible for TCP.

Processes can safely avoid OOM conditions by bounding the number of
bytes passed with MSG_ZEROCOPY and by removing shared pages after
transmission from their own memory map -- for instance, depending on
type of page, by calling munmap() or with madvise MADV_SOFT_OFFLINE or
MADV_DONTNEED. Long-lived kernel references are an anomaly and this
operation should be rare. The mechanism was suggested in the earlier
zerocopy packet socket patch.


* Limitations / Known Issues

- PF_INET6 and PF_UNIX are not yet supported.
- UDP/RAW/PACKET should sleep on ubuf_info alloc failure
     they currently immediately return ENOBUFS
- TCP does not build max GSO packets, especially for
     small send buffers (< 4 KB)

Willem de Bruijn (10):
  sock: skb_copy_ubufs support for compound pages
  sock: add generic socket zerocopy
  sock: enable generic socket zerocopy
  sock: zerocopy coalesce support
  tcp: enable MSG_ZEROCOPY
  udp: enable MSG_ZEROCOPY
  raw: enable MSG_ZEROCOPY with hdrincl
  packet: enable MSG_ZEROCOPY
  sock: RLIMIT number of pinned pages with MSG_ZEROCOPY
  test: add zerocopy tests

 drivers/vhost/net.c                           |   1 +
 include/linux/mm_types.h                      |   1 +
 include/linux/skbuff.h                        |  72 +++-
 include/linux/socket.h                        |   1 +
 include/net/sock.h                            |   2 +
 include/uapi/linux/errqueue.h                 |   1 +
 net/core/datagram.c                           |  37 +-
 net/core/skbuff.c                             | 297 ++++++++++++--
 net/core/sock.c                               |   2 +
 net/ipv4/ip_output.c                          |  30 +-
 net/ipv4/raw.c                                |  27 +-
 net/ipv4/tcp.c                                |  31 +-
 net/packet/af_packet.c                        |  44 ++-
 tools/testing/selftests/net/Makefile          |   2 +-
 tools/testing/selftests/net/snd_zerocopy.c    | 353 +++++++++++++++++
 tools/testing/selftests/net/snd_zerocopy_lo.c | 535 ++++++++++++++++++++++++++
 16 files changed, 1372 insertions(+), 64 deletions(-)
 create mode 100644 tools/testing/selftests/net/snd_zerocopy.c
 create mode 100644 tools/testing/selftests/net/snd_zerocopy_lo.c

-- 
2.5.0.276.gf5e568e

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to