from:"Joe Stringer"

[PATCH bpf-next 00/17] Improve BPF syscall command documentation

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

The state of bpf(2) manual pages today is not exactly ideal. For the
most part, it was written several years ago and has not kept up with the
pace of development in the kernel tree. For instance, out of a total of
~35 commands to the BPF syscall available today, when I pull the
kernel-man-pages tree today I find just 6 documented commands: The very
basics of map interaction and program load.

In contrast, looking at bpf-helpers(7), I am able today to run one
command[0] to fetch API documentation of the very latest eBPF helpers
that have been added to the kernel. This documentation is up to date
because kernel maintainers enforce documenting the APIs as part of
the feature submission process. As far as I can tell, we rely on manual
synchronization from the kernel tree to the kernel-man-pages tree to
distribute these more widely, so all locations may not be completely up
to date. That said, the documentation does in fact exist in the first
place which is a major initial hurdle to overcome.

Given the relative success of the process around bpf-helpers(7) to
encourage developers to document their user-facing changes, in this
patch series I explore applying this technique to bpf(2) as well.
Unfortunately, even with bpf(2) being so out-of-date, there is still a
lot of content to convert over. In particular, I've identified at least
the following aspects of the bpf syscall which could individually be
generated from separate documentation in the header:
* BPF syscall commands
* BPF map types
* BPF program types
* BPF attachment points

Rather than tackle everything at once, I have focused in this series on
the syscall commands, "enum bpf_cmd". This series is structured to first
import what useful descriptions there are from the kernel-man-pages
tree, then piece-by-piece document a few of the syscalls in more detail
in cases where I could find useful documentation from the git tree or
from a casual read of the code. Not all documentation is comprehensive
at this point, but a basis is provided with examples that can be further
enhanced with subsequent follow-up patches. Note, the series in its
current state only includes documentation around the syscall commands
themselves, so in the short term it doesn't allow us to automate bpf(2)
generation; Only one section of the man page could be replaced. Though
if there is appetite for this approach, this should be trivial to
improve on, even if just by importing the remaining static text from the
kernel-man-pages tree.

Following that, the series enhances the python scripting around parsing
the descriptions from the header files and generating dedicated
ReStructured Text and troff output. Finally, to expose the new text and
reduce the likelihood of having it get out of date or break the docs
parser, it is added to the selftests and exposed through the kernel
documentation web pages.

At this point I'd like to put this out for comments. In my mind, the
ideal eventuation of this work would be to extend kernel UAPI headers
such that each of the categories I had listed above (commands, maps,
progs, hooks) have dedicated documentation in the kernel tree, and that
developers must update the comments in the headers to document the APIs
prior to patch acceptance, and that we could auto-generate the latest
version of the bpf(2) manual pages based on a few static description
sections combined with the dynamically-generated output from the header.

Thanks also to Quentin Monnet for initial review.

[0]: make -C tools/bpf -f Makefile.docs bpf-helpers.7

Joe Stringer (17):
  bpf: Import syscall arg documentation
  bpf: Add minimal bpf() command documentation
  bpf: Document BPF_F_LOCK in syscall commands
  bpf: Document BPF_PROG_PIN syscall command
  bpf: Document BPF_PROG_ATTACH syscall command
  bpf: Document BPF_PROG_TEST_RUN syscall command
  bpf: Document BPF_PROG_QUERY syscall command
  bpf: Document BPF_MAP_*_BATCH syscall commands
  scripts/bpf: Rename bpf_helpers_doc.py -> bpf_doc.py
  scripts/bpf: Abstract eBPF API target parameter
  scripts/bpf: Add syscall commands printer
  tools/bpf: Rename Makefile.{helpers,docs}
  tools/bpf: Templatize man page generation
  tools/bpf: Build bpf-sycall.2 in Makefile.docs
  selftests/bpf: Add docs target
  docs/bpf: Add bpf() syscall command reference
  tools: Sync uapi bpf.h header with latest changes

 Documentation/Makefile|   2 +
 Documentation/bpf/Makefile|  28 +
 Documentation/bpf/bpf_commands.rst|   5 +
 Documentation/bpf/index.rst   |  14 +-
 include/uapi/linux/bpf.h  | 709 +-
 scripts/{bpf_helpers_doc.py => bpf_doc.py}| 189 -
 tools/bpf/Makefile.docs   |  88 +++
 tools/bpf/Makefile.helpers|  60 --
 tools/bpf/bpftool/Documentation/Makefile  |  12 +-
 tools/include/uapi/linux/bpf.h| 709 +++

[PATCH bpf-next 03/17] bpf: Document BPF_F_LOCK in syscall commands

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

Document the meaning of the BPF_F_LOCK flag for the map lookup/update
descriptions. Based on commit 96049f3afd50 ("bpf: introduce BPF_F_LOCK
flag").

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
CC: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ac6880d7b01b..d02259458fd6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -120,6 +120,14 @@ union bpf_iter_link_info {
  * Look up an element with a given *key* in the map referred to
  * by the file descriptor *map_fd*.
  *
+ * The *flags* argument may be specified as one of the
+ * following:
+ *
+ * **BPF_F_LOCK**
+ * Look up the value of a spin-locked map without
+ * returning the lock. This must be specified if the
+ * elements contain a spinlock.
+ *
  * Return
  * Returns zero on success. On error, -1 is returned and *errno*
  * is set appropriately.
@@ -137,6 +145,8 @@ union bpf_iter_link_info {
  * Create a new element only if it did not exist.
  * **BPF_EXIST**
  * Update an existing element.
+ * **BPF_F_LOCK**
+ * Update a spin_lock-ed map element.
  *
  * Return
  * Returns zero on success. On error, -1 is returned and *errno*
-- 
2.27.0

[PATCH bpf-next 01/17] bpf: Import syscall arg documentation

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

These descriptions are present in the man-pages project from the
original submissions around 2015-2016. Import them so that they can be
kept up to date as developers extend the bpf syscall commands.

These descriptions follow the pattern used by scripts/bpf_helpers_doc.py
so that we can take advantage of the parser to generate more up-to-date
man page writing based upon these headers.

Some minor wording adjustments were made to make the descriptions
more consistent for the description / return format.

Reviewed-by: Quentin Monnet 
Co-authored-by: Alexei Starovoitov 
Co-authored-by: Michael Kerrisk 
Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h | 119 ++-
 1 file changed, 118 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4c24daa43bac..56d7db0f3daf 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -93,7 +93,124 @@ union bpf_iter_link_info {
} map;
 };
 
-/* BPF syscall commands, see bpf(2) man-page for details. */
+/* BPF syscall commands, see bpf(2) man-page for more details.
+ *
+ * The operation to be performed by the **bpf**\ () system call is determined
+ * by the *cmd* argument. Each operation takes an accompanying argument,
+ * provided via *attr*, which is a pointer to a union of type *bpf_attr* (see
+ * below). The size argument is the size of the union pointed to by *attr*.
+ *
+ * Start of BPF syscall commands:
+ *
+ * BPF_MAP_CREATE
+ * Description
+ * Create a map and return a file descriptor that refers to the
+ * map. The close-on-exec file descriptor flag (see **fcntl**\ (2))
+ * is automatically enabled for the new file descriptor.
+ *
+ * Applying **close**\ (2) to the file descriptor returned by
+ * **BPF_MAP_CREATE** will delete the map (but see NOTES).
+ *
+ * Return
+ * A new file descriptor (a nonnegative integer), or -1 if an
+ * error occurred (in which case, *errno* is set appropriately).
+ *
+ * BPF_MAP_LOOKUP_ELEM
+ * Description
+ * Look up an element with a given *key* in the map referred to
+ * by the file descriptor *map_fd*.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_MAP_UPDATE_ELEM
+ * Description
+ * Create or update an element (key/value pair) in a specified map.
+ *
+ * The *flags* argument should be specified as one of the
+ * following:
+ *
+ * **BPF_ANY**
+ * Create a new element or update an existing element.
+ * **BPF_NOEXIST**
+ * Create a new element only if it did not exist.
+ * **BPF_EXIST**
+ * Update an existing element.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * May set *errno* to **EINVAL**, **EPERM**, **ENOMEM**,
+ * **E2BIG**, **EEXIST**, or **ENOENT**.
+ *
+ * **E2BIG**
+ * The number of elements in the map reached the
+ * *max_entries* limit specified at map creation time.
+ * **EEXIST**
+ * If *flags* specifies **BPF_NOEXIST** and the element
+ * with *key* already exists in the map.
+ * **ENOENT**
+ * If *flags* specifies **BPF_EXIST** and the element with
+ * *key* does not exist in the map.
+ *
+ * BPF_MAP_DELETE_ELEM
+ * Description
+ * Look up and delete an element by key in a specified map.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_MAP_GET_NEXT_KEY
+ * Description
+ * Look up an element by key in a specified map and return the key
+ * of the next element. Can be used to iterate over all elements
+ * in the map.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * The following cases can be used to iterate over all elements of
+ * the map:
+ *
+ * * If *key* is not found, the operation returns zero and sets
+ *   the *next_key* pointer to the key of the first element.
+ * * If *key* is found, the operation returns zero and sets the
+ *   *next_key* pointer to the key of the next element.
+ * * If *key* is the last element, returns -1 and *errno* is set
+ *   to **ENOENT**.
+ *
+ * May set *errno* to **ENOMEM**, **EFAULT**, **EPERM**, or
+ * **EINVAL** on error.
+ *
+ * BPF_PROG_LOAD
+ * Description

[PATCH bpf-next 04/17] bpf: Document BPF_PROG_PIN syscall command

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

Commit b2197755b263 ("bpf: add support for persistent maps/progs")
contains the original implementation and git logs, used as reference for
this documentation.

Also pull in the filename restriction as documented in commit 6d8cb045cde6
("bpf: comment why dots in filenames under BPF virtual FS are not allowed")

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
CC: Daniel Borkmann 
---
 include/uapi/linux/bpf.h | 34 +++---
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d02259458fd6..8301a19c97de 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -216,6 +216,22 @@ union bpf_iter_link_info {
  * Pin an eBPF program or map referred by the specified *bpf_fd*
  * to the provided *pathname* on the filesystem.
  *
+ * The *pathname* argument must not contain a dot (".").
+ *
+ * On success, *pathname* retains a reference to the eBPF object,
+ * preventing deallocation of the object when the original
+ * *bpf_fd* is closed. This allow the eBPF object to live beyond
+ * **close**\ (\ *bpf_fd*\ ), and hence the lifetime of the parent
+ * process.
+ *
+ * Applying **unlink**\ (2) or similar calls to the *pathname*
+ * unpins the object from the filesystem, removing the reference.
+ * If no other file descriptors or filesystem nodes refer to the
+ * same object, it will be deallocated (see NOTES).
+ *
+ * The filesystem type for the parent directory of *pathname* must
+ * be **BPF_FS_MAGIC**.
+ *
  * Return
  * Returns zero on success. On error, -1 is returned and *errno*
  * is set appropriately.
@@ -581,13 +597,17 @@ union bpf_iter_link_info {
  *
  * NOTES
  * eBPF objects (maps and programs) can be shared between processes.
- * For example, after **fork**\ (2), the child inherits file descriptors
- * referring to the same eBPF objects. In addition, file descriptors
- * referring to eBPF objects can be transferred over UNIX domain sockets.
- * File descriptors referring to eBPF objects can be duplicated in the
- * usual way, using **dup**\ (2) and similar calls. An eBPF object is
- * deallocated only after all file descriptors referring to the object
- * have been closed.
+ * * After **fork**\ (2), the child inherits file descriptors
+ *   referring to the same eBPF objects.
+ * * File descriptors referring to eBPF objects can be transferred over
+ *   **unix**\ (7) domain sockets.
+ * * File descriptors referring to eBPF objects can be duplicated in the
+ *   usual way, using **dup**\ (2) and similar calls.
+ * * File descriptors referring to eBPF objects can be pinned to the
+ *   filesystem using the **BPF_OBJ_PIN** command of **bpf**\ (2).
+ * An eBPF object is deallocated only after all file descriptors referring
+ * to the object have been closed and no references remain pinned to the
+ * filesystem or attached (for example, bound to a program or device).
  */
 enum bpf_cmd {
BPF_MAP_CREATE,
-- 
2.27.0

[PATCH bpf-next 02/17] bpf: Add minimal bpf() command documentation

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

Introduce high-level descriptions of the intent and return codes of the
bpf() syscall commands. Subsequent patches may further flesh out the
content to provide a more useful programming reference.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h | 368 +++
 1 file changed, 368 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 56d7db0f3daf..ac6880d7b01b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -201,6 +201,374 @@ union bpf_iter_link_info {
  * A new file descriptor (a nonnegative integer), or -1 if an
  * error occurred (in which case, *errno* is set appropriately).
  *
+ * BPF_OBJ_PIN
+ * Description
+ * Pin an eBPF program or map referred by the specified *bpf_fd*
+ * to the provided *pathname* on the filesystem.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_OBJ_GET
+ * Description
+ * Open a file descriptor for the eBPF object pinned to the
+ * specified *pathname*.
+ *
+ * Return
+ * A new file descriptor (a nonnegative integer), or -1 if an
+ * error occurred (in which case, *errno* is set appropriately).
+ *
+ * BPF_PROG_ATTACH
+ * Description
+ * Attach an eBPF program to a *target_fd* at the specified
+ * *attach_type* hook.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_PROG_DETACH
+ * Description
+ * Detach the eBPF program associated with the *target_fd* at the
+ * hook specified by *attach_type*. The program must have been
+ * previously attached using **BPF_PROG_ATTACH**.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_PROG_TEST_RUN
+ * Description
+ * Run an eBPF program a number of times against a provided
+ * program context and return the modified program context and
+ * duration of the test run.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_PROG_GET_NEXT_ID
+ * Description
+ * Fetch the next eBPF program currently loaded into the kernel.
+ *
+ * Looks for the eBPF program with an id greater than *start_id*
+ * and updates *next_id* on success. If no other eBPF programs
+ * remain with ids higher than *start_id*, returns -1 and sets
+ * *errno* to **ENOENT**.
+ *
+ * Return
+ * Returns zero on success. On error, or when no id remains, -1
+ * is returned and *errno* is set appropriately.
+ *
+ * BPF_MAP_GET_NEXT_ID
+ * Description
+ * Fetch the next eBPF map currently loaded into the kernel.
+ *
+ * Looks for the eBPF map with an id greater than *start_id*
+ * and updates *next_id* on success. If no other eBPF maps
+ * remain with ids higher than *start_id*, returns -1 and sets
+ * *errno* to **ENOENT**.
+ *
+ * Return
+ * Returns zero on success. On error, or when no id remains, -1
+ * is returned and *errno* is set appropriately.
+ *
+ * BPF_PROG_GET_FD_BY_ID
+ * Description
+ * Open a file descriptor for the eBPF program corresponding to
+ * *prog_id*.
+ *
+ * Return
+ * A new file descriptor (a nonnegative integer), or -1 if an
+ * error occurred (in which case, *errno* is set appropriately).
+ *
+ * BPF_MAP_GET_FD_BY_ID
+ * Description
+ * Open a file descriptor for the eBPF map corresponding to
+ * *map_id*.
+ *
+ * Return
+ * A new file descriptor (a nonnegative integer), or -1 if an
+ * error occurred (in which case, *errno* is set appropriately).
+ *
+ * BPF_OBJ_GET_INFO_BY_FD
+ * Description
+ * Obtain information about the eBPF object corresponding to
+ * *bpf_fd*.
+ *
+ * Populates up to *info_len* bytes of *info*, which will be in
+ * one of the following formats depending on the eBPF object type
+ * of *bpf_fd*:
+ *
+ * * **struct bpf_prog_info**
+ * * **struct bpf_map_info**
+ * * **struct bpf_btf_info**
+ * * **struct bpf_link_info**
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_PROG_QUERY
+ * Description
+ * Obtain information about eBPF programs associated with the
+ * specified

[PATCH bpf-next 06/17] bpf: Document BPF_PROG_TEST_RUN syscall command

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

Based on a brief read of the corresponding source code.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 603605c5ca03..86fe0445c395 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -303,14 +303,22 @@ union bpf_iter_link_info {
  *
  * BPF_PROG_TEST_RUN
  * Description
- * Run an eBPF program a number of times against a provided
- * program context and return the modified program context and
- * duration of the test run.
+ * Run the eBPF program associated with the *prog_fd* a *repeat*
+ * number of times against a provided program context *ctx_in* and
+ * data *data_in*, and return the modified program context
+ * *ctx_out*, *data_out* (for example, packet data), result of the
+ * execution *retval*, and *duration* of the test run.
  *
  * Return
  * Returns zero on success. On error, -1 is returned and *errno*
  * is set appropriately.
  *
+ * **ENOSPC**
+ * Either *data_size_out* or *ctx_size_out* is too small.
+ * **ENOTSUPP**
+ * This command is not supported by the program type of
+ * the program referred to by *prog_fd*.
+ *
  * BPF_PROG_GET_NEXT_ID
  * Description
  * Fetch the next eBPF program currently loaded into the kernel.
-- 
2.27.0

[PATCH bpf-next 05/17] bpf: Document BPF_PROG_ATTACH syscall command

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

Document the prog attach command in more detail, based on git commits:
* commit f4324551489e ("bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH
  commands")
* commit 4f738adba30a ("bpf: create tcp_bpf_ulp allowing BPF to monitor
  socket TX/RX data")
* commit f4364dcfc86d ("media: rc: introduce BPF_PROG_LIRC_MODE2")
* commit d58e468b1112 ("flow_dissector: implements flow dissector BPF
  hook")

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
CC: Daniel Mack 
CC: John Fastabend 
CC: Sean Young 
CC: Petar Penkov 
---
 include/uapi/linux/bpf.h | 37 +
 1 file changed, 37 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8301a19c97de..603605c5ca03 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -250,6 +250,43 @@ union bpf_iter_link_info {
  * Attach an eBPF program to a *target_fd* at the specified
  * *attach_type* hook.
  *
+ * The *attach_type* specifies the eBPF attachment point to
+ * attach the program to, and must be one of *bpf_attach_type*
+ * (see below).
+ *
+ * The *attach_bpf_fd* must be a valid file descriptor for a
+ * loaded eBPF program of a cgroup, flow dissector, LIRC, sockmap
+ * or sock_ops type corresponding to the specified *attach_type*.
+ *
+ * The *target_fd* must be a valid file descriptor for a kernel
+ * object which depends on the attach type of *attach_bpf_fd*:
+ *
+ * **BPF_PROG_TYPE_CGROUP_DEVICE**,
+ * **BPF_PROG_TYPE_CGROUP_SKB**,
+ * **BPF_PROG_TYPE_CGROUP_SOCK**,
+ * **BPF_PROG_TYPE_CGROUP_SOCK_ADDR**,
+ * **BPF_PROG_TYPE_CGROUP_SOCKOPT**,
+ * **BPF_PROG_TYPE_CGROUP_SYSCTL**,
+ * **BPF_PROG_TYPE_SOCK_OPS**
+ *
+ * Control Group v2 hierarchy with the eBPF controller
+ * enabled. Requires the kernel to be compiled with
+ * **CONFIG_CGROUP_BPF**.
+ *
+ * **BPF_PROG_TYPE_FLOW_DISSECTOR**
+ *
+ * Network namespace (eg /proc/self/ns/net).
+ *
+ * **BPF_PROG_TYPE_LIRC_MODE2**
+ *
+ * LIRC device path (eg /dev/lircN). Requires the kernel
+ * to be compiled with **CONFIG_BPF_LIRC_MODE2**.
+ *
+ * **BPF_PROG_TYPE_SK_SKB**,
+ * **BPF_PROG_TYPE_SK_MSG**
+ *
+ * eBPF map of socket type (eg **BPF_MAP_TYPE_SOCKHASH**).
+ *
  * Return
  * Returns zero on success. On error, -1 is returned and *errno*
  * is set appropriately.
-- 
2.27.0

[PATCH bpf-next 10/17] scripts/bpf: Abstract eBPF API target parameter

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

Abstract out the target parameter so that upcoming commits, more than
just the existing "helpers" target can be called to generate specific
portions of docs from the eBPF UAPI headers.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 scripts/bpf_doc.py | 87 --
 1 file changed, 61 insertions(+), 26 deletions(-)

diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index ca6e7559d696..5a4f68aab335 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -2,6 +2,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 #
 # Copyright (C) 2018-2019 Netronome Systems, Inc.
+# Copyright (C) 2021 Isovalent, Inc.
 
 # In case user attempts to run with Python 2.
 from __future__ import print_function
@@ -165,10 +166,11 @@ class Printer(object):
 """
 A generic class for printers. Printers should be created with an array of
 Helper objects, and implement a way to print them in the desired fashion.
-@helpers: array of Helper objects to print to standard output
+@parser: A HeaderParser with objects to print to standard output
 """
-def __init__(self, helpers):
-self.helpers = helpers
+def __init__(self, parser):
+self.parser = parser
+self.elements = []
 
 def print_header(self):
 pass
@@ -181,19 +183,23 @@ class Printer(object):
 
 def print_all(self):
 self.print_header()
-for helper in self.helpers:
-self.print_one(helper)
+for elem in self.elements:
+self.print_one(elem)
 self.print_footer()
 
+
 class PrinterRST(Printer):
 """
-A printer for dumping collected information about helpers as a ReStructured
-Text page compatible with the rst2man program, which can be used to
-generate a manual page for the helpers.
-@helpers: array of Helper objects to print to standard output
+A generic class for printers that print ReStructured Text. Printers should
+be created with a HeaderParser object, and implement a way to print API
+elements in the desired fashion.
+@parser: A HeaderParser with objects to print to standard output
 """
-def print_header(self):
-header = '''\
+def __init__(self, parser):
+self.parser = parser
+
+def print_license(self):
+license = '''\
 .. Copyright (C) All BPF authors and contributors from 2014 to present.
 .. See git log include/uapi/linux/bpf.h in kernel tree for details.
 .. 
@@ -223,7 +229,37 @@ class PrinterRST(Printer):
 .. located in file include/uapi/linux/bpf.h of the Linux kernel sources
 .. (helpers description), and from scripts/bpf_doc.py in the same
 .. repository (header and footer).
+'''
+print(license)
+
+def print_elem(self, elem):
+if (elem.desc):
+print('\tDescription')
+# Do not strip all newline characters: formatted code at the end of
+# a section must be followed by a blank line.
+for line in re.sub('\n$', '', elem.desc, count=1).split('\n'):
+print('{}{}'.format('\t\t' if line else '', line))
+
+if (elem.ret):
+print('\tReturn')
+for line in elem.ret.rstrip().split('\n'):
+print('{}{}'.format('\t\t' if line else '', line))
+
+print('')
 
+
+class PrinterHelpersRST(PrinterRST):
+"""
+A printer for dumping collected information about helpers as a ReStructured
+Text page compatible with the rst2man program, which can be used to
+generate a manual page for the helpers.
+@parser: A HeaderParser with Helper objects to print to standard output
+"""
+def __init__(self, parser):
+self.elements = parser.helpers
+
+def print_header(self):
+header = '''\
 ===
 BPF-HELPERS
 ===
@@ -264,6 +300,7 @@ kernel at the top).
 HELPERS
 ===
 '''
+PrinterRST.print_license(self)
 print(header)
 
 def print_footer(self):
@@ -380,27 +417,19 @@ SEE ALSO
 
 def print_one(self, helper):
 self.print_proto(helper)
+self.print_elem(helper)
 
-if (helper.desc):
-print('\tDescription')
-# Do not strip all newline characters: formatted code at the end of
-# a section must be followed by a blank line.
-for line in re.sub('\n$', '', helper.desc, count=1).split('\n'):
-print('{}{}'.format('\t\t' if line else '', line))
 
-if (helper.ret):
-print('\tReturn')
-for line in helper.ret.rstrip().split('\n&#

[PATCH bpf-next 07/17] bpf: Document BPF_PROG_QUERY syscall command

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

Commit 468e2f64d220 ("bpf: introduce BPF_PROG_QUERY command") originally
introduced this, but there have been several additions since then.
Unlike BPF_PROG_ATTACH, it appears that the sockmap progs are not able
to be queried so far.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
CC: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h | 37 +
 1 file changed, 37 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 86fe0445c395..a07cecfd2148 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -386,6 +386,43 @@ union bpf_iter_link_info {
  * Obtain information about eBPF programs associated with the
  * specified *attach_type* hook.
  *
+ * The *target_fd* must be a valid file descriptor for a kernel
+ * object which depends on the attach type of *attach_bpf_fd*:
+ *
+ * **BPF_PROG_TYPE_CGROUP_DEVICE**,
+ * **BPF_PROG_TYPE_CGROUP_SKB**,
+ * **BPF_PROG_TYPE_CGROUP_SOCK**,
+ * **BPF_PROG_TYPE_CGROUP_SOCK_ADDR**,
+ * **BPF_PROG_TYPE_CGROUP_SOCKOPT**,
+ * **BPF_PROG_TYPE_CGROUP_SYSCTL**,
+ * **BPF_PROG_TYPE_SOCK_OPS**
+ *
+ * Control Group v2 hierarchy with the eBPF controller
+ * enabled. Requires the kernel to be compiled with
+ * **CONFIG_CGROUP_BPF**.
+ *
+ * **BPF_PROG_TYPE_FLOW_DISSECTOR**
+ *
+ * Network namespace (eg /proc/self/ns/net).
+ *
+ * **BPF_PROG_TYPE_LIRC_MODE2**
+ *
+ * LIRC device path (eg /dev/lircN). Requires the kernel
+ * to be compiled with **CONFIG_BPF_LIRC_MODE2**.
+ *
+ * **BPF_PROG_QUERY** always fetches the number of programs
+ * attached and the *attach_flags* which were used to attach those
+ * programs. Additionally, if *prog_ids* is nonzero and the number
+ * of attached programs is less than *prog_cnt*, populates
+ * *prog_ids* with the eBPF program ids of the programs attached
+ * at *target_fd*.
+ *
+ * The following flags may alter the result:
+ *
+ * **BPF_F_QUERY_EFFECTIVE**
+ * Only return information regarding programs which are
+ * currently effective at the specified *target_fd*.
+ *
  * Return
  * Returns zero on success. On error, -1 is returned and *errno*
  * is set appropriately.
-- 
2.27.0

[PATCH bpf-next 12/17] tools/bpf: Rename Makefile.{helpers,docs}

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

In anticipation of including make targets for other manual pages in this
makefile, rename it to something a bit more generic.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 tools/bpf/{Makefile.helpers => Makefile.docs} | 2 +-
 tools/bpf/bpftool/Documentation/Makefile  | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)
 rename tools/bpf/{Makefile.helpers => Makefile.docs} (95%)

diff --git a/tools/bpf/Makefile.helpers b/tools/bpf/Makefile.docs
similarity index 95%
rename from tools/bpf/Makefile.helpers
rename to tools/bpf/Makefile.docs
index a26599022fd6..dc4ce82ada33 100644
--- a/tools/bpf/Makefile.helpers
+++ b/tools/bpf/Makefile.docs
@@ -3,7 +3,7 @@ ifndef allow-override
   include ../scripts/Makefile.include
   include ../scripts/utilities.mak
 else
-  # Assume Makefile.helpers is being run from bpftool/Documentation
+  # Assume Makefile.docs is being run from bpftool/Documentation
   # subdirectory. Go up two more directories to fetch bpf.h header and
   # associated script.
   UP2DIR := ../../
diff --git a/tools/bpf/bpftool/Documentation/Makefile 
b/tools/bpf/bpftool/Documentation/Makefile
index f33cb02de95c..bb7842efffd6 100644
--- a/tools/bpf/bpftool/Documentation/Makefile
+++ b/tools/bpf/bpftool/Documentation/Makefile
@@ -16,8 +16,8 @@ prefix ?= /usr/local
 mandir ?= $(prefix)/man
 man8dir = $(mandir)/man8
 
-# Load targets for building eBPF helpers man page.
-include ../../Makefile.helpers
+# Load targets for building eBPF man page.
+include ../../Makefile.docs
 
 MAN8_RST = $(wildcard bpftool*.rst)
 
-- 
2.27.0

[PATCH bpf-next 14/17] tools/bpf: Build bpf-sycall.2 in Makefile.docs

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

Add building of the bpf(2) syscall commands documentation as part of the
docs building step in the build. This allows us to pick up on potential
parse errors from the docs generator script as part of selftests.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 tools/bpf/Makefile.docs | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/tools/bpf/Makefile.docs b/tools/bpf/Makefile.docs
index 7111888ca5d8..47da582cdaf2 100644
--- a/tools/bpf/Makefile.docs
+++ b/tools/bpf/Makefile.docs
@@ -21,18 +21,27 @@ endif
 
 prefix ?= /usr/local
 mandir ?= $(prefix)/man
+man2dir = $(mandir)/man2
 man7dir = $(mandir)/man7
 
+SYSCALL_RST = bpf-syscall.rst
+MAN2_RST = $(SYSCALL_RST)
+
 HELPERS_RST = bpf-helpers.rst
 MAN7_RST = $(HELPERS_RST)
 
+_DOC_MAN2 = $(patsubst %.rst,%.2,$(MAN2_RST))
+DOC_MAN2 = $(addprefix $(OUTPUT),$(_DOC_MAN2))
+
 _DOC_MAN7 = $(patsubst %.rst,%.7,$(MAN7_RST))
 DOC_MAN7 = $(addprefix $(OUTPUT),$(_DOC_MAN7))
 
-DOCTARGETS := helpers
+DOCTARGETS := helpers syscall
 
 docs: $(DOCTARGETS)
+syscall: man2
 helpers: man7
+man2: $(DOC_MAN2)
 man7: $(DOC_MAN7)
 
 RST2MAN_DEP := $(shell command -v rst2man 2>/dev/null)
@@ -70,9 +79,10 @@ endef
 
 # Create the make targets to generate manual pages by name and section
 $(eval $(call DOCS_RULES,helpers,7))
+$(eval $(call DOCS_RULES,syscall,2))
 
 docs-clean: $(foreach doctarget,$(DOCTARGETS), docs-clean-$(doctarget))
 docs-install: $(foreach doctarget,$(DOCTARGETS), docs-install-$(doctarget))
 docs-uninstall: $(foreach doctarget,$(DOCTARGETS), docs-uninstall-$(doctarget))
 
-.PHONY: docs docs-clean docs-install docs-uninstall man7
+.PHONY: docs docs-clean docs-install docs-uninstall man2 man7
-- 
2.27.0

[PATCH bpf-next 08/17] bpf: Document BPF_MAP_*_BATCH syscall commands

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

Based roughly on the following commits:
* Commit cb4d03ab499d ("bpf: Add generic support for lookup batch op")
* Commit 057996380a42 ("bpf: Add batch ops to all htab bpf map")
* Commit aa2e93b8e58e ("bpf: Add generic support for update and delete
  batch ops")

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
CC: Brian Vazquez 
CC: Yonghong Song 

@Yonghong, would you mind double-checking whether the text is accurate for the
case where BPF_MAP_LOOKUP_AND_DELETE_BATCH returns -EFAULT?
---
 include/uapi/linux/bpf.h | 114 +--
 1 file changed, 111 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a07cecfd2148..893803f69a64 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -550,13 +550,55 @@ union bpf_iter_link_info {
  * Description
  * Iterate and fetch multiple elements in a map.
  *
+ * Two opaque values are used to manage batch operations,
+ * *in_batch* and *out_batch*. Initially, *in_batch* must be set
+ * to NULL to begin the batched operation. After each subsequent
+ * **BPF_MAP_LOOKUP_BATCH**, the caller should pass the resultant
+ * *out_batch* as the *in_batch* for the next operation to
+ * continue iteration from the current point.
+ *
+ * The *keys* and *values* are output parameters which must point
+ * to memory large enough to hold *count* items based on the key
+ * and value size of the map *map_fd*. The *keys* buffer must be
+ * of *key_size* * *count*. The *values* buffer must be of
+ * *value_size* * *count*.
+ *
+ * The *elem_flags* argument may be specified as one of the
+ * following:
+ *
+ * **BPF_F_LOCK**
+ * Look up the value of a spin-locked map without
+ * returning the lock. This must be specified if the
+ * elements contain a spinlock.
+ *
+ * On success, *count* elements from the map are copied into the
+ * user buffer, with the keys copied into *keys* and the values
+ * copied into the corresponding indices in *values*.
+ *
+ * If an error is returned and *errno* is not **EFAULT**, *count*
+ * is set to the number of successfully processed elements.
+ *
  * Return
  * Returns zero on success. On error, -1 is returned and *errno*
  * is set appropriately.
  *
+ * May set *errno* to **ENOSPC** to indicate that *keys* or
+ * *values* is too small to dump an entire bucket during
+ * iteration of a hash-based map type.
+ *
  * BPF_MAP_LOOKUP_AND_DELETE_BATCH
  * Description
- * Iterate and delete multiple elements in a map.
+ * Iterate and delete all elements in a map.
+ *
+ * This operation has the same behavior as
+ * **BPF_MAP_LOOKUP_BATCH** with two exceptions:
+ *
+ * * Every element that is successfully returned is also deleted
+ *   from the map. This is at least *count* elements. Note that
+ *   *count* is both an input and an output parameter.
+ * * Upon returning with *errno* set to **EFAULT**, up to
+ *   *count* elements may be deleted without returning the keys
+ *   and values of the deleted elements.
  *
  * Return
  * Returns zero on success. On error, -1 is returned and *errno*
@@ -564,15 +606,81 @@ union bpf_iter_link_info {
  *
  * BPF_MAP_UPDATE_BATCH
  * Description
- * Iterate and update multiple elements in a map.
+ * Update multiple elements in a map by *key*.
+ *
+ * The *keys* and *values* are input parameters which must point
+ * to memory large enough to hold *count* items based on the key
+ * and value size of the map *map_fd*. The *keys* buffer must be
+ * of *key_size* * *count*. The *values* buffer must be of
+ * *value_size* * *count*.
+ *
+ * Each element specified in *keys* is sequentially updated to the
+ * value in the corresponding index in *values*. The *in_batch*
+ * and *out_batch* parameters are ignored and should be zeroed.
+ *
+ * The *elem_flags* argument should be specified as one of the
+ * following:
+ *
+ * **BPF_ANY**
+ * Create new elements or update a existing elements.
+ * **BPF_NOEXIST**
+ * Create new elements only if they do not exist.
+ * **BPF_EXIST**
+ * Update existing elements.
+ * **BPF_F_LOCK**
+ * Update spin_lock-ed map elements. This must be
+ * specifi

[PATCH bpf-next 16/17] docs/bpf: Add bpf() syscall command reference

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

Generate the syscall command reference from the UAPI header file and
include it in the main bpf docs page.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 Documentation/Makefile |  2 ++
 Documentation/bpf/Makefile | 28 
 Documentation/bpf/bpf_commands.rst |  5 +
 Documentation/bpf/index.rst| 14 +++---
 4 files changed, 46 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/bpf/Makefile
 create mode 100644 Documentation/bpf/bpf_commands.rst

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 9c42dde97671..408542825cc2 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -73,6 +73,7 @@ loop_cmd = $(echo-cmd) $(cmd_$(1)) || exit;
 
 quiet_cmd_sphinx = SPHINX  $@ --> file://$(abspath $(BUILDDIR)/$3/$4)
   cmd_sphinx = $(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) 
$(build)=Documentation/userspace-api/media $2 && \
+   $(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) $(build)=Documentation/bpf $2 
&& \
PYTHONDONTWRITEBYTECODE=1 \
BUILDDIR=$(abspath $(BUILDDIR)) SPHINX_CONF=$(abspath 
$(srctree)/$(src)/$5/$(SPHINX_CONF)) \
$(PYTHON3) $(srctree)/scripts/jobserver-exec \
@@ -133,6 +134,7 @@ refcheckdocs:
 
 cleandocs:
$(Q)rm -rf $(BUILDDIR)
+   $(Q)$(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) $(build)=Documentation/bpf 
clean
$(Q)$(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) 
$(build)=Documentation/userspace-api/media clean
 
 dochelp:
diff --git a/Documentation/bpf/Makefile b/Documentation/bpf/Makefile
new file mode 100644
index ..4f14db0891cc
--- /dev/null
+++ b/Documentation/bpf/Makefile
@@ -0,0 +1,28 @@
+# SPDX-License-Identifier: GPL-2.0
+
+# Rules to convert a .h file to inline RST documentation
+
+SRC_DIR = $(srctree)/Documentation/bpf
+PARSER = $(srctree)/scripts/bpf_doc.py
+UAPI = $(srctree)/include/uapi/linux
+
+TARGETS = $(BUILDDIR)/bpf/bpf_syscall.rst
+
+$(BUILDDIR)/bpf/bpf_syscall.rst: $(UAPI)/bpf.h
+   $(PARSER) syscall > $@
+
+.PHONY: all html epub xml latex linkcheck clean
+
+all: $(IMGDOT) $(BUILDDIR)/bpf $(TARGETS)
+
+html: all
+epub: all
+xml: all
+latex: $(IMGPDF) all
+linkcheck:
+
+clean:
+   -rm -f -- $(TARGETS) 2>/dev/null
+
+$(BUILDDIR)/bpf:
+   $(Q)mkdir -p $@
diff --git a/Documentation/bpf/bpf_commands.rst 
b/Documentation/bpf/bpf_commands.rst
new file mode 100644
index ..da388ffac85b
--- /dev/null
+++ b/Documentation/bpf/bpf_commands.rst
@@ -0,0 +1,5 @@
+**
+bpf() subcommand reference
+**
+
+.. kernel-include:: $BUILDDIR/bpf/bpf_syscall.rst
diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst
index 4f2874b729c3..631d02d4dc49 100644
--- a/Documentation/bpf/index.rst
+++ b/Documentation/bpf/index.rst
@@ -12,9 +12,6 @@ BPF instruction-set.
 The Cilium project also maintains a `BPF and XDP Reference Guide`_
 that goes into great technical depth about the BPF Architecture.
 
-The primary info for the bpf syscall is available in the `man-pages`_
-for `bpf(2)`_.
-
 BPF Type Format (BTF)
 =
 
@@ -35,6 +32,17 @@ Two sets of Questions and Answers (Q&A) are maintained.
bpf_design_QA
bpf_devel_QA
 
+Syscall API
+===
+
+The primary info for the bpf syscall is available in the `man-pages`_
+for `bpf(2)`_.
+
+.. toctree::
+   :maxdepth: 1
+
+   bpf_commands
+
 
 Helper functions
 
-- 
2.27.0

[PATCH bpf-next 13/17] tools/bpf: Templatize man page generation

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

Previously, the Makefile here was only targeting a single manual page so
it just hardcoded a bunch of individual rules to specifically handle
build, clean, install, uninstall for that particular page.

Upcoming commits will generate manual pages for an additional section,
so this commit prepares the makefile first by converting the existing
targets into an evaluated set of targets based on the manual page name
and section.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 tools/bpf/Makefile.docs  | 52 
 tools/bpf/bpftool/Documentation/Makefile |  8 ++--
 2 files changed, 39 insertions(+), 21 deletions(-)

diff --git a/tools/bpf/Makefile.docs b/tools/bpf/Makefile.docs
index dc4ce82ada33..7111888ca5d8 100644
--- a/tools/bpf/Makefile.docs
+++ b/tools/bpf/Makefile.docs
@@ -29,32 +29,50 @@ MAN7_RST = $(HELPERS_RST)
 _DOC_MAN7 = $(patsubst %.rst,%.7,$(MAN7_RST))
 DOC_MAN7 = $(addprefix $(OUTPUT),$(_DOC_MAN7))
 
+DOCTARGETS := helpers
+
+docs: $(DOCTARGETS)
 helpers: man7
 man7: $(DOC_MAN7)
 
 RST2MAN_DEP := $(shell command -v rst2man 2>/dev/null)
 
-$(OUTPUT)$(HELPERS_RST): $(UP2DIR)../../include/uapi/linux/bpf.h
-   $(QUIET_GEN)$(UP2DIR)../../scripts/bpf_doc.py --filename $< > $@
+# Configure make rules for the man page bpf-$1.$2.
+# $1 - target for scripts/bpf_doc.py
+# $2 - man page section to generate the troff file
+define DOCS_RULES =
+$(OUTPUT)bpf-$1.rst: $(UP2DIR)../../include/uapi/linux/bpf.h
+   $$(QUIET_GEN)$(UP2DIR)../../scripts/bpf_doc.py $1 \
+   --filename $$< > $$@
 
-$(OUTPUT)%.7: $(OUTPUT)%.rst
+$(OUTPUT)%.$2: $(OUTPUT)%.rst
 ifndef RST2MAN_DEP
-   $(error "rst2man not found, but required to generate man pages")
+   $$(error "rst2man not found, but required to generate man pages")
 endif
-   $(QUIET_GEN)rst2man $< > $@
+   $$(QUIET_GEN)rst2man $$< > $$@
+
+docs-clean-$1:
+   $$(call QUIET_CLEAN, eBPF_$1-manpage)
+   $(Q)$(RM) $$(DOC_MAN$2) $(OUTPUT)bpf-$1.rst
+
+docs-install-$1: docs
+   $$(call QUIET_INSTALL, eBPF_$1-manpage)
+   $(Q)$(INSTALL) -d -m 755 $(DESTDIR)$$(man$2dir)
+   $(Q)$(INSTALL) -m 644 $$(DOC_MAN$2) $(DESTDIR)$$(man$2dir)
+
+docs-uninstall-$1:
+   $$(call QUIET_UNINST, eBPF_$1-manpage)
+   $(Q)$(RM) $$(addprefix $(DESTDIR)$$(man$2dir)/,$$(_DOC_MAN$2))
+   $(Q)$(RMDIR) $(DESTDIR)$$(man$2dir)
 
-helpers-clean:
-   $(call QUIET_CLEAN, eBPF_helpers-manpage)
-   $(Q)$(RM) $(DOC_MAN7) $(OUTPUT)$(HELPERS_RST)
+.PHONY: $1 docs-clean-$1 docs-install-$1 docs-uninstall-$1
+endef
 
-helpers-install: helpers
-   $(call QUIET_INSTALL, eBPF_helpers-manpage)
-   $(Q)$(INSTALL) -d -m 755 $(DESTDIR)$(man7dir)
-   $(Q)$(INSTALL) -m 644 $(DOC_MAN7) $(DESTDIR)$(man7dir)
+# Create the make targets to generate manual pages by name and section
+$(eval $(call DOCS_RULES,helpers,7))
 
-helpers-uninstall:
-   $(call QUIET_UNINST, eBPF_helpers-manpage)
-   $(Q)$(RM) $(addprefix $(DESTDIR)$(man7dir)/,$(_DOC_MAN7))
-   $(Q)$(RMDIR) $(DESTDIR)$(man7dir)
+docs-clean: $(foreach doctarget,$(DOCTARGETS), docs-clean-$(doctarget))
+docs-install: $(foreach doctarget,$(DOCTARGETS), docs-install-$(doctarget))
+docs-uninstall: $(foreach doctarget,$(DOCTARGETS), docs-uninstall-$(doctarget))
 
-.PHONY: helpers helpers-clean helpers-install helpers-uninstall
+.PHONY: docs docs-clean docs-install docs-uninstall man7
diff --git a/tools/bpf/bpftool/Documentation/Makefile 
b/tools/bpf/bpftool/Documentation/Makefile
index bb7842efffd6..f60b800584a5 100644
--- a/tools/bpf/bpftool/Documentation/Makefile
+++ b/tools/bpf/bpftool/Documentation/Makefile
@@ -24,7 +24,7 @@ MAN8_RST = $(wildcard bpftool*.rst)
 _DOC_MAN8 = $(patsubst %.rst,%.8,$(MAN8_RST))
 DOC_MAN8 = $(addprefix $(OUTPUT),$(_DOC_MAN8))
 
-man: man8 helpers
+man: man8 docs
 man8: $(DOC_MAN8)
 
 RST2MAN_DEP := $(shell command -v rst2man 2>/dev/null)
@@ -46,16 +46,16 @@ ifndef RST2MAN_DEP
 endif
$(QUIET_GEN)( cat $< ; printf "%b" $(call see_also,$<) ) | rst2man 
$(RST2MAN_OPTS) > $@
 
-clean: helpers-clean
+clean: docs-clean
$(call QUIET_CLEAN, Documentation)
$(Q)$(RM) $(DOC_MAN8)
 
-install: man helpers-install
+install: man docs-install
$(call QUIET_INSTALL, Documentation-man)
$(Q)$(INSTALL) -d -m 755 $(DESTDIR)$(man8dir)
$(Q)$(INSTALL) -m 644 $(DOC_MAN8) $(DESTDIR)$(man8dir)
 
-uninstall: helpers-uninstall
+uninstall: docs-uninstall
$(call QUIET_UNINST, Documentation-man)
$(Q)$(RM) $(addprefix $(DESTDIR)$(man8dir)/,$(_DOC_MAN8))
$(Q)$(RMDIR) $(DESTDIR)$(man8dir)
-- 
2.27.0

[PATCH bpf-next 09/17] scripts/bpf: Rename bpf_helpers_doc.py -> bpf_doc.py

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

Rename this file in anticipation of it being used for generating more
than just helper man pages.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h   | 2 +-
 scripts/{bpf_helpers_doc.py => bpf_doc.py} | 4 ++--
 tools/bpf/Makefile.helpers | 2 +-
 tools/include/uapi/linux/bpf.h | 2 +-
 tools/lib/bpf/Makefile | 2 +-
 tools/perf/MANIFEST| 2 +-
 6 files changed, 7 insertions(+), 7 deletions(-)
 rename scripts/{bpf_helpers_doc.py => bpf_doc.py} (99%)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 893803f69a64..4abf54327612 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1425,7 +1425,7 @@ union bpf_attr {
  * parsed and used to produce a manual page. The workflow is the following,
  * and requires the rst2man utility:
  *
- * $ ./scripts/bpf_helpers_doc.py \
+ * $ ./scripts/bpf_doc.py \
  * --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
  * $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
  * $ man /tmp/bpf-helpers.7
diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_doc.py
similarity index 99%
rename from scripts/bpf_helpers_doc.py
rename to scripts/bpf_doc.py
index 867ada23281c..ca6e7559d696 100755
--- a/scripts/bpf_helpers_doc.py
+++ b/scripts/bpf_doc.py
@@ -221,7 +221,7 @@ class PrinterRST(Printer):
 .. 
 .. Please do not edit this file. It was generated from the documentation
 .. located in file include/uapi/linux/bpf.h of the Linux kernel sources
-.. (helpers description), and from scripts/bpf_helpers_doc.py in the same
+.. (helpers description), and from scripts/bpf_doc.py in the same
 .. repository (header and footer).
 
 ===
@@ -511,7 +511,7 @@ class PrinterHelpers(Printer):
 
 def print_header(self):
 header = '''\
-/* This is auto-generated file. See bpf_helpers_doc.py for details. */
+/* This is auto-generated file. See bpf_doc.py for details. */
 
 /* Forward declarations of BPF structs */'''
 
diff --git a/tools/bpf/Makefile.helpers b/tools/bpf/Makefile.helpers
index 854d084026dd..a26599022fd6 100644
--- a/tools/bpf/Makefile.helpers
+++ b/tools/bpf/Makefile.helpers
@@ -35,7 +35,7 @@ man7: $(DOC_MAN7)
 RST2MAN_DEP := $(shell command -v rst2man 2>/dev/null)
 
 $(OUTPUT)$(HELPERS_RST): $(UP2DIR)../../include/uapi/linux/bpf.h
-   $(QUIET_GEN)$(UP2DIR)../../scripts/bpf_helpers_doc.py --filename $< > $@
+   $(QUIET_GEN)$(UP2DIR)../../scripts/bpf_doc.py --filename $< > $@
 
 $(OUTPUT)%.7: $(OUTPUT)%.rst
 ifndef RST2MAN_DEP
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4c24daa43bac..16f2f0d2338a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -720,7 +720,7 @@ union bpf_attr {
  * parsed and used to produce a manual page. The workflow is the following,
  * and requires the rst2man utility:
  *
- * $ ./scripts/bpf_helpers_doc.py \
+ * $ ./scripts/bpf_doc.py \
  * --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
  * $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
  * $ man /tmp/bpf-helpers.7
diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index 887a494ad5fc..8170f88e8ea6 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -158,7 +158,7 @@ $(BPF_IN_STATIC): force $(BPF_HELPER_DEFS)
$(Q)$(MAKE) $(build)=libbpf OUTPUT=$(STATIC_OBJDIR)
 
 $(BPF_HELPER_DEFS): $(srctree)/tools/include/uapi/linux/bpf.h
-   $(QUIET_GEN)$(srctree)/scripts/bpf_helpers_doc.py --header \
+   $(QUIET_GEN)$(srctree)/scripts/bpf_doc.py --header \
--file $(srctree)/tools/include/uapi/linux/bpf.h > 
$(BPF_HELPER_DEFS)
 
 $(OUTPUT)libbpf.so: $(OUTPUT)libbpf.so.$(LIBBPF_VERSION)
diff --git a/tools/perf/MANIFEST b/tools/perf/MANIFEST
index 5d7b947320fb..f05c4d48fd7e 100644
--- a/tools/perf/MANIFEST
+++ b/tools/perf/MANIFEST
@@ -20,4 +20,4 @@ tools/lib/bitmap.c
 tools/lib/str_error_r.c
 tools/lib/vsprintf.c
 tools/lib/zalloc.c
-scripts/bpf_helpers_doc.py
+scripts/bpf_doc.py
-- 
2.27.0

[PATCH bpf-next 15/17] selftests/bpf: Add docs target

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

This docs target will run the scripts/bpf_doc.py against the BPF UAPI
headers to ensure that the parser used for generating manual pages from
the headers doesn't trip on any newly added API documentation.

While we're at it, remove the bpftool-specific docs check target since
that would just be duplicated with the new target anyhow.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 tools/testing/selftests/bpf/Makefile  | 20 +-
 .../selftests/bpf/test_bpftool_build.sh   | 21 ---
 tools/testing/selftests/bpf/test_doc_build.sh | 13 
 3 files changed, 28 insertions(+), 26 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/test_doc_build.sh

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 044bfdcf5b74..e1a76444670c 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -68,6 +68,7 @@ TEST_PROGS := test_kmod.sh \
test_bpftool_build.sh \
test_bpftool.sh \
test_bpftool_metadata.sh \
+   test_docs_build.sh \
test_xsk.sh
 
 TEST_PROGS_EXTENDED := with_addr.sh \
@@ -103,6 +104,7 @@ override define CLEAN
$(call msg,CLEAN)
$(Q)$(RM) -r $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) 
$(TEST_GEN_FILES) $(EXTRA_CLEAN)
$(Q)$(MAKE) -C bpf_testmod clean
+   $(Q)$(MAKE) docs-clean
 endef
 
 include ../lib.mk
@@ -180,6 +182,7 @@ $(OUTPUT)/runqslower: $(BPFOBJ) | $(DEFAULT_BPFTOOL)
cp $(SCRATCH_DIR)/runqslower $@
 
 $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED): $(OUTPUT)/test_stub.o $(BPFOBJ)
+$(TEST_GEN_FILES): docs
 
 $(OUTPUT)/test_dev_cgroup: cgroup_helpers.c
 $(OUTPUT)/test_skb_cgroup_id_user: cgroup_helpers.c
@@ -200,11 +203,16 @@ $(DEFAULT_BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] 
$(BPFTOOLDIR)/Makefile)\
CC=$(HOSTCC) LD=$(HOSTLD)  \
OUTPUT=$(HOST_BUILD_DIR)/bpftool/  \
prefix= DESTDIR=$(HOST_SCRATCH_DIR)/ install
-   $(Q)mkdir -p $(BUILD_DIR)/bpftool/Documentation
-   $(Q)RST2MAN_OPTS="--exit-status=1" $(MAKE) $(submake_extras)   \
-   -C $(BPFTOOLDIR)/Documentation \
-   OUTPUT=$(BUILD_DIR)/bpftool/Documentation/ \
-   prefix= DESTDIR=$(SCRATCH_DIR)/ install
+
+docs:
+   $(Q)RST2MAN_OPTS="--exit-status=1" $(MAKE) $(submake_extras)\
+   -C $(TOOLSDIR)/bpf -f Makefile.docs \
+   prefix= OUTPUT=$(OUTPUT)/ DESTDIR=$(OUTPUT)/ $@
+
+docs-clean:
+   $(Q)$(MAKE) $(submake_extras)   \
+   -C $(TOOLSDIR)/bpf -f Makefile.docs \
+   prefix= OUTPUT=$(OUTPUT)/ DESTDIR=$(OUTPUT)/ $@
 
 $(BPFOBJ): $(wildcard $(BPFDIR)/*.[ch] $(BPFDIR)/Makefile)\
   ../../../include/uapi/linux/bpf.h   \
@@ -476,3 +484,5 @@ EXTRA_CLEAN := $(TEST_CUSTOM_PROGS) $(SCRATCH_DIR) 
$(HOST_SCRATCH_DIR)  \
prog_tests/tests.h map_tests/tests.h verifier/tests.h   \
feature \
$(addprefix $(OUTPUT)/,*.o *.skel.h no_alu32 bpf_gcc bpf_testmod.ko)
+
+.PHONY: docs docs-clean
diff --git a/tools/testing/selftests/bpf/test_bpftool_build.sh 
b/tools/testing/selftests/bpf/test_bpftool_build.sh
index 2db3c60e1e61..ac349a5cea7e 100755
--- a/tools/testing/selftests/bpf/test_bpftool_build.sh
+++ b/tools/testing/selftests/bpf/test_bpftool_build.sh
@@ -85,23 +85,6 @@ make_with_tmpdir() {
echo
 }
 
-make_doc_and_clean() {
-   echo -e "\$PWD:$PWD"
-   echo -e "command: make -s $* doc >/dev/null"
-   RST2MAN_OPTS="--exit-status=1" make $J -s $* doc
-   if [ $? -ne 0 ] ; then
-   ERROR=1
-   printf "FAILURE: Errors or warnings when building 
documentation\n"
-   fi
-   (
-   if [ $# -ge 1 ] ; then
-   cd ${@: -1}
-   fi
-   make -s doc-clean
-   )
-   echo
-}
-
 echo "Trying to build bpftool"
 echo -e "... through kbuild\n"
 
@@ -162,7 +145,3 @@ make_and_clean
 make_with_tmpdir OUTPUT
 
 make_with_tmpdir O
-
-echo -e "Checking documentation build\n"
-# From tools/bpf/bpftool
-make_doc_and_clean
diff --git a/tools/testing/selftests/bpf/test_doc_build.sh 
b/tools/testing/selftests/bpf/test_doc_build.sh
new file mode 100755
index ..7eb940a7b2eb
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_doc_build.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+# Assume script is located under tools/testing/selftests/bpf/. We want to start
+# build attempts from the

[PATCH bpf-next 11/17] scripts/bpf: Add syscall commands printer

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

Add a new target to bpf_doc.py to support generating the list of syscall
commands directly from the UAPI headers. Assuming that developer
submissions keep the main header up to date, this should allow the man
pages to be automatically generated based on the latest API changes
rather than requiring someone to separately go back through the API and
describe each command.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 scripts/bpf_doc.py | 98 +-
 1 file changed, 89 insertions(+), 9 deletions(-)

diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index 5a4f68aab335..72a2ba323692 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -14,6 +14,9 @@ import sys, os
 class NoHelperFound(BaseException):
 pass
 
+class NoSyscallCommandFound(BaseException):
+pass
+
 class ParsingError(BaseException):
 def __init__(self, line='', reader=None):
 if reader:
@@ -23,18 +26,27 @@ class ParsingError(BaseException):
 else:
 BaseException.__init__(self, 'Error parsing line: %s' % line)
 
-class Helper(object):
+
+class APIElement(object):
 """
-An object representing the description of an eBPF helper function.
-@proto: function prototype of the helper function
-@desc: textual description of the helper function
-@ret: description of the return value of the helper function
+An object representing the description of an aspect of the eBPF API.
+@proto: prototype of the API symbol
+@desc: textual description of the symbol
+@ret: (optional) description of any associated return value
 """
 def __init__(self, proto='', desc='', ret=''):
 self.proto = proto
 self.desc = desc
 self.ret = ret
 
+
+class Helper(APIElement):
+"""
+An object representing the description of an eBPF helper function.
+@proto: function prototype of the helper function
+@desc: textual description of the helper function
+@ret: description of the return value of the helper function
+"""
 def proto_break_down(self):
 """
 Break down helper function protocol into smaller chunks: return type,
@@ -61,6 +73,7 @@ class Helper(object):
 
 return res
 
+
 class HeaderParser(object):
 """
 An object used to parse a file in order to extract the documentation of a
@@ -73,6 +86,13 @@ class HeaderParser(object):
 self.reader = open(filename, 'r')
 self.line = ''
 self.helpers = []
+self.commands = []
+
+def parse_element(self):
+proto= self.parse_symbol()
+desc = self.parse_desc()
+ret  = self.parse_ret()
+return APIElement(proto=proto, desc=desc, ret=ret)
 
 def parse_helper(self):
 proto= self.parse_proto()
@@ -80,6 +100,18 @@ class HeaderParser(object):
 ret  = self.parse_ret()
 return Helper(proto=proto, desc=desc, ret=ret)
 
+def parse_symbol(self):
+p = re.compile(' \* ?(.+)$')
+capture = p.match(self.line)
+if not capture:
+raise NoSyscallCommandFound
+end_re = re.compile(' \* ?NOTES$')
+end = end_re.match(self.line)
+if end:
+raise NoSyscallCommandFound
+self.line = self.reader.readline()
+return capture.group(1)
+
 def parse_proto(self):
 # Argument can be of shape:
 #   - "void"
@@ -141,16 +173,29 @@ class HeaderParser(object):
 break
 return ret
 
-def run(self):
-# Advance to start of helper function descriptions.
-offset = self.reader.read().find('* Start of BPF helper function 
descriptions:')
+def seek_to(self, target, help_message):
+self.reader.seek(0)
+offset = self.reader.read().find(target)
 if offset == -1:
-raise Exception('Could not find start of eBPF helper descriptions 
list')
+raise Exception(help_message)
 self.reader.seek(offset)
 self.reader.readline()
 self.reader.readline()
 self.line = self.reader.readline()
 
+def parse_syscall(self):
+self.seek_to('* Start of BPF syscall commands:',
+ 'Could not find start of eBPF syscall descriptions list')
+while True:
+try:
+command = self.parse_element()
+self.commands.append(command)
+except NoSyscallCommandFound:
+break
+
+def parse_helpers(self):
+self.seek_to('* Start of BPF helper function descriptions:',
+ 'Could not find start of eBPF helper descriptions list')
 while True:

[PATCH bpf-next 17/17] tools: Sync uapi bpf.h header with latest changes

2021-02-16 Thread Joe Stringer

From: Joe Stringer 

Synchronize the header after all of the recent changes.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 tools/include/uapi/linux/bpf.h | 707 -
 1 file changed, 706 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 16f2f0d2338a..4abf54327612 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -93,7 +93,712 @@ union bpf_iter_link_info {
} map;
 };
 
-/* BPF syscall commands, see bpf(2) man-page for details. */
+/* BPF syscall commands, see bpf(2) man-page for more details.
+ *
+ * The operation to be performed by the **bpf**\ () system call is determined
+ * by the *cmd* argument. Each operation takes an accompanying argument,
+ * provided via *attr*, which is a pointer to a union of type *bpf_attr* (see
+ * below). The size argument is the size of the union pointed to by *attr*.
+ *
+ * Start of BPF syscall commands:
+ *
+ * BPF_MAP_CREATE
+ * Description
+ * Create a map and return a file descriptor that refers to the
+ * map. The close-on-exec file descriptor flag (see **fcntl**\ (2))
+ * is automatically enabled for the new file descriptor.
+ *
+ * Applying **close**\ (2) to the file descriptor returned by
+ * **BPF_MAP_CREATE** will delete the map (but see NOTES).
+ *
+ * Return
+ * A new file descriptor (a nonnegative integer), or -1 if an
+ * error occurred (in which case, *errno* is set appropriately).
+ *
+ * BPF_MAP_LOOKUP_ELEM
+ * Description
+ * Look up an element with a given *key* in the map referred to
+ * by the file descriptor *map_fd*.
+ *
+ * The *flags* argument may be specified as one of the
+ * following:
+ *
+ * **BPF_F_LOCK**
+ * Look up the value of a spin-locked map without
+ * returning the lock. This must be specified if the
+ * elements contain a spinlock.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_MAP_UPDATE_ELEM
+ * Description
+ * Create or update an element (key/value pair) in a specified map.
+ *
+ * The *flags* argument should be specified as one of the
+ * following:
+ *
+ * **BPF_ANY**
+ * Create a new element or update an existing element.
+ * **BPF_NOEXIST**
+ * Create a new element only if it did not exist.
+ * **BPF_EXIST**
+ * Update an existing element.
+ * **BPF_F_LOCK**
+ * Update a spin_lock-ed map element.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * May set *errno* to **EINVAL**, **EPERM**, **ENOMEM**,
+ * **E2BIG**, **EEXIST**, or **ENOENT**.
+ *
+ * **E2BIG**
+ * The number of elements in the map reached the
+ * *max_entries* limit specified at map creation time.
+ * **EEXIST**
+ * If *flags* specifies **BPF_NOEXIST** and the element
+ * with *key* already exists in the map.
+ * **ENOENT**
+ * If *flags* specifies **BPF_EXIST** and the element with
+ * *key* does not exist in the map.
+ *
+ * BPF_MAP_DELETE_ELEM
+ * Description
+ * Look up and delete an element by key in a specified map.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_MAP_GET_NEXT_KEY
+ * Description
+ * Look up an element by key in a specified map and return the key
+ * of the next element. Can be used to iterate over all elements
+ * in the map.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * The following cases can be used to iterate over all elements of
+ * the map:
+ *
+ * * If *key* is not found, the operation returns zero and sets
+ *   the *next_key* pointer to the key of the first element.
+ * * If *key* is found, the operation returns zero and sets the
+ *   *next_key* pointer to the key of the next element.
+ * * If *key* is the last element, returns -1 and *errno* is set
+ *   to **ENOENT**.
+ *
+ * May set *errno* to **ENOMEM**, **EFAULT**, **EPERM**, or
+ * **EINVAL** on error.
+ *
+ * BPF_PROG_LOAD
+ * Description
+ * Verify and load an eBPF program, returning a new file

Re: [PATCH bpf-next 00/17] Improve BPF syscall command documentation

2021-02-17 Thread Joe Stringer

On Wed, Feb 17, 2021 at 5:55 AM Toke Høiland-Jørgensen  wrote:
>
> Joe Stringer  writes:
> > Given the relative success of the process around bpf-helpers(7) to
> > encourage developers to document their user-facing changes, in this
> > patch series I explore applying this technique to bpf(2) as well.
> > Unfortunately, even with bpf(2) being so out-of-date, there is still a
> > lot of content to convert over. In particular, I've identified at least
> > the following aspects of the bpf syscall which could individually be
> > generated from separate documentation in the header:
> > * BPF syscall commands
> > * BPF map types
> > * BPF program types
> > * BPF attachment points
>
> Does this also include program subtypes (AKA expected_attach_type?)

I seem to have left my lawyerly "including, but not limited to..."
language at home today ;-) . Of course, I can add that to the list.

> > At this point I'd like to put this out for comments. In my mind, the
> > ideal eventuation of this work would be to extend kernel UAPI headers
> > such that each of the categories I had listed above (commands, maps,
> > progs, hooks) have dedicated documentation in the kernel tree, and that
> > developers must update the comments in the headers to document the APIs
> > prior to patch acceptance, and that we could auto-generate the latest
> > version of the bpf(2) manual pages based on a few static description
> > sections combined with the dynamically-generated output from the header.
>
> I like the approach, and I don't think it's too onerous to require
> updates to the documentation everywhere like we (as you note) already do
> for helpers.
>
> So with that, please feel free to add my enthusiastic:
>
> Acked-by: Toke Høiland-Jørgensen 

Thanks Toke.

Re: [PATCH bpf-next 00/17] Improve BPF syscall command documentation

2021-02-17 Thread Joe Stringer

On Wed, Feb 17, 2021 at 9:32 AM Jonathan Corbet  wrote:
>
> [CC += linux-doc]
>
> Joe Stringer  writes:
>
> > From: Joe Stringer 
> >
> > The state of bpf(2) manual pages today is not exactly ideal. For the
> > most part, it was written several years ago and has not kept up with the
> > pace of development in the kernel tree. For instance, out of a total of
> > ~35 commands to the BPF syscall available today, when I pull the
> > kernel-man-pages tree today I find just 6 documented commands: The very
> > basics of map interaction and program load.
> >
> > In contrast, looking at bpf-helpers(7), I am able today to run one
> > command[0] to fetch API documentation of the very latest eBPF helpers
> > that have been added to the kernel. This documentation is up to date
> > because kernel maintainers enforce documenting the APIs as part of
> > the feature submission process. As far as I can tell, we rely on manual
> > synchronization from the kernel tree to the kernel-man-pages tree to
> > distribute these more widely, so all locations may not be completely up
> > to date. That said, the documentation does in fact exist in the first
> > place which is a major initial hurdle to overcome.
> >
> > Given the relative success of the process around bpf-helpers(7) to
> > encourage developers to document their user-facing changes, in this
> > patch series I explore applying this technique to bpf(2) as well.
>
> So I am totally in favor of improving the BPF docs, this is great work.
>
> That said, I am a bit less thrilled about creating a new, parallel
> documentation-build system in the kernel.  I don't think that BPF is so
> special that it needs to do its own thing here.
>
> If you started that way, you'd get the whole existing build system for
> free.  You would also have started down a path that could, some bright
> shining day, lead to this kind of documentation for *all* of our system
> calls.  That would be a huge improvement in how we do things.
>
> The troff output would still need implementation, but we'd like to have
> that anyway.  We used to create man pages for internal kernel APIs; that
> was lost in the sphinx transition and hasn't been a priority since
> people haven't been screaming, but it could still be nice to have it
> back.
>
> So...could I ask you to have a look at doing this within the kernel's
> docs system instead of in addition to it?  Even if it means digging into
> scripts/kernel-doc, which isn't all that high on my list of Fun Things
> To Do either?  I'm willing to try to help, and maybe we can get some
> other assistance too - I'm ever the optimist.

Hey Jon, thanks for the feedback. Absolutely, what you say makes
sense. The intent here wasn't to come up with something new. Based on
your prompt from this email (and a quick look at your KR '19
presentation), I'm hearing a few observations:
* Storing the documentation in the code next to the things that
contributors edit is a reasonable approach to documentation of this
kind.
* This series currently proposes adding some new Makefile
infrastructure. However, good use of the "kernel-doc" sphinx directive
+ "DOC: " incantations in the header should be able to achieve the
same without adding such dedicated build system logic to the tree.
* The changes in patch 16 here extended Documentation/bpf/index.rst,
but to assist in improving the overall kernel documentation
organisation / hierarchy, you would prefer to instead introduce a
dedicated Documentation/userspace-api/bpf/ directory where the bpf
uAPI portions can be documented.

>From the above, there's a couple of clear actionable items I can look
into for a series v2 which should tidy things up.

In addition to this, today the bpf helpers documentation is built
through the bpftool build process as well as the runtime bpf
selftests, mostly as a way to ensure that the API documentation
conforms to a particular style, which then assists with the generation
of ReStructured Text and troff output. I can probably simplify the
make infrastructure involved in triggering the bpf docs build for bpf
subsystem developers and maintainers. I think there's likely still
interest from bpf folks to keep that particular dependency in the
selftests like today and even extend it to include this new
Documentation, so that we don't either introduce text that fails
against the parser or in some other way break the parser. Whether that
validation is done by scripts/kernel-doc or scripts/bpf_helpers_doc.py
doesn't make a big difference to me, other than I have zero experience
with Perl. My first impressions are that the bpf_helpers_doc.py is
providing stricter formatting requirements than what "DOC

Re: [PATCH bpf-next 00/17] Improve BPF syscall command documentation

2021-02-18 Thread Joe Stringer

On Thu, Feb 18, 2021 at 11:49 AM Jonathan Corbet  wrote:
>
> Joe Stringer  writes:
> > * The changes in patch 16 here extended Documentation/bpf/index.rst,
> > but to assist in improving the overall kernel documentation
> > organisation / hierarchy, you would prefer to instead introduce a
> > dedicated Documentation/userspace-api/bpf/ directory where the bpf
> > uAPI portions can be documented.
>
> An objective I've been working on for some years is reorienting the
> documentation with a focus on who the readers are.  We've tended to
> organize it by subsystem, requiring people to wade through a lot of
> stuff that isn't useful to them.  So yes, my preference would be to
> document the kernel's user-space API in the relevant manual.
>
> That said, I do tend to get pushback here at times, and the BPF API is
> arguably a bit different that much of the rest.  So while the above
> preference exists and is reasonably strong, the higher priority is to
> get good, current documentation in *somewhere* so that it's available to
> users.  I don't want to make life too difficult for people working
> toward that goal, even if I would paint it a different color.

Sure, I'm all for it. Unless I hear alternative feedback I'll roll it
under Documentation/userspace-api/bpf in the next revision.

> > In addition to this, today the bpf helpers documentation is built
> > through the bpftool build process as well as the runtime bpf
> > selftests, mostly as a way to ensure that the API documentation
> > conforms to a particular style, which then assists with the generation
> > of ReStructured Text and troff output. I can probably simplify the
> > make infrastructure involved in triggering the bpf docs build for bpf
> > subsystem developers and maintainers. I think there's likely still
> > interest from bpf folks to keep that particular dependency in the
> > selftests like today and even extend it to include this new
> > Documentation, so that we don't either introduce text that fails
> > against the parser or in some other way break the parser. Whether that
> > validation is done by scripts/kernel-doc or scripts/bpf_helpers_doc.py
> > doesn't make a big difference to me, other than I have zero experience
> > with Perl. My first impressions are that the bpf_helpers_doc.py is
> > providing stricter formatting requirements than what "DOC: " +
> > kernel-doc would provide, so my baseline inclination would be to keep
> > those patches to enhance that script and use that for the validation
> > side (help developers with stronger linting feedback), then use
> > kernel-doc for the actual html docs generation side, which would help
> > to satisfy your concern around duplication of the documentation build
> > systems.
>
> This doesn't sound entirely unreasonable.  I wonder if the BPF helper
> could be built into an sphinx extension to make it easy to pull that
> information into the docs build.  The advantage there is that it can be
> done in Python :)

Probably doable, it's already written in python. One thing at a time
though... :)

Cheers,
Joe

Re: [PATCH bpf-next] libbpf: clarify flags in ringbuf helpers

2021-04-07 Thread Joe Stringer

Hi Pedro,

On Tue, Apr 6, 2021 at 11:58 AM Pedro Tammela  wrote:
>
> In 'bpf_ringbuf_reserve()' we require the flag to '0' at the moment.
>
> For 'bpf_ringbuf_{discard,submit,output}' a flag of '0' might send a
> notification to the process if needed.
>
> Signed-off-by: Pedro Tammela 
> ---
>  include/uapi/linux/bpf.h   | 7 +++
>  tools/include/uapi/linux/bpf.h | 7 +++
>  2 files changed, 14 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 49371eba98ba..8c5c7a893b87 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4061,12 +4061,15 @@ union bpf_attr {
>   * of new data availability is sent.
>   * If **BPF_RB_FORCE_WAKEUP** is specified in *flags*, 
> notification
>   * of new data availability is sent unconditionally.
> + * If **0** is specified in *flags*, notification
> + * of new data availability is sent if needed.

Maybe a trivial question, but what does "if needed" mean? Does that
mean "when the buffer is full"?

Re: [PATCH bpf-next] bpf: fix missing * in bpf.h

2021-03-02 Thread Joe Stringer

On Fri, Feb 26, 2021 at 8:51 AM Quentin Monnet  wrote:
>
> 2021-02-24 10:59 UTC-0800 ~ Andrii Nakryiko 
> > On Wed, Feb 24, 2021 at 7:55 AM Daniel Borkmann  
> > wrote:
> >>
> >> On 2/23/21 3:43 PM, Jesper Dangaard Brouer wrote:
> >>> On Tue, 23 Feb 2021 20:45:54 +0800
> >>> Hangbin Liu  wrote:
> >>>
>  Commit 34b2021cc616 ("bpf: Add BPF-helper for MTU checking") lost a *
>  in bpf.h. This will make bpf_helpers_doc.py stop building
>  bpf_helper_defs.h immediately after bpf_check_mtu, which will affect
>  future add functions.
> 
>  Fixes: 34b2021cc616 ("bpf: Add BPF-helper for MTU checking")
>  Signed-off-by: Hangbin Liu 
>  ---
>    include/uapi/linux/bpf.h   | 2 +-
>    tools/include/uapi/linux/bpf.h | 2 +-
>    2 files changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> Thanks for fixing that!
> >>>
> >>> Acked-by: Jesper Dangaard Brouer 
> >>
> >> Thanks guys, applied!
> >>
> >>> I though I had already fix that, but I must have missed or reintroduced
> >>> this, when I rolling back broken ideas in V13.
> >>>
> >>> I usually run this command to check the man-page (before submitting):
> >>>
> >>>   ./scripts/bpf_helpers_doc.py | rst2man | man -l -
> >>
> >> [+ Andrii] maybe this could be included to run as part of CI to catch such
> >> things in advance?
> >
> > We do something like that as part of bpftool build, so there is no
> > reason we can't add this to selftests/bpf/Makefile as well.
>
> Hi, pretty sure this is the case already? [0]
>
> This helps catching RST formatting issues, for example if a description
> is using invalid markup, and reported by rst2man. My understanding is
> that in the current case, the missing star simply ends the block for the
> helpers documentation from the parser point of view, it's not considered
> an error.
>
> I see two possible workarounds:
>
> 1) Check that the number of helpers found ("len(self.helpers)") is equal
> to the number of helpers in the file, but that requires knowing how many
> helpers we have in the first place (e.g. parsing "__BPF_FUNC_MAPPER(FN)").

This is not so difficult as long as we stick to one symbol per line:

diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index e2ffac2b7695..74cdcc2bbf18 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -183,25 +183,51 @@ class HeaderParser(object):
 self.reader.readline()
 self.line = self.reader.readline()

+def get_elem_count(self, target):
+self.seek_to(target, 'Could not find symbol "%s"' % target)
+end_re = re.compile('^$')
+count = 0
+while True:
+capture = end_re.match(self.line)
+if capture:
+break
+self.line = self.reader.readline()
+count += 1
+
+# The last line (either '};' or '/* */' doesn't count.
+return count
+

I can either roll this into my docs update v2, or hold onto it for
another dedicated patch fixup. Either way I'm trialing this out
locally to regression-test my own docs update PR and make sure I'm not
breaking one of the various output formats.

Re: [PATCH] openvswitch: perform refragmentation for packets which pass through conntrack

2021-03-21 Thread Joe Stringer

Hey Aaron, long time no chat :)

On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole  wrote:
>
> When a user instructs a flow pipeline to perform connection tracking,
> there is an implicit L3 operation that occurs - namely the IP fragments
> are reassembled and then processed as a single unit.  After this, new
> fragments are generated and then transmitted, with the hint that they
> should be fragmented along the max rx unit boundary.  In general, this
> behavior works well to forward packets along when the MTUs are congruent
> across the datapath.
>
> However, if using a protocol such as UDP on a network with mismatching
> MTUs, it is possible that the refragmentation will still produce an
> invalid fragment, and that fragmented packet will not be delivered.
> Such a case shouldn't happen because the user explicitly requested a
> layer 3+4 function (conntrack), and that function generates new fragments,
> so we should perform the needed actions in that case (namely, refragment
> IPv4 along a correct boundary, or send a packet too big in the IPv6 case).
>
> Additionally, introduce a test suite for openvswitch with a test case
> that ensures this MTU behavior, with the expectation that new tests are
> added when needed.
>
> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
> Signed-off-by: Aaron Conole 
> ---
> NOTE: checkpatch reports a whitespace error with the openvswitch.sh
>   script - this is due to using tab as the IFS value.  This part
>   of the script was copied from
>   tools/testing/selftests/net/pmtu.sh so I think should be
>   permissible.
>
>  net/openvswitch/actions.c  |   2 +-
>  tools/testing/selftests/net/.gitignore |   1 +
>  tools/testing/selftests/net/Makefile   |   1 +
>  tools/testing/selftests/net/openvswitch.sh | 394 +
>  4 files changed, 397 insertions(+), 1 deletion(-)
>  create mode 100755 tools/testing/selftests/net/openvswitch.sh
>
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index 92a0b67b2728..d858ea580e43 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff 
> *skb, int out_port,
> if (likely(!mru ||
>(skb->len <= mru + vport->dev->hard_header_len))) {
> ovs_vport_send(vport, skb, ovs_key_mac_proto(key));
> -   } else if (mru <= vport->dev->mtu) {
> +   } else if (mru) {
> struct net *net = read_pnet(&dp->net);
>
> ovs_fragment(net, vport, skb, mru, key);

I thought about this for a while. For a bit of context, my
recollection is that in the initial design, there was an attempt to
minimize the set of assumptions around L3 behaviour and despite
performing this pseudo-L3 action of connection tracking, attempt a
"bump-in-the-wire" approach where OVS is serving as an L2 switch and
if you wanted L3 features, you need to build them on top or explicitly
define that you're looking for L3 semantics. In this case, you're
interpreting that the combination of the conntrack action + an output
action implies that L3 routing is being performed. Hence, OVS should
act like a router and either refragment or generate ICMP PTB in the
case where MTU differs. According to the flow table, the rest of the
routing functionality (MAC handling for instance) may or may not have
been performed at this point, but we basically leave that up to the
SDN controller to implement the right behaviour. In relation to this
particular check, the idea was to retain the original geometry of the
packet such that it's as though there were no functionality performed
in the middle at all. OVS happened to do connection tracking (which
implicitly involved queueing fragments), but if you treat it as an
opaque box, you have ports connected and OVS is simply performing
forwarding between the ports.

One of the related implications is the contrast between what happens
in this case if you have a conntrack action injected or not when
outputting to another port. If you didn't put a connection tracking
action into the flows when redirecting here, then there would be no
defragmentation or refragmentation. In that case, OVS is just
attempting to forward to another device and if the MTU check fails,
then bad luck, packets will be dropped. Now, with the interpretation
in this patch, it seems like we're trying to say that, well, actually,
if the controller injects a connection tracking action, then the
controller implicitly switches OVS into a sort of half-L3 mode for
this particular flow. This makes the behaviour a bit inconsistent.

Another thought that occurs here is that if you have three interfaces
attached to the switch, say one with MTU 1500 and two with MTU 1450,
and the OVS flows are configured to conntrack and clone the packets
from the higher-MTU interface to the lower-MTU interfaces. If you
receive

Re: [RFC bpf-next 0/7] Programming socket lookup with BPF

2019-06-20 Thread Joe Stringer

On Wed, Jun 19, 2019 at 2:14 AM Jakub Sitnicki  wrote:
>
> Hey Florian,
>
> Thanks for taking a look at it.
>
> On Tue, Jun 18, 2019 at 03:52 PM CEST, Florian Westphal wrote:
> > Jakub Sitnicki  wrote:
> >>  - XDP programs using bpf_sk_lookup helpers, like load balancers, can't
> >>find the listening socket to check for SYN cookies with TPROXY redirect.
> >
> > Sorry for the question, but where is the problem?
> > (i.e., is it with TPROXY or bpf side)?
>
> The way I see it is that the problem is that we have mappings for
> steering traffic into sockets split between two places: (1) the socket
> lookup tables, and (2) the TPROXY rules.
>
> BPF programs that need to check if there is a socket the packet is
> destined for have access to the socket lookup tables, via the mentioned
> bpf_sk_lookup helper, but are unaware of TPROXY redirects.
>
> For TCP we're able to look up from BPF if there are any established,
> request, and "normal" listening sockets. The listening sockets that
> receive connections via TPROXY are invisible to BPF progs.
>
> Why are we interested in finding all listening sockets? To check if any
> of them had SYN queue overflow recently and if we should honor SYN
> cookies.

Why are they invisible? Can't you look them up with bpf_skc_lookup_tcp()?

Re: [RFC bpf-next 0/7] Programming socket lookup with BPF

2019-06-21 Thread Joe Stringer

On Fri, Jun 21, 2019 at 1:44 AM Jakub Sitnicki  wrote:
>
> On Fri, Jun 21, 2019, 00:20 Joe Stringer  wrote:
>>
>> On Wed, Jun 19, 2019 at 2:14 AM Jakub Sitnicki  wrote:
>> >
>> > Hey Florian,
>> >
>> > Thanks for taking a look at it.
>> >
>> > On Tue, Jun 18, 2019 at 03:52 PM CEST, Florian Westphal wrote:
>> > > Jakub Sitnicki  wrote:
>> > >>  - XDP programs using bpf_sk_lookup helpers, like load balancers, can't
>> > >>find the listening socket to check for SYN cookies with TPROXY 
>> > >> redirect.
>> > >
>> > > Sorry for the question, but where is the problem?
>> > > (i.e., is it with TPROXY or bpf side)?
>> >
>> > The way I see it is that the problem is that we have mappings for
>> > steering traffic into sockets split between two places: (1) the socket
>> > lookup tables, and (2) the TPROXY rules.
>> >
>> > BPF programs that need to check if there is a socket the packet is
>> > destined for have access to the socket lookup tables, via the mentioned
>> > bpf_sk_lookup helper, but are unaware of TPROXY redirects.
>> >
>> > For TCP we're able to look up from BPF if there are any established,
>> > request, and "normal" listening sockets. The listening sockets that
>> > receive connections via TPROXY are invisible to BPF progs.
>> >
>> > Why are we interested in finding all listening sockets? To check if any
>> > of them had SYN queue overflow recently and if we should honor SYN
>> > cookies.
>>
>> Why are they invisible? Can't you look them up with bpf_skc_lookup_tcp()?
>
>
> They are invisible in that sense that you can't look them up using the packet 
> 4-tuple. You have to somehow make the XDP/TC progs aware of the TPROXY 
> redirects to find the target sockets.

Isn't that what you're doing in the example from the cover letter
(reincluded below for reference), except with the new program type
rather than XDP/TC progs?

   switch (bpf_ntohl(ctx->local_ip4) >> 8) {
case NET1:
ctx->local_ip4 = bpf_htonl(IP4(127, 0, 0, 1));
ctx->local_port = 81;
return BPF_REDIRECT;
case NET2:
ctx->local_ip4 = bpf_htonl(IP4(127, 0, 0, 1));
ctx->local_port = 82;
return BPF_REDIRECT;
}

That said, I appreciate that even if you find the sockets from XDP,
you'd presumably need some way to retain the socket reference beyond
XDP execution to convince the stack to guide the traffic into that
socket, which would be a whole other effort. For your use case it may
or may not make the most sense.

Removing skb_orphan() from ip_rcv_core()

2019-06-21 Thread Joe Stringer

Hi folks, picking this up again..

As discussed during LSFMM, I've been looking at adding something like
an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can
be implemented with integration into other BPF logic, however
currently any attempts to do so are blocked by the skb_orphan() call
in ip_rcv_core() (which will effectively ignore any socket assign
decision made by the TC BPF program).

Recently I was attempting to remove the skb_orphan() call, and I've
been trying different things but there seems to be some context I'm
missing. Here's the core of the patch:

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index ed97724c5e33..16aea980318a 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff
*skb, struct net *net)
   memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
   IPCB(skb)->iif = skb->skb_iif;

-   /* Must drop socket now because of tproxy. */
-   skb_orphan(skb);

   return skb;

The statement that the socket must be dropped because of tproxy
doesn't make sense to me, because the PRE_ROUTING hook is hit after
this, which will call into the tproxy logic and eventually
nf_tproxy_assign_sock() which already does the skb_orphan() itself.

However, if I drop these lines then I end up causing sockets to
release references too many times. Seems like if we don't orphan the
skb here, then later logic assumes that we have one more reference
than we actually have, and decrements the count when it shouldn't
(perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems
to assume we always have a reference to the socket?)

Splat:

refcount_t hit zero at sk_stop_timer+0x2c/0x30 in cilium-agent[16359],
uid/euid: 0/0
WARNING: CPU: 0 PID: 16359 at kernel/panic.c:686 refcount_error_report+0x9c/0xa1
...
? inet_put_port+0xa6/0xd0
inet_csk_clear_xmit_timers+0x2e/0x50
tcp_done+0x8b/0xf0
tcp_reset+0x49/0xc0
tcp_validate_incoming+0x2f7/0x410
tcp_rcv_state_process+0x250/0xdb6
? tcp_v4_connect+0x46f/0x4e0
tcp_v4_do_rcv+0xbd/0x1f0
__release_sock+0x84/0xd0
release_sock+0x30/0xa0
inet_stream_connect+0x47/0x60

(Full version: 
https://gist.github.com/joestringer/d5313e4bf4231e2c46405bd7a3053936
)

This seems potentially related to some of the socket referencing
discussion in the peer thread "[RFC bpf-next 0/7] Programming socket
lookup with BPF".

During LSFMM, it seemed like no-one knew quite why the skb_orphan() is
necessary in that path in the current version of the code, and that we
may be able to remove it. Florian, I know you weren't in the room for
that discussion, so raising it again now with a stack trace, Do you
have some sense what's going on here and whether there's a path
towards removing it from this path or allowing the skb->sk to be
retained during ip_rcv() in some conditions?

Re: Removing skb_orphan() from ip_rcv_core()

2019-06-24 Thread Joe Stringer

On Fri, Jun 21, 2019 at 1:59 PM Florian Westphal  wrote:
>
> Joe Stringer  wrote:
> > As discussed during LSFMM, I've been looking at adding something like
> > an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can
> > be implemented with integration into other BPF logic, however
> > currently any attempts to do so are blocked by the skb_orphan() call
> > in ip_rcv_core() (which will effectively ignore any socket assign
> > decision made by the TC BPF program).
> >
> > Recently I was attempting to remove the skb_orphan() call, and I've
> > been trying different things but there seems to be some context I'm
> > missing. Here's the core of the patch:
> >
> > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> > index ed97724c5e33..16aea980318a 100644
> > --- a/net/ipv4/ip_input.c
> > +++ b/net/ipv4/ip_input.c
> > @@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff
> > *skb, struct net *net)
> >memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
> >IPCB(skb)->iif = skb->skb_iif;
> >
> > -   /* Must drop socket now because of tproxy. */
> > -   skb_orphan(skb);
> >
> >return skb;
> >
> > The statement that the socket must be dropped because of tproxy
> > doesn't make sense to me, because the PRE_ROUTING hook is hit after
> > this, which will call into the tproxy logic and eventually
> > nf_tproxy_assign_sock() which already does the skb_orphan() itself.
>
> in comment: s/tproxy/skb_steal_sock/

For reference, I was following the path like this:

ip_rcv()
( -> ip_rcv_core() for skb_orphan)
-> NF_INET_PRE_ROUTING hook
(... invoke iptables hooks)
-> iptable_mangle_hook()
-> ipt_do_table()
... -> tproxy_tg4()
... -> nf_tproxy_assign_sock()
-> skb_orphan()
(... finish iptables processing)
( -> ip_rcv_finish())
( ... -> ip_rcv_finish_core() for early demux / route lookup )
(... -> dst_input())
(... -> tcp_v4_rcv())
( -> __inet_lookup_skb())
( -> skb_steal_sock() )

> at least thats what I concluded a few years ago when I looked into
> the skb_oprhan() need.
>
> IIRC some device drivers use skb->sk for backpressure, so without this
> non-tcp socket would be stolen by skb_steal_sock.

Do you happen to recall which device drivers? Or have some idea of a
list I could try to go through? Are you referring to virtual drivers
like veth or something else?

> We also recently removed skb orphan when crossing netns:
>
> commit 9c4c325252c54b34d53b3d0ffd535182b744e03d
> Author: Flavio Leitner 
> skbuff: preserve sock reference when scrubbing the skb.
>
> So thats another case where this orphan is needed.

Presumably the orphan is only needed in this case if the packet
crosses a namespace and then is subsequently passed back into the
stack?

> What could be done is adding some way to delay/defer the orphaning
> further, but we would need at the very least some annotation for
> skb_steal_sock to know when the skb->sk is really from TPROXY or
> if it has to orphan.

Eric mentions in another response to this thread that skb_orphan()
should be called from any ndo_start_xmit() which sends traffic back
into the stack. With that, presumably we would be pushing the
orphaning earlier such that the only way that the skb->sk ref can be
non-NULL around this point in receive would be because it was
specifically set by some kind of tproxy logic?

> Same for the safety check in the forwarding path.
> Netfilter modules need o be audited as well, they might make assumptions
> wrt. skb->sk being inet sockets (set by local stack or early demux).
>
> > However, if I drop these lines then I end up causing sockets to
> > release references too many times. Seems like if we don't orphan the
> > skb here, then later logic assumes that we have one more reference
> > than we actually have, and decrements the count when it shouldn't
> > (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems
> > to assume we always have a reference to the socket?)
>
> We might be calling the wrong destructor (i.e., the one set by tcp
> receive instead of the one set at tx time)?

Hmm, interesting thought. Sure enough, with a bit of bpftrace
debugging we find it's tcp_wfree():

$ cat ip_rcv.bt
#include 

kprobe:ip_rcv {
   $sk = ((struct sk_buff *)arg0)->sk;
   $des = ((struct sk_buff *)arg0)->destructor;
   if ($sk) {
   if ($des) {
   printf("received %s on %s with sk destructor %s
set\n", str(arg0), str(arg1), ksym($des));
   @ip4_stacks[kstack] = count();
   }
   }
}
$ sudo bpftrace ip_rcv.bt
Attaching 1 prob

Re: Removing skb_orphan() from ip_rcv_core()

2019-06-24 Thread Joe Stringer

On Mon, Jun 24, 2019 at 7:47 AM Jamal Hadi Salim  wrote:
>
> On 2019-06-21 1:58 p.m., Joe Stringer wrote:
> > Hi folks, picking this up again..
> [..]
> > During LSFMM, it seemed like no-one knew quite why the skb_orphan() is
> > necessary in that path in the current version of the code, and that we
> > may be able to remove it. Florian, I know you weren't in the room for
> > that discussion, so raising it again now with a stack trace, Do you
> > have some sense what's going on here and whether there's a path
> > towards removing it from this path or allowing the skb->sk to be
> > retained during ip_rcv() in some conditions?
>
>
> Sorry - I havent followed the discussion but saw your email over
> the weekend and wanted to be at work to refresh my memory on some
> code. For maybe 2-3 years we have deployed the tproxy
> equivalent as a tc action on ingress (with no netfilter dependency).
>
> And, of course, we had to work around that specific code you are
> referring to - we didnt remove it. The tc action code increments
> the sk refcount and sets the tc index. The net core doesnt orphan
> the skb if a speacial tc index value is set (see attached patch)
>
> I never bothered up streaming the patch because the hack is a bit
> embarrassing (but worked ;->); and never posted the action code
> either because i thought this was just us that had this requirement.
> I am glad other people see the need for this feature. Is there effort
> to make this _not_ depend on iptables/netfilter? I am guessing if you
> want to do this from ebpf (tc or xdp) that is a requirement.
> Our need was with tcp at the time; so left udp dependency on netfilter
> alone.

I haven't got as far as UDP yet, but I didn't see any need for a
dependency on netfilter.

Re: Removing skb_orphan() from ip_rcv_core()

2019-06-25 Thread Joe Stringer

On Mon, Jun 24, 2019 at 11:37 PM Eric Dumazet  wrote:
> On 6/24/19 8:17 PM, Joe Stringer wrote:
> > On Fri, Jun 21, 2019 at 1:59 PM Florian Westphal  wrote:
> >> Joe Stringer  wrote:
> >>> However, if I drop these lines then I end up causing sockets to
> >>> release references too many times. Seems like if we don't orphan the
> >>> skb here, then later logic assumes that we have one more reference
> >>> than we actually have, and decrements the count when it shouldn't
> >>> (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems
> >>> to assume we always have a reference to the socket?)
> >>
> >> We might be calling the wrong destructor (i.e., the one set by tcp
> >> receive instead of the one set at tx time)?
> >
> > Hmm, interesting thought. Sure enough, with a bit of bpftrace
> > debugging we find it's tcp_wfree():
> >
> > $ cat ip_rcv.bt
> > #include 
> >
> > kprobe:ip_rcv {
> >$sk = ((struct sk_buff *)arg0)->sk;
> >$des = ((struct sk_buff *)arg0)->destructor;
> >if ($sk) {
> >if ($des) {
> >printf("received %s on %s with sk destructor %s
> > set\n", str(arg0), str(arg1), ksym($des));
> >@ip4_stacks[kstack] = count();
> >}
> >}
> > }
> > $ sudo bpftrace ip_rcv.bt
> > Attaching 1 probe...
> > received  on eth0 with sk destructor tcp_wfree set
> > ^C
> >
> > @ip4_stacks[
> >ip_rcv+1
> >__netif_receive_skb+24
> >process_backlog+179
> >net_rx_action+304
> >__do_softirq+220
> >do_softirq_own_stack+42
> >do_softirq.part.17+70
> >__local_bh_enable_ip+101
> >ip_finish_output2+421
> >__ip_finish_output+187
> >ip_finish_output+44
> >ip_output+109
> >ip_local_out+59
> >__ip_queue_xmit+368
> >ip_queue_xmit+16
> >__tcp_transmit_skb+1303
> >tcp_connect+2758
> >tcp_v4_connect+1135
> >__inet_stream_connect+214
> >inet_stream_connect+59
> >__sys_connect+237
> >__x64_sys_connect+26
> >do_syscall_64+90
> >entry_SYSCALL_64_after_hwframe+68
> > ]: 1
> >
> > Is there a solution here where we call the destructor if it's not
> > sock_efree()? When the socket is later stolen, it will only return the
> > reference via a call to sock_put(), so presumably at that point in the
> > stack we already assume that the skb->destructor is not one of these
> > other destructors (otherwise we wouldn't release the resources
> > correctly).
> >
>
> What was the driver here ? In any case, the following patch should help.
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 
> eeacebd7debbe6a55daedb92f00afd48051ebaf8..5075b4b267af7057f69fcb935226fce097a920e2
>  100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3699,6 +3699,7 @@ static __always_inline int dev_forward_skb(struct 
> net_device *dev,
> return NET_RX_DROP;
> }
>
> +   skb_orphan(skb);
> skb_scrub_packet(skb, true);
> skb->priority = 0;
> return 0;

Looks like it was bridge in the end, found by attaching a similar
bpftrace program to __dev_forward_sk(). Interestingly enough, the
device attached to the skb reported its name as "eth0" despite not
having such a named link or named bridge that I could find anywhere
via "ip link" / "brctl show"..

__dev_forward_skb+1
   dev_hard_start_xmit+151
   __dev_queue_xmit+1928
   dev_queue_xmit+16
   br_dev_queue_push_xmit+123
   br_forward_finish+69
   __br_forward+327
   br_forward+204
   br_dev_xmit+598
   dev_hard_start_xmit+151
   __dev_queue_xmit+1928
   dev_queue_xmit+16
   neigh_resolve_output+339
   ip_finish_output2+402
   __ip_finish_output+187
   ip_finish_output+44
   ip_output+109
   ip_local_out+59
   __ip_queue_xmit+368
   ip_queue_xmit+16
   __tcp_transmit_skb+1303
   tcp_connect+2758
   tcp_v4_connect+1135
   __inet_stream_connect+214
   inet_stream_connect+59
   __sys_connect+237
   __x64_sys_connect+26
   do_syscall_64+90
   entry_SYSCALL_64_after_hwframe+68

So I guess something like this could be another alternative:

diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 82225b8b54f5..c2de2bb35080 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -65,6 +65,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit);

int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
{
+   skb_orphan(skb);
   skb->tstamp = 0;
   return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING,
  net, sk, skb, NULL, skb->dev,

Re: Removing skb_orphan() from ip_rcv_core()

2019-06-25 Thread Joe Stringer

On Tue, Jun 25, 2019 at 4:07 AM Jamal Hadi Salim  wrote:
>
> On 2019-06-24 11:26 p.m., Joe Stringer wrote:
> [..]
> >
> > I haven't got as far as UDP yet, but I didn't see any need for a
> > dependency on netfilter.
>
> I'd be curious to see what you did. My experience, even for TCP is
> the socket(transparent/tproxy) lookup code (to set skb->sk either
> listening or established) is entangled in
> CONFIG_NETFILTER_SOMETHING_OR_OTHER. You have to rip it out of
> there (in the tproxy tc action into that  code). Only then can you
> compile out netfilter.
> I didnt bother to rip out code for udp case.
> i.e if you needed udp to work with the tc action,
> youd have to turn on NF. But that was because we had
> no need for udp transparent proxying.
> IOW:
> There is really no reason, afaik, for tproxy code to only be
> accessed if netfilter is compiled in. Not sure i made sense.

Oh, I see. Between the existing bpf_skc_lookup_tcp() and
bpf_sk_lookup_tcp() helpers in BPF, plus a new bpf_sk_assign() helper
and a little bit of lookup code using the appropriate tproxy ports
etc. from the BPF side, I was able to get it working. One could
imagine perhaps wrapping all this logic up in a higher level
"bpf_sk_lookup_tproxy()" helper call or similar, but I didn't go that
direction given that the BPF socket primitives seemed to provide the
necessary functionality in a more generic manner.

[RFC bpf-next 02/11] bpf: Simplify ptr_min_max_vals adjustment

2018-05-09 Thread Joe Stringer

An upcoming commit will add another two pointer types that need very
similar behaviour, so generalise this function now.

Signed-off-by: Joe Stringer 
---
 kernel/bpf/verifier.c   | 22 ++
 tools/testing/selftests/bpf/test_verifier.c | 14 +++---
 2 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f40e089c3893..a32b560072d7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2602,20 +2602,18 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
return -EACCES;
}
 
-   if (ptr_reg->type == PTR_TO_MAP_VALUE_OR_NULL) {
-   verbose(env, "R%d pointer arithmetic on 
PTR_TO_MAP_VALUE_OR_NULL prohibited, null-check it first\n",
-   dst);
-   return -EACCES;
-   }
-   if (ptr_reg->type == CONST_PTR_TO_MAP) {
-   verbose(env, "R%d pointer arithmetic on CONST_PTR_TO_MAP 
prohibited\n",
-   dst);
+   switch (ptr_reg->type) {
+   case PTR_TO_MAP_VALUE_OR_NULL:
+   verbose(env, "R%d pointer arithmetic on %s prohibited, 
null-check it first\n",
+   dst, reg_type_str[ptr_reg->type]);
return -EACCES;
-   }
-   if (ptr_reg->type == PTR_TO_PACKET_END) {
-   verbose(env, "R%d pointer arithmetic on PTR_TO_PACKET_END 
prohibited\n",
-   dst);
+   case CONST_PTR_TO_MAP:
+   case PTR_TO_PACKET_END:
+   verbose(env, "R%d pointer arithmetic on %s prohibited\n",
+   dst, reg_type_str[ptr_reg->type]);
return -EACCES;
+   default:
+   break;
}
 
/* In case of 'scalar += pointer', dst_reg inherits pointer type and id.
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 275b4570b5b8..53439f40e1de 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -3497,7 +3497,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END",
+   .errstr = "R3 pointer arithmetic on pkt_end",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
},
@@ -4525,7 +4525,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 4 },
-   .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL",
+   .errstr = "R4 pointer arithmetic on map_value_or_null",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS
},
@@ -4546,7 +4546,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 4 },
-   .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL",
+   .errstr = "R4 pointer arithmetic on map_value_or_null",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS
},
@@ -4567,7 +4567,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 4 },
-   .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL",
+   .errstr = "R4 pointer arithmetic on map_value_or_null",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS
},
@@ -6864,7 +6864,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map_in_map = { 3 },
-   .errstr = "R1 pointer arithmetic on CONST_PTR_TO_MAP 
prohibited",
+   .errstr = "R1 pointer arithmetic on map_ptr prohibited",
.result = REJECT,
},
{
@@ -8538,7 +8538,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END",
+   .errstr = "R3 pointer arithmetic on pkt_end",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
},
@@ -8557,7 +8557,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END",
+   .errstr = "R3 pointer arithmetic on pkt_end",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
},
-- 
2.14.1

[RFC bpf-next 09/11] libbpf: Support loading individual progs

2018-05-09 Thread Joe Stringer

Allow the individual program load to be invoked. This will help with
testing, where a single ELF may contain several sections, some of which
denote subprograms that are expected to fail verification, along with
some which are expected to pass verification. By allowing programs to be
iterated and individually loaded, each program can be independently
checked against its expected verification result.

Signed-off-by: Joe Stringer 
---
 tools/lib/bpf/libbpf.c | 4 ++--
 tools/lib/bpf/libbpf.h | 3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 7bcdca13083a..04e3754bcf30 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -268,7 +268,7 @@ struct bpf_object {
 };
 #define obj_elf_valid(o)   ((o)->efile.elf)
 
-static void bpf_program__unload(struct bpf_program *prog)
+void bpf_program__unload(struct bpf_program *prog)
 {
int i;
 
@@ -1338,7 +1338,7 @@ load_program(enum bpf_prog_type type, enum 
bpf_attach_type expected_attach_type,
return ret;
 }
 
-static int
+int
 bpf_program__load(struct bpf_program *prog,
  char *license, u32 kern_version)
 {
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 197f9ce2248c..c07e9969e4ed 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -112,10 +112,13 @@ void *bpf_program__priv(struct bpf_program *prog);
 
 const char *bpf_program__title(struct bpf_program *prog, bool needs_copy);
 
+int bpf_program__load(struct bpf_program *prog, char *license,
+ u32 kern_version);
 int bpf_program__fd(struct bpf_program *prog);
 int bpf_program__pin_instance(struct bpf_program *prog, const char *path,
  int instance);
 int bpf_program__pin(struct bpf_program *prog, const char *path);
+void bpf_program__unload(struct bpf_program *prog);
 
 struct bpf_insn;
 
-- 
2.14.1

[RFC bpf-next 04/11] bpf: Add PTR_TO_SOCKET verifier type

2018-05-09 Thread Joe Stringer

Teach the verifier a little bit about a new type of pointer, a
PTR_TO_SOCKET. This pointer type is accessed from BPF through the
'struct bpf_sock' structure.

Signed-off-by: Joe Stringer 
---
 include/linux/bpf.h  | 19 +-
 include/linux/bpf_verifier.h |  2 ++
 kernel/bpf/verifier.c| 86 ++--
 net/core/filter.c| 30 +---
 4 files changed, 114 insertions(+), 23 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a38e474bf7ee..a03b4b0edcb6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -136,7 +136,7 @@ enum bpf_arg_type {
/* the following constraints used to prototype bpf_memcmp() and other
 * functions that access data on eBPF program stack
 */
-   ARG_PTR_TO_MEM, /* pointer to valid memory (stack, packet, map 
value) */
+   ARG_PTR_TO_MEM, /* pointer to valid memory (stack, packet, map 
value, socket) */
ARG_PTR_TO_MEM_OR_NULL, /* pointer to valid memory or NULL */
ARG_PTR_TO_UNINIT_MEM,  /* pointer to memory does not need to be 
initialized,
 * helper function must fill all bytes or clear
@@ -148,6 +148,7 @@ enum bpf_arg_type {
 
ARG_PTR_TO_CTX, /* pointer to context */
ARG_ANYTHING,   /* any (initialized) argument is ok */
+   ARG_PTR_TO_SOCKET,  /* pointer to bpf_sock */
 };
 
 /* type of values returned from helper functions */
@@ -155,6 +156,7 @@ enum bpf_return_type {
RET_INTEGER,/* function returns integer */
RET_VOID,   /* function doesn't return anything */
RET_PTR_TO_MAP_VALUE_OR_NULL,   /* returns a pointer to map elem value 
or NULL */
+   RET_PTR_TO_SOCKET_OR_NULL,  /* returns a pointer to a socket or 
NULL */
 };
 
 /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF 
programs
@@ -205,6 +207,8 @@ enum bpf_reg_type {
PTR_TO_PACKET_META,  /* skb->data - meta_len */
PTR_TO_PACKET,   /* reg points to skb->data */
PTR_TO_PACKET_END,   /* skb->data + headlen */
+   PTR_TO_SOCKET,   /* reg points to struct bpf_sock */
+   PTR_TO_SOCKET_OR_NULL,   /* reg points to struct bpf_sock or NULL */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -326,6 +330,11 @@ const struct bpf_func_proto 
*bpf_get_trace_printk_proto(void);
 
 typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src,
unsigned long off, unsigned long len);
+typedef u32 (*bpf_convert_ctx_access_t)(enum bpf_access_type type,
+   const struct bpf_insn *src,
+   struct bpf_insn *dst,
+   struct bpf_prog *prog,
+   u32 *target_size);
 
 u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
 void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy);
@@ -729,4 +738,12 @@ extern const struct bpf_func_proto 
bpf_sock_map_update_proto;
 void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
+bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type,
+ struct bpf_insn_access_aux *info);
+u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
+   const struct bpf_insn *si,
+   struct bpf_insn *insn_buf,
+   struct bpf_prog *prog,
+   u32 *target_size);
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index a613b52ce939..9dcd87f1d322 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -57,6 +57,8 @@ struct bpf_reg_state {
 * offset, so they can share range knowledge.
 * For PTR_TO_MAP_VALUE_OR_NULL this is used to share which map value we
 * came from, when one is tested for != NULL.
+* For PTR_TO_SOCKET this is used to share which pointers retain the
+* same reference to the socket, to determine proper reference freeing.
 */
u32 id;
/* Ordering of fields matters.  See states_equal() */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1b31b805dea4..d38c7c1e9da6 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -80,8 +80,8 @@ static const struct bpf_verifier_ops * const 
bpf_verifier_ops[] = {
  * (like pointer plus pointer becomes SCALAR_VALUE type)
  *
  * When verifier sees load or store instructions the type of base register
- * can be: PTR_TO_MAP_VALUE, PTR_TO_CTX, PTR_TO_STACK. These are three pointer
- * types recognized by check_mem_access() function.
+ * can be: PTR_TO_MAP_VALUE,

[RFC bpf-next 10/11] selftests/bpf: Add C tests for reference tracking

2018-05-09 Thread Joe Stringer

Signed-off-by: Joe Stringer 
---
 tools/testing/selftests/bpf/Makefile  |   2 +-
 tools/testing/selftests/bpf/test_progs.c  |  38 +++
 tools/testing/selftests/bpf/test_sk_lookup_kern.c | 127 ++
 3 files changed, 166 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_sk_lookup_kern.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 9d762184b805..cf71baa9d51d 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -33,7 +33,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o \
test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o 
\
-   test_get_stack_rawtp.o
+   test_get_stack_rawtp.o test_sk_lookup_kern.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index ed197eef1cfc..6d868a031b00 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -1409,6 +1409,43 @@ static void test_get_stack_raw_tp(void)
bpf_object__close(obj);
 }
 
+static void test_reference_tracking()
+{
+   const char *file = "./test_sk_lookup_kern.o";
+   struct bpf_object *obj;
+   struct bpf_program *prog;
+   __u32 duration;
+   int err = 0;
+
+   obj = bpf_object__open(file);
+   if (IS_ERR(obj)) {
+   error_cnt++;
+   return;
+   }
+
+   bpf_object__for_each_program(prog, obj) {
+   const char *title;
+
+   /* Ignore .text sections */
+   title = bpf_program__title(prog, false);
+   if (strstr(title, ".text") != NULL)
+   continue;
+
+   bpf_program__set_type(prog, BPF_PROG_TYPE_SCHED_CLS);
+
+   /* Expect verifier failure if test name has 'fail' */
+   if (strstr(title, "fail") != NULL) {
+   libbpf_set_print(NULL, NULL, NULL);
+   err = !bpf_program__load(prog, "GPL", 0);
+   libbpf_set_print(printf, printf, NULL);
+   } else {
+   err = bpf_program__load(prog, "GPL", 0);
+   }
+   CHECK(err, title, "\n");
+   }
+   bpf_object__close(obj);
+}
+
 int main(void)
 {
jit_enabled = is_jit_enabled();
@@ -1427,6 +1464,7 @@ int main(void)
test_stacktrace_build_id();
test_stacktrace_map_raw_tp();
test_get_stack_raw_tp();
+   test_reference_tracking();
 
printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
diff --git a/tools/testing/selftests/bpf/test_sk_lookup_kern.c 
b/tools/testing/selftests/bpf/test_sk_lookup_kern.c
new file mode 100644
index ..4f7383a31916
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sk_lookup_kern.c
@@ -0,0 +1,127 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+// Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+char _license[] SEC("license") = "GPL";
+
+/* Fill 'tuple' with L3 info, and attempt to find L4. On fail, return NULL. */
+static void *fill_ip(struct bpf_sock_tuple *tuple, void *data, __u64 nh_off,
+void *data_end, __u16 eth_proto)
+{
+   __u64 ihl_len;
+
+   if (eth_proto == bpf_htons(ETH_P_IP)) {
+   struct iphdr *iph = (struct iphdr *)(data + nh_off);
+
+   if (iph + 1 > data_end)
+   return NULL;
+   ihl_len = iph->ihl * 4;
+
+   tuple->family = AF_INET;
+   tuple->proto = iph->protocol;
+   tuple->saddr.ipv4 = iph->saddr;
+   tuple->daddr.ipv4 = iph->daddr;
+   } else if (eth_proto == bpf_htons(ETH_P_IPV6)) {
+   struct ipv6hdr *ip6h = (struct ipv6hdr *)(data + nh_off);
+
+   if (ip6h + 1 > data_end)
+   return NULL;
+   ihl_len = sizeof(*ip6h);
+
+   tuple->family = AF_INET6;
+   tuple->proto = ip6h->nexthdr;
+   *((struct in6_addr *)&tuple->saddr.ipv6) = ip6h->saddr;
+   *((struct in6_addr *)&tuple->daddr.ipv6) = ip6h->daddr;
+   }
+
+   if (tuple->proto != IPPROTO_TCP)
+

[RFC bpf-next 08/11] selftests/bpf: Add tests for reference tracking

2018-05-09 Thread Joe Stringer

reference tracking: leak potential reference
reference tracking: leak potential reference on stack
reference tracking: leak potential reference on stack 2
reference tracking: zero potential reference
reference tracking: copy and zero potential references
reference tracking: release reference without check
reference tracking: release reference
reference tracking: release reference twice
reference tracking: release reference twice inside branch
reference tracking: alloc, check, free in one subbranch
reference tracking: alloc, check, free in both subbranches
reference tracking in call: free reference in subprog
reference tracking in call: free reference in subprog and outside
reference tracking in call: alloc & leak reference in subprog
reference tracking in call: alloc in subprog, release outside
reference tracking in call: sk_ptr leak into caller stack
reference tracking in call: sk_ptr spill into caller stack

Signed-off-by: Joe Stringer 
---
 tools/testing/selftests/bpf/test_verifier.c | 359 
 1 file changed, 359 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 53439f40e1de..150c7c19eb51 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -3,6 +3,7 @@
  *
  * Copyright (c) 2014 PLUMgrid, http://plumgrid.com
  * Copyright (c) 2017 Facebook
+ * Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -152,6 +153,23 @@ static void bpf_fill_jump_around_ld_abs(struct bpf_test 
*self)
insn[i] = BPF_EXIT_INSN();
 }
 
+#define BPF_SK_LOOKUP  \
+   /* struct bpf_sock_tuple tuple = {} */  \
+   BPF_MOV64_IMM(BPF_REG_2, 0),\
+   BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),  \
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -16),\
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -24),\
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -32),\
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -40),\
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -48),\
+   /* sk = sk_lookup(ctx, &tuple, sizeof tuple, 0, 0) */   \
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),   \
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -48), \
+   BPF_MOV64_IMM(BPF_REG_3, 44),   \
+   BPF_MOV64_IMM(BPF_REG_4, 0),\
+   BPF_MOV64_IMM(BPF_REG_5, 0),\
+   BPF_EMIT_CALL(BPF_FUNC_sk_lookup)
+
 static struct bpf_test tests[] = {
{
"add+sub+mul",
@@ -11974,6 +11992,347 @@ static struct bpf_test tests[] = {
.result = ACCEPT,
.retval = 10,
},
+   {
+   "reference tracking: leak potential reference",
+   .insns = {
+   BPF_SK_LOOKUP,
+   BPF_MOV64_REG(BPF_REG_6, BPF_REG_0), /* leak reference 
*/
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .errstr = "Unreleased reference",
+   .result = REJECT,
+   },
+   {
+   "reference tracking: leak potential reference on stack",
+   .insns = {
+   BPF_SK_LOOKUP,
+   BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),
+   BPF_STX_MEM(BPF_DW, BPF_REG_4, BPF_REG_0, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .errstr = "Unreleased reference",
+   .result = REJECT,
+   },
+   {
+   "reference tracking: leak potential reference on stack 2",
+   .insns = {
+   BPF_SK_LOOKUP,
+   BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),
+   BPF_STX_MEM(BPF_DW, BPF_REG_4, BPF_REG_0, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_ST_MEM(BPF_DW, BPF_REG_4, 0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .errstr = "Unreleased reference",
+   .result = REJECT,
+   },
+   {
+   "reference tracking: zero potential reference",
+   .insns = {
+   BPF_SK_LOOKUP,
+

[RFC bpf-next 05/11] bpf: Macrofy stack state copy

2018-05-09 Thread Joe Stringer

An upcoming commit will need very similar copy/realloc boilerplate, so
refactor the existing stack copy/realloc functions into macros to
simplify it.

Signed-off-by: Joe Stringer 
---
 kernel/bpf/verifier.c | 104 --
 1 file changed, 59 insertions(+), 45 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d38c7c1e9da6..f426ebf2b6bf 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -347,60 +347,74 @@ static void print_verifier_state(struct bpf_verifier_env 
*env,
verbose(env, "\n");
 }
 
-static int copy_stack_state(struct bpf_func_state *dst,
-   const struct bpf_func_state *src)
-{
-   if (!src->stack)
-   return 0;
-   if (WARN_ON_ONCE(dst->allocated_stack < src->allocated_stack)) {
-   /* internal bug, make state invalid to reject the program */
-   memset(dst, 0, sizeof(*dst));
-   return -EFAULT;
-   }
-   memcpy(dst->stack, src->stack,
-  sizeof(*src->stack) * (src->allocated_stack / BPF_REG_SIZE));
-   return 0;
-}
+#define COPY_STATE_FN(NAME, COUNT, FIELD, SIZE)
\
+static int copy_##NAME##_state(struct bpf_func_state *dst, \
+  const struct bpf_func_state *src)\
+{  \
+   if (!src->FIELD)\
+   return 0;   \
+   if (WARN_ON_ONCE(dst->COUNT < src->COUNT)) {\
+   /* internal bug, make state invalid to reject the program */ \
+   memset(dst, 0, sizeof(*dst));   \
+   return -EFAULT; \
+   }   \
+   memcpy(dst->FIELD, src->FIELD,  \
+  sizeof(*src->FIELD) * (src->COUNT / SIZE));  \
+   return 0;   \
+}
+/* copy_stack_state() */
+COPY_STATE_FN(stack, allocated_stack, stack, BPF_REG_SIZE)
+#undef COPY_STATE_FN
+
+#define REALLOC_STATE_FN(NAME, COUNT, FIELD, SIZE) \
+static int realloc_##NAME##_state(struct bpf_func_state *state, int size, \
+ bool copy_old)\
+{  \
+   u32 old_size = state->COUNT;\
+   struct bpf_##NAME##_state *new_##FIELD; \
+   int slot = size / SIZE; \
+   \
+   if (size <= old_size || !size) {\
+   if (copy_old)   \
+   return 0;   \
+   state->COUNT = slot * SIZE; \
+   if (!size && old_size) {\
+   kfree(state->FIELD);\
+   state->FIELD = NULL;\
+   }   \
+   return 0;   \
+   }   \
+   new_##FIELD = kmalloc_array(slot, sizeof(struct bpf_##NAME##_state), \
+   GFP_KERNEL);\
+   if (!new_##FIELD)   \
+   return -ENOMEM; \
+   if (copy_old) { \
+   if (state->FIELD)   \
+   memcpy(new_##FIELD, state->FIELD,   \
+  sizeof(*new_##FIELD) * (old_size / SIZE)); \
+   memset(new_##FIELD + old_size / SIZE, 0,\
+  sizeof(*new_##FIELD) * (size - old_size) / SIZE); \
+   }   \
+   state->COUNT = slot * SIZE; \
+   kfree(state->FIELD);\
+   state->FIELD = new_##FIELD; \
+   return 0;   \
+}
+/* realloc_stack_state() */
+REALLOC_STATE_FN(stack, allocated_stack, stack, BPF_REG_SIZE)
+#undef REALLOC_STATE_FN
 
 /* do_check() starts wi

[RFC bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-05-09 Thread Joe Stringer

This patch adds a new BPF helper function, sk_lookup() which allows BPF
programs to find out if there is a socket listening on this host, and
returns a socket pointer which the BPF program can then access to
determine, for instance, whether to forward or drop traffic. sk_lookup()
takes a reference on the socket, so when a BPF program makes use of this
function, it must subsequently pass the returned pointer into the newly
added sk_release() to return the reference.

By way of example, the following pseudocode would filter inbound
connections at XDP if there is no corresponding service listening for
the traffic:

  struct bpf_sock_tuple tuple;
  struct bpf_sock_ops *sk;

  populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
  sk = bpf_sk_lookup(ctx, &tuple, sizeof tuple, netns, 0);
  if (!sk) {
// Couldn't find a socket listening for this traffic. Drop.
return TC_ACT_SHOT;
  }
  bpf_sk_release(sk, 0);
  return TC_ACT_OK;

Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h  |  39 +++-
 kernel/bpf/verifier.c |   8 ++-
 net/core/filter.c | 102 ++
 tools/include/uapi/linux/bpf.h|  40 +++-
 tools/testing/selftests/bpf/bpf_helpers.h |   7 ++
 5 files changed, 193 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d615c777b573..29f38838dbca 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1828,6 +1828,25 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
+ * struct bpf_sock_ops *bpf_sk_lookup(ctx, tuple, tuple_size, netns, flags)
+ * Decription
+ * Look for socket matching 'tuple'. The return value must be 
checked,
+ * and if non-NULL, released via bpf_sk_release().
+ * @ctx: pointer to ctx
+ * @tuple: pointer to struct bpf_sock_tuple
+ * @tuple_size: size of the tuple
+ * @flags: flags value
+ * Return
+ * pointer to socket ops on success, or
+ * NULL in case of failure
+ *
+ *  int bpf_sk_release(sock, flags)
+ * Description
+ * Release the reference held by 'sock'.
+ * @sock: Pointer reference to release. Must be found via 
bpf_sk_lookup().
+ * @flags: flags value
+ * Return
+ * 0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -1898,7 +1917,9 @@ union bpf_attr {
FN(xdp_adjust_tail),\
FN(skb_get_xfrm_state), \
FN(get_stack),  \
-   FN(skb_load_bytes_relative),
+   FN(skb_load_bytes_relative),\
+   FN(sk_lookup),  \
+   FN(sk_release),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2060,6 +2081,22 @@ struct bpf_sock {
 */
 };
 
+struct bpf_sock_tuple {
+   union {
+   __be32 ipv6[4];
+   __be32 ipv4;
+   } saddr;
+   union {
+   __be32 ipv6[4];
+   __be32 ipv4;
+   } daddr;
+   __be16 sport;
+   __be16 dport;
+   __u32 dst_if;
+   __u8 family;
+   __u8 proto;
+};
+
 #define XDP_PACKET_HEADROOM 256
 
 /* User return codes for XDP prog type.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 92b9a5dc465a..579012c483e4 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -153,6 +153,12 @@ static const struct bpf_verifier_ops * const 
bpf_verifier_ops[] = {
  * PTR_TO_MAP_VALUE, PTR_TO_SOCKET_OR_NULL becomes PTR_TO_SOCKET when the type
  * passes through a NULL-check conditional. For the branch wherein the state is
  * changed to CONST_IMM, the verifier releases the reference.
+ *
+ * For each helper function that allocates a reference, such as 
bpf_sk_lookup(),
+ * there is a corresponding release function, such as bpf_sk_release(). When
+ * a reference type passes into the release function, the verifier also 
releases
+ * the reference. If any unchecked or unreleased reference remains at the end 
of
+ * the program, the verifier rejects it.
  */
 
 /* verifier_state + insn_idx are pushed to stack when branch is encountered */
@@ -277,7 +283,7 @@ static bool arg_type_is_refcounted(enum bpf_arg_type type)
  */
 static bool is_release_function(enum bpf_func_id func_id)
 {
-   return false;
+   return func_id == BPF_FUNC_sk_release;
 }
 
 /* string representation of 'enum bpf_reg_type' */
diff --git a/net/core/filter.c b/net/core/filter.c
index 4c35152fb3a8..751c255d17d3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -58,8 +58,12 @@
 #include 
 #include 
 #include 
+#include

[RFC bpf-next 11/11] Documentation: Describe bpf reference tracking

2018-05-09 Thread Joe Stringer

Signed-off-by: Joe Stringer 
---
 Documentation/networking/filter.txt | 64 +
 1 file changed, 64 insertions(+)

diff --git a/Documentation/networking/filter.txt 
b/Documentation/networking/filter.txt
index 5032e1263bc9..77be17977bc5 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -1125,6 +1125,14 @@ pointer type.  The types of pointers describe their 
base, as follows:
 PTR_TO_STACKFrame pointer.
 PTR_TO_PACKET   skb->data.
 PTR_TO_PACKET_END   skb->data + headlen; arithmetic forbidden.
+PTR_TO_SOCKET   Pointer to struct bpf_sock_ops, implicitly refcounted.
+PTR_TO_SOCKET_OR_NULL
+Either a pointer to a socket, or NULL; socket lookup
+returns this type, which becomes a PTR_TO_SOCKET when
+checked != NULL. PTR_TO_SOCKET is reference-counted,
+so programs must release the reference through the
+socket release function before the end of the program.
+Arithmetic on these pointers is forbidden.
 However, a pointer may be offset from this base (as a result of pointer
 arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable
 offset'.  The former is used when an exactly-known value (e.g. an immediate
@@ -1168,6 +1176,13 @@ over the Ethernet header, then reads IHL and addes (IHL 
* 4), the resulting
 pointer will have a variable offset known to be 4n+2 for some n, so adding the 
2
 bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses 
through
 that pointer are safe.
+The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common
+to all copies of the pointer returned from a socket lookup. This has similar
+behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but
+it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly
+represents a reference to the corresponding 'struct sock'. To ensure that the
+reference is not leaked, it is imperative to NULL-check the reference and in
+the non-NULL case, and pass the valid reference to the socket release function.
 
 Direct packet access
 
@@ -1441,6 +1456,55 @@ Error:
   8: (7a) *(u64 *)(r0 +0) = 1
   R0 invalid mem access 'imm'
 
+Program that performs a socket lookup then sets the pointer to NULL without
+checking it:
+value:
+  BPF_MOV64_IMM(BPF_REG_2, 0),
+  BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
+  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+  BPF_MOV64_IMM(BPF_REG_3, 4),
+  BPF_MOV64_IMM(BPF_REG_4, 0),
+  BPF_MOV64_IMM(BPF_REG_5, 0),
+  BPF_EMIT_CALL(BPF_FUNC_sk_lookup),
+  BPF_MOV64_IMM(BPF_REG_0, 0),
+  BPF_EXIT_INSN(),
+Error:
+  0: (b7) r2 = 0
+  1: (63) *(u32 *)(r10 -8) = r2
+  2: (bf) r2 = r10
+  3: (07) r2 += -8
+  4: (b7) r3 = 4
+  5: (b7) r4 = 0
+  6: (b7) r5 = 0
+  7: (85) call bpf_sk_lookup#65
+  8: (b7) r0 = 0
+  9: (95) exit
+  Unreleased reference id=1, alloc_insn=7
+
+Program that performs a socket lookup but does not NULL-check the returned
+value:
+  BPF_MOV64_IMM(BPF_REG_2, 0),
+  BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
+  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+  BPF_MOV64_IMM(BPF_REG_3, 4),
+  BPF_MOV64_IMM(BPF_REG_4, 0),
+  BPF_MOV64_IMM(BPF_REG_5, 0),
+  BPF_EMIT_CALL(BPF_FUNC_sk_lookup),
+  BPF_EXIT_INSN(),
+Error:
+  0: (b7) r2 = 0
+  1: (63) *(u32 *)(r10 -8) = r2
+  2: (bf) r2 = r10
+  3: (07) r2 += -8
+  4: (b7) r3 = 4
+  5: (b7) r4 = 0
+  6: (b7) r5 = 0
+  7: (85) call bpf_sk_lookup#65
+  8: (95) exit
+  Unreleased reference id=1, alloc_insn=7
+
 Testing
 ---
 
-- 
2.14.1

[RFC bpf-next 00/11] Add socket lookup support

2018-05-09 Thread Joe Stringer

This series proposes a new helper for the BPF API which allows BPF programs to
perform lookups for sockets in a network namespace. This would allow programs
to determine early on in processing whether the stack is expecting to receive
the packet, and perform some action (eg drop, forward somewhere) based on this
information.

The series is structured roughly into:
* Misc refactor
* Add the socket pointer type
* Add reference tracking to ensure that socket references are freed
* Extend the BPF API to add sk_lookup() / sk_release() functions
* Add tests/documentation

The helper proposed in this series includes a parameter for a tuple which must
be filled in by the caller to determine the socket to look up. The simplest
case would be filling with the contents of the packet, ie mapping the packet's
5-tuple into the parameter. In common cases, it may alternatively be useful to
reverse the direction of the tuple and perform a lookup, to find the socket
that initiates this connection; and if the BPF program ever performs a form of
IP address translation, it may further be useful to be able to look up
arbitrary tuples that are not based upon the packet, but instead based on state
held in BPF maps or hardcoded in the BPF program.

Currently, access into the socket's fields are limited to those which are
otherwise already accessible, and are restricted to read-only access.

A few open points:
* Currently, the lookup interface only returns either a valid socket or a NULL
  pointer. This means that if there is any kind of issue with the tuple, such
  as it provides an unsupported protocol number, or the socket can't be found,
  then we are unable to differentiate these cases from one another. One natural
  approach to improve this could be to return an ERR_PTR from the
  bpf_sk_lookup() helper. This would be more complicated but maybe it's
  worthwhile.
* No ordering is defined between sockets. If the tuple could find multiple
  sockets, then it will arbitrarily return one. It is up to the caller to
  handle this. If we wish to handle this more reliably in future, we could
  encode an ordering preference in the flags field.
* Currently this helper is only defined for TC hook point, but it should also
  be valid at XDP and perhaps some other hooks.

Joe Stringer (11):
  bpf: Add iterator for spilled registers
  bpf: Simplify ptr_min_max_vals adjustment
  bpf: Generalize ptr_or_null regs check
  bpf: Add PTR_TO_SOCKET verifier type
  bpf: Macrofy stack state copy
  bpf: Add reference tracking to verifier
  bpf: Add helper to retrieve socket in BPF
  selftests/bpf: Add tests for reference tracking
  libbpf: Support loading individual progs
  selftests/bpf: Add C tests for reference tracking
  Documentation: Describe bpf reference tracking

 Documentation/networking/filter.txt   |  64 +++
 include/linux/bpf.h   |  19 +-
 include/linux/bpf_verifier.h  |  31 +-
 include/uapi/linux/bpf.h  |  39 +-
 kernel/bpf/verifier.c | 548 ++
 net/core/filter.c | 132 +-
 tools/include/uapi/linux/bpf.h|  40 +-
 tools/lib/bpf/libbpf.c|   4 +-
 tools/lib/bpf/libbpf.h|   3 +
 tools/testing/selftests/bpf/Makefile  |   2 +-
 tools/testing/selftests/bpf/bpf_helpers.h |   7 +
 tools/testing/selftests/bpf/test_progs.c  |  38 ++
 tools/testing/selftests/bpf/test_sk_lookup_kern.c | 127 +
 tools/testing/selftests/bpf/test_verifier.c   | 373 ++-
 14 files changed, 1299 insertions(+), 128 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sk_lookup_kern.c

-- 
2.14.1

[RFC bpf-next 06/11] bpf: Add reference tracking to verifier

2018-05-09 Thread Joe Stringer

Allow helper functions to acquire a reference and return it into a
register. Specific pointer types such as the PTR_TO_SOCKET will
implicitly represent such a reference. The verifier must ensure that
these references are released exactly once in each path through the
program.

To achieve this, this commit assigns an id to the pointer and tracks it
in the 'bpf_func_state', then when the function or program exits,
verifies that all of the acquired references have been freed. When the
pointer is passed to a function that frees the reference, it is removed
from the 'bpf_func_state` and all existing copies of the pointer in
registers are marked invalid.

Signed-off-by: Joe Stringer 
---
 include/linux/bpf_verifier.h |  18 ++-
 kernel/bpf/verifier.c| 295 ---
 2 files changed, 292 insertions(+), 21 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 9dcd87f1d322..8dbee360b3ec 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -104,6 +104,11 @@ struct bpf_stack_state {
u8 slot_type[BPF_REG_SIZE];
 };
 
+struct bpf_reference_state {
+   int id;
+   int insn_idx; /* allocation insn */
+};
+
 /* state of the program:
  * type of all registers and stack info
  */
@@ -122,7 +127,9 @@ struct bpf_func_state {
 */
u32 subprogno;
 
-   /* should be second to last. See copy_func_state() */
+   /* The following fields should be last. See copy_func_state() */
+   int acquired_refs;
+   struct bpf_reference_state *refs;
int allocated_stack;
struct bpf_stack_state *stack;
 };
@@ -218,11 +225,16 @@ void bpf_verifier_vlog(struct bpf_verifier_log *log, 
const char *fmt,
 __printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env,
   const char *fmt, ...);
 
-static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env)
+static inline struct bpf_func_state *cur_func(struct bpf_verifier_env *env)
 {
struct bpf_verifier_state *cur = env->cur_state;
 
-   return cur->frame[cur->curframe]->regs;
+   return cur->frame[cur->curframe];
+}
+
+static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env)
+{
+   return cur_func(env)->regs;
 }
 
 int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f426ebf2b6bf..92b9a5dc465a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1,5 +1,6 @@
 /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
  * Copyright (c) 2016 Facebook
+ * Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -140,6 +141,18 @@ static const struct bpf_verifier_ops * const 
bpf_verifier_ops[] = {
  *
  * After the call R0 is set to return type of the function and registers R1-R5
  * are set to NOT_INIT to indicate that they are no longer readable.
+ *
+ * The following reference types represent a potential reference to a kernel
+ * resource which, after first being allocated, must be checked and freed by
+ * the BPF program:
+ * - PTR_TO_SOCKET_OR_NULL, PTR_TO_SOCKET
+ *
+ * When the verifier sees a helper call return a reference type, it allocates a
+ * pointer id for the reference and stores it in the current function state.
+ * Similar to the way that PTR_TO_MAP_VALUE_OR_NULL is converted into
+ * PTR_TO_MAP_VALUE, PTR_TO_SOCKET_OR_NULL becomes PTR_TO_SOCKET when the type
+ * passes through a NULL-check conditional. For the branch wherein the state is
+ * changed to CONST_IMM, the verifier releases the reference.
  */
 
 /* verifier_state + insn_idx are pushed to stack when branch is encountered */
@@ -229,7 +242,42 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
 
 static bool reg_type_may_be_null(enum bpf_reg_type type)
 {
-   return type == PTR_TO_MAP_VALUE_OR_NULL;
+   return type == PTR_TO_MAP_VALUE_OR_NULL ||
+  type == PTR_TO_SOCKET_OR_NULL;
+}
+
+static bool type_is_refcounted(enum bpf_reg_type type)
+{
+   return type == PTR_TO_SOCKET;
+}
+
+static bool type_is_refcounted_or_null(enum bpf_reg_type type)
+{
+   return type == PTR_TO_SOCKET || type == PTR_TO_SOCKET_OR_NULL;
+}
+
+static bool reg_is_refcounted(const struct bpf_reg_state *reg)
+{
+   return type_is_refcounted(reg->type);
+}
+
+static bool reg_is_refcounted_or_null(const struct bpf_reg_state *reg)
+{
+   return type_is_refcounted_or_null(reg->type);
+}
+
+static bool arg_type_is_refcounted(enum bpf_arg_type type)
+{
+   return type == ARG_PTR_TO_SOCKET;
+}
+
+/* Determine whether the function releases some resources allocated by another
+ * function call. The first reference type argument will be assumed to be
+ * released by release_reference().

[RFC bpf-next 01/11] bpf: Add iterator for spilled registers

2018-05-09 Thread Joe Stringer

Add this iterator for spilled registers, it concentrates the details of
how to get the current frame's spilled registers into a single macro
while clarifying the intention of the code which is calling the macro.

Signed-off-by: Joe Stringer 
---
 include/linux/bpf_verifier.h | 11 +++
 kernel/bpf/verifier.c| 16 +++-
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 8f70dc181e23..a613b52ce939 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -133,6 +133,17 @@ struct bpf_verifier_state {
u32 curframe;
 };
 
+#define __get_spilled_reg(slot, frame) \
+   (((slot < frame->allocated_stack / BPF_REG_SIZE) && \
+ (frame->stack[slot].slot_type[0] == STACK_SPILL)) \
+? &frame->stack[slot].spilled_ptr : NULL)
+
+/* Iterate over 'frame', setting 'reg' to either NULL or a spilled register. */
+#define for_each_spilled_reg(iter, frame, reg) \
+   for (iter = 0, reg = __get_spilled_reg(iter, frame);\
+iter < frame->allocated_stack / BPF_REG_SIZE;  \
+iter++, reg = __get_spilled_reg(iter, frame))
+
 /* linked list of verifier states used to prune search */
 struct bpf_verifier_state_list {
struct bpf_verifier_state state;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d92d9c37affd..f40e089c3893 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2216,10 +2216,9 @@ static void __clear_all_pkt_pointers(struct 
bpf_verifier_env *env,
if (reg_is_pkt_pointer_any(®s[i]))
mark_reg_unknown(env, regs, i);
 
-   for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) {
-   if (state->stack[i].slot_type[0] != STACK_SPILL)
+   for_each_spilled_reg(i, state, reg) {
+   if (!reg)
continue;
-   reg = &state->stack[i].spilled_ptr;
if (reg_is_pkt_pointer_any(reg))
__mark_reg_unknown(reg);
}
@@ -3326,10 +3325,9 @@ static void find_good_pkt_pointers(struct 
bpf_verifier_state *vstate,
 
for (j = 0; j <= vstate->curframe; j++) {
state = vstate->frame[j];
-   for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) {
-   if (state->stack[i].slot_type[0] != STACK_SPILL)
+   for_each_spilled_reg(i, state, reg) {
+   if (!reg)
continue;
-   reg = &state->stack[i].spilled_ptr;
if (reg->type == type && reg->id == dst_reg->id)
reg->range = max(reg->range, new_range);
}
@@ -3574,7 +3572,7 @@ static void mark_map_regs(struct bpf_verifier_state 
*vstate, u32 regno,
  bool is_null)
 {
struct bpf_func_state *state = vstate->frame[vstate->curframe];
-   struct bpf_reg_state *regs = state->regs;
+   struct bpf_reg_state *reg, *regs = state->regs;
u32 id = regs[regno].id;
int i, j;
 
@@ -3583,8 +3581,8 @@ static void mark_map_regs(struct bpf_verifier_state 
*vstate, u32 regno,
 
for (j = 0; j <= vstate->curframe; j++) {
state = vstate->frame[j];
-   for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) {
-   if (state->stack[i].slot_type[0] != STACK_SPILL)
+   for_each_spilled_reg(i, state, reg) {
+   if (!reg)
continue;
mark_map_reg(&state->stack[i].spilled_ptr, 0, id, 
is_null);
}
-- 
2.14.1

[RFC bpf-next 03/11] bpf: Generalize ptr_or_null regs check

2018-05-09 Thread Joe Stringer

This check will be reused by an upcoming commit for conditional jump
checks for sockets. Refactor it a bit to simplify the later commit.

Signed-off-by: Joe Stringer 
---
 kernel/bpf/verifier.c | 43 +--
 1 file changed, 25 insertions(+), 18 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a32b560072d7..1b31b805dea4 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -227,6 +227,11 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
   type == PTR_TO_PACKET_META;
 }
 
+static bool reg_type_may_be_null(enum bpf_reg_type type)
+{
+   return type == PTR_TO_MAP_VALUE_OR_NULL;
+}
+
 /* string representation of 'enum bpf_reg_type' */
 static const char * const reg_type_str[] = {
[NOT_INIT]  = "?",
@@ -3531,12 +3536,10 @@ static void reg_combine_min_max(struct bpf_reg_state 
*true_src,
}
 }
 
-static void mark_map_reg(struct bpf_reg_state *regs, u32 regno, u32 id,
-bool is_null)
+static void mark_ptr_or_null_reg(struct bpf_reg_state *reg, u32 id,
+bool is_null)
 {
-   struct bpf_reg_state *reg = ®s[regno];
-
-   if (reg->type == PTR_TO_MAP_VALUE_OR_NULL && reg->id == id) {
+   if (reg_type_may_be_null(reg->type) && reg->id == id) {
/* Old offset (both fixed and variable parts) should
 * have been known-zero, because we don't allow pointer
 * arithmetic on pointers that might be NULL.
@@ -3549,11 +3552,13 @@ static void mark_map_reg(struct bpf_reg_state *regs, 
u32 regno, u32 id,
}
if (is_null) {
reg->type = SCALAR_VALUE;
-   } else if (reg->map_ptr->inner_map_meta) {
-   reg->type = CONST_PTR_TO_MAP;
-   reg->map_ptr = reg->map_ptr->inner_map_meta;
-   } else {
-   reg->type = PTR_TO_MAP_VALUE;
+   } else if (reg->type == PTR_TO_MAP_VALUE_OR_NULL) {
+   if (reg->map_ptr->inner_map_meta) {
+   reg->type = CONST_PTR_TO_MAP;
+   reg->map_ptr = reg->map_ptr->inner_map_meta;
+   } else {
+   reg->type = PTR_TO_MAP_VALUE;
+   }
}
/* We don't need id from this point onwards anymore, thus we
 * should better reset it, so that state pruning has chances
@@ -3566,8 +3571,8 @@ static void mark_map_reg(struct bpf_reg_state *regs, u32 
regno, u32 id,
 /* The logic is similar to find_good_pkt_pointers(), both could eventually
  * be folded together at some point.
  */
-static void mark_map_regs(struct bpf_verifier_state *vstate, u32 regno,
- bool is_null)
+static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
+ bool is_null)
 {
struct bpf_func_state *state = vstate->frame[vstate->curframe];
struct bpf_reg_state *reg, *regs = state->regs;
@@ -3575,14 +3580,14 @@ static void mark_map_regs(struct bpf_verifier_state 
*vstate, u32 regno,
int i, j;
 
for (i = 0; i < MAX_BPF_REG; i++)
-   mark_map_reg(regs, i, id, is_null);
+   mark_ptr_or_null_reg(®s[i], id, is_null);
 
for (j = 0; j <= vstate->curframe; j++) {
state = vstate->frame[j];
for_each_spilled_reg(i, state, reg) {
if (!reg)
continue;
-   mark_map_reg(&state->stack[i].spilled_ptr, 0, id, 
is_null);
+   mark_ptr_or_null_reg(reg, id, is_null);
}
}
 }
@@ -3784,12 +3789,14 @@ static int check_cond_jmp_op(struct bpf_verifier_env 
*env,
/* detect if R == 0 where R is returned from bpf_map_lookup_elem() */
if (BPF_SRC(insn->code) == BPF_K &&
insn->imm == 0 && (opcode == BPF_JEQ || opcode == BPF_JNE) &&
-   dst_reg->type == PTR_TO_MAP_VALUE_OR_NULL) {
-   /* Mark all identical map registers in each branch as either
+   reg_type_may_be_null(dst_reg->type)) {
+   /* Mark all identical registers in each branch as either
 * safe or unknown depending R == 0 or R != 0 conditional.
 */
-   mark_map_regs(this_branch, insn->dst_reg, opcode == BPF_JNE);
-   mark_map_regs(other_branch, insn->dst_reg, opcode == BPF_JEQ);
+   mark_ptr_or_null_regs(this_branch, insn->dst_reg,
+ opcode == BPF_JNE);
+   mark_ptr_or_null_regs(other_branch, insn->dst_reg,
+

[PATCH bpf-next] selftests/bpf: Fix bash reference in Makefile

2018-05-10 Thread Joe Stringer

'|& ...' is a bash 4.0+ construct which is not guaranteed to be available
when using '$(shell ...)' in a Makefile. Fall back to the more portable
'2>&1 | ...'.

Fixes the following warning during compilation:

/bin/sh: 1: Syntax error: "&" unexpected

Signed-off-by: Joe Stringer 
---
 tools/testing/selftests/bpf/Makefile | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 9d762184b805..79d29d6cc719 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -90,9 +90,9 @@ CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi \
 $(OUTPUT)/test_l4lb_noinline.o: CLANG_FLAGS += -fno-inline
 $(OUTPUT)/test_xdp_noinline.o: CLANG_FLAGS += -fno-inline
 
-BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help |& grep dwarfris)
-BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help |& grep BTF)
-BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --version |& grep LLVM)
+BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris)
+BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
+BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --version 2>&1 | grep LLVM)
 
 ifneq ($(BTF_LLC_PROBE),)
 ifneq ($(BTF_PAHOLE_PROBE),)
-- 
2.14.1

Re: [RFC bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-05-11 Thread Joe Stringer

On 10 May 2018 at 22:00, Martin KaFai Lau  wrote:
> On Wed, May 09, 2018 at 02:07:05PM -0700, Joe Stringer wrote:
>> This patch adds a new BPF helper function, sk_lookup() which allows BPF
>> programs to find out if there is a socket listening on this host, and
>> returns a socket pointer which the BPF program can then access to
>> determine, for instance, whether to forward or drop traffic. sk_lookup()
>> takes a reference on the socket, so when a BPF program makes use of this
>> function, it must subsequently pass the returned pointer into the newly
>> added sk_release() to return the reference.
>>
>> By way of example, the following pseudocode would filter inbound
>> connections at XDP if there is no corresponding service listening for
>> the traffic:
>>
>>   struct bpf_sock_tuple tuple;
>>   struct bpf_sock_ops *sk;
>>
>>   populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
>>   sk = bpf_sk_lookup(ctx, &tuple, sizeof tuple, netns, 0);
>>   if (!sk) {
>> // Couldn't find a socket listening for this traffic. Drop.
>> return TC_ACT_SHOT;
>>   }
>>   bpf_sk_release(sk, 0);
>>   return TC_ACT_OK;
>>
>> Signed-off-by: Joe Stringer 
>> ---

...

>> @@ -4032,6 +4036,96 @@ static const struct bpf_func_proto 
>> bpf_skb_get_xfrm_state_proto = {
>>  };
>>  #endif
>>
>> +struct sock *
>> +sk_lookup(struct net *net, struct bpf_sock_tuple *tuple) {
> Would it be possible to have another version that
> returns a sk without taking its refcnt?
> It may have performance benefit.

Not really. The sockets are not RCU-protected, and established sockets
may be torn down without notice. If we don't take a reference, there's
no guarantee that the socket will continue to exist for the duration
of running the BPF program.

>From what I follow, the comment below has a hidden implication which
is that sockets without SOCK_RCU_FREE, eg established sockets, may be
directly freed regardless of RCU.

/* Sockets having SOCK_RCU_FREE will call this function after one RCU
 * grace period. This is the case for UDP sockets and TCP listeners.
 */
static void __sk_destruct(struct rcu_head *head)
...

Therefore without the refcount, it won't be safe.

Re: [RFC bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-05-11 Thread Joe Stringer

On 11 May 2018 at 14:41, Martin KaFai Lau  wrote:
> On Fri, May 11, 2018 at 02:08:01PM -0700, Joe Stringer wrote:
>> On 10 May 2018 at 22:00, Martin KaFai Lau  wrote:
>> > On Wed, May 09, 2018 at 02:07:05PM -0700, Joe Stringer wrote:
>> >> This patch adds a new BPF helper function, sk_lookup() which allows BPF
>> >> programs to find out if there is a socket listening on this host, and
>> >> returns a socket pointer which the BPF program can then access to
>> >> determine, for instance, whether to forward or drop traffic. sk_lookup()
>> >> takes a reference on the socket, so when a BPF program makes use of this
>> >> function, it must subsequently pass the returned pointer into the newly
>> >> added sk_release() to return the reference.
>> >>
>> >> By way of example, the following pseudocode would filter inbound
>> >> connections at XDP if there is no corresponding service listening for
>> >> the traffic:
>> >>
>> >>   struct bpf_sock_tuple tuple;
>> >>   struct bpf_sock_ops *sk;
>> >>
>> >>   populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
>> >>   sk = bpf_sk_lookup(ctx, &tuple, sizeof tuple, netns, 0);
>> >>   if (!sk) {
>> >> // Couldn't find a socket listening for this traffic. Drop.
>> >> return TC_ACT_SHOT;
>> >>   }
>> >>   bpf_sk_release(sk, 0);
>> >>   return TC_ACT_OK;
>> >>
>> >> Signed-off-by: Joe Stringer 
>> >> ---
>>
>> ...
>>
>> >> @@ -4032,6 +4036,96 @@ static const struct bpf_func_proto 
>> >> bpf_skb_get_xfrm_state_proto = {
>> >>  };
>> >>  #endif
>> >>
>> >> +struct sock *
>> >> +sk_lookup(struct net *net, struct bpf_sock_tuple *tuple) {
>> > Would it be possible to have another version that
>> > returns a sk without taking its refcnt?
>> > It may have performance benefit.
>>
>> Not really. The sockets are not RCU-protected, and established sockets
>> may be torn down without notice. If we don't take a reference, there's
>> no guarantee that the socket will continue to exist for the duration
>> of running the BPF program.
>>
>> From what I follow, the comment below has a hidden implication which
>> is that sockets without SOCK_RCU_FREE, eg established sockets, may be
>> directly freed regardless of RCU.
> Right, SOCK_RCU_FREE sk is the one I am concern about.
> For example, TCP_LISTEN socket does not require taking a refcnt
> now.  Doing a bpf_sk_lookup() may have a rather big
> impact on handling TCP syn flood.  or the usual intention
> is to redirect instead of passing it up to the stack?

I see, if you're only interested in listen sockets then probably this
series could be extended with a new flag, eg something like
BPF_F_SK_FIND_LISTENERS which restricts the set of possible sockets
found to only listen sockets, then the implementation would call into
__inet_lookup_listener() instead of inet_lookup(). The presence of
that flag in the relevant register during CALL instruction would show
that the verifier should not reference-track the result, then there'd
need to be a check on the release to ensure that this unreferenced
socket is never released. Just a thought, completely untested and I
could still be missing some detail..

That said, I don't completely follow how you would expect to handle
the traffic for sockets that are already established - the helper
would no longer find those sockets, so you wouldn't know whether to
pass the traffic up the stack for established traffic or not.

Re: [ovs-dev] openvswitch crash on i386

2019-03-05 Thread Joe Stringer

On Tue, Mar 5, 2019 at 2:12 AM Christian Ehrhardt
 wrote:
>
> On Tue, Mar 5, 2019 at 10:58 AM Juerg Haefliger
>  wrote:
> >
> > Hi,
> >
> > Running the following commands in a loop will crash an i386 5.0 kernel
> > typically within a few iterations:
> >
> > ovs-vsctl add-br test
> > ovs-vsctl del-br test
> >
> > [  106.215748] BUG: unable to handle kernel paging request at e8a35f3b
> > [  106.216733] #PF error: [normal kernel read fault]
> > [  106.217464] *pdpt = 19a76001 *pde = 
> > [  106.218346] Oops:  [#1] SMP PTI
> > [  106.218911] CPU: 0 PID: 2050 Comm: systemd-udevd Tainted: GE 
> > 5.0.0 #25
> > [  106.220103] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > 1.11.1-1ubuntu1 04/01/2014
> > [  106.221447] EIP: kmem_cache_alloc_trace+0x7a/0x1b0
> > [  106.222178] Code: 01 00 00 8b 07 64 8b 50 04 64 03 05 28 61 e8 d2 8b 08 
> > 89 4d ec 85 c9 0f 84 03 01 00 00 8b 45 ec 8b 5f 14 8d 4a 01 8b 37 01 c3 
> > <33> 1b 33 9f b4 00 00 00 64 0f c7 0e 75 cb 8b 75 ec 8b 47 14 0f 18
> > [  106.224752] EAX: e8a35f3b EBX: e8a35f3b ECX: 869f EDX: 869e
> > [  106.225683] ESI: d2e96ef0 EDI: da401a00 EBP: d9b85dd0 ESP: d9b85db0
> > [  106.226662] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
> > [  106.227710] CR0: 80050033 CR2: e8a35f3b CR3: 185b8000 CR4: 06f0
> > [  106.228703] DR0:  DR1:  DR2:  DR3: 
> > [  106.229604] DR6: fffe0ff0 DR7: 0400
> > [  106.230114] Call Trace:
> > [  106.230525]  ? kernfs_fop_open+0xb4/0x390
> > [  106.231176]  kernfs_fop_open+0xb4/0x390
> > [  106.231856]  ? security_file_open+0x7c/0xc0
> > [  106.232562]  do_dentry_open+0x131/0x370
> > [  106.233229]  ? kernfs_fop_write+0x180/0x180
> > [  106.233905]  vfs_open+0x25/0x30
> > [  106.234432]  path_openat+0x2fd/0x1450
> > [  106.235084]  ? cp_new_stat64+0x115/0x140
> > [  106.235754]  ? cp_new_stat64+0x115/0x140
> > [  106.236427]  do_filp_open+0x6a/0xd0
> > [  106.237026]  ? cp_new_stat64+0x115/0x140
> > [  106.237748]  ? strncpy_from_user+0x3d/0x180
> > [  106.238539]  ? __alloc_fd+0x36/0x120
> > [  106.239256]  do_sys_open+0x175/0x210
> > [  106.239955]  sys_openat+0x1b/0x20
> > [  106.240596]  do_fast_syscall_32+0x7f/0x1e0
> > [  106.241313]  entry_SYSENTER_32+0x6b/0xbe
> > [  106.242017] EIP: 0xb7fae871
> > [  106.242559] Code: 8b 98 58 cd ff ff 89 c8 85 d2 74 02 89 0a 5b 5d c3 8b 
> > 04 24 c3 8b 14 24 c3 8b 34 24 c3 8b 3c 24 c3 51 52 55 89 e5 0f 34 cd 80 
> > <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
> > [  106.245551] EAX: ffda EBX: ff9c ECX: bffdcb60 EDX: 00088000
> > [  106.246651] ESI:  EDI: b7f9e000 EBP: 00088000 ESP: bffdc970
> > [  106.247706] DS: 007b ES: 007b FS:  GS: 0033 SS: 007b EFLAGS: 0246
> > [  106.248851] Modules linked in: openvswitch(E)
> > [  106.249621] CR2: e8a35f3b
> > [  106.250218] ---[ end trace 6a8d05679a59cda7 ]---
> >
> > I've bisected this down to the following commit that seems to have 
> > introduced
> > the issue:
> >
> > commit 120645513f55a4ac5543120d9e79925d30a0156f (refs/bisect/bad)
> > Author: Jarno Rajahalme 
> > Date:   Fri Apr 21 16:48:06 2017 -0700
> >
> > openvswitch: Add eventmask support to CT action.
> >
> > Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK,
> > which can be used in conjunction with the commit flag
> > (OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which
> > conntrack events (IPCT_*) should be delivered via the Netfilter
> > netlink multicast groups.  Default behavior depends on the system
> > configuration, but typically a lot of events are delivered.  This can be
> > very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some
> > types of events are of interest.
> >
> > Netfilter core init_conntrack() adds the event cache extension, so we
> > only need to set the ctmask value.  However, if the system is
> > configured without support for events, the setting will be skipped due
> > to extension not being found.
> >
> > Signed-off-by: Jarno Rajahalme 
> > Reviewed-by: Greg Rose 
> > Acked-by: Joe Stringer 
> > Signed-off-by: David S. Miller 
>
> Hi Juerg,
> the symptom, the identified breaking commit and actually all of it
> seems to be [1] which James, Joseph and I worked on already.
> I wanted to make you aware of the past context that already exists.

Re: RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers

2019-05-15 Thread Joe Stringer

On Wed, May 15, 2019 at 8:11 AM Lorenz Bauer  wrote:
>
> In the BPF-based TPROXY session with Joe Stringer [1], I mentioned
> that the sk_lookup_* helpers currently return inconsistent results if
> SK_REUSEPORT programs are in play.
>
> SK_REUSEPORT programs are a hook point in inet_lookup. They get access
> to the full packet
> that triggered the look up. To support this, inet_lookup gained a new
> skb argument to provide such context. If skb is NULL, the SK_REUSEPORT
> program is skipped and instead the socket is selected by its hash.
>
> The first problem is that not all callers to inet_lookup from BPF have
> an skb, e.g. XDP. This means that a look up from XDP gives an
> incorrect result. For now that is not a huge problem. However, once we
> get sk_assign as proposed by Joe, we can end up circumventing
> SK_REUSEPORT.

To clarify a bit, the reason this is a problem is that a
straightforward implementation may just consider passing the skb
context into the sk_lookup_*() and through to the inet_lookup() so
that it would run the SK_REUSEPORT BPF program for socket selection on
the skb when the packet-path BPF program performs the socket lookup.
However, as this paragraph describes, the skb context is not always
available.

> At the conference, someone suggested using a similar approach to the
> work done on the flow dissector by Stanislav: create a dedicated
> context sk_reuseport which can either take an skb or a plain pointer.
> Patch up load_bytes to deal with both. Pass the context to
> inet_lookup.
>
> This is when we hit the second problem: using the skb or XDP context
> directly is incorrect, because it assumes that the relevant protocol
> headers are at the start of the buffer. In our use case, the correct
> headers are at an offset since we're inspecting encapsulated packets.
>
> The best solution I've come up with is to steal 17 bits from the flags
> argument to sk_lookup_*, 1 bit for BPF_F_HEADERS_AT_OFFSET, 16bit for
> the offset itself.

FYI there's also the upper 32 bits of the netns_id parameter, another
option would be to steal 16 bits from there.

> Thoughts?

Internally with skbs, we use `skb_pull()` to manage header offsets,
could we do something similar with `bpf_xdp_adjust_head()` prior to
the call to `bpf_sk_lookup_*()`?

Re: RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers

2019-05-18 Thread Joe Stringer

On Fri, May 17, 2019 at 7:15 AM Lorenz Bauer  wrote:
>
> On Thu, 16 May 2019 at 21:33, Alexei Starovoitov
>  wrote:
> >
> > On Thu, May 16, 2019 at 09:41:34AM +0100, Lorenz Bauer wrote:
> > > On Wed, 15 May 2019 at 18:16, Joe Stringer  wrote:
> > > >
> > > > On Wed, May 15, 2019 at 8:11 AM Lorenz Bauer  
> > > > wrote:
> > > > >
> > > > > In the BPF-based TPROXY session with Joe Stringer [1], I mentioned
> > > > > that the sk_lookup_* helpers currently return inconsistent results if
> > > > > SK_REUSEPORT programs are in play.
> > > > >
> > > > > SK_REUSEPORT programs are a hook point in inet_lookup. They get access
> > > > > to the full packet
> > > > > that triggered the look up. To support this, inet_lookup gained a new
> > > > > skb argument to provide such context. If skb is NULL, the SK_REUSEPORT
> > > > > program is skipped and instead the socket is selected by its hash.
> > > > >
> > > > > The first problem is that not all callers to inet_lookup from BPF have
> > > > > an skb, e.g. XDP. This means that a look up from XDP gives an
> > > > > incorrect result. For now that is not a huge problem. However, once we
> > > > > get sk_assign as proposed by Joe, we can end up circumventing
> > > > > SK_REUSEPORT.
> > > >
> > > > To clarify a bit, the reason this is a problem is that a
> > > > straightforward implementation may just consider passing the skb
> > > > context into the sk_lookup_*() and through to the inet_lookup() so
> > > > that it would run the SK_REUSEPORT BPF program for socket selection on
> > > > the skb when the packet-path BPF program performs the socket lookup.
> > > > However, as this paragraph describes, the skb context is not always
> > > > available.
> > > >
> > > > > At the conference, someone suggested using a similar approach to the
> > > > > work done on the flow dissector by Stanislav: create a dedicated
> > > > > context sk_reuseport which can either take an skb or a plain pointer.
> > > > > Patch up load_bytes to deal with both. Pass the context to
> > > > > inet_lookup.
> > > > >
> > > > > This is when we hit the second problem: using the skb or XDP context
> > > > > directly is incorrect, because it assumes that the relevant protocol
> > > > > headers are at the start of the buffer. In our use case, the correct
> > > > > headers are at an offset since we're inspecting encapsulated packets.
> > > > >
> > > > > The best solution I've come up with is to steal 17 bits from the flags
> > > > > argument to sk_lookup_*, 1 bit for BPF_F_HEADERS_AT_OFFSET, 16bit for
> > > > > the offset itself.
> > > >
> > > > FYI there's also the upper 32 bits of the netns_id parameter, another
> > > > option would be to steal 16 bits from there.
> > >
> > > Or len, which is only 16 bits realistically. The offset doesn't really 
> > > fit into
> > > either of them very well, using flags seemed the cleanest to me.
> > > Is there some best practice around this?
> > >
> > > >
> > > > > Thoughts?
> > > >
> > > > Internally with skbs, we use `skb_pull()` to manage header offsets,
> > > > could we do something similar with `bpf_xdp_adjust_head()` prior to
> > > > the call to `bpf_sk_lookup_*()`?
> > >
> > > That would only work if it retained the contents of the skipped
> > > buffer, and if there
> > > was a way to undo the adjustment later. We're doing the sk_lookup to
> > > decide whether to
> > > accept or forward the packet, so at the point of the call we might still 
> > > need
> > > that data. Is that feasible with skb / XDP ctx?
> >
> > While discussing the solution for reuseport I propose to use
> > progs/test_select_reuseport_kern.c as an example of realistic program.
> > It reads tcp/udp header directly via ctx->data or via bpf_skb_load_bytes()
> > including payload after the header.
> > It also uses bpf_skb_load_bytes_relative() to fetch IP.
> > I think if we're fixing the sk_lookup from XDP the above program
> > would need to work.
>
> Agreed.
>
> > And I think we can make it work by adding new requirement that
> > 'struct bpf_sock_tuple *&#

Re: [PATCH bpf] bpf: Check sk_fullsock() before returning from bpf_sk_lookup()

2019-05-18 Thread Joe Stringer

On Sat, May 18, 2019, 09:05 Martin Lau  wrote:
>
> On Sat, May 18, 2019 at 08:38:46AM -1000, Joe Stringer wrote:
> > On Fri, May 17, 2019, 12:02 Martin Lau  wrote:
> >
> > > On Fri, May 17, 2019 at 02:51:48PM -0700, Eric Dumazet wrote:
> > > >
> > > >
> > > > On 5/17/19 2:21 PM, Martin KaFai Lau wrote:
> > > > > The BPF_FUNC_sk_lookup_xxx helpers return RET_PTR_TO_SOCKET_OR_NULL.
> > > > > Meaning a fullsock ptr and its fullsock's fields in bpf_sock can be
> > > > > accessed, e.g. type, protocol, mark and priority.
> > > > > Some new helper, like bpf_sk_storage_get(), also expects
> > > > > ARG_PTR_TO_SOCKET is a fullsock.
> > > > >
> > > > > bpf_sk_lookup() currently calls sk_to_full_sk() before returning.
> > > > > However, the ptr returned from sk_to_full_sk() is not guaranteed
> > > > > to be a fullsock.  For example, it cannot get a fullsock if sk
> > > > > is in TCP_TIME_WAIT.
> > > > >
> > > > > This patch checks for sk_fullsock() before returning. If it is not
> > > > > a fullsock, sock_gen_put() is called if needed and then returns NULL.
> > > > >
> > > > > Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
> > > > > Cc: Joe Stringer 
> > > > > Signed-off-by: Martin KaFai Lau 
> > > > > ---
> > > > >  net/core/filter.c | 16 ++--
> > > > >  1 file changed, 14 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > > index 55bfc941d17a..85def5a20aaf 100644
> > > > > --- a/net/core/filter.c
> > > > > +++ b/net/core/filter.c
> > > > > @@ -5337,8 +5337,14 @@ __bpf_sk_lookup(struct sk_buff *skb, struct
> > > bpf_sock_tuple *tuple, u32 len,
> > > > > struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
> > > > >ifindex, proto, netns_id,
> > > flags);
> > > > >
> > > > > -   if (sk)
> > > > > +   if (sk) {
> > > > > sk = sk_to_full_sk(sk);
> > > > > +   if (!sk_fullsock(sk)) {
> > > > > +   if (!sock_flag(sk, SOCK_RCU_FREE))
> > > > > +   sock_gen_put(sk);
> > > >
> > > > This looks a bit convoluted/weird.
> > > >
> > > > What about telling/asking __bpf_skc_lookup() to not return a non
> > > fullsock instead ?
> > > It is becausee some other helpers, like BPF_FUNC_skc_lookup_tcp,
> > > can return non fullsock
> > >
> >
> > FYI this is necessary for finding a transparently proxied socket for a
> > non-local connection (tproxy use case).
> You meant it is necessary to return a non fullsock from the
> BPF_FUNC_sk_lookup_xxx helpers?

Yes, that's what I want to associate with the skb so that the delivery
to the SO_TRANSPARENT is received properly.

For the first packet of a connection, we look up the socket using the
tproxy socket port as the destination, and deliver the packet there.
The SO_TRANSPARENT logic then kicks in and sends back the ack and
creates the non-full sock for the connection tuple, which can be
entirely unrelated to local addresses or ports.

For the second forward-direction packet, (ie ACK in 3-way handshake)
then we must deliver the packet to this non-full sock as that's what
is negotiating the proxied connection. If you look up using the packet
tuple then get the full sock from it, it will go back to the
SO_TRANSPARENT parent socket. Delivering the ACK there will result in
a RST being sent back, because the SO_TRANSPARENT socket is just there
to accept new connections for connections to be proxied. So this is
the case where I need the non-full sock.

(In practice, the lookup logic attempts the packet tuple first then if
that fails, uses the tproxy port for lookup to achieve the above).

Re: [PATCH bpf] bpf: Check sk_fullsock() before returning from bpf_sk_lookup()

2019-05-20 Thread Joe Stringer

On Sat, May 18, 2019 at 7:08 PM Martin Lau  wrote:
>
> On Sat, May 18, 2019 at 06:52:48PM -0700, Joe Stringer wrote:
> > On Sat, May 18, 2019, 09:05 Martin Lau  wrote:
> > >
> > > On Sat, May 18, 2019 at 08:38:46AM -1000, Joe Stringer wrote:
> > > > On Fri, May 17, 2019, 12:02 Martin Lau  wrote:
> > > >
> > > > > On Fri, May 17, 2019 at 02:51:48PM -0700, Eric Dumazet wrote:
> > > > > >
> > > > > >
> > > > > > On 5/17/19 2:21 PM, Martin KaFai Lau wrote:
> > > > > > > The BPF_FUNC_sk_lookup_xxx helpers return 
> > > > > > > RET_PTR_TO_SOCKET_OR_NULL.
> > > > > > > Meaning a fullsock ptr and its fullsock's fields in bpf_sock can 
> > > > > > > be
> > > > > > > accessed, e.g. type, protocol, mark and priority.
> > > > > > > Some new helper, like bpf_sk_storage_get(), also expects
> > > > > > > ARG_PTR_TO_SOCKET is a fullsock.
> > > > > > >
> > > > > > > bpf_sk_lookup() currently calls sk_to_full_sk() before returning.
> > > > > > > However, the ptr returned from sk_to_full_sk() is not guaranteed
> > > > > > > to be a fullsock.  For example, it cannot get a fullsock if sk
> > > > > > > is in TCP_TIME_WAIT.
> > > > > > >
> > > > > > > This patch checks for sk_fullsock() before returning. If it is not
> > > > > > > a fullsock, sock_gen_put() is called if needed and then returns 
> > > > > > > NULL.
> > > > > > >
> > > > > > > Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
> > > > > > > Cc: Joe Stringer 
> > > > > > > Signed-off-by: Martin KaFai Lau 
> > > > > > > ---
> > > > > > >  net/core/filter.c | 16 ++--
> > > > > > >  1 file changed, 14 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > > > > index 55bfc941d17a..85def5a20aaf 100644
> > > > > > > --- a/net/core/filter.c
> > > > > > > +++ b/net/core/filter.c
> > > > > > > @@ -5337,8 +5337,14 @@ __bpf_sk_lookup(struct sk_buff *skb, struct
> > > > > bpf_sock_tuple *tuple, u32 len,
> > > > > > > struct sock *sk = __bpf_skc_lookup(skb, tuple, len, 
> > > > > > > caller_net,
> > > > > > >ifindex, proto, netns_id,
> > > > > flags);
> > > > > > >
> > > > > > > -   if (sk)
> > > > > > > +   if (sk) {
> > > > > > > sk = sk_to_full_sk(sk);
> > > > > > > +   if (!sk_fullsock(sk)) {
> > > > > > > +   if (!sock_flag(sk, SOCK_RCU_FREE))
> > > > > > > +   sock_gen_put(sk);
> > > > > >
> > > > > > This looks a bit convoluted/weird.
> > > > > >
> > > > > > What about telling/asking __bpf_skc_lookup() to not return a non
> > > > > fullsock instead ?
> > > > > It is becausee some other helpers, like BPF_FUNC_skc_lookup_tcp,
> > > > > can return non fullsock
> > > > >
> > > >
> > > > FYI this is necessary for finding a transparently proxied socket for a
> > > > non-local connection (tproxy use case).
> > > You meant it is necessary to return a non fullsock from the
> > > BPF_FUNC_sk_lookup_xxx helpers?
> >
> > Yes, that's what I want to associate with the skb so that the delivery
> > to the SO_TRANSPARENT is received properly.
> >
> > For the first packet of a connection, we look up the socket using the
> > tproxy socket port as the destination, and deliver the packet there.
> > The SO_TRANSPARENT logic then kicks in and sends back the ack and
> > creates the non-full sock for the connection tuple, which can be
> > entirely unrelated to local addresses or ports.
> >
> > For the second forward-direction packet, (ie ACK in 3-way handshake)
> > then we must deliver the packet to this non-full sock as that's what
> > is negotiating the proxied connection. If you look up using the packet
> > tuple then get the full sock from it, it will go back to the
> > SO_TRANSPARENT parent socket. Delivering the ACK there will result in
> > a RST being sent back, because the SO_TRANSPARENT socket is just there
> > to accept new connections for connections to be proxied. So this is
> > the case where I need the non-full sock.
> >
> > (In practice, the lookup logic attempts the packet tuple first then if
> > that fails, uses the tproxy port for lookup to achieve the above).
> hmm...I am likely missing something.
>
> 1) The above can be done by the "BPF_FUNC_skC_lookup_tcp" which
>returns a non fullsock (RET_PTR_TO_SOCK_COMMON_OR_NULL), no?

Correct, I meant to send as response to Eric as to use cases for
__bpf_skc_lookup() returning non fullsock.

> 2) The bpf_func_proto of "BPF_FUNC_sk_lookup_tcp" returns
>fullsock (RET_PTR_TO_SOCKET_OR_NULL) and the bpf_prog (and
>the verifier) is expecting that.  How to address the bug here?

Your proposal seems fine to me.

Re: [PATCH bpf] bpf: Check sk_fullsock() before returning from bpf_sk_lookup()

2019-05-20 Thread Joe Stringer

On Fri, May 17, 2019 at 2:21 PM Martin KaFai Lau  wrote:
>
> The BPF_FUNC_sk_lookup_xxx helpers return RET_PTR_TO_SOCKET_OR_NULL.
> Meaning a fullsock ptr and its fullsock's fields in bpf_sock can be
> accessed, e.g. type, protocol, mark and priority.
> Some new helper, like bpf_sk_storage_get(), also expects
> ARG_PTR_TO_SOCKET is a fullsock.
>
> bpf_sk_lookup() currently calls sk_to_full_sk() before returning.
> However, the ptr returned from sk_to_full_sk() is not guaranteed
> to be a fullsock.  For example, it cannot get a fullsock if sk
> is in TCP_TIME_WAIT.
>
> This patch checks for sk_fullsock() before returning. If it is not
> a fullsock, sock_gen_put() is called if needed and then returns NULL.
>
> Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
> Cc: Joe Stringer 
> Signed-off-by: Martin KaFai Lau 
> ---

Acked-by: Joe Stringer

Re: [PATCH net-next] openvswitch: add ct_clear action

2017-10-10 Thread Joe Stringer

On 9 October 2017 at 21:41, Pravin Shelar  wrote:
> On Fri, Oct 6, 2017 at 9:44 AM, Eric Garver  wrote:
>> This adds a ct_clear action for clearing conntrack state. ct_clear is
>> currently implemented in OVS userspace, but is not backed by an action
>> in the kernel datapath. This is useful for flows that may modify a
>> packet tuple after a ct lookup has already occurred.
>>
>> Signed-off-by: Eric Garver 
> Patch mostly looks good. I have following comments.
>
>> ---
>>  include/uapi/linux/openvswitch.h |  2 ++
>>  net/openvswitch/actions.c|  5 +
>>  net/openvswitch/conntrack.c  | 12 
>>  net/openvswitch/conntrack.h  |  7 +++
>>  net/openvswitch/flow_netlink.c   |  5 +
>>  5 files changed, 31 insertions(+)
>>
>> diff --git a/include/uapi/linux/openvswitch.h 
>> b/include/uapi/linux/openvswitch.h
>> index 156ee4cab82e..1b6e510e2cc6 100644
>> --- a/include/uapi/linux/openvswitch.h
>> +++ b/include/uapi/linux/openvswitch.h
>> @@ -806,6 +806,7 @@ struct ovs_action_push_eth {
>>   * packet.
>>   * @OVS_ACTION_ATTR_POP_ETH: Pop the outermost Ethernet header off the
>>   * packet.
>> + * @OVS_ACTION_ATTR_CT_CLEAR: Clear conntrack state from the packet.
>>   *
>>   * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not 
>> all
>>   * fields within a header are modifiable, e.g. the IPv4 protocol and 
>> fragment
>> @@ -835,6 +836,7 @@ enum ovs_action_attr {
>> OVS_ACTION_ATTR_TRUNC,/* u32 struct ovs_action_trunc. */
>> OVS_ACTION_ATTR_PUSH_ETH, /* struct ovs_action_push_eth. */
>> OVS_ACTION_ATTR_POP_ETH,  /* No argument. */
>> +   OVS_ACTION_ATTR_CT_CLEAR, /* No argument. */
>>
>> __OVS_ACTION_ATTR_MAX,/* Nothing past this will be accepted
>>* from userspace. */
>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
>> index a54a556fcdb5..db9c7f2e662b 100644
>> --- a/net/openvswitch/actions.c
>> +++ b/net/openvswitch/actions.c
>> @@ -1203,6 +1203,10 @@ static int do_execute_actions(struct datapath *dp, 
>> struct sk_buff *skb,
>> return err == -EINPROGRESS ? 0 : err;
>> break;
>>
>> +   case OVS_ACTION_ATTR_CT_CLEAR:
>> +   err = ovs_ct_clear(skb, key);
>> +   break;
>> +
>> case OVS_ACTION_ATTR_PUSH_ETH:
>> err = push_eth(skb, key, nla_data(a));
>> break;
>> @@ -1210,6 +1214,7 @@ static int do_execute_actions(struct datapath *dp, 
>> struct sk_buff *skb,
>> case OVS_ACTION_ATTR_POP_ETH:
>> err = pop_eth(skb, key);
>> break;
>> +
>> }
> Unrelated change.
>
>>
>> if (unlikely(err)) {
>> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
>> index d558e882ca0c..f9b73c726ad7 100644
>> --- a/net/openvswitch/conntrack.c
>> +++ b/net/openvswitch/conntrack.c
>> @@ -1129,6 +1129,18 @@ int ovs_ct_execute(struct net *net, struct sk_buff 
>> *skb,
>> return err;
>>  }
>>
>> +int ovs_ct_clear(struct sk_buff *skb, struct sw_flow_key *key)
>> +{
>> +   if (skb_nfct(skb)) {
>> +   nf_conntrack_put(skb_nfct(skb));
>> +   nf_ct_set(skb, NULL, 0);
> Can the new conntract state be appropriate? may be IP_CT_UNTRACKED?
>
>> +   }
>> +
>> +   ovs_ct_fill_key(skb, key);
>> +
> I do not see need to refill the key if there is no skb-nf-ct.

Really this is trying to just zero the CT key fields, but reuses
existing functions, right? This means that subsequent upcalls, for
instance, won't have the outdated view of the CT state from the
previous lookup (that was prior to the ct_clear). I'd expect these key
fields to be cleared.

Re: [PATCH net-next] openvswitch: add ct_clear action

2017-10-10 Thread Joe Stringer

On 10 October 2017 at 08:09, Eric Garver  wrote:
> On Tue, Oct 10, 2017 at 05:33:48AM -0700, Joe Stringer wrote:
>> On 9 October 2017 at 21:41, Pravin Shelar  wrote:
>> > On Fri, Oct 6, 2017 at 9:44 AM, Eric Garver  wrote:
>> >> This adds a ct_clear action for clearing conntrack state. ct_clear is
>> >> currently implemented in OVS userspace, but is not backed by an action
>> >> in the kernel datapath. This is useful for flows that may modify a
>> >> packet tuple after a ct lookup has already occurred.
>> >>
>> >> Signed-off-by: Eric Garver 
>> > Patch mostly looks good. I have following comments.
>> >
>> >> ---
>> >>  include/uapi/linux/openvswitch.h |  2 ++
>> >>  net/openvswitch/actions.c|  5 +
>> >>  net/openvswitch/conntrack.c  | 12 
>> >>  net/openvswitch/conntrack.h  |  7 +++
>> >>  net/openvswitch/flow_netlink.c   |  5 +
>> >>  5 files changed, 31 insertions(+)
>> >>
>> >> diff --git a/include/uapi/linux/openvswitch.h 
>> >> b/include/uapi/linux/openvswitch.h
>> >> index 156ee4cab82e..1b6e510e2cc6 100644
>> >> --- a/include/uapi/linux/openvswitch.h
>> >> +++ b/include/uapi/linux/openvswitch.h
>> >> @@ -806,6 +806,7 @@ struct ovs_action_push_eth {
>> >>   * packet.
>> >>   * @OVS_ACTION_ATTR_POP_ETH: Pop the outermost Ethernet header off the
>> >>   * packet.
>> >> + * @OVS_ACTION_ATTR_CT_CLEAR: Clear conntrack state from the packet.
>> >>   *
>> >>   * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  
>> >> Not all
>> >>   * fields within a header are modifiable, e.g. the IPv4 protocol and 
>> >> fragment
>> >> @@ -835,6 +836,7 @@ enum ovs_action_attr {
>> >> OVS_ACTION_ATTR_TRUNC,/* u32 struct ovs_action_trunc. */
>> >> OVS_ACTION_ATTR_PUSH_ETH, /* struct ovs_action_push_eth. */
>> >> OVS_ACTION_ATTR_POP_ETH,  /* No argument. */
>> >> +   OVS_ACTION_ATTR_CT_CLEAR, /* No argument. */
>> >>
>> >> __OVS_ACTION_ATTR_MAX,/* Nothing past this will be 
>> >> accepted
>> >>* from userspace. */
>> >> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
>> >> index a54a556fcdb5..db9c7f2e662b 100644
>> >> --- a/net/openvswitch/actions.c
>> >> +++ b/net/openvswitch/actions.c
>> >> @@ -1203,6 +1203,10 @@ static int do_execute_actions(struct datapath *dp, 
>> >> struct sk_buff *skb,
>> >> return err == -EINPROGRESS ? 0 : err;
>> >> break;
>> >>
>> >> +   case OVS_ACTION_ATTR_CT_CLEAR:
>> >> +   err = ovs_ct_clear(skb, key);
>> >> +   break;
>> >> +
>> >> case OVS_ACTION_ATTR_PUSH_ETH:
>> >> err = push_eth(skb, key, nla_data(a));
>> >> break;
>> >> @@ -1210,6 +1214,7 @@ static int do_execute_actions(struct datapath *dp, 
>> >> struct sk_buff *skb,
>> >> case OVS_ACTION_ATTR_POP_ETH:
>> >> err = pop_eth(skb, key);
>> >> break;
>> >> +
>> >> }
>> > Unrelated change.
>> >
>> >>
>> >> if (unlikely(err)) {
>> >> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
>> >> index d558e882ca0c..f9b73c726ad7 100644
>> >> --- a/net/openvswitch/conntrack.c
>> >> +++ b/net/openvswitch/conntrack.c
>> >> @@ -1129,6 +1129,18 @@ int ovs_ct_execute(struct net *net, struct sk_buff 
>> >> *skb,
>> >> return err;
>> >>  }
>> >>
>> >> +int ovs_ct_clear(struct sk_buff *skb, struct sw_flow_key *key)
>> >> +{
>> >> +   if (skb_nfct(skb)) {
>> >> +   nf_conntrack_put(skb_nfct(skb));
>> >> +   nf_ct_set(skb, NULL, 0);
>> > Can the new conntract state be appropriate? may be IP_CT_UNTRACKED?
>> >
>> >> +   }
>> >> +
>> >> +   ovs_ct_fill_key(skb, key);
>> >> +
>> > I do not see need to refill the key if there is no skb-nf-ct.
>>
>> Really this is trying to just zero the CT key fields, but reuses
>> existing functions, right? This means that subsequent upcalls, for
>
> Right.
>
>> instance, won't have the outdated view of the CT state from the
>> previous lookup (that was prior to the ct_clear). I'd expect these key
>> fields to be cleared.
>
> I assumed Pravin was saying that we don't need to clear them if there is
> no conntrack state. They should already be zero.

The conntrack calls aren't going to clear it, so I don't see what else
would clear it?

If you execute ct(),ct_clear(), then the first ct will set the
values.. what will zero them?

Re: [ovs-dev] [PATCH net-next] openvswitch: add ct_clear action

2017-10-10 Thread Joe Stringer

On 10 October 2017 at 12:13, Eric Garver  wrote:
> On Tue, Oct 10, 2017 at 10:24:20AM -0700, Joe Stringer wrote:
>> On 10 October 2017 at 08:09, Eric Garver  wrote:
>> > On Tue, Oct 10, 2017 at 05:33:48AM -0700, Joe Stringer wrote:
>> >> On 9 October 2017 at 21:41, Pravin Shelar  wrote:
>> >> > On Fri, Oct 6, 2017 at 9:44 AM, Eric Garver  wrote:
>> >> >> This adds a ct_clear action for clearing conntrack state. ct_clear is
>> >> >> currently implemented in OVS userspace, but is not backed by an action
>> >> >> in the kernel datapath. This is useful for flows that may modify a
>> >> >> packet tuple after a ct lookup has already occurred.
>> >> >>
>> >> >> Signed-off-by: Eric Garver 
>> >> > Patch mostly looks good. I have following comments.
>> >> >
>> >> >> ---
>> >> >>  include/uapi/linux/openvswitch.h |  2 ++
>> >> >>  net/openvswitch/actions.c|  5 +
>> >> >>  net/openvswitch/conntrack.c  | 12 
>> >> >>  net/openvswitch/conntrack.h  |  7 +++
>> >> >>  net/openvswitch/flow_netlink.c   |  5 +
>> >> >>  5 files changed, 31 insertions(+)
>> >> >>
>> >> >> diff --git a/include/uapi/linux/openvswitch.h 
>> >> >> b/include/uapi/linux/openvswitch.h
>> >> >> index 156ee4cab82e..1b6e510e2cc6 100644
>> >> >> --- a/include/uapi/linux/openvswitch.h
>> >> >> +++ b/include/uapi/linux/openvswitch.h
>> >> >> @@ -806,6 +806,7 @@ struct ovs_action_push_eth {
>> >> >>   * packet.
>> >> >>   * @OVS_ACTION_ATTR_POP_ETH: Pop the outermost Ethernet header off the
>> >> >>   * packet.
>> >> >> + * @OVS_ACTION_ATTR_CT_CLEAR: Clear conntrack state from the packet.
>> >> >>   *
>> >> >>   * Only a single header can be set with a single 
>> >> >> %OVS_ACTION_ATTR_SET.  Not all
>> >> >>   * fields within a header are modifiable, e.g. the IPv4 protocol and 
>> >> >> fragment
>> >> >> @@ -835,6 +836,7 @@ enum ovs_action_attr {
>> >> >> OVS_ACTION_ATTR_TRUNC,/* u32 struct ovs_action_trunc. 
>> >> >> */
>> >> >> OVS_ACTION_ATTR_PUSH_ETH, /* struct ovs_action_push_eth. */
>> >> >> OVS_ACTION_ATTR_POP_ETH,  /* No argument. */
>> >> >> +   OVS_ACTION_ATTR_CT_CLEAR, /* No argument. */
>> >> >>
>> >> >> __OVS_ACTION_ATTR_MAX,/* Nothing past this will be 
>> >> >> accepted
>> >> >>* from userspace. */
>> >> >> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
>> >> >> index a54a556fcdb5..db9c7f2e662b 100644
>> >> >> --- a/net/openvswitch/actions.c
>> >> >> +++ b/net/openvswitch/actions.c
>> >> >> @@ -1203,6 +1203,10 @@ static int do_execute_actions(struct datapath 
>> >> >> *dp, struct sk_buff *skb,
>> >> >> return err == -EINPROGRESS ? 0 : err;
>> >> >> break;
>> >> >>
>> >> >> +   case OVS_ACTION_ATTR_CT_CLEAR:
>> >> >> +   err = ovs_ct_clear(skb, key);
>> >> >> +   break;
>> >> >> +
>> >> >> case OVS_ACTION_ATTR_PUSH_ETH:
>> >> >> err = push_eth(skb, key, nla_data(a));
>> >> >> break;
>> >> >> @@ -1210,6 +1214,7 @@ static int do_execute_actions(struct datapath 
>> >> >> *dp, struct sk_buff *skb,
>> >> >> case OVS_ACTION_ATTR_POP_ETH:
>> >> >> err = pop_eth(skb, key);
>> >> >> break;
>> >> >> +
>> >> >> }
>> >> > Unrelated change.
>> >> >
>> >> >>
>> >> >> if (unlikely(err)) {
>> >> >> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
>> >> >> index d558e882ca0c..f9b73c726ad7 100644
>> >> >> --- a/net/openvswitc

[PATCH iproute2] bpf: Print section name when hitting non ld64 issue

2018-02-28 Thread Joe Stringer

It's useful to be able to tell which section is being processed in the
ELF when this error is triggered, so print that detail.

Signed-off-by: Joe Stringer 
---
 lib/bpf.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/bpf.c b/lib/bpf.c
index 2db151e4dd3c..c38d92d87759 100644
--- a/lib/bpf.c
+++ b/lib/bpf.c
@@ -2039,6 +2039,7 @@ static int bpf_apply_relo_data(struct bpf_elf_ctx *ctx,
insns[ioff].code != (BPF_LD | BPF_IMM | BPF_DW)) {
fprintf(stderr, "ELF contains relo data for non ld64 
instruction at offset %u! Compiler bug?!\n",
ioff);
+   fprintf(stderr, " - Current section: %s\n", 
data_relo->sec_name);
if (ioff < num_insns &&
insns[ioff].code == (BPF_JMP | BPF_CALL))
fprintf(stderr, " - Try to annotate functions 
with always_inline attribute!\n");
-- 
2.14.1

Re: [RFC PATCH bpf-next v2 0/4] Implement bpf queue/stack maps

2018-09-06 Thread Joe Stringer

On Thu, 6 Sep 2018 at 17:13, Alexei Starovoitov
 wrote:
> bpf_map_pop_elem() is trying to do lookup_and_delete and preserve
> validity of value without races.
> With pcpu_freelist I don't think there is a solution.
> We can have this queue/stack map without prealloc and use kmalloc/kfree
> back and forth. Performance will not be as great, but for your use case,
> I suspect, it will be good enough.
> The key issue with kmalloc/kfree is unbounded time of rcu callbacks.
> If somebody starts doing push/pop for every packet, the rcu subsystem
> will struggle and nothing we can do about it.
>
> The only way I could think of to resolve this problem is to reuse
> the logic that Joe is working on for socket lookups inside the program.
> Joe,
> how is that going? Could you repost the latest patches?

I can rebase & send them out. Was just wanting to get a little more testing in.

Cheers,
Joe

[PATCH bpf-next 03/11] bpf: Generalize ptr_or_null regs check

2018-09-11 Thread Joe Stringer

This check will be reused by an upcoming commit for conditional jump
checks for sockets. Refactor it a bit to simplify the later commit.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c | 43 +--
 1 file changed, 25 insertions(+), 18 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 61b60e141b6a..f2357c8c90de 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -249,6 +249,11 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
   type == PTR_TO_PACKET_META;
 }
 
+static bool reg_type_may_be_null(enum bpf_reg_type type)
+{
+   return type == PTR_TO_MAP_VALUE_OR_NULL;
+}
+
 /* string representation of 'enum bpf_reg_type' */
 static const char * const reg_type_str[] = {
[NOT_INIT]  = "?",
@@ -3567,12 +3572,10 @@ static void reg_combine_min_max(struct bpf_reg_state 
*true_src,
}
 }
 
-static void mark_map_reg(struct bpf_reg_state *regs, u32 regno, u32 id,
-bool is_null)
+static void mark_ptr_or_null_reg(struct bpf_reg_state *reg, u32 id,
+bool is_null)
 {
-   struct bpf_reg_state *reg = ®s[regno];
-
-   if (reg->type == PTR_TO_MAP_VALUE_OR_NULL && reg->id == id) {
+   if (reg_type_may_be_null(reg->type) && reg->id == id) {
/* Old offset (both fixed and variable parts) should
 * have been known-zero, because we don't allow pointer
 * arithmetic on pointers that might be NULL.
@@ -3585,11 +3588,13 @@ static void mark_map_reg(struct bpf_reg_state *regs, 
u32 regno, u32 id,
}
if (is_null) {
reg->type = SCALAR_VALUE;
-   } else if (reg->map_ptr->inner_map_meta) {
-   reg->type = CONST_PTR_TO_MAP;
-   reg->map_ptr = reg->map_ptr->inner_map_meta;
-   } else {
-   reg->type = PTR_TO_MAP_VALUE;
+   } else if (reg->type == PTR_TO_MAP_VALUE_OR_NULL) {
+   if (reg->map_ptr->inner_map_meta) {
+   reg->type = CONST_PTR_TO_MAP;
+   reg->map_ptr = reg->map_ptr->inner_map_meta;
+   } else {
+   reg->type = PTR_TO_MAP_VALUE;
+   }
}
/* We don't need id from this point onwards anymore, thus we
 * should better reset it, so that state pruning has chances
@@ -3602,8 +3607,8 @@ static void mark_map_reg(struct bpf_reg_state *regs, u32 
regno, u32 id,
 /* The logic is similar to find_good_pkt_pointers(), both could eventually
  * be folded together at some point.
  */
-static void mark_map_regs(struct bpf_verifier_state *vstate, u32 regno,
- bool is_null)
+static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
+ bool is_null)
 {
struct bpf_func_state *state = vstate->frame[vstate->curframe];
struct bpf_reg_state *reg, *regs = state->regs;
@@ -3611,14 +3616,14 @@ static void mark_map_regs(struct bpf_verifier_state 
*vstate, u32 regno,
int i, j;
 
for (i = 0; i < MAX_BPF_REG; i++)
-   mark_map_reg(regs, i, id, is_null);
+   mark_ptr_or_null_reg(®s[i], id, is_null);
 
for (j = 0; j <= vstate->curframe; j++) {
state = vstate->frame[j];
for_each_spilled_reg(i, state, reg) {
if (!reg)
continue;
-   mark_map_reg(&state->stack[i].spilled_ptr, 0, id, 
is_null);
+   mark_ptr_or_null_reg(reg, id, is_null);
}
}
 }
@@ -3820,12 +3825,14 @@ static int check_cond_jmp_op(struct bpf_verifier_env 
*env,
/* detect if R == 0 where R is returned from bpf_map_lookup_elem() */
if (BPF_SRC(insn->code) == BPF_K &&
insn->imm == 0 && (opcode == BPF_JEQ || opcode == BPF_JNE) &&
-   dst_reg->type == PTR_TO_MAP_VALUE_OR_NULL) {
-   /* Mark all identical map registers in each branch as either
+   reg_type_may_be_null(dst_reg->type)) {
+   /* Mark all identical registers in each branch as either
 * safe or unknown depending R == 0 or R != 0 conditional.
 */
-   mark_map_regs(this_branch, insn->dst_reg, opcode == BPF_JNE);
-   mark_map_regs(other_branch, insn->dst_reg, opcode == BPF_JEQ);
+   mark_ptr_or_null_regs(this_branch, insn->dst_reg,
+ opcode == BPF_JNE);
+   mark

[PATCH bpf-next 05/11] bpf: Macrofy stack state copy

2018-09-11 Thread Joe Stringer

An upcoming commit will need very similar copy/realloc boilerplate, so
refactor the existing stack copy/realloc functions into macros to
simplify it.

Signed-off-by: Joe Stringer 
---
 kernel/bpf/verifier.c | 106 --
 1 file changed, 60 insertions(+), 46 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 111e031cf65d..faa83b3d7011 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -387,60 +387,74 @@ static void print_verifier_state(struct bpf_verifier_env 
*env,
verbose(env, "\n");
 }
 
-static int copy_stack_state(struct bpf_func_state *dst,
-   const struct bpf_func_state *src)
-{
-   if (!src->stack)
-   return 0;
-   if (WARN_ON_ONCE(dst->allocated_stack < src->allocated_stack)) {
-   /* internal bug, make state invalid to reject the program */
-   memset(dst, 0, sizeof(*dst));
-   return -EFAULT;
-   }
-   memcpy(dst->stack, src->stack,
-  sizeof(*src->stack) * (src->allocated_stack / BPF_REG_SIZE));
-   return 0;
-}
+#define COPY_STATE_FN(NAME, COUNT, FIELD, SIZE)
\
+static int copy_##NAME##_state(struct bpf_func_state *dst, \
+  const struct bpf_func_state *src)\
+{  \
+   if (!src->FIELD)\
+   return 0;   \
+   if (WARN_ON_ONCE(dst->COUNT < src->COUNT)) {\
+   /* internal bug, make state invalid to reject the program */ \
+   memset(dst, 0, sizeof(*dst));   \
+   return -EFAULT; \
+   }   \
+   memcpy(dst->FIELD, src->FIELD,  \
+  sizeof(*src->FIELD) * (src->COUNT / SIZE));  \
+   return 0;   \
+}
+/* copy_stack_state() */
+COPY_STATE_FN(stack, allocated_stack, stack, BPF_REG_SIZE)
+#undef COPY_STATE_FN
+
+#define REALLOC_STATE_FN(NAME, COUNT, FIELD, SIZE) \
+static int realloc_##NAME##_state(struct bpf_func_state *state, int size, \
+ bool copy_old)\
+{  \
+   u32 old_size = state->COUNT;\
+   struct bpf_##NAME##_state *new_##FIELD; \
+   int slot = size / SIZE; \
+   \
+   if (size <= old_size || !size) {\
+   if (copy_old)   \
+   return 0;   \
+   state->COUNT = slot * SIZE; \
+   if (!size && old_size) {\
+   kfree(state->FIELD);\
+   state->FIELD = NULL;\
+   }   \
+   return 0;   \
+   }   \
+   new_##FIELD = kmalloc_array(slot, sizeof(struct bpf_##NAME##_state), \
+   GFP_KERNEL);\
+   if (!new_##FIELD)   \
+   return -ENOMEM; \
+   if (copy_old) { \
+   if (state->FIELD)   \
+   memcpy(new_##FIELD, state->FIELD,   \
+  sizeof(*new_##FIELD) * (old_size / SIZE)); \
+   memset(new_##FIELD + old_size / SIZE, 0,\
+  sizeof(*new_##FIELD) * (size - old_size) / SIZE); \
+   }   \
+   state->COUNT = slot * SIZE; \
+   kfree(state->FIELD);\
+   state->FIELD = new_##FIELD; \
+   return 0;   \
+}
+/* realloc_stack_state() */
+REALLOC_STATE_FN(stack, allocated_stack, stack, BPF_REG_SIZE)
+#undef REALLOC_STATE_FN
 
 /* do_check() starts wi

[PATCH bpf-next 01/11] bpf: Add iterator for spilled registers

2018-09-11 Thread Joe Stringer

Add this iterator for spilled registers, it concentrates the details of
how to get the current frame's spilled registers into a single macro
while clarifying the intention of the code which is calling the macro.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf_verifier.h | 11 +++
 kernel/bpf/verifier.c| 16 +++-
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index b42b60a83e19..af262b97f586 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -131,6 +131,17 @@ struct bpf_verifier_state {
u32 curframe;
 };
 
+#define __get_spilled_reg(slot, frame) \
+   (((slot < frame->allocated_stack / BPF_REG_SIZE) && \
+ (frame->stack[slot].slot_type[0] == STACK_SPILL)) \
+? &frame->stack[slot].spilled_ptr : NULL)
+
+/* Iterate over 'frame', setting 'reg' to either NULL or a spilled register. */
+#define for_each_spilled_reg(iter, frame, reg) \
+   for (iter = 0, reg = __get_spilled_reg(iter, frame);\
+iter < frame->allocated_stack / BPF_REG_SIZE;  \
+iter++, reg = __get_spilled_reg(iter, frame))
+
 /* linked list of verifier states used to prune search */
 struct bpf_verifier_state_list {
struct bpf_verifier_state state;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6ff1bac1795d..97aac6ac1b0d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2219,10 +2219,9 @@ static void __clear_all_pkt_pointers(struct 
bpf_verifier_env *env,
if (reg_is_pkt_pointer_any(®s[i]))
mark_reg_unknown(env, regs, i);
 
-   for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) {
-   if (state->stack[i].slot_type[0] != STACK_SPILL)
+   for_each_spilled_reg(i, state, reg) {
+   if (!reg)
continue;
-   reg = &state->stack[i].spilled_ptr;
if (reg_is_pkt_pointer_any(reg))
__mark_reg_unknown(reg);
}
@@ -3362,10 +3361,9 @@ static void find_good_pkt_pointers(struct 
bpf_verifier_state *vstate,
 
for (j = 0; j <= vstate->curframe; j++) {
state = vstate->frame[j];
-   for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) {
-   if (state->stack[i].slot_type[0] != STACK_SPILL)
+   for_each_spilled_reg(i, state, reg) {
+   if (!reg)
continue;
-   reg = &state->stack[i].spilled_ptr;
if (reg->type == type && reg->id == dst_reg->id)
reg->range = max(reg->range, new_range);
}
@@ -3610,7 +3608,7 @@ static void mark_map_regs(struct bpf_verifier_state 
*vstate, u32 regno,
  bool is_null)
 {
struct bpf_func_state *state = vstate->frame[vstate->curframe];
-   struct bpf_reg_state *regs = state->regs;
+   struct bpf_reg_state *reg, *regs = state->regs;
u32 id = regs[regno].id;
int i, j;
 
@@ -3619,8 +3617,8 @@ static void mark_map_regs(struct bpf_verifier_state 
*vstate, u32 regno,
 
for (j = 0; j <= vstate->curframe; j++) {
state = vstate->frame[j];
-   for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) {
-   if (state->stack[i].slot_type[0] != STACK_SPILL)
+   for_each_spilled_reg(i, state, reg) {
+   if (!reg)
continue;
mark_map_reg(&state->stack[i].spilled_ptr, 0, id, 
is_null);
}
-- 
2.17.1

[PATCH bpf-next 02/11] bpf: Simplify ptr_min_max_vals adjustment

2018-09-11 Thread Joe Stringer

An upcoming commit will add another two pointer types that need very
similar behaviour, so generalise this function now.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c   | 22 ++---
 tools/testing/selftests/bpf/test_verifier.c | 14 ++---
 2 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 97aac6ac1b0d..61b60e141b6a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2636,20 +2636,18 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
return -EACCES;
}
 
-   if (ptr_reg->type == PTR_TO_MAP_VALUE_OR_NULL) {
-   verbose(env, "R%d pointer arithmetic on 
PTR_TO_MAP_VALUE_OR_NULL prohibited, null-check it first\n",
-   dst);
-   return -EACCES;
-   }
-   if (ptr_reg->type == CONST_PTR_TO_MAP) {
-   verbose(env, "R%d pointer arithmetic on CONST_PTR_TO_MAP 
prohibited\n",
-   dst);
+   switch (ptr_reg->type) {
+   case PTR_TO_MAP_VALUE_OR_NULL:
+   verbose(env, "R%d pointer arithmetic on %s prohibited, 
null-check it first\n",
+   dst, reg_type_str[ptr_reg->type]);
return -EACCES;
-   }
-   if (ptr_reg->type == PTR_TO_PACKET_END) {
-   verbose(env, "R%d pointer arithmetic on PTR_TO_PACKET_END 
prohibited\n",
-   dst);
+   case CONST_PTR_TO_MAP:
+   case PTR_TO_PACKET_END:
+   verbose(env, "R%d pointer arithmetic on %s prohibited\n",
+   dst, reg_type_str[ptr_reg->type]);
return -EACCES;
+   default:
+   break;
}
 
/* In case of 'scalar += pointer', dst_reg inherits pointer type and id.
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 67c412d19c09..ceb55a9f3da9 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -3637,7 +3637,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END",
+   .errstr = "R3 pointer arithmetic on pkt_end",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
},
@@ -4780,7 +4780,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 4 },
-   .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL",
+   .errstr = "R4 pointer arithmetic on map_value_or_null",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS
},
@@ -4801,7 +4801,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 4 },
-   .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL",
+   .errstr = "R4 pointer arithmetic on map_value_or_null",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS
},
@@ -4822,7 +4822,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 4 },
-   .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL",
+   .errstr = "R4 pointer arithmetic on map_value_or_null",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS
},
@@ -7137,7 +7137,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map_in_map = { 3 },
-   .errstr = "R1 pointer arithmetic on CONST_PTR_TO_MAP 
prohibited",
+   .errstr = "R1 pointer arithmetic on map_ptr prohibited",
.result = REJECT,
},
{
@@ -8811,7 +8811,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END",
+   .errstr = "R3 pointer arithmetic on pkt_end",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
},
@@ -8830,7 +8830,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END",
+   .errstr = "R3 pointer arithmetic on pkt_end",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
},
-- 
2.17.1

[PATCH bpf-next 10/11] selftests/bpf: Add C tests for reference tracking

2018-09-11 Thread Joe Stringer

Signed-off-by: Joe Stringer 
---
 tools/testing/selftests/bpf/Makefile  |   2 +-
 tools/testing/selftests/bpf/test_progs.c  |  38 ++
 .../selftests/bpf/test_sk_lookup_kern.c   | 128 ++
 3 files changed, 167 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_sk_lookup_kern.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index fff7fb1285fc..311b7bc9e37a 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o \
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
-   test_skb_cgroup_id_kern.o
+   test_skb_cgroup_id_kern.o test_sk_lookup_kern.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index 63a671803ed6..e8becca9c521 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -1698,6 +1698,43 @@ static void test_task_fd_query_tp(void)
   "sys_enter_read");
 }
 
+static void test_reference_tracking()
+{
+   const char *file = "./test_sk_lookup_kern.o";
+   struct bpf_object *obj;
+   struct bpf_program *prog;
+   __u32 duration;
+   int err = 0;
+
+   obj = bpf_object__open(file);
+   if (IS_ERR(obj)) {
+   error_cnt++;
+   return;
+   }
+
+   bpf_object__for_each_program(prog, obj) {
+   const char *title;
+
+   /* Ignore .text sections */
+   title = bpf_program__title(prog, false);
+   if (strstr(title, ".text") != NULL)
+   continue;
+
+   bpf_program__set_type(prog, BPF_PROG_TYPE_SCHED_CLS);
+
+   /* Expect verifier failure if test name has 'fail' */
+   if (strstr(title, "fail") != NULL) {
+   libbpf_set_print(NULL, NULL, NULL);
+   err = !bpf_program__load(prog, "GPL", 0);
+   libbpf_set_print(printf, printf, NULL);
+   } else {
+   err = bpf_program__load(prog, "GPL", 0);
+   }
+   CHECK(err, title, "\n");
+   }
+   bpf_object__close(obj);
+}
+
 int main(void)
 {
jit_enabled = is_jit_enabled();
@@ -1719,6 +1756,7 @@ int main(void)
test_get_stack_raw_tp();
test_task_fd_query_rawtp();
test_task_fd_query_tp();
+   test_reference_tracking();
 
printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
diff --git a/tools/testing/selftests/bpf/test_sk_lookup_kern.c 
b/tools/testing/selftests/bpf/test_sk_lookup_kern.c
new file mode 100644
index ..321a2299a3ac
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sk_lookup_kern.c
@@ -0,0 +1,128 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+// Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+char _license[] SEC("license") = "GPL";
+
+/* Fill 'tuple' with L3 info, and attempt to find L4. On fail, return NULL. */
+static void *fill_ip(struct bpf_sock_tuple *tuple, void *data, __u64 nh_off,
+void *data_end, __u16 eth_proto)
+{
+   __u64 ihl_len;
+   __u8 proto;
+
+   if (eth_proto == bpf_htons(ETH_P_IP)) {
+   struct iphdr *iph = (struct iphdr *)(data + nh_off);
+
+   if (iph + 1 > data_end)
+   return NULL;
+   ihl_len = iph->ihl * 4;
+   proto = iph->protocol;
+
+   tuple->family = AF_INET;
+   tuple->saddr.ipv4 = iph->saddr;
+   tuple->daddr.ipv4 = iph->daddr;
+   } else if (eth_proto == bpf_htons(ETH_P_IPV6)) {
+   struct ipv6hdr *ip6h = (struct ipv6hdr *)(data + nh_off);
+
+   if (ip6h + 1 > data_end)
+   return NULL;
+   ihl_len = sizeof(*ip6h);
+   proto = ip6h->nexthdr;
+
+   tuple->family = AF_INET6;
+   *((struct in6_addr *)&tuple->saddr.ipv6) = ip6h->saddr;
+   *((struct in6_addr *)&tuple->daddr.ipv6) = ip6h->daddr;
+   }
+
+   if (proto != IPPRO

[PATCH bpf-next 09/11] libbpf: Support loading individual progs

2018-09-11 Thread Joe Stringer

Allow the individual program load to be invoked. This will help with
testing, where a single ELF may contain several sections, some of which
denote subprograms that are expected to fail verification, along with
some which are expected to pass verification. By allowing programs to be
iterated and individually loaded, each program can be independently
checked against its expected verification result.

Signed-off-by: Joe Stringer 
---
 tools/lib/bpf/libbpf.c | 4 ++--
 tools/lib/bpf/libbpf.h | 3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 8476da7f2720..aadf05f6bfa0 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -227,7 +227,7 @@ struct bpf_object {
 };
 #define obj_elf_valid(o)   ((o)->efile.elf)
 
-static void bpf_program__unload(struct bpf_program *prog)
+void bpf_program__unload(struct bpf_program *prog)
 {
int i;
 
@@ -1375,7 +1375,7 @@ load_program(enum bpf_prog_type type, enum 
bpf_attach_type expected_attach_type,
return ret;
 }
 
-static int
+int
 bpf_program__load(struct bpf_program *prog,
  char *license, u32 kern_version)
 {
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index e3b00e23e181..40e4395f1c07 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -126,10 +126,13 @@ void bpf_program__set_ifindex(struct bpf_program *prog, 
__u32 ifindex);
 
 const char *bpf_program__title(struct bpf_program *prog, bool needs_copy);
 
+int bpf_program__load(struct bpf_program *prog, char *license,
+ u32 kern_version);
 int bpf_program__fd(struct bpf_program *prog);
 int bpf_program__pin_instance(struct bpf_program *prog, const char *path,
  int instance);
 int bpf_program__pin(struct bpf_program *prog, const char *path);
+void bpf_program__unload(struct bpf_program *prog);
 
 struct bpf_insn;
 
-- 
2.17.1

[PATCH bpf-next 00/11] Add socket lookup support

2018-09-11 Thread Joe Stringer

This series proposes a new helper for the BPF API which allows BPF programs to
perform lookups for sockets in a network namespace. This would allow programs
to determine early on in processing whether the stack is expecting to receive
the packet, and perform some action (eg drop, forward somewhere) based on this
information.

The series is structured roughly into:
* Misc refactor
* Add the socket pointer type
* Add reference tracking to ensure that socket references are freed
* Extend the BPF API to add sk_lookup_xxx() / sk_release() functions
* Add tests/documentation

The helper proposed in this series includes a parameter for a tuple which must
be filled in by the caller to determine the socket to look up. The simplest
case would be filling with the contents of the packet, ie mapping the packet's
5-tuple into the parameter. In common cases, it may alternatively be useful to
reverse the direction of the tuple and perform a lookup, to find the socket
that initiates this connection; and if the BPF program ever performs a form of
IP address translation, it may further be useful to be able to look up
arbitrary tuples that are not based upon the packet, but instead based on state
held in BPF maps or hardcoded in the BPF program.

Currently, access into the socket's fields are limited to those which are
otherwise already accessible, and are restricted to read-only access.

Changes since RFC:
* Split up sk_lookup() into sk_lookup_tcp(), sk_lookup_udp().
* Only take references on the socket when necessary.
  * Make sk_release() only free the socket reference in this case.
* Fix some runtime reference leaks:
  * Disallow BPF_LD_[ABS|IND] instructions while holding a reference.
  * Disallow bpf_tail_call() while holding a reference.
* Prevent the same instruction being used for reference and other
  pointer type.
* Simplify locating copies of a reference during helper calls by caching
  the pointer id from the caller.
* Fix kbuild compilation warnings with particular configs.
* Improve code comments describing the new verifier pieces.
* Testing courtesy of Nitin
* Rebase

This tree is also available at:
https://github.com/joestringer/linux/commits/submit/sk-lookup-v1

Joe Stringer (11):
  bpf: Add iterator for spilled registers
  bpf: Simplify ptr_min_max_vals adjustment
  bpf: Generalize ptr_or_null regs check
  bpf: Add PTR_TO_SOCKET verifier type
  bpf: Macrofy stack state copy
  bpf: Add reference tracking to verifier
  bpf: Add helper to retrieve socket in BPF
  selftests/bpf: Add tests for reference tracking
  libbpf: Support loading individual progs
  selftests/bpf: Add C tests for reference tracking
  Documentation: Describe bpf reference tracking

 Documentation/networking/filter.txt   |  64 ++
 include/linux/bpf.h   |  17 +
 include/linux/bpf_verifier.h  |  37 +-
 include/uapi/linux/bpf.h  |  54 +-
 kernel/bpf/verifier.c | 599 ++
 net/core/filter.c | 175 -
 tools/include/uapi/linux/bpf.h|  54 +-
 tools/lib/bpf/libbpf.c|   4 +-
 tools/lib/bpf/libbpf.h|   3 +
 tools/testing/selftests/bpf/Makefile  |   2 +-
 tools/testing/selftests/bpf/bpf_helpers.h |  12 +
 tools/testing/selftests/bpf/test_progs.c  |  38 ++
 .../selftests/bpf/test_sk_lookup_kern.c   | 128 
 tools/testing/selftests/bpf/test_verifier.c   | 373 ++-
 14 files changed, 1426 insertions(+), 134 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sk_lookup_kern.c

-- 
2.17.1

[PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-09-11 Thread Joe Stringer

This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
socket listening on this host, and returns a socket pointer which the
BPF program can then access to determine, for instance, whether to
forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
socket, so when a BPF program makes use of this function, it must
subsequently pass the returned pointer into the newly added sk_release()
to return the reference.

By way of example, the following pseudocode would filter inbound
connections at XDP if there is no corresponding service listening for
the traffic:

  struct bpf_sock_tuple tuple;
  struct bpf_sock_ops *sk;

  populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
  sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
  if (!sk) {
// Couldn't find a socket listening for this traffic. Drop.
return TC_ACT_SHOT;
  }
  bpf_sk_release(sk, 0);
  return TC_ACT_OK;

Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h  |  54 +++-
 kernel/bpf/verifier.c |   8 +-
 net/core/filter.c | 145 ++
 tools/include/uapi/linux/bpf.h|  54 +++-
 tools/testing/selftests/bpf/bpf_helpers.h |  12 ++
 5 files changed, 270 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4eba27..8ed6e293113f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2141,6 +2141,41 @@ union bpf_attr {
  * request in the skb.
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * struct bpf_sock_ops *bpf_sk_lookup_tcp(ctx, tuple, tuple_size, netns, flags)
+ * Decription
+ * Look for TCP socket matching 'tuple'. The return value must
+ * be checked, and if non-NULL, released via bpf_sk_release().
+ * @ctx: pointer to ctx
+ * @tuple: pointer to struct bpf_sock_tuple
+ * @tuple_size: size of the tuple
+ * @netns: network namespace id
+ * @flags: flags value
+ * Return
+ * pointer to socket ops on success, or
+ * NULL in case of failure
+ *
+ * struct bpf_sock_ops *bpf_sk_lookup_udp(ctx, tuple, tuple_size, netns, flags)
+ * Decription
+ * Look for UDP socket matching 'tuple'. The return value must
+ * be checked, and if non-NULL, released via bpf_sk_release().
+ * @ctx: pointer to ctx
+ * @tuple: pointer to struct bpf_sock_tuple
+ * @tuple_size: size of the tuple
+ * @netns: network namespace id
+ * @flags: flags value
+ * Return
+ * pointer to socket ops on success, or
+ * NULL in case of failure
+ *
+ *  int bpf_sk_release(sock, flags)
+ * Description
+ * Release the reference held by 'sock'.
+ * @sock: Pointer reference to release. Must be found via
+ *bpf_sk_lookup_xxx().
+ * @flags: flags value
+ * Return
+ * 0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2226,7 +2261,10 @@ union bpf_attr {
FN(get_current_cgroup_id),  \
FN(get_local_storage),  \
FN(sk_select_reuseport),\
-   FN(skb_ancestor_cgroup_id),
+   FN(skb_ancestor_cgroup_id), \
+   FN(sk_lookup_tcp),  \
+   FN(sk_lookup_udp),  \
+   FN(sk_release),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2395,6 +2433,20 @@ struct bpf_sock {
 */
 };
 
+struct bpf_sock_tuple {
+   union {
+   __be32 ipv6[4];
+   __be32 ipv4;
+   } saddr;
+   union {
+   __be32 ipv6[4];
+   __be32 ipv4;
+   } daddr;
+   __be16 sport;
+   __be16 dport;
+   __u8 family;
+};
+
 #define XDP_PACKET_HEADROOM 256
 
 /* User return codes for XDP prog type.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 67c62ef67d37..37feedaaa1c3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -153,6 +153,12 @@ static const struct bpf_verifier_ops * const 
bpf_verifier_ops[] = {
  * PTR_TO_MAP_VALUE, PTR_TO_SOCKET_OR_NULL becomes PTR_TO_SOCKET when the type
  * passes through a NULL-check conditional. For the branch wherein the state is
  * changed to CONST_IMM, the verifier releases the reference.
+ *
+ * For each helper function that allocates a reference, such as
+ * bpf_sk_lookup_tcp(), there is a corresponding release function, such as
+ * bpf_sk_release(). When a reference type passes into the release function,
+ * the veri

[PATCH bpf-next 06/11] bpf: Add reference tracking to verifier

2018-09-11 Thread Joe Stringer

Allow helper functions to acquire a reference and return it into a
register. Specific pointer types such as the PTR_TO_SOCKET will
implicitly represent such a reference. The verifier must ensure that
these references are released exactly once in each path through the
program.

To achieve this, this commit assigns an id to the pointer and tracks it
in the 'bpf_func_state', then when the function or program exits,
verifies that all of the acquired references have been freed. When the
pointer is passed to a function that frees the reference, it is removed
from the 'bpf_func_state` and all existing copies of the pointer in
registers are marked invalid.

Signed-off-by: Joe Stringer 
---
 include/linux/bpf_verifier.h |  24 ++-
 kernel/bpf/verifier.c| 303 ---
 2 files changed, 306 insertions(+), 21 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 23a2b17bfd75..23f222e0cb0b 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -104,6 +104,17 @@ struct bpf_stack_state {
u8 slot_type[BPF_REG_SIZE];
 };
 
+struct bpf_reference_state {
+   /* Track each reference created with a unique id, even if the same
+* instruction creates the reference multiple times (eg, via CALL).
+*/
+   int id;
+   /* Instruction where the allocation of this reference occurred. This
+* is used purely to inform the user of a reference leak.
+*/
+   int insn_idx;
+};
+
 /* state of the program:
  * type of all registers and stack info
  */
@@ -121,7 +132,9 @@ struct bpf_func_state {
 */
u32 subprogno;
 
-   /* should be second to last. See copy_func_state() */
+   /* The following fields should be last. See copy_func_state() */
+   int acquired_refs;
+   struct bpf_reference_state *refs;
int allocated_stack;
struct bpf_stack_state *stack;
 };
@@ -217,11 +230,16 @@ __printf(2, 0) void bpf_verifier_vlog(struct 
bpf_verifier_log *log,
 __printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env,
   const char *fmt, ...);
 
-static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env)
+static inline struct bpf_func_state *cur_func(struct bpf_verifier_env *env)
 {
struct bpf_verifier_state *cur = env->cur_state;
 
-   return cur->frame[cur->curframe]->regs;
+   return cur->frame[cur->curframe];
+}
+
+static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env)
+{
+   return cur_func(env)->regs;
 }
 
 int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index faa83b3d7011..67c62ef67d37 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1,5 +1,6 @@
 /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
  * Copyright (c) 2016 Facebook
+ * Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -140,6 +141,18 @@ static const struct bpf_verifier_ops * const 
bpf_verifier_ops[] = {
  *
  * After the call R0 is set to return type of the function and registers R1-R5
  * are set to NOT_INIT to indicate that they are no longer readable.
+ *
+ * The following reference types represent a potential reference to a kernel
+ * resource which, after first being allocated, must be checked and freed by
+ * the BPF program:
+ * - PTR_TO_SOCKET_OR_NULL, PTR_TO_SOCKET
+ *
+ * When the verifier sees a helper call return a reference type, it allocates a
+ * pointer id for the reference and stores it in the current function state.
+ * Similar to the way that PTR_TO_MAP_VALUE_OR_NULL is converted into
+ * PTR_TO_MAP_VALUE, PTR_TO_SOCKET_OR_NULL becomes PTR_TO_SOCKET when the type
+ * passes through a NULL-check conditional. For the branch wherein the state is
+ * changed to CONST_IMM, the verifier releases the reference.
  */
 
 /* verifier_state + insn_idx are pushed to stack when branch is encountered */
@@ -189,6 +202,7 @@ struct bpf_call_arg_meta {
int access_size;
s64 msize_smax_value;
u64 msize_umax_value;
+   int ptr_id;
 };
 
 static DEFINE_MUTEX(bpf_verifier_lock);
@@ -251,7 +265,42 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
 
 static bool reg_type_may_be_null(enum bpf_reg_type type)
 {
-   return type == PTR_TO_MAP_VALUE_OR_NULL;
+   return type == PTR_TO_MAP_VALUE_OR_NULL ||
+  type == PTR_TO_SOCKET_OR_NULL;
+}
+
+static bool type_is_refcounted(enum bpf_reg_type type)
+{
+   return type == PTR_TO_SOCKET;
+}
+
+static bool type_is_refcounted_or_null(enum bpf_reg_type type)
+{
+   return type == PTR_TO_SOCKET || type == PTR_TO_SOCKET_OR_NULL;
+}
+
+static bool reg_is_refcounted(const struct bpf_reg_state *reg)
+{
+

[PATCH bpf-next 04/11] bpf: Add PTR_TO_SOCKET verifier type

2018-09-11 Thread Joe Stringer

Teach the verifier a little bit about a new type of pointer, a
PTR_TO_SOCKET. This pointer type is accessed from BPF through the
'struct bpf_sock' structure.

Signed-off-by: Joe Stringer 
---
 include/linux/bpf.h  |  17 +
 include/linux/bpf_verifier.h |   2 +
 kernel/bpf/verifier.c| 125 ++-
 net/core/filter.c|  30 +
 4 files changed, 147 insertions(+), 27 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 523481a3471b..6ec93f3d66dd 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -154,6 +154,7 @@ enum bpf_arg_type {
 
ARG_PTR_TO_CTX, /* pointer to context */
ARG_ANYTHING,   /* any (initialized) argument is ok */
+   ARG_PTR_TO_SOCKET,  /* pointer to bpf_sock */
 };
 
 /* type of values returned from helper functions */
@@ -162,6 +163,7 @@ enum bpf_return_type {
RET_VOID,   /* function doesn't return anything */
RET_PTR_TO_MAP_VALUE,   /* returns a pointer to map elem value 
*/
RET_PTR_TO_MAP_VALUE_OR_NULL,   /* returns a pointer to map elem value 
or NULL */
+   RET_PTR_TO_SOCKET_OR_NULL,  /* returns a pointer to a socket or 
NULL */
 };
 
 /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF 
programs
@@ -212,6 +214,8 @@ enum bpf_reg_type {
PTR_TO_PACKET_META,  /* skb->data - meta_len */
PTR_TO_PACKET,   /* reg points to skb->data */
PTR_TO_PACKET_END,   /* skb->data + headlen */
+   PTR_TO_SOCKET,   /* reg points to struct bpf_sock */
+   PTR_TO_SOCKET_OR_NULL,   /* reg points to struct bpf_sock or NULL */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -334,6 +338,11 @@ const struct bpf_func_proto 
*bpf_get_trace_printk_proto(void);
 
 typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src,
unsigned long off, unsigned long len);
+typedef u32 (*bpf_convert_ctx_access_t)(enum bpf_access_type type,
+   const struct bpf_insn *src,
+   struct bpf_insn *dst,
+   struct bpf_prog *prog,
+   u32 *target_size);
 
 u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
 void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy);
@@ -827,4 +836,12 @@ extern const struct bpf_func_proto 
bpf_get_local_storage_proto;
 void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
+bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type,
+ struct bpf_insn_access_aux *info);
+u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
+   const struct bpf_insn *si,
+   struct bpf_insn *insn_buf,
+   struct bpf_prog *prog,
+   u32 *target_size);
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index af262b97f586..23a2b17bfd75 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -58,6 +58,8 @@ struct bpf_reg_state {
 * offset, so they can share range knowledge.
 * For PTR_TO_MAP_VALUE_OR_NULL this is used to share which map value we
 * came from, when one is tested for != NULL.
+* For PTR_TO_SOCKET this is used to share which pointers retain the
+* same reference to the socket, to determine proper reference freeing.
 */
u32 id;
/* For scalar types (SCALAR_VALUE), this represents our knowledge of
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f2357c8c90de..111e031cf65d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -80,8 +80,8 @@ static const struct bpf_verifier_ops * const 
bpf_verifier_ops[] = {
  * (like pointer plus pointer becomes SCALAR_VALUE type)
  *
  * When verifier sees load or store instructions the type of base register
- * can be: PTR_TO_MAP_VALUE, PTR_TO_CTX, PTR_TO_STACK. These are three pointer
- * types recognized by check_mem_access() function.
+ * can be: PTR_TO_MAP_VALUE, PTR_TO_CTX, PTR_TO_STACK, PTR_TO_SOCKET. These are
+ * four pointer types recognized by check_mem_access() function.
  *
  * PTR_TO_MAP_VALUE means that this register is pointing to 'map element value'
  * and the range of [ptr, ptr + map's value_size) is accessible.
@@ -266,6 +266,8 @@ static const char * const reg_type_str[] = {
[PTR_TO_PACKET] = "pkt",
[PTR_TO_PACKET_META]= "pkt_meta",
[PTR_TO_PACKET_END] = "pkt_end",
+   [PTR_TO_SOCKET] = "sock",
+   [PTR_TO_SOCKET_OR_NULL] = "so

[PATCH bpf-next 11/11] Documentation: Describe bpf reference tracking

2018-09-11 Thread Joe Stringer

Signed-off-by: Joe Stringer 
---
 Documentation/networking/filter.txt | 64 +
 1 file changed, 64 insertions(+)

diff --git a/Documentation/networking/filter.txt 
b/Documentation/networking/filter.txt
index e6b4ebb2b243..4443ce958862 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -1125,6 +1125,14 @@ pointer type.  The types of pointers describe their 
base, as follows:
 PTR_TO_STACKFrame pointer.
 PTR_TO_PACKET   skb->data.
 PTR_TO_PACKET_END   skb->data + headlen; arithmetic forbidden.
+PTR_TO_SOCKET   Pointer to struct bpf_sock_ops, implicitly refcounted.
+PTR_TO_SOCKET_OR_NULL
+Either a pointer to a socket, or NULL; socket lookup
+returns this type, which becomes a PTR_TO_SOCKET when
+checked != NULL. PTR_TO_SOCKET is reference-counted,
+so programs must release the reference through the
+socket release function before the end of the program.
+Arithmetic on these pointers is forbidden.
 However, a pointer may be offset from this base (as a result of pointer
 arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable
 offset'.  The former is used when an exactly-known value (e.g. an immediate
@@ -1171,6 +1179,13 @@ over the Ethernet header, then reads IHL and addes (IHL 
* 4), the resulting
 pointer will have a variable offset known to be 4n+2 for some n, so adding the 
2
 bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses 
through
 that pointer are safe.
+The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common
+to all copies of the pointer returned from a socket lookup. This has similar
+behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but
+it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly
+represents a reference to the corresponding 'struct sock'. To ensure that the
+reference is not leaked, it is imperative to NULL-check the reference and in
+the non-NULL case, and pass the valid reference to the socket release function.
 
 Direct packet access
 
@@ -1444,6 +1459,55 @@ Error:
   8: (7a) *(u64 *)(r0 +0) = 1
   R0 invalid mem access 'imm'
 
+Program that performs a socket lookup then sets the pointer to NULL without
+checking it:
+value:
+  BPF_MOV64_IMM(BPF_REG_2, 0),
+  BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
+  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+  BPF_MOV64_IMM(BPF_REG_3, 4),
+  BPF_MOV64_IMM(BPF_REG_4, 0),
+  BPF_MOV64_IMM(BPF_REG_5, 0),
+  BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
+  BPF_MOV64_IMM(BPF_REG_0, 0),
+  BPF_EXIT_INSN(),
+Error:
+  0: (b7) r2 = 0
+  1: (63) *(u32 *)(r10 -8) = r2
+  2: (bf) r2 = r10
+  3: (07) r2 += -8
+  4: (b7) r3 = 4
+  5: (b7) r4 = 0
+  6: (b7) r5 = 0
+  7: (85) call bpf_sk_lookup_tcp#65
+  8: (b7) r0 = 0
+  9: (95) exit
+  Unreleased reference id=1, alloc_insn=7
+
+Program that performs a socket lookup but does not NULL-check the returned
+value:
+  BPF_MOV64_IMM(BPF_REG_2, 0),
+  BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
+  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+  BPF_MOV64_IMM(BPF_REG_3, 4),
+  BPF_MOV64_IMM(BPF_REG_4, 0),
+  BPF_MOV64_IMM(BPF_REG_5, 0),
+  BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
+  BPF_EXIT_INSN(),
+Error:
+  0: (b7) r2 = 0
+  1: (63) *(u32 *)(r10 -8) = r2
+  2: (bf) r2 = r10
+  3: (07) r2 += -8
+  4: (b7) r3 = 4
+  5: (b7) r4 = 0
+  6: (b7) r5 = 0
+  7: (85) call bpf_sk_lookup_tcp#65
+  8: (95) exit
+  Unreleased reference id=1, alloc_insn=7
+
 Testing
 ---
 
-- 
2.17.1

[PATCH bpf-next 08/11] selftests/bpf: Add tests for reference tracking

2018-09-11 Thread Joe Stringer

reference tracking: leak potential reference
reference tracking: leak potential reference on stack
reference tracking: leak potential reference on stack 2
reference tracking: zero potential reference
reference tracking: copy and zero potential references
reference tracking: release reference without check
reference tracking: release reference
reference tracking: release reference twice
reference tracking: release reference twice inside branch
reference tracking: alloc, check, free in one subbranch
reference tracking: alloc, check, free in both subbranches
reference tracking in call: free reference in subprog
reference tracking in call: free reference in subprog and outside
reference tracking in call: alloc & leak reference in subprog
reference tracking in call: alloc in subprog, release outside
reference tracking in call: sk_ptr leak into caller stack
reference tracking in call: sk_ptr spill into caller stack

Signed-off-by: Joe Stringer 
---
 tools/testing/selftests/bpf/test_verifier.c | 359 
 1 file changed, 359 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index ceb55a9f3da9..eb760ead257a 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -3,6 +3,7 @@
  *
  * Copyright (c) 2014 PLUMgrid, http://plumgrid.com
  * Copyright (c) 2017 Facebook
+ * Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -177,6 +178,23 @@ static void bpf_fill_rand_ld_dw(struct bpf_test *self)
self->retval = (uint32_t)res;
 }
 
+#define BPF_SK_LOOKUP  \
+   /* struct bpf_sock_tuple tuple = {} */  \
+   BPF_MOV64_IMM(BPF_REG_2, 0),\
+   BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),  \
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -16),\
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -24),\
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -32),\
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -40),\
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -48),\
+   /* sk = sk_lookup_tcp(ctx, &tuple, sizeof tuple, 0, 0) */   \
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),   \
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -48), \
+   BPF_MOV64_IMM(BPF_REG_3, 44),   \
+   BPF_MOV64_IMM(BPF_REG_4, 0),\
+   BPF_MOV64_IMM(BPF_REG_5, 0),\
+   BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp)
+
 static struct bpf_test tests[] = {
{
"add+sub+mul",
@@ -12441,6 +12459,222 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
},
+   {
+   "reference tracking: leak potential reference",
+   .insns = {
+   BPF_SK_LOOKUP,
+   BPF_MOV64_REG(BPF_REG_6, BPF_REG_0), /* leak reference 
*/
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .errstr = "Unreleased reference",
+   .result = REJECT,
+   },
+   {
+   "reference tracking: leak potential reference on stack",
+   .insns = {
+   BPF_SK_LOOKUP,
+   BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),
+   BPF_STX_MEM(BPF_DW, BPF_REG_4, BPF_REG_0, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .errstr = "Unreleased reference",
+   .result = REJECT,
+   },
+   {
+   "reference tracking: leak potential reference on stack 2",
+   .insns = {
+   BPF_SK_LOOKUP,
+   BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),
+   BPF_STX_MEM(BPF_DW, BPF_REG_4, BPF_REG_0, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_ST_MEM(BPF_DW, BPF_REG_4, 0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .errstr = "Unreleased reference",
+   .result = REJECT,
+   },
+   {
+   "reference tracking: zero

Re: [PATCH bpf-next 04/11] bpf: Add PTR_TO_SOCKET verifier type

2018-09-13 Thread Joe Stringer

On Wed, 12 Sep 2018 at 15:50, Alexei Starovoitov
 wrote:
>
> On Tue, Sep 11, 2018 at 05:36:33PM -0700, Joe Stringer wrote:
> > ...
> > +static bool reg_type_mismatch(enum bpf_reg_type src, enum bpf_reg_type 
> > prev)
> > +{
> > + return src != prev && (!reg_type_mismatch_ok(src) ||
> > +!reg_type_mismatch_ok(prev));
> > +}
> > +
> >  static int do_check(struct bpf_verifier_env *env)
> >  {
> >   struct bpf_verifier_state *state;
> > @@ -4778,9 +4862,7 @@ static int do_check(struct bpf_verifier_env *env)
> >*/
> >   *prev_src_type = src_reg_type;
> >
> > - } else if (src_reg_type != *prev_src_type &&
> > -(src_reg_type == PTR_TO_CTX ||
> > - *prev_src_type == PTR_TO_CTX)) {
> > + } else if (reg_type_mismatch(src_reg_type, 
> > *prev_src_type)) {
> >   /* ABuser program is trying to use the same 
> > insn
> >* dst_reg = *(u32*) (src_reg + off)
> >* with different pointer types:
> > @@ -4826,8 +4908,8 @@ static int do_check(struct bpf_verifier_env *env)
> >   if (*prev_dst_type == NOT_INIT) {
> >   *prev_dst_type = dst_reg_type;
> >   } else if (dst_reg_type != *prev_dst_type &&
> > -(dst_reg_type == PTR_TO_CTX ||
> > - *prev_dst_type == PTR_TO_CTX)) {
> > +(!reg_type_mismatch_ok(dst_reg_type) ||
> > + !reg_type_mismatch_ok(*prev_dst_type))) {
>
> reg_type_mismatch() could have been used here as well ?

Missed that before, will fix.

> >   verbose(env, "same insn cannot be used with 
> > different pointers\n");
> >   return -EINVAL;
> >   }
> > @@ -5244,10 +5326,14 @@ static void sanitize_dead_code(struct 
> > bpf_verifier_env *env)
> >   }
> >  }
> >
> > -/* convert load instructions that access fields of 'struct __sk_buff'
> > - * into sequence of instructions that access fields of 'struct sk_buff'
> > +/* convert load instructions that access fields of a context type into a
> > + * sequence of instructions that access fields of the underlying structure:
> > + * struct __sk_buff-> struct sk_buff
> > + * struct bpf_sock_ops -> struct sock
> >   */
> > -static int convert_ctx_accesses(struct bpf_verifier_env *env)
> > +static int convert_ctx_accesses(struct bpf_verifier_env *env,
> > + bpf_convert_ctx_access_t convert_ctx_access,
> > + enum bpf_reg_type ctx_type)
> >  {
> >   const struct bpf_verifier_ops *ops = env->ops;
> >   int i, cnt, size, ctx_field_size, delta = 0;
> > @@ -5274,12 +5360,14 @@ static int convert_ctx_accesses(struct 
> > bpf_verifier_env *env)
> >   }
> >   }
> >
> > - if (!ops->convert_ctx_access || bpf_prog_is_dev_bound(env->prog->aux))
> > + if (!convert_ctx_access || bpf_prog_is_dev_bound(env->prog->aux))
> >   return 0;
> >
> >   insn = env->prog->insnsi + delta;
> >
> >   for (i = 0; i < insn_cnt; i++, insn++) {
> > + enum bpf_reg_type ptr_type;
> > +
> >   if (insn->code == (BPF_LDX | BPF_MEM | BPF_B) ||
> >   insn->code == (BPF_LDX | BPF_MEM | BPF_H) ||
> >   insn->code == (BPF_LDX | BPF_MEM | BPF_W) ||
> > @@ -5321,7 +5409,8 @@ static int convert_ctx_accesses(struct 
> > bpf_verifier_env *env)
> >   continue;
> >   }
> >
> > - if (env->insn_aux_data[i + delta].ptr_type != PTR_TO_CTX)
> > + ptr_type = env->insn_aux_data[i + delta].ptr_type;
> > + if (ptr_type != ctx_type)
> >   continue;
> >
> >   ctx_field_size = env->insn_aux_data[i + delta].ctx_field_size;
> > @@ -5354,8 +5443,8 @@ static int convert_ctx_accesses(struct 
> > bpf_verifier_env *env)
> >   }
> >
> >   target_size = 0;
> > - cnt = ops->convert_ctx_access(type, insn, insn_buf, env->prog,
> >

Re: [PATCH bpf-next 06/11] bpf: Add reference tracking to verifier

2018-09-13 Thread Joe Stringer

On Wed, 12 Sep 2018 at 16:17, Alexei Starovoitov
 wrote:
>
> On Tue, Sep 11, 2018 at 05:36:35PM -0700, Joe Stringer wrote:
> > ...
> > +
> > +/* release function corresponding to acquire_reference_state(). 
> > Idempotent. */
> > +static int __release_reference_state(struct bpf_func_state *state, int 
> > ptr_id)
> > +{
> > + int i, last_idx;
> > +
> > + if (!ptr_id)
> > + return 0;
>
> Is this defensive programming or this condition can actually happen?
> As far as I can see all callers suppose to pass valid ptr_id into it.
>
> Acked-by: Alexei Starovoitov 
>

Looks like defensive programming to me. That said, if it's being
defensive, why not return `-EFAULT`? I'll try this out locally.

Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-09-13 Thread Joe Stringer

On Thu, 13 Sep 2018 at 12:06, Alexei Starovoitov
 wrote:
>
> On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
>  wrote:
> > On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
> >> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
> >> bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
> >> socket listening on this host, and returns a socket pointer which the
> >> BPF program can then access to determine, for instance, whether to
> >> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
> >> socket, so when a BPF program makes use of this function, it must
> >> subsequently pass the returned pointer into the newly added sk_release()
> >> to return the reference.
> >>
> >> By way of example, the following pseudocode would filter inbound
> >> connections at XDP if there is no corresponding service listening for
> >> the traffic:
> >>
> >>   struct bpf_sock_tuple tuple;
> >>   struct bpf_sock_ops *sk;
> >>
> >>   populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
> >>   sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
> > ...
> >> +struct bpf_sock_tuple {
> >> + union {
> >> + __be32 ipv6[4];
> >> + __be32 ipv4;
> >> + } saddr;
> >> + union {
> >> + __be32 ipv6[4];
> >> + __be32 ipv4;
> >> + } daddr;
> >> + __be16 sport;
> >> + __be16 dport;
> >> + __u8 family;
> >> +};
> >
> > since we can pass ptr_to_packet into map lookup and other helpers now,
> > can you move 'family' out of bpf_sock_tuple and combine with netns_id arg?
> > then progs wouldn't need to copy bytes from the packet into tuple
> > to do a lookup.

If I follow, you're proposing that users should be able to pass a
pointer to the source address field of the L3 header, and assuming
that the L3 header ends with saddr+daddr (no options/extheaders), and
is immediately followed by the sport/dport then a packet pointer
should work for performing socket lookup. Then it is up to the BPF
program writer to ensure that this is the case, or otherwise fall back
to populating a copy of the sock tuple on the stack.

> have been thinking more about it.
> since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
> to infer family inside the helper, so it doesn't need to be passed explicitly?

Let me make sure I understand the proposal here.

The current structure and function prototypes are:

struct bpf_sock_tuple {
  union {
  __be32 ipv6[4];
  __be32 ipv4;
  } saddr;
  union {
  __be32 ipv6[4];
  __be32 ipv4;
  } daddr;
  __be16 sport;
  __be16 dport;
  __u8 family;
};

static struct bpf_sock *(*bpf_sk_lookup_tcp)(void *ctx,
   struct bpf_sock_tuple *tuple,
   int size, unsigned int netns_id,
   unsigned long long flags);
static struct bpf_sock *(*bpf_sk_lookup_udp)(void *ctx,
   struct bpf_sock_tuple *tuple,
   int size, unsigned int netns_id,
   unsigned long long flags);
static int (*bpf_sk_release)(struct bpf_sock *sk, unsigned long long flags);

You're proposing something like:

struct bpf_sock_tuple4 {
  __be32 saddr;
  __be32 daddr;
  __be16 sport;
  __be16 dport;
  __u8 family;
};

struct bpf_sock_tuple6 {
  __be32 saddr[4];
  __be32 daddr[4];
  __be16 sport;
  __be16 dport;
  __u8 family;
};

static struct bpf_sock *(*bpf_sk_lookup_tcp)(void *ctx,
   void *tuple,
   int size, unsigned int
netns_id,
   unsigned long long flags);
static struct bpf_sock *(*bpf_sk_lookup_udp)(void *ctx,
   void *tuple,
   int size, unsigned int netns_id,
   unsigned long long flags);
static int (*bpf_sk_release)(struct bpf_sock *sk, unsigned long long flags);

Then the implementation will check the size against either
"sizeof(struct bpf_sock_tuple4)" or "sizeof(struct bpf_sock_tuple6)"
and interpret as the v4 or v6 handler from this.

Sure, I can try this out.

Re: [PATCH bpf-next 10/11] selftests/bpf: Add C tests for reference tracking

2018-09-13 Thread Joe Stringer

On Wed, 12 Sep 2018 at 17:11, Alexei Starovoitov
 wrote:
>
> On Tue, Sep 11, 2018 at 05:36:39PM -0700, Joe Stringer wrote:
> > Signed-off-by: Joe Stringer 
>
> really nice set of tests.
> please describe them briefly in commit log.
>
> Acked-by: Alexei Starovoitov 

Ack, will do.

Re: [PATCH bpf-next 11/11] Documentation: Describe bpf reference tracking

2018-09-13 Thread Joe Stringer

On Wed, 12 Sep 2018 at 17:13, Alexei Starovoitov
 wrote:
>
> On Tue, Sep 11, 2018 at 05:36:40PM -0700, Joe Stringer wrote:
> > Signed-off-by: Joe Stringer 
>
> just few words in commit log would be better than nothing.
>
> Acked-by: Alexei Starovoitov 

Ack, thanks for the review!

Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-09-13 Thread Joe Stringer

On Thu, 13 Sep 2018 at 13:55, Joe Stringer  wrote:
> struct bpf_sock_tuple4 {
>   __be32 saddr;
>   __be32 daddr;
>   __be16 sport;
>   __be16 dport;
>   __u8 family;
> };
>
> struct bpf_sock_tuple6 {
>   __be32 saddr[4];
>   __be32 daddr[4];
>   __be16 sport;
>   __be16 dport;
>   __u8 family;
> };

(ignore the family bit here, I forgot to remove it..)

Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-09-13 Thread Joe Stringer

On Thu, 13 Sep 2018 at 14:02, Alexei Starovoitov
 wrote:
>
> On Thu, Sep 13, 2018 at 01:55:01PM -0700, Joe Stringer wrote:
> > On Thu, 13 Sep 2018 at 12:06, Alexei Starovoitov
> >  wrote:
> > >
> > > On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
> > >  wrote:
> > > > On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
> > > >> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
> > > >> bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
> > > >> socket listening on this host, and returns a socket pointer which the
> > > >> BPF program can then access to determine, for instance, whether to
> > > >> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on 
> > > >> the
> > > >> socket, so when a BPF program makes use of this function, it must
> > > >> subsequently pass the returned pointer into the newly added 
> > > >> sk_release()
> > > >> to return the reference.
> > > >>
> > > >> By way of example, the following pseudocode would filter inbound
> > > >> connections at XDP if there is no corresponding service listening for
> > > >> the traffic:
> > > >>
> > > >>   struct bpf_sock_tuple tuple;
> > > >>   struct bpf_sock_ops *sk;
> > > >>
> > > >>   populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
> > > >>   sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
> > > > ...
> > > >> +struct bpf_sock_tuple {
> > > >> + union {
> > > >> + __be32 ipv6[4];
> > > >> + __be32 ipv4;
> > > >> + } saddr;
> > > >> + union {
> > > >> + __be32 ipv6[4];
> > > >> + __be32 ipv4;
> > > >> + } daddr;
> > > >> + __be16 sport;
> > > >> + __be16 dport;
> > > >> + __u8 family;
> > > >> +};
> > > >
> > > > since we can pass ptr_to_packet into map lookup and other helpers now,
> > > > can you move 'family' out of bpf_sock_tuple and combine with netns_id 
> > > > arg?
> > > > then progs wouldn't need to copy bytes from the packet into tuple
> > > > to do a lookup.
> >
> > If I follow, you're proposing that users should be able to pass a
> > pointer to the source address field of the L3 header, and assuming
> > that the L3 header ends with saddr+daddr (no options/extheaders), and
> > is immediately followed by the sport/dport then a packet pointer
> > should work for performing socket lookup. Then it is up to the BPF
> > program writer to ensure that this is the case, or otherwise fall back
> > to populating a copy of the sock tuple on the stack.
>
> yep.
>
> > > have been thinking more about it.
> > > since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
> > > to infer family inside the helper, so it doesn't need to be passed 
> > > explicitly?
> >
> > Let me make sure I understand the proposal here.
> >
> > The current structure and function prototypes are:
> >
> > struct bpf_sock_tuple {
> >   union {
> >   __be32 ipv6[4];
> >   __be32 ipv4;
> >   } saddr;
> >   union {
> >   __be32 ipv6[4];
> >   __be32 ipv4;
> >   } daddr;
> >   __be16 sport;
> >   __be16 dport;
> >   __u8 family;
> > };
> ...
> > You're proposing something like:
> >
> > struct bpf_sock_tuple4 {
> >   __be32 saddr;
> >   __be32 daddr;
> >   __be16 sport;
> >   __be16 dport;
> >   __u8 family;
> > };
> >
> > struct bpf_sock_tuple6 {
> >   __be32 saddr[4];
> >   __be32 daddr[4];
> >   __be16 sport;
> >   __be16 dport;
> >   __u8 family;
> > };
>
> I think the split is unnecessary.
> I'm proposing:
> struct bpf_sock_tuple {
>   union {
>   __be32 ipv6[4];
>   __be32 ipv4;
>   } saddr;
>   union {
>   __be32 ipv6[4];
>   __be32 ipv4;
>   } daddr;
>   __be16 sport;
>   __be16 dport;
> };
>
> that points directly into the packet (when ipv4 options are not there)
> and bpf_sk_lookup_t

Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-09-13 Thread Joe Stringer

On Thu, 13 Sep 2018 at 14:22, Alexei Starovoitov
 wrote:
>
> On Thu, Sep 13, 2018 at 02:17:17PM -0700, Joe Stringer wrote:
> > On Thu, 13 Sep 2018 at 14:02, Alexei Starovoitov
> >  wrote:
> > >
> > > On Thu, Sep 13, 2018 at 01:55:01PM -0700, Joe Stringer wrote:
> > > > On Thu, 13 Sep 2018 at 12:06, Alexei Starovoitov
> > > >  wrote:
> > > > >
> > > > > On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
> > > > >  wrote:
> > > > > > On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
> > > > > >> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
> > > > > >> bpf_sk_lookup_udp() which allows BPF programs to find out if there 
> > > > > >> is a
> > > > > >> socket listening on this host, and returns a socket pointer which 
> > > > > >> the
> > > > > >> BPF program can then access to determine, for instance, whether to
> > > > > >> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference 
> > > > > >> on the
> > > > > >> socket, so when a BPF program makes use of this function, it must
> > > > > >> subsequently pass the returned pointer into the newly added 
> > > > > >> sk_release()
> > > > > >> to return the reference.
> > > > > >>
> > > > > >> By way of example, the following pseudocode would filter inbound
> > > > > >> connections at XDP if there is no corresponding service listening 
> > > > > >> for
> > > > > >> the traffic:
> > > > > >>
> > > > > >>   struct bpf_sock_tuple tuple;
> > > > > >>   struct bpf_sock_ops *sk;
> > > > > >>
> > > > > >>   populate_tuple(ctx, &tuple); // Extract the 5tuple from the 
> > > > > >> packet
> > > > > >>   sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
> > > > > > ...
> > > > > >> +struct bpf_sock_tuple {
> > > > > >> + union {
> > > > > >> + __be32 ipv6[4];
> > > > > >> + __be32 ipv4;
> > > > > >> + } saddr;
> > > > > >> + union {
> > > > > >> + __be32 ipv6[4];
> > > > > >> + __be32 ipv4;
> > > > > >> + } daddr;
> > > > > >> + __be16 sport;
> > > > > >> + __be16 dport;
> > > > > >> + __u8 family;
> > > > > >> +};
> > > > > >
> > > > > > since we can pass ptr_to_packet into map lookup and other helpers 
> > > > > > now,
> > > > > > can you move 'family' out of bpf_sock_tuple and combine with 
> > > > > > netns_id arg?
> > > > > > then progs wouldn't need to copy bytes from the packet into tuple
> > > > > > to do a lookup.
> > > >
> > > > If I follow, you're proposing that users should be able to pass a
> > > > pointer to the source address field of the L3 header, and assuming
> > > > that the L3 header ends with saddr+daddr (no options/extheaders), and
> > > > is immediately followed by the sport/dport then a packet pointer
> > > > should work for performing socket lookup. Then it is up to the BPF
> > > > program writer to ensure that this is the case, or otherwise fall back
> > > > to populating a copy of the sock tuple on the stack.
> > >
> > > yep.
> > >
> > > > > have been thinking more about it.
> > > > > since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
> > > > > to infer family inside the helper, so it doesn't need to be passed 
> > > > > explicitly?
> > > >
> > > > Let me make sure I understand the proposal here.
> > > >
> > > > The current structure and function prototypes are:
> > > >
> > > > struct bpf_sock_tuple {
> > > >   union {
> > > >   __be32 ipv6[4];
> > > >   __be32 ipv4;
> > > >   } saddr;
> > > >   union {
> > > >   __be32 ipv6[4];
> > > >   __be32 ipv4;
&g

[PATCHv2 bpf-next 00/11] Add socket lookup support

2018-09-21 Thread Joe Stringer

This series proposes a new helper for the BPF API which allows BPF programs to
perform lookups for sockets in a network namespace. This would allow programs
to determine early on in processing whether the stack is expecting to receive
the packet, and perform some action (eg drop, forward somewhere) based on this
information.

The series is structured roughly into:
* Misc refactor
* Add the socket pointer type
* Add reference tracking to ensure that socket references are freed
* Extend the BPF API to add sk_lookup_xxx() / sk_release() functions
* Add tests/documentation

The helper proposed in this series includes a parameter for a tuple which must
be filled in by the caller to determine the socket to look up. The simplest
case would be filling with the contents of the packet, ie mapping the packet's
5-tuple into the parameter. In common cases, it may alternatively be useful to
reverse the direction of the tuple and perform a lookup, to find the socket
that initiates this connection; and if the BPF program ever performs a form of
IP address translation, it may further be useful to be able to look up
arbitrary tuples that are not based upon the packet, but instead based on state
held in BPF maps or hardcoded in the BPF program.

Currently, access into the socket's fields are limited to those which are
otherwise already accessible, and are restricted to read-only access.

Changes since v1:
* Limit netns_id field to 32 bits
* Reuse reg_type_mismatch() in more places
* Reduce the number of passes at convert_ctx_access()
* Replace ptr_id defensive coding when releasing reference state with an
  internal error (-EFAULT)
* Rework 'struct bpf_sock_tuple' to allow passing a packet pointer
* Allow direct packet access from helper
* Fix compile error with CONFIG_IPV6 enabled
* Improve commit messages

Changes since RFC:
* Split up sk_lookup() into sk_lookup_tcp(), sk_lookup_udp().
* Only take references on the socket when necessary.
  * Make sk_release() only free the socket reference in this case.
* Fix some runtime reference leaks:
  * Disallow BPF_LD_[ABS|IND] instructions while holding a reference.
  * Disallow bpf_tail_call() while holding a reference.
* Prevent the same instruction being used for reference and other
  pointer type.
* Simplify locating copies of a reference during helper calls by caching
  the pointer id from the caller.
* Fix kbuild compilation warnings with particular configs.
* Improve code comments describing the new verifier pieces.
* Testing courtesy of Nitin
* Rebase

This tree is also available at:
https://github.com/joestringer/linux/commits/submit/sk-lookup-v2

Joe Stringer (11):
  bpf: Add iterator for spilled registers
  bpf: Simplify ptr_min_max_vals adjustment
  bpf: Generalize ptr_or_null regs check
  bpf: Add PTR_TO_SOCKET verifier type
  bpf: Macrofy stack state copy
  bpf: Add reference tracking to verifier
  bpf: Add helper to retrieve socket in BPF
  selftests/bpf: Add tests for reference tracking
  libbpf: Support loading individual progs
  selftests/bpf: Add C tests for reference tracking
  Documentation: Describe bpf reference tracking

 Documentation/networking/filter.txt   |  64 ++
 include/linux/bpf.h   |  17 +
 include/linux/bpf_verifier.h  |  37 +-
 include/uapi/linux/bpf.h  |  57 +-
 kernel/bpf/verifier.c | 594 ++
 net/core/filter.c | 179 +-
 tools/include/uapi/linux/bpf.h|  57 +-
 tools/lib/bpf/libbpf.c|   4 +-
 tools/lib/bpf/libbpf.h|   3 +
 tools/testing/selftests/bpf/Makefile  |   2 +-
 tools/testing/selftests/bpf/bpf_helpers.h |  12 +
 tools/testing/selftests/bpf/test_progs.c  |  38 ++
 .../selftests/bpf/test_sk_lookup_kern.c   | 137 
 tools/testing/selftests/bpf/test_verifier.c   | 373 ++-
 14 files changed, 1441 insertions(+), 133 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sk_lookup_kern.c

-- 
2.17.1

[PATCHv2 bpf-next 03/11] bpf: Generalize ptr_or_null regs check

2018-09-21 Thread Joe Stringer

This check will be reused by an upcoming commit for conditional jump
checks for sockets. Refactor it a bit to simplify the later commit.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c | 43 +--
 1 file changed, 25 insertions(+), 18 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a889398ba43d..7dccb18ede03 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -249,6 +249,11 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
   type == PTR_TO_PACKET_META;
 }
 
+static bool reg_type_may_be_null(enum bpf_reg_type type)
+{
+   return type == PTR_TO_MAP_VALUE_OR_NULL;
+}
+
 /* string representation of 'enum bpf_reg_type' */
 static const char * const reg_type_str[] = {
[NOT_INIT]  = "?",
@@ -3598,12 +3603,10 @@ static void reg_combine_min_max(struct bpf_reg_state 
*true_src,
}
 }
 
-static void mark_map_reg(struct bpf_reg_state *regs, u32 regno, u32 id,
-bool is_null)
+static void mark_ptr_or_null_reg(struct bpf_reg_state *reg, u32 id,
+bool is_null)
 {
-   struct bpf_reg_state *reg = ®s[regno];
-
-   if (reg->type == PTR_TO_MAP_VALUE_OR_NULL && reg->id == id) {
+   if (reg_type_may_be_null(reg->type) && reg->id == id) {
/* Old offset (both fixed and variable parts) should
 * have been known-zero, because we don't allow pointer
 * arithmetic on pointers that might be NULL.
@@ -3616,11 +3619,13 @@ static void mark_map_reg(struct bpf_reg_state *regs, 
u32 regno, u32 id,
}
if (is_null) {
reg->type = SCALAR_VALUE;
-   } else if (reg->map_ptr->inner_map_meta) {
-   reg->type = CONST_PTR_TO_MAP;
-   reg->map_ptr = reg->map_ptr->inner_map_meta;
-   } else {
-   reg->type = PTR_TO_MAP_VALUE;
+   } else if (reg->type == PTR_TO_MAP_VALUE_OR_NULL) {
+   if (reg->map_ptr->inner_map_meta) {
+   reg->type = CONST_PTR_TO_MAP;
+   reg->map_ptr = reg->map_ptr->inner_map_meta;
+   } else {
+   reg->type = PTR_TO_MAP_VALUE;
+   }
}
/* We don't need id from this point onwards anymore, thus we
 * should better reset it, so that state pruning has chances
@@ -3633,8 +3638,8 @@ static void mark_map_reg(struct bpf_reg_state *regs, u32 
regno, u32 id,
 /* The logic is similar to find_good_pkt_pointers(), both could eventually
  * be folded together at some point.
  */
-static void mark_map_regs(struct bpf_verifier_state *vstate, u32 regno,
- bool is_null)
+static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
+ bool is_null)
 {
struct bpf_func_state *state = vstate->frame[vstate->curframe];
struct bpf_reg_state *reg, *regs = state->regs;
@@ -3642,14 +3647,14 @@ static void mark_map_regs(struct bpf_verifier_state 
*vstate, u32 regno,
int i, j;
 
for (i = 0; i < MAX_BPF_REG; i++)
-   mark_map_reg(regs, i, id, is_null);
+   mark_ptr_or_null_reg(®s[i], id, is_null);
 
for (j = 0; j <= vstate->curframe; j++) {
state = vstate->frame[j];
for_each_spilled_reg(i, state, reg) {
if (!reg)
continue;
-   mark_map_reg(&state->stack[i].spilled_ptr, 0, id, 
is_null);
+   mark_ptr_or_null_reg(reg, id, is_null);
}
}
 }
@@ -3851,12 +3856,14 @@ static int check_cond_jmp_op(struct bpf_verifier_env 
*env,
/* detect if R == 0 where R is returned from bpf_map_lookup_elem() */
if (BPF_SRC(insn->code) == BPF_K &&
insn->imm == 0 && (opcode == BPF_JEQ || opcode == BPF_JNE) &&
-   dst_reg->type == PTR_TO_MAP_VALUE_OR_NULL) {
-   /* Mark all identical map registers in each branch as either
+   reg_type_may_be_null(dst_reg->type)) {
+   /* Mark all identical registers in each branch as either
 * safe or unknown depending R == 0 or R != 0 conditional.
 */
-   mark_map_regs(this_branch, insn->dst_reg, opcode == BPF_JNE);
-   mark_map_regs(other_branch, insn->dst_reg, opcode == BPF_JEQ);
+   mark_ptr_or_null_regs(this_branch, insn->dst_reg,
+ opcode == BPF_JNE);
+   mark

[PATCHv2 bpf-next 01/11] bpf: Add iterator for spilled registers

2018-09-21 Thread Joe Stringer

Add this iterator for spilled registers, it concentrates the details of
how to get the current frame's spilled registers into a single macro
while clarifying the intention of the code which is calling the macro.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf_verifier.h | 11 +++
 kernel/bpf/verifier.c| 16 +++-
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index b42b60a83e19..af262b97f586 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -131,6 +131,17 @@ struct bpf_verifier_state {
u32 curframe;
 };
 
+#define __get_spilled_reg(slot, frame) \
+   (((slot < frame->allocated_stack / BPF_REG_SIZE) && \
+ (frame->stack[slot].slot_type[0] == STACK_SPILL)) \
+? &frame->stack[slot].spilled_ptr : NULL)
+
+/* Iterate over 'frame', setting 'reg' to either NULL or a spilled register. */
+#define for_each_spilled_reg(iter, frame, reg) \
+   for (iter = 0, reg = __get_spilled_reg(iter, frame);\
+iter < frame->allocated_stack / BPF_REG_SIZE;  \
+iter++, reg = __get_spilled_reg(iter, frame))
+
 /* linked list of verifier states used to prune search */
 struct bpf_verifier_state_list {
struct bpf_verifier_state state;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8ccbff4fff93..62ce45d9c558 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2250,10 +2250,9 @@ static void __clear_all_pkt_pointers(struct 
bpf_verifier_env *env,
if (reg_is_pkt_pointer_any(®s[i]))
mark_reg_unknown(env, regs, i);
 
-   for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) {
-   if (state->stack[i].slot_type[0] != STACK_SPILL)
+   for_each_spilled_reg(i, state, reg) {
+   if (!reg)
continue;
-   reg = &state->stack[i].spilled_ptr;
if (reg_is_pkt_pointer_any(reg))
__mark_reg_unknown(reg);
}
@@ -3393,10 +3392,9 @@ static void find_good_pkt_pointers(struct 
bpf_verifier_state *vstate,
 
for (j = 0; j <= vstate->curframe; j++) {
state = vstate->frame[j];
-   for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) {
-   if (state->stack[i].slot_type[0] != STACK_SPILL)
+   for_each_spilled_reg(i, state, reg) {
+   if (!reg)
continue;
-   reg = &state->stack[i].spilled_ptr;
if (reg->type == type && reg->id == dst_reg->id)
reg->range = max(reg->range, new_range);
}
@@ -3641,7 +3639,7 @@ static void mark_map_regs(struct bpf_verifier_state 
*vstate, u32 regno,
  bool is_null)
 {
struct bpf_func_state *state = vstate->frame[vstate->curframe];
-   struct bpf_reg_state *regs = state->regs;
+   struct bpf_reg_state *reg, *regs = state->regs;
u32 id = regs[regno].id;
int i, j;
 
@@ -3650,8 +3648,8 @@ static void mark_map_regs(struct bpf_verifier_state 
*vstate, u32 regno,
 
for (j = 0; j <= vstate->curframe; j++) {
state = vstate->frame[j];
-   for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) {
-   if (state->stack[i].slot_type[0] != STACK_SPILL)
+   for_each_spilled_reg(i, state, reg) {
+   if (!reg)
continue;
mark_map_reg(&state->stack[i].spilled_ptr, 0, id, 
is_null);
}
-- 
2.17.1

[PATCHv2 bpf-next 09/11] libbpf: Support loading individual progs

2018-09-21 Thread Joe Stringer

Allow the individual program load to be invoked. This will help with
testing, where a single ELF may contain several sections, some of which
denote subprograms that are expected to fail verification, along with
some which are expected to pass verification. By allowing programs to be
iterated and individually loaded, each program can be independently
checked against its expected verification result.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 tools/lib/bpf/libbpf.c | 4 ++--
 tools/lib/bpf/libbpf.h | 3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 9ca8e0e624d8..b758883bed68 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -227,7 +227,7 @@ struct bpf_object {
 };
 #define obj_elf_valid(o)   ((o)->efile.elf)
 
-static void bpf_program__unload(struct bpf_program *prog)
+void bpf_program__unload(struct bpf_program *prog)
 {
int i;
 
@@ -1375,7 +1375,7 @@ load_program(enum bpf_prog_type type, enum 
bpf_attach_type expected_attach_type,
return ret;
 }
 
-static int
+int
 bpf_program__load(struct bpf_program *prog,
  char *license, u32 kern_version)
 {
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index e3b00e23e181..40e4395f1c07 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -126,10 +126,13 @@ void bpf_program__set_ifindex(struct bpf_program *prog, 
__u32 ifindex);
 
 const char *bpf_program__title(struct bpf_program *prog, bool needs_copy);
 
+int bpf_program__load(struct bpf_program *prog, char *license,
+ u32 kern_version);
 int bpf_program__fd(struct bpf_program *prog);
 int bpf_program__pin_instance(struct bpf_program *prog, const char *path,
  int instance);
 int bpf_program__pin(struct bpf_program *prog, const char *path);
+void bpf_program__unload(struct bpf_program *prog);
 
 struct bpf_insn;
 
-- 
2.17.1

[PATCHv2 bpf-next 04/11] bpf: Add PTR_TO_SOCKET verifier type

2018-09-21 Thread Joe Stringer

Teach the verifier a little bit about a new type of pointer, a
PTR_TO_SOCKET. This pointer type is accessed from BPF through the
'struct bpf_sock' structure.

Signed-off-by: Joe Stringer 

---

v2: Reuse reg_type_mismatch() in more places
Reduce the number of passes at convert_ctx_access()
---
 include/linux/bpf.h  |  17 +
 include/linux/bpf_verifier.h |   2 +
 kernel/bpf/verifier.c| 120 +++
 net/core/filter.c|  30 +
 4 files changed, 143 insertions(+), 26 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 988a00797bcd..daeb0d343d9c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -154,6 +154,7 @@ enum bpf_arg_type {
 
ARG_PTR_TO_CTX, /* pointer to context */
ARG_ANYTHING,   /* any (initialized) argument is ok */
+   ARG_PTR_TO_SOCKET,  /* pointer to bpf_sock */
 };
 
 /* type of values returned from helper functions */
@@ -162,6 +163,7 @@ enum bpf_return_type {
RET_VOID,   /* function doesn't return anything */
RET_PTR_TO_MAP_VALUE,   /* returns a pointer to map elem value 
*/
RET_PTR_TO_MAP_VALUE_OR_NULL,   /* returns a pointer to map elem value 
or NULL */
+   RET_PTR_TO_SOCKET_OR_NULL,  /* returns a pointer to a socket or 
NULL */
 };
 
 /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF 
programs
@@ -213,6 +215,8 @@ enum bpf_reg_type {
PTR_TO_PACKET,   /* reg points to skb->data */
PTR_TO_PACKET_END,   /* skb->data + headlen */
PTR_TO_FLOW_KEYS,/* reg points to bpf_flow_keys */
+   PTR_TO_SOCKET,   /* reg points to struct bpf_sock */
+   PTR_TO_SOCKET_OR_NULL,   /* reg points to struct bpf_sock or NULL */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -335,6 +339,11 @@ const struct bpf_func_proto 
*bpf_get_trace_printk_proto(void);
 
 typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src,
unsigned long off, unsigned long len);
+typedef u32 (*bpf_convert_ctx_access_t)(enum bpf_access_type type,
+   const struct bpf_insn *src,
+   struct bpf_insn *dst,
+   struct bpf_prog *prog,
+   u32 *target_size);
 
 u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
 void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy);
@@ -828,4 +837,12 @@ extern const struct bpf_func_proto 
bpf_get_local_storage_proto;
 void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
+bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type,
+ struct bpf_insn_access_aux *info);
+u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
+   const struct bpf_insn *si,
+   struct bpf_insn *insn_buf,
+   struct bpf_prog *prog,
+   u32 *target_size);
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index af262b97f586..23a2b17bfd75 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -58,6 +58,8 @@ struct bpf_reg_state {
 * offset, so they can share range knowledge.
 * For PTR_TO_MAP_VALUE_OR_NULL this is used to share which map value we
 * came from, when one is tested for != NULL.
+* For PTR_TO_SOCKET this is used to share which pointers retain the
+* same reference to the socket, to determine proper reference freeing.
 */
u32 id;
/* For scalar types (SCALAR_VALUE), this represents our knowledge of
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7dccb18ede03..1fee63d82290 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -80,8 +80,8 @@ static const struct bpf_verifier_ops * const 
bpf_verifier_ops[] = {
  * (like pointer plus pointer becomes SCALAR_VALUE type)
  *
  * When verifier sees load or store instructions the type of base register
- * can be: PTR_TO_MAP_VALUE, PTR_TO_CTX, PTR_TO_STACK. These are three pointer
- * types recognized by check_mem_access() function.
+ * can be: PTR_TO_MAP_VALUE, PTR_TO_CTX, PTR_TO_STACK, PTR_TO_SOCKET. These are
+ * four pointer types recognized by check_mem_access() function.
  *
  * PTR_TO_MAP_VALUE means that this register is pointing to 'map element value'
  * and the range of [ptr, ptr + map's value_size) is accessible.
@@ -267,6 +267,8 @@ static const char * const reg_type_str[] = {
[PTR_TO_PACKET_META]= "pkt_meta",
[PTR_TO_PACKET_END] = "pkt_end",
[PTR_TO_FLOW_KEYS]

[PATCHv2 bpf-next 05/11] bpf: Macrofy stack state copy

2018-09-21 Thread Joe Stringer

An upcoming commit will need very similar copy/realloc boilerplate, so
refactor the existing stack copy/realloc functions into macros to
simplify it.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c | 106 --
 1 file changed, 60 insertions(+), 46 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1fee63d82290..311340360aa3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -388,60 +388,74 @@ static void print_verifier_state(struct bpf_verifier_env 
*env,
verbose(env, "\n");
 }
 
-static int copy_stack_state(struct bpf_func_state *dst,
-   const struct bpf_func_state *src)
-{
-   if (!src->stack)
-   return 0;
-   if (WARN_ON_ONCE(dst->allocated_stack < src->allocated_stack)) {
-   /* internal bug, make state invalid to reject the program */
-   memset(dst, 0, sizeof(*dst));
-   return -EFAULT;
-   }
-   memcpy(dst->stack, src->stack,
-  sizeof(*src->stack) * (src->allocated_stack / BPF_REG_SIZE));
-   return 0;
-}
+#define COPY_STATE_FN(NAME, COUNT, FIELD, SIZE)
\
+static int copy_##NAME##_state(struct bpf_func_state *dst, \
+  const struct bpf_func_state *src)\
+{  \
+   if (!src->FIELD)\
+   return 0;   \
+   if (WARN_ON_ONCE(dst->COUNT < src->COUNT)) {\
+   /* internal bug, make state invalid to reject the program */ \
+   memset(dst, 0, sizeof(*dst));   \
+   return -EFAULT; \
+   }   \
+   memcpy(dst->FIELD, src->FIELD,  \
+  sizeof(*src->FIELD) * (src->COUNT / SIZE));  \
+   return 0;   \
+}
+/* copy_stack_state() */
+COPY_STATE_FN(stack, allocated_stack, stack, BPF_REG_SIZE)
+#undef COPY_STATE_FN
+
+#define REALLOC_STATE_FN(NAME, COUNT, FIELD, SIZE) \
+static int realloc_##NAME##_state(struct bpf_func_state *state, int size, \
+ bool copy_old)\
+{  \
+   u32 old_size = state->COUNT;\
+   struct bpf_##NAME##_state *new_##FIELD; \
+   int slot = size / SIZE; \
+   \
+   if (size <= old_size || !size) {\
+   if (copy_old)   \
+   return 0;   \
+   state->COUNT = slot * SIZE; \
+   if (!size && old_size) {\
+   kfree(state->FIELD);\
+   state->FIELD = NULL;\
+   }   \
+   return 0;   \
+   }   \
+   new_##FIELD = kmalloc_array(slot, sizeof(struct bpf_##NAME##_state), \
+   GFP_KERNEL);\
+   if (!new_##FIELD)   \
+   return -ENOMEM; \
+   if (copy_old) { \
+   if (state->FIELD)   \
+   memcpy(new_##FIELD, state->FIELD,   \
+  sizeof(*new_##FIELD) * (old_size / SIZE)); \
+   memset(new_##FIELD + old_size / SIZE, 0,\
+  sizeof(*new_##FIELD) * (size - old_size) / SIZE); \
+   }   \
+   state->COUNT = slot * SIZE; \
+   kfree(state->FIELD);\
+   state->FIELD = new_##FIELD; \
+   return 0;   \
+}
+/* realloc_stack_state() */
+REALLOC_STATE_FN(stack, allocated_stack, stack, BPF_REG_SIZE)
+#undef REALLOC_STATE_FN

[PATCHv2 bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-09-21 Thread Joe Stringer

This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
socket listening on this host, and returns a socket pointer which the
BPF program can then access to determine, for instance, whether to
forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
socket, so when a BPF program makes use of this function, it must
subsequently pass the returned pointer into the newly added sk_release()
to return the reference.

By way of example, the following pseudocode would filter inbound
connections at XDP if there is no corresponding service listening for
the traffic:

  struct bpf_sock_tuple tuple;
  struct bpf_sock_ops *sk;

  populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
  sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
  if (!sk) {
// Couldn't find a socket listening for this traffic. Drop.
return TC_ACT_SHOT;
  }
  bpf_sk_release(sk, 0);
  return TC_ACT_OK;

Signed-off-by: Joe Stringer 

---

v2: Rework 'struct bpf_sock_tuple' to allow passing a packet pointer
Limit netns_id field to 32 bits
Fix compile error with CONFIG_IPV6 enabled
Allow direct packet access from helper
---
 include/uapi/linux/bpf.h  |  57 -
 kernel/bpf/verifier.c |   8 +-
 net/core/filter.c | 149 ++
 tools/include/uapi/linux/bpf.h|  57 -
 tools/testing/selftests/bpf/bpf_helpers.h |  12 ++
 5 files changed, 280 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index aa5ccd2385ed..620adbb09a94 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2143,6 +2143,41 @@ union bpf_attr {
  * request in the skb.
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * struct bpf_sock_ops *bpf_sk_lookup_tcp(ctx, tuple, tuple_size, netns, flags)
+ * Decription
+ * Look for TCP socket matching 'tuple'. The return value must
+ * be checked, and if non-NULL, released via bpf_sk_release().
+ * @ctx: pointer to ctx
+ * @tuple: pointer to struct bpf_sock_tuple
+ * @tuple_size: size of the tuple
+ * @netns: network namespace id
+ * @flags: flags value
+ * Return
+ * pointer to socket ops on success, or
+ * NULL in case of failure
+ *
+ * struct bpf_sock_ops *bpf_sk_lookup_udp(ctx, tuple, tuple_size, netns, flags)
+ * Decription
+ * Look for UDP socket matching 'tuple'. The return value must
+ * be checked, and if non-NULL, released via bpf_sk_release().
+ * @ctx: pointer to ctx
+ * @tuple: pointer to struct bpf_sock_tuple
+ * @tuple_size: size of the tuple
+ * @netns: network namespace id
+ * @flags: flags value
+ * Return
+ * pointer to socket ops on success, or
+ * NULL in case of failure
+ *
+ *  int bpf_sk_release(sock, flags)
+ * Description
+ * Release the reference held by 'sock'.
+ * @sock: Pointer reference to release. Must be found via
+ *bpf_sk_lookup_xxx().
+ * @flags: flags value
+ * Return
+ * 0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2228,7 +2263,10 @@ union bpf_attr {
FN(get_current_cgroup_id),  \
FN(get_local_storage),  \
FN(sk_select_reuseport),\
-   FN(skb_ancestor_cgroup_id),
+   FN(skb_ancestor_cgroup_id), \
+   FN(sk_lookup_tcp),  \
+   FN(sk_lookup_udp),  \
+   FN(sk_release),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2398,6 +2436,23 @@ struct bpf_sock {
 */
 };
 
+struct bpf_sock_tuple {
+   union {
+   struct {
+   __be32 saddr;
+   __be32 daddr;
+   __be16 sport;
+   __be16 dport;
+   } ipv4;
+   struct {
+   __be32 saddr[4];
+   __be32 daddr[4];
+   __be16 sport;
+   __be16 dport;
+   } ipv6;
+   };
+};
+
 #define XDP_PACKET_HEADROOM 256
 
 /* User return codes for XDP prog type.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 16818508b225..7b7fa94aba58 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -153,6 +153,12 @@ static const struct bpf_verifier_ops * const 
bpf_verifier_ops[] = {
  * PTR_TO_MAP_VALUE, PTR_TO_SOCKET_OR_NULL becomes PTR_TO_SOCKET

[PATCHv2 bpf-next 02/11] bpf: Simplify ptr_min_max_vals adjustment

2018-09-21 Thread Joe Stringer

An upcoming commit will add another two pointer types that need very
similar behaviour, so generalise this function now.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c   | 22 ++---
 tools/testing/selftests/bpf/test_verifier.c | 14 ++---
 2 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 62ce45d9c558..a889398ba43d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2667,20 +2667,18 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
return -EACCES;
}
 
-   if (ptr_reg->type == PTR_TO_MAP_VALUE_OR_NULL) {
-   verbose(env, "R%d pointer arithmetic on 
PTR_TO_MAP_VALUE_OR_NULL prohibited, null-check it first\n",
-   dst);
-   return -EACCES;
-   }
-   if (ptr_reg->type == CONST_PTR_TO_MAP) {
-   verbose(env, "R%d pointer arithmetic on CONST_PTR_TO_MAP 
prohibited\n",
-   dst);
+   switch (ptr_reg->type) {
+   case PTR_TO_MAP_VALUE_OR_NULL:
+   verbose(env, "R%d pointer arithmetic on %s prohibited, 
null-check it first\n",
+   dst, reg_type_str[ptr_reg->type]);
return -EACCES;
-   }
-   if (ptr_reg->type == PTR_TO_PACKET_END) {
-   verbose(env, "R%d pointer arithmetic on PTR_TO_PACKET_END 
prohibited\n",
-   dst);
+   case CONST_PTR_TO_MAP:
+   case PTR_TO_PACKET_END:
+   verbose(env, "R%d pointer arithmetic on %s prohibited\n",
+   dst, reg_type_str[ptr_reg->type]);
return -EACCES;
+   default:
+   break;
}
 
/* In case of 'scalar += pointer', dst_reg inherits pointer type and id.
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 67c412d19c09..ceb55a9f3da9 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -3637,7 +3637,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END",
+   .errstr = "R3 pointer arithmetic on pkt_end",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
},
@@ -4780,7 +4780,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 4 },
-   .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL",
+   .errstr = "R4 pointer arithmetic on map_value_or_null",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS
},
@@ -4801,7 +4801,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 4 },
-   .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL",
+   .errstr = "R4 pointer arithmetic on map_value_or_null",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS
},
@@ -4822,7 +4822,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 4 },
-   .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL",
+   .errstr = "R4 pointer arithmetic on map_value_or_null",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS
},
@@ -7137,7 +7137,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map_in_map = { 3 },
-   .errstr = "R1 pointer arithmetic on CONST_PTR_TO_MAP 
prohibited",
+   .errstr = "R1 pointer arithmetic on map_ptr prohibited",
.result = REJECT,
},
{
@@ -8811,7 +8811,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END",
+   .errstr = "R3 pointer arithmetic on pkt_end",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
},
@@ -8830,7 +8830,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END",
+   .errstr = "R3 pointer arithmetic on pkt_end",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
},
-- 
2.17.1

[PATCHv2 bpf-next 06/11] bpf: Add reference tracking to verifier

2018-09-21 Thread Joe Stringer

Allow helper functions to acquire a reference and return it into a
register. Specific pointer types such as the PTR_TO_SOCKET will
implicitly represent such a reference. The verifier must ensure that
these references are released exactly once in each path through the
program.

To achieve this, this commit assigns an id to the pointer and tracks it
in the 'bpf_func_state', then when the function or program exits,
verifies that all of the acquired references have been freed. When the
pointer is passed to a function that frees the reference, it is removed
from the 'bpf_func_state` and all existing copies of the pointer in
registers are marked invalid.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 

---
v2: Replace ptr_id defensive coding when releasing reference state with an
internal error (-EFAULT)
---
 include/linux/bpf_verifier.h |  24 ++-
 kernel/bpf/verifier.c| 303 ---
 2 files changed, 306 insertions(+), 21 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 23a2b17bfd75..23f222e0cb0b 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -104,6 +104,17 @@ struct bpf_stack_state {
u8 slot_type[BPF_REG_SIZE];
 };
 
+struct bpf_reference_state {
+   /* Track each reference created with a unique id, even if the same
+* instruction creates the reference multiple times (eg, via CALL).
+*/
+   int id;
+   /* Instruction where the allocation of this reference occurred. This
+* is used purely to inform the user of a reference leak.
+*/
+   int insn_idx;
+};
+
 /* state of the program:
  * type of all registers and stack info
  */
@@ -121,7 +132,9 @@ struct bpf_func_state {
 */
u32 subprogno;
 
-   /* should be second to last. See copy_func_state() */
+   /* The following fields should be last. See copy_func_state() */
+   int acquired_refs;
+   struct bpf_reference_state *refs;
int allocated_stack;
struct bpf_stack_state *stack;
 };
@@ -217,11 +230,16 @@ __printf(2, 0) void bpf_verifier_vlog(struct 
bpf_verifier_log *log,
 __printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env,
   const char *fmt, ...);
 
-static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env)
+static inline struct bpf_func_state *cur_func(struct bpf_verifier_env *env)
 {
struct bpf_verifier_state *cur = env->cur_state;
 
-   return cur->frame[cur->curframe]->regs;
+   return cur->frame[cur->curframe];
+}
+
+static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env)
+{
+   return cur_func(env)->regs;
 }
 
 int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 311340360aa3..16818508b225 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1,5 +1,6 @@
 /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
  * Copyright (c) 2016 Facebook
+ * Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -140,6 +141,18 @@ static const struct bpf_verifier_ops * const 
bpf_verifier_ops[] = {
  *
  * After the call R0 is set to return type of the function and registers R1-R5
  * are set to NOT_INIT to indicate that they are no longer readable.
+ *
+ * The following reference types represent a potential reference to a kernel
+ * resource which, after first being allocated, must be checked and freed by
+ * the BPF program:
+ * - PTR_TO_SOCKET_OR_NULL, PTR_TO_SOCKET
+ *
+ * When the verifier sees a helper call return a reference type, it allocates a
+ * pointer id for the reference and stores it in the current function state.
+ * Similar to the way that PTR_TO_MAP_VALUE_OR_NULL is converted into
+ * PTR_TO_MAP_VALUE, PTR_TO_SOCKET_OR_NULL becomes PTR_TO_SOCKET when the type
+ * passes through a NULL-check conditional. For the branch wherein the state is
+ * changed to CONST_IMM, the verifier releases the reference.
  */
 
 /* verifier_state + insn_idx are pushed to stack when branch is encountered */
@@ -189,6 +202,7 @@ struct bpf_call_arg_meta {
int access_size;
s64 msize_smax_value;
u64 msize_umax_value;
+   int ptr_id;
 };
 
 static DEFINE_MUTEX(bpf_verifier_lock);
@@ -251,7 +265,42 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
 
 static bool reg_type_may_be_null(enum bpf_reg_type type)
 {
-   return type == PTR_TO_MAP_VALUE_OR_NULL;
+   return type == PTR_TO_MAP_VALUE_OR_NULL ||
+  type == PTR_TO_SOCKET_OR_NULL;
+}
+
+static bool type_is_refcounted(enum bpf_reg_type type)
+{
+   return type == PTR_TO_SOCKET;
+}
+
+static bool type_is_refcounted_or_null(enum bpf_reg_type type)
+{
+

[PATCHv2 bpf-next 11/11] Documentation: Describe bpf reference tracking

2018-09-21 Thread Joe Stringer

Document the new pointer types in the verifier and how the pointer ID
tracking works to ensure that references which are taken are later
released.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 Documentation/networking/filter.txt | 64 +
 1 file changed, 64 insertions(+)

diff --git a/Documentation/networking/filter.txt 
b/Documentation/networking/filter.txt
index e6b4ebb2b243..4443ce958862 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -1125,6 +1125,14 @@ pointer type.  The types of pointers describe their 
base, as follows:
 PTR_TO_STACKFrame pointer.
 PTR_TO_PACKET   skb->data.
 PTR_TO_PACKET_END   skb->data + headlen; arithmetic forbidden.
+PTR_TO_SOCKET   Pointer to struct bpf_sock_ops, implicitly refcounted.
+PTR_TO_SOCKET_OR_NULL
+Either a pointer to a socket, or NULL; socket lookup
+returns this type, which becomes a PTR_TO_SOCKET when
+checked != NULL. PTR_TO_SOCKET is reference-counted,
+so programs must release the reference through the
+socket release function before the end of the program.
+Arithmetic on these pointers is forbidden.
 However, a pointer may be offset from this base (as a result of pointer
 arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable
 offset'.  The former is used when an exactly-known value (e.g. an immediate
@@ -1171,6 +1179,13 @@ over the Ethernet header, then reads IHL and addes (IHL 
* 4), the resulting
 pointer will have a variable offset known to be 4n+2 for some n, so adding the 
2
 bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses 
through
 that pointer are safe.
+The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common
+to all copies of the pointer returned from a socket lookup. This has similar
+behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but
+it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly
+represents a reference to the corresponding 'struct sock'. To ensure that the
+reference is not leaked, it is imperative to NULL-check the reference and in
+the non-NULL case, and pass the valid reference to the socket release function.
 
 Direct packet access
 
@@ -1444,6 +1459,55 @@ Error:
   8: (7a) *(u64 *)(r0 +0) = 1
   R0 invalid mem access 'imm'
 
+Program that performs a socket lookup then sets the pointer to NULL without
+checking it:
+value:
+  BPF_MOV64_IMM(BPF_REG_2, 0),
+  BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
+  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+  BPF_MOV64_IMM(BPF_REG_3, 4),
+  BPF_MOV64_IMM(BPF_REG_4, 0),
+  BPF_MOV64_IMM(BPF_REG_5, 0),
+  BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
+  BPF_MOV64_IMM(BPF_REG_0, 0),
+  BPF_EXIT_INSN(),
+Error:
+  0: (b7) r2 = 0
+  1: (63) *(u32 *)(r10 -8) = r2
+  2: (bf) r2 = r10
+  3: (07) r2 += -8
+  4: (b7) r3 = 4
+  5: (b7) r4 = 0
+  6: (b7) r5 = 0
+  7: (85) call bpf_sk_lookup_tcp#65
+  8: (b7) r0 = 0
+  9: (95) exit
+  Unreleased reference id=1, alloc_insn=7
+
+Program that performs a socket lookup but does not NULL-check the returned
+value:
+  BPF_MOV64_IMM(BPF_REG_2, 0),
+  BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
+  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+  BPF_MOV64_IMM(BPF_REG_3, 4),
+  BPF_MOV64_IMM(BPF_REG_4, 0),
+  BPF_MOV64_IMM(BPF_REG_5, 0),
+  BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
+  BPF_EXIT_INSN(),
+Error:
+  0: (b7) r2 = 0
+  1: (63) *(u32 *)(r10 -8) = r2
+  2: (bf) r2 = r10
+  3: (07) r2 += -8
+  4: (b7) r3 = 4
+  5: (b7) r4 = 0
+  6: (b7) r5 = 0
+  7: (85) call bpf_sk_lookup_tcp#65
+  8: (95) exit
+  Unreleased reference id=1, alloc_insn=7
+
 Testing
 ---
 
-- 
2.17.1

[PATCHv2 bpf-next 10/11] selftests/bpf: Add C tests for reference tracking

2018-09-21 Thread Joe Stringer

Add some tests that demonstrate and test the balanced lookup/free
nature of socket lookup. Section names that start with "fail" represent
programs that are expected to fail verification; all others should
succeed.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/Makefile  |   2 +-
 tools/testing/selftests/bpf/test_progs.c  |  38 +
 .../selftests/bpf/test_sk_lookup_kern.c   | 137 ++
 3 files changed, 176 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_sk_lookup_kern.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index fd3851d5c079..a0c9c2208aad 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o \
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
-   test_skb_cgroup_id_kern.o bpf_flow.o
+   test_skb_cgroup_id_kern.o bpf_flow.o test_sk_lookup_kern.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index 63a671803ed6..e8becca9c521 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -1698,6 +1698,43 @@ static void test_task_fd_query_tp(void)
   "sys_enter_read");
 }
 
+static void test_reference_tracking()
+{
+   const char *file = "./test_sk_lookup_kern.o";
+   struct bpf_object *obj;
+   struct bpf_program *prog;
+   __u32 duration;
+   int err = 0;
+
+   obj = bpf_object__open(file);
+   if (IS_ERR(obj)) {
+   error_cnt++;
+   return;
+   }
+
+   bpf_object__for_each_program(prog, obj) {
+   const char *title;
+
+   /* Ignore .text sections */
+   title = bpf_program__title(prog, false);
+   if (strstr(title, ".text") != NULL)
+   continue;
+
+   bpf_program__set_type(prog, BPF_PROG_TYPE_SCHED_CLS);
+
+   /* Expect verifier failure if test name has 'fail' */
+   if (strstr(title, "fail") != NULL) {
+   libbpf_set_print(NULL, NULL, NULL);
+   err = !bpf_program__load(prog, "GPL", 0);
+   libbpf_set_print(printf, printf, NULL);
+   } else {
+   err = bpf_program__load(prog, "GPL", 0);
+   }
+   CHECK(err, title, "\n");
+   }
+   bpf_object__close(obj);
+}
+
 int main(void)
 {
jit_enabled = is_jit_enabled();
@@ -1719,6 +1756,7 @@ int main(void)
test_get_stack_raw_tp();
test_task_fd_query_rawtp();
test_task_fd_query_tp();
+   test_reference_tracking();
 
printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
diff --git a/tools/testing/selftests/bpf/test_sk_lookup_kern.c 
b/tools/testing/selftests/bpf/test_sk_lookup_kern.c
new file mode 100644
index ..d59a84e80120
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sk_lookup_kern.c
@@ -0,0 +1,137 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+// Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+char _license[] SEC("license") = "GPL";
+
+/* Fill 'tuple' with L3 info, and attempt to find L4. On fail, return NULL. */
+static struct bpf_sock_tuple *get_tuple(void *data, __u64 nh_off,
+   void *data_end, __u16 eth_proto,
+   bool *ipv4)
+{
+   struct bpf_sock_tuple *result;
+   __u8 proto = 0;
+   __u64 ihl_len;
+
+   if (eth_proto == bpf_htons(ETH_P_IP)) {
+   struct iphdr *iph = (struct iphdr *)(data + nh_off);
+
+   if (iph + 1 > data_end)
+   return NULL;
+   ihl_len = iph->ihl * 4;
+   proto = iph->protocol;
+   *ipv4 = true;
+   result = (struct bpf_sock_tuple *)&iph->saddr;
+   } else if (eth_proto == bpf_htons(ETH_P_IPV6)) {
+   struct ipv6hdr *ip6h = (struct ipv6hdr *)(data + nh_off);
+
+   if (ip6h + 1 > data_end)
+   return NULL;
+

[PATCHv2 bpf-next 08/11] selftests/bpf: Add tests for reference tracking

2018-09-21 Thread Joe Stringer

reference tracking: leak potential reference
reference tracking: leak potential reference on stack
reference tracking: leak potential reference on stack 2
reference tracking: zero potential reference
reference tracking: copy and zero potential references
reference tracking: release reference without check
reference tracking: release reference
reference tracking: release reference twice
reference tracking: release reference twice inside branch
reference tracking: alloc, check, free in one subbranch
reference tracking: alloc, check, free in both subbranches
reference tracking in call: free reference in subprog
reference tracking in call: free reference in subprog and outside
reference tracking in call: alloc & leak reference in subprog
reference tracking in call: alloc in subprog, release outside
reference tracking in call: sk_ptr leak into caller stack
reference tracking in call: sk_ptr spill into caller stack

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/test_verifier.c | 359 
 1 file changed, 359 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index ceb55a9f3da9..eb760ead257a 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -3,6 +3,7 @@
  *
  * Copyright (c) 2014 PLUMgrid, http://plumgrid.com
  * Copyright (c) 2017 Facebook
+ * Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -177,6 +178,23 @@ static void bpf_fill_rand_ld_dw(struct bpf_test *self)
self->retval = (uint32_t)res;
 }
 
+#define BPF_SK_LOOKUP  \
+   /* struct bpf_sock_tuple tuple = {} */  \
+   BPF_MOV64_IMM(BPF_REG_2, 0),\
+   BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),  \
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -16),\
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -24),\
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -32),\
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -40),\
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -48),\
+   /* sk = sk_lookup_tcp(ctx, &tuple, sizeof tuple, 0, 0) */   \
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),   \
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -48), \
+   BPF_MOV64_IMM(BPF_REG_3, 44),   \
+   BPF_MOV64_IMM(BPF_REG_4, 0),\
+   BPF_MOV64_IMM(BPF_REG_5, 0),\
+   BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp)
+
 static struct bpf_test tests[] = {
{
"add+sub+mul",
@@ -12441,6 +12459,222 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
},
+   {
+   "reference tracking: leak potential reference",
+   .insns = {
+   BPF_SK_LOOKUP,
+   BPF_MOV64_REG(BPF_REG_6, BPF_REG_0), /* leak reference 
*/
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .errstr = "Unreleased reference",
+   .result = REJECT,
+   },
+   {
+   "reference tracking: leak potential reference on stack",
+   .insns = {
+   BPF_SK_LOOKUP,
+   BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),
+   BPF_STX_MEM(BPF_DW, BPF_REG_4, BPF_REG_0, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .errstr = "Unreleased reference",
+   .result = REJECT,
+   },
+   {
+   "reference tracking: leak potential reference on stack 2",
+   .insns = {
+   BPF_SK_LOOKUP,
+   BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),
+   BPF_STX_MEM(BPF_DW, BPF_REG_4, BPF_REG_0, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_ST_MEM(BPF_DW, BPF_REG_4, 0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .errstr = "Unreleased reference",
+   .result = REJECT,
+   },
+   {
+

Re: netns_id in bpf_sk_lookup_{tcp,udp}

2018-11-19 Thread Joe Stringer

Hi David, thanks for pointing this out.

This is more of an oversight through iterations, the runtime lookup
will fail to find a socket if the netns value is greater than the
range of a uint32 so I think it would actually make more sense to drop
the parameter size to u32 rather than u64 so that this would be
validated at load time rather than silently returning NULL because of
a bad parameter.

I'll send a patch to bpf tree.

Cheers,
Joe

On Sun, 18 Nov 2018 at 19:27, David Ahern  wrote:
>
> Hi Joe:
>
> The netns_id to the bpf_sk_lookup_{tcp,udp} functions in
> net/core/filter.c is a u64, yet the APIs in include/uapi/linux/bpf.h
> shows a u32. Is that intentional or an oversight through the iterations?
>
> David

Re: netns_id in bpf_sk_lookup_{tcp,udp}

2018-11-19 Thread Joe Stringer

On Mon, 19 Nov 2018 at 10:39, David Ahern  wrote:
>
> On 11/19/18 11:36 AM, Joe Stringer wrote:
> > Hi David, thanks for pointing this out.
> >
> > This is more of an oversight through iterations, the runtime lookup
> > will fail to find a socket if the netns value is greater than the
> > range of a uint32 so I think it would actually make more sense to drop
> > the parameter size to u32 rather than u64 so that this would be
> > validated at load time rather than silently returning NULL because of
> > a bad parameter.
>
> ok. I was wondering if it was a u64 to handle nsid of 0 which as I
> understand it is a legal nsid. If you drop to u32, how do you know when
> nsid has been set?

I was operating under the assumption that 0 represents the root netns
id, and cannot be assigned to another non-root netns.

Looking at __peernet2id_alloc(), it seems to me like it attempts to
find a netns and if it cannot find one, returns 0, which then leads to
a scroll over the idr starting from 0 to INT_MAX to find a legitimate
id for the netns, so I think this is a fair assumption?

Re: netns_id in bpf_sk_lookup_{tcp,udp}

2018-11-19 Thread Joe Stringer

On Mon, 19 Nov 2018 at 12:29, Nicolas Dichtel  wrote:
>
> Le 19/11/2018 à 20:54, David Ahern a écrit :
> > On 11/19/18 12:47 PM, Joe Stringer wrote:
> >> On Mon, 19 Nov 2018 at 10:39, David Ahern  wrote:
> >>>
> >>> On 11/19/18 11:36 AM, Joe Stringer wrote:
> >>>> Hi David, thanks for pointing this out.
> >>>>
> >>>> This is more of an oversight through iterations, the runtime lookup
> >>>> will fail to find a socket if the netns value is greater than the
> >>>> range of a uint32 so I think it would actually make more sense to drop
> >>>> the parameter size to u32 rather than u64 so that this would be
> >>>> validated at load time rather than silently returning NULL because of
> >>>> a bad parameter.
> >>>
> >>> ok. I was wondering if it was a u64 to handle nsid of 0 which as I
> >>> understand it is a legal nsid. If you drop to u32, how do you know when
> >>> nsid has been set?
> >>
> >> I was operating under the assumption that 0 represents the root netns
> >> id, and cannot be assigned to another non-root netns.
> >>
> >> Looking at __peernet2id_alloc(), it seems to me like it attempts to
> >> find a netns and if it cannot find one, returns 0, which then leads to
> >> a scroll over the idr starting from 0 to INT_MAX to find a legitimate
> >> id for the netns, so I think this is a fair assumption?
> The NET_ID_ZERO trick is used to manage nsid 0 in net_eq_idr() (idr_for_each()
> stops when the callback returns != 0).
>
> >>
> >
> > Maybe Nicolas can give a definitive answer; as I recall he added the
> > NSID option. I have not had time to walk the code. But I do recall
> > seeing an id of 0. e.g, on my dev box:
> > $ ip netns
> > vms (id: 0)
> >
> > And include/uapi/linux/net_namespace.h shows -1 as not assigned.
> Yes, 0 is a valid value and can be assigned to any netns.
> nsid are signed 32 bit values. Note that -1 (NETNSA_NSID_NOT_ASSIGNED) is used
> by the kernel to express that the nsid is not assigned. It can also be used by
> the user to let the kernel chooses a nsid.
>
> $ ip netns add foo
> $ ip netns add bar
> $ ip netns
> bar
> foo
> $ ip netns set foo 0
> $ ip netns set bar auto
> $ ip netns
> bar (id: 1)
> foo (id: 0)

OK, I'll fix this up then.

Re: netns_id in bpf_sk_lookup_{tcp,udp}

2018-11-19 Thread Joe Stringer

On Mon, 19 Nov 2018 at 12:54, Joe Stringer  wrote:
>
> On Mon, 19 Nov 2018 at 12:29, Nicolas Dichtel  
> wrote:
> >
> > Le 19/11/2018 à 20:54, David Ahern a écrit :
> > > On 11/19/18 12:47 PM, Joe Stringer wrote:
> > >> On Mon, 19 Nov 2018 at 10:39, David Ahern  wrote:
> > >>>
> > >>> On 11/19/18 11:36 AM, Joe Stringer wrote:
> > >>>> Hi David, thanks for pointing this out.
> > >>>>
> > >>>> This is more of an oversight through iterations, the runtime lookup
> > >>>> will fail to find a socket if the netns value is greater than the
> > >>>> range of a uint32 so I think it would actually make more sense to drop
> > >>>> the parameter size to u32 rather than u64 so that this would be
> > >>>> validated at load time rather than silently returning NULL because of
> > >>>> a bad parameter.
> > >>>
> > >>> ok. I was wondering if it was a u64 to handle nsid of 0 which as I
> > >>> understand it is a legal nsid. If you drop to u32, how do you know when
> > >>> nsid has been set?
> > >>
> > >> I was operating under the assumption that 0 represents the root netns
> > >> id, and cannot be assigned to another non-root netns.
> > >>
> > >> Looking at __peernet2id_alloc(), it seems to me like it attempts to
> > >> find a netns and if it cannot find one, returns 0, which then leads to
> > >> a scroll over the idr starting from 0 to INT_MAX to find a legitimate
> > >> id for the netns, so I think this is a fair assumption?
> > The NET_ID_ZERO trick is used to manage nsid 0 in net_eq_idr() 
> > (idr_for_each()
> > stops when the callback returns != 0).
> >
> > >>
> > >
> > > Maybe Nicolas can give a definitive answer; as I recall he added the
> > > NSID option. I have not had time to walk the code. But I do recall
> > > seeing an id of 0. e.g, on my dev box:
> > > $ ip netns
> > > vms (id: 0)
> > >
> > > And include/uapi/linux/net_namespace.h shows -1 as not assigned.
> > Yes, 0 is a valid value and can be assigned to any netns.
> > nsid are signed 32 bit values. Note that -1 (NETNSA_NSID_NOT_ASSIGNED) is 
> > used
> > by the kernel to express that the nsid is not assigned. It can also be used 
> > by
> > the user to let the kernel chooses a nsid.
> >
> > $ ip netns add foo
> > $ ip netns add bar
> > $ ip netns
> > bar
> > foo
> > $ ip netns set foo 0
> > $ ip netns set bar auto
> > $ ip netns
> > bar (id: 1)
> > foo (id: 0)
>
> OK, I'll fix this up then.

Here's what I have in mind:

@@ -2221,12 +,13 @@ union bpf_attr {
 * **sizeof**\ (*tuple*\ **->ipv6**)
 * Look for an IPv6 socket.
 *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is **BPF_F_SK_CURRENT_NS** or greater, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is less than **BPF_F_SK_CURRENT_NS**, then it
+ * specifies the ID of the netns relative to the netns associated
+ * with the *ctx*.
 *
 * All values for *flags* are reserved for future usage, and must
 * be left at zero.
@@ -2409,6 +2411,9 @@ enum bpf_func_id {
/* BPF_FUNC_perf_event_output for sk_buff input context. */
#define BPF_F_CTXLEN_MASK  (0xfULL << 32)

+/* BPF_FUNC_sk_lookup_tcp and BPF_FUNC_sk_lookup_udp flags. */
+#define BPF_F_SK_CURRENT_NS0x8000 /* For netns argument */
+
/* Mode for BPF_FUNC_skb_adjust_room helper. */
enum bpf_adj_room_mode {
   BPF_ADJ_ROOM_NET,

Plus adjusting all of the internal types and the helper headers to use
u32. With the highest bit used to specify that the netns should be the
current netns, all other netns IDs should be available.

[PATCH bpf] bpf: Support sk lookup in netns with id 0

2018-11-26 Thread Joe Stringer

David Ahern and Nicolas Dichtel report that the handling of the netns id
0 is incorrect for the BPF socket lookup helpers: rather than finding
the netns with id 0, it is resolving to the current netns. This renders
the netns_id 0 inaccessible.

To fix this, adjust the API for the netns to treat all u32 values with
the highest bit set (BPF_F_SK_CURRENT_NS) as a lookup in the current
netns, while any values with a lower value (including zero) would result
in a lookup for a socket in the netns corresponding to that id. As
before, if the netns with that ID does not exist, no socket will be
found.

Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h  | 29 +---
 net/core/filter.c | 16 -
 tools/include/uapi/linux/bpf.h| 33 ---
 .../selftests/bpf/test_sk_lookup_kern.c   | 18 +-
 4 files changed, 55 insertions(+), 41 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 852dc17ab47a..543945d520b9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2187,12 +2187,13 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is **BPF_F_SK_CURRENT_NS** or greater, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is less than **BPF_F_SK_CURRENT_NS**, then it
+ * specifies the ID of the netns relative to the netns associated
+ * with the *ctx*.
  *
  * All values for *flags* are reserved for future usage, and must
  * be left at zero.
@@ -2219,12 +2220,13 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is **BPF_F_SK_CURRENT_NS** or greater, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is less than **BPF_F_SK_CURRENT_NS**, then it
+ * specifies the ID of the netns relative to the netns associated
+ * with the *ctx*.
  *
  * All values for *flags* are reserved for future usage, and must
  * be left at zero.
@@ -2405,6 +2407,9 @@ enum bpf_func_id {
 /* BPF_FUNC_perf_event_output for sk_buff input context. */
 #define BPF_F_CTXLEN_MASK  (0xfULL << 32)
 
+/* BPF_FUNC_sk_lookup_tcp and BPF_FUNC_sk_lookup_udp flags. */
+#define BPF_F_SK_CURRENT_NS0x8000 /* For netns field */
+
 /* Mode for BPF_FUNC_skb_adjust_room helper. */
 enum bpf_adj_room_mode {
BPF_ADJ_ROOM_NET,
diff --git a/net/core/filter.c b/net/core/filter.c
index 9a1327eb25fa..8c8a7ad3f5e6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4882,7 +4882,7 @@ static struct sock *sk_lookup(struct net *net, struct 
bpf_sock_tuple *tuple,
  */
 static unsigned long
 bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
- u8 proto, u64 netns_id, u64 flags)
+ u8 proto, u32 netns_id, u64 flags)
 {
struct net *caller_net;
struct sock *sk = NULL;
@@ -4890,22 +4890,22 @@ bpf_sk_lookup(struct sk_buff *skb, struct 
bpf_sock_tuple *tuple, u32 len,
struct net *net;
 
family = len == sizeof(tuple->ipv4) ? AF_INET : AF_INET6;
-   if (unlikely(family == AF_UNSPEC || netns_id > U32_MAX || flags))
+   if (unlikely(family == AF_UNSPEC || flags))
goto out;
 
if (skb->dev)
caller_net = dev_net(skb->dev);
else
caller_net = sock_net(skb->sk);
-   if (netns_id) {
+   if (netns_id & BPF_F_SK_CURRENT_NS) {
+

Re: [PATCH bpf] bpf: Support sk lookup in netns with id 0

2018-11-27 Thread Joe Stringer

On Tue, 27 Nov 2018 at 06:49, Nicolas Dichtel  wrote:
>
> Le 26/11/2018 à 23:08, David Ahern a écrit :
> > On 11/26/18 2:27 PM, Joe Stringer wrote:
> >> @@ -2405,6 +2407,9 @@ enum bpf_func_id {
> >>  /* BPF_FUNC_perf_event_output for sk_buff input context. */
> >>  #define BPF_F_CTXLEN_MASK   (0xfULL << 32)
> >>
> >> +/* BPF_FUNC_sk_lookup_tcp and BPF_FUNC_sk_lookup_udp flags. */
> >> +#define BPF_F_SK_CURRENT_NS 0x8000 /* For netns field */
> >> +
> >
> > I went down the nsid road because it will be needed for other use cases
> > (e.g., device lookups), and we should have a general API for network
> > namespaces. Given that, I think the _SK should be dropped from the name.

Fair point, I'll drop _SK from the name

> >
> Would it not be possible to have a s32 instead of an u32 for the coming APIs?
> It would be better to match the current netlink and kernel APIs.

Sure, I'll look into this.

I had earlier considered whether it's worth attempting to leave the
upper 32 bits of this parameter open for potential future expansion,
but at this point I'm not taking that into consideration. If anyone
has preferences or thoughts on that I'd be interested to hear them.

Re: [PATCH bpf] bpf: Support sk lookup in netns with id 0

2018-11-28 Thread Joe Stringer

On Tue, 27 Nov 2018 at 13:12, Alexei Starovoitov
 wrote:
>
> On Tue, Nov 27, 2018 at 10:01:40AM -0800, Joe Stringer wrote:
> > On Tue, 27 Nov 2018 at 06:49, Nicolas Dichtel  
> > wrote:
> > >
> > > Le 26/11/2018 ą 23:08, David Ahern a écrit :
> > > > On 11/26/18 2:27 PM, Joe Stringer wrote:
> > > >> @@ -2405,6 +2407,9 @@ enum bpf_func_id {
> > > >>  /* BPF_FUNC_perf_event_output for sk_buff input context. */
> > > >>  #define BPF_F_CTXLEN_MASK   (0xfULL << 32)
> > > >>
> > > >> +/* BPF_FUNC_sk_lookup_tcp and BPF_FUNC_sk_lookup_udp flags. */
> > > >> +#define BPF_F_SK_CURRENT_NS 0x8000 /* For netns field */
> > > >> +
> > > >
> > > > I went down the nsid road because it will be needed for other use cases
> > > > (e.g., device lookups), and we should have a general API for network
> > > > namespaces. Given that, I think the _SK should be dropped from the name.
> >
> > Fair point, I'll drop _SK from the name
> >
> > > >
> > > Would it not be possible to have a s32 instead of an u32 for the coming 
> > > APIs?
> > > It would be better to match the current netlink and kernel APIs.
> >
> > Sure, I'll look into this.
> >
> > I had earlier considered whether it's worth attempting to leave the
> > upper 32 bits of this parameter open for potential future expansion,
> > but at this point I'm not taking that into consideration. If anyone
> > has preferences or thoughts on that I'd be interested to hear them.
>
> Can we keep u64 as an argument type and do
> if ((s32)netns_id < 0) {
>   net = caller_net;
> } else {
>   if (netns_id > S32_MAX)
> goto err;
>   net = get_net_ns_by_id(caller_net, netns_id);
> }
>
> No need for extra macro in such case and passing -1 would match the rest of 
> the kernel.
> Upper 32-bit would still be open for future expansion.

Sounds good.

[PATCHv2 bpf 2/2] bpf: Improve socket lookup reuseport documentation

2018-11-29 Thread Joe Stringer

Improve the wording around socket lookup for reuseport sockets, and
ensure that both bpf.h headers are in sync.

Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h   | 4 
 tools/include/uapi/linux/bpf.h | 8 
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 38924b306e9f..b73d574356f4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2203,6 +2203,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
@@ -2237,6 +2239,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * int bpf_sk_release(struct bpf_sock *sk)
  * Description
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 465ad585c836..b73d574356f4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2203,8 +2203,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
- * For sockets with reuseport option, *struct bpf_sock*
- * return is from reuse->socks[] using hash of the packet.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
@@ -2239,8 +2239,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
- * For sockets with reuseport option, *struct bpf_sock*
- * return is from reuse->socks[] using hash of the packet.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * int bpf_sk_release(struct bpf_sock *sk)
  * Description
-- 
2.17.1

[PATCHv2 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-29 Thread Joe Stringer

David Ahern and Nicolas Dichtel report that the handling of the netns id
0 is incorrect for the BPF socket lookup helpers: rather than finding
the netns with id 0, it is resolving to the current netns. This renders
the netns_id 0 inaccessible.

To fix this, adjust the API for the netns to treat all negative s32
values as a lookup in the current netns, while any values with a
positive value in the signed 32-bit integer space would result in a
lookup for a socket in the netns corresponding to that id. As before, if
the netns with that ID does not exist, no socket will be found.
Furthermore, if any bits are set in the upper 32-bits, then no socket
will be found.

Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h  | 35 ++---
 net/core/filter.c | 11 +++---
 tools/include/uapi/linux/bpf.h| 39 ---
 tools/testing/selftests/bpf/bpf_helpers.h |  4 +-
 .../selftests/bpf/test_sk_lookup_kern.c   | 18 -
 5 files changed, 63 insertions(+), 44 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 852dc17ab47a..38924b306e9f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2170,7 +2170,7 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
- * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
  * Look for TCP socket matching *tuple*, optionally in a child
  * network namespace *netns*. The return value must be checked,
@@ -2187,12 +2187,14 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is a negative signed 32-bit integer, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is any other signed 32-bit value greater than or
+ * equal to zero then it specifies the ID of the netns relative to
+ * the netns associated with the *ctx*. *netns* values beyond the
+ * range of 32-bit integers are reserved for future use.
  *
  * All values for *flags* are reserved for future usage, and must
  * be left at zero.
@@ -2202,7 +2204,7 @@ union bpf_attr {
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
  *
- * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
  * Look for UDP socket matching *tuple*, optionally in a child
  * network namespace *netns*. The return value must be checked,
@@ -2219,12 +2221,14 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is a negative signed 32-bit integer, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is any other signed 32-bit value greater than or
+ * equal to zero then it specifies the ID of the netns relative to
+ * the netns associated with the *ctx*. *netns* values beyond the
+ * range of 32-bit integers are reserved for future use.
  *
  * All values for *flags* are reserved for future usage, and must
  * be left at zero.
@@ -2405,6 +2409,9 @@ enum bpf_f

Re: [PATCHv2 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Joe Stringer

On Thu, 29 Nov 2018 at 16:30, Joe Stringer  wrote:
>
> David Ahern and Nicolas Dichtel report that the handling of the netns id
> 0 is incorrect for the BPF socket lookup helpers: rather than finding
> the netns with id 0, it is resolving to the current netns. This renders
> the netns_id 0 inaccessible.
>
> To fix this, adjust the API for the netns to treat all negative s32
> values as a lookup in the current netns, while any values with a
> positive value in the signed 32-bit integer space would result in a
> lookup for a socket in the netns corresponding to that id. As before, if
> the netns with that ID does not exist, no socket will be found.
> Furthermore, if any bits are set in the upper 32-bits, then no socket
> will be found.

This last sentence is a little misleading, it only applies if the
highest bit in the lower 32 bits is 0.

Re: [PATCHv2 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Joe Stringer

On Fri, 30 Nov 2018 at 14:42, Alexei Starovoitov
 wrote:
>
> On Thu, Nov 29, 2018 at 04:29:33PM -0800, Joe Stringer wrote:
> > David Ahern and Nicolas Dichtel report that the handling of the netns id
> > 0 is incorrect for the BPF socket lookup helpers: rather than finding
> > the netns with id 0, it is resolving to the current netns. This renders
> > the netns_id 0 inaccessible.
> >
> > To fix this, adjust the API for the netns to treat all negative s32
> > values as a lookup in the current netns, while any values with a
> > positive value in the signed 32-bit integer space would result in a
> > lookup for a socket in the netns corresponding to that id. As before, if
> > the netns with that ID does not exist, no socket will be found.
> > Furthermore, if any bits are set in the upper 32-bits, then no socket
> > will be found.
> >
> > Signed-off-by: Joe Stringer 
> ..
> > +/* Current network namespace */
> > +#define BPF_CURRENT_NETNS(-1L)
>
> I was about to apply it, but then noticed that the name doesn't match
> the rest of the names.
> Could you rename it to BPF_F_CURRENT_NETNS ?

I skipped the F_ part since it's not really a flag, it's a value. I
can put it back though.

> Also reword the commit log so it's less misleading.

Can do.

Cheers,
Joe

1 2 3 4 5 6 >

1 - 100 of 508 matches

Mail list logo