Re: [ovs-dev] [PATCH v2 6/6] netdev-doca: Introduce doca netdev.

Eli Britstein Wed, 01 Apr 2026 02:15:27 -0700


On 12/03/2026 12:45, Eelco Chaudron wrote:

External email: Use caution opening links or attachments



On 9 Feb 2026, at 14:29, Eli Britstein wrote:

Introduce a new netdev type - "doca".
The code is placed in new files.
- ovs-doca: initialization of doca library and utility functions that
   are used currently by netdev-doca and also will be used for future
   hw-offload code.
- netdev-doca: implementation of the new netdev.

Supported ports are mlx5 ports in switch-dev mode only that with a NIC
that supports hw-steering.

The netdev has the concept of ESW manager. A representor port is
functional only if its ESW manager is attached to OVS. In case it is
not, the representor appears as functional in ovs-vsctl show, but it is
not. Upon initializing of an ESW manager port, each representor is
reconfigured to be functional, and upon destruction, they are first stopped.

Steering infrastructure:
- RX packets of all ports are steered to a common queue. This queue is
   polled using dpdk API and the packets are classified to a per-port
   memory structure.
- TX packets are marked with the target port as metadata and sent to a
   common queue. The egress pipe matches on the metadata and forwards the
   packets accordingly.

Signed-off-by: Eli Britstein<[email protected]>

Thanks for the patch, Eli! I have a few comments below.

I also understand that you plan to add GitHub Actions to at least build
DOCA and run the basic unit tests. I think running this on the same
Ubuntu version we use elsewhere should be sufficient.

Ack

In addition, I feel we should add offload test cases for DOCA, similar
to what we do with 'make check-dpdk-offloads'. Even though there is no
actual hardware offload for now, using the netdev-doca ports should
allow all system-traffic tests to pass.

For real HW-offload, there is no support yet. "system-traffic" - yousubmitted it in a separate patch. Let's discuss it there, separatelyfrom this series.

This above might also give an idea how to configure DOCA ports, but it
would be good to add some initial documentation on how to setup a simple
environment. Similar to dpdk.rst.

Ack

I did not try to compile/run this series with DOCA, but will try it later
this next week. So any hint on how to configure this on real hw would be
appriciated. I have an MT2892 Family [ConnectX-6 Dx], would this work?

   $ ethtool -i ens2f0np0
   driver: mlx5_core
   version: 5.14.0-570.33.2.el9_6.x86_64
   firmware-version: 22.45.1020 (MT_0000000528)

Yes it will work.

Cheers,

Eelco

---
  NEWS                          |    3 +
  lib/automake.mk               |    6 +
  lib/netdev-doca.c             | 4438 +++++++++++++++++++++++++++++++++
  lib/netdev-doca.h             |  238 ++
  lib/ovs-doca.c                |  757 +++++-
  lib/ovs-doca.h                |   87 +
  utilities/checkpatch_dict.txt |    2 +
  vswitchd/vswitch.xml          |   21 +-
  8 files changed, 5549 insertions(+), 3 deletions(-)
  create mode 100644 lib/netdev-doca.c
  create mode 100644 lib/netdev-doca.h

diff --git a/NEWS b/NEWS
index c3470b84e..c8206d7ef 100644
--- a/NEWS
+++ b/NEWS
@@ -1,5 +1,8 @@
  Post-v3.7.0
  --------------------
+   - DOCA:
+     * New netev type "doca", available under "netdev" bridge.

netev -> netdev

Ack.

+       other_config:doca-init=true should be configured.

[snip]

+    NETDEV_TX_OUTER_IP_CKSUM_OFFLOAD = 1 << 10,
+    NETDEV_TX_OUTER_UDP_CKSUM_OFFLOAD = 1 << 11,
+    NETDEV_TX_GRE_TNL_TSO_OFFLOAD = 1 << 12,
+};

This file still contains quite a bit of code duplication from
netdev-dpdk.c.

I think we should embrace the fact that a netdev-doca port is a subtype
of netdev-dpdk. It might be better to share functions and possibly some
local definitions via a private include file. This could help avoid code
duplication and make the implementation easier to understand and
maintain.

We could approach this in a similar way to how netdev-afxdp is a subtype
of netdev-linux.

Ack.

+
+struct dpdk_mp {
+     struct rte_mempool *mp;

[snip]

+    struct netdev_doca *dev = netdev_doca_cast(netdev);
+    int ret = DPDK_ETH_PORT_ID_INVALID;
+
+    if (!rte_eth_dev_is_valid_port(dev->port_id) ||
+        !rte_eth_dev_is_valid_port(dev->esw_mgr_port_id)) {
+        goto out;

I would avoid the goto, and just return DPDK_ETH_PORT_ID_INVALID;

Ack

+    }
+
+    ret = dev->esw_mgr_port_id;

Here just return  dev->esw_mgr_port_id;

Ack

+out:
+    return ret;
+}
+
+static uint16_t
+netdev_doca_get_port_id(const struct netdev *netdev)

Rework same as netdev_doca_get_esw_mgr_port_id().

Ack

+{
+    struct netdev_doca *dev = netdev_doca_cast(netdev);
+    int ret = DPDK_ETH_PORT_ID_INVALID;
+
+    if (!rte_eth_dev_is_valid_port(dev->port_id)) {
+        goto out;
+    }
+
+    ret = dev->port_id;
+out:
+    return ret;
+}
+
+static bool
+netdev_doca_is_esw_mgr(const struct netdev *netdev)
+{
+    return netdev_doca_get_esw_mgr_port_id(netdev) ==
+           netdev_doca_get_port_id(netdev) &&
+           netdev_doca_get_esw_mgr_port_id(netdev) != DPDK_ETH_PORT_ID_INVALID;

How about this?

   static bool
   netdev_doca_is_esw_mgr(const struct netdev *netdev)
   {
       uint16_t esw_mgr_id = netdev_doca_get_esw_mgr_port_id(netdev);

       return esw_mgr_id == netdev_doca_get_port_id(netdev)
              && esw_mgr_id != DPDK_ETH_PORT_ID_INVALID;
   }

Ack

+}
+
+static bool
+dev_get_started(const struct netdev_doca *dev)
+{
+    bool started;
+
+    atomic_read(&dev->started, &started);
+    return started;
+}
+
+/* ======== slowpath ======== */

Not really OVS-like formatted comments.

Removed

+static int
+netdev_doca_egress_pipe_init(struct netdev *netdev)
+{
+    struct netdev_doca *dev = netdev_doca_cast(netdev);
+    struct ovs_doca_flow_match match;
+    struct doca_flow_monitor monitor;
+    struct doca_flow_fwd fwd;
+
+    memset(&match, 0, sizeof match);
+    memset(&fwd, 0, sizeof fwd);
+    memset(&monitor, 0, sizeof monitor);
+
+    /* Meta to match on is defined per entry */

All comments should end with a dot (.).

Ack

+    match.d.meta.pkt_meta = (OVS_FORCE doca_be32_t) UINT32_MAX;

match.d.meta.pkt_meta = htonl(UINT32_MAX);

Also add a new line.

Ack

+    /* Port ID to forward to is defined per entry */
+    fwd.type = DOCA_FLOW_FWD_PORT;

[snip]

+}
+
+/* ======== sys-fs ======== */

I did not review the sysfs part, as you mentioned offline you will use
DOCA APIs instead.

Changed since v2.

+static int
+get_sys(const char *prefix, const char *devname, const char *suffix,
+        char *outp, size_t maxlen)

[snip]

+
+/* ======== netdev ======== */

Odd comment, it might be more descriptive to the section.

Ack. removed.

+struct netdev_doca *
+netdev_doca_cast(const struct netdev *netdev)
+{
+    return CONTAINER_OF(netdev, struct netdev_doca, up);
+}
+
+/* Allocates an area of 'sz' bytes from DPDK.  The memory is zero'ed.
+ *
+ * Unlike xmalloc(), this function can return NULL on failure. */
+static void *
+doca_rte_mzalloc(const char *type, size_t sz)
+{
+    return rte_zmalloc(type, sz, CACHE_LINE_SIZE);

Any reason why preferring the DPDK heap allocator over the default one?

No special reason. It was just copied from dpdk. We can move to heap ifyou prefer. I don't know why it was like this in DPDK either.

+}
+
+static struct netdev *
+netdev_doca_alloc(void)
+{
+    struct netdev_doca *dev;
+
+    dev = doca_rte_mzalloc("ovs_doca_netdev", sizeof *dev);
+    if (!dev) {
+        return NULL;
+    }
+
+    /* Upon the first port disable dpdk steering to allow doca to work. */
+    if (!atomic_count_inc(&n_doca_ports)) {
+        rte_pmd_mlx5_disable_steering();

See earlier comment on DOCA changing general DPDK settings.  How would
this affect other mlx5 ports used with DPDK / Kernel?

Answered in the other patch.

+    }
+
+    return &dev->up;
+}
+
+static void
+netdev_doca_dealloc(struct netdev *netdev)
+{
+    struct netdev_doca *dev = netdev_doca_cast(netdev);
+
+    /* Upon the last doca port going down, enable back dpdk steering. */
+    if (atomic_count_dec(&n_doca_ports) == 1) {
+        rte_pmd_mlx5_enable_steering();

See above comment.

Ditto.

+    }
+
+    rte_free(dev);

[snip]

+    if (netdev_doca_dev_open_pci(&arg->esw_key->rte_pci, &ctx->dev)) {
+        return ENODEV;
+    }

In general, OVS code has a blank line after the closing brace of an if
block.

Ack.

+    rte_pci_device_name(&arg->esw_key->rte_pci, ctx->pci_addr,
+                        sizeof ctx->pci_addr);

[snip]

+        return -1; /* Error has already been printed by the RTE function. */

This might be confusing to the end user, as the errors are reported by
DPDK and not DOCA.

Ack

+    }
+
+    if (rte_pci_addr_parse(da.name, rte_pci)) {
+        rv = -1;
+        goto out;
+    }
+
+out:
+    rte_devargs_reset(&da);
+    return rv;
+}
+
+static bool
+netdev_doca_foreach_representor(struct netdev_doca *esw_dev,

The name could be more specific, as based on the return value it also
calls netdev_request_reconfigure().  Also some comments could be added
that the callback is called under the device lock, and the return value
causes reconfigure of the netdev.

Ack. dropped the returned value.

+                                bool (*cb)(struct netdev_doca *, uint16_t))
+    OVS_REQUIRES(doca_mutex)

[snip]

+        if (rte_eth_dev_is_valid_port(dev->esw_mgr_port_id)) {
+            err = rte_eth_dev_close(dev->esw_mgr_port_id);

If the last device being closed is the ESW manager (PF) itself, will
this not result in a double closure?

Right. the last device is the ESW, so it's a dead code since it's not avalid port anymore. removing.

+            if (err) {
+                VLOG_ERR("Failed to close esw_mgr port %d: %s",
+                         dev->esw_mgr_port_id, rte_strerror(-err));
+            }
+        }
+        /* esw->cmd_fd is closed inside. */
+        if (dev_info.device) {

I'm a bit confused here.  We are doing an rte_dev_remove() on the
physical port, but we do this only on the last ESW.  Are we removing
the right device?  Should we not use
rte_eth_dev_info_get(dev->esw_mgr_port_id, ...) instead?

The last is the ESW. port_id and esw_mgr_port_id are the same. Insteadof having another get_info(), I used the already ready results.

+            err = rte_dev_remove(dev_info.device);
+            if (err) {
+                VLOG_ERR("Failed to remove device %s: %s", dev->devargs,
+                         rte_strerror(-err));
+            }
+        }
+
+        VLOG_INFO("Closing '%s'", pci_addr);

This should not be an info level.

Moved to DBG.

+        err = doca_dev_close(esw->dev);
+        if (err) {
+            VLOG_ERR("Failed to close doca dev %s. Error: %d (%s)", pci_addr,
+                     err, doca_error_get_descr(err));
+        }
+        esw->dev = NULL;
+        esw->cmd_fd = -1;
+    }
+
+    dev->esw_ctx = NULL;
+    free(pci_addr);
+}
+
+static bool
+netdev_doca_rep_stop(struct netdev_doca *dev,
+                     uint16_t esw_mgr_port_id OVS_UNUSED)
+{
+    if (!dev_get_started(dev)) {
+        return false;
+    }
+
+    netdev_doca_port_stop(&dev->up);
+    netdev_doca_dev_close(dev);
+    dev->port_id = DPDK_ETH_PORT_ID_INVALID;
+    dev->esw_mgr_port_id = DPDK_ETH_PORT_ID_INVALID;
+    dev->attached = false;
+
+    return true;
+}
+
+static int
+netdev_doca_port_stop(struct netdev *netdev)

Should also have the OVS_REQUIRES(doca_mutex) annotation?

Ack.

+{
+    struct netdev_doca *dev = netdev_doca_cast(netdev);
+    int err = 0;
+
+    if (!dev_get_started(dev)) {
+        return 0;
+    }
+
+    if (netdev_doca_get_esw_mgr_port_id(netdev) == dev->port_id &&
+        netdev_doca_foreach_representor(dev, netdev_doca_rep_stop)) {

Does this mean that if I stop a PF it will stop (remove) all VFs?
I need some time to play with the actual environment, but it lacks an
example in the docs (and I need to find time to setup compilation).
That's why I'm asking here.

Yes.

+        /* If a representor is reconfigured a result of its ESW manager
+         * change, it might not be synced in the bridge's database. Signal it

[snip]

+    struct rte_eth_stats rte_stats;
+    bool gg;

The name 'gg' does not make any sense to me.

Moved to common with dpdk. That is how it is there.

+
+    netdev_doca_get_carrier(netdev, &gg);
+    ovs_mutex_lock(&dev->mutex);
+
+    if (!dev_get_started(dev)) {
+        ovs_mutex_unlock(&dev->mutex);
+        return ENODEV;
+    }
+
+    if (rte_eth_stats_get(dev->port_id, &rte_stats)) {
+        VLOG_ERR("Can't get ETH statistics for port: %d",
+                 dev->port_id);
+        ovs_mutex_unlock(&dev->mutex);
+        return EPROTO;
+    }
+
+    /* Get length of statistics */
+    rte_xstats_len = rte_eth_xstats_get_names(dev->port_id, NULL, 0);
+    if (rte_xstats_len < 0) {
+        VLOG_WARN("Cannot get XSTATS values for port: %d",
+                  dev->port_id);
+        goto out;
+    }
+    /* Reserve memory for rte_xstats names and values */
+    rte_xstats_names = xcalloc(rte_xstats_len, sizeof *rte_xstats_names);
+    rte_xstats = xcalloc(rte_xstats_len, sizeof *rte_xstats);
+
+    /* Retreive rte_xstats names */

Retrieve

Fixed in netdev-dpdk.c

+    rte_xstats_new_len = rte_eth_xstats_get_names(dev->port_id,
+                                                  rte_xstats_names,
+                                                  rte_xstats_len);
+    if (rte_xstats_new_len != rte_xstats_len) {
+        VLOG_WARN("Cannot get XSTATS names for port: %d",
+                  dev->port_id);
+        goto out;
+    }
+    /* Retreive rte_xstats values */

Retrieve

Ditto

+    memset(rte_xstats, 0xff, sizeof *rte_xstats * rte_xstats_len);
+    rte_xstats_ret = rte_eth_xstats_get(dev->port_id, rte_xstats,

[snip]

+        ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
+                       "ovs%08x%02d%05d%07u",

Will this name not collide with the ones created in netdev-dpdk?

Changed, but for the question - it might, if you have a mixedconfiguration, but those names are only for dpdk debug. they don't haveany other meaning.

+                        hash, socket_id, mtu, n_mbufs);

[snip]

+         * esw_ctx handle in dev.
+         */

The comment closing can be moved to the same line, and fixing the
double-space issue.

Changed the code. In our downstream version we may probe a representorbefore the ESW. here i changed it so it's not necessary.

+        goto out;
+    }
+
+    if (doca_rdma_bridge_get_dev_pd(dev->esw_ctx->dev, &pd)) {
+        VLOG_ERR("Could not get pd for %s", devargs);
+        rv = EINVAL;
+        goto out;
+    }

New line, see earlier comment.

Ack

+    if (dev->esw_ctx->cmd_fd == -1) {
+        dev->esw_ctx->cmd_fd = dup(pd->context->cmd_fd);
+        if (dev->esw_ctx->cmd_fd == -1) {
+            VLOG_ERR("Could not dup fd for %s. Error %s", devargs,
+                     ovs_strerror(errno));
+            rv = EBADF;
+            goto out;
+        }
+    }
+
+    ds_put_format(&rte_devargs, "%s,cmd_fd=%d,pd_handle=%u", devargs,
+                  dev->esw_ctx->cmd_fd, pd->handle);
+
+    VLOG_INFO("Probing '%s'", ds_cstr(&rte_devargs));
+    if (rte_dev_probe(ds_cstr(&rte_devargs))) {
+        rv = ENODEV;
+        goto out;
+    }
+
+out:
+    ds_destroy(&rte_devargs);
+    if (rv) {
+        netdev_doca_dev_close(dev);
+    }
+    return rv;
+}
+
+static int
+netdev_doca_port_start(struct netdev *netdev)
+{
+    struct netdev_doca *dev = netdev_doca_cast(netdev);
+    struct doca_flow_port_cfg *port_cfg;
+    const char *devargs = dev->devargs;
+    uint16_t port_id = dev->port_id;
+    struct netdev_doca_esw_ctx *esw;
+    int err;
+
+    if (!rte_eth_dev_is_valid_port(dev->esw_mgr_port_id)) {
+        VLOG_ERR("Cannot start port %d '%s', invalid proxy port", port_id,
+                 devargs);
+        return -1;

Errors should be positive (as they end up being returned by
netdev_doca_reconfigure()), and we should use a meaningful errno value,
maybe EINVAL.

Same for all -1 occurrences below.

Ack

+    }
+
+    err = doca_flow_port_cfg_create(&port_cfg);
+    if (err) {
+        VLOG_ERR("Failed to create doca flow port_cfg. Error: %d (%s)",
+                 err, doca_error_get_descr(err));
+        return -1;
+    }
+
+    if (netdev_doca_dev_probe(dev, devargs)) {
+        err = -1;
+        goto out;
+    }
+    esw = dev->esw_ctx;
+    if (!esw) {
+        err = -1;
+        goto out;
+    }
+
+    err = doca_flow_port_cfg_set_port_id(port_cfg, port_id);
+    if (err) {
+        VLOG_ERR("%s: Failed to set doca flow port_cfg port_id %d. "
+                 "Error: %d (%s)", netdev_get_name(netdev), port_id, err,
+                 doca_error_get_descr(err));
+        goto out;
+    }
+
+    if (!netdev_doca_is_esw_mgr(netdev)) {
+        VLOG_INFO("%s: Opening doca dev_rep for port_id %d",
+                  netdev_get_name(netdev), port_id);
+        err = doca_dpdk_open_dev_rep_by_port_id(port_id, esw->dev,
+                                                &dev->dev_rep);
+        if (err) {
+            VLOG_ERR("%s: Failed to open doca dev_rep for port_id %d. "
+                     "Error: %d (%s)", netdev_get_name(netdev), port_id, err,
+                     doca_error_get_descr(err));
+            goto out;
+        }
+
+        err = doca_flow_port_cfg_set_dev_rep(port_cfg, dev->dev_rep);
+        if (err) {
+            VLOG_ERR("%s: Failed to set doca flow port_cfg dev_rep. "
+                     "Error: %d (%s)", netdev_get_name(netdev), err,
+                     doca_error_get_descr(err));
+            goto out;
+        }
+    }
+
+    err = doca_flow_port_cfg_set_dev(port_cfg, esw->dev);
+    if (err) {
+        VLOG_ERR("%s: Failed to set doca flow port_cfg dev. Error: %d (%s)",
+                 netdev_get_name(netdev), err, doca_error_get_descr(err));
+        goto out;
+    }
+
+    VLOG_INFO("%s: Starting '%s', port_id=%d", netdev_get_name(netdev),
+              devargs, port_id);
+    if (dev->port_id == dev->esw_mgr_port_id) {
+        err = doca_flow_port_cfg_set_actions_mem_size(
+                port_cfg, NETDEV_DOCA_ACTIONS_MEM_SIZE);
+        if (err) {
+            VLOG_ERR("Failed set_actions_mem_size for port_id %"PRIu16
+                     ". Error: %d (%s)", dev->port_id, err,
+                     doca_error_get_descr(err));
+            goto out;
+        }
+        err = rte_eth_dev_start(dev->port_id);
+        if (err) {
+            VLOG_ERR("Failed to start dpdk port_id %"PRIu16". Error: %d (%s)",
+                     dev->port_id, err, rte_strerror(-err));
+            goto out;
+        }
+
+        err = doca_flow_port_cfg_set_nr_resources(port_cfg,
+                                                  DOCA_FLOW_RESOURCE_COUNTER,
+                                                  ovs_doca_max_counters());
+        if (err) {
+            VLOG_ERR("Failed set_nr_resources counters for port_id %"PRIu16
+                     ". Error: %d (%s)", dev->port_id, err,
+                     doca_error_get_descr(err));
+            goto out;
+        }
+    }
+    err = doca_flow_port_start(port_cfg, &dev->port);

Are these functions called here (and below) returning valid errnos?
Asking as in some places OVS DOCA error codes are returned, and they
seem to be negative.  I guess a proper run-through on error code
handling is needed.

Ack

+    if (err) {
+        VLOG_ERR("Failed to start doca flow port_id %"PRIu16". Error: %d (%s)",
+                 port_id, err, doca_error_get_descr(err));

[snip]

+        return -1;
+    }
+
+    atomic_store(&dev->started, true);
+    ovs_mutex_unlock(&dev->mutex);

Why unlock and immediately relock dev->mutex here? If there's a reason
(e.g., lock ordering, avoiding deadlock, etc. etc.), it should be documented.
Otherwise, this should be removed.

No. there was a code in between in downstream version that is notnecessary here. i'll remove it.

+    ovs_mutex_lock(&dev->mutex);
+

[snip]

+        if (esw_n_rxq < 0) {
+            err = -1;

Errors should be positive (as they end up being returned by
netdev_doca_reconfigure()), and we should use a meaningful errno value,
maybe EINVAL.

Ack

+            goto out;

[snip]

+    if (dev->devargs && dev->devargs[0]) {
+        smap_add_format(args, "dpdk-devargs", "%s", dev->devargs);

Using dpdk-devargs for DOCA ports reinforces that netdev-doca should be
a specialized type of netdev-dpdk (sharing common infrastructure).
Document this relationship in the to-be-created doca.rst documentation.

I refactored netdev-dpdk to be able to share code. However, I don't wantto say doca is a "dpdk subtype". I don't think it's correct.

The RX/TX paths are similar (for now) but later, with offloads it'scompletely different.

+    }
+

[snip]

+
+        /* Device not found in DPDK, attempt to attach it */

In OVS all comments should start with a capital and end with a dot (.).
Can you check the entire patch for this pattern?

Ack

+        err = netdev_doca_dev_probe(dev, devargs);

[snip]

+        return EAGAIN;
+    }
+

I see no ingress_policer handling here. I assume DOCA ports do not support
this. However, I feel like we need to classify this in the doca.rst
documentation.

It will be supported, but in another way, as a future feature. It's notmentioned that it's supported (as well as offloads), so I think weshould documenting only what *IS* supported.

+    esw_mgr_port_id = dev->esw_ctx->port_id;
+    port_id = dev->port_id;

[snip]

+        rte_spinlock_t tx_lock;
+    );
+};
+
+struct netdev_doca {

Based on my earlier comment on sharing code with netdev-dpdk, this could
be implemented (in a separate patch) with something like (or similar
approach, which could save ~1800 lines of duplicate code):

   /* In new lib/netdev-dpdk-private.h */
   struct netdev_dpdk_common {
       PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0,
           uint16_t port_id;
           bool attached;
           atomic_bool started;
           struct eth_addr hwaddr;
           int mtu;
           int socket_id;
           int max_packet_len;
           enum netdev_flags flags;
           int link_reset_cnt;
           char *devargs;
           void *tx_q;  /* dpdk_tx_queue or doca_tx_queue */
           struct rte_eth_link link;
       );

       PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
           struct ovs_mutex mutex;
           struct dpdk_mp *dpdk_mp;
       );

       PADDED_MEMBERS(CACHE_LINE_SIZE,
           struct netdev up;
           struct ovs_list list_node;
           bool rx_metadata_delivery_configured;
       );

       PADDED_MEMBERS(CACHE_LINE_SIZE,
           struct netdev_stats stats;
           void *sw_stats;  /* Type-specific sw_stats */
           rte_spinlock_t stats_lock;
       );

       PADDED_MEMBERS(CACHE_LINE_SIZE,
           /* Configuration fields */
           int requested_mtu;
           int requested_n_txq;
           int user_n_rxq;
           int requested_n_rxq;
           int requested_rxq_size;
           int requested_txq_size;
           int rxq_size;
           int txq_size;
           int requested_socket_id;
           struct rte_eth_fc_conf fc_conf;
           uint32_t hw_ol_features;
           bool requested_lsc_interrupt_mode;
           bool lsc_interrupt_mode;
           struct eth_addr requested_hwaddr;
       );

       PADDED_MEMBERS(CACHE_LINE_SIZE,
           struct rte_eth_xstat_name *rte_xstats_names;
           int rte_xstats_names_size;
           int rte_xstats_ids_size;
           uint64_t *rte_xstats_ids;
       );
   };

   /* Then in netdev-dpdk.c and netdev-doca.c */
   struct netdev_dpdk {
       struct netdev_dpdk_common common;
       /* vHost-specific fields */
       /* QoS fields */
       /* etc. */
   };

BUILD_ASSERT_DECL(offsetof(struct netdev_dpdk, common) == 0);
BUILD_ASSERT_DECL(offsetof(struct netdev_dpdk, common.up) ==
                   offsetof(struct netdev_dpdk_common, up));

   struct netdev_doca {
       struct netdev_dpdk_common common;
       /* DOCA-specific fields */
   };

BUILD_ASSERT_DECL(offsetof(struct netdev_doca, common) == 0);
BUILD_ASSERT_DECL(offsetof(struct netdev_doca, common.up) ==
                   offsetof(struct netdev_dpdk_common, up));

Ack

+    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0,

[snip]

+#include "netdev-doca.h"
+#include "openvswitch/list.h"
+#include "openvswitch/vlog.h"

I normally move the openvswitch/* includes into their own section at
the bottom.

Ack

+#include "smap.h"
+#include "unixctl.h"
+#include "util.h"
+
  /* DOCA disables dpdk steering as a constructor in higher priority.
   * Set a lower priority one to enable it back. Disable it only upon using
   * doca ports.
@@ -38,9 +58,742 @@ RTE_INIT(dpdk_steering_enable)
      rte_pmd_mlx5_enable_steering();
  }

+VLOG_DEFINE_THIS_MODULE(ovs_doca);
+static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(600, 600);
+
+#define OVS_DOCA_SLOWPATH_COUNTERS \
+    ((NETDEV_DOCA_RSS_NUM_ENTRIES + 1) * RTE_MAX_ETHPORTS)
+
+COVERAGE_DEFINE(ovs_doca_queue_block);
+COVERAGE_DEFINE(ovs_doca_queue_empty);
+COVERAGE_DEFINE(ovs_doca_queue_none_processed);
+
+static atomic_bool doca_initialized = false;
+static unsigned int ovs_doca_max_megaflows_counters;
+static FILE *log_stream = NULL;       /* Stream for DOCA log redirection */
+static struct doca_log_backend *ovs_doca_log = NULL;
+
+static ssize_t
+ovs_doca_log_write(void *c OVS_UNUSED, const char *buf, size_t size)
+{
+    static struct vlog_rate_limit dbg_rl = VLOG_RATE_LIMIT_INIT(600, 600);
+
+    switch (doca_log_level_get_global_sdk_limit()) {

Is 'doca_log_level_get_global_sdk_limit()' the right API to get the current
messages log level?

Doca doesn't support per-message level. I changed it to parse the string.

+        case DOCA_LOG_LEVEL_TRACE:
+            VLOG_DBG_RL(&dbg_rl, "%.*s", (int) size, buf);
+            break;
+        case DOCA_LOG_LEVEL_DEBUG:
+            VLOG_DBG_RL(&dbg_rl, "%.*s", (int) size, buf);
+            break;
+        case DOCA_LOG_LEVEL_INFO:
+            VLOG_INFO_RL(&rl, "%.*s", (int) size, buf);
+            break;
+        case DOCA_LOG_LEVEL_WARNING:
+            VLOG_WARN_RL(&rl, "%.*s", (int) size, buf);
+            break;
+        case DOCA_LOG_LEVEL_ERROR:
+            VLOG_ERR_RL(&rl, "%.*s", (int) size, buf);
+            break;
+        case DOCA_LOG_LEVEL_CRIT:
+            VLOG_EMER("%.*s", (int) size, buf);
+            break;
+        default:
+            OVS_NOT_REACHED();
+    }
+
+    return size;
+}
+
+static cookie_io_functions_t ovs_doca_log_func = {
+    .write = ovs_doca_log_write,
+};
+
+static void
+ovs_doca_unixctl_mem_stream(struct unixctl_conn *conn, int argc OVS_UNUSED,
+                            const char *argv[] OVS_UNUSED, void *aux)

Maybe this can become a function in unixctl.c so it can be reused by both dpdk 
and doca.

Ack.

+{
+    void (*callback)(FILE *) = aux;
+    char *response = NULL;
+    FILE *stream;
+    size_t size;
+
+    stream = open_memstream(&response, &size);
+    if (!stream) {
+        response = xasprintf("Unable to open memstream: %s.",
+                             ovs_strerror(errno));
+        unixctl_command_reply_error(conn, response);
+        goto out;
+    }
+
+    callback(stream);
+    fclose(stream);
+    unixctl_command_reply(conn, response);
+out:
+    free(response);
+}
+
+static const char * const levels[] = {
+    [DOCA_LOG_LEVEL_CRIT]    = "critical",
+    [DOCA_LOG_LEVEL_ERROR]   = "error",
+    [DOCA_LOG_LEVEL_WARNING] = "warning",
+    [DOCA_LOG_LEVEL_INFO]    = "info",
+    [DOCA_LOG_LEVEL_DEBUG]   = "debug",
+    [DOCA_LOG_LEVEL_TRACE]   = "trace",
+};
+
+static int
+ovs_doca_parse_log_level(const char *s)
+{
+    int i;
+
+    for (i = 0; i < ARRAY_SIZE(levels); ++i) {

for (int i = 0; ...)

Ack

+        if (levels[i] && !strcmp(s, levels[i])) {
+            return i;
+        }
+    }
+    return -1;
+}
+
+static const char *
+ovs_doca_log_level_to_str(uint32_t log_level)
+{
+    int i;
+
+    for (i = 0; i < ARRAY_SIZE(levels); ++i) {

for (int i = 0; ...)

Ack

+        if (i == log_level && levels[i]) {
+            return levels[i];
+        }
+    }
+    return NULL;

Passing NULL to format-like functions is undefined behavior, so maybe just
return "UNKNOWN"?

Ack

+}
+
+static void
+ovs_doca_unixctl_log_set(struct unixctl_conn *conn, int argc,
+                         const char *argv[], void *aux OVS_UNUSED)
+{
+    int level = DOCA_LOG_LEVEL_DEBUG;
+
+    /* With no argument, level is set to 'debug'. */
+
+    if (argc == 2) {
+        const char *level_string;
+
+        level_string = argv[1];
+        level = ovs_doca_parse_log_level(level_string);
+        if (level == -1) {

I would make it level < 0 just to be safe.

Ack

+            char *err_msg;
+
+            err_msg = xasprintf("invalid log level: '%s'", level_string);

These two can be combined:

         if (level < 0) {
             char *err_msg = xasprintf("invalid log level: '%s'", level_string);

             unixctl_command_reply_error(conn, err_msg);
             free(err_msg);
             return;
         }

Ack

+            unixctl_command_reply_error(conn, err_msg);
+            free(err_msg);
+            return;
+        }
+    }
+
+    doca_log_level_set_global_sdk_limit(level);
+    unixctl_command_reply(conn, NULL);
+}
+
+static void
+ovs_doca_log_dump(FILE *stream)

   This name is not very descriptive, it makes me think we are dumping all
   log entries.  Maybe ovs_doca_log_get() might be more appropriate?

Ack

+{
+    uint32_t log_level;
+
+    log_level = doca_log_level_get_global_sdk_limit();
+    fprintf(stream, "DOCA log level is %s",
+            ovs_doca_log_level_to_str(log_level));
+}
+
+static void
+ovs_doca_destroy_defs(struct doca_flow_definitions *defs,
+                      struct doca_flow_definitions_cfg *defs_cfg)
+{
+    if (defs) {
+        doca_flow_definitions_destroy(defs);
+    }
+    if (defs_cfg) {
+        doca_flow_definitions_cfg_destroy(defs_cfg);
+    }
+}
+
+static doca_error_t
+ovs_doca_init_defs(struct doca_flow_cfg *cfg,
+                   struct doca_flow_definitions **defs,
+                   struct doca_flow_definitions_cfg **defs_cfg)
+{
+#define DEF_FIELD(str_val, struct_name, field_name) {     \
+    .str = str_val,                                       \
+    .offset = offsetof(struct struct_name, field_name),   \
+    .size = MEMBER_SIZEOF(struct struct_name, field_name) \
+}
+    struct def_field {
+        const char *str;
+        size_t offset;
+        size_t size;
+    } def_fields[] = {
+        DEF_FIELD("actions.packet.meta.mark", ovs_doca_flow_actions, mark),
+    };
+    doca_error_t result;
+
+    result = doca_flow_definitions_cfg_create(defs_cfg);
+    if (result != DOCA_SUCCESS) {
+        VLOG_ERR("Failed to create defs cfg: %d(%s)", result,
+                 doca_error_get_name(result));
+        return result;
+    }

New line maybe?

Ack

+    result = doca_flow_definitions_create(*defs_cfg, defs);
+    if (result != DOCA_SUCCESS) {
+        VLOG_ERR("definitions creation error: %d(%s)", result,
+                 doca_error_get_name(result));
+        goto out;
+    }
+
+    for (int i = 0; i < ARRAY_SIZE(def_fields); i++) {
+        result = doca_flow_definitions_add_field(*defs, def_fields[i].str,
+                                                 def_fields[i].offset,
+                                                 def_fields[i].size);
+        if (result != DOCA_SUCCESS) {
+            VLOG_ERR("definitions add field '%s' error: result=%d(%s)",
+                     def_fields[i].str, result, doca_error_get_name(result));
+            goto out;
+        }
+    }
+
+    result = doca_flow_cfg_set_definitions(cfg, *defs);
+    if (result != DOCA_SUCCESS) {
+        VLOG_ERR("Failed to set doca_flow_cfg defs: %d(%s)", result,
+                 doca_error_get_name(result));
+        goto out;
+    }
+
+out:
+    if (result) {
+        ovs_doca_destroy_defs(*defs, *defs_cfg);
+    }
+    return result;
+}
+
+static void
+ovs_doca_offload_entry_process(struct doca_flow_pipe_entry *entry,
+                               uint16_t qid,
+                               enum doca_flow_entry_status status,
+                               enum doca_flow_entry_op op,
+                               void *aux)
+{
+    static const char *status_desc[] = {
+        [DOCA_FLOW_ENTRY_STATUS_IN_PROCESS] = "in-process",
+        [DOCA_FLOW_ENTRY_STATUS_SUCCESS] = "success",
+        [DOCA_FLOW_ENTRY_STATUS_ERROR] = "failure",
+    };
+    static const char *op_desc[] = {
+        [DOCA_FLOW_ENTRY_OP_ADD] = "add",
+        [DOCA_FLOW_ENTRY_OP_DEL] = "del",
+        [DOCA_FLOW_ENTRY_OP_UPD] = "mod",
+        [DOCA_FLOW_ENTRY_OP_AGED] = "age",
+    };
+    bool error = status == DOCA_FLOW_ENTRY_STATUS_ERROR;
+    struct ovs_doca_offload_queue *queues = aux;
+
+    VLOG_RL(&rl, error ? VLL_ERR : VLL_DBG,
+            "%s: [qid:%" PRIu16 "] %s aux=%p entry %p %s",
+            __func__, qid, op_desc[op], aux, entry, status_desc[status]);

This code need bounds checking on op and status before indexing into
op_desc[] and status_desc[].

Ack

+
+    if (queues && status != DOCA_FLOW_ENTRY_STATUS_IN_PROCESS) {
+        queues[qid].n_waiting_entries--;
+    }
+}
+
+static int
+ovs_doca_init__(const struct smap *ovs_other_config)
+{
+    struct doca_flow_definitions_cfg *defs_cfg = NULL;
+    struct doca_flow_definitions *defs = NULL;
+    struct doca_flow_cfg *cfg;
+    doca_error_t err;
+
+#define RV_TEST(rv)                                           \
+    do {                                                      \
+        if (!rv) {                                            \
+            break;                                            \
+        }                                                     \
+        VLOG_ERR("Failed at %s: %d (%s)", OVS_SOURCE_LOCATOR, \
+                 rv, doca_error_get_descr(rv));               \
+        return rv;                                            \
+    } while (0)
+
+    /* Need dpdk to be initialized first. */
+    if (!dpdk_available()) {
+        struct smap other_config_dpdk;
+
+        smap_clone(&other_config_dpdk, ovs_other_config);
+        smap_replace(&other_config_dpdk, "dpdk-init", "true");
+        dpdk_init(&other_config_dpdk);
+        smap_destroy(&other_config_dpdk);

Should this warn and exit instead of silently initializing DPDK?  Since
the actual config does not have dpdk-init set to true, it's the user's
responsibility to enable DPDK first.  Enforcing initialization here
could be unexpected behavior.

Ack

+    }
+
+    rte_flow_dynf_metadata_register();

Will this in any way change normal rte_flow offload behavior?

No

+    log_stream = fopencookie(NULL, "w+", ovs_doca_log_func);
+    if (log_stream == NULL) {
+        VLOG_ERR("Can't redirect DOCA log: %s.", ovs_strerror(errno));
+    } else {
+        /* Create a logger back-end that prints to the redirected log */
+        err = doca_log_backend_create_with_file_sdk(log_stream,
+                                                    &ovs_doca_log);
+        if (err != DOCA_SUCCESS) {

Should we add a log message here?  In addition, why do we not error out
if log_stream == NULL, but we do here?

Ack

+            return EXIT_FAILURE;
+        }
+        doca_log_level_set_global_sdk_limit(DOCA_LOG_LEVEL_WARNING);
+    }
+    unixctl_command_register("doca/log-set", "{level}. level=critical/error/"
+                             "warning/info/debug", 0, 1,
+                             ovs_doca_unixctl_log_set, NULL);
+    unixctl_command_register("doca/log-get", "", 0, 0,

Should this be log-list similar to DPDK?  Or should we change DPDK,
and keep the log-list as a shadow parameter for backwards compatibility?

There is no log-list here. DPDK has per-module log level, but docadoesn't, just one global.

+                             ovs_doca_unixctl_mem_stream,
+                             ovs_doca_log_dump);
+

Should we have test cases for these two commands.

We can't run doca without HW (real/simulation).

+    /* DOCA configuration happens earlier than dpif-netdev's.
+     * To avoid reorganizing them, read the relevant item directly. */
+    ovs_doca_max_megaflows_counters =
+        smap_get_uint(ovs_other_config, "flow-limit",
+                      OVS_DOCA_MAX_MEGAFLOWS_COUNTERS);

How are changes to this value handled?

They are not. I'll add it to doc.

+
+    err = doca_flow_cfg_create(&cfg);
+    RV_TEST(err);
+    err = doca_flow_cfg_set_pipe_queues(cfg,OVS_DOCA_MAX_OFFLOAD_QUEUES);

Missing space after comma: (cfg, OVS_DOCA_MAX_OFFLOAD_QUEUES);

Ack

+    RV_TEST(err);
+    err = doca_flow_cfg_set_resource_mode(cfg, DOCA_FLOW_RESOURCE_MODE_PORT);
+    RV_TEST(err);
+    err = doca_flow_cfg_set_mode_args(cfg,
+                                      "switch"
+                                      ",hws"
+                                      ",isolated"
+                                      ",expert"
+                                      "");
+    RV_TEST(err);
+    err = doca_flow_cfg_set_queue_depth(cfg, OVS_DOCA_QUEUE_DEPTH);
+    RV_TEST(err);
+    err = doca_flow_cfg_set_cb_entry_process(cfg,
+                                             ovs_doca_offload_entry_process);
+    RV_TEST(err);
+    err = ovs_doca_init_defs(cfg, &defs, &defs_cfg);
+    RV_TEST(err);
+
+    VLOG_INFO("DOCA Enabled - initializing...");
+    err = doca_flow_init(cfg);
+    RV_TEST(err);
+    ovs_doca_destroy_defs(defs, defs_cfg);
+    err = doca_flow_cfg_destroy(cfg);
+    RV_TEST(err);
+
+    netdev_doca_register();
+    return 0;
+}
+
+static bool
+ovs_doca_available(void)
+{
+    bool available;
+
+    atomic_read_relaxed(&doca_initialized, &available);
+    return available;
+}
+
+/* Complete the queue 'qid' on the netdev's ESW until 'min_room'
+ * entries are available. If 'min_room' is 0, complete any available.
+ */
+static doca_error_t
+ovs_doca_complete_queue_n_protected(struct netdev_doca_esw_ctx *esw,
+                                    unsigned int qid, uint32_t min_room)
+{
+    struct ovs_doca_offload_queue *queue;
+    long long int timeout_ms;
+    unsigned int n_waiting;
+    doca_error_t err;
+    uint32_t room;
+    int retries;
+
+    queue = &esw->offload_queues[qid];
+    n_waiting = queue->n_waiting_entries;
+
+    if (n_waiting == 0) {
+        COVERAGE_INC(ovs_doca_queue_empty);
+        return DOCA_SUCCESS;
+    }
+
+    if (qid == AUX_QUEUE) {
+        /* 10 seconds timeout when synchronizing the AUX_QUEUE. */

10 seconds sounds like a very long time to stall the main thread (or any
thread), even for the worst case.

Ack

+        timeout_ms = time_msec() + 10 * 1000;
+    } else {
+        timeout_ms = 0;
+    }
+
+    retries = 100;
+    do {
+        unsigned int n_processed;
+
+        /* Use 'max_processed_entries' == 0 to always attempt processing
+         * the full length of the queue. */
+        err = doca_flow_entries_process(esw->esw_port, qid,
+                                        OVS_DOCA_ENTRY_PROCESS_TIMEOUT_US,
+                                        0);
+        if (err) {
+            VLOG_WARN_RL(&rl, "%s: Failed to process entries in queue "
+                         "%u: %s", netdev_get_name(esw->esw_netdev), qid,
+                         doca_error_get_descr(err));
+            return err;
+        }
+
+        n_processed = n_waiting - queue->n_waiting_entries;
+        if (min_room && n_processed == 0) {
+            COVERAGE_INC(ovs_doca_queue_none_processed);
+        }
+        n_waiting = queue->n_waiting_entries;
+
+        room = OVS_DOCA_QUEUE_DEPTH - n_waiting;
+        if (n_processed == 0 && retries-- <= 0) {
+            COVERAGE_INC(ovs_doca_queue_block);
+            break;
+        }
+
+        if (timeout_ms && time_msec() > timeout_ms) {
+            ovs_abort(0, "Timeout reached trying to complete queue %u: "
+                      "%u remaining entries", qid, n_waiting);
+        }
+    } while (err == DOCA_SUCCESS && min_room && room < min_room);
+
+    return err;
+}
+
+static doca_error_t
+ovs_doca_complete_queue_n(struct netdev_doca_esw_ctx *esw,
+                          unsigned int qid, uint32_t min_room)
+    OVS_EXCLUDED(esw->mgmt_queue_lock)
+{
+    doca_error_t err;
+
+    if (qid == AUX_QUEUE) {
+        if (qid == AUX_QUEUE) {

This looks odd...

Ack

+            ovs_mutex_lock(&esw->mgmt_queue_lock);
+            err = ovs_doca_complete_queue_n_protected(esw, qid, min_room);
+            ovs_mutex_unlock(&esw->mgmt_queue_lock);
+        } else {
+            err = ovs_doca_complete_queue_n_protected(esw, qid, min_room);
+        }
+    } else {
+        err = ovs_doca_complete_queue_n_protected(esw, qid, min_room);
+    }
+
+    return err;
+}
+
+doca_error_t
+ovs_doca_complete_queue_esw(struct netdev_doca_esw_ctx *esw,
+                            unsigned int qid,
+                            bool sync)

The sync parameter does not make sense to me.

Ack.

+    OVS_EXCLUDED(esw->mgmt_queue_lock)
+{
+    uint32_t min_room = 0;
+
+    if (esw == NULL) {
+        return DOCA_SUCCESS;
+    }
+
+    if (sync) {
+        min_room = OVS_DOCA_QUEUE_DEPTH;
+    }
+
+    return ovs_doca_complete_queue_n(esw, qid, min_room);
+}
+
+static doca_error_t
+ovs_doca_add_generic(unsigned int qid,
+                     uint32_t hash_index,
+                     struct doca_flow_pipe *pipe,
+                     enum doca_flow_pipe_type pipe_type,
+                     const struct ovs_doca_flow_match *omatch,
+                     const struct ovs_doca_flow_actions *oactions,
+                     const struct doca_flow_monitor *monitor,
+                     const struct doca_flow_fwd *fwd,
+                     uint32_t flags,
+                     struct netdev_doca_esw_ctx *esw,
+                     struct doca_flow_pipe_entry **pentry)
+    OVS_NO_THREAD_SAFETY_ANALYSIS
+{
+    const struct doca_flow_actions *actions;
+    struct ovs_doca_offload_queue *queues;
+    const struct doca_flow_match *match;
+    doca_error_t err;
+
+    ovs_assert(esw);
+    queues = esw->offload_queues;
+    if (qid == AUX_QUEUE) {
+        ovs_mutex_lock(&esw->mgmt_queue_lock);
+    }
+
+    match = omatch ? &omatch->d : NULL;
+    actions = oactions ? &oactions->d : NULL;
+
+    switch (pipe_type) {
+    case DOCA_FLOW_PIPE_BASIC:
+        err = doca_flow_pipe_basic_add_entry(qid, pipe, match, 0, actions,
+                                             monitor, fwd, flags, queues,
+                                             pentry);
+        break;
+    case DOCA_FLOW_PIPE_HASH:
+        err = doca_flow_pipe_hash_add_entry(qid, pipe, hash_index, 0, actions,
+                                            monitor, fwd, flags, queues,
+                                            pentry);
+        break;
+    case DOCA_FLOW_PIPE_CONTROL:
+    case DOCA_FLOW_PIPE_LPM:
+    case DOCA_FLOW_PIPE_CT:
+    case DOCA_FLOW_PIPE_ACL:
+    case DOCA_FLOW_PIPE_ORDERED_LIST:
+        OVS_NOT_REACHED();
+    }
+
+    if (qid == AUX_QUEUE) {
+        ovs_mutex_unlock(&esw->mgmt_queue_lock);
+    }
+    if (err == DOCA_SUCCESS) {
+        if (queues) {
+            queues[qid].n_waiting_entries++;

The counter increment needs to happen with the lock held.  Moving this
before the unlock would fix the race condition.  In addition, this
function needs a comment explaining why only the AUX_QUEUE needs a lock
(is it shared across threads while other queues are per-PMD-thread?).


qid is the local thread (unless AUX_QUEUE, in which case, locked).

For now, there is no need for this lock, even for AUX_QUEUE, as only themain thread uses it.


In the future, it may be used. I'll remove it for now.

+        } else {
+            VLOG_DBG("added entry %p without an eswitch handle", *pentry);
+        }
+    }
+    return err;
+}
+
+doca_error_t
+ovs_doca_add_entry(struct netdev *netdev,
+                   unsigned int qid,
+                   struct doca_flow_pipe *pipe,
+                   const struct ovs_doca_flow_match *match,
+                   const struct ovs_doca_flow_actions *actions,
+                   const struct doca_flow_monitor *monitor,
+                   const struct doca_flow_fwd *fwd,
+                   uint32_t flags,
+                   struct doca_flow_pipe_entry **pentry)
+{
+    struct netdev_doca *dev = netdev_doca_cast(netdev);
+    struct netdev_doca_esw_ctx *esw = dev->esw_ctx;
+    doca_error_t err;
+
+    err = ovs_doca_add_generic(qid, 0, pipe, DOCA_FLOW_PIPE_BASIC, match,
+                               actions, monitor, fwd, flags, esw, pentry);
+    if (err) {
+        VLOG_WARN_RL(&rl, "%s: Failed to create basic pipe entry. "
+                     "Error: %d (%s)", netdev_get_name(netdev), err,
+                     doca_error_get_descr(err));
+        return err;
+    }
+
+    if (DOCA_FLOW_FLAGS_IS_SET(flags, DOCA_FLOW_ENTRY_FLAGS_NO_WAIT)) {

Is the logic correct?  It looks reversed.


Yes. DOCA offloads are always async.

"wait" means that doca won't send the request to the HW yet (to enablebursts), so we can't "complete" in this case.

"no-wait" means it is sent immediately to the HW, then we want tocomplete to return to the caller as if the call was "sync".

+        err = ovs_doca_complete_queue_esw(esw, qid, true);
+        if (err != DOCA_SUCCESS) {
+            return err;
+        }
+    }
+
+    return err;
+}
+
+doca_error_t
+ovs_doca_remove_entry(struct netdev_doca_esw_ctx *esw,
+                      unsigned int qid, uint32_t flags,
+                      struct doca_flow_pipe_entry **entry)
+    OVS_EXCLUDED(esw->mgmt_queue_lock)
+{
+    doca_error_t err;
+
+    if (!*entry) {
+        return DOCA_SUCCESS;
+    }
+
+    if (qid == AUX_QUEUE) {
+        ovs_mutex_lock(&esw->mgmt_queue_lock);
+        err = doca_flow_pipe_remove_entry(qid, flags, *entry);
+        ovs_mutex_unlock(&esw->mgmt_queue_lock);
+    } else {
+        err = doca_flow_pipe_remove_entry(qid, flags, *entry);
+    }
+
+    if (err == DOCA_SUCCESS) {
+        esw->offload_queues[qid].n_waiting_entries++;

Same issue as ovs_doca_add_generic: the counter increment needs to happen
with the lock held for AUX_QUEUE.

Removed that lock.

+        if (qid == AUX_QUEUE) {
+            /* Ignore potential errors here, as even if the queue completion
+             * failed, the entry removal would still be issued. The caller

Comments should use double spaces between sentences per OVS style:
"issued.  The caller"

There are a lot of places with this problem, please fix all of them.

Ack.

+             * requires knowing so. */
+            ovs_doca_complete_queue_esw(esw, qid, true);

Could you explain why ovs_doca_complete_queue_esw() is only called for
AUX_QUEUE and not for other queue types?

Fixed.

+        }
+        *entry = NULL;
+    } else {
+        VLOG_ERR("Failed to remove entry %p qid=%d. %d (%s)", *entry, qid, err,
+                 doca_error_get_descr(err));
+    }
+
+    return err;
+}
+
+doca_error_t
+ovs_doca_pipe_cfg_allow_queues(struct doca_flow_pipe_cfg *cfg,
+                               uint64_t queues_bitmap)
+{
+    unsigned int qid;
+    doca_error_t err;

Move these two down to the for loop below.

Ack

+
+    if (!cfg) {
+        return DOCA_ERROR_INVALID_VALUE;
+    }
+
+    for (qid = 0; qid < OVS_DOCA_MAX_OFFLOAD_QUEUES; qid++) {

+    for (unsigned int qid = 0; qid < OVS_DOCA_MAX_OFFLOAD_QUEUES; qid++) {
+       doca_error_t err;

         if ((UINT64_C(1) << qid) & queues_bitmap)

+        if ((UINT64_C(1) << qid) & queues_bitmap) {
+            continue;
+        }

New line?

Ack

+        err = doca_flow_pipe_cfg_set_excluded_queue(cfg, qid);
+        if (DOCA_IS_ERROR(err)) {
+            VLOG_ERR("failed to exclude queue %u in pipe configuration: "
+                     "%d (%s)", qid, err, doca_error_get_descr(err));
+            return err;
+        }
+    }
+
+    return DOCA_SUCCESS;
+}
+
  void
-ovs_doca_init(const struct smap *ovs_other_config OVS_UNUSED)
+ovs_doca_destroy_pipe(struct doca_flow_pipe **ppipe)
+{
+    if (!ppipe || !*ppipe) {
+        return;
+    }
+    doca_flow_pipe_destroy(*ppipe);
+    *ppipe = NULL;
+}
+
+int
+ovs_doca_pipe_create(struct netdev *netdev,
+                     struct ovs_doca_flow_match *match,
+                     struct ovs_doca_flow_match *match_mask,
+                     struct doca_flow_monitor *monitor,
+                     struct ovs_doca_flow_actions *actions,
+                     struct ovs_doca_flow_actions *actions_mask,
+                     struct doca_flow_action_desc *desc,
+                     struct doca_flow_fwd *fwd,
+                     struct doca_flow_fwd *fwd_miss,
+                     uint32_t nr_entries,
+                     bool is_egress, bool is_root,
+                     uint64_t queues_bitmap,
+                     const char *pipe_str,
+                     struct doca_flow_pipe **pipe)
+{
+    struct doca_flow_actions *actions_arr[1], *actions_masks_arr[1];
+    struct netdev_doca *dev = netdev_doca_cast(netdev);
+    struct doca_flow_action_descs descs, *descs_arr[1];
+    char pipe_name[OVS_DOCA_MAX_PIPE_NAME_LEN];
+    struct doca_flow_port *doca_port;
+    struct doca_flow_pipe_cfg *cfg;
+    int ret;
+
+    ovs_assert(*pipe == NULL);
+
+    doca_port = doca_flow_port_switch_get(dev->port);

This API could return NULL, so we should check/assert.

Ack

+
+    snprintf(pipe_name, sizeof pipe_name, "%s: %s", netdev_get_name(netdev),
+             pipe_str);
+
+    ret = doca_flow_pipe_cfg_create(&cfg, doca_port);
+    if (ret) {
+        VLOG_ERR("%s: Could not create doca_flow_pipe_cfg for %s",
+                 netdev_get_name(netdev), pipe_name);

Should we print the doca error? Or will it not help?

+        return ret;
+    }
+    actions_arr[0] = actions ? &actions->d : NULL;
+    actions_masks_arr[0] = actions_mask ? &actions_mask->d : NULL;
+    descs.desc_array = desc;
+    descs.nb_action_desc = 1;

Should nb_action_desc be 0 when desc is NULL?  The code unconditionally
sets it to 1 even when desc_array might be NULL.

Note the usage below. desc_array is used only if desc is not NULL.

+    descs_arr[0] = &descs;
+
+    if (doca_flow_pipe_cfg_set_name(cfg, pipe_name) ||
+        doca_flow_pipe_cfg_set_type(cfg, DOCA_FLOW_PIPE_BASIC) ||
+        doca_flow_pipe_cfg_set_nr_entries(cfg, nr_entries) ||
+        ovs_doca_pipe_cfg_allow_queues(cfg, queues_bitmap) ||
+        (is_egress &&
+         doca_flow_pipe_cfg_set_domain(cfg, DOCA_FLOW_PIPE_DOMAIN_EGRESS)) ||
+        doca_flow_pipe_cfg_set_is_root(cfg, is_root) ||
+        (match && doca_flow_pipe_cfg_set_match(cfg, &match->d,
+                                               match_mask
+                                               ? &match_mask->d
+                                               : &match->d)) ||
+        (monitor && doca_flow_pipe_cfg_set_monitor(cfg, monitor)) ||
+        (actions && doca_flow_pipe_cfg_set_actions(cfg, actions_arr,
+                                                   actions_mask
+                                                   ? actions_masks_arr
+                                                   : actions_arr,
+                                                   desc
+                                                   ? descs_arr
+                                                   : NULL, 1))) {
+        VLOG_ERR("%s: Could not set doca_flow_pipe_cfg for %s",
+                 netdev_get_name(netdev), pipe_name);
+        ret = -1;

Should we use a proper DOCA error code here, as we do when
doca_flow_pipe_cfg_create() fails? Also, does it make sense to report
which specific call failed?

Ack

+        goto error;
+    }
+
+    ret = doca_flow_pipe_create(cfg, fwd, fwd_miss, pipe);
+    if (ret) {
+        VLOG_ERR("%s: Failed to create basic pipe '%s': %d (%s)",
+                 netdev_get_name(netdev), pipe_name, ret,
+                 doca_error_get_descr(ret));
+    }
+
+error:
+    doca_flow_pipe_cfg_destroy(cfg);
+    return ret;
+}
+
+unsigned int
+ovs_doca_max_counters(void)
  {
+    return ovs_doca_max_megaflows_counters +
+        OVS_DOCA_SLOWPATH_COUNTERS;

Same question as before: how will this be adjusted if
ovs_doca_max_megaflows_counters changes?  What will the impact be if
this changes at runtime for already initialized DOCA ports?

+}
+
+void
+ovs_doca_init(const struct smap *ovs_other_config)
+{
+    const char *doca_init_val = smap_get_def(ovs_other_config, "doca-init",
+                                             "false");

Use smap_get_bool() here.

Ack

+    static bool enabled = false;
+    int rv;
+
+    if (enabled || !ovs_other_config) {

Checking enabled here makes no sense, should it be ovs_doca_available()?

it is used to avoid smap read (will be fixed). see the same pattern asin dpdk_init().

+        return;
+    }
+
+    if (!strcasecmp(doca_init_val, "true")) {
+        static struct ovsthread_once once_enable = OVSTHREAD_ONCE_INITIALIZER;
+
+        if (!ovsthread_once_start(&once_enable)) {
+            return;
+        }
+
+        VLOG_INFO("Using DOCA %s", doca_version_runtime());
+        VLOG_INFO("DOCA Enabled - initializing...");
+        rv = ovs_doca_init__(ovs_other_config);
+        if (!rv) {
+            VLOG_INFO("DOCA Enabled - initialized");
+            enabled = true;
+        } else {
+            ovs_abort(rv, "DOCA Initialization Failed.");

I do not think we should abort here. Instead we should fail with an error,
similar to how DPDK handles initialization failures.

DPDK aborts. the same here.

+        }
+        ovsthread_once_done(&once_enable);

[snip]

@@ -60,6 +60,7 @@ dpctl
  dpdk
  dpif
  dst
+eg

This 'eg' and the earlier 'ie' can be removed.


Ack for 'eg'.

For 'ie':

WARNING: Possible misspelled word: "ie"
Did you mean:  ['IE', 'Ir', 'OE']
#581 FILE: lib/refmap.h:76:
/* Called once on a newly created 'value', i.e. when the first

  eip
  enablement
  encap
@@ -228,6 +229,7 @@ recirc
  recirculation
  recirculations
  refmap
+representor
  revalidate
  revalidation
  revalidator
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 9edd1027e..7f24005ed 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -337,6 +337,22 @@
          </p>
        </column>

+      <column name="other_config" key="doca-init"
+              type='{"type": "boolean"}'>
+        <p>
+          A value of <code>true</code> will cause the ovs-vswitchd process to
+          abort if DOCA cannot be initialized.
+        </p>
+        <p>
+          The default value is <code>false</code>. Changing this value requires
+          restarting the daemon
+        </p>
+        <p>
+          If this value is <code>false</code> at startup, any doca ports which
+          are configured in the bridge will fail due to memory errors.
+        </p>
+      </column>
+
        <column name="other_config" key="dpdk-lcore-mask"
                type='{"type": "integer", "minInteger": 1}'>
          <p>
@@ -510,6 +526,7 @@

        <column name="other_config" key="per-port-memory"
                type='{"type": "boolean"}'>
+        <p>NOTE: This option is not available for doca.</p>

Maybe this note can be moved a bit further down and expanded to explain
what exactly is not supported for DOCA, for example how DOCA ports behave
in this case.

I also do not see any code preventing this option from being set. Does
this mean it only affects DOCA ports, while non-DOCA ports continue to
behave as before?

I don't see a better place "below" to put it. I enhanced the note.

          <p>
            By default OVS DPDK uses a shared memory model wherein devices
            that have the same MTU and socket values can share the same
@@ -526,6 +543,7 @@
        </column>

        <column name="other_config" key="shared-mempool-config">
+              <p>NOTE: This option is not available for doca.</p>

Same comment as above.

Same.

                <p>Specifies dpdk shared mempool config.</p>
                <p>Value should be set in the following form:</p>
                <p>
@@ -942,7 +960,8 @@
        </column>

        <column name="doca_initialized">
-        Always false.
+        True if <ref column="other_config" key="doca-init"/> is set to
+        true and the DOCA library is successfully initialized.
        </column>

        <group title="Statistics">
--
2.34.1

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v2 6/6] netdev-doca: Introduce doca netdev.

Reply via email to