So after some more fiddling, it looks like I got the diagram wrong. Here's how the switch really consumes resources. 4 lookups in parallel, they are ORed in 2 pairs (ingress with egress forms a pair), and the result is ANDed. The consumptions for ingress and egress are really completely independent.
Frame forwarding decision taken | | v +--------------------+--------------------+--------------------+ | | | | v v v v Ingress memory Egress memory Ingress frame Egress frame check check reference check reference check | | | | v v v v BUF_Q_RSRV_I ok BUF_Q_RSRV_E ok REF_Q_RSRV_I ok REF_Q_RSRV_E ok (src port, prio) -+ (dst port, prio) -+ (src port, prio) -+ (dst port, prio) -+ | | | | | | | | | exceeded | | exceeded | | exceeded | | exceeded | | | | | | | | | v | v | v | v | BUF_P_RSRV_I ok| BUF_P_RSRV_E ok| REF_P_RSRV_I ok| REF_P_RSRV_E ok| (src port) ----+ (dst port) ----+ (src port) ----+ (dst port) -----+ | | | | | | | | | exceeded | | exceeded | | exceeded | | exceeded | | | | | | | | | v | v | v | v | BUF_PRIO_SHR_I ok| BUF_PRIO_SHR_E ok| REF_PRIO_SHR_I ok| REF_PRIO_SHR_E ok| (prio) ------+ (prio) ------+ (prio) ------+ (prio) -------+ | | | | | | | | | exceeded | | exceeded | | exceeded | | exceeded | | | | | | | | | v | v | v | v | BUF_COL_SHR_I ok| BUF_COL_SHR_E ok| REF_COL_SHR_I ok| REF_COL_SHR_E ok| (dp) -------+ (dp) -------+ (dp) -------+ (dp) --------+ | | | | | | | | | exceeded | | exceeded | | exceeded | | exceeded | | | | | | | | | v v v v v v v v fail success fail success fail success fail success | | | | | | | | v v v v v v v v +-----+----+ +-----+----+ +-----+----+ +-----+-----+ | | | | +-------> OR <-------+ +-------> OR <-------+ | | v v +----------------> AND <-----------------+ | v FIFO drop / accept Something which isn't explicitly said in devlink-sb is whether a pool bound to a port-TC is allowed to spill over into the port pool. And whether the port pool, in turn, is allowed to spill over into something else (a shared pool)? If they are, then I could expose BUF_P_RSRV_I (buffer reservation per ingress port) as the threshold of the port pool, BUF_Q_RSRV_I and BUF_Q_RSRV_E (buffer reservations per QoS class of ingress, and egress, ports) as port-TC pools, and I could implicitly configure the remaining sharing watermarks to consume the rest of the memory available in the pool. But by looking at some of the selftests, I don't see any clear indication of a test where the occupancy of the port-TC exceeds the size of that pool, and what should happen in that case. Just a vague hint, in tools/testing/selftests/drivers/net/mlxsw/sch_ets.sh, that once the port-TC pool threshold has been exceeded, the excess should be simply dropped: # Set the ingress quota high and use the three egress TCs to limit the # amount of traffic that is admitted to the shared buffers. This makes # sure that there is always enough traffic of all types to select from # for the DWRR process. devlink_port_pool_th_set $swp1 0 12 devlink_tc_bind_pool_th_set $swp1 0 ingress 0 12 devlink_port_pool_th_set $swp2 4 12 devlink_tc_bind_pool_th_set $swp2 7 egress 4 5 devlink_tc_bind_pool_th_set $swp2 6 egress 4 5 devlink_tc_bind_pool_th_set $swp2 5 egress 4 5 So I'm guessing that this is not the same behavior as in ocelot. But, truth be told, it doesn't really help either that nfp and mlxsw are simply passing these parameters to firmware, not really giving any insight into how they are interpreted. Would it be simpler if I just exposed these watermarks as generic devlink resources? Although in a way that would be a wasted opportunity for devlink-sb. I also don't think I can monitor occupancy if I model them as generic resources. Am I missing something? Thanks, -Vladimir