On 10/7/2025 6:10 AM, Andrii Nakryiko wrote:

[...]

+
+       over_pos = READ_ONCE(rb->overwrite_pos);
+       return min(prod_pos - max(cons_pos, over_pos), rb->mask + 1);

I'm trying to understand why you need to min with `rb->mask + 1`, can
you please elaborate?


We need the min because rb->producer_pos and rb->overwrite_pos are read
at different times. During this gap, a fast producer may wrap once or
more, making over_pos larger than prod_pos.


what if you read overwrite_pos before reading producer_pos? Then it
can't be larger than producer_pos and available data would be
producer_pos - max(consumer_pos, overwrite_pos)? would that work?


No, it won’t work. Between reading overwrite_pos and producer_pos, producer
on a different CPU may have already moved producer_pos forward by more than
one ring buffer size, causing prod_pos - max(cons_pos, over_pos) to exceed
the ring buffer size.


And also, at least for consistency, use smp_load_acquire() for overwrite_pos?


Using READ_ONCE here is to stay symmetric with __bpf_ringbuf_reserve(),
where overwrite_pos is WRITE_ONCE first, followed by 
smp_store_release(producer_pos).
So here we do smp_load_acquire(producer_pos) first, then 
READ_ONCE(overwrite_pos)
to ensure a consistent view of the ring buffer.

For consistency when reading consumer_pos and producer_pos, I’m fine with
switching READ_ONCE to smp_load_acquire for overwrite_pos.


I'm not sure it matters much, but this function is called outside of
rb->spinlock, while __bpf_ringbuf_reserve() does hold a lock while
doing that WRITE_ONCE(). So it might not make any difference, but I
have mild preference for smp_load_acquire() here.


OK, I'll switch to smp_load_acquire.

   }

   static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb)
@@ -402,11 +419,43 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr)
          return (void*)((addr & PAGE_MASK) - off);
   }

+

[...]

+       /* In overwrite mode, move overwrite_pos to the next record to be
+        * overwritten if the ring buffer is full
+        */

hm... here I think the important point is that we search for the next
record boundary until which we need to overwrite data such that it
fits newly reserved record. "next record to be overwritten" isn't that
important (we might never need to overwrite it). Important are those
aspects of a) staying on record boundary and b) consuming enough
records to reserve the new one.

Can you please update the comment to mention the above points?


Sure, I'll update the comment to:

In overwrite mode, advance overwrite_pos when the ring buffer is full.
The key points are to stay on record boundaries and consume enough
records to fit the new one.


ok

[...]


+                          unsigned long rec_pos,
+                          unsigned long cons_pos,
+                          u32 len, u64 flags)
+{
+       unsigned long rec_end;
+
+       if (flags & BPF_RB_FORCE_WAKEUP)
+               return true;
+
+       if (flags & BPF_RB_NO_WAKEUP)
+               return false;
+
+       /* for non-overwrite mode, if consumer caught up and is waiting for
+        * our record, notify about new data availability
+        */
+       if (likely(!rb->overwrite_mode))
+               return cons_pos == rec_pos;
+
+       /* for overwrite mode, to give the consumer a chance to catch up
+        * before being overwritten, wake up consumer every half a round
+        * ahead.
+        */
+       rec_end = rec_pos + ringbuf_round_up_hdr_len(len);
+
+       cons_pos &= (rb->mask >> 1);
+       rec_pos &= (rb->mask >> 1);
+       rec_end &= (rb->mask >> 1);
+
+       if (cons_pos == rec_pos)
+               return true;
+
+       if (rec_pos < cons_pos && cons_pos < rec_end)
+               return true;
+
+       if (rec_end < rec_pos && (cons_pos > rec_pos || cons_pos < rec_end))
+               return true;
+

hm... ok, let's discuss this. Why do we need to do some half-round
heuristic for overwrite mode? If a consumer is falling behind it
should be actively trying to catch up and they don't need notification
(that's the non-overwrite mode logic already).

So there is more to this than a brief comment you left, can you please
elaborate?


The half-round wakeup was originally intended to work with libbpf in the
v1 version. In that version, libbpf used a retry loop to safely copy data
from the ring buffer that hadn’t been overwritten. By waking the consumer
once every half round, there was always a period where the consumer and
producer did not overlap, which helped reduce the number of retries.

I can't say I completely grok the logic here, but do you think we
should still keep this half-round wakeup? It looks like an arbitrary
heuristic, so I'd rather not have it.


Sure, since the related libbpf code is no longer present, I’ll remove this
logic in the next version.


pw-bot: cr

+       return false;
+}
+
+static __always_inline

we didn't have always_inline before, any strong reason to add it now?


I just wanted to avoid introducing any performance regression. Before this
patch, bpf_ringbuf_commit() was automatically inlined by the compiler, but
after the patch it wasn’t, so I added always_inline explicitly to keep it
inlined.

how big of a difference was it in benchmarks? It's generally frowned
upon using __always_inline without a good reason.


The difference is not noticeable on my arm64 test machine, but it is on my
amd machine.

Below is the benchmark data on AMD EPYC 9654, with and without always_inline
attribute.

- With always_inline

Ringbuf, multi-producer contention
==================================
rb-libbpf nr_prod 1  13.070 ± 0.158M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 2  15.440 ± 0.017M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 3  7.860 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 4  6.444 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 8  3.788 ± 0.005M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 12 2.802 ± 0.007M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 16 2.560 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 20 2.227 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 24 2.141 ± 0.007M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 28 1.960 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 32 1.913 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 36 1.854 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 40 1.818 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 44 1.779 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 48 1.758 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 52 1.812 ± 0.003M/s (drops 0.000 ± 0.000M/s)

- Without always_inline

Ringbuf, multi-producer contention
==================================
rb-libbpf nr_prod 1  10.550 ± 0.032M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 2  14.661 ± 0.024M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 3  7.616 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 4  6.476 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 8  3.806 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 12 2.814 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 16 2.608 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 20 2.337 ± 0.005M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 24 2.270 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 28 1.977 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 32 1.921 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 36 1.862 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 40 1.827 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 44 1.912 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 48 1.860 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 52 1.824 ± 0.001M/s (drops 0.000 ± 0.000M/s)

When nr_prod=1, the performance regression is significant, dropping from
13.070 ± 0.158 M/s with always_inline to 10.550 ± 0.032 M/s without it.

However, since the half-round wakeup logic will be removed in the next
version, the changes to bpf_ringbuf_commit, including always_inline, will
also be removed.

[...]


Reply via email to