On Tue, Jan 02, 2018 at 01:27:23PM -0800, John Fastabend wrote: > On 01/02/2018 09:17 AM, Michael S. Tsirkin wrote: > > On Tue, Jan 02, 2018 at 07:01:33PM +0200, Michael S. Tsirkin wrote: > >> On Tue, Jan 02, 2018 at 11:52:19AM -0500, David Miller wrote: > >>> From: John Fastabend <john.fastab...@gmail.com> > >>> Date: Wed, 27 Dec 2017 19:50:25 -0800 > >>> > >>>> When running consumer and/or producer operations and empty checks in > >>>> parallel its possible to have the empty check run past the end of the > >>>> array. The scenario occurs when an empty check is run while > >>>> __ptr_ring_discard_one() is in progress. Specifically after the > >>>> consumer_head is incremented but before (consumer_head >= ring_size) > >>>> check is made and the consumer head is zeroe'd. > >>>> > >>>> To resolve this, without having to rework how consumer/producer ops > >>>> work on the array, simply add an extra dummy slot to the end of the > >>>> array. Even if we did a rework to avoid the extra slot it looks > >>>> like the normal case checks would suffer some so best to just > >>>> allocate an extra pointer. > >>>> > >>>> Reported-by: Jakub Kicinski <jakub.kicin...@netronome.com> > >>>> Fixes: c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array") > >>>> Signed-off-by: John Fastabend <john.fastab...@gmail.com> > >>> > >>> Applied, thanks John. > >> > >> I think that patch is wrong. I'd rather it got reverted. > > > > And replaced with something like the following - stil untested, but > > apparently there's some urgency to fix it so posting for review ASAP. > > > > So the above ptr_ring patch is meant for the dequeue() case not > the peek case. Dequeue case shown here, > > static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) > { > struct pfifo_fast_priv *priv = qdisc_priv(qdisc); > struct sk_buff *skb = NULL; > int band; > > for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) { > struct skb_array *q = band2list(priv, band); > > if (__skb_array_empty(q)) > continue; > > skb = skb_array_consume_bh(q); > } > if (likely(skb)) { > qdisc_qstats_cpu_backlog_dec(qdisc, skb); > qdisc_bstats_cpu_update(qdisc, skb); > qdisc_qstats_cpu_qlen_dec(qdisc); > } > > return skb; > } > > In the dequeue case we use it purely as a hint and then do a proper > consume (with locks) if needed. A false negative in this case means > a consume is happening on another cpu due to a reset op or a > dequeue op. (an aside but I'm evaluating if we should only allow a > single dequeue'ing cpu at a time more below?) If its a reset op > that caused the false negative it is OK because we are flushing the > array anyways. If it is a dequeue op it is also OK because this core > will abort but the running core will continue to dequeue avoiding a > stall. So I believe false negatives are OK in the above function.
I'm not 100% sure I understand. We don't always dequeue until empty. E.g. it seems that another CPU could dequeue an skb and then stop because e.g. it reached a byte queue limit, then this CPU gets a false negativesand stops because of that. More generally, what makes this usage safe? Is there a way to formalize it at the API level? > > John, others, could you pls confirm it's not too bad performance-wise? > > I'll then split it up properly and re-post. > > > > I haven't benchmarked it but in the dequeue case taking a lock for > every priority even when its empty seems unneeded. Well it does seem to make the code more comprehensible and more robust. But OK - I was looking at fixing the unlocked empty API to make sure it actually does what it's supposed to. I posted a draft earlier in this thread, it needs to be looked at in depth to figure out whether it can ever give false negatives or positives, and document the results. > > --> > > > > net: don't misuse ptr_ring_peek > > > > ptr_ring_peek only claims to be safe if the result is never > > dereferenced, which isn't the case for its use in sch_generic. > > Add locked API variants and use the bh one here. > > > > Signed-off-by: Michael S. Tsirkin <m...@redhat.com> > > > > --- > > > > [...] > > > --- a/net/sched/sch_generic.c > > +++ b/net/sched/sch_generic.c > > @@ -659,7 +659,7 @@ static struct sk_buff *pfifo_fast_peek(struct Qdisc > > *qdisc) > > for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) { > > struct skb_array *q = band2list(priv, band); > > > > - skb = __skb_array_peek(q); > > + skb = skb_array_peek_bh(q); > > Ah I should have added a comment here. For now peek() is only used from > locking qdiscs. So peek and consume/produce operations will never happen > in parallel. In this case we should never hit the false negative case with > my patch or the out of bounds reference without my patch. > > Doing a peek() op without qdisc lock is a bit problematic anyways. With > current code another cpu could consume the skb and free it. Either we > can ensure a single consumer runs at a time on an array (not the same as > qdisc maybe) or just avoid peek operations in this case. My current plan > was to just avoid peek() ops altogether, they seem unnecessary with the > types of qdiscs I want to be build. > > Thanks, > John For the lockless qdisc, for net-next, do we need the patch above until you fix that though? > > } > > > > return skb; > >