On Tue, Jan 02, 2018 at 01:27:23PM -0800, John Fastabend wrote:
> On 01/02/2018 09:17 AM, Michael S. Tsirkin wrote:
> > On Tue, Jan 02, 2018 at 07:01:33PM +0200, Michael S. Tsirkin wrote:
> >> On Tue, Jan 02, 2018 at 11:52:19AM -0500, David Miller wrote:
> >>> From: John Fastabend <john.fastab...@gmail.com>
> >>> Date: Wed, 27 Dec 2017 19:50:25 -0800
> >>>
> >>>> When running consumer and/or producer operations and empty checks in
> >>>> parallel its possible to have the empty check run past the end of the
> >>>> array. The scenario occurs when an empty check is run while
> >>>> __ptr_ring_discard_one() is in progress. Specifically after the
> >>>> consumer_head is incremented but before (consumer_head >= ring_size)
> >>>> check is made and the consumer head is zeroe'd.
> >>>>
> >>>> To resolve this, without having to rework how consumer/producer ops
> >>>> work on the array, simply add an extra dummy slot to the end of the
> >>>> array. Even if we did a rework to avoid the extra slot it looks
> >>>> like the normal case checks would suffer some so best to just
> >>>> allocate an extra pointer.
> >>>>
> >>>> Reported-by: Jakub Kicinski <jakub.kicin...@netronome.com>
> >>>> Fixes: c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array")
> >>>> Signed-off-by: John Fastabend <john.fastab...@gmail.com>
> >>>
> >>> Applied, thanks John.
> >>
> >> I think that patch is wrong. I'd rather it got reverted.
> > 
> > And replaced with something like the following - stil untested, but
> > apparently there's some urgency to fix it so posting for review ASAP.
> > 
> 
> So the above ptr_ring patch is meant for the dequeue() case not
> the peek case. Dequeue case shown here,
> 
> static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
> {
>         struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>         struct sk_buff *skb = NULL;
>         int band;
> 
>         for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>                 struct skb_array *q = band2list(priv, band);
> 
>                 if (__skb_array_empty(q))
>                         continue;
> 
>                 skb = skb_array_consume_bh(q);
>         }
>         if (likely(skb)) {
>                 qdisc_qstats_cpu_backlog_dec(qdisc, skb);
>                 qdisc_bstats_cpu_update(qdisc, skb);
>                 qdisc_qstats_cpu_qlen_dec(qdisc);
>         }
> 
>         return skb;
> }
> 
> In the dequeue case we use it purely as a hint and then do a proper
> consume (with locks) if needed.  A false negative in this case means
> a consume is happening on another cpu due to a reset op or a
> dequeue op. (an aside but I'm evaluating if we should only allow a
> single dequeue'ing cpu at a time more below?) If its a reset op
> that caused the false negative it is OK because we are flushing the
> array anyways. If it is a dequeue op it is also OK because this core
> will abort but the running core will continue to dequeue avoiding a
> stall. So I believe false negatives are OK in the above function.

I'm not 100% sure I understand. We don't always dequeue until empty.
E.g. it seems that another CPU could dequeue an skb and then stop
because e.g. it reached a byte queue limit, then this CPU gets a false
negativesand stops because of that.


More generally, what makes this usage safe?
Is there a way to formalize it at the API level?

> > John, others, could you pls confirm it's not too bad performance-wise?
> > I'll then split it up properly and re-post.
> > 
> 
> I haven't benchmarked it but in the dequeue case taking a lock for
> every priority even when its empty seems unneeded.

Well it does seem to make the code more comprehensible and more robust.

But OK -  I was looking at fixing the unlocked empty API to make sure it
actually does what it's supposed to. I posted a draft earlier in this
thread, it needs to be looked at in depth to figure out whether it can
ever give false negatives or positives, and document the results.



> > -->
> > 
> > net: don't misuse ptr_ring_peek
> > 
> > ptr_ring_peek only claims to be safe if the result is never
> > dereferenced, which isn't the case for its use in sch_generic.
> > Add locked API variants and use the bh one here.
> > 
> > Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
> > 
> > ---
> > 
> 
> [...]
> 
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -659,7 +659,7 @@ static struct sk_buff *pfifo_fast_peek(struct Qdisc 
> > *qdisc)
> >     for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
> >             struct skb_array *q = band2list(priv, band);
> >  
> > -           skb = __skb_array_peek(q);
> > +           skb = skb_array_peek_bh(q);
> 
> Ah I should have added a comment here. For now peek() is only used from
> locking qdiscs. So peek and consume/produce operations will never happen
> in parallel. In this case we should never hit the false negative case with
> my patch or the out of bounds reference without my patch.
> 
> Doing a peek() op without qdisc lock is a bit problematic anyways. With
> current code another cpu could consume the skb and free it. Either we
> can ensure a single consumer runs at a time on an array (not the same as
> qdisc maybe) or just avoid peek operations in this case. My current plan
> was to just avoid peek() ops altogether, they seem unnecessary with the
> types of qdiscs I want to be build.
> 
> Thanks,
> John

For the lockless qdisc, for net-next, do we need the patch above until you fix 
that though?

> >     }
> >  
> >     return skb;
> > 

Reply via email to