Re: [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue

Jesper Dangaard Brouer Fri, 15 Jul 2016 04:24:07 -0700

On Thu, 14 Jul 2016 17:07:33 -0700
John Fastabend <john.fastab...@gmail.com> wrote:

> On 16-07-14 04:42 PM, Alexei Starovoitov wrote:
> > On Wed, Jul 13, 2016 at 11:23:12PM -0700, John Fastabend wrote:  
> >> This converts the pfifo_fast qdisc to use the alf_queue enqueue and
> >> dequeue routines then sets the NOLOCK bit.
> >>
> >> This also removes the logic used to pick the next band to dequeue from
> >> and instead just checks each alf_queue for packets from top priority
> >> to lowest. This might need to be a bit more clever but seems to work
> >> for now.
> >>
> >> Signed-off-by: John Fastabend <john.r.fastab...@intel.com>
> >> ---
> >>  net/sched/sch_generic.c |  131 
> >> +++++++++++++++++++++++++++--------------------  
> >   
> >>  static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
> >>                          struct sk_buff **to_free)
> >>  {
> >> -  return qdisc_drop(skb, qdisc, to_free);
> >> +  err = skb_array_produce_bh(q, skb);  
> > ..  
> >>  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
> >>  {
> >> +          skb = skb_array_consume_bh(q);  
> > 
> > For this particular qdisc the performance gain should come from
> > granularityof spin_lock, right?  
> 
> And the fact that the consumer and producer are using different
> locks now.

Yes. Splitting up enqueue'ers (producer's) from the dequeuer (consumer)
is an important step, because today the qdisc layer have this problem
that enqueue'ers can starve the single dequeuer.  The current
mitigation tricks are the enq busy_lock and bulk dequeue.

As John says, using skb_array cause producers and consumer to use
different locks.

> > Before we were taking the lock much earlier. Here we keep the lock,
> > but for the very short time.
> >         original pps        lockless        diff
> >         1       1418168             1269450         -148718
> >         2       1587390             1553408         -33982
> >         4       1084961             1683639         +598678
> >         8       989636              1522723         +533087
> >         12      1014018             1348172         +334154
> >

It looks problematic that lockless is slower in the base single CPU
case, orig (1418168 pps) to lockless (1269450).  By approx
(1/1418168-1/1269450)*10^9 = -82.60 ns.  That base "regression" look
too high to start with.

How I view the results:
Negative scaling starts early for "original", which is the main problem
that we are looking to solve.  The lockless does not show as much
scaling effect as expected, and start to show a trend towards negative
scaling.

> > so perf for 1 cpu case is mainly due to array vs list,
> > since number of locks is still the same and there is no collision ?
> > but then why shorter lock give higher overhead in multi cpu cases?  
> 
> So in this case running pfifo_fast as the root qdisc with 12 threads
> means we have 12 producers hitting a single enqueue() path where as with
> mq and only looking at pktgen numbers we have one thread for each
> skb_array.

The skb_array allow multi producer multi consumer (MPMC), but is
optimized towards the case of a single producer and a single consumer
(SPSC).  The SPSC queue is usually implemented with much less
synchronization, but then you need some other guarantee that only a
single producer/consumer can be running.

> > That would have been the main target for performance improvement?
> >   
> 
> Maybe I should fire up a TCP test with 1000's of threads to see what
> the perf numbers look like.
> 
> > Looks like mq gets the most benefit, because it's lockless internally
> > which makes sense.
> > In general I think this is the right direction where tc infra should move 
> > to.
> > I'm only not sure whether it's worth converting pfifo to skb_array.
> > Probably alf queue would have been a better alternative.
> >   
> 
> Tomorrows task is to resurrect the alf_queue and look at its numbers
> compared to this. Today was spent trying to remove the HARD_TX_LOCK
> that protects the driver, in the mq case it seems this is not really
> needed either.

I would be very interested in the alf_queue numbers. As alf_queue
should perform better with multiple producers.  If you kept the single
TC dequeue'er rule, then you could use the MPSC variant of alf_queue.

My benchmarks from 2014 are avail in this presentation:

http://people.netfilter.org/hawk/presentations/nfws2014/dp-accel-qdisc-lockless.pdf
Also showing the mitigation effect of bulk-dequeue patches slide 12
(But I would like to see some new benchmarks for alf_queue slide 14)

But do notice, alf_queue have a design problem that skb_array solved.
Alf_queue have the problem of read-bouncing the remote/opposite CPUs
cache line, because it need to (1) at enqueue (producer) look if there
is free space (reading consumer.tail), and (2) at dequeue (consumer) if
any object are avail (reading producer.tail).

The alf_queue mitigate this by design problem by allowing doing bulk
enqueue and bulk dequeue. But I guess, you will not use that "mode", I
think I did in my NFWS2014 benchmarks.

The skb_array solved this design problem, by using the content of the
queue objects as a maker for free/full condition.  The alf_queue cannot
do so easily, because of it's bulk design, that need to reserve part of
the queue.  Thus, we likely need some new queue design, which is a
hybrid between alf_queue and skb_array, for this use-case.

I actually wrote some queue micro-benchmarks that show the strength and
weaknesses of both skb_array[1] and alf_queue[2].

[1] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c
[2] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/alf_queue_parallel01.c
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue

Reply via email to