Re: [uWSGI] Fully saturating all CPU cores

Roberto De Ioris Tue, 10 Sep 2013 21:52:41 -0700

> Thanks for the reply. Discussion below:
>
> On Tue, Sep 10, 2013 at 2:38 AM, Roberto De Ioris <[email protected]>
> wrote:
>
>> > Hi,
>> >
>> > I'm investigating using uwsgi to run Python code in the
>> > FrameworkBenchmarks
>> > project <http://www.techempower.com/benchmarks/> which compares web
>> > frameworks, languages, platforms, web servers and more. I tried
>> running
>> > another contributor's uwsgi command line, but I can't get uwsgi to
>> fully
>> > saturate all CPU cores when under load.
>> >
>> > uwsgi command line:
>> >
>> > --master -L --http :8080 --http-keepalive -p 2 -w hello --add-header
>> >> "Connection: keep-alive"
>>
>> in this way you are benchmarking a proxied setup with an http router on
>> front managing all of the requests and forwarding them to the workers.
>>
>> While in term of performance it could be successfull, in terms of core
>> usage could be suboptimal (even if core-usage is a bit 'strange' for a
>> benchmark as the operating system scheduler chooses the process to give
>> cpu to using really complex algorithms).
>>
>
> I ran htop and found that the http router process was ~100% (thus, using
> most of one core). My guess is that the http router is CPU bound and thus
> can't send enough work to the workers, so the worker processes are not
> fully utilized. Basically, the http router is the bottleneck. On my
> system,
> this produces about 6,000-7,000 requests/sec, whereas gunicorn can do
> about
> 10,000 requests/sec, saturating all cores.



Seems reasonable, as you have 1 process with uwsgi httprouter and 2 with
gunicorn (meinheld parser is very near in performance with the uwsgi one)

Just to be sure: are you using a 1.9.x release ?

My latest numbers (expecially those with pypy) are only for 1.9 codebase

(1.4 http parsing was not good [4 syscall per request] and meinheld last
time i tried was up to 8% faster in single process mode)


>
>
>> The "right" command line would be:
>>
>> --master -L --http-socket :8080 -p 2 -w hello
>>
>> (keep-alive is useless as this is an in-process non-persistent parser)
>>
>
> I tried this (after increasing system somaxconn, using uwsgi -l, and
> removing -H 'Connection: keep-alive' from the wrk args) and only 326
> requests completed and there were 330 read socket errors and 1788 timeout
> socket errors. I'm not sure what's going on, maybe it is a bug in wrk.


-l with which value ?

from the results it looks like something is wrong there (you should have
errors in uwsgi logs)

>
> But at any rate, my goal is to use HTTP Keep-Alive to get the most
> requests/sec, so perhaps --http-socket isn't useful for this benchmark in
> the first place.
>

(yes) you can't, because keepalive is only for the frontend (the
httprouter or nginx) while the --http-socket (the in-process one) is
non-keepalive.

But if you want to compare a non-proxied (gunicorn+meinheld) with a
proxied one (nginx+uwsgi / httprouter+uwsgi) you will be "unfair"
(expecially because no-one will place gunicorn+meinheld directly exposed
without nginx or something similar, as well no-one will expose uwsgi
--http-socket to the public). Even if things like tcp offloading and dma
engines highly reduce the impact of the IPC, you always have a little
overhead (expecially in syscall).

Yes, we are talking about microseconds, but on this kind of benchmarks
they make the difference too


>
>> If you want to test the http router (something lot of users use on
>> production) you may want to use --http-processes 2 (this time keepalives
>> work)
>>
>> With this setup the httprouter too will use 2 processes, but again 'cpu
>> cores' usage could be irrelevant.
>>
>
> I used `--master -L --http :8080 --http-processes 2 --http-keepalive -p 2
> -w hello --add-header ...' and I was able to saturate all CPU cores. The
> htop CPU usage was about ~65% for each httprouter process and ~35% for the
> worker processes. The result was ~8,500 requests/sec, an improvement, but
> still not close to gunicorn. These results seem to suggest that the
> original problem was that the httprouter is CPU bound and the bottleneck.
>

probably you still have problems with the listen queue (so the worker
itself is the bottleneck as the uwsgi router are tuned for really high
concurrency). Hello world benchmarks are not realistic (or better, are
near to a DOS) so the first step is tuning the listen queue as the network
will saturate fast.

You may want to run uwsgitop (with the stats server) to see the status of
the listen queue in real time.



> So far I haven't seen any data to suggest that this is an affinitization
> problem or that affinity could help, so I haven't bothered with
> --cpu-affinity. So far virtualization doesn't seem to be an issue since my
> physical machine is otherwise idle and has two cores (with
> hyperthreading).

on virtualized systems cpu-affinity simply does not work for the way cpu's
are abstracted by the hypervisor. Even if your kernel will show the right
distribution, internally you do not know which cpu is effectively used.

But this is not your problem. I have run some test with a concurrency of
90 (so no need to tune the listen queue), and --http-socket was 1-2%
faster, while httprouter + uwsgi was 3-4% slower (as expected as you have
the ipc overhead, something you will always have in production
environments)

>
> After doing this research (with your help), my analysis is that the
> (single
> process) uwsgi httprouter becomes CPU bound and becomes the limiting
> factor.

(Always supposing you are using a 1.9.x version)

the httprouter became CPU bound only on higher level of concurrency
(unless you are using a pre 1.9 version where there are blocking parts)

workers are more heavy in term of "things to do", the fact they are low in
cpu usage, suggests a communication problem (again it could be the listen
queue). The httprouter (as nginx) does not have the need to tune the
listen queue as they constantly accept() and wait again, reducing the need
of a queue. Workers (instead) have the heavy part after the accept() and
connections coming to it while in the "heavy part" are enqueued (and
saturating a 100 listen queue with 256 concurrent connections and 2
workers is pretty easy, expecially because the --http-socket expect a 4
seconds timeout on protocol traffic)


Thus, to increase the performance, one must distribute the load
> amongst more than one httprouter (--http-processes 2), or perhaps use a
> different 'router' such as nginx using the uwsgi protocol. What do you
> think? Is my thinking/analysis/approach wrong? I'm open to suggestions.

the httprouter passes requests to uWSGI workers via the uwsgi protocol. In
terms of performance it should map 1:1 with nginx (it is only because it
is way simpler than nginx, the parser of the latter is better for sure)


>
> Is there a way to use multiple worker processes without a router?
> Basically, is there a way that does the accept()/epoll()/read() from the
> network and then in the same process executes the python code? That seems
> like that might be the fastest because it would eliminate the dispatch
> from
> the router process to the worker process. I have a feeling that
> gunicorn+meinheld might be doing this, but I haven't read the code to
> verify.
>

i do not follow you here, it is the standard way uWSGI works. Even with
the httprouter the backend workers share the socket. It is the reason why
the --thunder-lock is needed in high-load scenarios.


-- 
Roberto De Ioris
http://unbit.it
_______________________________________________
uWSGI mailing list
[email protected]
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi

Re: [uWSGI] Fully saturating all CPU cores

Reply via email to