[issue27906] Socket accept exhaustion during high TCP traffic

2016-08-30 Thread kevinconway

New submission from kevinconway:

My organization noticed this issue after launching several asyncio services 
that would receive either a sustained high number of incoming connections or 
regular bursts of traffic. Our monitoring showed a loss of between 4% and 6% of 
all incoming requests. On the client side we see a socket read error 
"Connection reset by peer". On the asyncio side, with debug turned on, we see 
nothing.

After some more investigation we determined asyncio was not calling 'accept()' 
on the listening socket fast enough. To further test this we put together 
several hello-world type examples and put them under load. I've attached the 
project we used to test. Included are three docker files that will run the 
services under different configurations. One runs the service as an aiohttp 
service, the other uses the aiohttp worker behind gunicorn, and the third runs 
the aiohttp service with the proposed asyncio patch in place. For our testing 
we used 'wrk' to generate traffic and collect data on the OS/socket errors.

For anyone attempting to recreate our experiments, we ran a three test 
batteries against the service for each endpoint using:

wrk --duration 30s --timeout 10s --latency --threads 2 --connections 10 
wrk --duration 30s --timeout 10s --latency --threads 2 --connections 100 
wrk --duration 30s --timeout 10s --latency --threads 2 --connections 1000 

The endpoints most valuable for us to test were the ones that replicated some 
of our production logic:

/  # Hello World
/sleep?time=100  # Every request is delayed by 100 milliseconds and 
returns an HTML message.
/blocking/inband  # Every request performs a bcrypt with complexity 10 and 
performs the CPU blocking work on the event loop thread.

Our results varied based on the available CPU cycles, but we consistently 
recreate the socket read errors from production using the above tests.

Our proposed solution, attached as a patch file, is to put the socket.accept() 
call in a loop that is bounded by the listening socket's backlog. We use the 
backlog value as an upper bound to prevent the reverse situation of starving 
active coroutines while the event loop continues to accept new connections 
without yielding. With the proposed patch in place our loss rate disappeared.

For further comparison, we reviewed the socket accept logic in Twisted against 
which we ran similar tests and encountered no loss. We found that Twisted 
already runs the socket accept in a bounded loop to prevent this issue 
(https://github.com/twisted/twisted/blob/trunk/src/twisted/internet/tcp.py#L1028).

--
components: asyncio
files: testservice.zip
messages: 273989
nosy: gvanrossum, haypo, kevinconway, yselivanov
priority: normal
severity: normal
status: open
title: Socket accept exhaustion during high TCP traffic
versions: Python 3.4, Python 3.5, Python 3.6
Added file: http://bugs.python.org/file44286/testservice.zip

___
Python tracker 
<http://bugs.python.org/issue27906>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27906] Socket accept exhaustion during high TCP traffic

2016-08-30 Thread kevinconway

kevinconway added the comment:

Attaching the patch file

--
keywords: +patch
Added file: http://bugs.python.org/file44287/multi-accept.patch

___
Python tracker 
<http://bugs.python.org/issue27906>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27906] Socket accept exhaustion during high TCP traffic

2016-08-30 Thread kevinconway

kevinconway added the comment:

I'll dig into the existing asyncio unit tests and see what I can come up with. 
I'm not sure, yet, exactly what I might test for.

The variables involved with reproducing the error are mostly environmental. CPU 
speed of the host, amount of CPU bound work happening in handler coroutines, 
and the rate of new connections are the major contributors we've identified. 
I'm not sure how I might simulate those in a unit test.

Would it be sufficient to add a test that ensures the _accept_connection calls 
.accept() on the listening socket 'backlog' number of times in event there are 
no OS errors?

--

___
Python tracker 
<http://bugs.python.org/issue27906>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27906] Socket accept exhaustion during high TCP traffic

2016-08-30 Thread kevinconway

kevinconway added the comment:

I've added a unit test to the patch that asserts sock.accept() is called the 
appropriate number of times.

Worth a note, I had to adjust one of the existing tests to account for the new 
backlog argument. There is a default value for the argument to preserve 
backwards compat for any callers, but the mock used in the test was not 
tolerant of having an extra arg available.

--
Added file: http://bugs.python.org/file44289/multi-accept-2.patch

___
Python tracker 
<http://bugs.python.org/issue27906>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27906] Socket accept exhaustion during high TCP traffic

2016-09-01 Thread kevinconway

kevinconway added the comment:

Added a comment to the .accept() loop with a reference to the issue.

--
Added file: http://bugs.python.org/file44321/multi-accept-3.patch

___
Python tracker 
<http://bugs.python.org/issue27906>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com