Hi Maxim, Thanks for the suggestion! It looks like packet drop is the culprit here. The initial SYN packet doesn't receive a corresponding SYN-ACK from the upstream servers, so after a 1-second timeout (TCP Retransmission TimeOut), the packet is retransmitted. The question is still *why* this only occurs through nginx.
To further narrow down the root cause, I moved my upstream server to the same machine with nginx. The issue can still be replicated there. To eliminate my upstream server as the cause (it's written in C with libevent, by the way) I used the nodejs hello world demo; nodejs has trouble with the 5,000 concurrent connections (go figure) but the connections that are successful (without nginx reverse proxying) all complete in less than one second. When I place nginx between ApacheBench and nodejs, that 1-second TCP RTO shows up again. To reiterate, this is all happening on a single machine; the TCP stack is involved, but not a physical network. The only common denominator is nginx. On Fri, Mar 15, 2013 at 1:20 AM, Maxim Dounin <mdou...@mdounin.ru> wrote: > Hello! > > On Thu, Mar 14, 2013 at 07:07:20PM -0700, Jay Oster wrote: > > [...] > > > The access log has 10,000 lines total (i.e. two of these tests with 5,000 > > concurrent connections), and when I sort by upstream_response_time, I > get a > > log with the first 140 lines having about 1s on the > upstream_response_time, > > and the remaining 9,860 lines show 700ms and less. Here's a snippet > showing > > the strangeness, starting with line numbers: > > > > > > 1: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0" > > 200 19 "-" "ApacheBench/2.3" 1.027 1.026 234 83 > > 2: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0" > > 200 19 "-" "ApacheBench/2.3" 1.027 1.026 234 83 > > 3: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0" > > 200 19 "-" "ApacheBench/2.3" 1.026 1.025 234 83 > > ... > > 138: 127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0" > > 200 19 "-" "ApacheBench/2.3" 1.000 0.999 234 81 > > 139: 127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0" > > 200 19 "-" "ApacheBench/2.3" 0.999 0.999 234 81 > > 140: 127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0" > > 200 19 "-" "ApacheBench/2.3" 0.999 0.999 234 81 > > 141: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0" > > 200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83 > > 142: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0" > > 200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83 > > 143: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0" > > 200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83 > > ... > > 9998: 127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0" > > 200 19 "-" "ApacheBench/2.3" 0.142 0.005 234 81 > > 9999: 127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0" > > 200 19 "-" "ApacheBench/2.3" 0.142 0.005 234 81 > > 10000: 127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0" > > 200 19 "-" "ApacheBench/2.3" 0.122 0.002 234 81 > > > > > > > > The upstream_response_time difference between line 140 and 141 is nearly > > 500ms! The total request_time also displays an interesting gap of almost > > 300ms. What's going on here? > > I would suggests there are packet loss and retransmits for some > reason. Try tcpdump'ing traffic between nginx and backends to see > what goes on in details. > > -- > Maxim Dounin > http://nginx.org/en/donation.html > > _______________________________________________ > nginx mailing list > nginx@nginx.org > http://mailman.nginx.org/mailman/listinfo/nginx >
_______________________________________________ nginx mailing list nginx@nginx.org http://mailman.nginx.org/mailman/listinfo/nginx