Stephen Hemminger wrote:
On Fri, 10 Mar 2006 16:16:07 -0800
Rick Jones <[EMAIL PROTECTED]> wrote:
I would have thought that byte based growth of the CWND would have meant
that the ACK's above would have allowed more bytes to flow, yet more
bytes are not flowing. That makes it seem like cwnd isn't strictly
bytes, but is also tracked in terms of number of outstanding segments.
Linux cwnd is in packets.
How is the ABC cwnd of bytes mapped to packets? Does it only go up by
one packet after an MSS has been ACKed then?
/*
* Linear increase during slow start
*/
void tcp_slow_start(struct tcp_sock *tp)
{
if (sysctl_tcp_abc) {
ah, so there is a sysctl to turn this off :)
/* RFC3465: Slow Start
* TCP sender SHOULD increase cwnd by the number of
* previously unacknowledged bytes ACKed by each incoming
* acknowledgment, provided the increase is not more than L
*/
if (tp->bytes_acked < tp->mss_cache)
return;
And only increasing cwnd after a full mss has been acked. Which IIRC is
not part of the ABC RFC.
/* We MAY increase by 2 if discovered delayed ack */
if (sysctl_tcp_abc > 1 && tp->bytes_acked > 2*tp->mss_cache) {
if (tp->snd_cwnd < tp->snd_cwnd_clamp)
tp->snd_cwnd++;
}
}
tp->bytes_acked = 0;
if (tp->snd_cwnd < tp->snd_cwnd_clamp)
tp->snd_cwnd++;
}
Think of congestion window as measurement of the available sewer pipe.
If everyone thinks the congestion window is too large, then the sewer pipe
would back up and nothing would overflow.
Small packets are like a leaky faucet dripping, just because a drip goes
down the drain doesn't tell you much about the available pipe diameter.
I agree that if I can have five drips outstanding I should not be able
to then put five buckets out there, but should I have to exchange
another 1460 drips before I can have six drips outstanding?
The drips count for nothing as far as congestion is concerned when
we need to count toilet bowls (enough with this analogy)...
I didn't take us here :)
I got the impression that ABC was written with a byte cwnd in mind not a
packet cwnd, which makes me wonder if the mapping to a packet cwnd above
is really "correct?" I really don't think that ABC or any of the cwnd
stuff really meant that to be able to go from five single-byte packets
at one time to six one had to send 1460 single byte packets first. And
this application is caught in the middle of an attempt to map byte cwnds
with packet cwnds.
IIRC all (most of) the RFC's talk about the cwnd in bytes because at the
time VJ did his work (in hnits of packets/segments) none of the (common
-MPE did :) stacks actually knew how many segments they had outstanding
at any one time. So, we have the "increase by an MSS on each ACK"
heuristic - it didn't overly penalize bulik transfers. It was a proxy
for tracking segments, and with the existence of the ABC RFC we can
assume not all that good a proxy.
In the original VJ paper, when a packet was known to have left the
network, the stack was free to replace it and add another. The queues
in the network are (as near as I can tell) in units of packets, not in
units of bytes.
But in the code above, it is doing cwnd in packets but being _really_
conservative in a "conservation of packets" sense, by only increasing
the packet cwnd by one after a full MSS has been acked. That is much,
Much, MUCH more conservative than the original heuristic. And much much
more conservative than I think ABC was looking to be.
So, seems we don't want too many packets out there, but we also don't
want too many bytes out there, which seems to mean there needs to be two
cwnds - a packet cwnd and a byte cwnd. Packet cwnd increases based on
knowledge of packets having left the network, bytecwnd based on how many
bytes have left the network. Then the small packet application can get
its cwnd grown in reasonable time, and still not be able to dump a
boatload of bytes onto the network, and the large packet application
will get its cwnd grown in a reasonable time and still not be able to
generate some massive spike of small packets onto the network.
Admittedly, this specific application is a bad client for the case I'm
trying to make, but if it were properly putting messages to the
transport in one call, but trying to have five of them outstanding at a
time I get the impression it would be a very long time before it could
get all five outstanding. I guess netperf TCP_RR with configured with
--enable-burst would be one way to check that.
The other problem this application has is that by the time it builds up
enough bytes acked to open the congestion window, it goes back to sleep for
a long enough time for the window to be restarted.
Figures. Can't say as I've ever really liked restarting slow start
after idle to begin with. But that would be an entirely different discussion
rick
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html