David S. Miller wrote:
From: Rick Jones <[EMAIL PROTECTED]>
Date: Wed, 01 Feb 2006 15:50:38 -0800
[ What sucks about this whole thread is that only folks like
Jeff and myself are attempting to think and use our imagination
to consider how some roadblocks might be overcome.... ]
My questions are meant to see if something is even a roadblock in the first
place.
If the TCP processing is put in the user context, that means there
is no more parallelism between the application doing its non-TCP
stuff, and the TCP stuff for say the next request, which presently
could be processed on another CPU right?
There is no such implicit limitation, really.
Consider the userspace mmap()'d ring buffer being tagged with, say,
connection IDs. Say, file descriptors. In this way the kernel could
dump into a single net channel for multiple sockets, and then the app
can demux this stuff however it likes.
In particular, things like HTTP would want this because web servers
get lots of tiny requests and using a net channel per socket could
be very wasteful.
I'm not meaning to talk about mux/demux of multiple connections, I'm asking
about where all the cycles are consumed and how that affects parallelism between
user space, "TCP/IP processing" and the NIC for a given flow/connection/whatever.
Maybe I'm not sufficiently clued-in, but in broad handwaving terms, it seems
today that all three can be taking place in parallel for a given TCP connection.
The application is doing its application-level thing on request N on one CPU,
while request N+1 is being processed by TCP on another CPU, while the NIC is
DMA'ing request N+2 into the host.
If the processing is pushed all the way up to user space, will it be the case
that the single-threaded application code can be working on request N while the
TCP code is processing request N+1? That's what I'm trying to ask about.
I think the data I posted about saturating a GbE bidirectionally with a single
TCP connection shows an example of advantage being taken of parallelism between
the application doing its thing on request N, while TCP is processing N+1 on
another CPU and the NIC is bringing N+2 into the RAM.
["Re: [RFC] Poor Network Performance with e1000 on 2.6.14.3" msg id
<[EMAIL PROTECTED]> ]
What I'm not sure of is if that fully matters. Hence the questions.
rick jones
So, other background... long ago and far away, in HP-UX 10.20 which was BSDish
in its networking, with Inbound Packet Scheduling, the netisr handoff included a
hash of the header info and a per-CPU netisr would be used for the "TCP
processing" That got HP-UX parallelism for multiple TCP connections coming
through a single NIC. It meant that a single threaded application, with
multiple connections would have the inbound TCP processing possibly scattered
across all the CPUs while it was running on only one CPU. Cache lines for
socket structures going back and forth could indeed be a concern although moving
a cache line from one CPU to another is not a priori evil (although the
threshold is rather high IMO). In HP-UX 11.X IPS was replaced with Thread
Optimized Packet Scheduling (TOPS). There was still a netisr-like hand-off
(although not as low in the stack as I would have liked it) where a lookup took
place that found where the application last accessed that connection (I think
Solaris Fire Engine does something very similar today). The idea there was that
the place where inbound processing would take place would be determined by where
the application last accessed the socket. Still get advantage taken of multiple
CPUs for multiple connections to multiple threads, but at the price of losing
one part of the app/tcp/nic parallelism. Both TOPS and IPS have been successful
in their days. I'm trying to come to grips with which might be "better" - if it
is even possible to say that one was better than the other.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html