David S. Miller wrote:
From: Rick Jones <[EMAIL PROTECTED]>
Date: Wed, 01 Feb 2006 15:50:38 -0800

[ What sucks about this whole thread is that only folks like
  Jeff and myself are attempting to think and use our imagination
  to consider how some roadblocks might be overcome.... ]

My questions are meant to see if something is even a roadblock in the first 
place.

If the TCP processing is put in the user context, that means there
is no more parallelism between the application doing its non-TCP
stuff, and the TCP stuff for say the next request, which presently
could be processed on another CPU right?


There is no such implicit limitation, really.

Consider the userspace mmap()'d ring buffer being tagged with, say,
connection IDs.  Say, file descriptors.  In this way the kernel could
dump into a single net channel for multiple sockets, and then the app
can demux this stuff however it likes.

In particular, things like HTTP would want this because web servers
get lots of tiny requests and using a net channel per socket could
be very wasteful.

I'm not meaning to talk about mux/demux of multiple connections, I'm asking about where all the cycles are consumed and how that affects parallelism between user space, "TCP/IP processing" and the NIC for a given flow/connection/whatever.

Maybe I'm not sufficiently clued-in, but in broad handwaving terms, it seems today that all three can be taking place in parallel for a given TCP connection. The application is doing its application-level thing on request N on one CPU, while request N+1 is being processed by TCP on another CPU, while the NIC is DMA'ing request N+2 into the host.

If the processing is pushed all the way up to user space, will it be the case that the single-threaded application code can be working on request N while the TCP code is processing request N+1? That's what I'm trying to ask about.

I think the data I posted about saturating a GbE bidirectionally with a single TCP connection shows an example of advantage being taken of parallelism between the application doing its thing on request N, while TCP is processing N+1 on another CPU and the NIC is bringing N+2 into the RAM.

["Re: [RFC] Poor Network Performance with e1000 on 2.6.14.3" msg id <[EMAIL PROTECTED]> ]

What I'm not sure of is if that fully matters.  Hence the questions.

rick jones

So, other background... long ago and far away, in HP-UX 10.20 which was BSDish in its networking, with Inbound Packet Scheduling, the netisr handoff included a hash of the header info and a per-CPU netisr would be used for the "TCP processing" That got HP-UX parallelism for multiple TCP connections coming through a single NIC. It meant that a single threaded application, with multiple connections would have the inbound TCP processing possibly scattered across all the CPUs while it was running on only one CPU. Cache lines for socket structures going back and forth could indeed be a concern although moving a cache line from one CPU to another is not a priori evil (although the threshold is rather high IMO). In HP-UX 11.X IPS was replaced with Thread Optimized Packet Scheduling (TOPS). There was still a netisr-like hand-off (although not as low in the stack as I would have liked it) where a lookup took place that found where the application last accessed that connection (I think Solaris Fire Engine does something very similar today). The idea there was that the place where inbound processing would take place would be determined by where the application last accessed the socket. Still get advantage taken of multiple CPUs for multiple connections to multiple threads, but at the price of losing one part of the app/tcp/nic parallelism. Both TOPS and IPS have been successful in their days. I'm trying to come to grips with which might be "better" - if it is even possible to say that one was better than the other.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to