Hi, On Mon, Mar 17, 2008 at 10:47:40AM +0100, Neal H. Walfield wrote: > At Sun, 16 Mar 2008 08:21:19 +0100, <[EMAIL PROTECTED]> wrote: > > On Wed, Mar 12, 2008 at 05:12:03PM +0100, Marcus Brinkmann wrote:
> > > As for the threading model, more than one kernel thread per real > > > CPU doesn't seem to make much sense in most cases. > > > > Well, add a "per processing step" to make this statement more > > generally true. In some cases, it's useful to have a distinct set of > > worker threads for each processing step, working in a assemly line > > like fashion each thread picks a request from one wait queue, does > > it's work, and pushes it to the next wait queue. > > > > This model specifically should improve performance for servers that > > are constantly busy, processing many requests in parallel; and under > > the condition that the amount of data touched in the processing is > > relatively small compared to the amount of code. > > > > It also simplifies control flow and locking. Certain optimisations > > become obvious. > > What sort of optimizations? It seems to me that if you have a small > amount of data, then you'll be able to run almostly entirely out of > the L2 cache. In this case, switching to another thread will > introduce the overhead of shifting the cache to the other CPU. Of course. But as we have one thread per CPU for each processing step, it is perfectly possible to have a request go through all processing steps without ever changing CPU. Whether this is always really desirable, is another question... Note that this approach could also be implemented using just a single kernel thread per CPU, and some kind of userspace sheduler to switch between the various processing steps. While this variant gives more direct control, it seems to me that it's more complicated... Not sure which is preferable. In any case, the general advantage of this approach is that rather than having a single code path handling all stages of each request in succession, before switching to the next request, a particular piece of code handles a certain stage of many requests in a loop, before scheduling sets in and another stage is handled in another loop. This improves code locality, as well as data locality for any kind of data shared between requests, at the expense of locality in data private to the individual requests. In many cases, this can be a considerable win. (Especially on Pentium4 and related "Netburst" architecture processors, where it's very expensive to reload the trace cache -- a first level code cache storing pretranslated instructions. Note that the next generation of Intel processors, to be introduced by the end of this year -- unlike the current "Core 2" models -- will be based on an enhanced Netburst architecture again, and thus reintroduce the trace cache...) Beside of this inherent advantage, having the code handling an individual stage in a relatively tight, regular loop, can prompt further optimizations, which are not possible, not useful, or at least less obvious otherwise -- stuff like loop unrolling, using data-driven algorithms instead of branchy code, etc. Together with the increased code (and partially data) locality, this allows for much better use of parallelism in the processor -- it helps mitigate the fact that modern processors are extremely fast at executing optimized regular loops, but very poor at typical application code that is very irregular and full of stalls. I'm not sure how much difference the additional optimizations can make; but at least the locality part can be very decisive. -antrik-