From: Tom Herbert ... > > If nothing else reducing the size of this main loop may be desirable. > > I know the newer x86 is supposed to have a loop buffer so that it can > > basically loop on already decoded instructions. Normally it is only > > something like 64 or 128 bytes in size though. You might find that > > reducing this loop to that smaller size may improve the performance > > for larger payloads. > > I saw 128 to be better in my testing. For large packets this loop does > all the work. I see performance dependent on the amount of loop > overhead, i.e. we got it down to two non-adcq instructions but it is > still noticeable. Also, this helps a lot on sizes up to 128 bytes > since we only need to do single call in the jump table and no trip > through the loop.
But one of your 'loop overhead' instructions is 'loop'. Look at http://www.agner.org/optimize/instruction_tables.pdf you don't want to be using 'loop' on intel cpus. You might get some benefit from pipelining the loop (so you do a read to register in one iteration and a register-register adc the next). David