> On Aug 7, 2020, at 12:21 PM, Linus Torvalds <torva...@linux-foundation.org> 
> wrote:
> 
> On Fri, Aug 7, 2020 at 12:08 PM Andy Lutomirski <l...@amacapital.net> wrote:
>> 4 cycles per byte on Core 2
> 
> I took the reference C implementation as-is, and just compiled it with
> O2, so my numbers may not be what some heavily optimized case does.
> 
> But it was way more than that, even when amortizing for "only need to
> do it every 8 cases". I think the 4 cycles/byte might be some "zero
> branch mispredicts" case when you've fully unrolled the thing, but
> then you'll be taking I$ misses out of the wazoo, since by definition
> this won't be in your L1 I$ at all (only called every 8 times).
> 
> Sure, it might look ok on microbenchmarks where it does stay hot the
> cache all the time, but that's not realistic. I

No one said we have to do only one ChaCha20 block per slow path hit.  In fact, 
the more we reduce the number of rounds, the more time we spend on I$ misses, 
branch mispredictions, etc, so reducing rounds may be barking up the wrong tree 
entirely.  We probably don’t want to have more than one page 

I wonder if AES-NI adds any value here.  AES-CTR is almost a drop-in 
replacement for ChaCha20, and maybe the performance for a cache-cold short run 
is better.

Reply via email to