> -----Original Message----- > From: Linus Torvalds <[email protected]> > Sent: Friday, September 27, 2019 4:06 AM > To: Pascal Van Leeuwen <[email protected]> > Cc: Ard Biesheuvel <[email protected]>; Linux Crypto Mailing List > <linux- > [email protected]>; Linux ARM <[email protected]>; > Herbert Xu > <[email protected]>; David Miller <[email protected]>; Greg KH > <[email protected]>; Jason A . Donenfeld <[email protected]>; Samuel > Neves > <[email protected]>; Dan Carpenter <[email protected]>; Arnd Bergmann > <[email protected]>; Eric Biggers <[email protected]>; Andy Lutomirski > <[email protected]>; > Will Deacon <[email protected]>; Marc Zyngier <[email protected]>; Catalin Marinas > <[email protected]> > Subject: Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for > packet > encryption > > On Thu, Sep 26, 2019 at 5:15 PM Pascal Van Leeuwen > <[email protected]> wrote: > > > > But even the CPU only thing may have several implementations, of which > > you want to select the fastest one supported by the _detected_ CPU > > features (i.e. SSE, AES-NI, AVX, AVX512, NEON, etc. etc.) > > Do you think this would still be efficient if that would be some > > large if-else tree? Also, such a fixed implementation wouldn't scale. > > Just a note on this part. > > Yes, with retpoline a large if-else tree is actually *way* better for > performance these days than even just one single indirect call. I > think the cross-over point is somewhere around 20 if-statements. > Yikes, that is just _horrible_ :-(
_However_ there's many CPU architectures out there that _don't_ need the retpoline mitigation and would be unfairly penalized by the deep if-else tree (as opposed to the indirect branch) for a problem they did not cause in the first place. Wouldn't it be more fair to impose the penalty on the CPU's actually _causing_ this problem? Also because those are generally the more powerful CPU's anyway, that would suffer the least from the overhead? > But those kinds of things also are things that we already handle well > with instruction rewriting, so they can actually have even less of an > overhead than a conditional branch. Using code like > > if (static_cpu_has(X86_FEATURE_AVX2)) > > actually ends up patching the code at run-time, so you end up having > just an unconditional branch. Exactly because CPU feature choices > often end up being in critical code-paths where you have > one-or-the-other kind of setup. > > And yes, one of the big users of this is very much the crypto library code. > Ok, I didn't know about that. So I suppose we could have something like if (static_soc_has(HW_CRYPTO_ACCELERATOR_XYZ)) ... Hmmm ... > The code to do the above is disgusting, and when you look at the > generated code you see odd unreachable jumps and what looks like a > slow "bts" instruction that does the testing dynamically. > > And then the kernel instruction stream gets rewritten fairly early > during the boot depending on the actual CPU capabilities, and the > dynamic tests get overwritten by a direct jump. > > Admittedly I don't think the arm64 people go to quite those lengths, > but it certainly wouldn't be impossible there either. It just takes a > bit of architecture knowledge and a strong stomach ;) > > Linus Regards, Pascal van Leeuwen Silicon IP Architect, Multi-Protocol Engines @ Verimatrix www.insidesecure.com
