> -----Original Message-----
> From: Linus Torvalds <torva...@linux-foundation.org>
> Sent: Friday, September 27, 2019 4:06 AM
> To: Pascal Van Leeuwen <pvanleeu...@verimatrix.com>
> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org>; Linux Crypto Mailing List 
> <linux-
> cry...@vger.kernel.org>; Linux ARM <linux-arm-ker...@lists.infradead.org>; 
> Herbert Xu
> <herb...@gondor.apana.org.au>; David Miller <da...@davemloft.net>; Greg KH
> <gre...@linuxfoundation.org>; Jason A . Donenfeld <ja...@zx2c4.com>; Samuel 
> Neves
> <sne...@dei.uc.pt>; Dan Carpenter <dan.carpen...@oracle.com>; Arnd Bergmann
> <a...@arndb.de>; Eric Biggers <ebigg...@google.com>; Andy Lutomirski 
> <l...@kernel.org>;
> Will Deacon <w...@kernel.org>; Marc Zyngier <m...@kernel.org>; Catalin Marinas
> <catalin.mari...@arm.com>
> Subject: Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for 
> packet
> encryption
> 
> On Thu, Sep 26, 2019 at 5:15 PM Pascal Van Leeuwen
> <pvanleeu...@verimatrix.com> wrote:
> >
> > But even the CPU only thing may have several implementations, of which
> > you want to select the fastest one supported by the _detected_ CPU
> > features (i.e. SSE, AES-NI, AVX, AVX512, NEON, etc. etc.)
> > Do you think this would still be efficient if that would be some
> > large if-else tree? Also, such a fixed implementation wouldn't scale.
> 
> Just a note on this part.
> 
> Yes, with retpoline a large if-else tree is actually *way* better for
> performance these days than even just one single indirect call. I
> think the cross-over point is somewhere around 20 if-statements.
> 
Yikes, that is just _horrible_ :-(

_However_ there's many CPU architectures out there that _don't_ need
the retpoline mitigation and would be unfairly penalized by the deep
if-else tree (as opposed to the indirect branch) for a problem they
did not cause in the first place.

Wouldn't it be more fair to impose the penalty on the CPU's actually
_causing_ this problem? Also because those are generally the more 
powerful CPU's anyway, that would suffer the least from the overhead?

> But those kinds of things also are things that we already handle well
> with instruction rewriting, so they can actually have even less of an
> overhead than a conditional branch. Using code like
> 
>   if (static_cpu_has(X86_FEATURE_AVX2))
> 
> actually ends up patching the code at run-time, so you end up having
> just an unconditional branch. Exactly because CPU feature choices
> often end up being in critical code-paths where you have
> one-or-the-other kind of setup.
> 
> And yes, one of the big users of this is very much the crypto library code.
> 
Ok, I didn't know about that. So I suppose we could have something
like if (static_soc_has(HW_CRYPTO_ACCELERATOR_XYZ)) ... Hmmm ...

> The code to do the above is disgusting, and when you look at the
> generated code you see odd unreachable jumps and what looks like a
> slow "bts" instruction that does the testing dynamically.
> 
> And then the kernel instruction stream gets rewritten fairly early
> during the boot depending on the actual CPU capabilities, and the
> dynamic tests get overwritten by a direct jump.
> 
> Admittedly I don't think the arm64 people go to quite those lengths,
> but it certainly wouldn't be impossible there either.  It just takes a
> bit of architecture knowledge and a strong stomach ;)
> 
>                  Linus

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com

Reply via email to