from:"Van Leeuwen, Pascal"

RE: Severe performance regression in "net: macsec: preserve ingress frame ordering"

2020-08-12 Thread Van Leeuwen, Pascal

> -Original Message-
> From: linux-crypto-ow...@vger.kernel.org  
> On Behalf Of Sabrina Dubroca
> Sent: Wednesday, August 12, 2020 12:05 PM
> To: Scott Dial 
> Cc: linux-cry...@vger.kernel.org; Ryan Cox ; 
> netdev@vger.kernel.org; da...@davemloft.net; Antoine Tenart
> ; ebigg...@google.com
> Subject: Re: Severe performance regression in "net: macsec: preserve ingress 
> frame ordering"
>
> <<< External Email >>>
> 2020-08-10, 12:09:40 -0400, Scott Dial wrote:
> > On 8/10/2020 9:34 AM, Sabrina Dubroca wrote:
> > > [adding the linux-crypto list]
> > >
> > > 2020-08-06, 23:48:16 -0400, Scott Dial wrote:
> > >> On 8/6/2020 5:11 PM, Ryan Cox wrote:
> > >>> With 5.7 I get:
> > >>> * 9.90 Gb/s with no macsec at all
> > >>> * 1.80 Gb/s with macsec WITHOUT encryption
> > >>> * 1.00 Gb/s (sometimes, but often less) with macsec WITH encryption
> > >>>
> > >>> With 5.7 but with ab046a5d4be4c90a3952a0eae75617b49c0cb01b reverted, I 
> > >>> get:
> > >>> * 9.90 Gb/s with no macsec at all
> > >>> * 7.33 Gb/s with macsec WITHOUT encryption
> > >>> * 9.83 Gb/s with macsec WITH encryption
> > >>>
> > >>> On tests where performance is bad (including macsec without encryption),
> > >>> iperf3 is at 100% CPU usage.  I was able to run it under `perf record`on
> > >>> iperf3 in a number of the tests but, unfortunately, I have had trouble
> > >>> compiling perf for my own 5.7 compilations (definitely PEBKAC).  If it
> > >>> would be useful I can work on fixing the perf compilation issues.
> > >>
> > >> For certain, you are measuring the difference between AES-NI doing
> > >> gcm(aes) and gcm_base(ctr(aes-aesni),ghash-generic). Specifically, the
> > >> hotspot is ghash-generic's implementation of ghash_update() function.
> > >> I appreciate your testing because I was limited in my ability to test
> > >> beyond 1Gb/s.
> > >>
> > >> The aes-aesni driver is smart enough to use the FPU if it's not busy and
> > >> fallback to the CPU otherwise. Unfortunately, the ghash-clmulni driver
> > >> does not have that kind of logic in it and only provides an async 
> > >> version,
> > >> so we are forced to use the ghash-generic implementation, which is a pure
> > >> CPU implementation. The ideal would be for aesni_intel to provide a
> > >> synchronous version of gcm(aes) that fell back to the CPU if the FPU is
> > >> busy.
> > >> I don't know if the crypto maintainers would be open to such a change, 
> > >> but
> > >> if the choice was between reverting and patching the crypto code, then I
> > >> would work on patching the crypto code.
> > >
> > > To the crypto folks, a bit of context: Scott wrote commit ab046a5d4be4
> > > ("net: macsec: preserve ingress frame ordering"), which made MACsec
> > > use gcm(aes) with CRYPTO_ALG_ASYNC. This prevents out of order
> > > decryption, but reduces performance. We'd like to restore performance
> > > on systems where the FPU is available without breaking MACsec for
> > > systems where the FPU is often busy.
> > >
> > > A quick and dirty alternative might be to let the administrator decide
> > > if they're ok with some out of order. Maybe they know that their FPU
> > > will be mostly idle so it won't even be an issue (or maybe the
> > > opposite, ie keep the fast default and let admins fix their setups
> > > with an extra flag).
> >
> > I can appreciate favoring performance over correctness as practical
> > concern, but I'd suggest that the out-of-order decryption *is* a
> > performance concern as well. We can debate realness of my workload, but
> > even in Ryan's tests on an otherwise idle server, he showed 0.07% of the
> > frames needed to be dispatched to cryptd, and that for whatever reason
> > it's more often with encryption disabled, which correlates to his
> > decrease in throughput (9.83 Gb/s to 7.33 Gb/s, and 9.19 Gb/s to 6.00
> > Gb/s), perhaps causing exponential backoff from TCP retries. I can
> > resurrect my test setup, but my numbers were worse than Ryan's.
> >
> > In any case, I counted 18 implementations of HW accelerated gcm(aes) in
> > the kernel, with 3 of those implementations are in arch (x86, arm64, and
> > s390) and the rest are crypto device drivers. Of all those
> > implementations, the AES-NI implementation is the only one that
> > dispatches to cryptd (via code in cypto/simd.c). AFAICT, every other
> > implementation of gcm(aes) is synchronous, but they would require closer
> > inspection to be certain.
>
> I randomly picked 2 of them (chcr and inside-secure), and they both
> set CRYPTO_ALG_ASYNC, so I guess not.
>
You can expect most, if not all, HW accelerated crypto to by ASYNC. This is
important to achieve decent performance, as going through some external
(to the CPU) accelerator incurs significant latency.  (Note that I don't 
consider
CPU extensions like AES-NI to be "HW accelerated", anything that uses only
CPU instructions is "just" software in my world). Which implies you need to
pipeline requests to unleash its true performance. So if you need high
throughput

RE: Severe performance regression in "net: macsec: preserve ingress frame ordering"

2020-08-24 Thread Van Leeuwen, Pascal

> -Original Message-
> From: linux-crypto-ow...@vger.kernel.org  
> On Behalf Of Andrew Lunn
> Sent: Wednesday, August 12, 2020 2:42 PM
> To: Van Leeuwen, Pascal 
> Cc: Sabrina Dubroca ; Scott Dial ; 
> linux-cry...@vger.kernel.org; Ryan Cox
> ; netdev@vger.kernel.org; da...@davemloft.net; Antoine 
> Tenart ;
> ebigg...@google.com
> Subject: Re: Severe performance regression in "net: macsec: preserve ingress 
> frame ordering"
>
> <<< External Email >>>
> > With networking protocols you often also have a requirement to minimize
> > packet reordering, so I understand it's a careful balance. But it is 
> > possible
> > to serialize the important stuff and still do the crypto out-of-order, which
> > would be really beneficial on _some_ platforms (which have HW crypto
> > acceleration but no such CPU extensions) at least.
>
> Many Ethernet PHYs are also capable of doing MACSeC as they
> send/receive frames. Doing it in hardware in the PHY avoids all these
> out-of-order and latency issues. Unfortunately, we are still at the
> early days for PHY drivers actually implementing MACSeC offload. At
> the moment only the Microsemi PHY and Aquantia PHY via firmware in the
> Atlantic NIC support this.
>
No need to point this out to me as we're the number one supplier of inline 
MACsec IP :-)
In fact, the Microsemi PHY solution you mention is ours, major parts of that 
design were
even created by these 2 hands here.  Full protocol offload is obviously the 
holy grail of HW
acceleration, and what we tend to strive for. The problem is always with the 
software
integration, so I'm happy to see a framework for inline MACsec acceleration 
being added to
the kernel.

Without such a protocol acceleration framework (which AFAIK doesn't exist for 
IPsec yet,
at least not in a generic form supporting all modes and ciphersuites), however, 
you fall
back to basic hash-encrypt/AEAD offload as the "best you can do".  And some 
low-cost
devices may still do it on the CPU to minimize silicon cost. So it is still 
very useful for the
crypto API path to be as efficient as possible, at least for the time being.

Regards,
Pascal van Leeuwen
Silicon IP Architect Multi-Protocol Engines, Rambus Security
Rambus ROTW Holding BV
+31-73 6581953

Note: The Inside Secure/Verimatrix Silicon IP team was recently acquired by 
Rambus.
Please be so kind to update your e-mail address book with my new e-mail address.


** This message and any attachments are for the sole use of the intended 
recipient(s). It may contain information that is confidential and privileged. 
If you are not the intended recipient of this message, you are prohibited from 
printing, copying, forwarding or saving it. Please delete the message and 
attachments and notify the sender immediately. **

Rambus Inc.<http://www.rambus.com>

RE: Severe performance regression in "net: macsec: preserve ingress frame ordering"

2020-08-25 Thread Van Leeuwen, Pascal

> -Original Message-
> From: Andrew Lunn 
> Sent: Monday, August 24, 2020 3:02 PM
> To: Van Leeuwen, Pascal 
> Cc: Sabrina Dubroca ; Scott Dial ; 
> linux-cry...@vger.kernel.org; Ryan Cox
> ; netdev@vger.kernel.org; da...@davemloft.net; Antoine 
> Tenart ;
> ebigg...@google.com
> Subject: Re: Severe performance regression in "net: macsec: preserve ingress 
> frame ordering"
>
> <<< External Email >>>
> On Mon, Aug 24, 2020 at 09:07:26AM +, Van Leeuwen, Pascal wrote:
> > No need to point this out to me as we're the number one supplier of inline 
> > MACsec IP :-)
> > In fact, the Microsemi PHY solution you mention is ours, major parts of 
> > that design were
> > even created by these 2 hands here.
>
> Oh,  O.K.
>
> Do you know of other silicon vendors which are using the same IP?
>
I do, there are many. But unfortunately, I cannot disclose our customers unless 
this is already
public information, e.g. due to some press release or whatever.

> Maybe we can encourage them to share the driver, rather than re-invent
> the wheel, which often happens when nobody realises it is basically
> the same core with a different wrapper.
>
Yes, that could save a lot of duplication of code and effort. And it should be 
rather trivial to
move the MACsec stuff to a higher level as all it needs is some register access 
to PHY control
space and an interrupt callback. So it should be possible to define a simple 
API between the
MACsec driver and the PHY driver for that. I would expect a similar API to be 
useful for
MACsec enabled PHY's using other MACsec solutions (i.e. not ours) as well ...

The problem is: who will do it? We can't do it, because we have no access to 
the actual HW.
Microsemi won't be motivated to do it, because it would only help the 
competition, so why
would they? So it would have to be some competitor also desiring MACsec support 
(for the
same MACsec IP), convincing the maintainer of the Microsemi driver to go along 
with the
changes. I guess it's not all that relevant until we hit that situation.

> Thanks
> Andrew

Regards,
Pascal van Leeuwen
Silicon IP Architect Multi-Protocol Engines, Rambus Security
Rambus ROTW Holding BV
+31-73 6581953

Note: The Inside Secure/Verimatrix Silicon IP team was recently acquired by 
Rambus.
Please be so kind to update your e-mail address book with my new e-mail address.

** This message and any attachments are for the sole use of the intended 
recipient(s). It may contain information that is confidential and privileged. 
If you are not the intended recipient of this message, you are prohibited from 
printing, copying, forwarding or saving it. Please delete the message and 
attachments and notify the sender immediately. **

Rambus Inc.<http://www.rambus.com>

RE: Severe performance regression in "net: macsec: preserve ingress frame ordering"

RE: Severe performance regression in "net: macsec: preserve ingress frame ordering"

RE: Severe performance regression in "net: macsec: preserve ingress frame ordering"

3 matches

Site Navigation

Mail list logo

Footer information