Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-25 Thread bige...@linutronix.de
On 2018-07-25 06:57:42 [+], Vakul Garg wrote:
> I tested this patch. It helped but didn't regain the performance to previous 
> level.
> Are there more files remaining to be fixed? (In your original patch series 
> for adding
> preemptability check, there were lot more files changed than this series with 
> 4 files).
> 
> Instead of using hardcoded  32 block/16 block limit, should it be controlled 
> using Kconfig?
> I believe that on different cores, these values could be required to be 
> different.

What about PREEMPT_NONE (server)?

Sebastian


Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-25 Thread bige...@linutronix.de
On 2018-07-25 07:04:55 [+], Vakul Garg wrote:
> > 
> > What about PREEMPT_NONE (server)?
> 
> Why not have best of both the worlds :)

the NEON code gets interrupted because another tasks wants to schedule
and the scheduler allows. With "low latency desktop" this gets right
done away. The lower levels won't schedule so fast. So if you seek for
performance, the lower level should give you more. If you seek for low
latency…

Sebastian


Re: [PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks

2018-07-25 Thread bige...@linutronix.de
On 2018-07-25 11:54:53 [+0200], Ard Biesheuvel wrote:
> Indeed. OTOH, if the -rt people (Sebastian?) turn up and say that a
> 1000 cycle limit to the quantum of work performed with preemption
> disabled is unreasonably low, we can increase the yield block counts
> and approach the optimal numbers a bit closer. But with diminishing
> returns.

So I tested on SoftIron Overdrive 1000 which has A57 cores. I added this
series and didn't notice any spikes. This means cyclictest reported a
max value of like ~20us (which means the crypto code was not
noticeable).
I played a little with it and tcrypt tests for aes/sha1 and also no huge
spikes. So at this point this looks fantastic. I also setup cryptsetup /
dm-crypt with the usual xts(aes) mode and saw no spikes.
At this point, on this hardware if you want to raise the block count, I
wouldn't mind.

I remember on x86 the SIMD accelerated ciphers led to ~1ms+ spikes once
dm-crypt started its jobs.

Sebastian


Re: [PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks

2018-07-26 Thread bige...@linutronix.de
On 2018-07-26 09:25:40 [+0200], Ard Biesheuvel wrote:
> Thanks a lot.
> 
> So 20 us ~= 20,000 cycles on my 1 GHz Cortex-A53, and if I am
> understanding you correctly, you wouldn't mind the quantum of work to
> be in the order 16,000 cycles or even substantially more?

I have currently that one box and it does not seem to be a problem. So
it reports now on idle around 20us max. So if add "only" 20us to NEON /
your preempt-disable section then we may end up at 20+20 = 40us.
At this point I am not sure how "bad" it is. It works, it does not seem
that much and you can disable it if you don't want the extra 20us here.

> That is good news, but it is also rather interesting, given that these
> algorithms run at ~4 cycles per byte, meaning that you'd manage an
> entire 4 KB page without ever yielding. (GCM is used on network
> packets, XTS on disk sectors which are all smaller than that)
> 
> Do you remember how you found out NEON use is a problem for -rt on
> arm64 in the first place? Which algorithm did you test at the time to
> arrive at this conclusion?

I *think* that yield got in there by chance. The main problem was back
at the time that within the neon begin/end section there was the scatter
list walk. That walk may invoke kmap() / kmalloc() / kfree() and is not
allowed on RT within a preempt-disable section. This was my main
concern.

> Note that AES-GCM using ordinary SIMD instructions runs at 29 cpb, and
> plain AES at ~20 (on A53), so perhaps it would make sense to
> distinguish between algos using crypto instructions and ones using
> plain SIMD.

I was looking at AES-CE and AES-NEON (aes-neon-blk / aes_ce_blk) with
modprobe tcrypt mode=200 sec=1

and mode=403 +404 for the sha1/256 test.

Sebastian