For the in-order ARM Cortex-A8 (the target for this code), adjacent
multiply-add instructions forward summands quickly. A simple in-order
dot-product computation has no latency problems, while interleaving
computations, as suggested in this thread, creates problems. Also, on
this microarchitecture,
Eric Biggers writes:
> If (more likely) you're talking about things like "use this NEON
> implementation
> on Cortex-A7 but this other NEON implementation on Cortex-A53", it's up the
> developers and community to test different CPUs and make appropriate
> decisions,
> and yes it can be very usefu
Eric Biggers writes:
> You'd probably attract more contributors if you followed established
> open source conventions.
SUPERCOP already has thousands of implementations from hundreds of
contributors. New speed records are more likely to appear in SUPERCOP
than in any other cryptographic software c
Eric Biggers writes:
> I've also written a scalar ChaCha20 implementation (no NEON instructions!)
> that
> is 12.2 cpb on one block at a time on Cortex-A7, taking advantage of the free
> rotates; that would be useful for the single permutation used to compute
> XChaCha's subkey, and also for the e