On 2/15/24 08:46, Alexander Monakov wrote:
Right, so we can pick the cheapest reduction method, and if I'm reading
Neoverse-N1 SOG right, SHRN is marginally cheaper than ADDV (latency 2
instead of 3), and it should be generally preferable on other cores, no?
Fair.
For that matter, cannot UQXT
On Thu, 15 Feb 2024, Richard Henderson wrote:
> On 2/14/24 22:47, Alexander Monakov wrote:
> >
> > On Wed, 14 Feb 2024, Richard Henderson wrote:
> >
> >> Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
> >> double-check with the compiler flags for __ARM_NEON and don't
On 2/14/24 22:47, Alexander Monakov wrote:
On Wed, 14 Feb 2024, Richard Henderson wrote:
Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
double-check with the compiler flags for __ARM_NEON and don't bother with
a runtime check. Otherwise, model the loop after the x86
On Wed, 14 Feb 2024, Richard Henderson wrote:
> Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
> double-check with the compiler flags for __ARM_NEON and don't bother with
> a runtime check. Otherwise, model the loop after the x86 SSE2 function,
> and use VADDV to reduc
Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
double-check with the compiler flags for __ARM_NEON and don't bother with
a runtime check. Otherwise, model the loop after the x86 SSE2 function,
and use VADDV to reduce the four vector comparisons.
Signed-off-by: Richard He