> -----Original Message-----
> From: Kyrylo Tkachov <ktkac...@nvidia.com>
> Sent: Monday, July 7, 2025 10:38 AM
> To: GCC Patches <gcc-patches@gcc.gnu.org>
> Cc: Richard Sandiford <richard.sandif...@arm.com>; Richard Earnshaw
> <richard.earns...@arm.com>; Alex Coplan <alex.cop...@arm.com>; Andrew
> Pinski <pins...@gmail.com>
> Subject: [PATCH] aarch64: Improve popcountti2 with SVE
> 
> Hi all,
> 
> The TImode popcount sequence can be slightly improved with SVE.
> If we generate:
> ldr q31, [x0]
> ptrue p7.b, vl16
> cnt z31.d, p7/m, z31.d
> addp d31, v31.2d
> fmov x0, d31
> ret
> 
> instead of:
> h128:
> ldr q31, [x0]
> cnt v31.16b, v31.16b
> addv b31, v31.16b
> fmov w0, s31
> ret
> 
> we use the ADDP instruction for reduction, which is cheaper on all CPUs AFAIK,
> as it is only a single 64-bit addition vs the tree of additions for ADDV.
> For example, on a CPU like Grace we get a latency and throughput of 2,4 vs 4,1
> for ADDV.
> We do generate one more instruction due to the PTRUE being materialised, but
> that
> is cheap itself and can be scheduled away from the critical path or even CSE'd
> with other PTRUE constants.
> As this sequence is larger code size-wise it is avoided for -Os.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu.
> 
> Ok for trunk?

We don't seem to take -Os into consideration for the general vector version when
using SVE. Should we? or should the size check be dropped here?  Seems better
if we're consistent.

OK with or without that change.

Thanks,
Tamar

> Thanks,
> Kyrill
> 
> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
> 
> gcc/
> 
>       * config/aarch64/aarch64.md (popcountti2): Add TARGET_SVE path.
> 
> gcc/testsuite/
> 
>       * gcc.target/aarch64/popcnt9.c: Add +nosve to target pragma.
>       * gcc.target/aarch64/popcnt13.c: New test.

Reply via email to