Tamar Christina <tamar.christ...@arm.com> writes: >> -----Original Message----- >> From: Kyrylo Tkachov <ktkac...@nvidia.com> >> Sent: Monday, July 7, 2025 10:38 AM >> To: GCC Patches <gcc-patches@gcc.gnu.org> >> Cc: Richard Sandiford <richard.sandif...@arm.com>; Richard Earnshaw >> <richard.earns...@arm.com>; Alex Coplan <alex.cop...@arm.com>; Andrew >> Pinski <pins...@gmail.com> >> Subject: [PATCH] aarch64: Improve popcountti2 with SVE >> >> Hi all, >> >> The TImode popcount sequence can be slightly improved with SVE. >> If we generate: >> ldr q31, [x0] >> ptrue p7.b, vl16 >> cnt z31.d, p7/m, z31.d >> addp d31, v31.2d >> fmov x0, d31 >> ret >> >> instead of: >> h128: >> ldr q31, [x0] >> cnt v31.16b, v31.16b >> addv b31, v31.16b >> fmov w0, s31 >> ret >> >> we use the ADDP instruction for reduction, which is cheaper on all CPUs >> AFAIK, >> as it is only a single 64-bit addition vs the tree of additions for ADDV. >> For example, on a CPU like Grace we get a latency and throughput of 2,4 vs >> 4,1 >> for ADDV. >> We do generate one more instruction due to the PTRUE being materialised, but >> that >> is cheap itself and can be scheduled away from the critical path or even >> CSE'd >> with other PTRUE constants. >> As this sequence is larger code size-wise it is avoided for -Os. >> >> Bootstrapped and tested on aarch64-none-linux-gnu. >> >> Ok for trunk? > > We don't seem to take -Os into consideration for the general vector version > when > using SVE. Should we? or should the size check be dropped here? Seems better > if we're consistent.
The difference is that for 64-bit and smaller popcounts, SVE CNT provides the result directly, whereas Advanced SIMD requires CNT+ADDV. So for smaller sizes, it's effectively PTRUE+CNT vs CNT+ADDV, with the SVE version having the advantage of a hoistable and shareable constant. For 128-bit popcounts we need CNT+an ADD either way, and the SVE CNT has the added disadvantage of requiring tied registers to avoid a false dependency (either directly from the RA, or via MOVPRFX). So keeping the -Os check seems better to me FWIW. Richard > > OK with or without that change. > > Thanks, > Tamar > >> Thanks, >> Kyrill >> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >> >> gcc/ >> >> * config/aarch64/aarch64.md (popcountti2): Add TARGET_SVE path. >> >> gcc/testsuite/ >> >> * gcc.target/aarch64/popcnt9.c: Add +nosve to target pragma. >> * gcc.target/aarch64/popcnt13.c: New test.