> On 7 Jul 2025, at 13:27, Richard Sandiford <richard.sandif...@arm.com> wrote: > > Tamar Christina <tamar.christ...@arm.com> writes: >>> -----Original Message----- >>> From: Kyrylo Tkachov <ktkac...@nvidia.com> >>> Sent: Monday, July 7, 2025 10:38 AM >>> To: GCC Patches <gcc-patches@gcc.gnu.org> >>> Cc: Richard Sandiford <richard.sandif...@arm.com>; Richard Earnshaw >>> <richard.earns...@arm.com>; Alex Coplan <alex.cop...@arm.com>; Andrew >>> Pinski <pins...@gmail.com> >>> Subject: [PATCH] aarch64: Improve popcountti2 with SVE >>> >>> Hi all, >>> >>> The TImode popcount sequence can be slightly improved with SVE. >>> If we generate: >>> ldr q31, [x0] >>> ptrue p7.b, vl16 >>> cnt z31.d, p7/m, z31.d >>> addp d31, v31.2d >>> fmov x0, d31 >>> ret >>> >>> instead of: >>> h128: >>> ldr q31, [x0] >>> cnt v31.16b, v31.16b >>> addv b31, v31.16b >>> fmov w0, s31 >>> ret >>> >>> we use the ADDP instruction for reduction, which is cheaper on all CPUs >>> AFAIK, >>> as it is only a single 64-bit addition vs the tree of additions for ADDV. >>> For example, on a CPU like Grace we get a latency and throughput of 2,4 vs >>> 4,1 >>> for ADDV. >>> We do generate one more instruction due to the PTRUE being materialised, but >>> that >>> is cheap itself and can be scheduled away from the critical path or even >>> CSE'd >>> with other PTRUE constants. >>> As this sequence is larger code size-wise it is avoided for -Os. >>> >>> Bootstrapped and tested on aarch64-none-linux-gnu. >>> >>> Ok for trunk? >> >> We don't seem to take -Os into consideration for the general vector version >> when >> using SVE. Should we? or should the size check be dropped here? Seems better >> if we're consistent. > > The difference is that for 64-bit and smaller popcounts, SVE CNT provides > the result directly, whereas Advanced SIMD requires CNT+ADDV. So for smaller > sizes, it's effectively PTRUE+CNT vs CNT+ADDV, with the SVE version having > the advantage of a hoistable and shareable constant. > > For 128-bit popcounts we need CNT+an ADD either way, and the SVE CNT has the > added disadvantage of requiring tied registers to avoid a false dependency > (either directly from the RA, or via MOVPRFX). So keeping the -Os check > seems better to me FWIW. > > Richard > >> >> OK with or without that change. >>
Thanks, I added the -Os check as I expected this distinction to matter in more contexts as this is a scalar expansion and so the user may be more serious about code size requirements vs vector code. Though I admit it’s a bit handwavy. Richards rationale is more technical. I’ll keep the check when committing. Thanks, Kyrill >> Thanks, >> Tamar >> >>> Thanks, >>> Kyrill >>> >>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >>> >>> gcc/ >>> >>> * config/aarch64/aarch64.md (popcountti2): Add TARGET_SVE path. >>> >>> gcc/testsuite/ >>> >>> * gcc.target/aarch64/popcnt9.c: Add +nosve to target pragma. >>> * gcc.target/aarch64/popcnt13.c: New test.