Tamar Christina <tamar.christ...@arm.com> writes:
>> -----Original Message-----
>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>> Sent: Monday, July 7, 2025 10:38 AM
>> To: GCC Patches <gcc-patches@gcc.gnu.org>
>> Cc: Richard Sandiford <richard.sandif...@arm.com>; Richard Earnshaw
>> <richard.earns...@arm.com>; Alex Coplan <alex.cop...@arm.com>; Andrew
>> Pinski <pins...@gmail.com>
>> Subject: [PATCH] aarch64: Improve popcountti2 with SVE
>> 
>> Hi all,
>> 
>> The TImode popcount sequence can be slightly improved with SVE.
>> If we generate:
>> ldr q31, [x0]
>> ptrue p7.b, vl16
>> cnt z31.d, p7/m, z31.d
>> addp d31, v31.2d
>> fmov x0, d31
>> ret
>> 
>> instead of:
>> h128:
>> ldr q31, [x0]
>> cnt v31.16b, v31.16b
>> addv b31, v31.16b
>> fmov w0, s31
>> ret
>> 
>> we use the ADDP instruction for reduction, which is cheaper on all CPUs 
>> AFAIK,
>> as it is only a single 64-bit addition vs the tree of additions for ADDV.
>> For example, on a CPU like Grace we get a latency and throughput of 2,4 vs 
>> 4,1
>> for ADDV.
>> We do generate one more instruction due to the PTRUE being materialised, but
>> that
>> is cheap itself and can be scheduled away from the critical path or even 
>> CSE'd
>> with other PTRUE constants.
>> As this sequence is larger code size-wise it is avoided for -Os.
>> 
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>> 
>> Ok for trunk?
>
> We don't seem to take -Os into consideration for the general vector version 
> when
> using SVE. Should we? or should the size check be dropped here?  Seems better
> if we're consistent.

The difference is that for 64-bit and smaller popcounts, SVE CNT provides
the result directly, whereas Advanced SIMD requires CNT+ADDV.  So for smaller
sizes, it's effectively PTRUE+CNT vs CNT+ADDV, with the SVE version having
the advantage of a hoistable and shareable constant.

For 128-bit popcounts we need CNT+an ADD either way, and the SVE CNT has the
added disadvantage of requiring tied registers to avoid a false dependency
(either directly from the RA, or via MOVPRFX).  So keeping the -Os check
seems better to me FWIW.

Richard

>
> OK with or without that change.
>
> Thanks,
> Tamar
>
>> Thanks,
>> Kyrill
>> 
>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>> 
>> gcc/
>> 
>>      * config/aarch64/aarch64.md (popcountti2): Add TARGET_SVE path.
>> 
>> gcc/testsuite/
>> 
>>      * gcc.target/aarch64/popcnt9.c: Add +nosve to target pragma.
>>      * gcc.target/aarch64/popcnt13.c: New test.

Reply via email to