Re: [PATCH] aarch64: Improve popcountti2 with SVE

Kyrylo Tkachov Mon, 07 Jul 2025 07:10:21 -0700


> On 7 Jul 2025, at 13:27, Richard Sandiford <richard.sandif...@arm.com> wrote:
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
>>> -----Original Message-----
>>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>>> Sent: Monday, July 7, 2025 10:38 AM
>>> To: GCC Patches <gcc-patches@gcc.gnu.org>
>>> Cc: Richard Sandiford <richard.sandif...@arm.com>; Richard Earnshaw
>>> <richard.earns...@arm.com>; Alex Coplan <alex.cop...@arm.com>; Andrew
>>> Pinski <pins...@gmail.com>
>>> Subject: [PATCH] aarch64: Improve popcountti2 with SVE
>>> 
>>> Hi all,
>>> 
>>> The TImode popcount sequence can be slightly improved with SVE.
>>> If we generate:
>>> ldr q31, [x0]
>>> ptrue p7.b, vl16
>>> cnt z31.d, p7/m, z31.d
>>> addp d31, v31.2d
>>> fmov x0, d31
>>> ret
>>> 
>>> instead of:
>>> h128:
>>> ldr q31, [x0]
>>> cnt v31.16b, v31.16b
>>> addv b31, v31.16b
>>> fmov w0, s31
>>> ret
>>> 
>>> we use the ADDP instruction for reduction, which is cheaper on all CPUs 
>>> AFAIK,
>>> as it is only a single 64-bit addition vs the tree of additions for ADDV.
>>> For example, on a CPU like Grace we get a latency and throughput of 2,4 vs 
>>> 4,1
>>> for ADDV.
>>> We do generate one more instruction due to the PTRUE being materialised, but
>>> that
>>> is cheap itself and can be scheduled away from the critical path or even 
>>> CSE'd
>>> with other PTRUE constants.
>>> As this sequence is larger code size-wise it is avoided for -Os.
>>> 
>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>> 
>>> Ok for trunk?
>> 
>> We don't seem to take -Os into consideration for the general vector version 
>> when
>> using SVE. Should we? or should the size check be dropped here?  Seems better
>> if we're consistent.
> 
> The difference is that for 64-bit and smaller popcounts, SVE CNT provides
> the result directly, whereas Advanced SIMD requires CNT+ADDV.  So for smaller
> sizes, it's effectively PTRUE+CNT vs CNT+ADDV, with the SVE version having
> the advantage of a hoistable and shareable constant.
> 
> For 128-bit popcounts we need CNT+an ADD either way, and the SVE CNT has the
> added disadvantage of requiring tied registers to avoid a false dependency
> (either directly from the RA, or via MOVPRFX).  So keeping the -Os check
> seems better to me FWIW.
> 
> Richard
> 
>> 
>> OK with or without that change.
>>


Thanks, I added the -Os check as I expected this distinction to matter in more 
contexts as this is a scalar expansion and so the user may be more serious 
about code size requirements vs vector code.
Though I admit it’s a bit handwavy. Richards rationale is more technical.
I’ll keep the check when committing.

Thanks,
Kyrill

>> Thanks,
>> Tamar
>> 
>>> Thanks,
>>> Kyrill
>>> 
>>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>>> 
>>> gcc/
>>> 
>>> * config/aarch64/aarch64.md (popcountti2): Add TARGET_SVE path.
>>> 
>>> gcc/testsuite/
>>> 
>>> * gcc.target/aarch64/popcnt9.c: Add +nosve to target pragma.
>>> * gcc.target/aarch64/popcnt13.c: New test.

Re: [PATCH] aarch64: Improve popcountti2 with SVE

Reply via email to