Hi all,

The architecture recommends that load-gather instructions avoid using the same
Z register for the load address and the destination, and the Software 
Optimization
Guides for Arm cores recommend that as well.
This means that for code like:
#include <arm_sve.h>

svuint64_t
food (svbool_t p, uint64_t *in, svint64_t offsets, svuint64_t a)
{
  return svadd_u64_x (p, a, svld1_gather_offset(p, in, offsets));
}

we'll want to avoid generating the current:
food:
        ld1d    z0.d, p0/z, [x0, z0.d] // Z0 reused as input and output.
        add     z0.d, z1.d, z0.d
        ret

However, we still want to avoid generating extra moves where there were
none before, so the tight aarch64-sve-acle.exp tests for load gathers
should still pass as they are.

This patch implements that recommendation for the load gather patterns by:
* duplicating the alternatives
* marking the output operand as early clobber
* Tying the input Z register operand in the original alternatives to 0
* Penalising the original alternatives with '?'

This results in a large-ish patch in terms of diff lines but the new
compact syntax (thanks Tamar) makes it quite a readable an regular change.

The benchmark numbers on a Neoverse V1 on fprate look okay:
                diff
503.bwaves_r    0.00%
507.cactuBSSN_r 0.00%
508.namd_r      0.00%
510.parest_r    0.55%
511.povray_r    0.22%
519.lbm_r       0.00%
521.wrf_r       0.00%
526.blender_r   0.00%
527.cam4_r      0.56%
538.imagick_r   0.00%
544.nab_r       0.00%
549.fotonik3d_r 0.00%
554.roms_r      0.00%
fprate          0.10%

Bootstrapped and tested on aarch64-none-linux-gnu.
Pushing to trunk.
Thanks,
Kyrill

P.S. I had messed up my previous commit of 
https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622456.html
by squashing the config/aarch64 changes with this patch.
I have reverted that one commit and reapplied it properly (as it should have 
been a no-op) and am pushing this commit on top of that.
Sorry for the churn in the repo.

gcc/ChangeLog:

        * config/aarch64/aarch64-sve.md 
(mask_gather_load<mode><v_int_container>):
        Add alternatives to prefer to avoid same input and output Z register.
        (mask_gather_load<mode><v_int_container>): Likewise.
        (*mask_gather_load<mode><v_int_container>_<su>xtw_unpacked): Likewise.
        (*mask_gather_load<mode><v_int_container>_sxtw): Likewise.
        (*mask_gather_load<mode><v_int_container>_uxtw): Likewise.
        (@aarch64_gather_load_<ANY_EXTEND:optab><SVE_4HSI:mode><SVE_4BHI:mode>):
        Likewise.
        
(@aarch64_gather_load_<ANY_EXTEND:optab><SVE_2HSDI:mode><SVE_2BHSI:mode>):
        Likewise.
        (*aarch64_gather_load_<ANY_EXTEND:optab><SVE_2HSDI:mode>
        <SVE_2BHSI:mode>_<ANY_EXTEND2:su>xtw_unpacked): Likewise.
        (*aarch64_gather_load_<ANY_EXTEND:optab><SVE_2HSDI:mode>
        <SVE_2BHSI:mode>_sxtw): Likewise.
        (*aarch64_gather_load_<ANY_EXTEND:optab><SVE_2HSDI:mode>
        <SVE_2BHSI:mode>_uxtw): Likewise.
        (@aarch64_ldff1_gather<mode>): Likewise.
        (@aarch64_ldff1_gather<mode>): Likewise.
        (*aarch64_ldff1_gather<mode>_sxtw): Likewise.
        (*aarch64_ldff1_gather<mode>_uxtw): Likewise.
        (@aarch64_ldff1_gather_<ANY_EXTEND:optab><VNx4_WIDE:mode>
        <VNx4_NARROW:mode>): Likewise.
        (@aarch64_ldff1_gather_<ANY_EXTEND:optab><VNx2_WIDE:mode>
        <VNx2_NARROW:mode>): Likewise.
        (*aarch64_ldff1_gather_<ANY_EXTEND:optab><VNx2_WIDE:mode>
        <VNx2_NARROW:mode>_sxtw): Likewise.
        (*aarch64_ldff1_gather_<ANY_EXTEND:optab><VNx2_WIDE:mode>
        <VNx2_NARROW:mode>_uxtw): Likewise.
        * config/aarch64/aarch64-sve2.md (@aarch64_gather_ldnt<mode>): Likewise.
        (@aarch64_gather_ldnt_<ANY_EXTEND:optab><SVE_FULL_SDI:mode>
        <SVE_PARTIAL_I:mode>): Likewise.
        
gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/sve/gather_earlyclobber.c: New test.
        * gcc.target/aarch64/sve2/gather_earlyclobber.c: New test.

Attachment: gather-earlyclobber.patch
Description: gather-earlyclobber.patch

Reply via email to