Hi all,

This patch introduces an internal tune flag to break up VL-based scalar ops
into a GP-reg scalar op with the VL read kept separate. This can be preferable 
on some CPUs.

I went for a tune param rather than extending the rtx costs as our RTX costs 
tables aren't set up to track
this intricacy.

I've confirmed that on the simple loop:
void vadd (int *dst, int *op1, int *op2, int count)
{
  for (int i = 0; i < count; ++i)
    dst[i] = op1[i] + op2[i];
}

we now split the incw into a cntw outside the loop and the add inside.

+       cntw    x5
...
loop:
-       incw    x4
+       add     x4, x4, x5

Bootstrapped and tested on aarch64-none-linux-gnu.
This is a minimally invasive fix to help the performance of just 
-mcpu=neoverse-v1 for
GCC 11 so I'd like to have it in stage4 if possible,
but I'd appreciate some feedback on the risk assessment of it.

Thanks,
Kyrill

gcc/ChangeLog:

        * config/aarch64/aarch64-tuning-flags.def (cse_sve_vl_constants):
        Define.
        * config/aarch64/aarch64.md (add<mode>3): Force CONST_POLY_INT 
immediates
        into a register when the above is enabled.
        * config/aarch64/aarch64.c (neoversev1_tunings): 
        AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS.
        (aarch64_rtx_costs): Use AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS.

gcc/testsuite/

        * gcc.target/aarch64/sve/cse_sve_vl_constants_1.c: New test.

Attachment: cse_vl_constants.patch
Description: cse_vl_constants.patch

Reply via email to