> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Thursday, January 2, 2025 4:52 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
> <richard.earns...@arm.com>; ktkac...@gcc.gnu.org
> Subject: Re: [PATCH]AArch64: Fix costing of emulated gathers/scatters
> [PR118188]
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
> > Hi All,
> >
> > When a target does not support gathers and scatters the vectorizer tries to
> > emulate these using scalar loads/stores and a reconstruction of vectors from
> > scalar.
> >
> > The loads are still marked with VMAT_GATHER_SCATTER to indicate that they 
> > are
> > gather/scatters, however the vectorizer also asks the target to cost the
> > instruction that generates the indexes for the emulated instructions.
> >
> > This is done by asking the target to cost vec_to_scalar and vec_construct 
> > with
> > a stmt_vinfo being the VMAT_GATHER_SCATTER.
> >
> > Since Adv. SIMD does not have an LD1 variant that takes an Adv. SIMD Scalar
> > element the operation is lowered entirely into a sequence of GPR loads to 
> > create
> > the x registers for the indexes.
> >
> > At the moment however we don't cost these, and so the vectorizer things that
> > when it emulates the instructions that it's much cheaper than using an 
> > actual
> > gather/scatter with SVE.  Consider:
> >
> > #define iterations 100000
> > #define LEN_1D 32000
> >
> > float a[LEN_1D], b[LEN_1D];
> >
> > float
> > s4115 (int *ip)
> > {
> >     float sum = 0.;
> >     for (int i = 0; i < LEN_1D; i++)
> >         {
> >             sum += a[i] * b[ip[i]];
> >         }
> >     return sum;
> > }
> >
> > which before this patch with -mcpu=<sve-core> generates:
> >
> > .L2:
> >         add     x3, x0, x1
> >         ldrsw   x4, [x0, x1]
> >         ldrsw   x6, [x3, 4]
> >         ldpsw   x3, x5, [x3, 8]
> >         ldr     s1, [x2, x4, lsl 2]
> >         ldr     s30, [x2, x6, lsl 2]
> >         ldr     s31, [x2, x5, lsl 2]
> >         ldr     s29, [x2, x3, lsl 2]
> >         uzp1    v30.2s, v30.2s, v31.2s
> >         ldr     q31, [x7, x1]
> >         add     x1, x1, 16
> >         uzp1    v1.2s, v1.2s, v29.2s
> >         zip1    v30.4s, v1.4s, v30.4s
> >         fmla    v0.4s, v31.4s, v30.4s
> >         cmp     x1, x8
> >         bne     .L2
> >
> > but during costing:
> >
> > a[i_18] 1 times vector_load costs 4 in body
> > *_4 1 times unaligned_load (misalign -1) costs 4 in body
> > b[_5] 4 times vec_to_scalar costs 32 in body
> > b[_5] 4 times scalar_load costs 16 in body
> > b[_5] 1 times vec_construct costs 3 in body
> > _1 * _6 1 times vector_stmt costs 2 in body
> > _7 + sum_16 1 times scalar_to_vec costs 4 in prologue
> > _7 + sum_16 1 times vector_stmt costs 2 in epilogue
> > _7 + sum_16 1 times vec_to_scalar costs 4 in epilogue
> > _7 + sum_16 1 times vector_stmt costs 2 in body
> >
> > Here we see that the latency for the vec_to_scalar is very high.  We know 
> > the
> > intermediate vector isn't usable by the target ISA and will always be 
> > elided.
> > However these latencies need to remain high because when costing
> gather/scatters
> > IFNs we still pass the nunits of the type along.  In other words, the 
> > vectorizer
> > is still costing vector gather/scatters as scalar load/stores.
> >
> > Lowering the cost for the emulated gathers would result in emulation being
> > seemingly cheaper.  So while the emulated costs are very high, they need to 
> > be
> > higher than those for the IFN costing.
> >
> > i.e. the vectorizer generates:
> >
> >   vect__5.9_8 = MEM <vector(4) intD.7> [(intD.7 *)vectp_ip.7_14];
> >   _35 = BIT_FIELD_REF <vect__5.9_8, 32, 0>;
> >   _36 = (sizetype) _35;
> >   _37 = _36 * 4;
> >   _38 = _34 + _37;
> >   _39 = (voidD.55 *) _38;
> >   # VUSE <.MEM_10(D)>
> >   _40 = MEM[(floatD.32 *)_39];
> >
> > which after IVopts is:
> >
> >   _63 = &MEM <vector(4) int> [(int *)ip_11(D) + ivtmp.19_27 * 1];
> >   _47 = BIT_FIELD_REF <MEM <vector(4) int> [(int *)_63], 32, 64>;
> >   _41 = BIT_FIELD_REF <MEM <vector(4) int> [(int *)_63], 32, 32>;
> >   _35 = BIT_FIELD_REF <MEM <vector(4) int> [(int *)_63], 32, 0>;
> >   _53 = BIT_FIELD_REF <MEM <vector(4) int> [(int *)_63], 32, 96>;
> >
> > Which we correctly lower in RTL to individual loads to avoid the repeated 
> > umov.
> >
> > As such, we should cost the vec_to_scalar as GPR loads and also do so for 
> > the
> > throughput which we at the moment cost as:
> >
> >   note:  Vector issue estimate:
> >   note:    load operations = 6
> >   note:    store operations = 0
> >   note:    general operations = 6
> >   note:    reduction latency = 2
> >   note:    estimated min cycles per iteration = 2.000000
> >
> > Which means 3 loads for the GOR indexes are missing, making it seem like the
> > emulated loop has a much lower cycles per iter than it actually does since 
> > the
> > bottleneck on the load units are not modelled.
> 
> Yeah, currently the memory operations for an emulated 4-element
> load/store would be:
> 
> - 1 vector load for the indices
> - 4 loads for the gather load, or 4 stores for the scatter store
> 
> (and then +1 for the a[i] access in this case).
> 
> Therefore...
> 
> > But worse, because the vectorizer costs gathers/scatters IFNs as scalar
> > load/stores the number of loads required for an SVE gather is always much
> > higher than the equivalent emulated variant.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> >     PR target/118188
> >     * config/aarch64/aarch64.cc (aarch64_vector_costs::count_ops): Adjust
> >     throughput of emulated gather and scatters.
> >
> > gcc/testsuite/ChangeLog:
> >
> >     PR target/118188
> >     * gcc.target/aarch64/sve/gather_load_12.c: New test.
> >
> > ---
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index
> 6bb4bdf2472e62d9b066a06561da8e516f1b3c3e..cb9b155826d12b622ae0df1
> 736e4b042d01cf56a 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -17358,6 +17358,25 @@ aarch64_vector_costs::count_ops (unsigned int
> count, vect_cost_for_stmt kind,
> >     return;
> >      }
> >
> > +  /* Detect the case where we are using an emulated gather/scatter.  When a
> > +     target does not support gathers and scatters directly the vectorizer
> > +     emulates these by constructing an index vector and then issuing an
> > +     extraction for every lane in the vector.  This is subsequently lowered
> > +     by veclower into a series of loads which creates the scalar indexes 
> > for
> > +     the subsequent loads.  After the final loads are done it issues a
> > +     vec_construct to recreate the vector from the scalar.  For costing 
> > when
> > +     we see a vec_to_scalar on a stmt with VMAT_GATHER_SCATTER we are
> dealing
> > +     with an emulated instruction and should adjust costing properly.  */
> > +  if (kind == vec_to_scalar
> > +      && (m_vec_flags & VEC_ADVSIMD)
> > +      && vect_mem_access_type (stmt_info, node) == VMAT_GATHER_SCATTER)
> > +    {
> > +      if (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type)
> > +   ops->loads += count - 1;
> > +      else
> > +   ops->stores += count - 1;
> 
> ...since the aim is to replace:
> 
> - 1 vector load for the indices
> 
> with
> 
> - 4 vector loads for the indices
> 
> I think this should be changing ops->loads even for scatter stores.
> 

Ah yes that's true... because the indexes for the stores are loads themselves.

> But this assumes that the gather load indices come directly from memory.
> That isn't always the case.  If we have:
> 
> float
> s4115 (int *ip)
> {
>     float sum = 0.;
>     for (int i = 0; i < LEN_1D; i++)
>       {
>         sum += a[i] * b[ip[i] + 1];
>       }
>     return sum;
> }
> 
> then we'll have a vector load, a vector add, and then the 4 umovs
> that, as you said above, were elided by post-vectoriser optimisations
> in your b[ip[i]] example:
> 
> .L2:
>         ldr     q31, [x0, x1]
>         ldr     q29, [x6, x1]
>         add     x1, x1, 16
>         add     v31.4s, v31.4s, v26.4s
>         umov    w5, v31.s[1]
>         umov    w4, v31.s[3]
>         umov    w3, v31.s[2]
>         fmov    w8, s31
>         ldr     s30, [x2, w5, sxtw 2]
>         ldr     s27, [x2, w4, sxtw 2]
>         ldr     s31, [x2, w8, sxtw 2]
>         ldr     s28, [x2, w3, sxtw 2]
>         uzp1    v30.2s, v30.2s, v27.2s
>         uzp1    v31.2s, v31.2s, v28.2s
>         zip1    v31.4s, v31.4s, v30.4s
>         fmla    v0.4s, v29.4s, v31.4s
>         cmp     x1, x7
>         bne     .L2
> 
> These umovs are currently modelled:
> 
>   note:    load operations = 6
>   note:    store operations = 0
>   note:    general operations = 7
>   note:    reduction latency = 2
> 
> although I'm guessing it should be:
> 
>   note:    load operations = 6
>   note:    store operations = 0
>   note:    general operations = 9
>   note:    reduction latency = 2
> 
> instead.  The underaccounting comes from vec_construct, which counts 1
> rather than 3 operations for moving the scalars back to a vector.
> 
> This patch would remove the umov accounting to give:
> 
>   note:    load operations = 9
>   note:    store operations = 0
>   note:    general operations = 3
>   note:    reduction latency = 2
> 
> Counting loads rather than general ops wouldn't matter for tight loops
> like these in which memory dominates anyway, since counting loads is
> then pessimistically correct.  But in a loop that is compute bound,
> it's probably more important to get this right.
> 
> So I think ideally, we should try to detect whether the indices come
> directly from memory or are the result of arithmetic.  In the former case,
> we should do the loads adjustment above.  In the latter case, we should
> keep the vec_to_scalar accounting unchanged.

I can do this but...

> 
> Of course, these umovs are likely to be more throughput-limited than we
> model, but that's a separate pre-existing problem...

I agree with the above, the reason I just updated loads is as you already said
that the umov accounting as general operations don't account for the bottleneck.
In general umovs are more throughput limited than loads and the number of 
general
ops we can execute would in the example above misrepresent the throughput as it
still thinks it can execute all transfers + all scalar loads in one cycle.  As 
the number of VX
increases modelling them as general ops incorrectly favors the emulated gather. 
 See e.g.
Cortex-X925.

By still modelling them as loads it more accurately models that the data loads 
have to
wait for the indexes.

The problem with modelling them as general ops is that when compared to the IFN 
for
SVE they end up being cheaper.  For instance the umov case above is faster 
using an
actual SVE gather.

So if we really want to be accurate we have to model vec transfers as otherwise 
it still
models the index transfers as effectively free.

> 
> For the scatter store case:
> 
> float
> s4115 (int *ip)
> {
>   for (int i = 0; i < LEN_1D; i++)
>     {
>       b[ip[i]] = a[i] + 1;
>     }
> }
> 
> the vectoriser (unhelpfully) costs both the index-to-scalars and
> data-to-scalars as vec_to_scalar, meaning that we'll double-count
> the extra loads.
> 

I think that's more accurate though.

This example is load Q -> umov -> store.

This is a 3 insn dependency chain, where modelling the umov as load
more accurately depicts the dependency on the preceding load.

Thanks,
Tamar

> Thanks,
> Richard
> 
> > +      return;
> > +    }
> >
> >    /* Count the basic operation cost associated with KIND.  */
> >    switch (kind)
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/gather_load_12.c
> b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_12.c
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..d550f005d638f26097331
> 2fbccd05e2bb9aef1a7
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_12.c
> > @@ -0,0 +1,20 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-Ofast -mcpu=neoverse-v2" } */
> > +
> > +#define iterations 100000
> > +#define LEN_1D 32000
> > +
> > +float a[LEN_1D], b[LEN_1D];
> > +
> > +float
> > +s4115 (int *ip)
> > +{
> > +    float sum = 0.;
> > +    for (int i = 0; i < LEN_1D; i++)
> > +      {
> > +        sum += a[i] * b[ip[i]];
> > +      }
> > +    return sum;
> > +}
> > +
> > +/* { dg-final { scan-assembler {\s+ld1w\t} } } */

Reply via email to