Richard Biener <[email protected]> writes:
>> Am 08.08.2024 um 15:12 schrieb Richard Sandiford <[email protected]>:
>>> PR tree-optimization/116274
>>> * tree-vect-slp.cc (vect_bb_slp_scalar_cost): Cost scalar loads
>>> and stores as simple scalar stmts when they access a non-global,
>>> not address-taken variable that doesn't have BLKmode assigned.
>>>
>>> * gcc.target/i386/pr116274.c: New testcase.
>>> ---
>>> gcc/testsuite/gcc.target/i386/pr116274.c | 9 +++++++++
>>> gcc/tree-vect-slp.cc | 12 +++++++++++-
>>> 2 files changed, 20 insertions(+), 1 deletion(-)
>>> create mode 100644 gcc/testsuite/gcc.target/i386/pr116274.c
>>>
>>> diff --git a/gcc/testsuite/gcc.target/i386/pr116274.c
>>> b/gcc/testsuite/gcc.target/i386/pr116274.c
>>> new file mode 100644
>>> index 00000000000..d5811344b93
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/i386/pr116274.c
>>> @@ -0,0 +1,9 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-options "-O2 -fdump-tree-slp2-optimized" } */
>>> +
>>> +struct a { long x,y; };
>>> +long test(struct a a) { return a.x+a.y; }
>>> +
>>> +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" }
>>> } */
>>> +/* { dg-final { scan-assembler-times "addl|leaq" 1 } } */
>>> +/* { dg-final { scan-assembler-not "padd" } } */
>>> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
>>> index 3464d0c0e23..e43ff721100 100644
>>> --- a/gcc/tree-vect-slp.cc
>>> +++ b/gcc/tree-vect-slp.cc
>>> @@ -7807,7 +7807,17 @@ next_lane:
>>> vect_cost_for_stmt kind;
>>> if (STMT_VINFO_DATA_REF (orig_stmt_info))
>>> {
>>> - if (DR_IS_READ (STMT_VINFO_DATA_REF (orig_stmt_info)))
>>> + data_reference_p dr = STMT_VINFO_DATA_REF (orig_stmt_info);
>>> + tree base = get_base_address (DR_REF (dr));
>>> + /* When the scalar access is to a non-global not address-taken
>>> + decl that is not BLKmode assume we can access it with a single
>>> + non-load/store instruction. */
>>> + if (DECL_P (base)
>>> + && !is_global_var (base)
>>> + && !TREE_ADDRESSABLE (base)
>>> + && DECL_MODE (base) != BLKmode)
>>> + kind = scalar_stmt;
>>> + else if (DR_IS_READ (STMT_VINFO_DATA_REF (orig_stmt_info)))
>>> kind = scalar_load;
>>> else
>>> kind = scalar_store;
>>
>> LGTM FWIW, but did you consider skipping the cost altogether?
>> I'm not sure what the scalar_stmt would correspond to in practice,
>> if we assume that the ABI (for parameters/returns) or RA (for locals)
>> puts the data in a sensible register class for the datatype.
>
> On x86_64 you get up to two eightbytes in two gpr or float regs, so with an
> example with four int we’d get a scalar shift for the second and fourth int
> and with eight short you get to the point where the vector marshaling might
> be profitable. So it’s a heuristic that says it’s likely not zero cost but
> definitely not as high as a load. Anything better would need to know the
> actual register passings.
Ah, yeah, fair enough. I suppose that would be true for aarch64 too
on things like:
struct a { char f[4]; };
struct a test(struct a a) {
a.f[0] += 1;
a.f[1] += 2;
a.f[2] += 3;
a.f[3] += 4;
return a;
}
It's just that there are important cases where it wouldn't happen for
floats on aarch64, and the scalar_stmt cost for floats is typcially 2.
But like you say, that could be fixed later, and this should be a
strict improvement over the status quo.
Richard