https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117395
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #4)
> (In reply to Andrew Pinski from comment #1)
> > It is a slight regression from GCC 14 though.
> >
> > Which produced:
> > ```
> > foo:
> > ldr q31, [x0, 32]
> > sub sp, sp, #128
> > add sp, sp, 128
> > dup d0, v31.d[1]
> > add v0.4h, v0.4h, v31.4h
> > ret
> > ```
> >
> > But that is only because vget_low_s16/vget_high_s16 didn't expand to using
> > BIT_FIELD_REF before.
> >
>
> That's a good point, lowering it in RTL as we did before prevented the
> subreg inlining so reload didn't have to spill.
>
> I wonder if instead of using BIT_FIELD_REF we should instead use
> VEC_PERM_EXPR + VIEW_CONVERT. This would get us the right rotate again and
> recover the regression.
>
> We'd still need SRA for optimal codegen though without the stack allocations.
>
>
> (In reply to Richard Biener from comment #3)
> > Having memcpy in the IL is preventing SRA. There's probably no type
> > suitable for the single load/store memcpy inlining done by
> > gimple_fold_builtin_memory_op.
> >
>
> Yeah the original loop doesn't have memcpy but it's being idiom recognized.
>
> > We could try to fold all memcpy to aggregate char[] array assignments,
> > at least when a decl is involved on either side with the idea to
> > eventually elide TREE_ADDRESSABLE. But we need to make sure this
> > doesn't pessimize RTL expansion or other code dealing with memcpy but
> > not aggregate array copy.
> >
> > SRA could handle memcpy and friends transparently iff it were to locally
> > compute its own idea of TREE_ADDRESSABLE.
>
> I suppose the second option is better in general? does SRA have the same
> issue with memset? Would it be possible to get a rough sketch of what this
> would entail?
Yes, same with memset. memset can be replaced with aggregate init from {}.
SRA really wants to see whether all accesses to a decl are "direct" and its
address does not escape. TREE_ADDRESSABLE is a conservative approximation
here and works for early disqualifying of candidates.
SRA already walks all stmts, it could instead update the disqualified
bitmap using gimple_ior_addresses_taken, like execute_update_addresses_taken
does, and explicitly handle memset and memcpy - it needs to explicitly
handle those calls during stmt rewriting and access analysis as well then,
of course.