https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117395

--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #1)
> It is a slight regression from GCC 14 though.
> 
> Which produced:
> ```
> foo:
>         ldr     q31, [x0, 32]
>         sub     sp, sp, #128
>         add     sp, sp, 128
>         dup     d0, v31.d[1]
>         add     v0.4h, v0.4h, v31.4h
>         ret
> ```
> 
> But that is only because vget_low_s16/vget_high_s16 didn't expand to using
> BIT_FIELD_REF before.
> 

That's a good point, lowering it in RTL as we did before prevented the subreg
inlining so reload didn't have to spill.

I wonder if instead of using BIT_FIELD_REF we should instead use VEC_PERM_EXPR
+ VIEW_CONVERT.  This would get us the right rotate again and recover the
regression.

We'd still need SRA for optimal codegen though without the stack allocations.


(In reply to Richard Biener from comment #3)
> Having memcpy in the IL is preventing SRA.  There's probably no type
> suitable for the single load/store memcpy inlining done by
> gimple_fold_builtin_memory_op.
> 

Yeah the original loop doesn't have memcpy but it's being idiom recognized.

> We could try to fold all memcpy to aggregate char[] array assignments,
> at least when a decl is involved on either side with the idea to
> eventually elide TREE_ADDRESSABLE.  But we need to make sure this
> doesn't pessimize RTL expansion or other code dealing with memcpy but
> not aggregate array copy.
> 
> SRA could handle memcpy and friends transparently iff it were to locally
> compute its own idea of TREE_ADDRESSABLE.

I suppose the second option is better in general? does SRA have the same issue
with memset? Would it be possible to get a rough sketch of what this would
entail?

Reply via email to