https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117395
--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> --- (In reply to Andrew Pinski from comment #1) > It is a slight regression from GCC 14 though. > > Which produced: > ``` > foo: > ldr q31, [x0, 32] > sub sp, sp, #128 > add sp, sp, 128 > dup d0, v31.d[1] > add v0.4h, v0.4h, v31.4h > ret > ``` > > But that is only because vget_low_s16/vget_high_s16 didn't expand to using > BIT_FIELD_REF before. > That's a good point, lowering it in RTL as we did before prevented the subreg inlining so reload didn't have to spill. I wonder if instead of using BIT_FIELD_REF we should instead use VEC_PERM_EXPR + VIEW_CONVERT. This would get us the right rotate again and recover the regression. We'd still need SRA for optimal codegen though without the stack allocations. (In reply to Richard Biener from comment #3) > Having memcpy in the IL is preventing SRA. There's probably no type > suitable for the single load/store memcpy inlining done by > gimple_fold_builtin_memory_op. > Yeah the original loop doesn't have memcpy but it's being idiom recognized. > We could try to fold all memcpy to aggregate char[] array assignments, > at least when a decl is involved on either side with the idea to > eventually elide TREE_ADDRESSABLE. But we need to make sure this > doesn't pessimize RTL expansion or other code dealing with memcpy but > not aggregate array copy. > > SRA could handle memcpy and friends transparently iff it were to locally > compute its own idea of TREE_ADDRESSABLE. I suppose the second option is better in general? does SRA have the same issue with memset? Would it be possible to get a rough sketch of what this would entail?