[Bug tree-optimization/89967] Inefficient code generation for vld2q_lane_u8 under aarch64

pinskia at gcc dot gnu.org via Gcc-bugs Thu, 27 Feb 2025 21:23:03 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89967


--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Ok, so my current patches actually make it worse because SRA decided to
scalarize inbetween the builtins.

That is because we don't look back over things that don't alias (in this case
the clobber):
```
  # .MEM_122 = VDEF <.MEM_121>
  g1b = D.23266;
  # .MEM_123 = VDEF <.MEM_122>
  __b ={v} {CLOBBER(eos)};
  _91 = BIT_FIELD_REF <o1v_39(D), 32, 96>;
  _15 = (sizetype) _91;
  _16 = in_53(D) + _15;
  # .MEM_127 = VDEF <.MEM_123>
  __b = g1b;
  # .MEM_128 = VDEF <.MEM_127>
  D.23259 = __builtin_aarch64_ld2_lanev16qi_usus (_16, g1b, 12);
```

That is with -fstack-reuse=none, we get 4 mov in the non-loop case while the
loop case it is still bad (and maybe even worse).  Note the 4 mov here is an
RTL issue and maybe even a cost issue; I have not fully looked into that; movi
should be just as cheap as mov so I am not sure why we are doing CSE of a
constant here.

The loop case is we have:

a = ...

L1:

  b = ...(a);
  c = ...(b);
  use a.val[0], a.val[1]
  a = b;
if (...) goto L1;

Without the copy prop, SRA is a little smarter at scalarizing the copies but
afterwards SRA is like we need to scalarize a, b, and c because why not.

[Bug tree-optimization/89967] Inefficient code generation for vld2q_lane_u8 under aarch64

Reply via email to