[Bug target/95962] Inefficient code for simple arm_neon.h iota operation

rsandifo at gcc dot gnu.org via Gcc-bugs Fri, 20 Aug 2021 04:52:13 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962


--- Comment #2 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> 
---
(In reply to Tamar Christina from comment #1)
> We generate the correct code at -O3 but not -O2.
> 
> At -O3 we generate
> 
> foo:
>         adrp    x0, .LC0
>         sub     sp, sp, #16
>         ldr     q0, [x0, #:lo12:.LC0]
>         add     sp, sp, 16
>         ret
> 
> where the problem seems to be at at -O2 store merging has broken up the
> construction of `array` into two separate memory accesses:
> 
>   MEM <unsigned long> [(int *)&array] = 4294967296;
>   MEM <unsigned long> [(int *)&array + 8B] = 12884901890;
> 
> whereas at -O3 we still have a single assignment:
> 
>   MEM <vector(4) int> [(int *)&array] = { 0, 1, 2, 3 };
> 
> I'm not sure even if we made these loads gimple level if that would help.
> we'd still have the explicit MEMs created by store merging.
If we folded them to gimple loads, the gimple optimisers should replace
the MEM with an assignment of the VECTOR_CST { 0, 1, 2, 3 } to an SSA name,
with the function returning the SSA name.

expand will convert this back into a memory access, in the form of
an RTL constant pool load.  But that will avoid the stack temporary
and thus the pointless stack adjustments.

[Bug target/95962] Inefficient code for simple arm_neon.h iota operation

Reply via email to