[Bug rtl-optimization/103641] New: [aarch64][11 regression] Severe compile time regression in SLP vectorize step

2021-12-10 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103641

Bug ID: 103641
   Summary: [aarch64][11 regression] Severe compile time
regression in SLP vectorize step
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: husseydevin at gmail dot com
  Target Milestone: ---

Created attachment 51966
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51966&action=edit
aarch64-linux-gnu-gcc-11 -O3 -c xxhash.c -ftime-report -ftime-report-details

While GCC 11.2 has been noticably better at NEON64 code, with some files it
hangs for more than 15-30 seconds on the SLP vectorization step.

I haven't narrowed this down to a specific thing yet because I don't know much
about the GCC internals, but it is *extremely* noticeable in the xxHash
library. (https://github.com/Cyan4973/xxHash).

This is a test compiling xxhash.c from Git revision
a17161efb1d2de151857277628678b0e0b486155.

This was done on a Core i5-430m with 8GB RAM and an SSD on Debian Bullseye
amd64. GCC 10 (10.2.1-6) was from the\repos, GCC 11 (11.2.0) was built from the
tarball with similar flags. While this may cause bias, the two compilers get
very similar times when the SLP vectorizer is off.

$ time aarch64-linux-gnu-gcc-10 -O3 -c xxhash.c

real0m3.596s
user0m3.270s
sys 0m0.149s
$ time aarch64-linux-gnu-gcc-11 -O3 -c xxhash.c

real0m31.579s
user0m31.314s
sys 0m0.112s

When disabling the NEON intrinsics with `-DXXH_VECTOR=0`, it only takes ~21
seconds. 

Time variable   usr   sys  wall
  GGC
 phase opt and generate :  31.46 ( 97%)   0.24 ( 32%)  31.80 ( 96%)
   54M ( 63%)
 callgraph functions expansion  :  31.01 ( 96%)   0.18 ( 24%)  31.29 ( 94%)
   42M ( 49%)
 tree slp vectorization :  28.35 ( 88%)   0.03 (  4%)  28.37 ( 85%)
 9941k ( 11%)

 TOTAL  :  32.34  0.75 33.20   
   86M

This is significantly worse on my Pi 4B, where an ARMv7->AArch64 build took 3
minutes, although I presume that is mostly due to being 32-bit and the CPU
being much slower.

[Bug middle-end/103641] [11/12 regression] Severe compile time regression in SLP vectorize step

2021-12-10 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103641

--- Comment #19 from Devin Hussey  ---
> The new costs on AArch64 have a vector multiplication cost of 4, which is 
> very reasonable.

Would this include multv2di3 by any chance?

Because another thing I noticed is that GCC is also trying to multiply 64-bit
numbers like it's free but it just ends up scalarizing.

[Bug middle-end/103781] New: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3

2021-12-20 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781

Bug ID: 103781
   Summary: [AArch64, 11 regr.] Failed partial vectorization of
mulv2di3
   Product: gcc
   Version: 11.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: husseydevin at gmail dot com
  Target Milestone: ---

As of GCC 11, the AArch64 backend is very greedy in trying to vectorize
mulv2di3. However, there is no mulv2di3 routine so it extracts from the vector.

The bad codegen should be obvious. 

#include 

void fma_u64(uint64_t *restrict acc, const uint64_t *restrict x, const uint64_t
*restrict y)
{
for (int i = 0; i < 16384; i++){
acc[0] += *x++ * *y++;
acc[1] += *x++ * *y++;
}
}

gcc-11 -O3

fma_u64:
.LFB0:
.cfi_startproc
ldr q1, [x0]
add x6, x1, 262144
.p2align 3,,7
.L2:
ldr x4, [x1], 16
ldr x5, [x2], 16
ldr x3, [x1, -8]
mul x4, x4, x5
ldr x5, [x2, -8]
fmovd0, x4
ins v0.d[1], x5
mul x3, x3, x5
ins v0.d[1], x3
add v1.2d, v1.2d, v0.2d
cmp x1, x6
bne .L2
str q1, [x0]
ret
.cfi_endproc

GCC 10.2.1 emits better code.

fma_u64:
.LFB0:
.cfi_startproc
ldp x4, x3, [x0]
add x9, x1, 262144
.p2align 3,,7
.L2:
ldr x8, [x1], 16
ldr x7, [x2], 16
ldr x6, [x1, -8]
ldr x5, [x2, -8]
maddx4, x8, x7, x4
maddx3, x6, x5, x3
cmp x9, x1
bne .L2
stp x4, x3, [x0]
ret
.cfi_endproc

However, the ideal code would be a 2 iteration unroll.

Side note: why not ldp in the loop?

[Bug target/103781] Cost model for SLP for aarch64 is not so good still

2021-12-20 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781

--- Comment #2 from Devin Hussey  ---
Yeah my bad, I meant SLP, I get them mixed up all the time.

[Bug target/103781] generic/cortex-a53 cost model for SLP for aarch64 is good

2021-12-20 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781

--- Comment #4 from Devin Hussey  ---
Makes sense because the multiplier is what, 5 cycles on an A53?

[Bug target/110013] New: [i386] vector_size(8) on 32-bit ABI

2023-05-27 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110013

Bug ID: 110013
   Summary: [i386] vector_size(8) on 32-bit ABI
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: husseydevin at gmail dot com
  Target Milestone: ---

Closely related to bug 86541, which was fixed on x64 only.

On 32-bit, GCC passes any vector_size(8) vectors to external functions in MMX
registers, similar to how it passes 16 byte vectors in SSE registers. 

This appears to be the only time that GCC will ever naturally generate an MMX
instruction.

This is only good if and only if you are using MMX intrinsics and are manually
handling _mm_empty().

Otherwise, if, say, you are porting over NEON code (where I found this issue)
using the vector_size intrinsics, this can cause some sneaky issues if your
function fails to inline:
1. Things will likely break because GCC doesn't handle MMX and x87 properly
   - Example of broken code (works with -mno-mmx):
https://godbolt.org/z/xafWPohKb
2. You will have a nasty performance toll, more than just a cdecl call, as GCC
doesn't actually know what to do with an MMX register and just spills it into
memory.
   - This especially can be seen when v2sf is used and it places the floats
into MMX registers.

There are two options. The first is to use the weird ABI that Clang seems to
use:

| Type | SIMD | Params | Return  |
| float| base | stack  | ST0:ST1 |
| float| SSE  | XMM0-2 | XMM0|
| double   | all  | stack  | ST0 |
| long long/__m64  | all  | stack  | EAX:EDX |
| int, short, char | base | stack  | stack   |
| int, short, char | SSE2 | stack  | XMM0|

However, since the current ABIs aren't 100% compatible anyways, I think that a
much simpler solution is to just convert to SSE like x64 does, falling back to
the stack if SSE is not available.

Changing the ABI to this also allows us to port MMX with SSE (bug 86541) to
32-bit mode. If you REALLY need MMX intrinsics, you can't inline, and you don't
have SSE2, you can cope with a stack spill.

[Bug target/110013] [i386] vector_size(8) on 32-bit ABI emits broken MMX

2023-05-27 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110013

--- Comment #1 from Devin Hussey  ---
As a side note, the official psABI does say that function call parameters use
MM0-MM2, if Clang follows its own rules then it means that the supposed
stability of the ABI is meaningless.

[Bug target/110013] [i386] vector_size(8) on 32-bit ABI emits broken MMX

2023-05-27 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110013

--- Comment #2 from Devin Hussey  ---
Scratch that. There is a somewhat easy way to fix this following psABI AND
using MMX with SSE.

Upon calling a function, we can have the following sequence

func:
movdq2q  mm0, xmm0
movq mm1, [esp + n]
call mmx_func
movq2dq  xmm0, mm0
emms

Then, this prologue:

mmx_func:
movq2dq   xmm0, mm0
movq2dq   xmm1, mm1
emms
...
movdq2q   mm0, xmm0
ret