[Bug rtl-optimization/103641] New: [aarch64][11 regression] Severe compile time regression in SLP vectorize step
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103641 Bug ID: 103641 Summary: [aarch64][11 regression] Severe compile time regression in SLP vectorize step Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- Created attachment 51966 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51966&action=edit aarch64-linux-gnu-gcc-11 -O3 -c xxhash.c -ftime-report -ftime-report-details While GCC 11.2 has been noticably better at NEON64 code, with some files it hangs for more than 15-30 seconds on the SLP vectorization step. I haven't narrowed this down to a specific thing yet because I don't know much about the GCC internals, but it is *extremely* noticeable in the xxHash library. (https://github.com/Cyan4973/xxHash). This is a test compiling xxhash.c from Git revision a17161efb1d2de151857277628678b0e0b486155. This was done on a Core i5-430m with 8GB RAM and an SSD on Debian Bullseye amd64. GCC 10 (10.2.1-6) was from the\repos, GCC 11 (11.2.0) was built from the tarball with similar flags. While this may cause bias, the two compilers get very similar times when the SLP vectorizer is off. $ time aarch64-linux-gnu-gcc-10 -O3 -c xxhash.c real0m3.596s user0m3.270s sys 0m0.149s $ time aarch64-linux-gnu-gcc-11 -O3 -c xxhash.c real0m31.579s user0m31.314s sys 0m0.112s When disabling the NEON intrinsics with `-DXXH_VECTOR=0`, it only takes ~21 seconds. Time variable usr sys wall GGC phase opt and generate : 31.46 ( 97%) 0.24 ( 32%) 31.80 ( 96%) 54M ( 63%) callgraph functions expansion : 31.01 ( 96%) 0.18 ( 24%) 31.29 ( 94%) 42M ( 49%) tree slp vectorization : 28.35 ( 88%) 0.03 ( 4%) 28.37 ( 85%) 9941k ( 11%) TOTAL : 32.34 0.75 33.20 86M This is significantly worse on my Pi 4B, where an ARMv7->AArch64 build took 3 minutes, although I presume that is mostly due to being 32-bit and the CPU being much slower.
[Bug middle-end/103641] [11/12 regression] Severe compile time regression in SLP vectorize step
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103641 --- Comment #19 from Devin Hussey --- > The new costs on AArch64 have a vector multiplication cost of 4, which is > very reasonable. Would this include multv2di3 by any chance? Because another thing I noticed is that GCC is also trying to multiply 64-bit numbers like it's free but it just ends up scalarizing.
[Bug middle-end/103781] New: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781 Bug ID: 103781 Summary: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3 Product: gcc Version: 11.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- As of GCC 11, the AArch64 backend is very greedy in trying to vectorize mulv2di3. However, there is no mulv2di3 routine so it extracts from the vector. The bad codegen should be obvious. #include void fma_u64(uint64_t *restrict acc, const uint64_t *restrict x, const uint64_t *restrict y) { for (int i = 0; i < 16384; i++){ acc[0] += *x++ * *y++; acc[1] += *x++ * *y++; } } gcc-11 -O3 fma_u64: .LFB0: .cfi_startproc ldr q1, [x0] add x6, x1, 262144 .p2align 3,,7 .L2: ldr x4, [x1], 16 ldr x5, [x2], 16 ldr x3, [x1, -8] mul x4, x4, x5 ldr x5, [x2, -8] fmovd0, x4 ins v0.d[1], x5 mul x3, x3, x5 ins v0.d[1], x3 add v1.2d, v1.2d, v0.2d cmp x1, x6 bne .L2 str q1, [x0] ret .cfi_endproc GCC 10.2.1 emits better code. fma_u64: .LFB0: .cfi_startproc ldp x4, x3, [x0] add x9, x1, 262144 .p2align 3,,7 .L2: ldr x8, [x1], 16 ldr x7, [x2], 16 ldr x6, [x1, -8] ldr x5, [x2, -8] maddx4, x8, x7, x4 maddx3, x6, x5, x3 cmp x9, x1 bne .L2 stp x4, x3, [x0] ret .cfi_endproc However, the ideal code would be a 2 iteration unroll. Side note: why not ldp in the loop?
[Bug target/103781] Cost model for SLP for aarch64 is not so good still
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781 --- Comment #2 from Devin Hussey --- Yeah my bad, I meant SLP, I get them mixed up all the time.
[Bug target/103781] generic/cortex-a53 cost model for SLP for aarch64 is good
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781 --- Comment #4 from Devin Hussey --- Makes sense because the multiplier is what, 5 cycles on an A53?
[Bug target/110013] New: [i386] vector_size(8) on 32-bit ABI
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110013 Bug ID: 110013 Summary: [i386] vector_size(8) on 32-bit ABI Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- Closely related to bug 86541, which was fixed on x64 only. On 32-bit, GCC passes any vector_size(8) vectors to external functions in MMX registers, similar to how it passes 16 byte vectors in SSE registers. This appears to be the only time that GCC will ever naturally generate an MMX instruction. This is only good if and only if you are using MMX intrinsics and are manually handling _mm_empty(). Otherwise, if, say, you are porting over NEON code (where I found this issue) using the vector_size intrinsics, this can cause some sneaky issues if your function fails to inline: 1. Things will likely break because GCC doesn't handle MMX and x87 properly - Example of broken code (works with -mno-mmx): https://godbolt.org/z/xafWPohKb 2. You will have a nasty performance toll, more than just a cdecl call, as GCC doesn't actually know what to do with an MMX register and just spills it into memory. - This especially can be seen when v2sf is used and it places the floats into MMX registers. There are two options. The first is to use the weird ABI that Clang seems to use: | Type | SIMD | Params | Return | | float| base | stack | ST0:ST1 | | float| SSE | XMM0-2 | XMM0| | double | all | stack | ST0 | | long long/__m64 | all | stack | EAX:EDX | | int, short, char | base | stack | stack | | int, short, char | SSE2 | stack | XMM0| However, since the current ABIs aren't 100% compatible anyways, I think that a much simpler solution is to just convert to SSE like x64 does, falling back to the stack if SSE is not available. Changing the ABI to this also allows us to port MMX with SSE (bug 86541) to 32-bit mode. If you REALLY need MMX intrinsics, you can't inline, and you don't have SSE2, you can cope with a stack spill.
[Bug target/110013] [i386] vector_size(8) on 32-bit ABI emits broken MMX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110013 --- Comment #1 from Devin Hussey --- As a side note, the official psABI does say that function call parameters use MM0-MM2, if Clang follows its own rules then it means that the supposed stability of the ABI is meaningless.
[Bug target/110013] [i386] vector_size(8) on 32-bit ABI emits broken MMX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110013 --- Comment #2 from Devin Hussey --- Scratch that. There is a somewhat easy way to fix this following psABI AND using MMX with SSE. Upon calling a function, we can have the following sequence func: movdq2q mm0, xmm0 movq mm1, [esp + n] call mmx_func movq2dq xmm0, mm0 emms Then, this prologue: mmx_func: movq2dq xmm0, mm0 movq2dq xmm1, mm1 emms ... movdq2q mm0, xmm0 ret