[Bug rtl-optimization/99346] New: [aarch64] ICE in gen_rtx_SUBREG, at emit-rtl.c:1021
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99346 Bug ID: 99346 Summary: [aarch64] ICE in gen_rtx_SUBREG, at emit-rtl.c:1021 Product: gcc Version: 8.4.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: spop at gcc dot gnu.org Target Milestone: --- Created attachment 50289 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50289&action=edit pre-processed reduced testcase gcc-8, gcc-9, and gcc-10 from Ubuntu 20.04 are failing to compile the attached test at -O2 and -O3 on Graviton2 aarch64-linux. $ g++-10 -O2 a.ii [...] a.ii:362:50: internal compiler error: in gen_rtx_SUBREG, at emit-rtl.c:1021 $ g++-8 -O2 a.ii [...] a.ii:493:11: internal compiler error: in gen_rtx_SUBREG, at emit-rtl.c:1010 Similar bug was reported/fixed on x86: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83723
[Bug target/97802] New: [AArch64] Incorrect documentation for Arm64 NEON
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97802 Bug ID: 97802 Summary: [AArch64] Incorrect documentation for Arm64 NEON Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: spop at gcc dot gnu.org Target Milestone: --- The following text in doc/invoke.texi seems to be outdated. To avoid confusion the text needs to be more specific on which NEON implementations it applies: "If the selected floating-point hardware includes the NEON extension (e.g.@: @option{-mfpu=neon}), note that floating-point operations are not generated by GCC's auto-vectorization pass unless @option{-funsafe-math-optimizations} is also specified. This is because NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision." This used to be true for older NEON implementations. NEON implementation in Armv8 and later is IEEE 754 compliant.
[Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877 Bug ID: 98877 Summary: [AArch64] Inefficient code generated for tbl NEON intrinsics Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: spop at gcc dot gnu.org Target Milestone: --- The use of NEON intrinsics is inefficient and leads developers to prefer inline assembly instead of intrinsics. A similar performance bug for vmlal intrinsics was reported in https://gcc.gnu.org/PR92665 The code generated by GCC for table lookups is also inefficient: $ cat red.c #include "arm_neon.h" uint8x16_t fun(uint8x16_t lo, uint8x16_t hi, uint8x16_t idx) { uint8x16x2_t tab = { .val = {lo, hi} }; uint8x16_t res = vqtbl2q_u8(tab, idx); return res; } $ gcc -O3 -S -o- red.c fun: mov v4.16b, v0.16b mov v5.16b, v1.16b tbl v0.16b, {v4.16b - v5.16b}, v2.16b ret $ clang -O3 -S -o- red.c fun: tbl v0.16b, { v0.16b, v1.16b }, v2.16b ret
[Bug c++/99012] gcc-8.4.0 on aarch64 hits internal error during RTL pass: expand if `std::copysign` is used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99012 Sebastian Pop changed: What|Removed |Added CC||spop at gcc dot gnu.org --- Comment #2 from Sebastian Pop --- I see the bug with $ gcc-8 --version gcc-8 (Ubuntu/Linaro 8.4.0-1ubuntu1~18.04) 8.4.0
[Bug c++/99012] gcc-8.4.0 on aarch64 hits internal error during RTL pass: expand if `std::copysign` is used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99012 --- Comment #3 from Sebastian Pop --- I do not see the bug with today's cc1plus from origin/releases/gcc-8
[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409 Sebastian Pop changed: What|Removed |Added CC||spop at gcc dot gnu.org --- Comment #18 from Sebastian Pop --- A new 5% regression happened in gcc-trunk more recently and may be due to another patch. Rama was bisecting a 15% perf regression on lbm when updating gcc-7 to gcc-10. The regression can be seen on the LNT graph link from comment#3 https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=633.477.0&plot.1=683.477.0&plot.2=664.477.0&plot.3=648.477.0&plot.4=618.477.0&plot.5=605.477.0&plot.6=759.477.0&plot.7=584.477.0 gcc-6 has execution time of 213 seconds gcc-7 is at 215 seconds gcc-8 is at 266 gcc-9 at 259 gcc-10 at 260 Honza's patch seems to be unrelated as it was committed to trunk before gcc-10 release on May 7, 2020: commit a9a4edf0e71bbac9f1b5dcecdcf9250111d16889 Author: Jan Hubicka Date: Sat Nov 30 22:25:24 2019 +0100 Update max_bb_count in execute_fixup_cfg We need to git-bisect between gcc-7 and gcc-8.
[Bug target/105162] New: [AArch64] outline-atomics drops dmb ish barrier on __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162 Bug ID: 105162 Summary: [AArch64] outline-atomics drops dmb ish barrier on __sync builtins Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: spop at gcc dot gnu.org Target Milestone: --- With -mno-outline-atomics gcc produces a `dmb ish` barrier on __sync builtins as required by the Intel specification (see fix for https://gcc.gnu.org/PR65697 https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=f70fb3b635f9618c6d2ee3848ba836914f7951c2 https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=ab876106eb689947cdd8203f8ecc6e8ac38bf5ba ) $ cat a.c int foo(int a) { return __sync_bool_compare_and_swap(&a, 4, 5); } $ gcc -O2 a.c -S -o- -mno-outline-atomics foo: sub sp, sp, #16 mov w1, 5 str w0, [sp, 12] add x0, sp, 12 .L4: ldxrw2, [x0] cmp w2, 4 bne .L5 stlxr w3, w1, [x0] cbnzw3, .L4 .L5: dmb ish csetw0, eq add sp, sp, 16 ret With -moutline-atomics gcc does not generate the barrier: $ gcc -O2 a.c -S -o- -moutline-atomics foo: stp x29, x30, [sp, -32]! mov w1, 5 mov x29, sp add x2, sp, 28 str w0, [sp, 28] mov w0, 4 bl __aarch64_cas4_acq_rel cmp w0, 4 csetw0, eq ldp x29, x30, [sp], 32 ret Happens on gcc-8, 9, 10, 11, and trunk.
[Bug target/105162] [AArch64] outline-atomics drops dmb ish barrier on __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162 --- Comment #1 from Sebastian Pop --- Also happens when compiling with LSE: -march=armv8.1-a or later.
[Bug target/105162] [AArch64] outline-atomics drops dmb ish barrier on __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162 --- Comment #2 from Sebastian Pop --- Created attachment 52750 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52750&action=edit patch Fix.
[Bug target/105162] [AArch64] outline-atomics drops dmb ish barrier on __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162 Sebastian Pop changed: What|Removed |Added Attachment #52750|0 |1 is obsolete|| --- Comment #3 from Sebastian Pop --- Created attachment 52755 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52755&action=edit patch LSE atomics do not need a barrier. Updated the patch to only generate the barriers after outline-atomics calls.
[Bug target/105162] [AArch64] outline-atomics drops dmb ish barrier on __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162 --- Comment #4 from Sebastian Pop --- The attached patch degrades performance on cpus with LSE: the barrier is not needed when outline-atomics execute an LSE instruction. I was thinking to add the barrier to the armv8.0 generic path (no LSE) in the outline-atomics functions.
[Bug target/105162] [AArch64] outline-atomics drops dmb ish barrier on __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162 Sebastian Pop changed: What|Removed |Added Attachment #52755|0 |1 is obsolete|| --- Comment #5 from Sebastian Pop --- Created attachment 52762 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52762&action=edit patch The attached patch fixes the issue for __sync builtins by adding the missing barrier to -march=armv8-a+nolse path in the outline-atomics functions. The patch also changes the behavior of __atomic builtins for -moutline-atomics -march=armv8-a+nolse to be the same as for -march=armv8-a+lse.
[Bug target/105162] [AArch64] outline-atomics drops dmb ish barrier on __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162 Sebastian Pop changed: What|Removed |Added Attachment #52762|0 |1 is obsolete|| --- Comment #8 from Sebastian Pop --- Created attachment 52826 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52826&action=edit patch You are right. Please see attached an amended patch that only adds the barriers to __sync builtins.
[Bug debug/98776] DW_AT_low_pc is inconsistent with function entry address, when enabling -fpatchable-function-entry
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98776 Sebastian Pop changed: What|Removed |Added CC||spop at gcc dot gnu.org --- Comment #9 from Sebastian Pop --- Hi, is somebody working on fixing this on arm64? If not I will be working on it. The linux kernel needs this fixed for systemtap and perf probe.
[Bug middle-end/107485] New: gcc-10 ICE with -fnon-call-exception
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107485 Bug ID: 107485 Summary: gcc-10 ICE with -fnon-call-exception Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: spop at gcc dot gnu.org Target Milestone: --- On arm64-linux I see the following crash only on gcc-10. I do not see the ICE on gcc-11, 12, and trunk. $ ~/gcc-10/bld/gcc/cc1plus -fnon-call-exceptions f.ii [...] f.ii:29:23: internal compiler error: Segmentation fault 29 | template void x(double *, b, unsigned long *) { f(); } | ^ 0x134e58b crash_signal ../../gcc/toplev.c:328 0x1639464 tree_vec_extract(gimple_stmt_iterator*, tree_node*, tree_node*, tree_node*, tree_node*) ../../gcc/tree-vect-generic.c:140 0x163ca0f expand_vector_condition ../../gcc/tree-vect-generic.c:1044 0x164081f expand_vector_operations_1 ../../gcc/tree-vect-generic.c:1988 0x16419f7 expand_vector_operations ../../gcc/tree-vect-generic.c:2240 0x1641b3f execute ../../gcc/tree-vect-generic.c:2284 [...] $ cat f.ii typedef long a; typedef double b; typedef struct { a c __attribute__((__vector_size__(32))); b d __attribute__((__vector_size__(32))); } e; __attribute__((__always_inline__)) b f() { e g, h, i; g.c = h.d < i.d; } class j { bool k(); }; template void ab(aa, l, n) { int o; typename n::p q; unsigned long r; q(0, o, &r); } namespace s { template void t(j *, long, long, unsigned long *, int u) { n ac; void v(); ab(v, u, ac); } } // namespace s struct w { template void x(double *, b, unsigned long *) { f(); } double ad; void operator()(double, double, unsigned long *) { unsigned long m; x<0>(&ad, 0, &m); } }; using s::t; struct y { using p = w; }; long ag, ah; unsigned long ai; double aj; bool j::k() { using n = y; t(this, ag, ah, &ai, aj); } git bisect stops on this patch: commit 1e676cfbe1e13fba2c636b560362ed4f0a56893d Author: Richard Biener Date: Mon May 18 08:51:23 2020 +0200 middle-end/95171 - inlining of trapping compare into non-call EH fn This fixes always-inlining across -fnon-call-exception boundaries for conditions which we do not allow to throw. 2020-05-18 Richard Biener PR middle-end/95171 * tree-inline.c (remap_gimple_stmt): Split out trapping compares when inlining into a non-call EH function. * gcc.dg/pr95171.c: New testcase. (cherry picked from commit fe168751c5c1c517c7c89c9a1e4e561d66b24663)
[Bug middle-end/107485] [10 Regression] gcc-10 ICE with -fnon-call-exception
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107485 --- Comment #10 from Sebastian Pop --- Thanks Richard. The patch fixed the larger test as well.
[Bug debug/98776] DW_AT_low_pc is inconsistent with function entry address, when enabling -fpatchable-function-entry
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98776 --- Comment #10 from Sebastian Pop --- Patch for arm64: https://gcc.gnu.org/pipermail/gcc-patches/2022-December/607601.html
[Bug debug/98776] DW_AT_low_pc is inconsistent with function entry address, when enabling -fpatchable-function-entry
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98776 Sebastian Pop changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #15 from Sebastian Pop --- Fixed for arm64 as well on master, and backported to active branches gcc-12, 11, and 10.
[Bug target/105162] [AArch64] outline-atomics drops dmb ish barrier on __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162 Sebastian Pop changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #14 from Sebastian Pop --- Fixed.
[Bug target/109519] New: aarch64: wrong code with NEON intrinsics on gcc-10 and later
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109519 Bug ID: 109519 Summary: aarch64: wrong code with NEON intrinsics on gcc-10 and later Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: spop at gcc dot gnu.org Target Milestone: --- Steps to reproduce: $ git clone https://github.com/sebpop/bitshuffle.git -b gcc-10-bug $ cd bitshuffle/reproduce $ make $ ./a.out The expected output is produced by gcc-7, gcc-9, and clang-15. 16384 4 14 16 33 39 45 51 57 67 102 108 120 126 128 134 138 140 [...] gcc-9 is the last version of gcc I tested that works. gcc-10 produces the following output: ./a.out 16384 0 0 0 0 39 45 51 57 gcc-11 and gcc-trunk produce the following output: ./a.out 16384 0 0 0 0 0 0 0 The output is also correct when removing the before-last patch from the git repo https://github.com/kiyo-masui/bitshuffle/pull/140 This patch exposes the bug in gcc by using NEON intrinsics instead of scalar computations to translate move_mask instructions from SSE2 to NEON.
[Bug target/109519] aarch64: wrong code with NEON intrinsics on gcc-10 and later
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109519 --- Comment #5 from Sebastian Pop --- Thanks Andrew for the patch, it fixes the issue.