https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102206
Bug ID: 102206 Summary: amd zen hosts running zen-optimized gcc: gimplification ICE after 94e24187 Product: gcc Version: 10.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: g...@be-evil.net Target Milestone: --- The bad news: why this bug report will be long and confusing ============================================================ There some things in this bug report that will probably make folks think "This is a hardware or software stability problem and not gcc's fault." Strictly speaking, I can't entirely disprove this hypothesis, but I will present evidence below which has led me to believe it's probably a legit gcc bug. AFAICT, due to the nature of binary distributions, this bug manifests exclusively on Gentoo. I imagine there are like 50 Gentoo users on Zen and 25 of them have experienced the bug, half of whom filed bug reports, and the rest of whom shrugged it off and decided it must have been a cosmic ray or something :) So, Gentoo-only. "Rice" is an arguably racially-insensitive term Gentoo people use to describe excessive customization of Gentoo systems resulting in various breakage and non-reproducible problems. Initially I thought this was probably a "rice"-related problem, but I have taken considerable pains to rule this out. It's not rice-related. But wait, there's more! It's also non-deterministically non-deterministic! That is, almost everyone experiences this bug non-deterministically. But, for reasons not yet understood, some users (I was once one of them) have found <software, hardware> configurations in which this bug manifests fully repeatably and deterministically. Sadly, AFAICT none of these users have managed to preserve these fully deterministic software configurations. OK, enough prefacing. I just want to prepare the reader: the nature of this bug/issue will raise some doubts, which will likely need to be overcome before this bug looks "legit". I also wish to encourage the reader not to jump to easy "not-a-bug" conclusions without careful consideration of the circumstances presented below. Scope/Domain of the bug ======================= On Gentoo, there are several AMD Zen hardware users who report that they must either A) avoid building gcc with -m{arch,tune}=znver? and their -m{arch,tune}=native equivalents, or B) must downgrade to gcc-9 or earlier. If they fail to do so, the bug will occur, eventually. The compile which seems to most reliably reproduce the bug is boost (any recent version will do the trick). But it appears in other builds. Zen (1xxx) and Zen+ (2xxx) hosts seem most susceptible. But Zen-2 and Zen 3 hosts (3xxx/4xxx(?) and 5xxx, respectively) also appear to be at least occasionally affected. Note that once an optimized gcc is built, optimizing the target build with similar -m{arch,tune} options is not a requirement. But, such target optimizations do seem to reproduce the problem with a considerably greater probability. Bug Manifestation ================= The bug itself appears as an ICE, stack smash, or zero-pointer-deference fault during c++ compiles. The problem seems to always manifest during gimplification and to produce distinctive stack-dumps, ie: Thread 2.1 "cc1plus" received signal SIGABRT, Aborted. [Switching to process 15911] 0x00007ffff7bc3f71 in raise () from /lib64/libc.so.6 #0 0x00007ffff7bc3f71 in raise () from /lib64/libc.so.6 #1 0x00007ffff7bad537 in abort () from /lib64/libc.so.6 #2 0x00007ffff7c08207 in ?? () from /lib64/libc.so.6 #3 0x00007ffff7c99892 in __fortify_fail () from /lib64/libc.so.6 #4 0x00007ffff7c99870 in __stack_chk_fail () from /lib64/libc.so.6 #5 0x000000000065a1f2 in cp_gimplify_expr(tree_node**, gimple**, gimple**) () #6 0x00000000009f4ffc in gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) () #7 0x00000000009faeb1 in ?? () #8 0x00000000009f6304 in gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) () #9 0x00000000009f6196 in gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) () #10 0x00000000009f5d42 in gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) () #11 0x00000000009f6196 in gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) () #12 0x00000000009fd5b9 in ?? () #13 0x00000000009f638d in gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) () #14 0x00000000009f6196 in gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) () #15 0x00000000009f9ed9 in gimplify_body(tree_node*, bool) () #16 0x00000000009fa316 in gimplify_function_tree(tree_node*) () #17 0x0000000000885f58 in cgraph_node::analyze() () #18 0x0000000000888878 in ?? () #19 0x00000000008894c3 in symbol_table::finalize_compilation_unit() () #20 0x0000000000c769b1 in ?? () #21 0x000000000060065a in toplev::main(int, char**) () #22 0x000000000060413c in main () This is a pretty manageable example; others report stack traces with very deep gimplify_expr recursion*. Git Bisect: 94e2418780f1d13235f3e2e6e5c09dbe821c1ce3 ==================================================== A few months ago I git bisected this thing. Since the bug was manifesting nondeterministically, it took some doing; I wrote scripts to repeatedly build boost, treating a point in history as "good" after no less than 50 consecutive successful builds with the resulting optimized compiler*. Thankfully, this did result in a culprit which was revertible without crippling gcc: 94e24187 | c++: Avoid unnecessary empty class copy [94175] I must admit I don't really understand what this commit does. But reverting it and rebuilding gcc-1{0.{1,2,3},1.{1,2}} results in a compiler which seems to work fine and does not suffer from the bug/issue. Since then, every reporter so far in Gentoo bug 724314 (where most discussion of this bug has occurred) has reported that applying this patch also solved the problem for them*. The specific Gentoo-friendly patch folks have been using is available at: https://724314.bugs.gentoo.org/attachment.cgi?id=718944 Significance of this finding ============================ ? -- * See https://bugs.gentoo.org/724314 for examples/specifics