https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108500
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jamborm at gcc dot gnu.org, | |vmakarov at gcc dot gnu.org --- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> --- Thanks for the new testcase. With -O0 (and a --enable-checking=release built compiler) this builds in ~11 minutes (on a Ryzen 9 7900X) with integrated RA : 38.96 ( 6%) 1.94 ( 20%) 42.00 ( 6%) 3392M ( 23%) LRA non-specific : 18.93 ( 3%) 1.24 ( 13%) 23.78 ( 4%) 450M ( 3%) LRA virtuals elimination : 5.67 ( 1%) 0.05 ( 1%) 5.75 ( 1%) 457M ( 3%) LRA reload inheritance : 318.25 ( 49%) 0.24 ( 2%) 318.51 ( 48%) 0 ( 0%) LRA create live ranges : 199.24 ( 31%) 0.12 ( 1%) 199.38 ( 30%) 228M ( 2%) 645.67user 10.29system 11:04.42elapsed 98%CPU (0avgtext+0avgdata 30577844maxresident)k 3936200inputs+1091808outputs (122053major+10664929minor)pagefaults 0swaps so register allocation taking all of the time. There's maybe the possibility to gate some of its features on the # of BBs or insns (or whatever the actual "bad" thing is - I didn't look closer yet). It also seems to use 30GB of peak memory at -O0 ... For -O the situation is "better": tree PTA : 987.21 ( 99%) 0.41 ( 12%) 987.70 ( 99%) 128 ( 0%) 992.56user 3.53system 16:36.20elapsed 99%CPU (0avgtext+0avgdata 2968740maxresident)k 42576inputs+8outputs (28major+717414minor)pagefaults 0swaps which suggests a clear workaround, -fno-tree-pta, which makes it compile in 5s for me. Doing -O -finline-small-functions -fno-tree-pta we get a very high compile-time in SRAs propagate_all_subaccesses which probably sees a very large struct copy chain tem1 = s2; s2 = tem1; tem2 = s2; s2 = tem2; ... and somehow ends up quadratic (possibly switching the candidate_bitmap to tree form at the start of propagate_all_subaccesses will help a bit). tree form bitmap doesn't help, I guess we end up queueing all elements in the copy chain to the worklist and via the chains end up with a O(n^2) working set. The testcase can probably be shortened to get at this problem. SRA is actually quite important here, so disabling SRA as a workaround doesn't look to improve the situation a lot. Still with -fno-tree-sra added we get good compile time and DCE/DSE remove all code plus -fno-tree-pta isn't required. Martin, can you look at the SRA issue? Do you want me to create a separate bugreport for this? The IL into SRA looks like <bb 2> : s2D.2755 = {}; s1D.2756 = {}; _unusedD.2002766 = s1D.2756; sD.2002767 = s2D.2755; s2D.2755 = sD.2002767; _unusedD.2002766 ={v} {CLOBBER(eol)}; sD.2002767 ={v} {CLOBBER(eol)}; _unusedD.2002764 = s1D.2756; sD.2002765 = s2D.2755; s2D.2755 = sD.2002765; _unusedD.2002764 ={v} {CLOBBER(eol)}; sD.2002765 ={v} {CLOBBER(eol)}; _unusedD.2002762 = s1D.2756; sD.2002763 = s2D.2755; s2D.2755 = sD.2002763; _unusedD.2002762 ={v} {CLOBBER(eol)}; sD.2002763 ={v} {CLOBBER(eol)}; _unusedD.2002760 = s1D.2756; sD.2002761 = s2D.2755; s2D.2755 = sD.2002761; _unusedD.2002760 ={v} {CLOBBER(eol)}; sD.2002761 ={v} {CLOBBER(eol)}; ...