https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108500

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jamborm at gcc dot gnu.org,
                   |                            |vmakarov at gcc dot gnu.org

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
Thanks for the new testcase.  With -O0 (and a --enable-checking=release built
compiler) this builds in ~11 minutes (on a Ryzen 9 7900X) with

 integrated RA                      :  38.96 (  6%)   1.94 ( 20%)  42.00 (  6%)
 3392M ( 23%)
 LRA non-specific                   :  18.93 (  3%)   1.24 ( 13%)  23.78 (  4%)
  450M (  3%)
 LRA virtuals elimination           :   5.67 (  1%)   0.05 (  1%)   5.75 (  1%)
  457M (  3%)
 LRA reload inheritance             : 318.25 ( 49%)   0.24 (  2%) 318.51 ( 48%)
    0  (  0%)
 LRA create live ranges             : 199.24 ( 31%)   0.12 (  1%) 199.38 ( 30%)
  228M (  2%)
645.67user 10.29system 11:04.42elapsed 98%CPU (0avgtext+0avgdata
30577844maxresident)k
3936200inputs+1091808outputs (122053major+10664929minor)pagefaults 0swaps

so register allocation taking all of the time.  There's maybe the possibility
to gate some of its features on the # of BBs or insns (or whatever the actual
"bad" thing is - I didn't look closer yet).

It also seems to use 30GB of peak memory at -O0 ...

For -O the situation is "better":

 tree PTA                           : 987.21 ( 99%)   0.41 ( 12%) 987.70 ( 99%)
  128  (  0%)
992.56user 3.53system 16:36.20elapsed 99%CPU (0avgtext+0avgdata
2968740maxresident)k
42576inputs+8outputs (28major+717414minor)pagefaults 0swaps

which suggests a clear workaround, -fno-tree-pta, which makes it compile
in 5s for me.

Doing -O -finline-small-functions -fno-tree-pta we get a very high
compile-time in SRAs propagate_all_subaccesses which probably sees
a very large struct copy chain

  tem1 = s2;
  s2 = tem1;
  tem2 = s2;
  s2 = tem2;
...

and somehow ends up quadratic (possibly switching the candidate_bitmap
to tree form at the start of propagate_all_subaccesses will help a bit).
tree form bitmap doesn't help, I guess we end up queueing all elements in
the copy chain to the worklist and via the chains end up with a O(n^2)
working set.  The testcase can probably be shortened to get at this
problem.  SRA is actually quite important here, so disabling SRA as a
workaround doesn't look to improve the situation a lot.

Still with -fno-tree-sra added we get good compile time and DCE/DSE
remove all code plus -fno-tree-pta isn't required.

Martin, can you look at the SRA issue?  Do you want me to create a separate
bugreport for this?  The IL into SRA looks like

  <bb 2> :
  s2D.2755 = {};
  s1D.2756 = {};
  _unusedD.2002766 = s1D.2756;
  sD.2002767 = s2D.2755;
  s2D.2755 = sD.2002767;
  _unusedD.2002766 ={v} {CLOBBER(eol)};
  sD.2002767 ={v} {CLOBBER(eol)};
  _unusedD.2002764 = s1D.2756;
  sD.2002765 = s2D.2755;
  s2D.2755 = sD.2002765;
  _unusedD.2002764 ={v} {CLOBBER(eol)};
  sD.2002765 ={v} {CLOBBER(eol)};
  _unusedD.2002762 = s1D.2756;
  sD.2002763 = s2D.2755;
  s2D.2755 = sD.2002763;
  _unusedD.2002762 ={v} {CLOBBER(eol)};
  sD.2002763 ={v} {CLOBBER(eol)};
  _unusedD.2002760 = s1D.2756;
  sD.2002761 = s2D.2755;
  s2D.2755 = sD.2002761;
  _unusedD.2002760 ={v} {CLOBBER(eol)};
  sD.2002761 ={v} {CLOBBER(eol)};
...

Reply via email to