https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108487
Alexander Monakov <amonakov at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|rtl-optimization |tree-optimization Keywords| |needs-bisection Summary|~20-30x slowdown in |[10/11/12/13 Regression] |populating std::vector from |~20-30x slowdown in |std::ranges::iota_view |populating std::vector from | |std::ranges::iota_view CC| |amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov <amonakov at gcc dot gnu.org> --- Regarding fn1, would you mind re-running the test on your Xeon CPU with fn2 removed from the source code and -falign-loops=32 added to gcc command line? For fn1, assembly of the inner loop should be identical, so I think the 20% you were seeing may result from different loop alignment with respect to 32b fetch boundary. Also please note that cloud instances backing godbolt.org have different CPUs, so timing results from different runs are not directly comparable. Regarding fn2, this may partially be a library issue, compiling preprocessed source from gcc-10.4 using gcc-10.2 also exhibits the problem. Inner loop becomes significantly more complicated. Bisecting should be helpful.