https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83017
--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
Ok, so we do slightly better for the runtime test than for the static test:
if (loop->inner)
m_p_thread=2;
else
m_p_thread=MIN_PER_THREAD;
so with 2 threads we should have exactly 2 iterations but ... the runtime
check uses the number of latch executions which is 3 and thus arrives at
1 iteration per thread. Fixing this off-by-one get's us
> /usr/bin/time ./a.out
PI 2.98876095
PI 3.14159274
4.02user 0.00system 0:04.02elapsed 99%CPU (0avgtext+0avgdata 2460maxresident)k
0inputs+0outputs (0major+102minor)pagefaults 0swaps
vs.
> /usr/bin/time ./a.out
PI 8.59536934
PI 3.14159274
10.90user 0.00system 0:05.54elapsed 196%CPU (0avgtext+0avgdata
2840maxresident)k
0inputs+0outputs (0major+126minor)pagefaults 0swaps
I guess the different computation outcome means we're doing sth wrong
somewhere.
Also at least on my machine the result isn't any faster (when parallelizing
the outer loop). As usual auto-parallelization may harm followup transforms.