https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83064
--- Comment #8 from rguenther at suse dot de <rguenther at suse dot de> --- On Thu, 23 Nov 2017, dominiq at lps dot ens.fr wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83064 > > --- Comment #7 from Dominique d'Humieres <dominiq at lps dot ens.fr> --- > > I looked at the IL from the Fortran FE and it clearly uses a single memory > > area for tmp for each outer loop iteration. That is, the memory is allocated > > by the caller. > > I confirm that using > > pik = compute( low(i), high(i) ) > pi(i) = sum(pik) > > gives the right result. > > Does it means that the 'sum' in 'sum(compute( low(i), high(i) ))' is not part > of the parallelization? no idea, I can't do the above, pik is not declared. > > > > Do you understand why the code is not parallelized with > > > -ftree-parallelize-loops=4? > > > Because the outer loop has four iterations and we statically require > > at least two per thread for outer loops. > > Why is it so? and is it documented? It is documented: @item parloops-min-per-thread The minimum number of iterations per thread of an innermost parallelized loop for which the parallelized variant is prefered over the single threaded one. The default is 100. Note that for a parallelized loop nest the minimum number of iterations of the outermost loop per thread is two. note autopar isn't very well maintained and certainly the cost modeling needs some work. So for the issue in this bug the .original from the fortran FE looks ok: while (1) { if (ANNOTATE_EXPR <count.9 <= 0, parallel>) goto L.10; { real(kind=4) val.5; integer(kind=8) * D.3618; integer(kind=8) * D.3619; struct array1_real(kind=4) atmp.6; real(kind=4) A.7[4]; val.5 = 0.0; D.3618 = &low[NON_LVALUE_EXPR <i.4> + -1]; D.3619 = &high[NON_LVALUE_EXPR <i.4> + -1]; typedef real(kind=4) [4]; atmp.6.dtype = 281; atmp.6.dim[0].stride = 1; atmp.6.dim[0].lbound = 0; atmp.6.dim[0].ubound = 3; atmp.6.data = (void * restrict) &A.7; atmp.6.offset = 0; compute (&atmp.6, D.3618, D.3619); so A.7 is in scope of the concurrent loop body and gimplification adds a CLOBBER at the end of the scope. I believe there's no logic in autopar that would use this to force local allocation of that variable. It might be also fragile since we can't really rely on those CLOBBERs persisting(?) This means a DO CONCURRENT isn't enough to skip the validity check in autopar, in fact DO CONCURRENT doesn't tell us anything but maybe skipping any cost modeling during autopar?