https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
I think for the case at hand no runtime alias checking is needed, since we have

            DO 30 MK=1,NOC
            DO 30 ML=1,MK
               MKL = MKL+1
               XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) +
     *               VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK))
               XPQKL(MRS,MKL) = XPQKL(MRS,MKL) +
     *               VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK))
   30       CONTINUE

so we're dealing with reductions which we can interleave (with -Ofast). 
Editing
the source with !GCC$ ivdep reduces the vectorization penalty to 5% (we still
need the niter/epilogue checks).  It also shows that only fixing PR89755 isn't
the solution we're looking for.

In the end the vectorization is unlikely going to play out since V2DF is
usually handled well by dual issue capabilities for DFmode arithmetic on
modern archs.

The only mitigation I can think of is realizing the outer inner loop niter
is 0, 1, 2, .., NOC - 1 and thus the first outer iterations will have inner
loop vectorization not profitable.  But the question is what to do with this
(not knowing the actual runtime values of NOC).  As PR87561 says

"Note for 416.gamess it looks like NOC is just 5 but MPQ and MRS are so
that there is no runtime aliasing between iterations most of the time
(sometimes they are indeed equal).  The cost model check skips the
vector loop for MK == 2 and 3 and only will execute it for MK == 4 and 5.
An alternative for this kind of loop nest would be to cost-model for
MK % 2 == 0, thus requiring no epilogue loop."

In general applying no vectorization to these kind of loops looks wrong.
Versioning also the outer loop in addition to the inner loop in case the
number of iterations evolves in the outer loop looks excessive (but would
eventually help 416.gamess).  Implementation-wise it's also non-trivial.

Reply via email to