https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #22 from Richard Biener <rguenth at gcc dot gnu.org> ---
You also get to measure memory / cache behavior.  The default grid size was
(in the past ...) set to make sure we hit main memory, nowadays it's more like
on the order of the size of L2 (or L3) caches.  Once you make the problem
smaller via cmdline parameters you can see the optimization effects more
(ideally tramp3d ran on a cluster with MPI parallelization and on the
node OpenMP enabled with each thread running fully inside the L3 cache - back
in times Itanic was really great here with its gigantic L3/L4 caches).

On an old iCore5 I see trunk outperforming 4.9 with just using -Ofast now
(with generic tuning).

The most important thing to make sure when optimizing is that no calls should
survive in all the hot triple-nested loops and the innermost loop should
"look" fast ;)

The loops are in functions with symbols with the pattern
*EvaluateLocLoop*runEv, for example
_ZN14MultiArgKernelI9MultiArg2I5FieldI22UniformRectilinearMeshI10MeshTraitsILi3Ed21UniformRectilinearTag12CartesianTagLi3EEEd10BrickViewUES9_E15EvaluateLocLoopIN4Adv51Z7DensupdILi3EEELi3EEE3runEv
(unfortunately the pattern matches on very many unrelated functions as well...)

Note that we seem to vectorize the innermost loops now (yay!) but peel
them for alignment (ugh - the prologue won't make things better - the
innermost loops run only 64 iterations and thus 32 vector iterations
by default).  And then we of course have the epilogue for the remaining
iteration.  Luckily we peel both epilogue and prologue (both have at
most 1 iteration with V2DF vectors).

Code generated for trunk and 4.9 is almost the same for a few cases I looked
at.

And I think the performance to compare to is that with compiling
with -Dleafify=flatten (which makes sure to do all the desired inlining
very early).  On my machine with flatten enabled its even a little slower.

The graphs on gcc.opensuse.org show the regression is fixed as well (though
compile-time had quite a surge).

Thus I think we can close this as fixed.

Reply via email to