https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017

--- Comment #17 from Alexander Peslyak <solar-gcc at openwall dot com> ---
(In reply to Richard Biener from comment #16)
> I'm completely confused now as to what the original regression was reported
> against.

I'm sorry, I should have re-read my original description of the regression
before I wrote comment 13.  Together, these are indeed confusing.

> I thought it was the default options in the Makefile, -O2
> -fomit-frame-pointer, which showed the regression and you found -Os would
> mitigate it somewhat (and I more specifically told you it is -fno-tree-pre
> that makes the actual difference).

That's one of the regressions I mentioned in the original description.  Yes,
you identified -fno-tree-pre as the component of -Os that makes the difference
- Thank You!  However, I also mentioned in the original description that a
bigger regression with 4.6+ vs. 4.5 and 4.4 remained despite of -Os, and I had
no similar workaround for it at the time (but enabling -fopenmp made it go
away, perhaps due to changes to declarations in the source code in #ifdef
_OPENMP blocks).  I think we can now say that this bigger 4.6+ regression was
primarily caused by the unaligned load instructions.  So two regressions are
figured out, and the remaining slowdown (not investigated yet) vs. 4.1 to 4.3
(which worked best) is only 6% to 10% in recent versions (9% in 4.9.2).

> So - what options give good results with old compilers but bad results with
> new compilers?

On CPUs where movups/movdqu are slower than their aligned counterparts (for
addresses that happen to be aligned), any sane optimization options of 4.6+
give bad results as compared to pre-4.6 with same options.  As you say, this
can be fixed in the source code (and I most likely will fix it there), but I
think many other programs may experience similar slowdowns, so maybe GCC should
do something about this too.

Other than that, either -Os or -fno-tree-pre works around the second worst
slowdown seen in 4.6+.

To avoid confusion, maybe this bug should focus on one of the three
regressions?  Should we keep it for PRE only?

Should we create a new bug for the unnecessary and non-optional use of
unaligned load instructions for source code like this, or is this considered
the new intended behavior despite of the major slowdown on such CPUs? 
(Presumably not only for JtR.  I'd expect this to affect many programs.)

Should we also create a bug for investigating the remaining slowdown of 9% in
4.9.2 (vs. 4.1 to 4.3), or is it considered too minor to bother?

Thank you!

Reply via email to