https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Status|WAITING                     |NEW
                 CC|                            |rguenth at gcc dot gnu.org
          Component|middle-end                  |tree-optimization
            Summary|GCC 4.6 performance         |GCC 4.6 performance
                   |regression (vs. 4.4/4.5)    |regression (vs. 4.4/4.5),
                   |                            |PRE increases register
                   |                            |pressure

--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
As for movaps vs. movups when movaps actually works shouldn't make any
difference on modern architectures.  So I wonder if you could share the exact
CPU type
you are using?

We are putting quite heavy register-pressure on the thing by means of
partial redundancy elimination, thus disabling PRE using -fno-tree-pre
might help (we still spill a lot).

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     103296 c/s real, 103296 c/s virtual
Only one salt:  100736 c/s real, 100736 c/s virtual

improves to

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     126848 c/s real, 126848 c/s virtual
Only one salt:  123008 c/s real, 123008 c/s virtual

with that for me (gcc 4.8, SSE2).  Which is close to what 4.5.3 gets for me:

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     128384 c/s real, 128384 c/s virtual
Only one salt:  124800 c/s real, 124800 c/s virtual

albeit that doesn't need -fno-tree-pre to fix things.

Note that we have to use movups because DES_bs_all is not aligned as seen
from DES_bs_b.c (it's defined in DES_bs.c and only there annotated with
CC_CACHE_ALIGN, not at the point of declaration in DES_bs.h).  So the
unaligned moves are the sources fault.  Annotating that with CC_CACHE_ALIGN
produces the desired movaps instructions (with no effect on performance for
me).

I think for the effect of PRE increasing register pressure we do have some
duplicate bugs (but no good heuristic to fix anything).  LIM store-motion can
have the very same issue.

Reply via email to