https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization Status|WAITING |NEW CC| |rguenth at gcc dot gnu.org Component|middle-end |tree-optimization Summary|GCC 4.6 performance |GCC 4.6 performance |regression (vs. 4.4/4.5) |regression (vs. 4.4/4.5), | |PRE increases register | |pressure --- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> --- As for movaps vs. movups when movaps actually works shouldn't make any difference on modern architectures. So I wonder if you could share the exact CPU type you are using? We are putting quite heavy register-pressure on the thing by means of partial redundancy elimination, thus disabling PRE using -fno-tree-pre might help (we still spill a lot). Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE Many salts: 103296 c/s real, 103296 c/s virtual Only one salt: 100736 c/s real, 100736 c/s virtual improves to Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE Many salts: 126848 c/s real, 126848 c/s virtual Only one salt: 123008 c/s real, 123008 c/s virtual with that for me (gcc 4.8, SSE2). Which is close to what 4.5.3 gets for me: Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE Many salts: 128384 c/s real, 128384 c/s virtual Only one salt: 124800 c/s real, 124800 c/s virtual albeit that doesn't need -fno-tree-pre to fix things. Note that we have to use movups because DES_bs_all is not aligned as seen from DES_bs_b.c (it's defined in DES_bs.c and only there annotated with CC_CACHE_ALIGN, not at the point of declaration in DES_bs.h). So the unaligned moves are the sources fault. Annotating that with CC_CACHE_ALIGN produces the desired movaps instructions (with no effect on performance for me). I think for the effect of PRE increasing register pressure we do have some duplicate bugs (but no good heuristic to fix anything). LIM store-motion can have the very same issue.