http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754
--- Comment #2 from Matthias Kretz <kretz at kde dot org> 2011-02-15 16:31:39 UTC --- True, the Optimization Reference Manual and AVX Docs are not very specific about the performance impact of this. But as far as I understood the docs it will internally not be slower than an unaligned load + op, but also not faster. Except, of course, if it's related to memory fetch latency. So it's just about having more registers available - again AFAIU. If you want I can try the same testcase on ICC...