------- Comment #8 from rguenth at gcc dot gnu dot org 2009-07-07 15:47 ------- The issue is likely the sequence
load upper half of cache line 1 load lower half of cache line 2 store upper half of cache line 1 store lower half of cache line 2 <--- load upper half of cache line 2 <--- load lower half of cache line 3 ... where the marked lines probably cause internal delays. Not using unaligned stores for this kind of data dependence or peeling for alignment will probably help here. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40648