------- Comment #5 from hubicka at gcc dot gnu dot org 2009-01-15 00:30 ------- Created an attachment (id=17106) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17106&action=view) Proposed patch
The patch makes GCC to generate movaps load followed by addps. On Core 2 it speeds up the testcase from 7s to 6.2s so I guess it works as expected. The same however does not reproduce on AMD box and I am not sure if it is just coincidence here or if really core preffer to split read-execute SSE operations (it is not recommended by the manual). H.J. perhaps, you can have some advice here? Or at least can we do some benchmarking? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824