http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55829
--- Comment #9 from Uros Bizjak <ubizjak at gmail dot com> 2013-01-09 17:52:19 UTC --- gcc now generates: movq p1(%rip), %r12 # 56 *movdi_internal_rex64/2 [length = 7] movq %r12, (%rsp) # 57 *movdi_internal_rex64/4 [length = 4] movddup (%rsp), %xmm1 # 23 *vec_concatv2df/3 [length = 5] is there a reason not to load directly from p1, to avoid extra moves: movddup p1(%rip), %xmm1