https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #28 from Uroš Bizjak <ubizjak at gmail dot com> --- (In reply to Uroš Bizjak from comment #27) > (In reply to Richard Biener from comment #26) > > but that doesn't seem to match for some unknown reason. > Try this: The latency problem with the original testcase is solved with: (define_peephole2 [(match_scratch:DI 3 "Yv") (set (match_operand:V2DI 0 "sse_reg_operand") (vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand") (match_operand:DI 2 "nonimmediate_gr_operand")))] "" [(set (match_dup 3) (match_dup 2)) (set (match_dup 0) (vec_concat:V2DI (match_dup 1) (match_dup 3)))]) but I don't know if this transformation applies universally to all x86 targets.