https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119298
--- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> --- Hmm, the sequence does not use + at all, but I think I know what is going on. While the field is called addss it is used as an kitchen sink for all other simple operations. /* pmuludq under sse2, pmuldq under sse4.1, for sign_extend, require extra 4 mul, 4 add, 4 cmp and 2 shift. */ if (!TARGET_SSE4_1 && !uns_p) extra_cost = (cost->mulss + cost->addss + cost->sse_op) * 4 + cost->sse_op * 2; .... case FLOAT_EXTEND: if (!SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode)) *total = 0; else *total = ix86_vec_cost (mode, cost->addss); return false; .... case FLOAT_TRUNCATE: if (!SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode)) *total = cost->fadd; else *total = ix86_vec_cost (mode, cost->addss); return false; ... case scalar_stmt: return fp ? ix86_cost->addss : COSTS_N_INSNS (1); .... case vector_stmt: return ix86_vec_cost (mode, fp ? ix86_cost->addss : ix86_cost->sse_op); Only addss was sped up (and apparently only in common contextes), other simple SSE operations are still 3 cycles. We have const int sse_op; /* cost of cheap SSE instruction. */ const int addss; /* cost of ADDSS/SD SUBSS/SD instructions. */ SSE_OP is used for integer SSE instructions, which are typically 1 cycle, so perhaps we want to have also sse_fp_op /* Chose of cheap SSE fp instruction. */ in addition to addss. But to be precise builtin_vectorizer cost would need to now if scalar/vector_stmt is additio or something else, which AFAK it doesn't