https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- OK, so on haswell I see (- is bad, + is good): -0x2342ca0 _40 + _45 1 times scalar_stmt costs 12 in body +0x2342ca0 _40 + _45 1 times scalar_stmt costs 4 in body so a simple add changes cost from 4 to 12 with the patch. Ah, so that goes switch (subcode) { case PLUS_EXPR: case POINTER_PLUS_EXPR: case MINUS_EXPR: if (kind == scalar_stmt) { if (SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH) stmt_cost = ix86_cost->addss; else if (X87_FLOAT_MODE_P (mode)) stmt_cost = ix86_cost->fadd; else stmt_cost = ix86_cost->add; } where with kind == scalar_stmt we now run into the SSE_FLOAT_MODE_P case (previously mode was sth like V2DFmode) and thus use ix86_cost->addss instead of ix86_cost->add. That's more correct. That causes us to (for example) now vectorize mccas.fppized.f:3160 where we previously figured vectorization is never profitable. The look looks like DO 10 MK=1,NOC DO 10 ML=1,MK MKL = MKL+1 XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) + * VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK)) XPQKL(MRS,MKL) = XPQKL(MRS,MKL) + * VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK)) 10 CONTINUE and requires versioning for aliasing and strided loads and strided stores. We're too trigger-happy for doing that it seems. Also the vector version isn't entered at all at runtime. But that's not the 10%. And the big offenders from looking at perf output do not have any vectorization decision changes... very strage.