https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83358
--- Comment #3 from Jan Hubicka <hubicka at ucw dot cz> --- > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83358 > > --- Comment #2 from Markus Trippelsdorf <trippels at gcc dot gnu.org> --- > The following fixes this particular issue: > > diff --git a/gcc/config/i386/x86-tune-costs.h > b/gcc/config/i386/x86-tune-costs.h > index 312467d9788..00f1dae9085 100644 > --- a/gcc/config/i386/x86-tune-costs.h > +++ b/gcc/config/i386/x86-tune-costs.h > @@ -2345,7 +2345,7 @@ struct processor_costs core_cost = { > > {COSTS_N_INSNS (8), /* cost of a divide/mod for QI */ > > COSTS_N_INSNS (8), /* HI */ > > /* 8-11 */ > - COSTS_N_INSNS (11), /* SI */ > > + COSTS_N_INSNS (13), /* SI */ > > /* 24-81 */ > COSTS_N_INSNS (81), /* DI */ > > COSTS_N_INSNS (81)}, /* > other */ > > Perhaps the div costs are a bit too tight in general? The main problem here is that the algorithm expading div/mod into shift/add/lea does not consider at all the parallelism and thus the cost model is not realistic. I meant to write bencmark that will try different constatns and see if they are after by idiv or by expanded sequence. The original costs was still based on pentium4, so I brought them to be consistently basedd on latencies. Increasing values bit over the estimated latencies (with comment on why that was done) is perhaps easiest short term solution. Honza