On Thu, Jun 20, 2019 at 5:19 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > On Thu, Jun 20, 2019 at 12:43 AM Uros Bizjak <ubiz...@gmail.com> wrote: > > > > On Thu, Jun 20, 2019 at 9:40 AM Uros Bizjak <ubiz...@gmail.com> wrote: > > > > > > On Mon, Jun 17, 2019 at 6:27 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > > > > > processor_costs has costs of RTL expressions and costs of moves: > > > > > > > > 1. Costs of RTL expressions is computed as COSTS_N_INSNS which are used > > > > to generate RTL expressions with the lowest costs. Costs of RTL memory > > > > operation can be very close to costs of fast instructions to indicate > > > > fast memory operations. > > > > > > > > 2. After RTL expressions have been generated, costs of moves are used by > > > > TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute move > > > > costs for register allocator. Costs of load and store are higher than > > > > costs of register moves to reduce stack usages by register allocator. > > > > > > > > We should separate costs of RTL expressions from costs of moves so that > > > > they can be adjusted independently. This patch moves costs of moves to > > > > the new used_by_ra field and duplicates costs of moves which are also > > > > used for costs of RTL expressions. > > > > > > Actually, I think that the current separation is OK. Before reload, we > > > actually don't know which register set will perform the move (not even > > > if float mode will be moved in integer registers), the only thing we > > > can estimate is the number of move instructions. The real cost of > > > register moves is later calculated by the register allocator, where > > > the register class is taken into account when calculating the cost. > > > > Forgot to say that due to the above reasoning, cost of moves should > > not be used in the calculation of costs of RTL expressions, as we are > > talking about two different cost functions. RTL expressions should > > know nothing about register classes. > > > > Currently, costs of moves are also used for costs of RTL expressions. This > patch: > > https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html > > includes: > > diff --git a/gcc/config/i386/x86-tune-costs.h > b/gcc/config/i386/x86-tune-costs.h > index e943d13..8409a5f 100644 > --- a/gcc/config/i386/x86-tune-costs.h > +++ b/gcc/config/i386/x86-tune-costs.h > @@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = { > {4, 4, 4}, /* cost of loading integer registers > in QImode, HImode and SImode. > Relative to reg-reg move (2). */ > - {6, 6, 6}, /* cost of storing integer registers */ > + {6, 6, 3}, /* cost of storing integer registers */ > 2, /* cost of reg,reg fld/fst */ > {6, 6, 8}, /* cost of loading fp registers > in SFmode, DFmode and XFmode */ > > It lowered the cost for SImode store and made it cheaper than SSE<->integer > register move. It caused a regression: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90878 > > Since the cost for SImode store is also used to compute scalar_store > in ix86_builtin_vectorization_cost, it changed loop costs in > > void > foo (long p2, long *diag, long d, long i) > { > long k; > k = p2 < 3 ? p2 + p2 : p2 + 3; > while (i < k) > diag[i++] = d; > } > > As the result, the loop is unrolled 4 times with -O3 -march=skylake, > instead of 3. > > My patch separates costs of moves from costs of RTL expressions. We have > a follow up patch which restores the cost for SImode store back to 6 and leave > the cost of scalar_store unchanged. It keeps loop unrolling unchanged and > improves powf performance in glibc by 30%. We are collecting SPEC CPU 2017 > data now.
It looks that x86 costs are one big mess. I suggest you took this matter to Honza, he knows this part better than I. Uros.