On Tue, Dec 17, 2013 at 6:50 AM, Alan Modra <amo...@gmail.com> wrote: > This patch is aimed at fixing test failures introduced by my > 2013-12-07 change to bswapdi2_32bit: > FAIL: gcc.target/powerpc/pr53199.c scan-assembler-times lwbrx 6 > FAIL: gcc.target/powerpc/pr53199.c scan-assembler-times stwbrx 6 > > The 2013-12-07 change was necessary to make -m32 -mlra generate good > code for pr53199.c:reg_reverse. Too many '?'s on the r->r alternative > result in lra never choosing that option. Instead we get Z->r, > ie. storing the input reg to a stack temp then using lwbrx from there. > That means we have a load-hit-store flush with a severe slowdown. > (See http://gcc.gnu.org/ml/gcc-patches/2013-12/msg00002.html for the > corresponding -m64 result, a 4x slowdown.) > > A similar problem occurs with -m64 -mcpu=power7 -mlra due to > bswapdi2_ldbrx having two '?'s on r->r. > > To fix this I ran into a whole lot of pain. reload and lra are quite > different in their selection of insn alternatives. I could not find > any combination of '?'s that generated the best code for both reload > an lra on pr53199.c. To see why, it's necessary to look (from a great > height) at how reload chooses amongst alternatives. A particular > alternative gets a "loser" score, with the lowest "loser" being > chosen. "loser" is incremented > a) when an operand does not match its constraint. > b) when an alternative has a '?' in the constraint. > c) when a scratch register is required. > d) when an early clobber output clashes with one of the inputs. > > a) is fairly obvious. For example, if we have a MEM when the operand > alternative needs a reg, then we'll require a reload. > b) is also quite obvious. Multiple '?'s accumulate. > c) is a little more subtle. It bites you when alternatives require > differing numbers of scratch registers. Take for example > bswapdi2_64bit, which before this patch has three alternatives > Z->r with 3 scratch regs, (Z is a subset of m) > r->Z with 2 scratch regs, > r->r with 3 scratch regs. > All other things being equal, with reload you could correct this > disparity by adding a '?' to the r->Z alternative. We might want > to do that so that Z->r and r->r are the same "distance" apart > as r->Z is from r->r. With lra it seems that scratch regs are > weighted differently.. > d) is also tricky, and a trap for anyone optimizing insn selection for > functions like some in pr53199.c that have just one rtl insn with > early clobbers. PowerPC generally returns function results in the > same register as the first argument, so these hit the early > clobber. Code elsewhere in larger functions probably won't. > lra penalizes early clobbers differently to reload (a lot less). > > So, putting this all together.. Point (d) test implication is covered > by the additional functions in pr53199.c. Avoiding early clobbers > where possible is good since it reduces differences between reload and > lra insn alternative costing. We also generate better code. I > managed to do this for all the Z->r bswapdi patterns, but stopped > short of changing r->r as well since at that point everything looked > OK! Avoiding extra scratch registers for one alternative is also good. > The op4 r->r scratch wasn't even used, and op4 for Z->r fairly easy to > do without. Renaming word_high/low to word1/2 was to make it a little > easier to track lifetime of addr1/2. > > Bootstrapped and regression tested powerpc64-linux. Output of > pr53199.c inspected for sanity with -mcpu=power{6,7} -m{32,64} and > {-mlra,}. OK to apply? > > gcc/ > * config/rs6000/rs6000.md (bswapdi2): Remove one scratch reg. > Modify Z->r bswapdi splitter to use dest in place of scratch. > In r->Z and Z->r bswapdi splitter rename word_high, word_low > to word1, word2 and rearrange logic to suit. > (bswapdi2_64bit): Remove early clobber on Z->r alternative. > (bswapdi2_ldbrx): Likewise. Remove '??' on r->r. > (bswapdi2_32bit): Remove early clobber on Z->r alternative. > Add one '?' on r->r. Modify Z->r splitter to avoid need for > early clobber. > gcc/testsuite/ > * gcc.target/powerpc/pr53199.c: Add extra functions.
Okay. Thanks, David