On Wed, Mar 23, 2016 at 2:18 PM, Richard Biener <richard.guent...@gmail.com> wrote: > On Wed, Mar 23, 2016 at 2:58 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >> On Tue, Mar 22, 2016 at 11:01 AM, Richard Biener >> <richard.guent...@gmail.com> wrote: >>> On Tue, Mar 22, 2016 at 11:22 AM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>>> On Wed, Mar 16, 2016 at 10:06 AM, Richard Biener >>>> <richard.guent...@gmail.com> wrote: >>>>> >>>>> On Wed, Mar 16, 2016 at 10:48 AM, Bin Cheng <bin.ch...@arm.com> wrote: >>>>> > Hi, >>>>> > When I tried to decrease # of IV candidates, I removed code that adds >>>>> > IV candidates for use with constant offset stripped in use->base. This >>>>> > is kind of too aggressive and triggers PR69042. So here is a patch >>>>> > adding back the missing candidates. Honestly, this patch doesn't truly >>>>> > fix the issue, it just brings back the original behavior in IVOPT part >>>>> > (Which is still a right thing to do I think). The issue still depends >>>>> > on PIC_OFFSET register used on x86 target. As discussed in >>>>> > https://gcc.gnu.org/ml/gcc/2016-02/msg00040.html. Furthermore, the >>>>> > real problem could be in register pressure modeling about PIC_OFFSET >>>>> > symbol in IVOPT. >>>>> > >>>>> > On AArch64, overall spec2k number isn't changed, though 173.applu is >>>>> > regressed by ~4% because couple of loops' # of candidates now hits >>>>> > "--param iv-consider-all-candidates-bound=30". For spec2k6 data on >>>>> > AArch64, INT is not affected; FP overall is not changed, as for >>>>> > specific case: 459.GemsFDTD is regressed by 2%, 433.milc is improved by >>>>> > 2%. To address the regression, I will send another patch increasing >>>>> > the parameter bound. >>>>> > >>>>> > Bootstrap&test on x86_64 and AArch64, is it OK? In the meantime, I >>>>> > will collect spec2k6 data on x86_64. >>>>> >>>>> Ok. >>>> Hi Richard, >>>> Hmm, I got spec2k6 data on my x86_64, it (along with patch increasing >>>> param iv-consider-all-candidates-bound) causes 1% regression for >>>> 436.cactusADM in my run. I looked into the code, for function >>>> bench_staggeredleapfrog2_ (takes 99% running time after patching), >>>> IVOPT chooses one fewer candidates for outer loop, but it does result >>>> in couple of more instructions there. >>> >>> You mean IVOPTs chooses one fewer IVs for the outer loop? >> Yes. >>> >>>> For this case, register >>>> pressure is a more interesting issue (36 candidates chosen in outer >>>> loop, many stack accesses), not sure if this 1% regression blocks the >>>> patch at this stage, or not? >>> >>> Is this with or without the increase of the param? What compiler options >>> and >>> on what sub-architecture was this? >> It's not the param, we have both address and generic iv_uses in the >> form of {base + 4, step}, and there is another address iv_use {base + >> 8, step}. With the patch we now add candidate {base, step}, which >> will be used for all these uses. Before the patch, candidates {base + >> 4, step} and {base + 8, step} are chosen for these iv_uses >> respectively. >> As a matter of fact, {base + 4, step} should be the best choice, but >> we can't do that because candidate added from one iv_use is not >> related to other iv_use and we exceed the bound of param >> iv-consider-all-candidates-bound = 30. > > I suppose upping that to 40 doesn't help here either (AFAIR the loops are > quite > big). Yes, there are ~100 candidates in total for each loop. > >>> >>> I think if the IVO choice looks optimal before the patch and not optimal >>> after >>> then it's worth blocking but it sounds like the IVO choice is a mess anyway? >> Yes, it's a mess here. For three outermost loops in 436, IVO chooses >>>32 candidates for each loop. I found at least one bug in register >> pressure calculation is responsible for this. By fixing this (plus >> one another local patch), # of candidates of these loops are reduced >> from 36/36/32 to 12/12/7. And # of lines of assembly is reduced from >> ~7150 to ~6850. Interesting thing is, the performance isn't >> improved... That may be because they are outermost loops. > > Yeah, maybe a good default heuristic would be to keep existing BIVs > and add no related candidates at all for outer loops. Or to adjust > heuristic to prefer the least number of IVs for those. > >>> [can you maybe check IV choice by ICC?] >> I don't have ICC at hand, but from past experiments, ICC tends to >> reuse more common IVs than GCC. > > Ok. > > I think the patch is ok and eventually we need to revisit the outer loop > handling with this as an excuse. Yes, currently GCC doesn't handle loop nest at all. Will revisit this part in coming stage1. Patch applied as revision 234429.
Thanks, bin