Re: [PATCH PR69042/01]Add IV candidate for use with constant offset stripped in base.

Bin.Cheng Wed, 23 Mar 2016 08:30:38 -0700

On Wed, Mar 23, 2016 at 2:18 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Wed, Mar 23, 2016 at 2:58 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Tue, Mar 22, 2016 at 11:01 AM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Tue, Mar 22, 2016 at 11:22 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>> On Wed, Mar 16, 2016 at 10:06 AM, Richard Biener
>>>> <richard.guent...@gmail.com> wrote:
>>>>>
>>>>> On Wed, Mar 16, 2016 at 10:48 AM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>> > Hi,
>>>>> > When I tried to decrease # of IV candidates, I removed code that adds 
>>>>> > IV candidates for use with constant offset stripped in use->base.  This 
>>>>> > is kind of too aggressive and triggers PR69042.  So here is a patch 
>>>>> > adding back the missing candidates.  Honestly, this patch doesn't truly 
>>>>> > fix the issue, it just brings back the original behavior in IVOPT part 
>>>>> > (Which is still a right thing to do I think).  The issue still depends 
>>>>> > on PIC_OFFSET register used on x86 target.  As discussed in 
>>>>> > https://gcc.gnu.org/ml/gcc/2016-02/msg00040.html.  Furthermore, the 
>>>>> > real problem could be in register pressure modeling about PIC_OFFSET 
>>>>> > symbol in IVOPT.
>>>>> >
>>>>> > On AArch64, overall spec2k number isn't changed, though 173.applu is 
>>>>> > regressed by ~4% because couple of loops' # of candidates now hits 
>>>>> > "--param iv-consider-all-candidates-bound=30".  For spec2k6 data on 
>>>>> > AArch64, INT is not affected; FP overall is not changed, as for 
>>>>> > specific case: 459.GemsFDTD is regressed by 2%, 433.milc is improved by 
>>>>> > 2%.  To address the regression, I will send another patch increasing 
>>>>> > the parameter bound.
>>>>> >
>>>>> > Bootstrap&test on x86_64 and AArch64, is it OK?  In the meantime, I 
>>>>> > will collect spec2k6 data on x86_64.
>>>>>
>>>>> Ok.
>>>> Hi Richard,
>>>> Hmm, I got spec2k6 data on my x86_64, it (along with patch increasing
>>>> param iv-consider-all-candidates-bound) causes 1% regression for
>>>> 436.cactusADM in my run.  I looked into the code, for function
>>>> bench_staggeredleapfrog2_ (takes 99% running time after patching),
>>>> IVOPT chooses one fewer candidates for outer loop, but it does result
>>>> in couple of more instructions there.
>>>
>>> You mean IVOPTs chooses one fewer IVs for the outer loop?
>> Yes.
>>>
>>>>  For this case, register
>>>> pressure is a more interesting issue (36 candidates chosen in outer
>>>> loop, many stack accesses), not sure if this 1% regression blocks the
>>>> patch at this stage, or not?
>>>
>>> Is this with or without the increase of the param?  What compiler options 
>>> and
>>> on what sub-architecture was this?
>> It's not the param, we have both address and generic iv_uses in the
>> form of {base + 4, step}, and there is another address iv_use {base +
>> 8, step}.  With the patch we now add candidate {base, step}, which
>> will be used for all these uses.  Before the patch, candidates {base +
>> 4, step} and {base + 8, step} are chosen for these iv_uses
>> respectively.
>> As a matter of fact, {base + 4, step} should be the best choice, but
>> we can't do that because candidate added from one iv_use is not
>> related to other iv_use and we exceed the bound of param
>> iv-consider-all-candidates-bound = 30.
>
> I suppose upping that to 40 doesn't help here either (AFAIR the loops are 
> quite
> big).
Yes, there are ~100 candidates in total for each loop.
>
>>>
>>> I think if the IVO choice looks optimal before the patch and not optimal 
>>> after
>>> then it's worth blocking but it sounds like the IVO choice is a mess anyway?
>> Yes, it's a mess here.  For three outermost loops in 436, IVO chooses
>>>32 candidates for each loop.  I found at least one bug in register
>> pressure calculation is responsible for this.  By fixing this (plus
>> one another local patch), # of candidates of these loops are reduced
>> from 36/36/32 to 12/12/7.  And # of lines of assembly is reduced from
>> ~7150 to ~6850.  Interesting thing is, the performance isn't
>> improved...  That may be because they are outermost loops.
>
> Yeah, maybe a good default heuristic would be to keep existing BIVs
> and add no related candidates at all for outer loops.  Or to adjust
> heuristic to prefer the least number of IVs for those.
>
>>> [can you maybe check IV choice by ICC?]
>> I don't have ICC at hand, but from past experiments, ICC tends to
>> reuse more common IVs than GCC.
>
> Ok.
>
> I think the patch is ok and eventually we need to revisit the outer loop
> handling with this as an excuse.
Yes, currently GCC doesn't handle loop nest at all.  Will revisit this
part in coming stage1.
Patch applied as revision 234429.


Thanks,
bin

Re: [PATCH PR69042/01]Add IV candidate for use with constant offset stripped in base.

Reply via email to