Re: [PATCH V1] rs6000: New pass for replacement of adjacent (load) lxv with lxvp

Ajit Agarwal Wed, 17 Jan 2024 01:35:26 -0800

Hello Kewen:

On 17/01/24 12:32 pm, Kewen.Lin wrote:
> on 2024/1/16 06:22, Ajit Agarwal wrote:
>> Hello Richard:
>>
>> On 15/01/24 6:25 pm, Ajit Agarwal wrote:
>>>
>>>
>>> On 15/01/24 6:14 pm, Ajit Agarwal wrote:
>>>> Hello Richard:
>>>>
>>>> On 15/01/24 3:03 pm, Richard Biener wrote:
>>>>> On Sun, Jan 14, 2024 at 4:29 PM Ajit Agarwal <aagar...@linux.ibm.com> 
>>>>> wrote:
>>>>>>
>>>>>> Hello All:
>>>>>>
>>>>>> This patch add the vecload pass to replace adjacent memory accesses lxv 
>>>>>> with lxvp
>>>>>> instructions. This pass is added before ira pass.
>>>>>>
>>>>>> vecload pass removes one of the defined adjacent lxv (load) and replace 
>>>>>> with lxvp.
>>>>>> Due to removal of one of the defined loads the allocno is has only uses 
>>>>>> but
>>>>>> not defs.
>>>>>>
>>>>>> Due to this IRA pass doesn't assign register pairs like registers in 
>>>>>> sequence.
>>>>>> Changes are made in IRA register allocator to assign sequential 
>>>>>> registers to
>>>>>> adjacent loads.
>>>>>>
>>>>>> Some of the registers are cleared and are not set as profitable 
>>>>>> registers due
>>>>>> to zero cost is greater than negative costs and checks are added to 
>>>>>> compare
>>>>>> positive costs.
>>>>>>
>>>>>> LRA register is changed not to reassign them to different register and 
>>>>>> form
>>>>>> the sequential register pairs intact.
>>>>>>
>>>>>> contrib/check_GNU_style.sh run on patch looks good.
>>>>>>
>>>>>> Bootstrapped and regtested for powerpc64-linux-gnu.
>>>>>>
>>>>>> Spec2017 benchmarks are run and I get impressive benefits for some of 
>>>>>> the FP
>>>>>> benchmarks.
>>>>> i
>>>>> I want to point out the aarch64 target recently got a ld/st fusion
>>>>> pass which sounds
>>>>> related.  It would be nice to have at least common infrastructure for
>>>>> this (the aarch64
>>>>> one also looks quite more powerful)
> 
> Thank Richi for pointing out this pass.  Yeah, it would be nice if we can 
> share
> something common.  CC the author Alex as well in case he have more insightful
> comments.
> 
>>>>
>>>> load/store fusion pass in aarch64 is scheduled to use before peephole2 
>>>> pass 
>>>> and after register allocator pass. In our case, if we do after register 
>>>> allocator
>>>> then we should keep register assigned to lower offset load and other load
>>>> that is adjacent to previous load with offset difference of 16 is removed.
>>>>
>>>> Then we are left with one load with lower offset and register assigned 
>>>> by register allocator for lower offset load should be lower than other
>>>> adjacent load. If not, we need to change it to lower register and 
>>>> propagate them with all the uses of the variable. Similary for other
>>>> adjacent load that we are removing, register needs to be propagated to
>>>> all the uses.
>>>>
>>>> In that case we are doing the work of register allocator. In most of our
>>>> example testcases the lower offset load is assigned greater register 
>>>> than other adjacent load by register allocator and hence we are left
>>>> with propagating them always and almost redoing the register allocator
>>>> work.
>>>>
>>>> Is it same/okay to use load/store fusion pass as on aarch64 for our cases
>>>> considering the above scenario.
>>>>
>>>> Please let me know what do you think. 
>>
>> I have gone through the implementation of ld/st fusion in aarch64.
>>
>> Here is my understanding:
>>
>> First all its my mistake that I have mentioned in my earlier mail that 
>> this pass is done before peephole2 after RA-pass.
>>
>> This pass does it before RA-pass early before early-remat and 
>> also before peephole2 after RA-pass.
>>
>> This pass does load fusion 2 ldr instruction with adjacent accesses
>> into ldp instruction.
>>
>> The assembly syntax of ldp instruction is
>>
>> ldp w3, w7, [x0]
>>
>> It loads [X0] into w3 and [X0+4] into W7.
>>
>> Both registers that forms pairs are mentioned in ldp instructions
>> and might not be in sequntial order like first register is W3 and
>> then next register would be W3+1.
>>
>> Thats why the pass before RA-pass works as it has both the defs
>> and may not be required in sequential order like first_reg and then
>> first_reg+1. It can be any valid registers.
>>
>>
>> But in lxvp instructions:
>>
>> lxv vs32, 0(r2)
>> lxv vs45, 16(r2)
>>
>> When we combine above lxv instruction into lxvp, lxvp instruction
>> becomes
>>
>> lxvp vs32, 0(r2)
>>
>> wherein in lxvp  r2+0 is loaded into vs32 and r2+16 is loaded into vs33 
>> register (sequential registers). vs33 is hidden in lxvp instruction.
>> This is mandatory requirement for lxvp instruction and cannot be in 
>> any other sequence. register assignment difference should be 1.
> 
> Note that the first register number in the pair should be even, it
> means the so-called sequential order should be X, X + 1 (X is even).
> This is also the reason why we preferred this pairing to be done
> before RA (can catch more opportunities).
> 
>>
>> All the uses of r45 has to be propagated with r33.
> 
> I think you meant s/r45/vs45/ and s/r33/vs33/.
>


Yes I meant the same.

>>
>> And also register allocator can allocate two lxv instructions
>> in the following registers.
>>
>> lxv vs33, 0(r2)
>> lxv vs32, 16(r2)
>>
>> To generate lxvp for above lxv instructions 
>>
>> lxvp vs32, 0(r2).
>>
>> And all the registers vs33 has to be propagated with vs32 and vs32
>> has to be propagated with vs33 if we do vecload pass after RA-pass.
>>
>> If we do before RA-pass the IRA and LRA register allocation cannot
>> assign register with a difference of 1 and the order difference can
>> be anything with a positive difference.
> 
> This sounds unexpected.  IMHO if you adopt OOmode for the paired load,
> RA should be able to allocate two sequential vsx registers and the
> first is even, since OOmode is only ok for even vsx register and its
> size makes it take two consecutive vsx registers.
> 
> Hi Peter, is my understanding correct?
> 

I tried all the combination in the past RA is not allocating sequential
register. I dont see any such code in RA that generates sequential registers.

We need to add explicit code in RA to generate sequential registers.
This is what I have done in this patch submitted.
>>
>> IRA allocated one in vs32 and other can in vs45.
>>
>> In vecload pass we remove one lxv from 2 lxv instruction and 2nd
>> lxv instruction with offset of 16 is removed and the use of register
>> with 2nd lxv's will not have defs and IRA pass cannot allocate
>> them in order with a difference of 1.
> 
> With Peter's patch to allow subreg from OOmode, I'd expect that we
> replace all uses of the first lxv (from low part address) with
> (subreg:VnX <the result of lxvp> <low offset>) and all uses of the
> second lxv (from high address) with (subreg:VnX <the result of lxvp> 
> <high offset>), currently after vecload, with the associated vecload.C,
> we transform from:
> 
> (insn 8 4 10 2 (set (reg:V16QI 124 [ *ptr_4(D) ])
>         (mem:V16QI (reg/v/f:DI 122 [ ptr ]) [0 *ptr_4(D)+0 S16 A128])) 1186 
> {vsx_movv16qi_64bit}
>      (nil))
> (insn 10 8 9 2 (set (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 
> 16B] ])
>         (mem:V16QI (plus:DI (reg/v/f:DI 122 [ ptr ])
>                 (const_int 16 [0x10])) [0 MEM[(__vector unsigned char 
> *)ptr_4(D) + 16B]+0 S16 A128])) 1186 {vsx_movv16qi_64bit}
>      (expr_list:REG_DEAD (reg/v/f:DI 122 [ ptr ])
>         (nil)))
> (insn 9 10 11 2 (set (reg:XO 119 [ _7 ])
>         (unspec:XO [
>                 (reg/v:V16QI 123 [ src ])
>                 (reg:V16QI 124 [ *ptr_4(D) ])
>             ] UNSPEC_MMA_XVF32GER)) 2203 {mma_xvf32ger}
>      (expr_list:REG_DEAD (reg:V16QI 124 [ *ptr_4(D) ])
>         (nil)))
> (insn 11 9 12 2 (set (reg:XO 120 [ _9 ])
>         (unspec:XO [
>                 (reg:XO 119 [ _7 ])
>                 (reg/v:V16QI 123 [ src ])
>                 (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 
> 16B] ])
>             ] UNSPEC_MMA_XVF32GERPP)) 2217 {mma_xvf32gerpp}
>      (expr_list:REG_DEAD (reg:V16QI 125 [ MEM[(__vector unsigned char 
> *)ptr_4(D) + 16B] ])
>         (expr_list:REG_DEAD (reg/v:V16QI 123 [ src ])
>             (expr_list:REG_DEAD (reg:XO 119 [ _7 ])
>                 (nil)))))
> 
> to:
> 
> (insn 19 4 9 2 (set (reg:OO 124 [ *ptr_4(D) ])
>         (mem:OO (reg/v/f:DI 122 [ ptr ]) [0 *ptr_4(D)+0 S16 A128])) -1
>      (nil))
> (insn 9 19 11 2 (set (reg:XO 119 [ _7 ])
>         (unspec:XO [
>                 (reg/v:V16QI 123 [ src ])
>                 (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 
> 16B] ])
>             ] UNSPEC_MMA_XVF32GER)) 2203 {mma_xvf32ger}
>      (expr_list:REG_DEAD (reg:OO 124 [ *ptr_4(D) ])
>         (nil)))
> (insn 11 9 12 2 (set (reg:XO 120 [ _9 ])
>         (unspec:XO [
>                 (reg:XO 119 [ _7 ])
>                 (reg/v:V16QI 123 [ src ])
>                 (reg:OO 124 [ *ptr_4(D) ])
>             ] UNSPEC_MMA_XVF32GERPP)) 2217 {mma_xvf32gerpp}
>      (expr_list:REG_DEAD (reg:V16QI 125 [ MEM[(__vector unsigned char 
> *)ptr_4(D) + 16B] ])
>         (expr_list:REG_DEAD (reg/v:V16QI 123 [ src ])
>             (expr_list:REG_DEAD (reg:XO 119 [ _7 ])
> 
> 
> After vecload IMHO this code sequence doesn't look valid, no insn
> defines pseudo 125, pseudo 124 is with OOmode but it's used to
> replace a use of V16QI.  Both are expected to be subreg from pseudo
> 124 (should be better to use a new pseduo for the paired load).
> 

I tried that in the past and no sequential code is generated in RA.

> Without Peter's patch, UNSPEC_MMA_EXTRACT can be used for extracting,
> I think it should only result in sub-optimal code with possible extra
> register moves but not have any correctness issue.
> 
> This patch can NOT be bootstrapped on x86_64-redhat-linux, I guess
> it's caused by the proposed RA changes.  And it's NOT regress-tested
> on Power10 LE with some go failures (note that I configured with
> --enable-languages="c,c++,fortran,objc,obj-c++,go"), also on Power9
> BE and LE with a few failures.
>

Did you try to bootstrapped on x86_64-redhat-linux. This patch did
not bootstrapped?. If so I will fix them.
 
> btw, I think Mike wants to test this pairing support on both adjacent
> loads and stores, so could you also extend the pairing to cover the
> stores, which can be guarded in a separated flag and disabled by
> default by considering the known issue on paired store, then Mike
> can test if the pairing can satisfy what he looked for.
>

Sure I will do that.

Thanks & Regards
Ajit
 
> BR,
> Kewen
>

Re: [PATCH V1] rs6000: New pass for replacement of adjacent (load) lxv with lxvp

Reply via email to