Hello Kewen: On 17/01/24 12:32 pm, Kewen.Lin wrote: > on 2024/1/16 06:22, Ajit Agarwal wrote: >> Hello Richard: >> >> On 15/01/24 6:25 pm, Ajit Agarwal wrote: >>> >>> >>> On 15/01/24 6:14 pm, Ajit Agarwal wrote: >>>> Hello Richard: >>>> >>>> On 15/01/24 3:03 pm, Richard Biener wrote: >>>>> On Sun, Jan 14, 2024 at 4:29 PM Ajit Agarwal <aagar...@linux.ibm.com> >>>>> wrote: >>>>>> >>>>>> Hello All: >>>>>> >>>>>> This patch add the vecload pass to replace adjacent memory accesses lxv >>>>>> with lxvp >>>>>> instructions. This pass is added before ira pass. >>>>>> >>>>>> vecload pass removes one of the defined adjacent lxv (load) and replace >>>>>> with lxvp. >>>>>> Due to removal of one of the defined loads the allocno is has only uses >>>>>> but >>>>>> not defs. >>>>>> >>>>>> Due to this IRA pass doesn't assign register pairs like registers in >>>>>> sequence. >>>>>> Changes are made in IRA register allocator to assign sequential >>>>>> registers to >>>>>> adjacent loads. >>>>>> >>>>>> Some of the registers are cleared and are not set as profitable >>>>>> registers due >>>>>> to zero cost is greater than negative costs and checks are added to >>>>>> compare >>>>>> positive costs. >>>>>> >>>>>> LRA register is changed not to reassign them to different register and >>>>>> form >>>>>> the sequential register pairs intact. >>>>>> >>>>>> contrib/check_GNU_style.sh run on patch looks good. >>>>>> >>>>>> Bootstrapped and regtested for powerpc64-linux-gnu. >>>>>> >>>>>> Spec2017 benchmarks are run and I get impressive benefits for some of >>>>>> the FP >>>>>> benchmarks. >>>>> i >>>>> I want to point out the aarch64 target recently got a ld/st fusion >>>>> pass which sounds >>>>> related. It would be nice to have at least common infrastructure for >>>>> this (the aarch64 >>>>> one also looks quite more powerful) > > Thank Richi for pointing out this pass. Yeah, it would be nice if we can > share > something common. CC the author Alex as well in case he have more insightful > comments. > >>>> >>>> load/store fusion pass in aarch64 is scheduled to use before peephole2 >>>> pass >>>> and after register allocator pass. In our case, if we do after register >>>> allocator >>>> then we should keep register assigned to lower offset load and other load >>>> that is adjacent to previous load with offset difference of 16 is removed. >>>> >>>> Then we are left with one load with lower offset and register assigned >>>> by register allocator for lower offset load should be lower than other >>>> adjacent load. If not, we need to change it to lower register and >>>> propagate them with all the uses of the variable. Similary for other >>>> adjacent load that we are removing, register needs to be propagated to >>>> all the uses. >>>> >>>> In that case we are doing the work of register allocator. In most of our >>>> example testcases the lower offset load is assigned greater register >>>> than other adjacent load by register allocator and hence we are left >>>> with propagating them always and almost redoing the register allocator >>>> work. >>>> >>>> Is it same/okay to use load/store fusion pass as on aarch64 for our cases >>>> considering the above scenario. >>>> >>>> Please let me know what do you think. >> >> I have gone through the implementation of ld/st fusion in aarch64. >> >> Here is my understanding: >> >> First all its my mistake that I have mentioned in my earlier mail that >> this pass is done before peephole2 after RA-pass. >> >> This pass does it before RA-pass early before early-remat and >> also before peephole2 after RA-pass. >> >> This pass does load fusion 2 ldr instruction with adjacent accesses >> into ldp instruction. >> >> The assembly syntax of ldp instruction is >> >> ldp w3, w7, [x0] >> >> It loads [X0] into w3 and [X0+4] into W7. >> >> Both registers that forms pairs are mentioned in ldp instructions >> and might not be in sequntial order like first register is W3 and >> then next register would be W3+1. >> >> Thats why the pass before RA-pass works as it has both the defs >> and may not be required in sequential order like first_reg and then >> first_reg+1. It can be any valid registers. >> >> >> But in lxvp instructions: >> >> lxv vs32, 0(r2) >> lxv vs45, 16(r2) >> >> When we combine above lxv instruction into lxvp, lxvp instruction >> becomes >> >> lxvp vs32, 0(r2) >> >> wherein in lxvp r2+0 is loaded into vs32 and r2+16 is loaded into vs33 >> register (sequential registers). vs33 is hidden in lxvp instruction. >> This is mandatory requirement for lxvp instruction and cannot be in >> any other sequence. register assignment difference should be 1. > > Note that the first register number in the pair should be even, it > means the so-called sequential order should be X, X + 1 (X is even). > This is also the reason why we preferred this pairing to be done > before RA (can catch more opportunities). > >> >> All the uses of r45 has to be propagated with r33. > > I think you meant s/r45/vs45/ and s/r33/vs33/. >
Yes I meant the same. >> >> And also register allocator can allocate two lxv instructions >> in the following registers. >> >> lxv vs33, 0(r2) >> lxv vs32, 16(r2) >> >> To generate lxvp for above lxv instructions >> >> lxvp vs32, 0(r2). >> >> And all the registers vs33 has to be propagated with vs32 and vs32 >> has to be propagated with vs33 if we do vecload pass after RA-pass. >> >> If we do before RA-pass the IRA and LRA register allocation cannot >> assign register with a difference of 1 and the order difference can >> be anything with a positive difference. > > This sounds unexpected. IMHO if you adopt OOmode for the paired load, > RA should be able to allocate two sequential vsx registers and the > first is even, since OOmode is only ok for even vsx register and its > size makes it take two consecutive vsx registers. > > Hi Peter, is my understanding correct? > I tried all the combination in the past RA is not allocating sequential register. I dont see any such code in RA that generates sequential registers. We need to add explicit code in RA to generate sequential registers. This is what I have done in this patch submitted. >> >> IRA allocated one in vs32 and other can in vs45. >> >> In vecload pass we remove one lxv from 2 lxv instruction and 2nd >> lxv instruction with offset of 16 is removed and the use of register >> with 2nd lxv's will not have defs and IRA pass cannot allocate >> them in order with a difference of 1. > > With Peter's patch to allow subreg from OOmode, I'd expect that we > replace all uses of the first lxv (from low part address) with > (subreg:VnX <the result of lxvp> <low offset>) and all uses of the > second lxv (from high address) with (subreg:VnX <the result of lxvp> > <high offset>), currently after vecload, with the associated vecload.C, > we transform from: > > (insn 8 4 10 2 (set (reg:V16QI 124 [ *ptr_4(D) ]) > (mem:V16QI (reg/v/f:DI 122 [ ptr ]) [0 *ptr_4(D)+0 S16 A128])) 1186 > {vsx_movv16qi_64bit} > (nil)) > (insn 10 8 9 2 (set (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + > 16B] ]) > (mem:V16QI (plus:DI (reg/v/f:DI 122 [ ptr ]) > (const_int 16 [0x10])) [0 MEM[(__vector unsigned char > *)ptr_4(D) + 16B]+0 S16 A128])) 1186 {vsx_movv16qi_64bit} > (expr_list:REG_DEAD (reg/v/f:DI 122 [ ptr ]) > (nil))) > (insn 9 10 11 2 (set (reg:XO 119 [ _7 ]) > (unspec:XO [ > (reg/v:V16QI 123 [ src ]) > (reg:V16QI 124 [ *ptr_4(D) ]) > ] UNSPEC_MMA_XVF32GER)) 2203 {mma_xvf32ger} > (expr_list:REG_DEAD (reg:V16QI 124 [ *ptr_4(D) ]) > (nil))) > (insn 11 9 12 2 (set (reg:XO 120 [ _9 ]) > (unspec:XO [ > (reg:XO 119 [ _7 ]) > (reg/v:V16QI 123 [ src ]) > (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + > 16B] ]) > ] UNSPEC_MMA_XVF32GERPP)) 2217 {mma_xvf32gerpp} > (expr_list:REG_DEAD (reg:V16QI 125 [ MEM[(__vector unsigned char > *)ptr_4(D) + 16B] ]) > (expr_list:REG_DEAD (reg/v:V16QI 123 [ src ]) > (expr_list:REG_DEAD (reg:XO 119 [ _7 ]) > (nil))))) > > to: > > (insn 19 4 9 2 (set (reg:OO 124 [ *ptr_4(D) ]) > (mem:OO (reg/v/f:DI 122 [ ptr ]) [0 *ptr_4(D)+0 S16 A128])) -1 > (nil)) > (insn 9 19 11 2 (set (reg:XO 119 [ _7 ]) > (unspec:XO [ > (reg/v:V16QI 123 [ src ]) > (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + > 16B] ]) > ] UNSPEC_MMA_XVF32GER)) 2203 {mma_xvf32ger} > (expr_list:REG_DEAD (reg:OO 124 [ *ptr_4(D) ]) > (nil))) > (insn 11 9 12 2 (set (reg:XO 120 [ _9 ]) > (unspec:XO [ > (reg:XO 119 [ _7 ]) > (reg/v:V16QI 123 [ src ]) > (reg:OO 124 [ *ptr_4(D) ]) > ] UNSPEC_MMA_XVF32GERPP)) 2217 {mma_xvf32gerpp} > (expr_list:REG_DEAD (reg:V16QI 125 [ MEM[(__vector unsigned char > *)ptr_4(D) + 16B] ]) > (expr_list:REG_DEAD (reg/v:V16QI 123 [ src ]) > (expr_list:REG_DEAD (reg:XO 119 [ _7 ]) > > > After vecload IMHO this code sequence doesn't look valid, no insn > defines pseudo 125, pseudo 124 is with OOmode but it's used to > replace a use of V16QI. Both are expected to be subreg from pseudo > 124 (should be better to use a new pseduo for the paired load). > I tried that in the past and no sequential code is generated in RA. > Without Peter's patch, UNSPEC_MMA_EXTRACT can be used for extracting, > I think it should only result in sub-optimal code with possible extra > register moves but not have any correctness issue. > > This patch can NOT be bootstrapped on x86_64-redhat-linux, I guess > it's caused by the proposed RA changes. And it's NOT regress-tested > on Power10 LE with some go failures (note that I configured with > --enable-languages="c,c++,fortran,objc,obj-c++,go"), also on Power9 > BE and LE with a few failures. > Did you try to bootstrapped on x86_64-redhat-linux. This patch did not bootstrapped?. If so I will fix them. > btw, I think Mike wants to test this pairing support on both adjacent > loads and stores, so could you also extend the pairing to cover the > stores, which can be guarded in a separated flag and disabled by > default by considering the known issue on paired store, then Mike > can test if the pairing can satisfy what he looked for. > Sure I will do that. Thanks & Regards Ajit > BR, > Kewen >