on 2021/8/2 下午5:11, Richard Biener wrote: > On Mon, 2 Aug 2021, Kewen.Lin wrote: > >> on 2021/8/2 下午3:09, Richard Biener wrote: >>> On Mon, 2 Aug 2021, Kewen.Lin wrote: >>> >>>> on 2021/7/30 下午10:04, Kewen.Lin via Gcc-patches wrote: >>>>> Hi Richi, >>>>> >>>>> on 2021/7/30 下午7:34, Richard Biener wrote: >>>>>> This adds a gather vectorization capability to the vectorizer >>>>>> without target support by decomposing the offset vector, doing >>>>>> sclar loads and then building a vector from the result. This >>>>>> is aimed mainly at cases where vectorizing the rest of the loop >>>>>> offsets the cost of vectorizing the gather. >>>>>> >>>>>> Note it's difficult to avoid vectorizing the offset load, but in >>>>>> some cases later passes can turn the vector load + extract into >>>>>> scalar loads, see the followup patch. >>>>>> >>>>>> On SPEC CPU 2017 510.parest_r this improves runtime from 250s >>>>>> to 219s on a Zen2 CPU which has its native gather instructions >>>>>> disabled (using those the runtime instead increases to 254s) >>>>>> using -Ofast -march=znver2 [-flto]. It turns out the critical >>>>>> loops in this benchmark all perform gather operations. >>>>>> >>>>> >>>>> Wow, it sounds promising! >>>>> >>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu. >>>>>> >>>>>> Any comments? I still plan to run this over full SPEC and >>>>>> I have to apply TLC to the followup patch before I can post it. >>>>>> >>>>>> I think neither power nor z has gather so I'm curious if the >>>>>> patch helps 510.parest_r there, I'm unsure about neon/advsimd. >>>>> >>>>> Yes, Power (latest Power10) doesn't support gather load. >>>>> I just measured 510.parest_r with this patch on Power9 at option >>>>> -Ofast -mcpu=power9 {,-funroll-loops}, both are neutral. >>>>> >>>>> It fails to vectorize the loop in vect-gather-1.c: >>>>> >>>>> vect-gather.c:12:28: missed: failed: evolution of base is not affine. >>>>> vect-gather.c:11:46: missed: not vectorized: data ref analysis failed >>>>> _6 = *_5; >>>>> vect-gather.c:12:28: missed: not vectorized: data ref analysis failed: >>>>> _6 = *_5; >>>>> vect-gather.c:11:46: missed: bad data references. >>>>> vect-gather.c:11:46: missed: couldn't vectorize loop >>>>> >>>> >>>> By further investigation, it's due to that rs6000 fails to make >>>> maybe_gather true in: >>>> >>>> bool maybe_gather >>>> = DR_IS_READ (dr) >>>> && !TREE_THIS_VOLATILE (DR_REF (dr)) >>>> && (targetm.vectorize.builtin_gather != NULL >>>> || supports_vec_gather_load_p ()); >>>> >>>> With the hacking defining TARGET_VECTORIZE_BUILTIN_GATHER (as >>>> well as TARGET_VECTORIZE_BUILTIN_SCATTER) for rs6000, the >>>> case gets vectorized as expected. >>> >>> Ah, yeah - this check needs to be elided. Checking with a cross >>> that's indeed what is missing. >>> >>>> But re-evaluated 510.parest_r with this extra hacking, the >>>> runtime performance doesn't have any changes. >>> >>> OK, it likely needs the followup as well then. For the added >>> testcase I end up with >>> >>> .L4: >>> lwz %r10,8(%r9) >>> lwz %r31,12(%r9) >>> addi %r8,%r8,32 >>> addi %r9,%r9,16 >>> lwz %r7,-16(%r9) >>> lwz %r0,-12(%r9) >>> lxv %vs10,-16(%r8) >>> lxv %vs12,-32(%r8) >>> extswsli %r10,%r10,3 >>> extswsli %r31,%r31,3 >>> extswsli %r7,%r7,3 >>> extswsli %r0,%r0,3 >>> ldx %r10,%r6,%r10 >>> ldx %r31,%r6,%r31 >>> ldx %r7,%r6,%r7 >>> ldx %r12,%r6,%r0 >>> mtvsrdd %vs0,%r31,%r10 >>> mtvsrdd %vs11,%r12,%r7 >>> xvmuldp %vs0,%vs0,%vs10 >>> xvmaddadp %vs0,%vs12,%vs11 >>> xvadddp %vs32,%vs32,%vs0 >>> bdnz .L4 >>> >>> as the inner vectorized loop. Looks like power doesn't have >>> extending SI->DImode loads? Otherwise the code looks >> >> You meant one instruction for both left shifting and loads? > > No, I meant doing > > lwz %r10,8(%r9) > extswsli %r10,%r10,3 > > in one step. I think extswsli should be a SImode->DImode sign-extend. > Ah - it does the multiplication by 8 and the sign-extend. OK, that's > nice as well.
Ah, OK. Yeah, extswsli is short for "Extend Sign Word and Shift Left Immediate", which is introduced since Power9. It does do both the sign-extend and the multiplication by 8 as you noted. :) > On x86 the multiplication by 8 is done via complex > addressing mode on the dependent load and the scalar load does > the sign-extension part. > OK, thanks for the clarification! BR, Kewen >> It doesn't if so. btw, the above code sequence looks nice! >>> straight-forward (I've used -mcpu=power10), but eventually >>> the whole gather, which I think boils down to >>> >>> lwz %r10,8(%r9) >>> lwz %r31,12(%r9) >>> addi %r9,%r9,16 >>> lwz %r7,-16(%r9) >>> lwz %r0,-12(%r9) >>> extswsli %r10,%r10,3 >>> extswsli %r31,%r31,3 >>> extswsli %r7,%r7,3 >>> extswsli %r0,%r0,3 >>> ldx %r10,%r6,%r10 >>> ldx %r31,%r6,%r31 >>> ldx %r7,%r6,%r7 >>> ldx %r12,%r6,%r0 >>> mtvsrdd %vs0,%r31,%r10 >>> mtvsrdd %vs11,%r12,%r7 >>> >>> is too difficult for the integer/load side of the pipeline. It's >>> of course the same in the scalar loop but there the FP ops only >>> need to wait for one lane to complete. Note the above is with >>> the followup patch. >>> >> >> Good point! Will check the actual effect after this and its follow-up >> are landed, may need some cost tweaking on it. > > Yes - I hope the vectorizer costing part is conservative enough. > > Richard. >