On Mon, 2 Aug 2021, Kewen.Lin wrote: > on 2021/8/2 下午3:09, Richard Biener wrote: > > On Mon, 2 Aug 2021, Kewen.Lin wrote: > > > >> on 2021/7/30 下午10:04, Kewen.Lin via Gcc-patches wrote: > >>> Hi Richi, > >>> > >>> on 2021/7/30 下午7:34, Richard Biener wrote: > >>>> This adds a gather vectorization capability to the vectorizer > >>>> without target support by decomposing the offset vector, doing > >>>> sclar loads and then building a vector from the result. This > >>>> is aimed mainly at cases where vectorizing the rest of the loop > >>>> offsets the cost of vectorizing the gather. > >>>> > >>>> Note it's difficult to avoid vectorizing the offset load, but in > >>>> some cases later passes can turn the vector load + extract into > >>>> scalar loads, see the followup patch. > >>>> > >>>> On SPEC CPU 2017 510.parest_r this improves runtime from 250s > >>>> to 219s on a Zen2 CPU which has its native gather instructions > >>>> disabled (using those the runtime instead increases to 254s) > >>>> using -Ofast -march=znver2 [-flto]. It turns out the critical > >>>> loops in this benchmark all perform gather operations. > >>>> > >>> > >>> Wow, it sounds promising! > >>> > >>>> Bootstrapped and tested on x86_64-unknown-linux-gnu. > >>>> > >>>> Any comments? I still plan to run this over full SPEC and > >>>> I have to apply TLC to the followup patch before I can post it. > >>>> > >>>> I think neither power nor z has gather so I'm curious if the > >>>> patch helps 510.parest_r there, I'm unsure about neon/advsimd. > >>> > >>> Yes, Power (latest Power10) doesn't support gather load. > >>> I just measured 510.parest_r with this patch on Power9 at option > >>> -Ofast -mcpu=power9 {,-funroll-loops}, both are neutral. > >>> > >>> It fails to vectorize the loop in vect-gather-1.c: > >>> > >>> vect-gather.c:12:28: missed: failed: evolution of base is not affine. > >>> vect-gather.c:11:46: missed: not vectorized: data ref analysis failed > >>> _6 = *_5; > >>> vect-gather.c:12:28: missed: not vectorized: data ref analysis failed: > >>> _6 = *_5; > >>> vect-gather.c:11:46: missed: bad data references. > >>> vect-gather.c:11:46: missed: couldn't vectorize loop > >>> > >> > >> By further investigation, it's due to that rs6000 fails to make > >> maybe_gather true in: > >> > >> bool maybe_gather > >> = DR_IS_READ (dr) > >> && !TREE_THIS_VOLATILE (DR_REF (dr)) > >> && (targetm.vectorize.builtin_gather != NULL > >> || supports_vec_gather_load_p ()); > >> > >> With the hacking defining TARGET_VECTORIZE_BUILTIN_GATHER (as > >> well as TARGET_VECTORIZE_BUILTIN_SCATTER) for rs6000, the > >> case gets vectorized as expected. > > > > Ah, yeah - this check needs to be elided. Checking with a cross > > that's indeed what is missing. > > > >> But re-evaluated 510.parest_r with this extra hacking, the > >> runtime performance doesn't have any changes. > > > > OK, it likely needs the followup as well then. For the added > > testcase I end up with > > > > .L4: > > lwz %r10,8(%r9) > > lwz %r31,12(%r9) > > addi %r8,%r8,32 > > addi %r9,%r9,16 > > lwz %r7,-16(%r9) > > lwz %r0,-12(%r9) > > lxv %vs10,-16(%r8) > > lxv %vs12,-32(%r8) > > extswsli %r10,%r10,3 > > extswsli %r31,%r31,3 > > extswsli %r7,%r7,3 > > extswsli %r0,%r0,3 > > ldx %r10,%r6,%r10 > > ldx %r31,%r6,%r31 > > ldx %r7,%r6,%r7 > > ldx %r12,%r6,%r0 > > mtvsrdd %vs0,%r31,%r10 > > mtvsrdd %vs11,%r12,%r7 > > xvmuldp %vs0,%vs0,%vs10 > > xvmaddadp %vs0,%vs12,%vs11 > > xvadddp %vs32,%vs32,%vs0 > > bdnz .L4 > > > > as the inner vectorized loop. Looks like power doesn't have > > extending SI->DImode loads? Otherwise the code looks > > You meant one instruction for both left shifting and loads?
No, I meant doing lwz %r10,8(%r9) extswsli %r10,%r10,3 in one step. I think extswsli should be a SImode->DImode sign-extend. Ah - it does the multiplication by 8 and the sign-extend. OK, that's nice as well. On x86 the multiplication by 8 is done via complex addressing mode on the dependent load and the scalar load does the sign-extension part. > It doesn't if so. btw, the above code sequence looks nice! > > straight-forward (I've used -mcpu=power10), but eventually > > the whole gather, which I think boils down to > > > > lwz %r10,8(%r9) > > lwz %r31,12(%r9) > > addi %r9,%r9,16 > > lwz %r7,-16(%r9) > > lwz %r0,-12(%r9) > > extswsli %r10,%r10,3 > > extswsli %r31,%r31,3 > > extswsli %r7,%r7,3 > > extswsli %r0,%r0,3 > > ldx %r10,%r6,%r10 > > ldx %r31,%r6,%r31 > > ldx %r7,%r6,%r7 > > ldx %r12,%r6,%r0 > > mtvsrdd %vs0,%r31,%r10 > > mtvsrdd %vs11,%r12,%r7 > > > > is too difficult for the integer/load side of the pipeline. It's > > of course the same in the scalar loop but there the FP ops only > > need to wait for one lane to complete. Note the above is with > > the followup patch. > > > > Good point! Will check the actual effect after this and its follow-up > are landed, may need some cost tweaking on it. Yes - I hope the vectorizer costing part is conservative enough. Richard.