On Wed, 12 Apr 2023, ??? wrote:
> >> It's not so much that we need to do that. But normally it's only worth
> >> adding internal functions if they do something that is too complicated
> >> to express in simple gimple arithmetic. The UQDEC case I mentioned:
>
> >> z = MAX (x, y) - y
>
> >> fell into the "simple arithmetic" category for me. We could have added
> >> an ifn for unsigned saturating decrement, but it didn't seem complicated
> >> enough to merit its own ifn.
>
> Ah, I known your concern. I should admit that WHILE_LEN is a simple
> arithmetic operation
> which is just taking result from
>
> min (remain,vf).
>
> The possible solution is to just use MIN_EXPR (remain,vf).
> Then, add speciall handling in umin_optab pattern to recognize "vf" in the
> backend.
> Finally generate vsetvl in RISC-V backend.
>
> The "vf" should be recognized as the operand of umin should be
> const_int/const_poly_int operand.
> Otherwise, just generate umin scalar instruction..
>
> However, there is a case that I can't recognize umin should generate vsetvl
> or umin. Is this following case:
> void foo (int32_t a)
> {
> return min (a, 4);
> }
>
> In this case I should generate:
> li a1,4
> umin a1,a0,a1
>
> instead of generating vsetvl
>
> However, in this case:
>
> void foo (int32_t *a...)
> for (int i = 0; i < n; i++)
> a[i] = b[i] + c[i];
>
> with -mriscv-vector-bits=128 (which means each vector can handle 4 INT32)
> Then the VF will be 4 too. If we also MIN_EXPR instead WHILE_LEN:
>
> ...
> len = MIN_EXPR (n,4)
> v = len_load (len)
> ....
> ...
>
> In this case, MIN_EXPR should emit vsetvl.
>
> It's hard for me to tell the difference between these 2 cases...
But the issue is the same in the reverse with WHILE_LEN, no?
WHILE_LEN just computes a scalar value - you seem to suggest
there's a hidden side-effect of "coalescing" the result with
a hardware vector length register? I don't think that's good design.
IMHO tieing the scalar result with the uses has to be done where
you emit the other vsetvl instructions.
One convenient thing we have with WHILE_LEN is that it is a key
for the vectorizer to query target capabilities (and preferences).
But of course collecting whether stmts can be vectorized
with length and/or with mask would be better.
Richard.
> CC RISC-V port backend maintainer: Kito.
>
>
>
> [email protected]
>
> From: Richard Sandiford
> Date: 2023-04-12 20:24
> To: juzhe.zhong\@rivai.ai
> CC: rguenther; gcc-patches; jeffreyalaw; rdapp; linkw
> Subject: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for
> auto-vectorization
> "[email protected]" <[email protected]> writes:
> >>> I think that already works for them (could be misremembering).
> >>> However, IIUC, they have no special instruction to calculate the
> >>> length (unlike for RVV), and so it's open-coded using vect_get_len.
> >
> > Yeah, the current flow using min, sub, and then min in vect_get_len
> > is working for IBM. But I wonder whether switching the current flow of
> > length-loop-control into the WHILE_LEN pattern that this patch can improve
> > their performance.
> >
> >>> (1) How easy would it be to express WHILE_LEN in normal gimple?
> >>> I haven't thought about this at all, so the answer might be
> >>> "very hard". But it reminds me a little of UQDEC on AArch64,
> >>> which we open-code using MAX_EXPR and MINUS_EXPR (see
> > >> vect_set_loop_controls_directly).
> >
> > >> I'm not saying WHILE_LEN is the same operation, just that it seems
> > >> like it might be open-codeable in a similar way.
> >
> > >> Even if we can open-code it, we'd still need some way for the
> > >> target to select the "RVV way" from the "s390/PowerPC way".
> >
> > WHILE_LEN in doc I define is
> > operand0 = MIN (operand1, operand2)operand1 is the residual number of
> > scalar elements need to be updated.operand2 is vectorization factor (vf)
> > for single rgroup. if multiple rgroup operan2 = vf *
> > nitems_per_ctrl.You mean such pattern is not well expressed so we need to
> > replace it with normaltree code (MIN OR MAX). And let RISC-V backend to
> > optimize them into vsetvl ?Sorry, maybe I am not on the same page.
>
> It's not so much that we need to do that. But normally it's only worth
> adding internal functions if they do something that is too complicated
> to express in simple gimple arithmetic. The UQDEC case I mentioned:
>
> z = MAX (x, y) - y
>
> fell into the "simple arithmetic" category for me. We could have added
> an ifn for unsigned saturating decrement, but it didn't seem complicated
> enough to merit its own ifn.
>
> >>> (2) What effect does using a variable IV step (the result of
> >>> the WHILE_LEN) have on ivopts? I remember experimenting with
> >>> something similar once (can't remember the context) and not
> >>> having a constant step prevented ivopts from making good
> >>> addresing-mode choices.
> >
> > Thank you so much for pointing out this. Currently, varialble IV step and
> > decreasing n down to 0
> > works fine for RISC-V downstream GCC and we didn't find issues related
> > addressing-mode choosing.
>
> OK, that's good. Sounds like it isn't a problem then.
>
> > I think I must missed something, would you mind giving me some hints so
> > that I can study on ivopts
> > to find out which case may generate inferior codegens for varialble IV step?
>
> I think AArch64 was sensitive to this because (a) the vectoriser creates
> separate IVs for each base address and (b) for SVE, we instead want
> invariant base addresses that are indexed by the loop control IV.
> Like Richard says, if the loop control IV isn't a SCEV, ivopts isn't
> able to use it and so (b) fails.
>
> Thanks,
> Richard
>
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)