Richard Biener <[email protected]> writes:
> On Wed, 12 Apr 2023, [email protected] wrote:
>
>>
>> >> Thanks for the detailed explanation. Just to clarify - with RVV
>> >> there's only a single mask register, v0.t, or did you want to
>> >> say an instruction can only specify a single mask register?
>>
>> RVV has 32 (v0~v31) vector register in total.
>> We can store vector data value or mask value in any of them.
>> We also have mask-logic instruction for example mask-and between any vector
>> register.
>>
>> However, any vector operation for example like vadd.vv can only predicated
>> by v0 (in asm is v0.t) which is the first vector register.
>> We can predicate vadd.vv with v1 - v31.
>>
>> So, you can image every time we want to use a mask to predicate a vector
>> operation, we should always first store the mask value
>> into v0.
>>
>> So, we can write intrinsic sequence like this:
>>
>> vmseq v0,v8,v9 (store mask value to v0)
>> vmslt v1,v10,v11 (store mask value to v1)
>> vmand v0,v0,v1
>> vadd.vv ...v0.t (predicate mask should always be mask).
>
> Ah, I see - that explains it well.
>
>> >> ARM SVE would have a loop control mask and a separate mask
>> >> for the if (cond[i]) which would be combined with a mask-and
>> >> instruction to a third mask which is then used on the
>> >> predicated instructions.
>>
>> Yeah, I know it. ARM SVE way is a more elegant way than RVV do.
>> However, for RVV, we can't follow this flow.
>> We don't have a "whilelo" instruction to generate loop control mask.
>
> Yep. Similar for AVX512 where I have to use a vector compare. I'm
> currently using
>
> { 0, 1, 2 ... } < { remaining_len, remaining_len, ... }
>
> and careful updating of remaining_len (we know it will either
> be adjusted by the full constant vector length or updated to zero).
>
>> We only can do loop control with length generated by vsetvl.
>> And we can only use "v0" to mask predicate vadd.vv, and mask value can only
>> generated by comparison or mask logical instructions.
>>
>> >> PowerPC and s390x might be able to use WHILE_LEN as well (though
>> >> they only have LEN variants of loads and stores) - of course
>> >> only "simulating it". For the fixed-vector-length ISAs the
>> >> predicated vector loop IMHO makes most sense for the epilogue to
>> >> handle low-trip loops better.
>>
>> Yeah, I wonder how they do the flow control (if (cond[i])).
>> For RVV, you can image I will need to add a pattern
>> LEN_MASK_LOAD/LEN_MASK_STORE (length generated by WHILE_LEN and mask
>> generated by comparison)
>>
>> I think we can CC IBM folks to see whether we can make WHILE_LEN works
>> for both IBM and RVV ?
>
> I've CCed them. Adding WHILE_LEN support to rs6000/s390x would be
> mainly the "easy" way to get len-masked (epilog) loop support.
I think that already works for them (could be misremembering).
However, IIUC, they have no special instruction to calculate the
length (unlike for RVV), and so it's open-coded using vect_get_len.
I suppose my two questions are:
(1) How easy would it be to express WHILE_LEN in normal gimple?
I haven't thought about this at all, so the answer might be
"very hard". But it reminds me a little of UQDEC on AArch64,
which we open-code using MAX_EXPR and MINUS_EXPR (see
vect_set_loop_controls_directly).
I'm not saying WHILE_LEN is the same operation, just that it seems
like it might be open-codeable in a similar way.
Even if we can open-code it, we'd still need some way for the
target to select the "RVV way" from the "s390/PowerPC way".
(2) What effect does using a variable IV step (the result of
the WHILE_LEN) have on ivopts? I remember experimenting with
something similar once (can't remember the context) and not
having a constant step prevented ivopts from making good
addresing-mode choices.
Thanks,
Richard