Hi,
On 2020/9/1 21:07, Richard Biener wrote:
> On Tue, Sep 1, 2020 at 10:11 AM luoxhu via Gcc-patches
> <[email protected]> wrote:
>>
>> Hi,
>>
>> On 2020/9/1 01:04, Segher Boessenkool wrote:
>>> Hi!
>>>
>>> On Mon, Aug 31, 2020 at 04:06:47AM -0500, Xiong Hu Luo wrote:
>>>> vec_insert accepts 3 arguments, arg0 is input vector, arg1 is the value
>>>> to be insert, arg2 is the place to insert arg1 to arg0. This patch adds
>>>> __builtin_vec_insert_v4si[v4sf,v2di,v2df,v8hi,v16qi] for vec_insert to
>>>> not expand too early in gimple stage if arg2 is variable, to avoid generate
>>>> store hit load instructions.
>>>>
>>>> For Power9 V4SI:
>>>> addi 9,1,-16
>>>> rldic 6,6,2,60
>>>> stxv 34,-16(1)
>>>> stwx 5,9,6
>>>> lxv 34,-16(1)
>>>> =>
>>>> addis 9,2,.LC0@toc@ha
>>>> addi 9,9,.LC0@toc@l
>>>> mtvsrwz 33,5
>>>> lxv 32,0(9)
>>>> sradi 9,6,2
>>>> addze 9,9
>>>> sldi 9,9,2
>>>> subf 9,9,6
>>>> subfic 9,9,3
>>>> sldi 9,9,2
>>>> subfic 9,9,20
>>>> lvsl 13,0,9
>>>> xxperm 33,33,45
>>>> xxperm 32,32,45
>>>> xxsel 34,34,33,32
>>>
>>> For v a V4SI, x a SI, j some int, what do we generate for
>>> v[j&3] = x;
>>> ?
>>> This should be exactly the same as we generate for
>>> vec_insert(x, v, j);
>>> (the builtin does a modulo 4 automatically).
>>
>> No, even with my patch "stxv 34,-16(1);stwx 5,9,6;lxv 34,-16(1)" generated
>> currently.
>> Is it feasible and acceptable to expand some kind of pattern in expander
>> directly without
>> builtin transition?
>>
>> I borrowed some of implementation from vec_extract. For vec_extract, the
>> issue also exists:
>>
>> source: gimple:
>> expand: asm:
>> 1) i = vec_extract (v, n); => i = __builtin_vec_ext_v4si (v, n); =>
>> {r120:SI=unspec[r118:V4SI,r119:DI] 134;...} => slwi 9,6,2 vextuwrx
>> 3,9,2
>> 2) i = vec_extract (v, 3); => i = __builtin_vec_ext_v4si (v, 3); =>
>> {r120:SI=vec_select(r118:V4SI,parallel)...} => li 9,12 vextuwrx
>> 3,9,2
>> 3) i = v[n%4]; => _1 = n & 3; i = VIEW_CONVERT_EXPR<int[4]>(v)[_1]; =>
>> ... => stxv 34,-16(1);addi 9,1,-16; rldic 5,5,2,60; lwax
>> 3,9,5
>> 4) i = v[3]; => i = BIT_FIELD_REF <v, 32, 96>; =>
>> {r120:SI=vec_select(r118:V4SI,parallel)...} => li 9,12; vextuwrx 3,9,2
>
> Why are 1) and 2) handled differently than 3)/4)?
1) and 2) are calling builtin function vec_extract, it is defined to
__builtin_vec_extract and will be resolved to ALTIVEC_BUILTIN_VEC_EXTRACT
by resolve_overloaded_builtin, to generate a call __builtin_vec_ext_v4si
to be expanded only in RTL.
3) is access variable v as array type with opcode VIEW_CONVERT_EXPR, I
guess we should also generate builtin call instead of calling
convert_vector_to_array_for_subscript to generate VIEW_CONVERT_EXPR
expression for such kind of usage.
4) is translated to BIT_FIELD_REF with constant bitstart and bitsize,
variable v could also be accessed by register instead of stack, so optabs
could match the rs6000_expand_vector_insert to generate expected instruction
through extract_bit_field.
>
>> Case 3) also couldn't handle the similar usage, and case 4) doesn't generate
>> builtin as expected,
>> it just expand to vec_select by coincidence. So does this mean both
>> vec_insert and vec_extract
>> and all other similar vector builtins should use IFN as suggested by Richard
>> Biener, to match the
>> pattern in gimple and expand both constant and variable index in expander?
>> Will this also be
>> beneficial for other targets except power? Or we should do that gradually
>> after this patch
>> approved as it seems another independent issue? Thanks:)
>
> If the code generated for 3)/4) isn't optimal you have to figure why
> by tracing the RTL
> expansion code and looking for missing optabs.
>
> Consider the amount of backend code you need to write if ever using
> those in constexpr
> context ...
It seems too complicated to expand the "i = VIEW_CONVERT_EXPR<int[4]>(v)[_1];"
or "VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i_6(D);" to
rs6000_expand_vector_insert/rs6000_expand_vector_extract in RTL, as:
1) Vector v is stored to stack with array type; need extra load and store
operation.
2) Requires amount of code to decompose VIEW_CONVERT_EXPR to extract the vector
and
index then call rs6000_expand_vector_insert/rs6000_expand_vector_extract.
which means replace following instructions #9~#12 to new instruction sequences:
1: NOTE_INSN_DELETED
6: NOTE_INSN_BASIC_BLOCK 2
2: r119:V4SI=%2:V4SI
3: r120:DI=%5:DI
4: r121:DI=%6:DI
5: NOTE_INSN_FUNCTION_BEG
8: [r112:DI]=r119:V4SI
9: r122:DI=r121:DI&0x3
10: r123:DI=r122:DI<<0x2
11: r124:DI=r112:DI+r123:DI
12: [r124:DI]=r120:DI#0
13: r118:V4SI=[r112:DI]
17: %2:V4SI=r118:V4SI
18: use %2:V4SI
=>
1: NOTE_INSN_DELETED
6: NOTE_INSN_BASIC_BLOCK 2
2: r119:V4SI=%2:V4SI
3: r120:DI=%5:DI
4: r121:DI=%6:DI
5: NOTE_INSN_FUNCTION_BEG
8: [r112:DI]=r119:V4SI
r130:V4SI=[r112:DI]
rs6000_expand_vector_insert (r130, r121:DI&0x3, r120:DI#0)
[r112:DI]=r130:V4SI
13: r118:V4SI=[r112:DI]
17: %2:V4SI=r118:V4SI
18: use %2:V4SI
so maybe bypass convert_vector_to_array_for_subscript for special circumstance
like "i = v[n%4]" or "v[n&3]=i" to generate vec_extract or vec_insert builtin
call a relative simpler method?
Xionghu