on 2021/7/14 下午2:38, Richard Biener wrote: > On Tue, Jul 13, 2021 at 4:59 PM Kewen.Lin <li...@linux.ibm.com> wrote: >> >> on 2021/7/13 下午8:42, Richard Biener wrote: >>> On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <li...@linux.ibm.com> wrote: >>>> >>>> Hi Richi, >>>> >>>> Thanks for the comments! >>>> >>>> on 2021/7/13 下午5:35, Richard Biener wrote: >>>>> On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <li...@linux.ibm.com> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> When I added the support for Power10 newly introduced multiply >>>>>> highpart instrutions, I noticed that currently vectorizer >>>>>> doesn't try to vectorize multiply highpart pattern, I hope >>>>>> this isn't intentional? >>>>>> >>>>>> This patch is to extend the existing pattern mulhs handlings >>>>>> to cover multiply highpart. Another alternative seems to >>>>>> recog mul_highpart operation in a general place applied for >>>>>> scalar code when the target supports the optab for the scalar >>>>>> operation, it's based on the assumption that one target which >>>>>> supports vector version of multiply highpart should have the >>>>>> scalar version. I noticed that the function can_mult_highpart_p >>>>>> can check/handle mult_highpart well even without mul_highpart >>>>>> optab support, I think to recog this pattern in vectorizer >>>>>> is better. Is it on the right track? >>>>> >>>>> I think it's on the right track, using IFN_LAST is a bit awkward >>>>> in case yet another case pops up so maybe you can use >>>>> a code_helper instance instead which unifies tree_code, >>>>> builtin_code and internal_fn? >>>>> >>>> >>>> If there is one new requirement which doesn't have/introduce IFN >>>> stuffs but have one existing tree_code, can we add one more field >>>> with type tree_code, then for the IFN_LAST path we can check the >>>> different requirements under the guard with that tree_code variable? >>>> >>>>> I also notice that can_mult_highpart_p will return true if >>>>> only vec_widen_[us]mult_{even,odd,hi,lo} are available, >>>>> but then the result might be less optimal (or even not >>>>> handled later)? >>>>> >>>> >>>> I think it will be handled always? The expander calls >>>> >>>> rtx >>>> expand_mult_highpart (machine_mode mode, rtx op0, rtx op1, >>>> rtx target, bool uns_p) >>>> >>>> which will further check with can_mult_highpart_p. >>>> >>>> For the below case, >>>> >>>> #define SHT_CNT 16 >>>> >>>> __attribute__ ((noipa)) void >>>> test () >>>> { >>>> for (int i = 0; i < N; i++) >>>> sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16; >>>> } >>>> >>>> Without this patch, it use widen_mult like below: >>>> >>>> vect__1.5_19 = MEM <vector(8) short int> [(short int *)&sh_a + >>>> ivtmp.18_24 * 1]; >>>> vect__3.8_14 = MEM <vector(8) short int> [(short int *)&sh_b + >>>> ivtmp.18_24 * 1]; >>>> vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR <vect__3.8_14, vect__1.5_19>; >>>> vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR <vect__3.8_14, vect__1.5_19>; >>>> vect__6.10_25 = vect_patt_22.9_13 >> 16; >>>> vect__6.10_26 = vect_patt_22.9_9 >> 16; >>>> vect__7.11_27 = VEC_PACK_TRUNC_EXPR <vect__6.10_25, vect__6.10_26>; >>>> MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.18_24 * 1] = >>>> vect__7.11_27; >>>> >>>> .L2: >>>> lxvx 33,7,9 >>>> lxvx 32,8,9 >>>> vmulosh 13,0,1 // widen mult >>>> vmulesh 0,0,1 >>>> xxmrglw 33,32,45 // merge >>>> xxmrghw 32,32,45 >>>> vsraw 1,1,12 // shift >>>> vsraw 0,0,12 >>>> vpkuwum 0,0,1 // pack >>>> stxvx 32,10,9 >>>> addi 9,9,16 >>>> bdnz .L2 >>>> >>>> >>>> With this patch, it ends up with: >>>> >>>> vect__1.5_14 = MEM <vector(8) short int> [(short int *)&sh_a + >>>> ivtmp.17_24 * 1]; >>>> vect__3.8_8 = MEM <vector(8) short int> [(short int *)&sh_b + >>>> ivtmp.17_24 * 1]; >>>> vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14; >>>> MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.17_24 * 1] = >>>> vect_patt_21.9_25; >>> >>> Yes, so I'm curious what it ends up with/without the patch on x86_64 which >>> can do vec_widen_[us]mult_{even,odd} but not [us]mul_highpart. >>> >> >> For test case: >> >> ``` >> #define N 32 >> typedef signed int bigType; >> typedef signed short smallType; >> #define SH_CNT 16 >> >> extern smallType small_a[N], small_b[N], small_c[N]; >> >> __attribute__((noipa)) void test_si(int n) { >> for (int i = 0; i < n; i++) >> small_c[i] = ((bigType)small_a[i] * (bigType)small_b[i]) >> SH_CNT; >> } >> >> ``` >> >> on x86_64, with option set: -mfpmath=sse -msse2 -O2 -ftree-vectorize >> >> 1) without this patch, the pattern isn't recognized, the IR looks like: >> >> <bb 4> [local count: 94607391]: >> bnd.5_34 = niters.4_25 >> 3; >> _13 = (sizetype) bnd.5_34; >> _29 = _13 * 16; >> >> <bb 5> [local count: 378429566]: >> # ivtmp.34_4 = PHI <ivtmp.34_5(5), 0(4)> >> vect__1.10_40 = MEM <vector(8) short int> [(short int *)&small_a + >> ivtmp.34_4 * 1]; >> vect__3.13_43 = MEM <vector(8) short int> [(short int *)&small_b + >> ivtmp.34_4 * 1]; >> vect_patt_18.14_44 = WIDEN_MULT_LO_EXPR <vect__1.10_40, vect__3.13_43>; >> vect_patt_18.14_45 = WIDEN_MULT_HI_EXPR <vect__1.10_40, vect__3.13_43>; >> vect__6.15_46 = vect_patt_18.14_44 >> 16; >> vect__6.15_47 = vect_patt_18.14_45 >> 16; >> vect__7.16_48 = VEC_PACK_TRUNC_EXPR <vect__6.15_46, vect__6.15_47>; >> MEM <vector(8) short int> [(short int *)&small_c + ivtmp.34_4 * 1] = >> vect__7.16_48; >> ivtmp.34_5 = ivtmp.34_4 + 16; >> if (ivtmp.34_5 != _29) >> goto <bb 5>; [75.00%] >> else >> goto <bb 6>; [25.00%] >> >> ... >> >> *asm*: >> >> .L4: >> movdqu small_b(%rax), %xmm3 >> movdqu small_a(%rax), %xmm1 >> addq $16, %rax >> movdqu small_a-16(%rax), %xmm2 >> pmullw %xmm3, %xmm1 >> pmulhw %xmm3, %xmm2 >> movdqa %xmm1, %xmm0 >> punpcklwd %xmm2, %xmm0 >> punpckhwd %xmm2, %xmm1 >> psrad $16, %xmm1 >> psrad $16, %xmm0 >> movdqa %xmm0, %xmm2 >> punpcklwd %xmm1, %xmm0 >> punpckhwd %xmm1, %xmm2 >> movdqa %xmm0, %xmm1 >> punpckhwd %xmm2, %xmm1 >> punpcklwd %xmm2, %xmm0 >> punpcklwd %xmm1, %xmm0 >> movups %xmm0, small_c-16(%rax) >> cmpq %rdx, %rax >> jne .L4 >> movl %edi, %eax >> andl $-8, %eax >> testb $7, %dil >> je .L14 >> >> *insn dist in loop* >> >> 1 addq >> 3 movdqa >> 3 movdqu >> 1 movups >> 1 pmulhw >> 1 pmullw >> 2 psrad >> 3 punpckhwd >> 4 punpcklwd >> >> >> 2) with this patch but make the mul_highpart optab query return false, the >> IR looks >> like: >> >> <bb 4> [local count: 94607391]: >> bnd.5_37 = niters.4_22 >> 3; >> _13 = (sizetype) bnd.5_37; >> _32 = _13 * 16; >> >> <bb 5> [local count: 378429566]: >> # ivtmp.33_4 = PHI <ivtmp.33_5(5), 0(4)> >> vect__1.10_43 = MEM <vector(8) short int> [(short int *)&small_a + >> ivtmp.33_4 * 1]; >> vect__3.13_46 = MEM <vector(8) short int> [(short int *)&small_b + >> ivtmp.33_4 * 1]; >> vect_patt_26.14_47 = vect__1.10_43 h* vect__3.13_46; >> MEM <vector(8) short int> [(short int *)&small_c + ivtmp.33_4 * 1] = >> vect_patt_26.14_47; >> ivtmp.33_5 = ivtmp.33_4 + 16; >> if (ivtmp.33_5 != _32) >> goto <bb 5>; [75.00%] >> else >> goto <bb 6>; [25.00%] >> >> *asm*: >> >> .L4: >> movdqu small_b(%rax), %xmm3 >> movdqu small_a(%rax), %xmm1 >> addq $16, %rax >> movdqu small_a-16(%rax), %xmm2 >> pmullw %xmm3, %xmm1 >> pmulhw %xmm3, %xmm2 >> movdqa %xmm1, %xmm0 >> punpcklwd %xmm2, %xmm0 >> punpckhwd %xmm2, %xmm1 >> movdqa %xmm0, %xmm2 >> punpcklwd %xmm1, %xmm0 >> punpckhwd %xmm1, %xmm2 >> movdqa %xmm0, %xmm1 >> punpckhwd %xmm2, %xmm1 >> punpcklwd %xmm2, %xmm0 >> punpckhwd %xmm1, %xmm0 >> movups %xmm0, small_c-16(%rax) >> cmpq %rdx, %rax >> jne .L4 >> movl %edi, %eax >> andl $-8, %eax >> testb $7, %dil >> je .L14 >> >> *insn dist*: >> >> 1 addq >> 3 movdqa >> 3 movdqu >> 1 movups >> 1 pmulhw >> 1 pmullw >> 4 punpckhwd >> 3 punpcklwd >> >> I know little on i386 asm, but this seems better from insn distribution, >> the one without this patch uses two more psrad (s). > > So the advantage of 1) would be more appropriate costing in the vectorizer > (x86 does not have a native vector highpart multiply). > >> 3) FWIW with this patch as well as existing optab supports, the IR looks >> like: >> >> <bb 4> [local count: 94607391]: >> bnd.5_40 = niters.4_19 >> 3; >> _75 = (sizetype) bnd.5_40; >> _91 = _75 * 16; >> >> <bb 5> [local count: 378429566]: >> # ivtmp.48_53 = PHI <ivtmp.48_45(5), 0(4)> >> vect__1.10_48 = MEM <vector(8) short int> [(short int *)&small_a + >> ivtmp.48_53 * 1]; >> vect__3.13_51 = MEM <vector(8) short int> [(short int *)&small_b + >> ivtmp.48_53 * 1]; >> vect_patt_26.14_52 = vect__1.10_48 h* vect__3.13_51; >> MEM <vector(8) short int> [(short int *)&small_c + ivtmp.48_53 * 1] = >> vect_patt_26.14_52; >> ivtmp.48_45 = ivtmp.48_53 + 16; >> if (ivtmp.48_45 != _91) >> goto <bb 5>; [75.00%] >> else >> goto <bb 6>; [25.00%] >> >> <bb 6> [local count: 94607391]: >> niters_vector_mult_vf.6_41 = niters.4_19 & 4294967288; >> tmp.7_42 = (int) niters_vector_mult_vf.6_41; >> _61 = niters.4_19 & 7; >> if (_61 == 0) >> goto <bb 12>; [12.50%] >> else >> goto <bb 7>; [87.50%] >> >> <bb 7> [local count: 93293400]: >> # i_38 = PHI <tmp.7_42(6), 0(3)> >> # _44 = PHI <niters_vector_mult_vf.6_41(6), 0(3)> >> niters.18_60 = niters.4_19 - _44; >> _76 = niters.18_60 + 4294967295; >> if (_76 <= 2) >> goto <bb 9>; [10.00%] >> else >> goto <bb 8>; [90.00%] >> >> <bb 8> [local count: 167928121]: >> _85 = (sizetype) _44; >> _86 = _85 * 2; >> vectp_small_a.23_84 = &small_a + _86; >> vectp_small_b.26_90 = &small_b + _86; >> vectp_small_c.31_98 = &small_c + _86; >> vect__9.24_89 = MEM <vector(4) short int> [(short int >> *)vectp_small_a.23_84]; >> vect__28.27_95 = MEM <vector(4) short int> [(short int >> *)vectp_small_b.26_90]; >> vect_patt_23.28_96 = vect__9.24_89 h* vect__28.27_95; >> MEM <vector(4) short int> [(short int *)vectp_small_c.31_98] = >> vect_patt_23.28_96; >> niters_vector_mult_vf.20_80 = niters.18_60 & 4294967292; >> _82 = (int) niters_vector_mult_vf.20_80; >> tmp.21_81 = i_38 + _82; >> _74 = niters.18_60 & 3; >> if (_74 == 0) >> goto <bb 11>; [25.00%] >> else >> goto <bb 9>; [75.00%] >> >> ... >> >> *asm*: >> >> .L4: >> movdqu small_a(%rax), %xmm0 >> movdqu small_b(%rax), %xmm2 >> addq $16, %rax >> pmulhw %xmm2, %xmm0 >> movups %xmm0, small_c-16(%rax) >> cmpq %rdx, %rax >> jne .L4 >> movl %ecx, %edx >> andl $-8, %edx >> movl %edx, %eax >> testb $7, %cl >> je .L19 >> .L3: >> movl %ecx, %esi >> subl %edx, %esi >> leal -1(%rsi), %edi >> cmpl $2, %edi >> jbe .L6 >> movq small_a(%rdx,%rdx), %xmm0 >> movq small_b(%rdx,%rdx), %xmm1 >> pmulhw %xmm1, %xmm0 >> movq %xmm0, small_c(%rdx,%rdx) >> movl %esi, %edx >> andl $-4, %edx >> addl %edx, %eax >> andl $3, %esi >> je .L1 > > Oh, so indeed x86 has a vector highpart multiply, not grep-friendly > macroized as <s>mul<mode>3_highpart ;) > >> I guess the proposed IFN would be directly mapped for [us]mul_highpart? > > Yes. >
Thanks for confirming! The related patch v2 is attached and the testing is ongoing. >> or do you expect it will also cover what we support in can_mult_highpart_p? > > See above - since we manage to use widen_mult we should expose this > detail for the purpose of costing. So the pattern would directly ask for > an optab IFN mapping to [us]mul_highpart. > OK, the costing issue seems already exist since vect_recog_divmod_pattern checks with can_mult_highpart_p and generate mul_highpart in some path which probably ends up with widen_mult_{odd,even,lo,hi}, I guess we can fix the related costing routine by querying can_mult_highpart_p and price it more for mul_highpart input and it has to work with widen_mult_{odd...}. btw, IIUC Richard S. proposed to generate these widen_mult_{odd,even,lo,hi} in gimple rather than expand phase, since vector perm is involved it seems we have to generate them during transforming? Otherwise the mul_highpart in scalar code like pattern recog divmod seems hard to be expanded? BR, Kewen ----- gcc/ChangeLog: * internal-fn.c (first_commutative_argument): Add info for IFN_MULH. * internal-fn.def (IFN_MULH): New internal function. * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to recog normal multiply highpart as IFN_MULH.
--- gcc/internal-fn.c | 1 + gcc/internal-fn.def | 2 ++ gcc/tree-vect-patterns.c | 45 +++++++++++++++++++++++++++------------- 3 files changed, 34 insertions(+), 14 deletions(-) diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c index fb8b43d1ce2..b1b4289357c 100644 --- a/gcc/internal-fn.c +++ b/gcc/internal-fn.c @@ -3703,6 +3703,7 @@ first_commutative_argument (internal_fn fn) case IFN_FNMS: case IFN_AVG_FLOOR: case IFN_AVG_CEIL: + case IFN_MULH: case IFN_MULHS: case IFN_MULHRS: case IFN_FMIN: diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index c3b8e730960..ed6d7de1680 100644 --- a/gcc/internal-fn.def +++ b/gcc/internal-fn.def @@ -169,6 +169,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_FLOOR, ECF_CONST | ECF_NOTHROW, first, DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first, savg_ceil, uavg_ceil, binary) +DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW, first, + smul_highpart, umul_highpart, binary) DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first, smulhs, umulhs, binary) DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first, diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c index b2e7fc2cc7a..4581cc6b51c 100644 --- a/gcc/tree-vect-patterns.c +++ b/gcc/tree-vect-patterns.c @@ -1896,8 +1896,15 @@ vect_recog_over_widening_pattern (vec_info *vinfo, 1) Multiply high with scaling TYPE res = ((TYPE) a * (TYPE) b) >> c; + Here, c is bitsize (TYPE) / 2 - 1. + 2) ... or also with rounding TYPE res = (((TYPE) a * (TYPE) b) >> d + 1) >> 1; + Here, d is bitsize (TYPE) / 2 - 2. + + 3) Normal multiply high + TYPE res = ((TYPE) a * (TYPE) b) >> e; + Here, e is bitsize (TYPE) / 2. where only the bottom half of res is used. */ @@ -1942,7 +1949,6 @@ vect_recog_mulhs_pattern (vec_info *vinfo, stmt_vec_info mulh_stmt_info; tree scale_term; internal_fn ifn; - unsigned int expect_offset; /* Check for the presence of the rounding term. */ if (gimple_assign_rhs_code (rshift_input_stmt) == PLUS_EXPR) @@ -1991,25 +1997,37 @@ vect_recog_mulhs_pattern (vec_info *vinfo, /* Get the scaling term. */ scale_term = gimple_assign_rhs2 (plus_input_stmt); + /* Check that the scaling factor is correct. */ + if (TREE_CODE (scale_term) != INTEGER_CST) + return NULL; + + /* Check pattern 2). */ + if (wi::to_widest (scale_term) + target_precision + 2 + != TYPE_PRECISION (lhs_type)) + return NULL; - expect_offset = target_precision + 2; ifn = IFN_MULHRS; } else { mulh_stmt_info = rshift_input_stmt_info; scale_term = gimple_assign_rhs2 (last_stmt); + /* Check that the scaling factor is correct. */ + if (TREE_CODE (scale_term) != INTEGER_CST) + return NULL; - expect_offset = target_precision + 1; - ifn = IFN_MULHS; + /* Check for pattern 1). */ + if (wi::to_widest (scale_term) + target_precision + 1 + == TYPE_PRECISION (lhs_type)) + ifn = IFN_MULHS; + /* Check for pattern 3). */ + else if (wi::to_widest (scale_term) + target_precision + == TYPE_PRECISION (lhs_type)) + ifn = IFN_MULH; + else + return NULL; } - /* Check that the scaling factor is correct. */ - if (TREE_CODE (scale_term) != INTEGER_CST - || wi::to_widest (scale_term) + expect_offset - != TYPE_PRECISION (lhs_type)) - return NULL; - /* Check whether the scaling input term can be seen as two widened inputs multiplied together. */ vect_unpromoted_value unprom_mult[2]; @@ -2030,8 +2048,7 @@ vect_recog_mulhs_pattern (vec_info *vinfo, /* Check for target support. */ tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type); if (!new_vectype - || !direct_internal_fn_supported_p - (ifn, new_vectype, OPTIMIZE_FOR_SPEED)) + || !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED)) return NULL; /* The IR requires a valid vector type for the cast result, even though @@ -2043,8 +2060,8 @@ vect_recog_mulhs_pattern (vec_info *vinfo, /* Generate the IFN_MULHRS call. */ tree new_var = vect_recog_temp_ssa_var (new_type, NULL); tree new_ops[2]; - vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, - unprom_mult, new_vectype); + vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult, + new_vectype); gcall *mulhrs_stmt = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]); gimple_call_set_lhs (mulhrs_stmt, new_var); -- 2.17.1