[pushed] wwwdocs: news: Update links to Craig Burley's notes on g77
The new site does not offer https, so staying with http. Pushed. Gerald On the way remove a mailto: link to offer g77 help from 1999. --- htdocs/news.html | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/htdocs/news.html b/htdocs/news.html index 5e782349..6fac4ea5 100644 --- a/htdocs/news.html +++ b/htdocs/news.html @@ -1612,14 +1612,11 @@ series. On behalf of the entire GCC team, the steering committee would like to thank Craig for his work. Craig has written a detailed analysis of the -http://world.std.com/~burley/g77-why.html";>current state +http://www.kilmnj.com/g77/why.html";>current state and -http://world.std.com/~burley/g77-next.html";>possible future +http://www.kilmnj.com/g77/next.html";>possible future of g77, available at his -http://world.std.com/~burley/g77.html";>g77 web site. - -If you are interested in helping with g77, please mailto:g...@gcc.gnu.org?subject=g77%20Help%20Wanted?";>contact us! +http://www.kilmnj.com/g77/";>g77 web site. October 12, 1999 -- 2.45.2
[PATCH] AVX512BF16: Do not allow permutation with vcvtne2ps2bf16 [PR115889]
Hi, According to the instruction spec of AVX512BF16, the convert from float to BF16 is not a simple truncation. It has special handling for denormal/nan, even for normal float it will add an extra bias according to the least significant bit for bf number. This means we cannot use the vcvtne2ps2bf16 for any bf16 vector shuffle. The optimization introduced in r15-1368 adds a specific split to convert HImode permutation with this instruction, so remove it and treat the BFmode permutation same as HFmode. Bootstrapped & regtested on x86_64-pc-linux-gnu. OK for trunk? gcc/ChangeLog: PR target/115889 * config/i386/predicates.md (vcvtne2ps2bf_parallel): Remove. * config/i386/sse.md (hi_cvt_bf): Remove. (HI_CVT_BF): Likewise. (vpermt2_sepcial_bf16_shuffle_):Likewise. gcc/testsuite/ChangeLog: PR target/115889 * gcc.target/i386/vpermt2-special-bf16-shufflue.c: Adjust option and output scan. --- gcc/config/i386/predicates.md | 11 -- gcc/config/i386/sse.md| 35 --- .../i386/vpermt2-special-bf16-shufflue.c | 5 ++- 3 files changed, 2 insertions(+), 49 deletions(-) diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md index a894847adaf..5d0bb1e0f54 100644 --- a/gcc/config/i386/predicates.md +++ b/gcc/config/i386/predicates.md @@ -2327,14 +2327,3 @@ (define_predicate "apx_ndd_add_memory_operand" return true; }) - -;; Check that each element is odd and incrementally increasing from 1 -(define_predicate "vcvtne2ps2bf_parallel" - (and (match_code "const_vector") - (match_code "const_int" "a")) -{ - for (int i = 0; i < XVECLEN (op, 0); ++i) -if (INTVAL (XVECEXP (op, 0, i)) != (2 * i + 1)) - return false; - return true; -}) diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index b3b4697924b..c134494cd20 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -31460,38 +31460,3 @@ (define_insn "vpdp_" "TARGET_AVXVNNIINT16" "vpdp\t{%3, %2, %0|%0, %2, %3}" [(set_attr "prefix" "vex")]) - -(define_mode_attr hi_cvt_bf - [(V8HI "v8bf") (V16HI "v16bf") (V32HI "v32bf")]) - -(define_mode_attr HI_CVT_BF - [(V8HI "V8BF") (V16HI "V16BF") (V32HI "V32BF")]) - -(define_insn_and_split "vpermt2_sepcial_bf16_shuffle_" - [(set (match_operand:VI2_AVX512F 0 "register_operand") - (unspec:VI2_AVX512F - [(match_operand:VI2_AVX512F 1 "vcvtne2ps2bf_parallel") - (match_operand:VI2_AVX512F 2 "register_operand") - (match_operand:VI2_AVX512F 3 "nonimmediate_operand")] - UNSPEC_VPERMT2))] - "TARGET_AVX512VL && TARGET_AVX512BF16 && ix86_pre_reload_split ()" - "#" - "&& 1" - [(const_int 0)] -{ - rtx op0 = gen_reg_rtx (mode); - operands[2] = lowpart_subreg (mode, - force_reg (mode, operands[2]), - mode); - operands[3] = lowpart_subreg (mode, - force_reg (mode, operands[3]), - mode); - - emit_insn (gen_avx512f_cvtne2ps2bf16_(op0, - operands[3], - operands[2])); - emit_move_insn (operands[0], lowpart_subreg (mode, op0, - mode)); - DONE; -} -[(set_attr "mode" "")]) diff --git a/gcc/testsuite/gcc.target/i386/vpermt2-special-bf16-shufflue.c b/gcc/testsuite/gcc.target/i386/vpermt2-special-bf16-shufflue.c index 5c65f2a9884..4cbc85735de 100755 --- a/gcc/testsuite/gcc.target/i386/vpermt2-special-bf16-shufflue.c +++ b/gcc/testsuite/gcc.target/i386/vpermt2-special-bf16-shufflue.c @@ -1,7 +1,6 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -mavx512bf16 -mavx512vl" } */ -/* { dg-final { scan-assembler-not "vpermi2b" } } */ -/* { dg-final { scan-assembler-times "vcvtne2ps2bf16" 3 } } */ +/* { dg-options "-O2 -mavx512vbmi -mavx512vl" } */ +/* { dg-final { scan-assembler-times "vpermi2w" 3 } } */ typedef __bf16 v8bf __attribute__((vector_size(16))); typedef __bf16 v16bf __attribute__((vector_size(32))); -- 2.34.1
[pushed] wwwdocs: gcc-3.*: Remove download links for MTL from 20 years ago
The entire osl.iu.edu site has gone away without direct replacement. --- htdocs/gcc-3.1/criteria.html | 3 +-- htdocs/gcc-3.3/criteria.html | 3 +-- htdocs/gcc-3.4/criteria.html | 3 +-- 3 files changed, 3 insertions(+), 6 deletions(-) diff --git a/htdocs/gcc-3.1/criteria.html b/htdocs/gcc-3.1/criteria.html index 13d5a7bd..22ffaaf7 100644 --- a/htdocs/gcc-3.1/criteria.html +++ b/htdocs/gcc-3.1/criteria.html @@ -251,8 +251,7 @@ shown here are used for GCC 3.1 integration testing. http://www.osl.iu.edu/research/mtl/";>MTL C++ 2.12.2.-20 -http://www.osl.iu.edu/research/mtl/download.php3";> -MTL (Download) + POOMA diff --git a/htdocs/gcc-3.3/criteria.html b/htdocs/gcc-3.3/criteria.html index ec32b28d..547c1bff 100644 --- a/htdocs/gcc-3.3/criteria.html +++ b/htdocs/gcc-3.3/criteria.html @@ -255,8 +255,7 @@ shown here are used for GCC 3.3 integration testing. http://www.osl.iu.edu/research/mtl/";>MTL C++ 2.12.2.-20 -http://www.osl.iu.edu/research/mtl/download.php3";> -MTL (Download), with http://www.osl.iu.edu/MailArchives/mtl-devel/msg00311.php";>patch + POOMA diff --git a/htdocs/gcc-3.4/criteria.html b/htdocs/gcc-3.4/criteria.html index 8860aa36..1ff98ee9 100644 --- a/htdocs/gcc-3.4/criteria.html +++ b/htdocs/gcc-3.4/criteria.html @@ -255,8 +255,7 @@ shown here are used for GCC 3.4 integration testing. http://www.osl.iu.edu/research/mtl/";>MTL C++ 2.12.2.-20 -http://www.osl.iu.edu/research/mtl/download.php3";> -MTL (Download), with http://www.osl.iu.edu/MailArchives/mtl-devel/msg00311.php";>patch + POOMA -- 2.45.2
[PATCH] LoongArch: Organize the code related to split move and merge the same functions.
gcc/ChangeLog: * config/loongarch/loongarch-protos.h (loongarch_split_128bit_move): Delete. (loongarch_split_128bit_move_p): Delete. (loongarch_split_256bit_move): Delete. (loongarch_split_256bit_move_p): Delete. (loongarch_split_vector_move): Add a function declaration. * config/loongarch/loongarch.cc (loongarch_vector_costs::finish_cost): Adjust the code formatting. (loongarch_split_vector_move_p): Merge loongarch_split_128bit_move_p and loongarch_split_256bit_move_p. (loongarch_split_move_p): Merge code. (loongarch_split_move): Likewise. (loongarch_split_128bit_move_p): Delete. (loongarch_split_256bit_move_p): Delete. (loongarch_split_128bit_move): Delete. (loongarch_split_vector_move): Merge loongarch_split_128bit_move and loongarch_split_256bit_move. (loongarch_split_256bit_move): Delete. (loongarch_global_init): Remove the extra semicolon at the end of the function. * config/loongarch/loongarch.md (*movdf_softfloat): Added a new condition TARGET_64BIT. --- gcc/config/loongarch/loongarch-protos.h | 5 +- gcc/config/loongarch/loongarch.cc | 221 ++-- gcc/config/loongarch/loongarch.md | 1 + 3 files changed, 58 insertions(+), 169 deletions(-) diff --git a/gcc/config/loongarch/loongarch-protos.h b/gcc/config/loongarch/loongarch-protos.h index e238d795a73..85f6e894399 100644 --- a/gcc/config/loongarch/loongarch-protos.h +++ b/gcc/config/loongarch/loongarch-protos.h @@ -85,10 +85,7 @@ extern bool loongarch_split_move_p (rtx, rtx); extern void loongarch_split_move (rtx, rtx); extern bool loongarch_addu16i_imm12_operand_p (HOST_WIDE_INT, machine_mode); extern void loongarch_split_plus_constant (rtx *, machine_mode); -extern void loongarch_split_128bit_move (rtx, rtx); -extern bool loongarch_split_128bit_move_p (rtx, rtx); -extern void loongarch_split_256bit_move (rtx, rtx); -extern bool loongarch_split_256bit_move_p (rtx, rtx); +extern void loongarch_split_vector_move (rtx, rtx); extern const char *loongarch_output_move (rtx, rtx); #ifdef RTX_CODE extern void loongarch_expand_scc (rtx *); diff --git a/gcc/config/loongarch/loongarch.cc b/gcc/config/loongarch/loongarch.cc index 8eb47ff95c3..c7a02103ef5 100644 --- a/gcc/config/loongarch/loongarch.cc +++ b/gcc/config/loongarch/loongarch.cc @@ -4354,10 +4354,10 @@ void loongarch_vector_costs::finish_cost (const vector_costs *scalar_costs) { loop_vec_info loop_vinfo = dyn_cast (m_vinfo); + if (loop_vinfo) -{ - m_suggested_unroll_factor = determine_suggested_unroll_factor (loop_vinfo); -} +m_suggested_unroll_factor + = determine_suggested_unroll_factor (loop_vinfo); vector_costs::finish_cost (scalar_costs); } @@ -4423,6 +4423,7 @@ loongarch_subword (rtx op, bool high_p) return simplify_gen_subreg (word_mode, op, mode, byte); } +static bool loongarch_split_vector_move_p (rtx dest, rtx src); /* Return true if a move from SRC to DEST should be split into two. SPLIT_TYPE describes the split condition. */ @@ -,13 +4445,11 @@ loongarch_split_move_p (rtx dest, rtx src) return false; } - /* Check if LSX moves need splitting. */ - if (LSX_SUPPORTED_MODE_P (GET_MODE (dest))) -return loongarch_split_128bit_move_p (dest, src); - /* Check if LASX moves need splitting. */ - if (LASX_SUPPORTED_MODE_P (GET_MODE (dest))) -return loongarch_split_256bit_move_p (dest, src); + /* Check if vector moves need splitting. */ + if (LSX_SUPPORTED_MODE_P (GET_MODE (dest)) + || LASX_SUPPORTED_MODE_P (GET_MODE (dest))) +return loongarch_split_vector_move_p (dest, src); /* Otherwise split all multiword moves. */ return size > UNITS_PER_WORD; @@ -4463,10 +4462,9 @@ void loongarch_split_move (rtx dest, rtx src) { gcc_checking_assert (loongarch_split_move_p (dest, src)); - if (LSX_SUPPORTED_MODE_P (GET_MODE (dest))) -loongarch_split_128bit_move (dest, src); - else if (LASX_SUPPORTED_MODE_P (GET_MODE (dest))) -loongarch_split_256bit_move (dest, src); + if (LSX_SUPPORTED_MODE_P (GET_MODE (dest)) + || LASX_SUPPORTED_MODE_P (GET_MODE (dest))) +loongarch_split_vector_move (dest, src); else gcc_unreachable (); } @@ -4588,224 +4586,117 @@ loongarch_output_move_index_float (rtx x, machine_mode mode, bool ldr) return insn[ldr][index-2]; } -/* Return true if a 128-bit move from SRC to DEST should be split. */ - -bool -loongarch_split_128bit_move_p (rtx dest, rtx src) -{ - /* LSX-to-LSX moves can be done in a single instruction. */ - if (FP_REG_RTX_P (src) && FP_REG_RTX_P (dest)) -return false; - - /* Check for LSX loads and stores. */ - if (FP_REG_RTX_P (dest) && MEM_P (src)) -return false; - if (FP_REG_RTX_P (src) && MEM_P (dest)) -return false; - - /* Check for LSX set to an immediate const vector with valid repli
[PATCH] fortran: Correctly evaluate the MASK argument of MINLOC/MAXLOC
From: Mikael Morin Hello, I'm currently testing this on x86_64-linux. I plan to push to master if all goes well. Mikael -- 8< -- Add the preliminary code that the generated expression for MASK may depend on when generating the inline code to evaluate MINLOC or MAXLOC with a scalar MASK. The generated code was only keeping the generated expression but not the preliminary code, which was sufficient for simple cases such as data references or simple (scalar) function calls, but was bogus with more complicated ones. gcc/fortran/ChangeLog: * trans-intrinsic.cc (gfc_conv_intrinsic_minmaxloc): Add the preliminary code generated for MASK to the preliminary code of MINLOC/MAXLOC. gcc/testsuite/ChangeLog: * gfortran.dg/minmaxloc_17.f90: New test. --- gcc/fortran/trans-intrinsic.cc | 1 + gcc/testsuite/gfortran.dg/minmaxloc_17.f90 | 33 ++ 2 files changed, 34 insertions(+) create mode 100644 gcc/testsuite/gfortran.dg/minmaxloc_17.f90 diff --git a/gcc/fortran/trans-intrinsic.cc b/gcc/fortran/trans-intrinsic.cc index cadbd177452..180d0d7a88c 100644 --- a/gcc/fortran/trans-intrinsic.cc +++ b/gcc/fortran/trans-intrinsic.cc @@ -5749,6 +5749,7 @@ gfc_conv_intrinsic_minmaxloc (gfc_se * se, gfc_expr * expr, enum tree_code op) gfc_init_se (&maskse, NULL); gfc_conv_expr_val (&maskse, maskexpr); + gfc_add_block_to_block (&se->pre, &maskse.pre); gfc_init_block (&block); gfc_add_block_to_block (&block, &loop.pre); gfc_add_block_to_block (&block, &loop.post); diff --git a/gcc/testsuite/gfortran.dg/minmaxloc_17.f90 b/gcc/testsuite/gfortran.dg/minmaxloc_17.f90 new file mode 100644 index 000..7e6e586ab03 --- /dev/null +++ b/gcc/testsuite/gfortran.dg/minmaxloc_17.f90 @@ -0,0 +1,33 @@ +! { dg-do run } +! +! Check that the code necessary to evaluate MINLOC's or MAXLOC's MASK +! argument is correctly generated. + +program p + implicit none + integer, parameter :: data10(*) = (/ 2, 5, 2, 0, 6, 5, 3, 6, 0, 1 /) + logical, parameter :: mask10(*) = (/ .false., .true., .false., & + .false., .true., .true., & + .true. , .true., .false., & + .false. /) + type bool_wrapper +logical :: l + end type + call check_minloc + call check_maxloc +contains + subroutine check_minloc +integer :: a(10) +integer :: r +a = data10 +r = minloc(a, dim = 1, mask = sum(a) > 0) +if (r /= 4) stop 11 + end subroutine + subroutine check_maxloc +integer :: a(10) +integer :: r +a = data10 +r = maxloc(a, dim = 1, mask = sum(a) > 0) +if (r /= 5) stop 18 + end subroutine +end program -- 2.43.0
[Patch, fortran] PR84868 - [11/12/13/14/15 Regression] ICE in gfc_conv_descriptor_offset, at fortran/trans-array.c:208
Hi All, After messing around with argument mapping, where I found and fixed another bug, I realised that the problem lay with simplification of len_trim with an argument that is the element of a parameter array. The fix was then a straightforward lift of existing code in expr.cc. The mapping bug is also fixed by supplying the se string length when building character typespecs. Regtests just fine. OK for mainline? I believe that this is safe for backporting to 14-branch before the 14.2 release - thoughts? Regards Paul Change.Logs Description: Binary data Change.Logs Description: Binary data
Re: [RFC] c++: Eagerly substitute auto constraint args in tsubst [PR115030]
On Fri, 12 Jul 2024 at 21:31, Patrick Palka wrote: > > Interesting, thanks for the detailed write-up! Thank you very much. > FWIW it should be much easier to obtain the partially specialised > tmpl/args of a class/variable specialization since GCC 14, by looking > at TI_PARTIAL_INFO of DECL_TEMPLATE_INFO (if it exists). > > Maybe passing the TI_PARTIAL_INFO args to do_auto_deduction where > appropriate might fix this PR too? You are right, TI_PARTIAL_INFO does the job if it is used in decl.cc. I have done some basic tests and everything is fine with it too. > Your approach indeed seems like a nice simplification, and while it > won't change the behavior of the vast majority of code, eager > substitution into constraints versus waiting until satisfaction > can cause us to reject otherwise valid code in some edge cases. > Consider > > template > concept C = __is_same(T, int) || __is_same(T, U); > > template > void f(T t) { > C auto x = t; > } > > int main() { > f(0); > } > > With the eager approach we'd substitute T=int into 'typename T::type' during > instantiation of f(), which fails and causes the program to be ill-formed > since we're not in a SFINAE context. > > With the non-eager approach we pass the original constraint C T::type> > and the full set of arguments to satisfaction, which instead substitutes and > evaluates each atom of the constraint in a SFINAE manner. And since the first > term of the disjunction is satisfied, the entire constraint is satisfied, and > there's no error. > > This is basically the motivation for deferring substitution into constraints, > most importantly the associated constraints of template declaration. > Admittedly it's probably less important to do this for constrained autos, but > IIUC we do so anyway for QoI/consistency. Clang and MSVC reject the above > testcase FWIW, indicating they do substitute into constrained autos eagerly. > I'm not sure if the standard is completely clear about what to do here. > > Jason would have more context, but IIUC we deliberately don't eagerly > substitute into constrained autos and accept the implementation > complexities that this approach entails. What I can understand from 13.5.4.1.4 of N4950 is that SFINAE is not applied to the concepts. As it is stated: "If any such substitution results in an invalid type or expression, the program is ill-formed; no diagnostic is required." Adding that the following code is ill-formed: template concept A = T::value || true; template concept B = A; template concept C = B; That we are rejecting it, if we add `static_assert(C);` or `C auto x = 1;` to the code. One could conclude that the following code is ill-formed too: template concept A = T::value || true; template concept B = A; static_assert(A); But we are accepting it. As far as I can understand, I cannot distinguish the difference between the two cases. I would appreciate if you could elaborate it for me. If SFINAE applies to the disjunctions of the atomic constraints, I can make a new patch to use TI_PARTIAL_INFO, as you suggested. Otherwise, in addition to these changes, we also need to modify constraint_satisfaction_value (constraints.cc) as well to raise an error for the static_assert and other usages of concept expressions as well. On Fri, 12 Jul 2024 at 21:31, Patrick Palka wrote: > > Interesting, thanks for the detailed write-up! > > On Tue, 9 Jul 2024, Seyed Sajad Kahani wrote: > > > Hi. > > > > While investigating a fix for C++/PR115030 (a bug in constrained auto > > deduction), I was wondering why we are not substituting constraint args of > > an > > auto node in tsubst (pt.cc:16533). Instead, this substitution is delayed > > until > > do_auto_deduction (pt.cc), where we attempt to find the substituted args of > > the > > enclosing scope. At that point, it becomes difficult to find partially > > specialised args, as they are erased in the process. > > FWIW it should be much easier to obtain the partially specialised > tmpl/args of a class/variable specialization since GCC 14, by looking > at TI_PARTIAL_INFO of DECL_TEMPLATE_INFO (if it exists). > > Maybe passing the TI_PARTIAL_INFO args to do_auto_deduction where > appropriate might fix this PR too? > > > I propose modifying tsubst to eagerly substitute the constraint args of an > > auto node. > > > > By making this change, we would not need to provide outer_targs for > > do_auto_deduction in cases where tsubst has been called for the type, which > > covers most scenarios. However, we still need outer_targs for cases like: > > > > template auto V> > > struct S { }; > > > > Hence, outer_targs cannot be completely removed but will be set to > > TREE_NULL in > > all calls except the one in convert_template_argument (pt.cc:8788). > > Additionally, the tmpl argument of do_auto_deduction, which helps to provide > > outer args in the scope, will no longer be necessary and can be safely > > removed. > > > > Also, the trimming hack proposed in > > h
Re: [Patch, tree-optimization, predcom] Improve unroll factor for predictive commoning
Hello Richard: On 12/07/24 6:20 pm, Richard Biener wrote: > On Fri, Jul 12, 2024 at 12:09 PM Ajit Agarwal wrote: >> >> Hello Richard: >> >> On 11/07/24 2:21 pm, Richard Biener wrote: >>> On Thu, Jul 11, 2024 at 10:30 AM Ajit Agarwal >>> wrote: Hello All: Unroll factor is determined with max distance across loop iterations. The logic for determining the loop unroll factor is based on data dependency across loop iterations. The max distance across loop iterations is the unrolling factor that helps in predictive commoning. >>> >>> The old comment in the code says >>> - /* The best unroll factor for this chain is equal to the number of -temporary variables that we create for it. */ >>> >>> why is that wrong and why is the max dependence distance more correct? >>> >>> Do you have a testcase that shows how this makes a (positive) difference? >>> >> >> There is nothing wrong in the existing implementation of unroll >> factor for predictive commoning. >> >> But with max dependence distance we get performance improvement >> with spec 2017 benchmarks (INT) of 0.01% (Geomean) with and without >> changes. Improvement in benchmarks with max dependence distance >> changes. >> >> I have used the following flags: >> -O2 -funroll-loops --param max-unroll-times=8 -fpredictive-commoning >> -fno-tree-pre >> >> With above flags I ran with and without changes. > > A 0.01% geomean improvement is noise. Why did you disable PRE? > I have changed the flags used. Now I change the flags to -O3 and -fpredictive-commoning -funroll-loops. I am measuring the performance with the above flags for spec 2017 benchmarks and would let you know the performance by Monday. >> There is no degradation with spec 2017 (FP benchmarks). >> >> Because in predictive commoning we reuse values computed in >> earlier iterations of a loop in the later ones, max distance is the >> better choice. > > The re-use distance is the same though. So your change merely increases > the unroll factor? Or can you explain why there is more re-use with > your change. > With -O3 flag and -fpredictive-commoning -funroll-loops in spec 2017 benchmarks many benchmarks increases the unroll factor with my changes in predictive commoning determine_unroll_factor. I have traversed the data dependence relation vector and get the distance and the max data dependence distance is the unroll factor. > Richard. > Thanks & Regards Ajit >>> Richard. >>> >> >> Thanks & Regards >> Ajit >> Bootstrapped and regtested on powerpc64-linux-gnu. Thanks & Regards Ajit tree-optimization, predcom: Improve unroll factor for predictive commoning Unroll factor is determined with max distance across loop iterations. The logic for determining the loop unroll factor is based on data dependency across loop iterations. The max distance across loop iterations is the unrolling factor that helps in predictive commoning. 2024-07-11 Ajit Kumar Agarwal gcc/ChangeLog: * tree-predcom.cc: Change in determining unroll factor with data dependence across loop iterations. --- gcc/tree-predcom.cc | 51 ++--- 1 file changed, 39 insertions(+), 12 deletions(-) diff --git a/gcc/tree-predcom.cc b/gcc/tree-predcom.cc index 9844fee1e97..029b02f5990 100644 --- a/gcc/tree-predcom.cc +++ b/gcc/tree-predcom.cc @@ -409,6 +409,7 @@ public: /* Perform the predictive commoning optimization for chains, make this public for being called in callback execute_pred_commoning_cbck. */ void execute_pred_commoning (bitmap tmp_vars); + unsigned determine_unroll_factor (const vec &chains); private: /* The pointer to the given loop. */ @@ -2400,13 +2401,46 @@ pcom_worker::execute_pred_commoning_chain (chain_p chain, copies as possible. CHAINS is the list of chains that will be optimized. */ -static unsigned -determine_unroll_factor (const vec &chains) +unsigned +pcom_worker::determine_unroll_factor (const vec &chains) { chain_p chain; - unsigned factor = 1, af, nfactor, i; + unsigned factor = 1, i; unsigned max = param_max_unroll_times; + struct data_dependence_relation *ddr; + unsigned nfactor = 0; + int nzfactor = 0; + + /* Best unroll factor is the maximum distance across loop + iterations. */ + FOR_EACH_VEC_ELT (m_dependences, i, ddr) +{ + for (unsigned j = 0; j < DDR_NUM_DIST_VECTS (ddr); j++) + { + lambda_vector vec = DDR_DIST_VECT (ddr, j); + widest_int distance = vec[j]; + unsigned offset = distance.to_uhwi (); + if (offset == 0) + continue; + + int dist
Re: [Patch, tree-optimization, predcom] Improve unroll factor for predictive commoning
Hello Richard: On 12/07/24 6:20 pm, Richard Biener wrote: > On Fri, Jul 12, 2024 at 12:09 PM Ajit Agarwal wrote: >> >> Hello Richard: >> >> On 11/07/24 2:21 pm, Richard Biener wrote: >>> On Thu, Jul 11, 2024 at 10:30 AM Ajit Agarwal >>> wrote: Hello All: Unroll factor is determined with max distance across loop iterations. The logic for determining the loop unroll factor is based on data dependency across loop iterations. The max distance across loop iterations is the unrolling factor that helps in predictive commoning. >>> >>> The old comment in the code says >>> - /* The best unroll factor for this chain is equal to the number of -temporary variables that we create for it. */ >>> >>> why is that wrong and why is the max dependence distance more correct? >>> >>> Do you have a testcase that shows how this makes a (positive) difference? >>> >> >> There is nothing wrong in the existing implementation of unroll >> factor for predictive commoning. >> >> But with max dependence distance we get performance improvement >> with spec 2017 benchmarks (INT) of 0.01% (Geomean) with and without >> changes. Improvement in benchmarks with max dependence distance >> changes. >> >> I have used the following flags: >> -O2 -funroll-loops --param max-unroll-times=8 -fpredictive-commoning >> -fno-tree-pre >> >> With above flags I ran with and without changes. > > A 0.01% geomean improvement is noise. Why did you disable PRE? > I have changed the flags. Now I changed the flags to -O3 -fpredictive-commoning -funroll-loops. With these flags I am measuring the performance with spec 2017 benchmarks. Would let you know the results by Monday. >> There is no degradation with spec 2017 (FP benchmarks). >> >> Because in predictive commoning we reuse values computed in >> earlier iterations of a loop in the later ones, max distance is the >> better choice. > > The re-use distance is the same though. So your change merely increases > the unroll factor? Or can you explain why there is more re-use with > your change. > With -O3 -fpredictive-commoning -funroll-loops many spec 2017 benchmarks increases unroll factor with my changes. I am traversing data dependence relation vector, get the distance and max distance is the unroll factor. > Richard. > Thanks & Regards Ajit >>> Richard. >>> >> >> Thanks & Regards >> Ajit >> Bootstrapped and regtested on powerpc64-linux-gnu. Thanks & Regards Ajit tree-optimization, predcom: Improve unroll factor for predictive commoning Unroll factor is determined with max distance across loop iterations. The logic for determining the loop unroll factor is based on data dependency across loop iterations. The max distance across loop iterations is the unrolling factor that helps in predictive commoning. 2024-07-11 Ajit Kumar Agarwal gcc/ChangeLog: * tree-predcom.cc: Change in determining unroll factor with data dependence across loop iterations. --- gcc/tree-predcom.cc | 51 ++--- 1 file changed, 39 insertions(+), 12 deletions(-) diff --git a/gcc/tree-predcom.cc b/gcc/tree-predcom.cc index 9844fee1e97..029b02f5990 100644 --- a/gcc/tree-predcom.cc +++ b/gcc/tree-predcom.cc @@ -409,6 +409,7 @@ public: /* Perform the predictive commoning optimization for chains, make this public for being called in callback execute_pred_commoning_cbck. */ void execute_pred_commoning (bitmap tmp_vars); + unsigned determine_unroll_factor (const vec &chains); private: /* The pointer to the given loop. */ @@ -2400,13 +2401,46 @@ pcom_worker::execute_pred_commoning_chain (chain_p chain, copies as possible. CHAINS is the list of chains that will be optimized. */ -static unsigned -determine_unroll_factor (const vec &chains) +unsigned +pcom_worker::determine_unroll_factor (const vec &chains) { chain_p chain; - unsigned factor = 1, af, nfactor, i; + unsigned factor = 1, i; unsigned max = param_max_unroll_times; + struct data_dependence_relation *ddr; + unsigned nfactor = 0; + int nzfactor = 0; + + /* Best unroll factor is the maximum distance across loop + iterations. */ + FOR_EACH_VEC_ELT (m_dependences, i, ddr) +{ + for (unsigned j = 0; j < DDR_NUM_DIST_VECTS (ddr); j++) + { + lambda_vector vec = DDR_DIST_VECT (ddr, j); + widest_int distance = vec[j]; + unsigned offset = distance.to_uhwi (); + if (offset == 0) + continue; + + int dist = offset - nzfactor; + if (dist == 0) + continue; + if (nfactor == 0) +
[pushed] diagnostics: add highlight-a vs highlight-b in colorization and pp_markup
Since r6-4582-g8a64515099e645 (which added class rich_location), ranges of quoted source code have been colorized using the following rules: - the primary range used the same color of the kind of the diagnostic i.e. "error" vs "warning" etc (defaulting to bold red and bold magenta respectively) - secondary ranges alternate between "range1" and "range2" (defaulting to green and blue respectively) This works for cases with large numbers of highlighted ranges, but is suboptimal for common cases. The following patch adds a pair of color names: "highlight-a" and "highlight-b", and uses them whenever it makes sense to highlight and contrast two different things in the source code (e.g. a type mismatch). These are used by diagnostic-show-locus.cc for highlighting quoted source. In addition the patch adds colorization to fragments within the corresponding diagnostic messages themselves, using consistent colorization between the message and the quoted source code for the two different things being contrasted. For example, consider: demo.c: In function ‘test_bad_format_string_args’: ../../src/demo.c:25:18: warning: format ‘%i’ expects argument of type ‘int’, but argument 2 has type ‘const char *’ [-Wformat=] 25 | printf("hello %i", msg); | ~^ ~~~ | | | | int const char * | %s Previously, the types within the message in quotes would be in bold but not colorized, and the labelled ranges of quoted source code would use bold magenta for the "int" and non-bold green for the "const char *". With this patch: - the "%i" and "int" in the message and the "int" in the quoted source are all colored bold green - the "const char *" in the message and in the quoted source are both colored bold blue so that the consistent use of contrasting color draws the reader's eyes to the relationships between the diagnostic message and the source. I've tried this with gnome-terminal with many themes, including a variety of light versus dark backgrounds, solarized versus non-solarized themes, etc, and it was readable in all. My initial version of the patch used the existing %r and %R facilities within pretty-print.cc for the messages, but this turned out to be very uncomfortable, leading to error-prone format strings such as: error_at (richloc, "invalid operands to binary %s (have %<%r%T%R%> and %<%r%T%R%>)", opname, "highlight-a", type0, "highlight-b", type1); To avoid requiring monstrosities such as the above, the patch adds a new "%e" format code to pretty-print.cc, which expects a pp_element *, where pp_element is a new abstract base class (actually a pp_markup::element), along with various useful subclasses. This lets the above be written as: pp_markup::element_quoted_type element_0 (type0, highlight_colors::lhs); pp_markup::element_quoted_type element_1 (type1, highlight_colors::rhs); error_at (richloc, "invalid operands to binary %s (have %e and %e)", opname, &element_0, &element_1); which I feel is maintainable and clear to translators; the use of %e and pp_element * captures the type-unsafe part of the variadic call, and the subclasses allow for type-safety (so e.g. an element_quoted_type expects a type and a highlighting color). This approach allows for some nice simplifications within c-format.cc. The patch also extends -Wformat to "teach" it about the new %e and pp_element *. Doing so requires c-format.cc to be able to determine if a T * is a pp_element * (i.e. if T is a subclass). To do so I added a new comp_types callback for comparing types, where the C++ frontend supplies a suitable implementation (and %e will always be wrong for C). I've manually tested this on many diagnostics with both C and C++ and it seems a subtle but significant improvement in readability. I've added a new option -fno-diagnostics-show-highlight-colors in case people prefer the old behavior. Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu. Successful run of analyzer integration tests on x86_64-pc-linux-gnu. Pushed to trunk as r15-2015-g7d73c01ce6d1ab. gcc/c-family/ChangeLog: * c-common.cc: Include "tree-pretty-print-markup.h". (binary_op_error): Use pp_markup::element_quoted_type and %e. (check_function_arguments): Add "comp_types" param and pass it to check_function_format. * c-common.h (check_function_arguments): Add "comp_types" param. (check_function_format): Likewise. * c-format.cc: Include "tree-pretty-print-markup.h". (local_pp_element_ptr_node): New. (PP_FORMAT_CHAR_TABLE): Add entry for %e. (struct format_check_context): Add "m_comp_types" field. (check_function_format): Add "comp_types" param and pass it to check_format_info. (check_format_info): Likewise, passing it to format_ctx's ctor. (check_format_arg): Extract m_
[PATCH v5] RISC-V: use fclass insns to implement isfinite, isnormal and isinf builtins
Changes since v4: - No functional changes. - Use int iterator and attr to implement expanders in md (inspired by loongarch patch. Thx Xi Ruoyao) Changes since v3: - Remove '*' from define_insn for fclass - Remove the dummy expander for fclass. - De-duplicate the expanders code by using a helper which takes fclass val. Changes since v2: - fclass define_insn tightened to check op0 mode "X" with additional expander w/o mode for callers. - builtins expander explicit mode check and FAIL if mode not appropriate. - subreg promoted handling to elide potential extension of ret val. - Added isinf builtin with bimodal retun value as others. Changes since v1: - Removed UNSPEC_{INFINITE,ISNORMAL} - Don't hardcode SI in patterns, try to keep X to avoid potential sign extension pitfalls. Implementation wise requires skipping :MODE specifier in match_operand which is flagged as missing mode warning. --- Currently thsse builtins use float compare instructions which require FP flags to be save/restore around them. Our perf team complained this could be costly in uarch. RV Base ISA already has FCLASS.{d,s,h} instruction to compare/identify FP values w/o disturbing FP exception flags. Coincidently, upstream very recently got support for the corresponding optabs. So this just requires wiring up in the backend. Tested for rv64, one additioal failure g++.dg/opt/pr107569.C needs upstream ranger fix for the new optab. gcc/ChangeLog: * config/riscv/iterators.md (FCLASS_MASK): New define_int_iterator. (fclass_optab): New define_int_attr. * config/riscv/riscv.md: Add UNSPEC_FCLASS. (fclass): New define_insn. (2): New define_expand template for isfinite, isnormal, isinf. gcc/testsuite/ChangeLog: * gcc.target/riscv/fclass.c: New tests. Signed-off-by: Vineet Gupta --- gcc/config/riscv/iterators.md | 7 +++ gcc/config/riscv/riscv.md | 57 + gcc/testsuite/gcc.target/riscv/fclass.c | 38 + 3 files changed, 102 insertions(+) create mode 100644 gcc/testsuite/gcc.target/riscv/fclass.c diff --git a/gcc/config/riscv/iterators.md b/gcc/config/riscv/iterators.md index 20745faa55ef..20413737f775 100644 --- a/gcc/config/riscv/iterators.md +++ b/gcc/config/riscv/iterators.md @@ -219,6 +219,13 @@ (define_code_attr fix_uns [(fix "fix") (unsigned_fix "fixuns")]) +;; fclass mask values for corresponding builtins +(define_int_iterator FCLASS_MASK [126 66 129]) + +(define_int_attr fclass_optab + [(126"isfinite") + (66 "isnormal") + (129"isinf")]) ;; --- ;; Code Attributes diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md index ff37125e3f28..1e1be1f5c192 100644 --- a/gcc/config/riscv/riscv.md +++ b/gcc/config/riscv/riscv.md @@ -68,6 +68,7 @@ UNSPEC_FMAX UNSPEC_FMINM UNSPEC_FMAXM + UNSPEC_FCLASS ;; Stack tie UNSPEC_TIE @@ -3436,6 +3437,62 @@ (set_attr "mode" "") (set (attr "length") (const_int 16))]) +;; fclass instruction output bitmap +;; 0 negative infinity +;; 1 negative normal number. +;; 2 negative subnormal number. +;; 3 -0 +;; 4 +0 +;; 5 positive subnormal number. +;; 6 positive normal number. +;; 7 positive infinity +;; 8 signaling NaN. +;; 9 quiet NaN + +(define_insn "fclass" + [(set (match_operand:X0 "register_operand" "=r") + (unspec [(match_operand:ANYF 1 "register_operand" " f")] + UNSPEC_FCLASS))] + "TARGET_HARD_FLOAT" + "fclass.\t%0,%1"; + [(set_attr "type" "fcmp") + (set_attr "mode" "")]) + +;; Implements optab for isfinite, isnormal, isinf + +(define_expand "2" + [(match_operand 0 "register_operand" "=r") + (match_operand:ANYF 1 "register_operand" " f") + (const_int FCLASS_MASK)] + "TARGET_HARD_FLOAT" +{ + if (GET_MODE (operands[0]) != SImode + && GET_MODE (operands[0]) != word_mode) +FAIL; + + rtx t = gen_reg_rtx (word_mode); + rtx t_op0 = gen_reg_rtx (word_mode); + + if (TARGET_64BIT) +emit_insn (gen_fclassdi (t, operands[1])); + else +emit_insn (gen_fclasssi (t, operands[1])); + + riscv_emit_binary (AND, t, t, GEN_INT ()); + rtx cmp = gen_rtx_NE (word_mode, t, const0_rtx); + emit_insn (gen_cstore4 (t_op0, cmp, t, const0_rtx)); + + if (TARGET_64BIT) +{ + t_op0 = gen_lowpart (SImode, t_op0); + SUBREG_PROMOTED_VAR_P (t_op0) = 1; + SUBREG_PROMOTED_SET (t_op0, SRP_SIGNED); +} + + emit_move_insn (operands[0], t_op0); + DONE; +}) + (define_insn "*seq_zero_" [(set (match_operand:GPR 0 "register_operand" "=r") (eq:GPR (match_operand:X 1 "register_operand" " r") diff --git a/gcc/testsuite/gcc.target/riscv/fclass.c b/gcc/testsuite/gcc.target/riscv/fclass.c new file mode 100644 index ..ea0f173ecf4b --- /dev/null +++ b/gcc/testsuite/gcc.target/riscv/fclass.c @@ -0,0 +1,38 @@ +/* { dg-d
[PATCH] rtl-ssa: Enforce earlyclobbers on hard-coded clobbers [PR115891]
The asm in the testcase has a memory operand and also clobbers ax. The clobber means that ax cannot be used to hold inputs, which extends to the address of the memory. I think I had an implicit assumption that constrain_operands would enforce this, but in hindsight, that clearly wasn't going to be true. constrain_operands only looks at constraints, and these clobbers are by definition outside the constraint system. (And that's why they have to be handled conservatively, since there's no way to distinguish the earlyclobber and non-earlyclobber cases.) The semantics of hard-coded clobbers are generic enough that I think they should be handled directly by rtl-ssa, rather than by consumers. And in the context of rtl-ssa, the easiest way to check for a clash is to walk the list of input registers, which we already have to hand. It therefore seemed better not to push this down to a more generic rtl helper. The patch detects hard-coded clobbers in the same way as regrename: by temporarily stubbing out the operands with pc_rtx. Tested on aarch64-linux-gnu & x86_64-linux-gnu. OK to install? Richard gcc/ PR rtl-optimization/115891 * rtl-ssa/changes.cc (find_clobbered_access): New function. (recog_level2): Use it to check for overlap between input registers and hard-coded clobbers. Conditionally reset recog_data.insn after changing the insn code. gcc/testsuite/ PR rtl-optimization/115891 * gcc.target/i386/pr115891.c: New test. --- gcc/rtl-ssa/changes.cc | 60 +++- gcc/testsuite/gcc.target/i386/pr115891.c | 10 2 files changed, 69 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr115891.c diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc index 6b6f7cd5d3a..a163c6d3bec 100644 --- a/gcc/rtl-ssa/changes.cc +++ b/gcc/rtl-ssa/changes.cc @@ -944,6 +944,25 @@ add_clobber (insn_change &change, add_regno_clobber_fn add_regno_clobber, return true; } +// See if PARALLEL pattern PAT clobbers any of the registers in ACCESSES. +// Return one such access if so, otherwise return null. +static access_info * +find_clobbered_access (access_array accesses, rtx pat) +{ + rtx subpat; + for (int i = 0; i < XVECLEN (pat, 0); ++i) +if (GET_CODE (subpat = XVECEXP (pat, 0, i)) == CLOBBER) + { + rtx x = XEXP (subpat, 0); + if (REG_P (x)) + for (auto *access : accesses) + if (access->regno () >= REGNO (x) + && access->regno () < END_REGNO (x)) + return access; + } + return nullptr; +} + // Try to recognize the new form of the insn associated with CHANGE, // adding any clobbers that are necessary to make the instruction match // an .md pattern. Return true on success. @@ -1035,9 +1054,48 @@ recog_level2 (insn_change &change, add_regno_clobber_fn add_regno_clobber) pat = newpat; } + INSN_CODE (rtl) = icode; + if (recog_data.insn == rtl) +recog_data.insn = nullptr; + + // See if the pattern contains any hard-coded clobbers of registers + // that are also inputs to the instruction. The standard rtl semantics + // treat such clobbers as earlyclobbers, since there is no way of proving + // which clobbers conflict with the inputs and which don't. + // + // (Non-hard-coded clobbers are handled by constraint satisfaction instead.) + rtx subpat; + if (GET_CODE (pat) == PARALLEL) +for (int i = 0; i < XVECLEN (pat, 0); ++i) + if (GET_CODE (subpat = XVECEXP (pat, 0, i)) == CLOBBER + && REG_P (XEXP (subpat, 0))) + { + // Stub out all operands, so that can tell which registers + // are hard-coded. + extract_insn (rtl); + for (int j = 0; j < recog_data.n_operands; ++j) + *recog_data.operand_loc[j] = pc_rtx; + + auto *use = find_clobbered_access (change.new_uses, pat); + + // Restore the operands. + for (int j = 0; j < recog_data.n_operands; ++j) + *recog_data.operand_loc[j] = recog_data.operand[j]; + + if (use) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "register %d is both clobbered" + " and used as an input:\n", use->regno ()); + print_rtl_single (dump_file, pat); + } + return false; + } + } + // check_asm_operands checks the constraints after RA, so we don't // need to do it again. - INSN_CODE (rtl) = icode; if (reload_completed && !asm_p) { extract_insn (rtl); diff --git a/gcc/testsuite/gcc.target/i386/pr115891.c b/gcc/testsuite/gcc.target/i386/pr115891.c new file mode 100644 index 000..b1a1b159c5d --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr115891.c @@ -0,0 +1,10 @@ +/* { dg-require-effective-target ia32 } */ +/* { dg-options "-O2" } */ + +void __attribute__((regparm(1))) +f (int *ptr)
[PATCH] recog: restrict paradoxical mode punning in insn_propagation [PR115901]
In g:44fc801e97a8dc626a4806ff4124439003420b20 I'd extended insn_propagation to handle simple cases of hard-reg mode punning. One of the checks was that the new use mode occupied the same number of registers as the original definition mode. However, as PR115901 shows, we need to avoid increasing the size of any registers in the punned "to" expression as well. Specifically, the test includes a DImode move from GPR x0 to a vector register, followed by a V2DI use of the vector register. The simplification would then create a V2DI spanning x0 and x1, manufacturing a new, unwanted use of x1. Checking for that kind of thing directly seems too cumbersome, and is not related to the original motivation (which was to improve handling of shared vector zeros on aarch64). This patch therefore restricts the paradoxical case to constants. Tested on aarch64-linux-gnu & x86_64-linux-gnu. OK to install? Richard gcc/ PR rtl-optimization/115901 * recog.cc (insn_propagation::apply_to_rvalue_1): Restrict paradoxical mode punning to cases where "to" is constant. gcc/testsuite/ PR rtl-optimization/115901 * gcc.dg/torture/pr115901.c: New test. --- gcc/recog.cc| 8 gcc/testsuite/gcc.dg/torture/pr115901.c | 14 ++ 2 files changed, 22 insertions(+) create mode 100644 gcc/testsuite/gcc.dg/torture/pr115901.c diff --git a/gcc/recog.cc b/gcc/recog.cc index 7710c55b745..54b317126c2 100644 --- a/gcc/recog.cc +++ b/gcc/recog.cc @@ -1082,6 +1082,14 @@ insn_propagation::apply_to_rvalue_1 (rtx *loc) || !REG_CAN_CHANGE_MODE_P (REGNO (x), GET_MODE (from), GET_MODE (x))) return false; + /* If the reference is paradoxical and the replacement +value contains registers, we would need to check that the +simplification below does not increase REG_NREGS for those +registers either. It seems simpler to punt on nonconstant +values instead. */ + if (paradoxical_subreg_p (GET_MODE (x), GET_MODE (from)) + && !CONSTANT_P (to)) + return false; newval = simplify_subreg (GET_MODE (x), to, GET_MODE (from), subreg_lowpart_offset (GET_MODE (x), GET_MODE (from))); diff --git a/gcc/testsuite/gcc.dg/torture/pr115901.c b/gcc/testsuite/gcc.dg/torture/pr115901.c new file mode 100644 index 000..244af857d88 --- /dev/null +++ b/gcc/testsuite/gcc.dg/torture/pr115901.c @@ -0,0 +1,14 @@ +/* { dg-additional-options "-ftrivial-auto-var-init=zero" } */ + +int p; +void g(long); +#define vec16 __attribute__((vector_size(16))) + +void l(vec16 long *); +void h() +{ + long inv1; + vec16 long inv = {p, inv1}; + g (p); + l(&inv); +} -- 2.25.1
Re: [PATCH 2/4] vect: Fix inaccurate vector stmts number for slp reduction with lane-reducing
> > Hi, Richard, > > > > Let me explain some idea that has to be chosen for lane-reducing. The key > > complication is that these ops are associated with two kinds of vec_nums, > > one is number of effective vector stmts, which is used by partial > > vectorzation > > function such as vect_get_loop_mask. The other is number of total created > > vector stmts. Now we should make it aligned with normal op, in order to > > interoperate with normal op. Suppose expressions mixed with lane-reducing > > and normal as: > > > > temp = lane_reducing<16*char> + expr<4*int>; > > temp = cst<4*int> * lane_reducing<16*char>; > > > > If only generating effective vector stmt for lane_reducing, vector def/use > > between ops will never be matched, so extra pass-through copies are > > necessary. This is why we say "refit a lane-reducing to be a fake normal > > op". > > And this only happens in vect_transform_reduction, right? Yes. it is. > > The other pre-existing issue is that for single_defuse_cycle optimization > SLP_TREE_NUMBER_OF_VEC_STMTS is also off (too large). But here > the transform also goes through vect_transform_reduction. > > > The requirement of two vec_stmts are independent of how we will implement > > SLP_TREE_NUMBER_OF_VEC_STMTS. Moreover, if we want to refactor vect code > > to unify ncopies/vec_num computation and completely remove > > SLP_TREE_NUMBER_OF_VEC_STMTS, this tends to be a a large task, and might > > be overkill for these lane-reducing patches. So I will keep it as before, > > and do > > not touch it as what I have done in this patch. > > > > Since one SLP_TREE_NUMBER_OF_VEC_STMTS could not be used for two purposes. > > The your previous suggestion might not be work: > > > > > As said, I don't like this much. vect_slp_analyze_node_operations_1 sets > > > this > > > and I think the existing "exception" > > > > > > /* Calculate the number of vector statements to be created for the > > > scalar stmts in this node. For SLP reductions it is equal to the > > > number of vector statements in the children (which has already been > > > calculated by the recursive call). Otherwise it is the number of > > > scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by > > > VF divided by the number of elements in a vector. */ > > > if (SLP_TREE_CODE (node) != VEC_PERM_EXPR > > > && !STMT_VINFO_DATA_REF (stmt_info) > > > && REDUC_GROUP_FIRST_ELEMENT (stmt_info)) > > >{ > > > for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i) > > >if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) == > > > vect_internal_def) > > > { > > >SLP_TREE_NUMBER_OF_VEC_STMTS (node) > > > = SLP_TREE_NUMBER_OF_VEC_STMTS (SLP_TREE_CHILDREN (node)[i]); > > >break; > > > } > > >} > > > > > > could be changed (or amended if replacing doesn't work out) to > > > > > > if (SLP_TREE_CODE (node) != VEC_PERM_EXPR > > > && STMT_VINFO_REDUC_IDX (stmt_info) > > > // do we have this always set? > > > && STMT_VINFO_REDUC_VECTYPE_IN (stmt_info)) > > >{ > > > do the same as in else {} but using VECTYPE_IN > > >} > > > > > > Or maybe scrap the special case and use STMT_VINFO_REDUC_VECTYPE_IN > > > when that's set instead of SLP_TREE_VECTYPE? As said having wrong > > > SLP_TREE_NUMBER_OF_VEC_STMTS is going to backfire. > > > > Then the alternative is to limit special handling related to the vec_num > > only > > inside vect_transform_reduction. Is that ok? Or any other suggestion? > > I think that's kind-of in line with the suggestion of a reduction > specific VF, so yes, > not using SLP_TREE_NUMBER_OF_VEC_STMTS in vect_transform_reduction > sounds fine to me and would be a step towards not having > SLP_TREE_NUMBER_OF_VEC_STMTS > where the function would be responsible for appropriate allocation as well. OK. I remade 4 patches, and send them in a new emails. Thanks, Feng > From: Richard Biener > Sent: Thursday, July 11, 2024 5:43 PM > To: Feng Xue OS; Richard Sandiford > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH 2/4] vect: Fix inaccurate vector stmts number for slp > reduction with lane-reducing > > On Thu, Jul 11, 2024 at 10:53 AM Feng Xue OS > wrote: > > > > Vector stmts number of an operation is calculated based on output vectype. > > This is over-estimated for lane-reducing operation. Sometimes, to workaround > > the issue, we have to rely on additional logic to deduce an exactly accurate > > number by other means. Aiming at the inconvenience, in this patch, we would > > "turn" lane-reducing operation into a normal one by inserting new trivial > > statements like zero-valued PHIs and pass-through copies, which could be > > optimized away by later passes. At the same time, a new field is added for > > slp node to hold number of vector stmts that are really effective after > > vectorization. For example: > > Adding Richard in
[PATCH 1/4] vect: Add a unified vect_get_num_copies for slp and non-slp
Extend original vect_get_num_copies (pure loop-based) to calculate number of vector stmts for slp node regarding a generic vect region. Thanks, Feng --- gcc/ * tree-vectorizer.h (vect_get_num_copies): New overload function. (vect_get_slp_num_vectors): New function. * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Calculate number of vector stmts for slp node with vect_get_num_copies. (vect_slp_analyze_node_operations): Calculate number of vector elements for constant/external slp node with vect_get_num_copies. --- gcc/tree-vect-slp.cc | 19 +++ gcc/tree-vectorizer.h | 29 - 2 files changed, 31 insertions(+), 17 deletions(-) diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index d0a8531fd3b..4dadbc6854d 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -6573,17 +6573,7 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node, } } else -{ - poly_uint64 vf; - if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) - vf = loop_vinfo->vectorization_factor; - else - vf = 1; - unsigned int group_size = SLP_TREE_LANES (node); - tree vectype = SLP_TREE_VECTYPE (node); - SLP_TREE_NUMBER_OF_VEC_STMTS (node) - = vect_get_num_vectors (vf * group_size, vectype); -} +SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vect_get_num_copies (vinfo, node); /* Handle purely internal nodes. */ if (SLP_TREE_CODE (node) == VEC_PERM_EXPR) @@ -6851,12 +6841,9 @@ vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node, && j == 1); continue; } - unsigned group_size = SLP_TREE_LANES (child); - poly_uint64 vf = 1; - if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) - vf = loop_vinfo->vectorization_factor; + SLP_TREE_NUMBER_OF_VEC_STMTS (child) - = vect_get_num_vectors (vf * group_size, vector_type); + = vect_get_num_copies (vinfo, child); /* And cost them. */ vect_prologue_cost_for_slp (child, cost_vec); } diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 8eb3ec4df86..09923b9b440 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -2080,6 +2080,33 @@ vect_get_num_vectors (poly_uint64 nunits, tree vectype) return exact_div (nunits, TYPE_VECTOR_SUBPARTS (vectype)).to_constant (); } +/* Return the number of vectors in the context of vectorization region VINFO, + needed for a group of total SIZE statements that are supposed to be + interleaved together with no gap, and all operate on vectors of type + VECTYPE. If NULL, SLP_TREE_VECTYPE of NODE is used. */ + +inline unsigned int +vect_get_num_copies (vec_info *vinfo, slp_tree node, tree vectype = NULL) +{ + poly_uint64 vf; + + if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) +vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); + else +vf = 1; + + if (node) +{ + vf *= SLP_TREE_LANES (node); + if (!vectype) + vectype = SLP_TREE_VECTYPE (node); +} + else +gcc_checking_assert (vectype); + + return vect_get_num_vectors (vf, vectype); +} + /* Return the number of copies needed for loop vectorization when a statement operates on vectors of type VECTYPE. This is the vectorization factor divided by the number of elements in @@ -2088,7 +2115,7 @@ vect_get_num_vectors (poly_uint64 nunits, tree vectype) inline unsigned int vect_get_num_copies (loop_vec_info loop_vinfo, tree vectype) { - return vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR (loop_vinfo), vectype); + return vect_get_num_copies (loop_vinfo, NULL, vectype); } /* Update maximum unit count *MAX_NUNITS so that it accounts for -- 2.17.1
[PATCH 2/4] vect: Refit lane-reducing to be normal operation
Vector stmts number of an operation is calculated based on output vectype. This is over-estimated for lane-reducing operation, which would cause vector def/use mismatched when we want to support loop reduction mixed with lane- reducing and normal operations. One solution is to refit lane-reducing to make it behave like a normal one, by adding new pass-through copies to fix possible def/use gap. And resultant superfluous statements could be optimized away after vectorization. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod } The vector size is 128-bit,vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 1 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 Thanks, Feng --- gcc/ * tree-vect-loop.cc (vect_reduction_update_partial_vector_usage): Calculate effective vector stmts number with generic vect_get_num_copies. (vect_transform_reduction): Insert copies for lane-reducing so as to fix over-estimated vector stmts number. (vect_transform_cycle_phi): Calculate vector PHI number only based on output vectype. * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Remove adjustment on vector stmts number specific to slp reduction. --- gcc/tree-vect-loop.cc | 134 +++--- gcc/tree-vect-slp.cc | 27 +++-- 2 files changed, 121 insertions(+), 40 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index a64b5082bd1..5ac83e76975 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7468,12 +7468,8 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo, = get_masked_reduction_fn (reduc_fn, vectype_in); vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo); vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); - unsigned nvectors; - - if (slp_node) - nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); - else - nvectors = vect_get_num_copies (loop_vinfo, vectype_in); + unsigned nvectors = vect_get_num_copies (loop_vinfo, slp_node, + vectype_in); if (mask_reduc_fn == IFN_MASK_LEN_FOLD_LEFT_PLUS) vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_in, 1); @@ -8595,12 +8591,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo, stmt_vec_info phi_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)); gphi *reduc_def_phi = as_a (phi_info->stmt); int reduc_index = STMT_VINFO_REDUC_IDX (stmt_info); - tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); + tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info); + + if (!vectype_in) +vectype_in = STMT_VINFO_VECTYPE (stmt_info); if (slp_node) { ncopies = 1; - vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); + vec_num = vect_get_num_copies (loop_vinfo, slp_node, vectype_in); } else { @@ -8658,13 +8657,40 @@ vect_transform_reduction (loop_vec_info loop_vinfo, bool lane_reducing = lane_reducing_op_p (code); gcc_assert (single_defuse_cycle || lane_reducing); + if (lane_reducing) +{ + /* The last operand of lane-reducing op is for reduction. */ + gcc_assert (reduc_index == (int) op.num_ops - 1); +} + /* Create the destination vector */ tree scalar_dest = gimple_get_lhs (stmt_info->stmt); tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out); + if (lane_reducing && !slp_node && !single_defuse_cycle) +{ + /* Note: there are still vectorizable cases that can not be handled by +single-lane slp. Probably it would take some time to evolve the +feature to a mature state. So we have to keep the below non-slp code +path as failsafe for lane-reducing support. */ + gcc_assert (op.num_ops <= 3); + for (unsigned i = 0; i < op.num_ops; i++) + { + unsigned oprnd_ncopies = ncopies; + + if ((int) i == reduc_index) + { + tree vectype = STMT_VINFO_VECTYPE (stmt_info); + oprnd_ncopies = vect_get_num_copies (loop_vinfo, vectype); + } + + vect_get_vec_defs_for_operand (loop_vinfo, stmt_info, oprnd_ncopies, +op.ops[i], &vec_oprnds[i]); + } +} /* Get NCOPIES vector definitions for all operands except the reduction definition. */ - if (!cond_fn_p) + else if (!cond_fn_p) { gcc_assert (reduc_index >= 0 && reduc_index <= 2); vect_get_vec_defs (loop_vinf
[PATCH 3/4] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current vectorizer could only handle the pattern if the reduction chain does not contain other operation, no matter the other is normal or lane-reducing. This patches removes some constraints in reduction analysis to allow multiple arbitrary lane-reducing operations with mixed input vectypes in a loop reduction chain. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad } The vector size is 128-bit vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 1 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 + sum_v1 Thanks, Feng --- gcc/ PR tree-optimization/114440 * tree-vectorizer.h (vectorizable_lane_reducing): New function declaration. * tree-vect-stmts.cc (vect_analyze_stmt): Call new function vectorizable_lane_reducing to analyze lane-reducing operation. * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation code related to emulated_mixed_dot_prod. (vectorizable_lane_reducing): New function. (vectorizable_reduction): Allow multiple lane-reducing operations in loop reduction. Move some original lane-reducing related code to vectorizable_lane_reducing. (vect_transform_reduction): Adjust comments with updated example. gcc/testsuite/ PR tree-optimization/114440 * gcc.dg/vect/vect-reduc-chain-1.c * gcc.dg/vect/vect-reduc-chain-2.c * gcc.dg/vect/vect-reduc-chain-3.c * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c * gcc.dg/vect/vect-reduc-dot-slp-1.c --- .../gcc.dg/vect/vect-reduc-chain-1.c | 64 + .../gcc.dg/vect/vect-reduc-chain-2.c | 79 ++ .../gcc.dg/vect/vect-reduc-chain-3.c | 68 + .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c | 95 +++ .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c | 67 + .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c | 79 ++ .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c | 63 + .../gcc.dg/vect/vect-reduc-dot-slp-1.c| 60 + gcc/tree-vect-loop.cc | 240 +- gcc/tree-vect-stmts.cc| 2 + gcc/tree-vectorizer.h | 2 + 11 files changed, 750 insertions(+), 69 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c new file mode 100644 index 000..80b0089ea0f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c @@ -0,0 +1,64 @@ +/* Disabling epilogues until we find a better way to deal with scans. */ +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ +/* { dg-add-options arm_v8_2a_dotprod_neon } */ + +#include "tree-vect.h" + +#define N 50 + +#ifndef SIGNEDNESS_1 +#define SIGNEDNESS_1 signed +#define SIGNEDNESS_2 signed +#endif + +SIGNEDNESS_1 int __attribute__ ((noipa)) +f (SIGNEDNESS_1 int res, + SIGNEDNESS_2 char *restrict a, + SIGNEDNESS_2 char *restrict b, + SIGNEDNESS_2 char *restrict c, + SIGNEDNESS_2 char *restrict d, + SIGNEDNESS_1 int *restrict e) +{ + for (int i = 0; i < N; ++i) +{ + res += a[i] * b[i]; + res += c[i] * d[i]; + r
[PATCH 4/4] vect: Optimize order of lane-reducing statements in loop def-use cycles
When transforming multiple lane-reducing operations in a loop reduction chain, originally, corresponding vectorized statements are generated into def-use cycles starting from 0. The def-use cycle with smaller index, would contain more statements, which means more instruction dependency. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad sum += n[i]; // normal } Original transformation result: for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy ... } For a higher instruction parallelism in final vectorized loop, an optimal means is to make those effective vector lane-reducing ops be distributed evenly among all def-use cycles. Transformed as the below, DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles, instruction dependency among them could be eliminated. for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = sum_v1; // copy sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2); sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3); ... } Thanks, Feng --- gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing statements in an optimized order. --- gcc/tree-vect-loop.cc | 64 ++- gcc/tree-vectorizer.h | 6 2 files changed, 63 insertions(+), 7 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index e72d692ffa3..5bc6e526d43 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8841,6 +8841,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad + sum += n[i]; // normal } The vector size is 128-bit,vectorization factor is 16. Reduction @@ -8858,19 +8859,27 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); - sum_v1 = sum_v1; // copy + sum_v0 = sum_v0; // copy + sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); - sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); - sum_v2 = sum_v2; // copy + sum_v0 = sum_v0; // copy + sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1); + sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2); sum_v3 = sum_v3; // copy + + sum_v0 += n_v0[i: 0 ~ 3 ]; + sum_v1 += n_v1[i: 4 ~ 7 ]; + sum_v2 += n_v2[i: 8 ~ 11]; + sum_v3 += n_v3[i: 12 ~ 15]; } - sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 + sum_v1 - */ +Moreover, for a higher instruction parallelism in final vectorized +loop, it is considered to make those effective vector lane-reducing +ops be distributed evenly among all def-use cycles. In the above +example, DOT_PROD, WIDEN_SUM and SADs are generated into disparate +cycles, instruction dependency among them could be eliminated. */ unsigned effec_ncopies = vec_oprnds[0].length (); unsigned total_ncopies = vec_oprnds[reduc_index].length (); @@ -8884,6 +8893,47 @@ vect_transform_reduction (loop_vec_info loop_vinfo, vec_oprnds[i].safe_grow_cleared (total_ncopies); } } + + tree reduc_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); + gcc_assert (reduc_vectype_in); + + unsigned effec_reduc_ncopies + = vect_get_num_copies (loop_vinfo, slp_no
Re: [Patch, fortran] PR84868 - [11/12/13/14/15 Regression] ICE in gfc_conv_descriptor_offset, at fortran/trans-array.c:208
Hi All, Harald has pointed out that I attached the ChangeLog twice and the patch not at all :-( Please find the patch duly attached. Paul On Sat, 13 Jul 2024 at 10:58, Paul Richard Thomas < paul.richard.tho...@gmail.com> wrote: > Hi All, > > After messing around with argument mapping, where I found and fixed > another bug, I realised that the problem lay with simplification of > len_trim with an argument that is the element of a parameter array. The fix > was then a straightforward lift of existing code in expr.cc. The mapping > bug is also fixed by supplying the se string length when building character > typespecs. > > Regtests just fine. OK for mainline? I believe that this is safe for > backporting to 14-branch before the 14.2 release - thoughts? > > Regards > > Paul > diff --git a/gcc/fortran/simplify.cc b/gcc/fortran/simplify.cc index 7a5d31c01a6..5199ede98fe 100644 --- a/gcc/fortran/simplify.cc +++ b/gcc/fortran/simplify.cc @@ -4637,6 +4637,76 @@ gfc_simplify_len_trim (gfc_expr *e, gfc_expr *kind) if (k == -1) return &gfc_bad_expr; + if (e->expr_type == EXPR_VARIABLE + && e->ts.type == BT_CHARACTER + && e->symtree->n.sym->attr.flavor == FL_PARAMETER + && e->ref && e->ref->type == REF_ARRAY + && e->ref->u.ar.dimen_type[0] == DIMEN_ELEMENT + && e->symtree->n.sym->value) +{ + char name[GFC_MAX_SYMBOL_LEN + 10]; + gfc_namespace *ns = e->symtree->n.sym->ns; + gfc_symtree *st; + gfc_expr *expr; + gfc_expr *p; + gfc_constructor *c; + int cnt = 0; + + sprintf (name, "_len_trim_%s", e->symtree->n.sym->name); + st = gfc_find_symtree (ns->sym_root, name); + if (st) + goto already_built; + + /* Recursively call this fcn to simplify the constructor elements. */ + expr = gfc_copy_expr (e->symtree->n.sym->value); + expr->ts.type = BT_INTEGER; + expr->ts.kind = k; + expr->ts.u.cl = NULL; + expr->rank = 1; + c = gfc_constructor_first (expr->value.constructor); + for (; c; c = gfc_constructor_next (c)) + { + if (c->iterator) + continue; + + if (c->expr && c->expr->ts.type == BT_CHARACTER) + { + p = gfc_simplify_len_trim (c->expr, kind); + if (p == NULL) + goto clean_up; + gfc_replace_expr (c->expr, p); + cnt++; + } + } + + if (cnt) + { + /* Build a new parameter to take the result. */ + st = gfc_new_symtree (&ns->sym_root, name); + st->n.sym = gfc_new_symbol (st->name, ns); + st->n.sym->value = expr; + st->n.sym->ts = expr->ts; + st->n.sym->attr.dimension = 1; + st->n.sym->attr.save = SAVE_IMPLICIT; + st->n.sym->attr.flavor = FL_PARAMETER; + st->n.sym->as = gfc_copy_array_spec (e->symtree->n.sym->as); + gfc_set_sym_referenced (st->n.sym); + st->n.sym->refs++; + +already_built: + /* Build a return expression. */ + expr = gfc_copy_expr (e); + expr->ts = st->n.sym->ts; + expr->symtree = st; + expr->rank = 0; + return expr; + } + +clean_up: + gfc_free_expr (expr); + return NULL; +} + if (e->expr_type != EXPR_CONSTANT) return NULL; diff --git a/gcc/fortran/trans-expr.cc b/gcc/fortran/trans-expr.cc index 477c2720187..fe872a661ec 100644 --- a/gcc/fortran/trans-expr.cc +++ b/gcc/fortran/trans-expr.cc @@ -4490,12 +4490,15 @@ gfc_get_interface_mapping_charlen (gfc_interface_mapping * mapping, static tree gfc_get_interface_mapping_array (stmtblock_t * block, gfc_symbol * sym, - gfc_packed packed, tree data) + gfc_packed packed, tree data, tree len) { tree type; tree var; - type = gfc_typenode_for_spec (&sym->ts); + if (len != NULL_TREE && (TREE_CONSTANT (len) || VAR_P (len))) +type = gfc_get_character_type_len (sym->ts.kind, len); + else +type = gfc_typenode_for_spec (&sym->ts); type = gfc_get_nodesc_array_type (type, sym->as, packed, !sym->attr.target && !sym->attr.pointer && !sym->attr.proc_pointer); @@ -4642,7 +4645,8 @@ gfc_add_interface_mapping (gfc_interface_mapping * mapping, convert it to a boundless character type. */ else if (!sym->attr.dimension && sym->ts.type == BT_CHARACTER) { - tmp = gfc_get_character_type_len (sym->ts.kind, NULL); + se->string_length = gfc_evaluate_now (se->string_length, &se->pre); + tmp = gfc_get_character_type_len (sym->ts.kind, se->string_length); tmp = build_pointer_type (tmp); if (sym->attr.pointer) value = build_fold_indirect_ref_loc (input_location, @@ -4661,7 +4665,7 @@ gfc_add_interface_mapping (gfc_interface_mapping * mapping, /* For character(*), use the actual argument's descriptor. */ else if (sym->ts.type == BT_CHARACTER && !new_sym->ts.u.cl->length) value = build_fold_indirect_ref_loc (input_location, - se->expr); + se->expr); /* If the argument is an array descriptor, use it to determine information about the actual argument's shape. */ @@ -4675,7 +4679,8 @@ gfc_add_interface_mapping (gfc_interface_ma
[PATCH v2] c: Add support for byte arrays in C2Y
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 This marks structures which include a byte array as typeless storage for all C language modes. Bootstrapped and regression tested on x86_64. c: Add support for byte arrays in C2Y To get correct aliasing behavior requires that structures and unions that contain a byte array, i.e. an array of non-atomic character type (N3254), are marked with TYPE_TYPELESS_STORAGE. This change affects also earlier language modes. gcc/c/ * c-decl.cc (grokdeclarator, finish_struct): Set and propagate TYPE_TYPELESS_STORAGE. gcc/testsuite/ * gcc.dg/c2y-byte-alias-1.c: New test. * gcc.dg/c2y-byte-alias-2.c: New test. * gcc.dg/c2y-byte-alias-3.c: New test. diff --git a/gcc/c/c-decl.cc b/gcc/c/c-decl.cc index 97f1d346835..7b8f622296f 100644 - --- a/gcc/c/c-decl.cc +++ b/gcc/c/c-decl.cc @@ -7502,12 +7502,18 @@ grokdeclarator (const struct c_declarator *declarator, modify the shared type, so we gcc_assert (itype) below. */ { + /* Identify typeless storage as introduced in C2Y + and supported also in earler language modes. */ + bool typeless = (char_type_p (type) +&& !(type_quals & TYPE_QUAL_ATOMIC)) + || (AGGREGATE_TYPE_P (type) + && TYPE_TYPELESS_STORAGE (type)); + addr_space_t as = DECODE_QUAL_ADDR_SPACE (type_quals); if (!ADDR_SPACE_GENERIC_P (as) && as != TYPE_ADDR_SPACE (type)) type = build_qualified_type (type, ENCODE_QUAL_ADDR_SPACE (as)); - - - - type = build_array_type (type, itype); + type = build_array_type (type, itype, typeless); } if (type != error_mark_node) @@ -9659,6 +9665,10 @@ finish_struct (location_t loc, tree t, tree fieldlist, tree attributes, if (DECL_NAME (x) || RECORD_OR_UNION_TYPE_P (TREE_TYPE (x))) saw_named_field = true; + + if (AGGREGATE_TYPE_P (TREE_TYPE (x)) + && TYPE_TYPELESS_STORAGE (TREE_TYPE (x))) + TYPE_TYPELESS_STORAGE (t) = true; } detect_field_duplicates (fieldlist); @@ -9859,6 +9869,7 @@ finish_struct (location_t loc, tree t, tree fieldlist, tree attributes, TYPE_FIELDS (x) = TYPE_FIELDS (t); TYPE_LANG_SPECIFIC (x) = TYPE_LANG_SPECIFIC (t); TYPE_TRANSPARENT_AGGR (x) = TYPE_TRANSPARENT_AGGR (t); + TYPE_TYPELESS_STORAGE (x) = TYPE_TYPELESS_STORAGE (t); C_TYPE_FIELDS_READONLY (x) = C_TYPE_FIELDS_READONLY (t); C_TYPE_FIELDS_VOLATILE (x) = C_TYPE_FIELDS_VOLATILE (t); C_TYPE_FIELDS_NON_CONSTEXPR (x) = C_TYPE_FIELDS_NON_CONSTEXPR (t); diff --git a/gcc/testsuite/gcc.dg/c2y-byte-alias-1.c b/gcc/testsuite/gcc.dg/c2y-byte-alias-1.c new file mode 100644 index 000..30bc2c09c2f - --- /dev/null +++ b/gcc/testsuite/gcc.dg/c2y-byte-alias-1.c @@ -0,0 +1,46 @@ +/* { dg-do run } */ +/* { dg-options "-std=c2y -O2" } */ + +struct f { _Alignas(int) char buf[sizeof(int)]; }; +struct f2 { struct f x; }; +union g { _Alignas(int) char buf[sizeof(int)]; }; + +[[gnu::noinline]] +int foo(struct f *p, int *q) +{ + *q = 1; + *p = (struct f){ }; + return *q; +} + +[[gnu::noinline]] +int foo2(struct f2 *p, int *q) +{ + *q = 1; + *p = (struct f2){ }; + return *q; +} + +[[gnu::noinline]] +int bar(union g *p, int *q) +{ + *q = 1; + *p = (union g){ }; + return *q; +} + + +int main() +{ + struct f p; + if (0 != foo(&p, (void*)&p.buf)) + __builtin_abort(); + + struct f2 p2; + if (0 != foo2(&p2, (void*)&p2.x.buf)) + __builtin_abort(); + + union g q; + if (0 != bar(&q, (void*)&q.buf)) + __builtin_abort(); +} diff --git a/gcc/testsuite/gcc.dg/c2y-byte-alias-2.c b/gcc/testsuite/gcc.dg/c2y-byte-alias-2.c new file mode 100644 index 000..9bd2d18b386 - --- /dev/null +++ b/gcc/testsuite/gcc.dg/c2y-byte-alias-2.c @@ -0,0 +1,43 @@ +/* { dg-do run } */ +/* { dg-options "-std=c2y -O2" } */ + +struct f2 { + struct f { + _Alignas(int) char buf[sizeof(int)]; + } x[2]; + int i; +}; + +[[gnu::noinline]] +int foo2(struct f2 *p, int *q) +{ + *q = 1; + *p = (struct f2){ }; + return *q; +} + +struct g2 { + union g { + _Alignas(int) char buf[sizeof(int)]; + } x[2]; + int i; +}; + +[[gnu::noinline]] +int bar2(struct g2 *p, int *q) +{ + *q = 1; + *p = (struct g2){ }; + return *q; +} + +int main() +{ + struct f2 p2; + if (0 != foo2(&p2, (void*)&p2.x[0].buf)) + __builtin_abort(); + + struct g2 q2; + if (0 != bar2(&q2, (void*)&q2.x[0].buf)) + __builtin_abort(); +} diff --gi
Re: [PATCH v9 05/10] C: Implement musttail attribute for returns
Here's an updated patch with your feedback addressed. Is this version ok? The common code is in the C++ patch. --- Implement a C23 clang compatible musttail attribute similar to the earlier C++ implementation in the C parser. gcc/c/ChangeLog: PR c/83324 * c-parser.cc (struct attr_state): Define with musttail_p. (c_parser_statement_after_labels): Handle [[musttail]]. (c_parser_std_attribute): Dito. (c_parser_handle_musttail): Dito. (c_parser_compound_statement_nostart): Dito. (c_parser_all_labels): Dito. (c_parser_statement): Dito. * c-tree.h (c_finish_return): Add musttail_p flag. * c-typeck.cc (c_finish_return): Handle musttail_p flag. --- gcc/c/c-parser.cc | 70 ++- gcc/c/c-tree.h| 2 +- gcc/c/c-typeck.cc | 7 +++-- 3 files changed, 63 insertions(+), 16 deletions(-) diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc index 8c4e697a4e10..9cb4d5d932ad 100644 --- a/gcc/c/c-parser.cc +++ b/gcc/c/c-parser.cc @@ -1621,6 +1621,12 @@ struct omp_for_parse_data { bool fail : 1; }; +struct attr_state +{ + /* True if we parsed a musttail attribute for return. */ + bool musttail_p; +}; + static bool c_parser_nth_token_starts_std_attributes (c_parser *, unsigned int); static tree c_parser_std_attribute_specifier_sequence (c_parser *); @@ -1665,7 +1671,7 @@ static location_t c_parser_compound_statement_nostart (c_parser *); static void c_parser_label (c_parser *, tree); static void c_parser_statement (c_parser *, bool *, location_t * = NULL); static void c_parser_statement_after_labels (c_parser *, bool *, -vec * = NULL); +vec * = NULL, attr_state = {}); static tree c_parser_c99_block_statement (c_parser *, bool *, location_t * = NULL); static void c_parser_if_statement (c_parser *, bool *, vec *); @@ -6982,6 +6988,29 @@ c_parser_handle_directive_omp_attributes (tree &attrs, } } +/* Check if STD_ATTR contains a musttail attribute and remove if it + precedes a return. PARSER is the parser and ATTR is the output + attr_state. */ + +static tree +c_parser_handle_musttail (c_parser *parser, tree std_attrs, attr_state &attr) +{ + if (c_parser_next_token_is_keyword (parser, RID_RETURN)) +{ + if (lookup_attribute ("gnu", "musttail", std_attrs)) + { + std_attrs = remove_attribute ("gnu", "musttail", std_attrs); + attr.musttail_p = true; + } + if (lookup_attribute ("clang", "musttail", std_attrs)) + { + std_attrs = remove_attribute ("clang", "musttail", std_attrs); + attr.musttail_p = true; + } +} + return std_attrs; +} + /* Parse a compound statement except for the opening brace. This is used for parsing both compound statements and statement expressions (which follow different paths to handling the opening). */ @@ -6998,6 +7027,7 @@ c_parser_compound_statement_nostart (c_parser *parser) bool in_omp_loop_block = omp_for_parse_state ? omp_for_parse_state->want_nested_loop : false; tree sl = NULL_TREE; + attr_state a = {}; if (c_parser_next_token_is (parser, CPP_CLOSE_BRACE)) { @@ -7138,7 +7168,10 @@ c_parser_compound_statement_nostart (c_parser *parser) = c_parser_nth_token_starts_std_attributes (parser, 1); tree std_attrs = NULL_TREE; if (have_std_attrs) - std_attrs = c_parser_std_attribute_specifier_sequence (parser); + { + std_attrs = c_parser_std_attribute_specifier_sequence (parser); + std_attrs = c_parser_handle_musttail (parser, std_attrs, a); + } if (c_parser_next_token_is_keyword (parser, RID_CASE) || c_parser_next_token_is_keyword (parser, RID_DEFAULT) || (c_parser_next_token_is (parser, CPP_NAME) @@ -7286,7 +7319,7 @@ c_parser_compound_statement_nostart (c_parser *parser) last_stmt = true; mark_valid_location_for_stdc_pragma (false); if (!omp_for_parse_state) - c_parser_statement_after_labels (parser, NULL); + c_parser_statement_after_labels (parser, NULL, NULL, a); else { /* In canonical loop nest form, nested loops can only appear @@ -7328,15 +7361,20 @@ c_parser_compound_statement_nostart (c_parser *parser) /* Parse all consecutive labels, possibly preceded by standard attributes. In this context, a statement is required, not a declaration, so attributes must be followed by a statement that is - not just a semicolon. */ + not just a semicolon. Returns an attr_state. */ -static void +static attr_state c_parser_all_labels (c_parser *parser) { + attr_state attr = {}; bool have_std_attrs; tree std_attrs = NULL; if ((have_std_attrs = c_parser_nth_token_starts_std_attrib
Re: [PATCH v9 04/10] C++: Support clang compatible [[musttail]] (PR83324)
Updated version with common code for C/C++ extracted in c-family. Other than that no changes. Is this version ok to commit? --- This patch implements a clang compatible [[musttail]] attribute for returns. musttail is useful as an alternative to computed goto for interpreters. With computed goto the interpreter function usually ends up very big which causes problems with register allocation and other per function optimizations not scaling. With musttail the interpreter can be instead written as a sequence of smaller functions that call each other. To avoid unbounded stack growth this requires forcing a sibling call, which this attribute does. It guarantees an error if the call cannot be tail called which allows the programmer to fix it instead of risking a stack overflow. Unlike computed goto it is also type-safe. It turns out that David Malcolm had already implemented middle/backend support for a musttail attribute back in 2016, but it wasn't exposed to any frontend other than a special plugin. This patch adds a [[gnu::musttail]] attribute for C++ that can be added to return statements. The return statement must be a direct call (it does not follow dependencies), which is similar to what clang implements. It then uses the existing must tail infrastructure. For compatibility it also detects clang::musttail Passes bootstrap and full test gcc/c-family/ChangeLog: * c-attribs.cc (set_musttail_on_return): New function. * c-common.h (set_musttail_on_return): Declare new function. gcc/cp/ChangeLog: PR c/83324 * parser.cc (cp_parser_statement): Handle musttail. (cp_parser_jump_statement): Dito. * pt.cc (tsubst_expr): Copy CALL_EXPR_MUST_TAIL_CALL. --- gcc/c-family/c-attribs.cc | 20 gcc/c-family/c-common.h | 1 + gcc/cp/parser.cc | 26 +++--- gcc/cp/pt.cc | 7 ++- 4 files changed, 50 insertions(+), 4 deletions(-) diff --git a/gcc/c-family/c-attribs.cc b/gcc/c-family/c-attribs.cc index 5adc7b775eaf..685f212683f4 100644 --- a/gcc/c-family/c-attribs.cc +++ b/gcc/c-family/c-attribs.cc @@ -672,6 +672,26 @@ attribute_takes_identifier_p (const_tree attr_id) return targetm.attribute_takes_identifier_p (attr_id); } +/* Set a musttail attribute MUSTTAIL_P on return expression RETVAL + at LOC. */ + +void +set_musttail_on_return (tree retval, location_t loc, bool musttail_p) +{ + if (retval && musttail_p) +{ + tree t = retval; + if (TREE_CODE (t) == TARGET_EXPR) + t = TARGET_EXPR_INITIAL (t); + if (TREE_CODE (t) != CALL_EXPR) + error_at (loc, "cannot tail-call: return value must be a call"); + else + CALL_EXPR_MUST_TAIL_CALL (t) = 1; +} + else if (musttail_p && !retval) +error_at (loc, "cannot tail-call: return value must be a call"); +} + /* Verify that argument value POS at position ARGNO to attribute NAME applied to function FN (which is either a function declaration or function type) refers to a function parameter at position POS and the expected type diff --git a/gcc/c-family/c-common.h b/gcc/c-family/c-common.h index e84c9c47513b..079c9dc5f08b 100644 --- a/gcc/c-family/c-common.h +++ b/gcc/c-family/c-common.h @@ -1646,6 +1646,7 @@ extern tree handle_noreturn_attribute (tree *, tree, tree, int, bool *); extern tree handle_musttail_attribute (tree *, tree, tree, int, bool *); extern bool has_attribute (location_t, tree, tree, tree (*)(tree)); extern tree build_attr_access_from_parms (tree, bool); +extern void set_musttail_on_return (tree, location_t, bool); /* In c-format.cc. */ extern bool valid_format_string_type_p (tree); diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc index 31ae9c2fb54d..e2411ee7213c 100644 --- a/gcc/cp/parser.cc +++ b/gcc/cp/parser.cc @@ -2467,7 +2467,7 @@ static tree cp_parser_perform_range_for_lookup static tree cp_parser_range_for_member_function (tree, tree); static tree cp_parser_jump_statement - (cp_parser *); + (cp_parser *, tree &); static void cp_parser_declaration_statement (cp_parser *); @@ -12756,7 +12756,7 @@ cp_parser_statement (cp_parser* parser, tree in_statement_expr, case RID_CO_RETURN: case RID_GOTO: std_attrs = process_stmt_hotness_attribute (std_attrs, attrs_loc); - statement = cp_parser_jump_statement (parser); + statement = cp_parser_jump_statement (parser, std_attrs); break; /* Objective-C++ exception-handling constructs. */ @@ -14844,10 +14844,11 @@ cp_parser_init_statement (cp_parser *parser, tree *decl) jump-statement: goto * expression ; + STD_ATTRS are the statement attributes. They can be modified. Returns the new BREAK_STMT, CONTINUE_STMT, RETURN_EXPR, or GOTO_EXPR. */ static tree -cp_parser_jump_statement (cp_parser* parser) +cp_parser_jump_statement (cp_parser* parser, tree &std_attrs) { tree statement = error_mark_node; cp_token *token; @@ -14924,6
[PATCH] Output CodeView function information
Translate DW_TAG_subprogram DIEs into CodeView LF_FUNC_ID types and S_GPROC32_ID / S_LPROC32_ID symbols. ld will then transform these into S_GPROC32 / S_LPROC32 symbols, which map addresses to unmangled function names. gcc/ * dwarf2codeview.cc (enum cv_sym_type): Add new values. (struct codeview_symbol): Add function to union. (struct codeview_custom_type): Add lf_func_id to union. (write_function): New function. (write_codeview_symbols): Call write_function. (write_lf_func_id): New function. (write_custom_types): Call write_lf_func_id. (add_function): New function. (codeview_debug_early_finish): Call add_function. --- gcc/dwarf2codeview.cc | 291 +- 1 file changed, 288 insertions(+), 3 deletions(-) diff --git a/gcc/dwarf2codeview.cc b/gcc/dwarf2codeview.cc index df53d8bab9d..c174f320480 100644 --- a/gcc/dwarf2codeview.cc +++ b/gcc/dwarf2codeview.cc @@ -34,6 +34,7 @@ along with GCC; see the file COPYING3. If not see #include "langhooks.h" #include "dwarf2out.h" #include "dwarf2codeview.h" +#include "rtl.h" #ifdef CODEVIEW_DEBUGGING_INFO @@ -71,7 +72,10 @@ along with GCC; see the file COPYING3. If not see enum cv_sym_type { S_LDATA32 = 0x110c, S_GDATA32 = 0x110d, - S_COMPILE3 = 0x113c + S_COMPILE3 = 0x113c, + S_LPROC32_ID = 0x1146, + S_GPROC32_ID = 0x1147, + S_PROC_ID_END = 0x114f }; /* This is enum LEAF_ENUM_e in Microsoft's cvinfo.h. */ @@ -185,6 +189,16 @@ struct codeview_symbol char *name; dw_die_ref die; } data_symbol; +struct +{ + uint32_t parent; + uint32_t end; + uint32_t next; + uint32_t type; + uint8_t flags; + char *name; + dw_die_ref die; +} function; }; }; @@ -309,6 +323,12 @@ struct codeview_custom_type uint32_t num_entries; uint32_t *args; } lf_arglist; +struct +{ + uint32_t parent_scope; + uint32_t function_type; + char *name; +} lf_func_id; }; }; @@ -966,6 +986,152 @@ end: free (s->data_symbol.name); } +/* Write an S_GPROC32_ID symbol, representing a global function, or an + S_LPROC32_ID symbol, for a static function. */ + +static void +write_function (codeview_symbol *s) +{ + unsigned int label_num = ++sym_label_num; + dw_attr_node *loc_low, *loc_high; + const char *label_low, *label_high; + rtx rtx_low, rtx_high; + + /* This is struct procsym in binutils and PROCSYM32 in Microsoft's cvinfo.h: + + struct procsym + { + uint16_t size; + uint16_t kind; + uint32_t parent; + uint32_t end; + uint32_t next; + uint32_t proc_len; + uint32_t debug_start; + uint32_t debug_end; + uint32_t type; + uint32_t offset; + uint16_t section; + uint8_t flags; + char name[]; + } ATTRIBUTE_PACKED; + */ + + loc_low = get_AT (s->function.die, DW_AT_low_pc); + if (!loc_low) +goto end; + + if (loc_low->dw_attr_val.val_class != dw_val_class_lbl_id) +goto end; + + label_low = loc_low->dw_attr_val.v.val_lbl_id; + if (!label_low) +goto end; + + rtx_low = gen_rtx_SYMBOL_REF (Pmode, label_low); + + loc_high = get_AT (s->function.die, DW_AT_high_pc); + if (!loc_high) +goto end; + + if (loc_high->dw_attr_val.val_class != dw_val_class_high_pc) +goto end; + + label_high = loc_high->dw_attr_val.v.val_lbl_id; + if (!label_high) +goto end; + + rtx_high = gen_rtx_SYMBOL_REF (Pmode, label_high); + + /* Output the S_GPROC32_ID / S_LPROC32_ID record. */ + + fputs (integer_asm_op (2, false), asm_out_file); + asm_fprintf (asm_out_file, + "%L" SYMBOL_END_LABEL "%u - %L" SYMBOL_START_LABEL "%u\n", + label_num, label_num); + + targetm.asm_out.internal_label (asm_out_file, SYMBOL_START_LABEL, label_num); + + fputs (integer_asm_op (2, false), asm_out_file); + fprint_whex (asm_out_file, s->kind); + putc ('\n', asm_out_file); + + fputs (integer_asm_op (4, false), asm_out_file); + fprint_whex (asm_out_file, s->function.parent); + putc ('\n', asm_out_file); + + fputs (integer_asm_op (4, false), asm_out_file); + fprint_whex (asm_out_file, s->function.end); + putc ('\n', asm_out_file); + + fputs (integer_asm_op (4, false), asm_out_file); + fprint_whex (asm_out_file, s->function.next); + putc ('\n', asm_out_file); + + fputs (integer_asm_op (4, false), asm_out_file); + output_addr_const (asm_out_file, rtx_high); + fputs (" - ", asm_out_file); + output_addr_const (asm_out_file, rtx_low); + putc ('\n', asm_out_file); + + /* FIXME - debug_start should be the end of the prologue, and debug_end +the beginning of the epilogue. Do the whole function for +now. */ + + fputs (integer_asm_op (4, false), asm_out_file); + fprint_whex (asm_out_file, 0); + putc ('\n', asm_out_file); + + fputs (integer_asm_op (4, false), asm_out_file); + output_addr_const (asm_out_file, rtx_high); +
Re: [RFC 1/2] libbacktrace: add FDPIC support
On Wed, Jul 10, 2024 at 12:49 PM Ian Lance Taylor wrote: > On Sun, May 26, 2024 at 11:51 PM Max Filippov wrote: > > diff --git a/libbacktrace/internal.h b/libbacktrace/internal.h > > index 4fa0af8cb6c9..456911166026 100644 > > --- a/libbacktrace/internal.h > > +++ b/libbacktrace/internal.h > > @@ -323,10 +323,22 @@ struct dwarf_sections > > > > struct dwarf_data; > > > > +#if defined (HAVE_DL_ITERATE_PHDR) && defined (__FDPIC__) > > +typedef struct elf32_fdpic_loadaddr base_address_type; > > +#define __RELOC_UINTPTR(ptr, base) ((uintptr_t)__RELOC_POINTER (ptr, base)) > > +#define no_base_address ((struct elf32_fdpic_loadaddr){0}) > > +#else > > +typedef uintptr_t base_address_type; > > +#define __RELOC_POINTER(ptr, base) ((ptr) + (base)) > > +#define __RELOC_UINTPTR(ptr, base) ((uintptr_t)__RELOC_POINTER (ptr, base)) > > +#define no_base_address ((uintptr_t)0) > > +#endif > > + > > + > > When I look at the uClibc sources, I don't understand how this works. > This sets no_base_address to have a zero map field. But > __RELOC_POINTER will crash when given a zero map field. That's right. But __RELOC_POINTER should never be called for base address set to no_base_address, that's what the following hunk ensures: --->8--- @@ -6636,9 +6636,15 @@ elf_add (struct backtrace_state *state, const char *filename, int descriptor, /* If the executable is ET_DYN, it is either a PIE, or we are running directly a shared library with .interp. We need to wait for - dl_iterate_phdr in that case to determine the actual base_address. */ + dl_iterate_phdr in that case to determine the actual base_address. + In case of FDPIC we always need the actual base_address. */ +#ifndef __FDPIC__ if (exe && ehdr.e_type == ET_DYN) return -1; +#else + if (exe) +return -1; +#endif shoff = ehdr.e_shoff; shnum = ehdr.e_shnum; --->8--- > At least that is what it looks like in > uClibc/libc/sysdeps/linux/bfin/bits/elf-fdpic.h. > What target and what library are you using? I'm using xtensa-linux-uclibcfdpic (gcc part is still WIP, most recent version is available in https://github.com/jcmvbkbc/gcc-xtensa tagged xtensa-fdpic-abi-spec-1.4) with uClibc-ng -- Thanks. -- Max
[COMMITTED] CRIS: Disable late-combine by default, related PR115883
Heads-up to xtensa maintainers, who might similarly want to move the option-override to TARGET_OVERRIDE_OPTIONS_AFTER_CHANGE (and call it from TARGET_OPTION_OVERRIDE). Regarding disabling that optimization: with the brief description per the below, I think I've done due diligence when it comes to a valid reason for disabling that optimization, if only the bare minimum. Not that it'd be a requirement, but to proactively counter presumed arguments about sweeping target bugs under the carpet. I may revisit later (famous last words), but for now, I'd rather spend time I don't have, writing my own optimizations. ;-) Sorry. Oops! I see I forgot to replace the X templates before committing and pushing, so there'll be a brown-paper bag fixup. Bah. -- >8 -- With late-combine, performance for coremark compiled for cris-elf regresses 2.6% by performance and by size 0.4%, measured at r15-2005-g13757e50ff0b, when compiled with "-O2 -march=v10". Earlier, at r15-1880-gce34fcc572a0, numbers were by performance 3.2% and by size 0.4%, even with the proposed patch to PR115883 (TL;DR: a presumed bug in LRA or combine exposed by late-combine). Without that patch, about the same performance results (at that revision). Similarly around the late-combine commit (r15-1579-g792f97b44ffc5e). I briefly looked at the performance regression for coremark at r15-2005-g13757e50ff0b (with/without this patch) as far as seeing that the stack-frame grew larger (maxing out on hard registers and needing one more slot) for at least two of the top three* functions that regressed the most in terms of cycles per call: matrix_mul_matrix_bitextract (in coremark, 17% slower) and __subdf3 (in libgcc, 6.7% slower). That makes sense when considering that late-combine "naturally" stretches register life-times. But, looking at late_combine::combine_into_uses and late_combine::optimizable_set, nothing stood out to me. I guess there's improvement opportunities in late_combine::check_register_pressure. (*) I opted not to look at _dtoa_r (in newlib) mostly because it's boring and always shows up when something in gcc goes sideways. (It maxes out on hard registers and is big. End of story.) Note that the change of default is done in the TARGET_OVERRIDE_OPTIONS_AFTER_CHANGE worker, not in the TARGET_OPTION_OVERRIDE worker for reasons stated in the comment. * config/cris/cris.cc (cris_option_override_after_change): New function. Disable late-combine by default. (cris_option_override): Call the new function. --- gcc/config/cris/cris.cc | 37 + 1 file changed, 37 insertions(+) diff --git a/gcc/config/cris/cris.cc b/gcc/config/cris/cris.cc index f1accc0f36e5..f2c3955b9665 100644 --- a/gcc/config/cris/cris.cc +++ b/gcc/config/cris/cris.cc @@ -53,6 +53,7 @@ along with GCC; see the file COPYING3. If not see #include "builtins.h" #include "cfgrtl.h" #include "tree-pass.h" +#include "opts.h" /* This file should be included last. */ #include "target-def.h" @@ -156,6 +157,7 @@ static rtx_insn *cris_md_asm_adjust (vec &, vec &, HARD_REG_SET &, location_t); static void cris_option_override (void); +static void cris_option_override_after_change (); static bool cris_frame_pointer_required (void); @@ -281,6 +283,8 @@ int cris_cpu_version = CRIS_DEFAULT_CPU_VERSION; #undef TARGET_OPTION_OVERRIDE #define TARGET_OPTION_OVERRIDE cris_option_override +#undef TARGET_OVERRIDE_OPTIONS_AFTER_CHANGE +#define TARGET_OVERRIDE_OPTIONS_AFTER_CHANGE cris_option_override_after_change #undef TARGET_ASM_TRAMPOLINE_TEMPLATE #define TARGET_ASM_TRAMPOLINE_TEMPLATE cris_asm_trampoline_template @@ -2409,6 +2413,39 @@ cris_option_override (void) /* Set the per-function-data initializer. */ init_machine_status = cris_init_machine_status; + + cris_option_override_after_change (); +} + +/* The TARGET_OVERRIDE_OPTIONS_AFTER_CHANGE worker. + + The CRIS port doesn't have any port-specific function attributes to + handle, but to keep attributes consistent across per-function changes + and not fail per-function optimization settings as exposed by + gcc.dg/ipa/iinline-attr.c, any OPTION_SET_P work need to be done + here, not in the TARGET_OPTION_OVERRIDE function. This function then + instead needs to called from that function. */ + +static void +cris_option_override_after_change () +{ + /* The combine pass inserts extra copies of the incoming parameter + registers in make_more_copies, between the hard registers and + pseudo-registers holding the "original" copies. When doing that, + it does not copy attributes from those original registers. With + the late-combine pass, those extra copies are propagated into more + places than the original copies, and trips up LRA which doesn't see + e.g. REG_POINTER where it's expected. This causes an ICE for + gcc.target/cris/rld-legit1.c. That's a red flag, but also a very
[COMMITTED] CRIS: Fix up last comment.
Regarding shortening it: no need to duplicate what's in the git commit log, just keep it at the minimum for at-a-glance use. -- >8 -- * config/cris/cris.cc (cris_option_override_after_change): Fix up comment regarding disabling late_combine. --- gcc/config/cris/cris.cc | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/gcc/config/cris/cris.cc b/gcc/config/cris/cris.cc index f2c3955b9665..617fc0a0cb33 100644 --- a/gcc/config/cris/cris.cc +++ b/gcc/config/cris/cris.cc @@ -2440,10 +2440,9 @@ cris_option_override_after_change () special corner case. A more valid reason is that coremark with -march=v10 -O2 regresses - by X% (@r15-g vs. @r15-g and also at - r15-g with patches to handle the REG_POINTER and - rld-legit1 fallout vs. this additional change). Disable - late-combine by default until that's fixed. */ + by 2.6% @r15-2005-g13757e50ff0b compared to late-combined disabled. + + Disable late-combine by default until that's fixed. */ if (!OPTION_SET_P (flag_late_combine_instructions)) flag_late_combine_instructions = 0; } -- 2.30.2
Re: [PATCH] fortran: Correctly evaluate the MASK argument of MINLOC/MAXLOC
Hi Mikael, The fix is blindingly obvious :-) Not only that, the failing testcase runs correctly. OK for mainline and please backport to 14-branch before the 14.2 release. Thanks for the patch Paul On Sat, 13 Jul 2024 at 10:48, Mikael Morin wrote: > From: Mikael Morin > > Hello, > > I'm currently testing this on x86_64-linux. > I plan to push to master if all goes well. > > Mikael > > -- 8< -- > > Add the preliminary code that the generated expression for MASK may depend > on when generating the inline code to evaluate MINLOC or MAXLOC with a > scalar MASK. > > The generated code was only keeping the generated expression but not the > preliminary code, which was sufficient for simple cases such as data > references or simple (scalar) function calls, but was bogus with more > complicated ones. > > gcc/fortran/ChangeLog: > > * trans-intrinsic.cc (gfc_conv_intrinsic_minmaxloc): Add the > preliminary code generated for MASK to the preliminary code of > MINLOC/MAXLOC. > > gcc/testsuite/ChangeLog: > > * gfortran.dg/minmaxloc_17.f90: New test. > --- > gcc/fortran/trans-intrinsic.cc | 1 + > gcc/testsuite/gfortran.dg/minmaxloc_17.f90 | 33 ++ > 2 files changed, 34 insertions(+) > create mode 100644 gcc/testsuite/gfortran.dg/minmaxloc_17.f90 > > diff --git a/gcc/fortran/trans-intrinsic.cc > b/gcc/fortran/trans-intrinsic.cc > index cadbd177452..180d0d7a88c 100644 > --- a/gcc/fortran/trans-intrinsic.cc > +++ b/gcc/fortran/trans-intrinsic.cc > @@ -5749,6 +5749,7 @@ gfc_conv_intrinsic_minmaxloc (gfc_se * se, gfc_expr > * expr, enum tree_code op) > >gfc_init_se (&maskse, NULL); >gfc_conv_expr_val (&maskse, maskexpr); > + gfc_add_block_to_block (&se->pre, &maskse.pre); >gfc_init_block (&block); >gfc_add_block_to_block (&block, &loop.pre); >gfc_add_block_to_block (&block, &loop.post); > diff --git a/gcc/testsuite/gfortran.dg/minmaxloc_17.f90 > b/gcc/testsuite/gfortran.dg/minmaxloc_17.f90 > new file mode 100644 > index 000..7e6e586ab03 > --- /dev/null > +++ b/gcc/testsuite/gfortran.dg/minmaxloc_17.f90 > @@ -0,0 +1,33 @@ > +! { dg-do run } > +! > +! Check that the code necessary to evaluate MINLOC's or MAXLOC's MASK > +! argument is correctly generated. > + > +program p > + implicit none > + integer, parameter :: data10(*) = (/ 2, 5, 2, 0, 6, 5, 3, 6, 0, 1 /) > + logical, parameter :: mask10(*) = (/ .false., .true., .false., & > + .false., .true., .true., & > + .true. , .true., .false., & > + .false. /) > + type bool_wrapper > +logical :: l > + end type > + call check_minloc > + call check_maxloc > +contains > + subroutine check_minloc > +integer :: a(10) > +integer :: r > +a = data10 > +r = minloc(a, dim = 1, mask = sum(a) > 0) > +if (r /= 4) stop 11 > + end subroutine > + subroutine check_maxloc > +integer :: a(10) > +integer :: r > +a = data10 > +r = maxloc(a, dim = 1, mask = sum(a) > 0) > +if (r /= 5) stop 18 > + end subroutine > +end program > -- > 2.43.0 > >