> -----Original Message----- > From: Andrew Stubbs <a...@baylibre.com> > Sent: 12 November 2024 20:23 > To: Prathamesh Kulkarni <prathame...@nvidia.com>; Jakub Jelinek > <ja...@redhat.com> > Cc: Richard Biener <richard.guent...@gmail.com>; Richard Biener > <rguent...@suse.de>; gcc@gcc.gnu.org; Thomas Schwinge > <tschwi...@baylibre.com> > Subject: Re: [RFC] Enabling SVE with offloading to nvptx > > External email: Use caution opening links or attachments > > > On 12/11/2024 06:01, Prathamesh Kulkarni via Gcc wrote: > > > > > >> -----Original Message----- > >> From: Jakub Jelinek <ja...@redhat.com> > >> Sent: 04 November 2024 21:44 > >> To: Prathamesh Kulkarni <prathame...@nvidia.com> > >> Cc: Richard Biener <richard.guent...@gmail.com>; Richard Biener > >> <rguent...@suse.de>; gcc@gcc.gnu.org; Thomas Schwinge > >> <tschwi...@baylibre.com> > >> Subject: Re: [RFC] Enabling SVE with offloading to nvptx > >> > >> External email: Use caution opening links or attachments > >> > >> > >> On Sat, Nov 02, 2024 at 03:53:34PM +0000, Prathamesh Kulkarni > wrote: > >>> The attached patch adds a new bitfield needs_max_vf_lowering to > >> loop, > >>> and sets that in expand_omp_simd for loops that need delayed > >> lowering > >>> of safelen and omp simd arrays. The patch defines a new macro > >>> OMP_COMMON_MAX_VF (arbitrarily set to 16), as a placeholder value > >> for > >>> max_vf (instead of INT_MAX), and is later replaced by appropriate > >>> max_vf during omp_adjust_max_vf pass. Does that look OK ? > >> > >> No. > >> The thing is, if user doesn't specify safelen, it defaults to > >> infinity (which we represent as INT_MAX), if user specifies it, > then > >> that is the maximum for it (currently in OpenMP specification it is > >> just an integral value, so can't be a poly int). > >> And then the lowering uses the max_vf as another limit, what the hw > >> can do at most and sizes the magic arrays with it. So, one needs > to > >> use minimum of what user specified and what the hw can handle. > >> So using 16 as some magic value is just wrong, safelen(16) can be > >> specified in the source as well, or safelen(8), or safelen(32) or > >> safelen(123). > >> > >> Thus, the fact that the hw minimum hasn't been determined yet needs > >> to be represented in some other flag, not in loop->safelen value, > and > >> before that is determined, loop->safelen should then represent what > >> the user wrote (or was implied) and the later pass should use > minimum > >> from loop->safelen and the picked hw maximum. Of course if the > >> picked hw maximum is POLY_INT-ish, the big question is how to > compare > >> that against the user supplied integer value, either one can just > >> handle the INT_MAX (aka > >> infinity) special case, or say query the backend on what is the > >> maximum value of the POLY_INT at runtime and only use the POLY_INT > if > >> it is always known to be smaller or equal to the user supplied > >> safelen. > >> > >> Another thing (already mentioned in the thread Andrew referenced) > is > >> that max_vf is used in two separate places. One is just to size of > >> the magic arrays and one of the operands of the minimum (the other > is > >> user specified safelen). In this case, it is generally just fine > to > >> pick later value than strictly necessary (as long as it is never > >> larger than user supplied safelen). > >> The other case is simd modifier on schedule clause. That value > >> should better be the right one or slightly larger, but not too > much. > >> I think currently we just use the INTEGER_CST we pick as the > maximum, > >> if this sizing is deferred, maybe it needs to be another internal > >> function that asks the value (though, it can refer to a loop vf in > >> another function, which complicates stuff). > >> > >> Regarding Richi's question, I'm afraid the OpenMP simd loop > lowering > >> can't be delayed until some later pass. > > Hi Jakub, > > Thanks for the suggestions! The attached patch makes the following > changes: > > (1) Delays setting of safelen for offloading by introducing a new > > bitfield needs_max_vf_lowering in loop, which is true with > offloading enabled, and safelen is then set to min(safelen, max_vf) > for the target later in omp_device_lower pass. > > Comparing user-specified safelen with poly_int max_vf may not be > > always possible at compile-time (say 32 and 16+16x), and even if we > determine runtime VL based on -mcpu flags, I guess relying on that > won't be portable ? > > The patch works around this by taking constant_lower_bound (max_vf), > > and comparing it with safelen instead, with the downside that > constant_lower_bound(max_vf) will not be the optimal max_vf for SVE > target if it implements SIMD width > 128 bits. > > > > (2) Since max_vf is used as length of omp simd array, it gets > streamed > > out to device, and device compilation fails during streaming-in if > > max_vf is poly_int (16+16x), and device's NUM_POLY_INT_COEFFS < 2 > (which motivated my patch). The patch tries to address this by simply > setting length to a placeholder value (INT_MAX?) in > lower_rec_simd_input_clauses if offloading is enabled, and will be > later set to appropriate value in omp_device_lower pass. > > > > (3) Andrew's patches seems to already fix the case for adjusting > > chunk_size for schedule clause with simd modifier by introducing a > new internal function .GOMP_MAX_VF, which is then replaced by target's > max_vf. To keep it consistent with safelen, the patch here uses > constant_lower_bound (max_vf) too. > > > > Patch passes libgomp testing for AArch64/nvptx offloading (with and > without GPU). > > Does it look OK ? > > I've not reviewed the patch in detail, but I can confirm that this > does not break my usecase or cause any test regressions, for me. Hi Andrew, Thanks for testing the patch! > > However, are you sure that ompdevlow is always running? I think you > need to add this, somewhere: > > cfun->curr_properties &= ~PROP_gimple_lomp_dev; > > My patch added this into omp_adjust_chunk_size, which triggers it for > certain schedule clauses, but I think you want it whenever safelen is > needed? Ah indeed, thanks for the suggestions. I have tried to fix it in the attached patch.
Thanks, Prathamesh > > Andrew > > > > > Thanks, > > Prathamesh > >> > >> Jakub > >
Delay lowering safelen if offloading is enabled. gcc/ChangeLog: * cfgloop.h (loop): New bitfield needs_max_vf_lowering. * omp-expand.cc (expand_omp_simd): Set loop->needs_max_vf_lowering to result of is_in_offload_region. * omp-low.cc (lower_rec_simd_input_clauses): Set a placeholder value for length of omp simd array if offloading is enabled and max_vf is non-constant poly_int. * omp-offload.cc (adjust_max_vf): New function. (execute_omp_device_lower): Call adjust_max_vf, and use constant_lower_bound on result of omp_max_vf while replacing call to .GOMP_MAX_VF. Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com> diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h index 30b5e40d0d9..f7b57e594fd 100644 --- a/gcc/cfgloop.h +++ b/gcc/cfgloop.h @@ -233,6 +233,12 @@ public: flag_finite_loops or similar pragmas state. */ unsigned finite_p : 1; + /* True if SIMD loop needs delayed lowering of artefacts like + safelen and length of omp simd arrays that depend on target's + max_vf. This is true for offloading, when max_vf is computed after + streaming out to device. */ + unsigned needs_max_vf_lowering: 1; + /* The number of times to unroll the loop. 0 means no information given, just do what we always do. A value of 1 means do not unroll the loop. A value of USHRT_MAX means unroll with no specific unrolling factor. diff --git a/gcc/omp-expand.cc b/gcc/omp-expand.cc index 80fb1843445..c9581e3d92e 100644 --- a/gcc/omp-expand.cc +++ b/gcc/omp-expand.cc @@ -7171,6 +7171,10 @@ expand_omp_simd (struct omp_region *region, struct omp_for_data *fd) loop->latch = cont_bb; add_loop (loop, l1_bb->loop_father); loop->safelen = safelen_int; + loop->needs_max_vf_lowering = is_in_offload_region (region); + if (loop->needs_max_vf_lowering) + cfun->curr_properties &= ~PROP_gimple_lomp_dev; + if (simduid) { loop->simduid = OMP_CLAUSE__SIMDUID__DECL (simduid); diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc index 70a2c108fbc..0f68876a2bc 100644 --- a/gcc/omp-low.cc +++ b/gcc/omp-low.cc @@ -4660,7 +4660,16 @@ lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, } else { - tree atype = build_array_type_nelts (TREE_TYPE (new_var), sctx->max_vf); + poly_uint64 nelts; + /* FIXME: If offloading is enabled, and max_vf is poly_int, avoid + using it as length of omp simd array, and use a placeholder value + instead to avoid streaming out poly_int to device. The arrays + will be set to correct length later in omp_device_lower pass. */ + if (omp_maybe_offloaded_ctx (ctx) && !sctx->max_vf.is_constant ()) + nelts = INT_MAX; + else + nelts = sctx->max_vf; + tree atype = build_array_type_nelts (TREE_TYPE (new_var), nelts); tree avar = create_tmp_var_raw (atype); if (TREE_ADDRESSABLE (new_var)) TREE_ADDRESSABLE (avar) = 1; diff --git a/gcc/omp-offload.cc b/gcc/omp-offload.cc index 372b019f9d6..9d6467cc6a6 100644 --- a/gcc/omp-offload.cc +++ b/gcc/omp-offload.cc @@ -2618,6 +2618,75 @@ find_simtpriv_var_op (tree *tp, int *walk_subtrees, void *) return NULL_TREE; } +/* Compute max_vf for target, and accordingly set loop->safelen and length + of omp simd arrays. */ + +static void +adjust_max_vf (function *fun) +{ + if (!fun->has_simduid_loops) + return; + + poly_uint64 max_vf = omp_max_vf (false); + + /* FIXME: Since loop->safelen has to be an integer, it's not always possible + to compare against poly_int. For eg 32 and 16+16x are not comparable at + compile-time because 16+16x <= 32 for x < 2, but 16+16x > 32 for x >= 2. + Even if we could get runtime VL based on -mcpu/-march, that would not be + portable across other SVE archs. + + For now, use constant_lower_bound (max_vf), as a "safer approximation" to + max_vf that avoids these issues, with the downside that it will be + suboptimal max_vf for SVE archs implementing SIMD width > 128 bits. */ + + uint64_t max_vf_int; + if (!max_vf.is_constant (&max_vf_int)) + max_vf_int = constant_lower_bound (max_vf); + + calculate_dominance_info (CDI_DOMINATORS); + for (auto loop: loops_list (fun, 0)) + { + if (!loop->needs_max_vf_lowering) + continue; + + if (loop->safelen > max_vf_int) + loop->safelen = max_vf_int; + + basic_block *bbs = get_loop_body (loop); + for (unsigned i = 0; i < loop->num_nodes; i++) + for (auto gsi = gsi_start_bb (bbs[i]); !gsi_end_p (gsi); gsi_next (&gsi)) + { + gcall *stmt = dyn_cast<gcall *> (gsi_stmt (gsi)); + tree lhs = NULL_TREE; + + if (stmt + && gimple_call_internal_p (stmt) + && (gimple_call_internal_fn (stmt) == IFN_GOMP_SIMD_LANE) + && ((lhs = gimple_call_lhs (stmt)) != NULL_TREE)) + { + imm_use_iterator use_iter; + gimple *use_stmt; + + FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, lhs) + { + gassign *assign_stmt = dyn_cast<gassign *> (use_stmt); + if (!assign_stmt) + continue; + + tree assign_stmt_lhs = gimple_assign_lhs (assign_stmt); + if (TREE_CODE (assign_stmt_lhs) == ARRAY_REF) + { + tree array = TREE_OPERAND (assign_stmt_lhs, 0); + tree& max = TYPE_MAX_VALUE (TYPE_DOMAIN (TREE_TYPE (array))); + max = size_int (loop->safelen - 1); + relayout_decl (array); + } + } + } + } + } +} + /* Cleanup uses of SIMT placeholder internal functions: on non-SIMT targets, VF is 1 and LANE is 0; on SIMT targets, VF is folded to a constant, and LANE is kept to be expanded to RTL later on. Also cleanup all other SIMT @@ -2627,6 +2696,8 @@ find_simtpriv_var_op (tree *tp, int *walk_subtrees, void *) static unsigned int execute_omp_device_lower () { + adjust_max_vf (cfun); + int vf = targetm.simt.vf ? targetm.simt.vf () : 1; bool regimplify = false; basic_block bb; @@ -2755,7 +2826,10 @@ execute_omp_device_lower () rhs = build_int_cst (type, vf); break; case IFN_GOMP_MAX_VF: - rhs = build_int_cst (type, omp_max_vf (false)); + /* Use constant_lower_bound, to keep max_vf consistent with + adjut_max_vf above. */ + rhs = build_int_cst (type, + constant_lower_bound (omp_max_vf (false))); break; case IFN_GOMP_SIMT_ORDERED_PRED: rhs = vf == 1 ? integer_zero_node : NULL_TREE;