On 12/11/2024 06:01, Prathamesh Kulkarni via Gcc wrote:


-----Original Message-----
From: Jakub Jelinek <ja...@redhat.com>
Sent: 04 November 2024 21:44
To: Prathamesh Kulkarni <prathame...@nvidia.com>
Cc: Richard Biener <richard.guent...@gmail.com>; Richard Biener
<rguent...@suse.de>; gcc@gcc.gnu.org; Thomas Schwinge
<tschwi...@baylibre.com>
Subject: Re: [RFC] Enabling SVE with offloading to nvptx

External email: Use caution opening links or attachments


On Sat, Nov 02, 2024 at 03:53:34PM +0000, Prathamesh Kulkarni wrote:
The attached patch adds a new bitfield needs_max_vf_lowering to
loop,
and sets that in expand_omp_simd for loops that need delayed
lowering
of safelen and omp simd arrays.  The patch defines a new macro
OMP_COMMON_MAX_VF (arbitrarily set to 16), as a placeholder value
for
max_vf (instead of INT_MAX), and is later replaced by appropriate
max_vf during omp_adjust_max_vf pass.  Does that look OK ?

No.
The thing is, if user doesn't specify safelen, it defaults to infinity
(which we represent as INT_MAX), if user specifies it, then that is
the maximum for it (currently in OpenMP specification it is just an
integral value, so can't be a poly int).
And then the lowering uses the max_vf as another limit, what the hw
can do at most and sizes the magic arrays with it.  So, one needs to
use minimum of what user specified and what the hw can handle.
So using 16 as some magic value is just wrong, safelen(16) can be
specified in the source as well, or safelen(8), or safelen(32) or
safelen(123).

Thus, the fact that the hw minimum hasn't been determined yet needs to
be represented in some other flag, not in loop->safelen value, and
before that is determined, loop->safelen should then represent what
the user wrote (or was implied) and the later pass should use minimum
from loop->safelen and the picked hw maximum.  Of course if the picked
hw maximum is POLY_INT-ish, the big question is how to compare that
against the user supplied integer value, either one can just handle
the INT_MAX (aka
infinity) special case, or say query the backend on what is the
maximum value of the POLY_INT at runtime and only use the POLY_INT if
it is always known to be smaller or equal to the user supplied
safelen.

Another thing (already mentioned in the thread Andrew referenced) is
that max_vf is used in two separate places.  One is just to size of
the magic arrays and one of the operands of the minimum (the other is
user specified safelen).  In this case, it is generally just fine to
pick later value than strictly necessary (as long as it is never
larger than user supplied safelen).
The other case is simd modifier on schedule clause.  That value should
better be the right one or slightly larger, but not too much.
I think currently we just use the INTEGER_CST we pick as the maximum,
if this sizing is deferred, maybe it needs to be another internal
function that asks the value (though, it can refer to a loop vf in
another function, which complicates stuff).

Regarding Richi's question, I'm afraid the OpenMP simd loop lowering
can't be delayed until some later pass.
Hi Jakub,
Thanks for the suggestions! The attached patch makes the following changes:
(1) Delays setting of safelen for offloading by introducing a new bitfield 
needs_max_vf_lowering in loop, which is true with offloading enabled,
and safelen is then set to min(safelen, max_vf) for the target later in 
omp_device_lower pass.
Comparing user-specified safelen with poly_int max_vf may not be always 
possible at compile-time (say 32 and 16+16x),
and even if we determine runtime VL based on -mcpu flags, I guess relying on 
that won't be portable ?
The patch works around this by taking constant_lower_bound (max_vf), and 
comparing it with safelen instead, with the downside
that constant_lower_bound(max_vf) will not be the optimal max_vf for SVE target if 
it implements SIMD width > 128 bits.

(2) Since max_vf is used as length of omp simd array, it gets streamed out to 
device, and device compilation fails during streaming-in if max_vf
is poly_int (16+16x), and device's NUM_POLY_INT_COEFFS < 2 (which motivated my 
patch). The patch tries to address this by simply setting length to a
placeholder value (INT_MAX?) in lower_rec_simd_input_clauses if offloading is 
enabled, and will be later set to appropriate value in omp_device_lower pass.

(3) Andrew's patches seems to already fix the case for adjusting chunk_size for 
schedule clause with simd modifier by introducing a new internal
function .GOMP_MAX_VF, which is then replaced by target's max_vf. To keep it 
consistent with safelen, the patch here uses constant_lower_bound (max_vf) too.

Patch passes libgomp testing for AArch64/nvptx offloading (with and without 
GPU).
Does it look OK ?

I've not reviewed the patch in detail, but I can confirm that this does not break my usecase or cause any test regressions, for me.

However, are you sure that ompdevlow is always running? I think you need to add this, somewhere:

  cfun->curr_properties &= ~PROP_gimple_lomp_dev;

My patch added this into omp_adjust_chunk_size, which triggers it for certain schedule clauses, but I think you want it whenever safelen is needed?

Andrew


Thanks,
Prathamesh

         Jakub


Reply via email to