RE: [RFC] Enabling SVE with offloading to nvptx

Prathamesh Kulkarni via Gcc Wed, 20 Nov 2024 21:10:16 -0800


> -----Original Message-----
> From: Gcc <[email protected]> On Behalf
> Of Prathamesh Kulkarni via Gcc
> Sent: 14 November 2024 13:59
> To: Andrew Stubbs <[email protected]>; Jakub Jelinek <[email protected]>
> Cc: Richard Biener <[email protected]>; Richard Biener
> <[email protected]>; [email protected]; Thomas Schwinge
> <[email protected]>
> Subject: RE: [RFC] Enabling SVE with offloading to nvptx
> 
> External email: Use caution opening links or attachments
> 
> 
> > -----Original Message-----
> > From: Andrew Stubbs <[email protected]>
> > Sent: 12 November 2024 20:23
> > To: Prathamesh Kulkarni <[email protected]>; Jakub Jelinek
> > <[email protected]>
> > Cc: Richard Biener <[email protected]>; Richard Biener
> > <[email protected]>; [email protected]; Thomas Schwinge
> > <[email protected]>
> > Subject: Re: [RFC] Enabling SVE with offloading to nvptx
> >
> > External email: Use caution opening links or attachments
> >
> >
> > On 12/11/2024 06:01, Prathamesh Kulkarni via Gcc wrote:
> > >
> > >
> > >> -----Original Message-----
> > >> From: Jakub Jelinek <[email protected]>
> > >> Sent: 04 November 2024 21:44
> > >> To: Prathamesh Kulkarni <[email protected]>
> > >> Cc: Richard Biener <[email protected]>; Richard Biener
> > >> <[email protected]>; [email protected]; Thomas Schwinge
> > >> <[email protected]>
> > >> Subject: Re: [RFC] Enabling SVE with offloading to nvptx
> > >>
> > >> External email: Use caution opening links or attachments
> > >>
> > >>
> > >> On Sat, Nov 02, 2024 at 03:53:34PM +0000, Prathamesh Kulkarni
> > wrote:
> > >>> The attached patch adds a new bitfield needs_max_vf_lowering to
> > >> loop,
> > >>> and sets that in expand_omp_simd for loops that need delayed
> > >> lowering
> > >>> of safelen and omp simd arrays.  The patch defines a new macro
> > >>> OMP_COMMON_MAX_VF (arbitrarily set to 16), as a placeholder
> value
> > >> for
> > >>> max_vf (instead of INT_MAX), and is later replaced by
> appropriate
> > >>> max_vf during omp_adjust_max_vf pass.  Does that look OK ?
> > >>
> > >> No.
> > >> The thing is, if user doesn't specify safelen, it defaults to
> > >> infinity (which we represent as INT_MAX), if user specifies it,
> > then
> > >> that is the maximum for it (currently in OpenMP specification it
> is
> > >> just an integral value, so can't be a poly int).
> > >> And then the lowering uses the max_vf as another limit, what the
> hw
> > >> can do at most and sizes the magic arrays with it.  So, one needs
> > to
> > >> use minimum of what user specified and what the hw can handle.
> > >> So using 16 as some magic value is just wrong, safelen(16) can be
> > >> specified in the source as well, or safelen(8), or safelen(32) or
> > >> safelen(123).
> > >>
> > >> Thus, the fact that the hw minimum hasn't been determined yet
> needs
> > >> to be represented in some other flag, not in loop->safelen value,
> > and
> > >> before that is determined, loop->safelen should then represent
> what
> > >> the user wrote (or was implied) and the later pass should use
> > minimum
> > >> from loop->safelen and the picked hw maximum.  Of course if the
> > >> picked hw maximum is POLY_INT-ish, the big question is how to
> > compare
> > >> that against the user supplied integer value, either one can just
> > >> handle the INT_MAX (aka
> > >> infinity) special case, or say query the backend on what is the
> > >> maximum value of the POLY_INT at runtime and only use the
> POLY_INT
> > if
> > >> it is always known to be smaller or equal to the user supplied
> > >> safelen.
> > >>
> > >> Another thing (already mentioned in the thread Andrew referenced)
> > is
> > >> that max_vf is used in two separate places.  One is just to size
> of
> > >> the magic arrays and one of the operands of the minimum (the
> other
> > is
> > >> user specified safelen).  In this case, it is generally just fine
> > to
> > >> pick later value than strictly necessary (as long as it is never
> > >> larger than user supplied safelen).
> > >> The other case is simd modifier on schedule clause.  That value
> > >> should better be the right one or slightly larger, but not too
> > much.
> > >> I think currently we just use the INTEGER_CST we pick as the
> > maximum,
> > >> if this sizing is deferred, maybe it needs to be another internal
> > >> function that asks the value (though, it can refer to a loop vf
> in
> > >> another function, which complicates stuff).
> > >>
> > >> Regarding Richi's question, I'm afraid the OpenMP simd loop
> > lowering
> > >> can't be delayed until some later pass.
> > > Hi Jakub,
> > > Thanks for the suggestions! The attached patch makes the following
> > changes:
> > > (1) Delays setting of safelen for offloading by introducing a new
> > > bitfield needs_max_vf_lowering in loop, which is true with
> > offloading enabled, and safelen is then set to min(safelen, max_vf)
> > for the target later in omp_device_lower pass.
> > > Comparing user-specified safelen with poly_int max_vf may not be
> > > always possible at compile-time (say 32 and 16+16x), and even if
> we
> > determine runtime VL based on -mcpu flags, I guess relying on that
> > won't be portable ?
> > > The patch works around this by taking constant_lower_bound
> (max_vf),
> > > and comparing it with safelen instead, with the downside that
> > constant_lower_bound(max_vf) will not be the optimal max_vf for SVE
> > target if it implements SIMD width > 128 bits.
> > >
> > > (2) Since max_vf is used as length of omp simd array, it gets
> > streamed
> > > out to device, and device compilation fails during streaming-in if
> > > max_vf is poly_int (16+16x), and device's NUM_POLY_INT_COEFFS < 2
> > (which motivated my patch). The patch tries to address this by
> simply
> > setting length to a placeholder value (INT_MAX?) in
> > lower_rec_simd_input_clauses if offloading is enabled, and will be
> > later set to appropriate value in omp_device_lower pass.
> > >
> > > (3) Andrew's patches seems to already fix the case for adjusting
> > > chunk_size for schedule clause with simd modifier by introducing a
> > new internal function .GOMP_MAX_VF, which is then replaced by
> target's
> > max_vf. To keep it consistent with safelen, the patch here uses
> > constant_lower_bound (max_vf) too.
> > >
> > > Patch passes libgomp testing for AArch64/nvptx offloading (with
> and
> > without GPU).
> > > Does it look OK ?
> >
> > I've not reviewed the patch in detail, but I can confirm that this
> > does not break my usecase or cause any test regressions, for me.
> Hi Andrew,
> Thanks for testing the patch!
> >
> > However, are you sure that ompdevlow is always running? I think you
> > need to add this, somewhere:
> >
> >    cfun->curr_properties &= ~PROP_gimple_lomp_dev;
> >
> > My patch added this into omp_adjust_chunk_size, which triggers it
> for
> > certain schedule clauses, but I think you want it whenever safelen
> is
> > needed?
> Ah indeed, thanks for the suggestions.
> I have tried to fix it in the attached patch.
Hi Jakub,
Could you please take a look at this patch:
https://gcc.gnu.org/pipermail/gcc/2024-November/245139.html


It makes the following changes:
(1) Delays setting of safelen for offloading by introducing a new bitfield 
needs_max_vf_lowering in loop, which is true with offloading enabled,
and safelen is then set to min(safelen, max_vf) for the target later in 
omp_device_lower pass.

Comparing user-specified safelen with poly_int max_vf may not be always 
possible at compile-time (say 32 and 16+16x), and even if we determine runtime 
VL
based on -mcpu flags, I guess relying on that won't be portable ? The patch 
works around this by taking constant_lower_bound (max_vf), and comparing it with
safelen instead, with the downside that constant_lower_bound(max_vf) will not 
be the optimal max_vf for SVE target if it implements SIMD width > 128 bits.

(2) Since max_vf is used as length of omp simd array, it gets streamed out to 
device, and device compilation fails during streaming-in if max_vf is poly_int 
(16+16x),
and device's NUM_POLY_INT_COEFFS < 2 (which motivated my patch). The patch 
tries to address this by simply setting length to a placeholder value 
(INT_MAX?) in
lower_rec_simd_input_clauses if offloading is enabled, and will be later set to 
appropriate value in omp_device_lower pass.

(3) Andrew's patches seems to already fix the case for adjusting chunk_size for 
schedule clause with simd modifier by introducing a new internal function 
.GOMP_MAX_VF,
which is then replaced by target's max_vf. To keep it consistent with safelen, 
the patch here uses constant_lower_bound (max_vf) too.

Patch passes libgomp testing for AArch64/nvptx offloading (with and without 
GPU).
Does it look OK ?

Thanks,
Prathamesh
> 
> Thanks,
> Prathamesh
> >
> > Andrew
> >
> > >
> > > Thanks,
> > > Prathamesh
> > >>
> > >>          Jakub
> > >

RE: [RFC] Enabling SVE with offloading to nvptx

Reply via email to