Questions about macro fusion pass

2025-01-08 Thread Hau Hsu via Gcc
Hi,

I have a question about GCC's macro fusion pass.
In the GCC internals doc, there is a hook for scheduling:
TARGET_SCHED_MACRO_FUSION_PAIR_P

It says:
> If this hook returns true for the given insn pair (prev and curr),
> the scheduler will put them into a sched group, and they will not be
scheduled apart.

My questions are, does this mean that GCC is just adding a mark on
back-to-back insns (as their name *prev* and *curr* suggests) and won't
separate them during scheduling? Does GCC try to search for separate but
fusible insns and to put them together?

I also traced the code in sched_macro_fuse_insns()
,
but it seems only grouping back-to-back insns ...

Thanks.

Hau


Re: Link-time resolved constant

2025-01-08 Thread The Cuthour via Gcc



And I heard "That's already here. ELF gABI does the same thing."
Anyway I can't implement it myself. I hope wide discussion.
https://gcc.gnu.org/pipermail/gcc/2025-January/245358.html


On 2025/01/08 22:08, Trampas Stern wrote:

Go do it!
Do it so well and with such high quality that no one can ignore the 
results!


On Tue, Jan 7, 2025 at 1:26 PM The Cuthour > wrote:


Thank you.

My plan is to make a higher compatibility version of ELF.
You may be able to call it ELF2.0





Re: Link-time resolved constant

2025-01-08 Thread Trampas Stern via Gcc
Go do it!
Do it so well and with such high quality that no one can ignore the
results!

On Tue, Jan 7, 2025 at 1:26 PM The Cuthour  wrote:

> Thank you.
>
> My plan is to make a higher compatibility version of ELF.
> You may be able to call it ELF2.0
>


RE: [RFC] Enabling SVE with offloading to nvptx

2025-01-08 Thread Prathamesh Kulkarni via Gcc



> -Original Message-
> From: Gcc  On Behalf
> Of Prathamesh Kulkarni via Gcc
> Sent: 27 December 2024 18:00
> To: Jakub Jelinek 
> Cc: Andrew Stubbs ; Richard Biener
> ; Richard Biener ;
> gcc@gcc.gnu.org; Thomas Schwinge 
> Subject: RE: [RFC] Enabling SVE with offloading to nvptx
> 
> External email: Use caution opening links or attachments
> 
> 
> > -Original Message-
> > From: Jakub Jelinek 
> > Sent: 17 December 2024 19:09
> > To: Prathamesh Kulkarni 
> > Cc: Andrew Stubbs ; Richard Biener
> > ; Richard Biener ;
> > gcc@gcc.gnu.org; Thomas Schwinge 
> > Subject: Re: [RFC] Enabling SVE with offloading to nvptx
> >
> > External email: Use caution opening links or attachments
> >
> >
> > On Mon, Dec 02, 2024 at 11:17:08AM +, Prathamesh Kulkarni wrote:
> > > --- a/gcc/cfgloop.h
> > > +++ b/gcc/cfgloop.h
> > > @@ -233,6 +233,12 @@ public:
> > >   flag_finite_loops or similar pragmas state.  */
> > >unsigned finite_p : 1;
> > >
> > > +  /* True if SIMD loop needs delayed lowering of artefacts like
> > > + safelen and length of omp simd arrays that depend on
> target's
> > > + max_vf.  This is true for offloading, when max_vf is
> computed
> > after
> > > + streaming out to device.  */
> > > +  unsigned needs_max_vf_lowering: 1;
> >
> > Consistency, finite_p above uses space before :, the above line
> > doesn't.
> >
> > > --- a/gcc/omp-expand.cc
> > > +++ b/gcc/omp-expand.cc
> > > @@ -7170,6 +7170,10 @@ expand_omp_simd (struct omp_region *region,
> > struct omp_for_data *fd)
> > >loop->latch = cont_bb;
> > >add_loop (loop, l1_bb->loop_father);
> > >loop->safelen = safelen_int;
> > > +  loop->needs_max_vf_lowering = is_in_offload_region
> (region);
> > > +  if (loop->needs_max_vf_lowering)
> > > + cfun->curr_properties &= ~PROP_gimple_lomp_dev;
> >
> > Do you really need this for non-SVE arches?
> > I mean, could you not set loop->needs_max_vf_lowering if maximum
> > number of poly_int coeffs is 1?  Or if omp_max_vf returns constant
> or
> > something similar?
> Well, I guess the issue is not really about VLA vectors but when host
> and device have different max_vf, and selecting optimal max_vf is not
> really possible during omp-low/omp-expand, since we don't have
> device's target info available at this point. Andrew's recent patch
> works around this limitation by searching for "amdgcn" in
> OFFLOAD_TARGET_NAMES in omp_max_vf, but I guess a more general
> solution would be to delay lowering max_vf after streaming-out to
> device irrespective of VLA/VLS vectors ?
> For AArch64/nvptx offloading with SVE, where host is VLA and device is
> VLS, the issue is more pronounced (failing to compile), compared to
> offloading from VLS host to VLS device (selecting sub-optimal max_vf).
> >
> > > --- a/gcc/omp-offload.cc
> > > +++ b/gcc/omp-offload.cc
> > > @@ -2617,6 +2617,77 @@ find_simtpriv_var_op (tree *tp, int
> > *walk_subtrees, void *)
> > >return NULL_TREE;
> > >  }
> > >
> > > +/* Compute max_vf for target, and accordingly set loop->safelen
> and
> > length
> > > +   of omp simd arrays.  */
> > > +
> > > +static void
> > > +adjust_max_vf (function *fun)
> > > +{
> > > +  if (!fun->has_simduid_loops)
> > > +return;
> > > +
> > > +  poly_uint64 max_vf = omp_max_vf (false);
> > > +
> > > +  /* Since loop->safelen has to be an integer, it's not always
> > possible
> > > + to compare against poly_int.  For eg 32 and 16+16x are not
> > comparable at
> > > + compile-time because 16+16x <= 32 for x < 2, but 16+16x > 32
> > for x >= 2.
> > > + Even if we could get runtime VL based on -mcpu/-march, that
> > would not be
> > > + portable across other SVE archs.
> > > +
> > > + For now, use constant_lower_bound (max_vf), as a "safer
> > approximation" to
> > > + max_vf that avoids these issues, with the downside that it
> > will be
> > > + suboptimal max_vf for SVE archs implementing SIMD width >
> 128
> > > + bits.  */
> > > +
> > > +  uint64_t max_vf_int;
> > > +  if (!max_vf.is_constant (&max_vf_int))
> > > +max_vf_int = constant_lower_bound (max_vf);
> > > +
> > > +  calculate_dominance_info (CDI_DOMINATORS);  for (auto loop:
> > > + loops_list (fun, 0))
> > > +{
> > > +  if (!loop->needs_max_vf_lowering)
> > > + continue;
> > > +
> > > +  if (loop->safelen > max_vf_int)
> > > + loop->safelen = max_vf_int;
> > > +
> > > +  basic_block *bbs = get_loop_body (loop);
> >
> > I still think using the tree-vectorizer.cc infrastructure is much
> > better here.
> > There is no guarantee all accesses to the simd arrays will be within
> > the loop body, in fact, none of them could be there.  Consider e.g.
> > parts of loop body (in the C meaning) followed by noreturn calls,
> > those aren't considered loop body in the cfg.
> > So, I think it is much better to walk the whole function once, not
> for
> > each loop walk its loop body (that could be even more expensive if
> > there are nested 

nvptx: For '-march=sm_52' and higher, default at least to '-mptx=7.3' (was: Raise nvptx code generation to default PTX ISA 7.3, sm_52, therefore CUDA 11.3 (released 2021-04))

2025-01-08 Thread Thomas Schwinge
Hi!

On 2024-09-20T18:49:46+0200, I wrote:
> We'd like to raise nvptx code generation from PTX ISA 6.0, sm_30 "Kepler"
> to default PTX ISA 7.3, sm_52 "Maxwell", therefore CUDA 11.3 (2021-04).
> This is, primarily, so that we're able to use 'alloca' and related stack
> manipulation instructions, and improve upon the current:
>
> sorry ("target cannot support alloca");
>
> I see, for example:
>
>   - Ubuntu 22.04 "jammy" LTS has 11.5.1-1ubuntu1 packaged
>   - Debian 12 "stable" ("bookworm", 2023-06) has 11.8.89~11.8.0-5~deb12u1 
> packaged

Pushed to trunk branch commit b7f168644966d451fbe46ee9d06c9763a539c41b
"nvptx: For '-march=sm_52' and higher, default at least to '-mptx=7.3'",
see attached, so that we'll be able to use PTX 'alloca' for sm_52+.


Grüße
 Thomas


>From b7f168644966d451fbe46ee9d06c9763a539c41b Mon Sep 17 00:00:00 2001
From: Thomas Schwinge 
Date: Tue, 12 Nov 2024 16:26:15 +0100
Subject: [PATCH] nvptx: For '-march=sm_52' and higher, default at least to
 '-mptx=7.3'

	PR target/65181
	gcc/
	* config/nvptx/nvptx.cc (default_ptx_version_option): For
	'-march=sm_52' and higher, default at least to '-mptx=7.3'.
	* doc/invoke.texi (Nvidia PTX Options): Update '-mptx=[...]'.
	gcc/testsuite/
	* gcc.target/nvptx/march-map=sm_52.c: Adjust.
	* gcc.target/nvptx/march-map=sm_53.c: Likewise.
	* gcc.target/nvptx/march-map=sm_60.c: Likewise.
	* gcc.target/nvptx/march-map=sm_61.c: Likewise.
	* gcc.target/nvptx/march-map=sm_62.c: Likewise.
	* gcc.target/nvptx/march-map=sm_70.c: Likewise.
	* gcc.target/nvptx/march-map=sm_72.c: Likewise.
	* gcc.target/nvptx/march-map=sm_75.c: Likewise.
	* gcc.target/nvptx/march-map=sm_80.c: Likewise.
	* gcc.target/nvptx/march-map=sm_86.c: Likewise.
	* gcc.target/nvptx/march-map=sm_87.c: Likewise.
	* gcc.target/nvptx/march=sm_52.c: Likewise.
	* gcc.target/nvptx/march=sm_53.c: Likewise.
	* gcc.target/nvptx/march=sm_70.c: Likewise.
	* gcc.target/nvptx/march=sm_75.c: Likewise.
	* gcc.target/nvptx/march=sm_80.c: Likewise.
	* gcc.target/nvptx/mptx=_.c: Use '-march=sm_89'.
---
 gcc/config/nvptx/nvptx.cc|  4 
 gcc/doc/invoke.texi  |  6 +++---
 gcc/testsuite/gcc.target/nvptx/march-map=sm_52.c |  6 +++---
 gcc/testsuite/gcc.target/nvptx/march-map=sm_53.c |  6 +++---
 gcc/testsuite/gcc.target/nvptx/march-map=sm_60.c |  6 +++---
 gcc/testsuite/gcc.target/nvptx/march-map=sm_61.c |  6 +++---
 gcc/testsuite/gcc.target/nvptx/march-map=sm_62.c |  6 +++---
 gcc/testsuite/gcc.target/nvptx/march-map=sm_70.c |  6 +++---
 gcc/testsuite/gcc.target/nvptx/march-map=sm_72.c |  6 +++---
 gcc/testsuite/gcc.target/nvptx/march-map=sm_75.c |  4 ++--
 gcc/testsuite/gcc.target/nvptx/march-map=sm_80.c |  4 ++--
 gcc/testsuite/gcc.target/nvptx/march-map=sm_86.c |  4 ++--
 gcc/testsuite/gcc.target/nvptx/march-map=sm_87.c |  4 ++--
 gcc/testsuite/gcc.target/nvptx/march=sm_52.c |  6 +++---
 gcc/testsuite/gcc.target/nvptx/march=sm_53.c |  6 +++---
 gcc/testsuite/gcc.target/nvptx/march=sm_70.c |  6 +++---
 gcc/testsuite/gcc.target/nvptx/march=sm_75.c |  4 ++--
 gcc/testsuite/gcc.target/nvptx/march=sm_80.c |  4 ++--
 gcc/testsuite/gcc.target/nvptx/mptx=_.c  | 10 +-
 19 files changed, 54 insertions(+), 50 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
index eb948d5b07e1..5860b3df6dd7 100644
--- a/gcc/config/nvptx/nvptx.cc
+++ b/gcc/config/nvptx/nvptx.cc
@@ -245,6 +245,10 @@ default_ptx_version_option (void)
  warp convergence.  */
   res = MAX (res, PTX_VERSION_6_0);
 
+  /* For sm_52+, pick at least 7.3.  */
+  if (ptx_isa_option >= PTX_ISA_SM52)
+res = MAX (res, PTX_VERSION_7_3);
+
   /* Verify that we pick a version that supports the sm.  */
   gcc_assert (first <= res);
   return res;
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 480c48c5372a..4583181f4f53 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -30212,9 +30212,9 @@ Valid version strings are
 @samp{4.1}, @samp{4.2},
 @samp{6.0}, @samp{6.3},
 @samp{7.0}, @samp{7.3}, and @samp{7.8}.
-The default PTX ISA version is 6.0, unless a higher
-version is required for specified PTX ISA target architecture via
-option @option{-march=}.
+The default PTX ISA version is the one that added support for the
+selected PTX ISA target architecture, see @option{-march=}, but at
+least @samp{6.0}, or @samp{7.3} for @option{-march=sm_52} and higher.
 
 This option sets the values of the preprocessor macros
 @code{__PTX_ISA_VERSION_MAJOR__} and @code{__PTX_ISA_VERSION_MINOR__};
diff --git a/gcc/testsuite/gcc.target/nvptx/march-map=sm_52.c b/gcc/testsuite/gcc.target/nvptx/march-map=sm_52.c
index f37d13a8b088..027247810ecd 100644
--- a/gcc/testsuite/gcc.target/nvptx/march-map=sm_52.c
+++ b/gcc/testsuite/gcc.target/nvptx/march-map=sm_52.c
@@ -1,14 +1,14 @@
 /* { dg-do assemble } */
 /* { dg-options {-march-map=sm_52 -mptx=_} } */
 /* { dg-additional-options -save-temps } */
-/* { dg-final { scan-assembler-times {(?n)^