date:20241018

Re: [RFC][AArch64] Defining lrotm3 optabs for SVE modes for TARGET_SVE2?

2024-10-18 Thread Richard Sandiford via Gcc

Kyrylo Tkachov  writes:
> Hello,
>
> I’ve been optimizing various code sequences relating to vector rotates 
> recently.
> I ended up proposing we expand the vector-rotate-by-immediate optab rotlm3 for
> the Advanced SIMD (Neon) modes here:
> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665635.html
> This expands to a ROTATE RTL code that can be later combined into more complex
> instructions like XAR and for certain rotate amounts can be optimized in a 
> single instruction.
> If they fail to be optimized then a splitter breaks it down into an SHL + 
> USRA pair.
>
> For SVE, because we have predicates in the general case it’s not feasible to 
> detect
> these rotates at the RTL level, so I was hoping that GIMPLE could do it, and 
> indeed
> GIMPLE has many places where it can detect rotate idioms: forwprop1, bswap 
> detection,
> pattern matching in the vectorizer, match.pd for simple cases etc.
> The vectorizer is probably a good place to do it (rather than asking the 
> other places to deal
> with VLA types) but I think it would need the target to affirm that it 
> supports SVE vector rotates
> through the lrotm3 optab, hence my question. 
>
> Though some rotate amounts can be implemented with a single instruction 
> (REVB, REVH, REVW),
> the fallback expansion for TARGET_SVE2 would be a two-instruction LSL+USRA 
> which is better than
> what we currently emit in the motivating test case:
> https://godbolt.org/z/o55or8hYv
> We currently cannot combine the LSL+LSR+ORR sequence because the predicates 
> get in the way during
> combine (even though the instructions involved are actually unpredicated and 
> the predicate would get
> dropped later anyway).
> It would also allow us to keep an RTL-level ROTATE long enough to combine it 
> into the XAR and RAX
> instructions from TARGET_SVE2_SHA3.
>
> Finally, it would allow us to experiment with more optimal SVE-specific 
> rotate sequences in the future.
> For example, we could consider emitting high-throughput TBLs for rotates that 
> are a multiple of 8.
>
> I’m suggesting doing this for TARGET_SVE2 as we have the combined USRA 
> instruction there,
> but I wouldn’t object doing this for TARGET_SVE.

I think there are three cases here:

(1) Using permutes for rotates.  That part on its own could be a
target-independent optimisation.  I imagine other targets without
native rotate support would benefit.

(2) Encouraging the use of XAR.  I suppose the question here is:
is XAR so good that can we consider using it instead of LSL/USRA
even when the XOR part isn't needed?  That is, when XAR is available,
one way of implementing the rotate optab would be to zero the
destination register (hopefully free) and then use XAR itself as
the rotate instruction.

If that's a win, then defining the optab like that sounds good.

If it's not a win, then we could end up being too aggressive about
forming XAR in general, since XORs fold with other things too.

(3) Using LSL+USRA for SVE2.

IIUC, one part of the combine issue is the old "should we use IOR,
or should we use PLUS?", for cases where both are equivalent.
Is that right?  I.e. target-independent code normally expands
rotates using two shifts and an ior_optab, but for aarch64 it would
be better to use add_optab.  And the only reason that add_optab is
better is because we then want to combine the addition with one of
the shifts.

If so, then yeah, that does sound too complex to handle in a
target-independent way.  But, given (2), it feels like a separate
issue from XAR/RAX optimisation.

I suppose (1) is somewhat in conflict with (2) and (3).  We'd presumably
still want to use permute-based rotates where possible.  We might even
want to avoid modelling that case as a rotate rtx in the RTL stream,
in case we lose the nice permute to some "simplification".  (Not sure
either way on that last part though.)

Thanks,
Richard

gcc-13-20241018 is now available

2024-10-18 Thread GCC Administrator via Gcc

Snapshot gcc-13-20241018 is now available on
  https://gcc.gnu.org/pub/gcc/snapshots/13-20241018/
and on various mirrors, see https://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 13 git branch
with the following options: git://gcc.gnu.org/git/gcc.git branch 
releases/gcc-13 revision 074bea39b01b41b2eab33de051a85107aa43323b

You'll find:

 gcc-13-20241018.tar.xz   Complete GCC

  SHA256=b4de173ca71e52a8bbfb217dba5a86dd060a67909cabfc8c4920f1e3f533d62c
  SHA1=c5088d42526af5305f5c58eeeb39030e42aa65b0

Diffs from 13-20241011 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-13
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.

Re: BUG: realloc(p,0) is not conforming to C99/C11/C17/POSIX.1-2008

2024-10-18 Thread Jason Merrill via Gcc


On 10/17/24 12:09 PM, Alejandro Colomar via Gcc wrote:

CC += JeanHeyd

CC += Robert, Joseph, gcc@

CC += Doug, as the author of the original malloc(3).


Please don't CC random people (and mailing lists) on a glibc bug report.

See https://sourceware.org/glibc/wiki/FilingBugs for bug reporting 
instructions.


Jason

RE: [RFC] Enabling SVE with offloading to nvptx

2024-10-18 Thread Prathamesh Kulkarni via Gcc



> -Original Message-
> From: Richard Biener 
> Sent: 17 October 2024 19:18
> To: Prathamesh Kulkarni 
> Cc: gcc@gcc.gnu.org; Thomas Schwinge 
> Subject: RE: [RFC] Enabling SVE with offloading to nvptx
> 
> External email: Use caution opening links or attachments
> 
> 
> On Thu, 17 Oct 2024, Prathamesh Kulkarni wrote:
> 
> > > -Original Message-
> > > From: Richard Biener 
> > > Sent: 16 October 2024 13:05
> > > To: Prathamesh Kulkarni 
> > > Cc: gcc@gcc.gnu.org; Thomas Schwinge 
> > > Subject: Re: [RFC] Enabling SVE with offloading to nvptx
> > >
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > On Tue, 15 Oct 2024, Prathamesh Kulkarni wrote:
> > >
> > > > Hi,
> > > > Testing libgomp with SVE enabled (-mcpu=generic+sve2), results
> in
> > > > ~60
> > > UNRESOLVED errors with following error message:
> > > >
> > > > lto1: fatal error: degree of 'poly_int' exceeds
> 'NUM_POLY_INT_COEFFS'
> > > > compilation terminated.
> > > > nvptx mkoffload: fatal error:
> > > > ../../install/bin/aarch64-unknown-linux-gnu-accel-nvptx-none-gcc
> > > returned 1 exit status compilation terminated.
> > > >
> > > > This behaviour can be reproduced with the following simple
> > > > test-case
> > > with -fopenmp -foffload=nvptx-none -mcpu=generic+sve2:
> > > >
> > > > #define N 1000
> > > > int main ()
> > > > {
> > > >   int i;
> > > >   int A[N] = {0}, B[N] = {0};
> > > >
> > > >   #pragma omp target map(i), map(tofrom: A), map(from: B)
> > > >   #pragma omp simd
> > > >   for (i = 0; i < N; i++)
> > > > A[i] = A[i] + B[i];
> > > >   return A[0];
> > > > }
> > > >
> > > > omplower pass lowers the above loop to the following:
> > > >
> > > > D.4576 = .GOMP_USE_SIMT ();
> > > > if (D.4576 != 0) goto ; else goto
> ;
> > > > :
> > > > {
> > > >   unsigned int D.4586;
> > > >   unsigned int D.4587;
> > > >   int D.4588;
> > > >   void * simduid.5;
> > > >   void * .omp_simt.6;
> > > >   int D.4596;
> > > >   _Bool D.4597;
> > > >   int D.4598;
> > > >   unsigned int D.4599;
> > > >   int D.4600;
> > > >   int D.4601;
> > > >   int * D.4602;
> > > >   int i [value-expr: D.4588];
> > > >   int i.0;
> > > >
> > > >   simduid.5 = .GOMP_SIMT_ENTER (simduid.5,
> &D.4588);
> > > >   .omp_simt.6 = .GOMP_SIMT_ENTER_ALLOC
> (simduid.5);
> > > >   D.4587 = 0;
> > > >   i.0 = 0;
> > > >   #pragma omp simd safelen(32)
> > > > _simduid_(simduid.5)
> > > _simt_ linear(i.0:1) linear(i:1)
> > > >   for (i.0 = 0; i.0 < 1000; i.0 = i.0 + 1)
> > > >   ...
> > > > }
> > > > goto ;
> > > > :
> > > > {
> > > >   unsigned int D.4603;
> > > >   unsigned int D.4604;
> > > >   int D.4605[0:POLY_INT_CST [15, 16]];
> > > >   void * simduid.7;
> > > >   unsigned int D.4612;
> > > >   int * D.4613;
> > > >   int D.4614;
> > > >   int i [value-expr: D.4605[D.4604]];
> > > >   int i.0;
> > > >
> > > >   D.4604 = 0;
> > > >   i.0 = 0;
> > > >   #pragma omp simd safelen(POLY_INT_CST [16,
> 16])
> > > _simduid_(simduid.7) linear(i.0:1) linear(i:1)
> > > >   ...
> > > >  }
> > > >  :
> > > >  ...
> > > >
> > > > For offloading to SIMT based device like nvptx, scan_omp_simd
> > > > duplicates lowering of simd pragma into if-else where the if-
> part
> > > > contains simt code-path, and else-part contains simd code-path.
> In
> > > lower_rec_simd_input_clauses, max_vf is set to 16+16x for the
> above
> > > case as determined by omp_max_vf, and that becomes length of the
> omp
> > > simd
> > > array:
> > > > int D.4605[0:POLY_INT_CST [15, 16]];
> > > >
> > > > The issue here is that, the function containing above if-else
> > > > condition gets streamed out to LTO bytecode including the simd
> > > > code-
> > > path and the omp simd array, whose domain is [0:POLY_INT_CST[15,
> > > 16]], and thus we get the above error while streaming-in
> > > POLY_INT_CST in lto_input_ts_poly_tree_pointers on device side.
> > > >
> > > > Note that, the simd code-path is essentially dead-code on nvptx,
> > > > since
> > > > .GOMP_USE_SIMT() resolves to 1 during omp_device_lower pass, and
> > > > later optimization passes (ccp2) remove the dead-code path and
> > > > unused omp
> > > simd arrays while compiling to device. So in this case, we aren't
> > > really mapping POLY_INT_CST from host to device, but it gets
> > > streamed out t

Re: Alex Coplan appointed maintainer of AArch64 pair fusion pass and pair-fusion pass.

2024-10-18 Thread Alex Coplan via Gcc

On 17/10/2024 16:51, Richard Sandiford wrote:
> Ramana Radhakrishnan via Gcc  writes:
> > I am pleased to announce that the GCC Steering Committee has appointed
> > Alex Coplan as a maintainer for the AArch64 load / store pair fusion
> > pass.
> >
> > In addition the steering committee has also appointed him as
> > maintainer for the Pair Fusion pass in the target independent
> > portions.
> >
> > Please join me in congratulating Alex on his new roles. Alex , please
> > update your listing in the MAINTAINERS file.
> 
> Congrats Alex!  Well deserved :)

Thank you for your trust, I will push a patch to MAINTAINERS shortly.

Alex

> 
> Just wanted to add that, IMO, the AArch64 parts should be taken to
> include things like aarch64-ldpstp.md, other related peepholes,
> and the LDP/STP support in aarch64.cc.
> 
> Richard

Re: [RFC][AArch64] Defining lrotm3 optabs for SVE modes for TARGET_SVE2?

gcc-13-20241018 is now available

Re: BUG: realloc(p,0) is not conforming to C99/C11/C17/POSIX.1-2008

RE: [RFC] Enabling SVE with offloading to nvptx

Re: Alex Coplan appointed maintainer of AArch64 pair fusion pass and pair-fusion pass.

5 matches

Site Navigation

Mail list logo

Footer information