[Bug target/100208] amdgcn fails to build with llvm-mc from llvm12

2021-04-22 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100208

--- Comment #1 from Andrew Stubbs  ---
LLVM changed the default parameters, so we either have to change the
expectations in the ".amdgcn_target" string (which is basically an assert), or
set the attributes be want explicitly on the assembler command line.

(Or port binutils to amdgcn, but there's no plan for that.)

[Bug tree-optimization/84958] int loads not eliminated against larger stores

2020-10-15 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84958

--- Comment #6 from Andrew Stubbs  ---
(In reply to Tom de Vries from comment #5)
> I've removed the xfail for nvptx.
> 
> The only remaining xfail is for gcn.  Is that one still necessary?

The test still fails for gcn.

[Bug target/97521] [11 Regression] wrong code with -mno-sse2 since r11-3394

2020-10-23 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97521

--- Comment #21 from Andrew Stubbs  ---
(In reply to Richard Biener from comment #19)
> GCN also uses MODE_INT for the mask mode and thus may be similarly affected.
> Andrew - are the bits in the mask dense?  Thus for a V4SImode compare
> would the mask occupy only the lowest 4 bits of the DImode mask?

Yes, that's correct.

[Bug target/97521] [11 Regression] wrong code with -mno-sse2 since r11-3394

2020-10-23 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97521

--- Comment #22 from Andrew Stubbs  ---
(In reply to Andrew Stubbs from comment #21)
> (In reply to Richard Biener from comment #19)
> > GCN also uses MODE_INT for the mask mode and thus may be similarly affected.
> > Andrew - are the bits in the mask dense?  Thus for a V4SImode compare
> > would the mask occupy only the lowest 4 bits of the DImode mask?
> 
> Yes, that's correct.

Or rather, I should say that that *will* be the case when I add partial vector
support; right now it can only be done via masking V64SImode.

A have a patch set, but the last problem is that while_ult doesn't operate on
partial integer masks, leading to wrong code. AArch64 doesn't have a problem
with this because it uses VBI masks of the right size. I have a patch that adds
the vector size as an operand to while_ult; this seems to fix the problems on
GCN, but I need to make corresponding changes for AArch64 also before I can
submit those patches, and time is tight.

[Bug libgomp/97332] [gcn] GCN_NUM_GANGS/GCN_NUM_WORKERS override compile-time constants

2020-10-08 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97332

Andrew Stubbs  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2020-10-08
 Status|UNCONFIRMED |NEW

--- Comment #1 from Andrew Stubbs  ---
At the point the overrides are applied (run_kernel) the code only knows what
dimensions were selected at runtime, not how those figures were arrived at. It
then prints (with GCN_DEBUG set) the "launch attributes" and "launch actuals".

To fix this the overrides will have to applied much earlier, and independently
for OpenACC (gcn_exec) and OpenMP (parse_target_attributes). That or the
automatic balancing be applied later. Or perhaps the original attributes be
stored for later inspection (but GOMP_kernel_launch_attributes is defined by
libgomp). The "attributes" and "actuals" will need to be overhauled. Probably
get_group_size can be removed.

It ought to be doable though.

[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770

2021-05-05 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418

--- Comment #4 from Andrew Stubbs  ---
Alexandre's patch has this:

emit_move_insn (rem, plus_constant (ptr_mode, rem, -blksize));

Is that generally a valid thing to do? It seems like other places do similar
things...

[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770

2021-05-05 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418

Andrew Stubbs  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2021-05-05
 Status|UNCONFIRMED |NEW

--- Comment #6 from Andrew Stubbs  ---
Using force_operand does fix Tobias's reduced testcase. I'll test it further
and let you know.

[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770

2021-05-05 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418

--- Comment #9 from Andrew Stubbs  ---
I found a couple of other places to put force_operand and the full case works
now.

Running more tests

[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770

2021-05-06 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418

--- Comment #13 from Andrew Stubbs  ---
I found a lot more ICEs when testing my patch. They look to be unrelated
(TImode come back to haunt us), but it makes it hard to be sure.

[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770

2021-05-14 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418

Andrew Stubbs  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #17 from Andrew Stubbs  ---
This issue should be fixed now.

[Bug testsuite/108898] [13 Regression] Test introduced by r13-6278-g3da77f217c8b2089ecba3eb201e727c3fcdcd19d failed on i386

2023-02-23 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108898

--- Comment #1 from Andrew Stubbs  ---
I tested it on i686-pc-linux-gnu before I posted the patch, and it was working
then. Can you be more specific what configuration you were testing, please?

[Bug testsuite/108898] [13 Regression] Test introduced by r13-6278-g3da77f217c8b2089ecba3eb201e727c3fcdcd19d failed on i386

2023-03-15 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108898

--- Comment #4 from Andrew Stubbs  ---
I did not know there was a way to do that! I'll add this to my to-do list.

[Bug target/105246] [amdgcn] Use library call for SQRT with -ffast-math + provide additional option to use single-precsion opcode

2022-04-13 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105246

--- Comment #2 from Andrew Stubbs  ---
When we first coded this we only had the GCN3 ISA manual, which says nothing
about the accuracy.

Now I look in the Vega manual (GCN5) I see:

  Square root with perhaps not the accuracy you were hoping for --
  (2**29)ULP accuracy. On the upside, denormals are supported.

The most recent CDNA2 manual is a bit less verbose:

  Square root. Precision is (2**29) ULP, and supports denormals.

The compiler already emits Newton Raphson iterations for division with
-ffast-math, so I'm sure it can be done, but I'm not too clear on the
mathematics myself.

[Bug tree-optimization/106476] New: ICE generating FOLD_EXTRACT_LAST

2022-07-29 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106476

Bug ID: 106476
   Summary: ICE generating FOLD_EXTRACT_LAST
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ams at gcc dot gnu.org
CC: rguenther at suse dot de
  Target Milestone: ---
Target: amdgcn-amdhsa

Commit 8f4d9c1deda "amdgcn: 64-bit not" exposed an ICE in tree-vect_stmts.cc
when compiling gcc.dg/torture/pr67470.c at -O2 for amdgcn. The newly
implemented op is not the problem, but it allows this optimization (and many
others) to proceed, and the error is no longer hidden.

amdgcn has masked vectors and fold_extract_last, which leads to a code path
through tree-vect-stmts.cc that has

   vec_then_clause = vec_oprnds2[i];
   if (reduction_type != EXTRACT_LAST_REDUCTION)
 vec_else_clause = vec_oprnds3[i];

and then

   /* Instead of doing ~x ? y : z do x ? z : y.  */
   vec_compare = new_temp;
   std::swap (vec_then_clause, vec_else_clause);

and finally

   new_stmt = gimple_build_call_internal
   (IFN_FOLD_EXTRACT_LAST, 3, else_clause, vec_compare,
vec_then_clause);

in which vec_then_clause remains set to NULL_TREE.

The dump shows

   e_lsm.16_32 = .FOLD_EXTRACT_LAST (e_lsm.16_8, _70, );

(note the last field is missing.)

I can fix the ICE if I add "else vec_else_clause = integer_zero_node", but I'm
not sure that is the correct logical solution.

(CC Richi who touched this code last)

[Bug middle-end/107088] [13 Regression] cselib ICE building __trunctfxf2 on ia64

2022-09-30 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107088

Andrew Stubbs  changed:

   What|Removed |Added

 Target|ia64-*-*|ia64-*-*, amdgcn-*-*
 CC||ams at gcc dot gnu.org

--- Comment #7 from Andrew Stubbs  ---
I get the same failure on amdgcn building newlib/libm/math/kf_rem_pio2.c

[Bug middle-end/107088] [13 Regression] cselib ICE building __trunctfxf2 on ia64

2022-09-30 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107088

--- Comment #9 from Andrew Stubbs  ---
I can confirm that the patch fixes the amdgcn build.

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

2022-10-10 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #4 from Andrew Stubbs  ---
I don't understand rgroups, but I can say that GCN masks are very simply
one-bit-one-lane. There are always 64-lanes, regardless of the type, so V64QI
mode has fewer bytes and bits than V64DImode (when written to memory).

This is different to most other architectures where the bit-size remains the
same and number of lanes varies with the inner type, and has caused us some
issues with invalid assumptions in GCC (e.g. "there's no need for sign-extends
in vector registers" is not true for GCN).

However, I think it's the same as you're describing for AVX512, at least in
this respect.

Incidentally I'm on the cusp of adding multiple "virtual" vector sizes in the
GCN backend (in lieu of implementing full mask support everywhere in the
middle-end and fixing all the cost assumptions), so these VIEW_CONVERT_EXPR
issues are getting worse. I have a bunch of vec_extract patterns that fix up
some of it. Within the backed, the V32, V16, V8, V4 and V2 vectors are all
really just 64-lane vectors with the mask preset, so the mask has to remain
DImode or register allocation becomes tricky.

[Bug middle-end/104026] [12 Regression] ICE in wide_int_to_tree_1, at tree.c:1755 via tree-vect-loop-manip.c:673

2022-01-14 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104026

Andrew Stubbs  changed:

   What|Removed |Added

 CC||ams at gcc dot gnu.org

--- Comment #6 from Andrew Stubbs  ---
amdgcn always uses 64-lane vectors, regardless of type, and relies on masking
to support anything smaller.

The len_store pattern seems to have been introduced in July 2020 which is more
recent than the last major work in the amdgcn backend.

[Bug target/100181] hot-cold partitioned code doesn't assemble

2022-02-11 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100181

--- Comment #13 from Andrew Stubbs  ---
I've updated the LLVM version documentation at
https://gcc.gnu.org/wiki/Offloading#For_AMD_GCN:

It's LLVM 9 or 13.0.1 now (nothing in between), and will be 13.0.1+ for the
next release (dropping LLVM 9 because we'll want to add newer device support to
GCC soonish).

[Bug target/103201] [12 Regression] trunk 20211111 ftbfs for amdgcn – libgomp/teams.c:49:6: error: 'struct gomp_thread' has no member named 'num_teams'

2021-11-12 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103201

--- Comment #3 from Andrew Stubbs  ---
I did some preliminary testing on your patch: the libgomp.c/target-teams-1.c
testcase runs fine on amdgcn. I presume that that covers most of the existing
features of those runtime calls?

[Bug target/103396] [12 Regression][GCN][BUILD] ICE RTL check: access of elt 4 of vector with last elt 3 in move_callee_saved_registers, at config/gcn/gcn.c:2821

2021-11-24 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103396

Andrew Stubbs  changed:

   What|Removed |Added

   Last reconfirmed||2021-11-24
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |ams at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #4 from Andrew Stubbs  ---
I think I have a fix for this. It happens when the link register has to be
saved because it is used implicitly by a function call, but the register is
never explicitly mentioned anywhere else in the function. I don't know why this
hasn't been a problem before now?

[Bug target/103396] [12 Regression][GCN][BUILD] ICE RTL check: access of elt 4 of vector with last elt 3 in move_callee_saved_registers, at config/gcn/gcn.c:2821

2021-11-25 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103396

Andrew Stubbs  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #6 from Andrew Stubbs  ---
This problem should be fixed now.

[Bug target/102260] amdgcn offload compiler fails to configure, not matching target directive's target id

2021-09-09 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102260

Andrew Stubbs  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2021-09-09
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |ams at gcc dot gnu.org

--- Comment #1 from Andrew Stubbs  ---
In addition to changing the amdgcn_target syntax in LLVM 13, the LLVM GCN guys
have also renamed the "sram-ecc" attribute to "sramecc" on the CLI, and have
not provided any backwards compatibility for either change.

These are not helpful decisions and will require configure tests in GCC to
support all the variations. :-(

I'm working on it.

[Bug target/102544] GCN offloading not working for 'amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-'

2021-09-30 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102544

--- Comment #1 from Andrew Stubbs  ---
Please set "export GCN_DEBUG=1", try it again, and post the output.

[Bug target/102544] GCN offloading not working for 'amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-'

2021-10-01 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102544

--- Comment #3 from Andrew Stubbs  ---
That output shows that we have the correct libgomp and rocm is installed and
working. Libgomp initialized the GCN plugin, but did not attempt to initialize
the device (the next message in the output should have been "Selected kernel
arguments memory region", or at least a GCN error message).

Instead we have a target-independent libgomp error. Presumably the kernel
metadata is malformed, somehow?

I think we need a testcase to debug this further, preferably reduced to be as
simple as possible.

Perhaps it would be a good idea to start with a minimal toy example and see if
that works on the device.

#include 
#include 

int main ()
{
  int v = 1;

#pragma acc parallel copy(v)
  {
if (acc_on_device(acc_device_host))
  v = -1; // error
else {
  v = 2; // success
}
  }

  printf ("v is %d\n", v);
  return v;
}

[Bug target/102544] GCN offloading not working for 'amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-'

2021-10-01 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102544

--- Comment #5 from Andrew Stubbs  ---
Sorry, I should have said to compile with -fopenacc.

If you did do that, please post the GCN_DEBUG output.

[Bug target/102544] GCN offloading not working for 'amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-'

2021-10-04 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102544

--- Comment #8 from Andrew Stubbs  ---
Did you get the C version to return anything other than "-1"? (The expected
result is "2".)

I'm still trying to determine if the device is compatible, but the mapping
problem looks like a different issue.

Your code works fine on my device using a somewhat more recent GCC build. (I
can't install that exact toolchain right now.)

[Bug target/107510] gcc/config/gcn/gcn.cc:4930:9: style: Same expression on both sides of '||'. [duplicateExpression]

2022-11-03 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107510

Andrew Stubbs  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ams at gcc dot gnu.org
 Status|NEW |ASSIGNED

--- Comment #2 from Andrew Stubbs  ---
Oops, I thought I fixed that. :(

[Bug target/107510] gcc/config/gcn/gcn.cc:4930:9: style: Same expression on both sides of '||'. [duplicateExpression]

2022-11-03 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107510

Andrew Stubbs  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #4 from Andrew Stubbs  ---
Fixed.

[Bug other/89863] [meta-bug] Issues in gcc that other static analyzers (cppcheck, clang-static-analyzer, PVS-studio) find that gcc misses

2022-11-03 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89863
Bug 89863 depends on bug 107510, which changed state.

Bug 107510 Summary: gcc/config/gcn/gcn.cc:4930:9: style: Same expression on 
both sides of '||'. [duplicateExpression]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107510

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug target/105873] [amdgcn][OpenMP] task reductions fail with "team master not responding; slave thread aborting"

2022-06-08 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105873

--- Comment #4 from Andrew Stubbs  ---
I think unused threads should be given a no-op function to run, not a null
pointer. The GCN implementation cannot tell the difference between a null
pointer and an unset pointer (which is what happens when the master thread
dies).

There's also a potential issue of what happens when barriers occur within an
active thread when there are also inactive threads. GCN barrier instructions
are unconditional, meaning that all the live threads must respond. The inactive
threads can do so, in a harmless way, as long as they allowed to spin, but we
don't want them spinning forever when the master dies.

I believe the current barrier implementation skips the barrier instruction when
the team's thread count is 1. This is how we avoid issues with nested teams and
tasks. I don't know why that doesn't help here?

[Bug target/95023] Offloading AMD GCN wiki cannot be followed

2021-07-02 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95023

Andrew Stubbs  changed:

   What|Removed |Added

 CC||ams at gcc dot gnu.org
 Resolution|--- |DUPLICATE
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from Andrew Stubbs  ---
The second problem in this bug is reported in 97827.

*** This bug has been marked as a duplicate of bug 97827 ***

[Bug target/97827] bootstrap error building the amdgcn-amdhsa offload compiler with LLVM 11

2021-07-02 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97827

Andrew Stubbs  changed:

   What|Removed |Added

 CC||xw111luoye at gmail dot com

--- Comment #17 from Andrew Stubbs  ---
*** Bug 95023 has been marked as a duplicate of this bug. ***

[Bug target/101484] [12 Regression] trunk 20210717 ftbfs for amdgcn-amdhsa (gcn offload)

2021-07-17 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101484

Andrew Stubbs  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2021-07-17

--- Comment #1 from Andrew Stubbs  ---
A new warning has been added that falsely identifies any access to a hardcoded
constant address as bogus. This has affected a few targets, including GCN
libgomp. See pr101374.

There's some discussion what to do about it. E.g.
https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574880.html

[Bug target/101544] [OpenMP][AMDGCN][nvptx] C++ offloading: unresolved _Znwm = "operator new(unsigned long)"

2021-07-21 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101544

--- Comment #3 from Andrew Stubbs  ---
The standalone amdgcn configuration does not support C++. There are a number of
technical reasons why it doesn't Just Work, but basically it comes down to
no-one ever working on it. Our customers were primarily interested in Fortran
with C second.

C++ offloading works fine provided that there are no library calls or
exceptions.

Ignoring unsupported C++ language features, for now, I don't think there's any
reason why libstdc++ would need to be cut down. We already build the full
libgfortran for amdgcn. System calls that make no sense on the GPU were
implemented as stubs in Newlib (mostly returning some reasonable errno value),
and it would be straight-forward to implement more the same way.

I believe static constructors work (libgfortran uses some), but exception
handling does not. I'm not sure what other exotica C++ might need?

As for exceptions, set-jump-long-jump is not implemented because there was no
call for it and I didn't know how to handle the GCN register files properly.
Not only are they variable-sized, they're also potentially very large: ranging
from ~6KB up to ~65KB, I think (102 32-bit scalar, and 256 2048-bit vector
registers, for single-threaded mode, but only 80 scalar and 24 vector registers
in maximum occupancy mode, in which case per-thread stack space is also quite
limited). I'm not sure now the other exception implementations work.

[Bug target/100208] amdgcn fails to build with llvm-mc from llvm12

2021-07-21 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100208

Andrew Stubbs  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Andrew Stubbs  ---
I think this issue should be resolved now.

(Other reasons why GCC fails with LLVM 12 still exist).

[Bug target/101544] [OpenMP][AMDGCN][nvptx] C++ offloading: unresolved _Znwm = "operator new(unsigned long)"

2021-07-21 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101544

--- Comment #5 from Andrew Stubbs  ---
[Note: all of my comments refer to the amdgcn case. nvptx has somewhat
different support in this area.]

(In reply to Jonathan Wakely from comment #4)
> But it's a waste of space in the .so to build lots of symbols that use the
> stubs.

DSOs are not supported. This is strictly for static linking only.

> There are other reasons it might be nice to be able to configure libstdc++
> for something in between a full hosted environment and a minimal
> freestanding one.

If it isn't a horrible hack, like libgfortran minimal mode, then fine.

> > I believe static constructors work (libgfortran uses some), but exception
> > handling does not. I'm not sure what other exotica C++ might need?
> 
> Ideally, __cxa_atexit and __cxa_thread_atexit for static and thread-local
> destructors, but we can survive without them (and have not-fully-conforming
> destruction ordering).

Offload kernels are just fragments of programs, so this is tricky in those
cases. Libgomp explicitly calls _init_array and _fini_array as single-threaded
kernel launches. Actually, it's not clear that deconstruction is in any way
interesting, given that code running on the GPU has no external access and the
resources are all released when the host program exits.

Similarly, C++ threads are not interesting in the GPU-offload case. There are a
fixed number or threads launched on entry and they are managed by libgomp. In
theory it would be possible to code gthreads/libstdc++ to use them in
standalone mode, but really that mode only exists to facilitate compiler
testing.

[Bug testsuite/113085] New test case libgomp.c/alloc-pinned-1.c from r14-6499-g348874f0baac0f fails

2024-02-08 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113085

--- Comment #6 from Andrew Stubbs  ---
(In reply to seurer from comment #5)
> I should note that pinned-2 also fails on powerpc64 LE.
> 
> make  -k check-target-libgomp RUNTESTFLAGS="c.exp=libgomp.c/alloc-pinned-*"
> FAIL: libgomp.c/alloc-pinned-1.c execution test
> FAIL: libgomp.c/alloc-pinned-2.c execution test
> 
> 
> On powerpc64 BE pinned-3 and -4 fail (but not -1 and -2):
> 
> make  -k check-target-libgomp RUNTESTFLAGS="--target_board=unix'{-m32,-m64}'
> c.exp=libgomp.c/alloc-pinned-*"
> FAIL: libgomp.c/alloc-pinned-3.c execution test
> FAIL: libgomp.c/alloc-pinned-4.c execution test

Please show any messages in the libgomp.log file, and find out what the page
sizes and locked memory limits are on both machines.

[Bug testsuite/113085] New test case libgomp.c/alloc-pinned-1.c from r14-6499-g348874f0baac0f fails

2024-02-12 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113085

--- Comment #8 from Andrew Stubbs  ---
(In reply to seurer from comment #7)
> On the BE machine:
> 
> seurer@nilram:~/gcc/git/build/gcc-test$ ulimit -a
> real-time non-blocking time  (microseconds, -R) unlimited
> ...
> max locked memory   (kbytes, -l) 529679232
> ...

That's a suspiciously large number, but OK.

> seurer@nilram:~/gcc/git/build/gcc-test$ getconf PAGESIZE
> 65536
> 
> 
> There were no messages.  Running it in gdb I get:
> 
> (gdb) where
> #0  0x0fce3340 in ?? () from /lib32/libc.so.6
> #1  0x0fc851e4 in raise () from /lib32/libc.so.6
> #2  0x0fc6a128 in abort () from /lib32/libc.so.6
> #3  0x1ae4 in set_pin_limit (size=size@entry=131072) at
> /home/seurer/gcc/git/gcc-test/libgomp/testsuite/libgomp.c/alloc-pinned-4.c:44
> #4  0x1754 in main () at
> /home/seurer/gcc/git/gcc-test/libgomp/testsuite/libgomp.c/alloc-pinned-4.c:
> 106
> 
> 
>   if (getrlimit (RLIMIT_MEMLOCK, &limit))
> abort ();   // line 44 in alloc-pinned-4.c

Why would that fail? Perhaps you can investigate the errno. You're probably
best placed to submit a patch for whatever this issue is.

> 
> This is a Debian Trixie machine and it too is using whatever the defaults
> are.

Good to know.

[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]

2024-03-27 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302

--- Comment #2 from Andrew Stubbs  ---
The execution test checks that each of the libgcc routines work correctly, and
the scan assembler tests make sure that we're getting coverage of all of them.

In this case, the failure indicates that we're not testing the routine we were
aiming for (but I think it does execute correctly and give a good result).

[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]

2024-03-27 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302

--- Comment #4 from Andrew Stubbs  ---
Yes, that's what the simd-math-3* tests do.

The simd-math-5* tests are explicitly supposed to be doing this in the context
of the autovectorizer.

If these tests are being compiled as (newly) intended then we should change the
expected results.

So, questions:

1. Are the new results actually correct? (So far I only know that being
different is expected.)

2. Is there some other testcase form that would exercise the previously
intended routines?

3. Is the new behaviour configurable? I don't think the 16-bit shift bug ever
existed on GCN (in which "short" vectors actually have excess bits in each
lane, much like scalar registers do).

[Bug driver/114717] '-fcf-protection' vs. offloading compilation

2024-04-15 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114717

--- Comment #3 from Andrew Stubbs  ---
Can this be filtered (safely) in mkoffload? That tool is
offload-target-specific, so no problem with "if offload target were to support
it".

[Bug tree-optimization/115304] gcc.dg/vect/slp-gap-1.c FAILs

2024-06-03 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304

--- Comment #9 from Andrew Stubbs  ---
(In reply to Richard Biener from comment #6)
> The best strathegy for GCN would be to gather V4QImode aka SImode into the
> V64QImode (or V16SImode) vector.  For pix2 we have a gap of 28 elements,
> doing consecutive loads isn't a good strategy here.

I don't fully understand what you're trying to say here, so apologies if you
knew all this already and I missed the point.

In general, on GCN V4QImode is not in any way equivalent to SImode (when the
values are in registers). The vector registers are not one single string of
re-interpretable bits.

For the same reason, you can't load a value as V64QImode and then try to
interpret it as V16SImode. GCN vector registers just don't work like
SSE/Neon/etc.

When you load a V64QImode vector, each lane is extended to 32 bits, so what you
actually get in hardware is a V64SImode vector.

Likewise, when you load a V4QImode vector the hardware representation is
actually V4SImode (which in itself is just V64SImode with undefined values in
the unused lanes).

[Bug tree-optimization/115304] gcc.dg/vect/slp-gap-1.c FAILs

2024-06-03 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304

--- Comment #11 from Andrew Stubbs  ---
(In reply to rguent...@suse.de from comment #10)
> On Mon, 3 Jun 2024, ams at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304
> > 
> > --- Comment #9 from Andrew Stubbs  ---
> > (In reply to Richard Biener from comment #6)
> > > The best strathegy for GCN would be to gather V4QImode aka SImode into the
> > > V64QImode (or V16SImode) vector.  For pix2 we have a gap of 28 elements,
> > > doing consecutive loads isn't a good strategy here.
> > 
> > I don't fully understand what you're trying to say here, so apologies if you
> > knew all this already and I missed the point.
> > 
> > In general, on GCN V4QImode is not in any way equivalent to SImode (when the
> > values are in registers). The vector registers are not one single string of
> > re-interpretable bits.
> > 
> > For the same reason, you can't load a value as V64QImode and then try to
> > interpret it as V16SImode. GCN vector registers just don't work like
> > SSE/Neon/etc.
> > 
> > When you load a V64QImode vector, each lane is extended to 32 bits, so what 
> > you
> > actually get in hardware is a V64SImode vector.
> > 
> > Likewise, when you load a V4QImode vector the hardware representation is
> > actually V4SImode (which in itself is just V64SImode with undefined values 
> > in
> > the unused lanes).
> 
> I see.  I wonder if there's not one or two latent wrong-code because of
> this and the vectorizers assumptions ;)  I suppose modes_tieable_p
> will tell us whether a VIEW_CONVERT_EXPR will do the right thing?
> Is GET_MODE_SIZE (V64QImode) == GET_MODE_SIZE (V64SImode) btw?
> And V64QImode really V64PSImode?

The mode size says how big it will be when written to memory, so no they're not
the same. I believe this matches the scalar QImode behaviour.

We don't use any PSI modes. There are (some) machine instructions for V64QImode
(and V64HImode) so we don't want to lose that information.

There may well be some bugs, but we have handling for conversions in a number
of places. There are truncate and extend patterns that operate lane-wise, and
vec_extract can take a subset of a vector, IIRC.

> Still for a V64QImode load on { c[0], c[1], c[2], c[3], c[32], c[33], 
> c[34], c[35], ... } it's probably best to use a single V64QImode gather 
> with GCN then rather than four "consecutive" V64QImode loads and then
> element swizzling.

Fewer loads are always better, and permutations are expensive operations (and
don't work with 64-lane vectors on RDNA devices because they're actually two
32-lane vectors stuck together) so it can certainly make sense to use gather
with a vector of permuted offsets (although it can be expensive to generate
that vector in the first place).

[Bug target/112088] [14 Regression] GCN target testing broken by "amdgcn: add -march=gfx1030 EXPERIMENTAL"

2023-10-27 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112088

Andrew Stubbs  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Last reconfirmed||2023-10-27
   Assignee|unassigned at gcc dot gnu.org  |ams at gcc dot gnu.org

--- Comment #1 from Andrew Stubbs  ---
I'm testing a fix for this.

[Bug target/112088] [14 Regression] GCN target testing broken by "amdgcn: add -march=gfx1030 EXPERIMENTAL"

2023-10-27 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112088

Andrew Stubbs  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #3 from Andrew Stubbs  ---
The patch should fix the bug.

[Bug target/112308] [14 Regression] GCN: 'error: literal operands are not supported' for 'v_add_co_u32'

2023-11-09 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112308

Andrew Stubbs  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2023-11-09
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |ams at gcc dot gnu.org

[Bug target/112313] [14 Regression] GCN target 'gcc.dg/pr111082.c' ICE, 'during RTL pass: vregs': 'error: unrecognizable insn'

2023-11-10 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112313

Andrew Stubbs  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED
   Assignee|unassigned at gcc dot gnu.org  |ams at gcc dot gnu.org

--- Comment #2 from Andrew Stubbs  ---
This is now fixed.

[Bug target/112308] [14 Regression] GCN: 'error: literal operands are not supported' for 'v_add_co_u32'

2023-11-10 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112308

Andrew Stubbs  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #2 from Andrew Stubbs  ---
This should be fixed now.

[Bug target/112481] [14 Regression] RISCV: ICE: Segmentation fault when compiling pr110817-3.c

2023-11-13 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112481

Andrew Stubbs  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2023-11-13
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |ams at gcc dot gnu.org

--- Comment #4 from Andrew Stubbs  ---
It fails because optab_handler fails to find an instruction for "and_optab" in
SImode.  I didn't consider handling that case; seems so unlikely.

I guess architectures that can't "and" masks don't get to have safe masks? ...
I'll work on a fix.

[Bug target/112481] [14 Regression] RISCV: ICE: Segmentation fault when compiling pr110817-3.c

2023-11-14 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112481

--- Comment #7 from Andrew Stubbs  ---
Simply changing to OPTAB_WIDEN solves the ICE, but I don't know if it does so
in a sensible way, for RISC V.

@@ -7489,7 +7489,7 @@ store_constructor (tree exp, rtx target, int cleared,
poly_int64 size,
if (maybe_ne (GET_MODE_PRECISION (mode), nunits))
  tmp = expand_binop (mode, and_optab, tmp,
  GEN_INT ((1 << nunits) - 1), target,
- true, OPTAB_DIRECT);
+ true, OPTAB_WIDEN);
if (tmp != target)
  emit_move_insn (target, tmp);
break;

Here are the instructions it generates:

(set (reg:DI 165)
(and:DI (subreg:DI (reg:SI 164) 0)
(const_int 1 [0x1])))
(set (reg:SI 154)
(subreg:SI (reg:DI 165) 0))

Should I use that patch? I think it's harmless on targets where OPTAB_DIRECT
would work.

[Bug target/112481] [14 Regression] RISCV: ICE: Segmentation fault when compiling pr110817-3.c

2023-11-14 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112481

Andrew Stubbs  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #13 from Andrew Stubbs  ---
This should be fixed now.

[Bug target/110313] [14 Regression] GCN Fiji reload ICE in 'process_alt_operands'

2023-06-20 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110313

--- Comment #1 from Andrew Stubbs  ---
This ICE also affect the following standalone test failures (raw amdgcn, no
offloading):

gfortran.dg/assumed_rank_21.f90
gfortran.dg/finalize_38.f90
gfortran.dg/finalize_38a.f90

[Bug target/110313] [14 Regression] GCN Fiji reload ICE in 'process_alt_operands'

2023-06-20 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110313

--- Comment #3 from Andrew Stubbs  ---
It's curious that this affects the Fiji target only, and not the newer targets
at all.

There are some additional register options for multiply instructions, some
differences to atomics, but mostly the difference is that Fiji's "flat" load
and store instructions can't have offsets.

[Bug target/110313] [14 Regression] GCN Fiji reload ICE in 'process_alt_operands'

2023-06-20 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110313

--- Comment #5 from Andrew Stubbs  ---
One thing that is unusual about the GCN stack pointer is that it's actually two
registers. Could this be breaking some cprop assumptions?

GCN can't fit an address in one (SImode) register so all (DImode) pointers
require a pair of registers. We had to rework the dwarf stack representation
code for this architecture, so I'm pretty sure no other port does this.

[Bug target/112937] [14 Regression] GCN: FAILs due to unconditional 'f->use_flat_addressing = true;'

2023-12-11 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112937

--- Comment #2 from Andrew Stubbs  ---
Flat addressing *should* be the safe option that always works (although using
"global" address space permits slightly more efficient offset options).

[Bug target/113022] GCN offloading bricked by "amdgcn: Work around XNACK register allocation problem"

2023-12-15 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113022

--- Comment #1 from Andrew Stubbs  ---
This is what I get for trying to get this done before vacation. :(

Yes, there's probably something in mkoffload that has to match the default
change from -mxnack=any to -mxnack=off on the older ISAs.

[Bug testsuite/113085] New test case libgomp.c/alloc-pinned-1.c from r14-6499-g348874f0baac0f fails

2023-12-20 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113085

--- Comment #1 from Andrew Stubbs  ---
That is a typo.

I don't want to make it pass on machines that have insufficient memory
configured because it will mask the case where it fails for another reason.

However, the testcase was originally supposed to fit in 64kB. Is your page size
larger than 4kB?

[Bug testsuite/113085] New test case libgomp.c/alloc-pinned-1.c from r14-6499-g348874f0baac0f fails

2023-12-27 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113085

--- Comment #4 from Andrew Stubbs  ---
It's going to be difficult to make this test work when only one page of locked
memory is available. :-(

I will look at making it "unsupported".

[Bug middle-end/113163] [14 Regression][GCN] ICE in vect_peel_nonlinear_iv_init, at tree-vect-loop.cc:9420

2024-01-02 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113163

Andrew Stubbs  changed:

   What|Removed |Added

 CC||ams at gcc dot gnu.org

--- Comment #11 from Andrew Stubbs  ---
(In reply to Tamar Christina from comment #7)
> This seems to happen because the vectorizer decides to use partial vectors
> to vectorize the loop and the target picks a nonlinear induction step which
> we can't support for early breaks.

In which hook is this selected?

I'm not aware of this being a deliberate choice we made...

[Bug middle-end/113199] [14 Regression][GCN] ICE (segfault) due to invalid 'loop_mask_46 = VEC_PERM_EXPR' when compiling Newlib's wcsftime.c

2024-01-09 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113199

--- Comment #5 from Andrew Stubbs  ---
I can confirm that I can now build the amdgcn toolchain once more. :-)

Thanks.

[Bug target/113615] internal compiler error: in extract_insn, at recog.cc:2812

2024-01-29 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113615

--- Comment #3 from Andrew Stubbs  ---
I did see these, but I hadn't had time to chase them up.

The proposed patch is exactly the sort of solution I was expecting to find,
short term. Have you confirmed that it fixes all the cases?

A proper solution is to find out how to implement reductions with the RDNA ISA,
of course, but that's probably non-trivial (as in, I'm pretty sure it's more
than renaming a few mnemonics), and low-priority as GCC does a reasonably good 
job without them.

[Bug target/115631] [15 Regression] GCN: [-PASS:-]{+FAIL:+} c-c++-common/torture/builtin-arith-overflow-6.c -O2 execution test

2024-06-25 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115631

--- Comment #1 from Andrew Stubbs  ---
It was writing 0 to s12 (scalar register) and then moving the zero to lane zero
of v0 (vector register).

Now it's writing the 0 directly to v0, of which all but lane zero is masked.

These should be identical (unless s12 was also live).

The problem must be elsewhere.

[Bug target/115640] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test

2024-06-25 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640

--- Comment #3 from Andrew Stubbs  ---
(In reply to Richard Biener from comment #2)
> If you force GCN to use fixed length vectors (how?), does it work?  How's
> it behaving on aarch64 with SVE?  (the CI was happy, but maybe doesn't
> enable SVE)

I believe "--param vect-partial-vector-usage=0" will disable the use of
WHILE_ULT? The default is "2" for the standalone toolchain, and last I checked
the value is inherited from the host in the offload toolchain; the default for
x86_64 was "1", meaning approximately "only use partial vectors in epilogue
loops".

[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test

2024-06-26 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640

--- Comment #10 from Andrew Stubbs  ---
On 26/06/2024 12:05, rguenth at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> 
> --- Comment #8 from Richard Biener  ---
> (In reply to Richard Biener from comment #7)
>> I will have a look (and for run validation try to reproduce with gfx1036).
> 
> OK, so with gfx1036 we end up using 16 byte vectors and the testcase
> passes.  The difference with gfx908 is
> 
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:   ==> examining statement: _14 = aa[_13];
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:   vect_model_load_cost: aligned.
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:   vect_model_load_cost: inside_cost = 2, prologue_cost = 0 .
> 
> vs.
> 
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:   ==> examining statement: _14 = aa[_13];
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> missed:   unsupported vect permute { 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 
> 10
> 10 11 11 12 12 13 13 14 14 15 15 }
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> missed:   unsupported load permutation
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:19:72:
> missed:   not vectorized: relevant stmt not supported: _14 = aa[_13];
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:   removing SLP instance operations starting from: REALPART_EXPR
> <(*hadcur_24(D))[_2]> = _86;
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> missed:  unsupported SLP instances
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:  re-trying with SLP disabled
> 
> so gfx1036 cannot do such permutes but gfx908 can?

GFX10 has more limited permutation capabilities than GFX9 because it 
only has 32-lane vectors natively, even though we're using the 64-lane 
"compatibility" mode.

However, in theory, the permutation capabilities on V32 and below should 
be the same, and some permutations on V64 are allowed, so I don't know 
why it doesn't use it. It's possible I broke the logic in 
gcn_vectorize_vec_perm_const:

   /* RDNA devices can only do permutations within each group of 32-lanes.
  Reject permutations that cross the boundary.  */
   if (TARGET_RDNA2_PLUS)
 for (unsigned int i = 0; i < nelt; i++)
   if (i < 31 ? perm[i] > 31 : perm[i] < 32)
 return false;

It looks right to me though?

The vec_extract patterns that also use permutations are likewise 
supposedly still enabled for V32 and below.

Andrew

[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test

2024-06-26 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640

--- Comment #14 from Andrew Stubbs  ---
On 26/06/2024 13:34, rguenth at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> 
> --- Comment #13 from Richard Biener  ---
> (In reply to Richard Biener from comment #12)
>> (In reply to Andrew Stubbs from comment #10)
>>> GFX10 has more limited permutation capabilities than GFX9 because it
>>> only has 32-lane vectors natively, even though we're using the 64-lane
>>> "compatibility" mode.
>>>
>>> However, in theory, the permutation capabilities on V32 and below should
>>> be the same, and some permutations on V64 are allowed, so I don't know
>>> why it doesn't use it. It's possible I broke the logic in
>>> gcn_vectorize_vec_perm_const:
>>>
>>> /* RDNA devices can only do permutations within each group of 32-lanes.
>>>Reject permutations that cross the boundary.  */
>>> if (TARGET_RDNA2_PLUS)
>>>   for (unsigned int i = 0; i < nelt; i++)
>>> if (i < 31 ? perm[i] > 31 : perm[i] < 32)
>>>   return false;
>>>
>>> It looks right to me though?
>>
>> nelt == 32 so I think the last element has the wrong check applied?
>>
>> It should be
>>
>>> if (i < 32 ? perm[i] > 31 : perm[i] < 32)
>>
>> I think.  With that the vectorization happens in a similar way but the
>> failure still doesn't reproduce (without the patch, of course).

Oops, I think you're right.

> Btw, the above looks quite odd for nelt == 32 anyway - we are permuting
> two vectors src0 and src1 into one 32 element dst vector (it's no longer
> required that src0 and src1 line up with the dst vector size btw, they
> might have different nelt).  So the loop would reject interleaving
> the low parts of two 32 element vectors, a permute that would look like
> { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes"
> mean you can never mix the two vector inputs?  Or does GCN not have
> a two-to-one vector permute instruction?

GCN does not have two-to-one vector permute in hardware, so we do two 
permutes and a vec_merge to get the same effect.

GFX9 can permute all the elements within a 64 lane vector arbitrarily.

GFX10 and GFX11 can permute the low-32 and high-32 elements freely, but 
no value may cross the boundary. AFAIK there's no way to do that via any 
vector instruction (i.e. without writing to memory, or extracting values 
element-wise).

In theory, we could implement permutes with different sized inputs and 
outputs, but right now those are rejected early. The interleave example 
wouldn't work in hardware, for GFX10, but we could have it for GFX9.

However, I think you might be right about the numbering of the "perm" 
array; we probably need to be testing "(perm[i] % nelt) > 31" if we are 
to support two-to-one permutations.

Thanks for looking at this.

Andrew

[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test

2024-06-26 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640

--- Comment #16 from Andrew Stubbs  ---
On 26/06/2024 14:41, rguenther at suse dot de wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> 
> --- Comment #15 from rguenther at suse dot de  ---
>>> Btw, the above looks quite odd for nelt == 32 anyway - we are permuting
>>> two vectors src0 and src1 into one 32 element dst vector (it's no longer
>>> required that src0 and src1 line up with the dst vector size btw, they
>>> might have different nelt).  So the loop would reject interleaving
>>> the low parts of two 32 element vectors, a permute that would look like
>>> { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes"
>>> mean you can never mix the two vector inputs?  Or does GCN not have
>>> a two-to-one vector permute instruction?
>>
>> GCN does not have two-to-one vector permute in hardware, so we do two
>> permutes and a vec_merge to get the same effect.
>>
>> GFX9 can permute all the elements within a 64 lane vector arbitrarily.
>>
>> GFX10 and GFX11 can permute the low-32 and high-32 elements freely, but
>> no value may cross the boundary. AFAIK there's no way to do that via any
>> vector instruction (i.e. without writing to memory, or extracting values
>> element-wise).
> 
> I see - so it cannot even swap low-32 and high-32?  I'm thinking of
> what sub-part of permutes would be possible by extending the two-to-one
> vec_merge trick.

No(?)

The 64-lane compatibility mode works, under the hood, by allocating 
double the number of 32-lane registers and then executing each 
instruction twice. Mostly this is invisible, but it gets exposed for 
permutations and the like. Logically, the microarchitecture could do a 
vec_merge to DTRT, but I've not found a way to express that.

It's possible I missed something when RTFM.

> OTOH we restrict GFX10/11 to 32 lane vectors so in practice this
> restriction should be fine.

Yes, with the "31" fixed it should work.

Andrew

[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test

2024-06-28 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640

--- Comment #18 from Andrew Stubbs  ---
That should fix the broken validation check. All V32 permutations should work
now on RDNA GPUs, I think. V16 and smaller were already working fine.

[Bug target/116104] [15 Regression] GCN vs. "[rtl-optimization/116037] Explicitly track if a destination was skipped in ext-dce"

2024-07-29 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116104

--- Comment #3 from Andrew Stubbs  ---
(In reply to Jeffrey A. Law from comment #1)
> So, how am I supposed to reproduce this?  I don't have an assembler/binutils
> for amdgcn and thus libgcc won't configure.  Thus I can't extract a testcase.
> 
> Alternately, if you could just attach a .i file, it'd be helpful.

In fact, you probably do have an assembler and linker already installed!

Just copy "llvm-mc" to amdgcn-amdhsa/bin/as and llvm's "lld" to
amdgcn-amdhsa/bin/ld. You might want ar, ranlib, and nm too. Full details here:
https://gcc.gnu.org/wiki/Offloading#For_AMD_GCN

[Bug target/116104] [15 Regression] GCN vs. "[rtl-optimization/116037] Explicitly track if a destination was skipped in ext-dce"

2024-07-29 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116104

--- Comment #4 from Andrew Stubbs  ---
The problem insn is this:

(insn 31 30 32 2 (set (reg:V2SI 711)
(ashift:V2SI (reg:V2SI 161 v1)
(const_vector:V2SI [
(const_int 3 [0x3]) repeated x2
])))
"/srv/data/ams/upB/gcn/src/gcc-mainline/libgcc/libgcc2.c":70:11 2043
{vashlv2si3}
 (expr_list:REG_DEAD (reg:V2SI 161 v1)
(nil)))

That is, it shifts each element in a 2-element vector by 3 bits. This looks
like valid GCN code to me.

ext_dce does not seem to be handling the vector case.

[Bug target/116103] [15 Regression] GCN vs. "Internal-fn: Only allow modes describe types for internal fn[PR115961]"

2024-07-29 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116103

--- Comment #8 from Andrew Stubbs  ---
(In reply to Thomas Schwinge from comment #4)
> (In reply to Richard Biener from comment #2)
> >   if (VECTOR_BOOLEAN_TYPE_P (type)
> >   && SCALAR_INT_MODE_P (TYPE_MODE (type)))
> > return true;
> 
> >   && TYPE_PRECISION (TREE_TYPE (type)) == 1
> 
> > Thomas, does that resolve the issue?
> 
> Thanks, it does: restores the original '*.s' exactly.  (Assuming that's the
> desired outcome, Andrew?)

It looks like while_ult is being rejected ... we really do want that!

[Bug target/116104] [15 Regression] GCN vs. "[rtl-optimization/116037] Explicitly track if a destination was skipped in ext-dce"

2024-07-30 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116104

Andrew Stubbs  changed:

   What|Removed |Added

 Resolution|FIXED   |---
 Status|RESOLVED|REOPENED

--- Comment #9 from Andrew Stubbs  ---
The patch fixed ASHIFT, but we now have the exact same failure in LSHIFTRT. 

Looking at the code, there's probably a few other cases in that switch
statement. At least ASHIFTRT, and probably all the other uses of CONSTANT_P.

[Bug target/116955] [15 Regression] GCN '-march=gfx1100': [-PASS:-]{+FAIL:+} gcc.dg/vect/pr81740-2.c execution test

2024-10-04 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116955

--- Comment #2 from Andrew Stubbs  ---
Compared to gfx908, gfx1100 lacks 64-lane permute and vector reductions.
Permute works with 32 lanes or fewer, but reductions are unimplemented in the
backend. Otherwise it should vectorize the same.

That might explain some additional failures, but probably doesn't explain a
regression.

[Bug target/116571] [15 Regression] GCN vs. "lower SLP load permutation to interleaving"

2024-09-23 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116571

--- Comment #6 from Andrew Stubbs  ---
(In reply to Richard Biener from comment #5)
> (In reply to Thomas Schwinge from comment #4)
> > The GCN target FAILs that I originally had reported here:
> > 
> > > [-PASS:-]{+FAIL:+} gcc.dg/vect/slp-11a.c scan-tree-dump-times vect 
> > > "vectorizing stmts using SLP" [-0-]{+1+}
> > 
> > > [-PASS:-]{+FAIL:+} gcc.dg/vect/slp-12a.c scan-tree-dump-times vect 
> > > "vectorizing stmts using SLP" [-0-]{+1+}
> > 
> > ... are back to PASS as of recently; should we close this PR?
> 
> I'd say so.
> 
> > 
> > Andrew, anything to be done anyway, regarding the following?
> > 
> > (In reply to Richard Biener from comment #3)
> > > Possibly for GCN the issue is the vect_strided8 which is implemented as
> > > 
> > > foreach N {2 3 4 5 6 7 8} {
> > > eval [string map [list N $N] {
> > > # Return 1 if the target supports 2-vector interleaving
> > > proc check_effective_target_vect_stridedN { } {
> > > return [check_cached_effective_target_indexed vect_stridedN {
> > > if { (N & -N) == N
> > >  && [check_effective_target_vect_interleave]
> > >  && [check_effective_target_vect_extract_even_odd] } {
> > > return 1
> > > }
> > > if { ([istarget arm*-*-*]
> > >   || [istarget aarch64*-*-*]) && N >= 2 && N <= 4 } {
> > > return 1
> > > }
> > > if { ([istarget riscv*-*-*]) && N >= 2 && N <= 8 } {
> > > return 1
> > > }
> > > if [check_effective_target_vect_fully_masked] {
> > > return 1
> > > }
> > > 
> > > not sure if gcn really supports a load/store-lane with 8 elements.
> 
> Note I might have misunderstood the vect_stridedN effective target, but
> the last line looks odd (x86 also can do fully masked loops with AVX512
> but definitely cannot do arbitrary interleaving schemes just because of
> that).

OK, but unless I'm missing something x86 is not one of the targets where
check_effective_target_vect_fully_masked is true.

> I'd remove that last line, it was added by you, Andrew, in
> r9-5484-g674931d2b7bd88 ...

I will switch it to test for GCN specifically.

As far as I know, GCN is fine with 8 element vectors and can load store them
from any arbitrary pattern you can generate.

[Bug target/117657] [15 Regression][gcn] ICE during in-tree newlib build: error: unrecognizable insn

2024-11-19 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117657

--- Comment #6 from Andrew Stubbs  ---
The patch changed the wrong operand on the gen_gather_insn_1offset_exec
call. It sets one of the offsets undefined instead of setting the else value
undefined.

I'm testing a fix.

[Bug target/117657] [15 Regression][gcn] ICE during in-tree newlib build: error: unrecognizable insn

2024-11-18 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117657

--- Comment #1 from Andrew Stubbs  ---
This appears to have been caused by the recent maskload patches, which is weird
because I thought I already tested the patches that were posted.

[Bug target/117657] [15 Regression][gcn] ICE during in-tree newlib build: error: unrecognizable insn

2024-11-20 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117657

Andrew Stubbs  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #12 from Andrew Stubbs  ---
Marking as fixed. The new test failure has been reported as pr117709.

[Bug target/117709] [15 regression] maskload else case generating wrong code

2024-11-20 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117709

--- Comment #6 from Andrew Stubbs  ---
Yes, that fixes the issue, thanks.

The only diff in the assembly now, compared to before the "else" patch, is the
zero-initialization is gone. This is good; the mysterious extra code seemed
like a step backwards. :)

[Bug target/117709] New: [15 regression] maskload else case generating wrong code

2024-11-20 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117709

Bug ID: 117709
   Summary: [15 regression] maskload else case generating wrong
code
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ams at gcc dot gnu.org
CC: rdapp at gcc dot gnu.org
  Target Milestone: ---

The following testcase aborts on amdgcn since the maskload else patches were
added (https://patchwork.sourceware.org/project/gcc/list/?series=40395, plus
the patch for pr117657).

---8<---
int i, j;   
int k[11][101]; 

__attribute__((noipa)) void 
doit (void) {   
  int err = 0;  
  for (j = -11; j >= -41; j -= 15) {
k[0][-j] = 1;   
  } 
#pragma omp simd collapse(2) lastprivate (i, j) reduction(|:err)
  for (i = 4; i < 8; i += 12)   
for (j = -8 * i - 9; j < i * -3 + 6; j += 15)   
  err |= (k[0][-j] != 1);   

  if (err)  
__builtin_abort (); 
}   
int main () {   
  doit ();  
}   
---8<---

The testcase is simplified from gcc.dg/vect/vect-simd-17.c.

If I revert most of the gcn patch, but keep the new operand with the
"maskload_else_operand" predicate, so that the backend generates the
zero-initializer, but the middle-end assumes the lanes are undefined, then the
testcase still fails.

I think this shows that the problem in in the additional code generated by the
middle end, not that the vectors are actually corrupt, but I've not identified
exactly how.

[Bug target/117709] [15 regression] maskload else case generating wrong code

2024-11-20 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117709

--- Comment #4 from Andrew Stubbs  ---
The mask is a 64-bit integer value in the "exec" register.

I agree that I cannot see the problem staring at it. Like I said, I changed the
backend so that it generated the zero-initializers anyway, and the snippets you
show above reverted to as they were before. However, there's some new code
further up the assembly that I don't understand yet.

[Bug target/117657] [15 Regression][gcn] ICE during in-tree newlib build: error: unrecognizable insn

2024-11-19 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117657

--- Comment #9 from Andrew Stubbs  ---
That commit should fix the build failure.

However, I'm now seeing a wrong-code regression in gcc.dg/vect/vect-simd-17.c
that I can't prove isn't related. The testcase now aborts, at least on gfx90a,
where it used to pass. There's rather a lot going on in there, so I'll have to
minimize the testcase.

I'm not sure why I didn't see this issue before, but I did miss that the last
version of the patch posted had extra new stuff in it, so that's probably why.

[Bug target/101544] [OpenMP][AMDGCN][nvptx] C++ offloading: unresolved _Znwm = "operator new(unsigned long)"

2025-01-16 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101544

--- Comment #15 from Andrew Stubbs  ---
BTW, if you're calling "new" in the offload kernel then you're probably "doing
it wrong", even when we do implement full C++ support.  Offload kernels are for
hot code, executed many times, and memory allocation is inherently slow.

On AMDGCN, "malloc" uses Newlib's heap support and gets serialized via a global
lock.  Likewise for "free".  On NVPTX, the implementation is provided by the
PTX finalizer, so may be better optimized, but I still don't recommend it.

I'm assuming you're using printf for debug and testing only, so that's fine,
but it definitely has no place in hot code either.

[Bug target/101544] [OpenMP][AMDGCN][nvptx] C++ offloading: unresolved _Znwm = "operator new(unsigned long)"

2025-01-16 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101544

--- Comment #14 from Andrew Stubbs  ---
"printf" exists and has been working on both AMDGCN and NVPTX devices since
forever. "fputs", "puts", and "write", etc. should all work too.

If the FORTIFY_SOURCE trick doesn't get rid of __printf_chk, or has other
side-effects, then you can probably use __builtin_printf in these instances to
bypass the preprocessor "magic".

[Bug middle-end/119325] [15 Regression] libgomp.c/simd-math-1.c (gcn offloading): timeout (for fmodf, remainderf) since r15-7257-g54bdeca3c62144

2025-03-18 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119325

--- Comment #10 from Andrew Stubbs  ---
The libm vector routines are pretty much just the scalar routine translated
into vector extension statements wrapped in preprocessor macros.

They should be unaffected by the vectorizer (and most of the optimization
passes, for that matter).

[Bug middle-end/119325] [15 Regression] libgomp.c/simd-math-1.c (gcn offloading): timeout (for fmodf, remainderf) since r15-7257-g54bdeca3c62144

2025-03-18 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119325

--- Comment #17 from Andrew Stubbs  ---
Oops, that has __to and __from backwards ... you get the idea.

[Bug testsuite/119286] [15 Regression] GCN vs. "middle-end: delay checking for alignment to load [PR118464]"

2025-03-18 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119286

--- Comment #3 from Andrew Stubbs  ---
The RDNA consumer devices, such as gfx1100, support permute for V32 and
smaller, but not V64. Gather/scatter should be able to load from arbitrary
addresses, but synthesising a vector with those addresses may run into the
permutation issue. If that's not the issue then I don't know why it wouldn't
work.

The GFX9/CDNA devices support V64 properly.

As for the scalar loads, we never actually implemented any costs. In general,
loading a vector elementwise isn't actually worse than just doing the scalar
algorithm (very slow on a GPU), so we have focused on making the vector stuff
work  except that RDNA has the hardware limitations.

[Bug middle-end/119325] [15 Regression] libgomp.c/simd-math-1.c (gcn offloading): timeout (for fmodf, remainderf) since r15-7257-g54bdeca3c62144

2025-03-18 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119325

--- Comment #16 from Andrew Stubbs  ---
Perhaps:

asm ("mov %0, %1" : "=v"(__from), "v"(__to));

or maybe

asm ("; no-op cast %0" : "=v"(__from), "0"(__to));


Is there a downside to that in the optimizer(s)?

[Bug middle-end/119325] [15 Regression] libgomp.c/simd-math-1.c (gcn offloading): timeout (for fmodf, remainderf) since r15-7284-g6b56e645a7b481

2025-03-17 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119325

--- Comment #3 from Andrew Stubbs  ---
Supposedly, the non-openmp equivalent test is gcc.target/gcn/simd-math-1.c, but
that seems to be passing still.

[Bug middle-end/119325] [15 Regression] libgomp.c/simd-math-1.c (gcn offloading): timeout (for fmodf, remainderf) since r15-7257-g54bdeca3c62144

2025-03-19 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119325

--- Comment #20 from Andrew Stubbs  ---
I tried the memcpy solution with the following testcase:

v2sf
smaller (v64sf in)  
{   
  v2sf out = RESIZE_VECTOR (v2sf, in);  
  return out;   
}   

v64sf   
bigger (v2sf in)
{   
  v64sf out = RESIZE_VECTOR (v64sf, in);
  return out;   
}   

It doesn't look great, compiled with -O3 -fomit-frame-pointer, as it's routing
the conversion through expensive stack memory:

smaller: 
s_addc_u32  s17, s17, 0  
s_mov_b64   exec, -1 
v_lshlrev_b32   v4, 2, v1
s_mov_b32   s22, scc 
s_add_u32   s12, s16, -256   
s_addc_u32  s13, s17, -1 
s_cmpk_lg_u32   s22, 0   
v_add_co_u32v4, s[22:23], s12, v4
v_mov_b32   v5, s13  
v_addc_co_u32   v5, s[22:23], 0, v5, s[22:23]
flat_store_dwordv[4:5], v8 offset:0  < Store big vector
s_mov_b64   exec, 3  
v_lshlrev_b32   v8, 2, v1
v_add_co_u32v8, s[22:23], s12, v8
v_mov_b32   v9, s13  
v_addc_co_u32   v9, s[22:23], 0, v9, s[22:23]
flat_load_dword v8, v[8:9] offset:0  < Load smaller
vector
s_waitcnt   0
s_sub_u32   s16, s16, 256
s_subb_u32  s17, s17, 0  
s_setpc_b64 s[18:19] 

bigger:  
s_addc_u32  s17, s17, 0  
s_mov_b64   exec, -1 
v_mov_b32   v0, 0
s_mov_b32   s22, scc 
s_add_u32   s12, s16, -256   
s_addc_u32  s13, s17, -1 
s_cmpk_lg_u32   s22, 0   
v_lshlrev_b32   v4, 2, v1
v_mov_b32   v5, s13  
v_add_co_u32v4, s[22:23], s12, v4
v_addc_co_u32   v5, s[22:23], 0, v5, s[22:23]
flat_store_dwordv[4:5], v0 offset:0< Initialize zeroed
big vector
s_mov_b64   exec, 3  
v_lshlrev_b32   v4, 2, v1
v_mov_b32   v5, s13  
v_add_co_u32v4, s[22:23], s12, v4
v_addc_co_u32   v5, s[22:23], 0, v5, s[22:23]
flat_store_dwordv[4:5], v8 offset:0< Store small vector
over zeros
s_mov_b64   exec, -1 
v_lshlrev_b32   v4, 2, v1
v_mov_b32   v5, s13  
v_add_co_u32v4, s[22:23], s12, v4
v_addc_co_u32   v5, s[22:23], 0, v5, s[22:23]
flat_load_dword v8, v[4:5] offset:0< Load combined big
vector
s_waitcnt   0
s_sub_u32   s16, s16, 256
s_subb_u32  s17, s17, 0  
s_setpc_b64 s[18:19] 

Here's my alternative in-register solution:

#define RESIZE_VECTOR(to_t, from) \
({ \   
  to_t __to; \ 
  if (VECTOR_WIDTH (to_t) < VECTOR_WIDTH (__typeof (from))) \  
asm ("; no-op cast %0" : "=v"(__to) : "0"(from)); \
  else \   
{ \
  unsigned long __mask = -1L; \
  int lanes = VECTOR_WIDTH (__typeof (from)); \
  __mask <<= lanes; \  
  __builtin_choose_expr ( \
V_SF_SI_P (to_t), \
({asm ("v_mov_b32 %0, 0" : "=v"(__to) : "0"(from), "e"(__mask));}), \  
({asm ("v_mov_b32 %H0, 0\n\t" \
   "v_mov_b32 %L0, 0" : "=v"(__to) : "0"(from), "e"(__mask));})); \
} \   

[Bug target/119474] GCN 'libgomp.oacc-c++/pr96835-1.C' ICE 'during GIMPLE pass: ivopts'

2025-03-26 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119474

--- Comment #1 from Andrew Stubbs  ---
In the -O1 case, the problem seems to be that the "ivopts" pass has identified
an item-in-an-array-in-a-struct as the IV, and that struct is in a different
address space:

  Type: REFERENCE ADDRESS   
  Use 0.0:  
At stmt:_9 = v1.values_[i_38];  
At pos: v1.values_[i_38]
IV struct:  
  Type: int *   
  Base: (int *) (unsigned long) &v1 
  Step: 4   
  Object:   (void *) (&v1)  
  Biv:  N   
  Overflowness wrto loop niter: Overflow

When it tries to calculate the base address-of operator, it observes the type
of  "values_[i_38]", which doesn't have anything like an address space.

In the -O2 case, the problem seems to be the same, except that it's happening
in the "vect" pass: the compiler attempts to create a "vectp" to match &v1, but
doesn't propogate the address space correctly.

Question: are we just missing address-space handling in a bunch of places, or
is the OpenACC lowering producing non-canonical IVs somehow?

[Bug target/119474] GCN 'libgomp.oacc-c++/pr96835-1.C' ICE 'during GIMPLE pass: ivopts'

2025-03-27 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119474

--- Comment #6 from Andrew Stubbs  ---
The address space has to be introduced "late" because it's done in the
accelerator compiler, so post-IPA.  The pass is "oaccdevlow" (currently
no.103).

The address space is selected via the TARGET_GOACC_ADJUST_PRIVATE_DECL hook.

[Bug target/119474] GCN 'libgomp.oacc-c++/pr96835-1.C' ICE 'during GIMPLE pass: ivopts'

2025-03-31 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119474

--- Comment #9 from Andrew Stubbs  ---
This patch fixes the -O1 failure, for *this* testcase:

diff --git a/gcc/tree.cc b/gcc/tree.cc
index eccfcc89da40..4bfdb7a938e7 100644
--- a/gcc/tree.cc
+++ b/gcc/tree.cc
@@ -7085,11 +7085,8 @@ build_pointer_type_for_mode (tree to_type, machine_mode
mode,
   if (to_type == error_mark_node)
 return error_mark_node;

-  if (mode == VOIDmode)
-{
-  addr_space_t as = TYPE_ADDR_SPACE (to_type);
-  mode = targetm.addr_space.pointer_mode (as);
-}
+  addr_space_t as = TYPE_ADDR_SPACE (to_type);
+  mode = targetm.addr_space.pointer_mode (as);

   /* If the pointed-to type has the may_alias attribute set, force
  a TYPE_REF_CAN_ALIAS_ALL pointer to be generated.  */
@@ -7121,6 +7118,7 @@ build_pointer_type_for_mode (tree to_type, machine_mode
mode,
   TYPE_REF_CAN_ALIAS_ALL (t) = can_alias_all;
   TYPE_NEXT_PTR_TO (t) = TYPE_POINTER_TO (to_type);
   TYPE_POINTER_TO (to_type) = t;
+  TYPE_ADDR_SPACE (t) = TYPE_ADDR_SPACE (to_type);

   /* During LTO we do not set TYPE_CANONICAL of pointers and references.  */
   if (TYPE_STRUCTURAL_EQUALITY_P (to_type) || in_lto_p)


This is clearly wrong though, because the address space that the pointer is
*in* doesn't have to be the same as the one it *points to*, so I need a better
solution.

Any suggestions where to start?

[Bug target/119369] GCN: weak undefined symbols -> execution test FAIL, 'HSA_STATUS_ERROR_VARIABLE_UNDEFINED'

2025-03-31 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119369

--- Comment #2 from Andrew Stubbs  ---
We used to have work-arounds for ROCm runtime linker deficiencies, but these
were removed in 2020, as they were no longer necessary when we moved to
HSACOv3:

https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=f64f0899ba14058534e19ee46cfbf6f921e4644c

Something like this can probably be put back, if necessary.

[Bug target/119369] GCN: weak undefined symbols -> execution test FAIL, 'HSA_STATUS_ERROR_VARIABLE_UNDEFINED'

2025-03-31 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119369

--- Comment #5 from Andrew Stubbs  ---
A post-linker could be included as part of the mkoffload process (or maybe we
could fix up the weak directives in the assembler as part of the pre-assembler
step we already have).

Either way, there's no mkoffload in the pipeline for the standalone tests.
However, given that weak symbols are known to be half-broken then maybe
-fno-weak is the correct solution in that configuration?

The way the old workaround worked was it would hide *all* the relocations from
the runtime loader by editing the ELF file in memory, and then fix up the
relocations manually, in GPU memory, after the loader had chosen the load
address. This was suboptimal because device memcpy is slow (adding to the
initial setup overhead only), but means we have full control over runtime
symbol resolution.

It's frustrating though because, as you can see from the old code, processing
relocations is not hard, and adding weak symbol support would be fairly
straight-forward, so you'd hope the driver could already handle it. :(

IIUC, the reason the weak symbols are not resolved already is because we use
--pie with --export-dynamic.  The latter ought not be necessary, but it is used
because that's what LLVM uses (or used to use during the v3 era) and is
therefore the path of least resistance within the driver code (as in, this is
the reason we could stop resolving our own relocations).  It's possible that we
don't have to do this any more?

[Bug target/119474] GCN 'libgomp.oacc-c++/pr96835-1.C' ICE 'during GIMPLE pass: ivopts'

2025-03-28 Thread ams at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119474

--- Comment #8 from Andrew Stubbs  ---
This patch fixes the ICE and produces working code at -O2 and -O3:

diff --git a/gcc/omp-offload.cc b/gcc/omp-offload.cc
index da2b54b76485..1778a70bf755 100644
--- a/gcc/omp-offload.cc
+++ b/gcc/omp-offload.cc
@@ -1907,7 +1907,7 @@ oacc_rewrite_var_decl (tree *tp, int *walk_subtrees, void
*data)
   /* Adjust the type of the component ref itself.  */
   tree comp_type = TREE_TYPE (*tp);
   int comp_quals = TYPE_QUALS (comp_type);
-  if (TREE_CODE (*tp) == COMPONENT_REF && comp_quals != base_quals)
+  if (/*TREE_CODE (*tp) == COMPONENT_REF &&*/ comp_quals != base_quals)
{
  comp_quals |= base_quals;
  TREE_TYPE (*tp)

However, we now get a new ICE at -O1 in emit-rtl.cc:663

  /* Allow truncation but not extension since we do not know if the
 number is signed or unsigned.  */ 
  gcc_assert (prec <= v.get_precision ()); 

I think this is probably caused by address space 4 using 32-bit pointers
(normal pointers are 64-bit), but I've not confirmed this yet.