[Bug target/100208] amdgcn fails to build with llvm-mc from llvm12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100208 --- Comment #1 from Andrew Stubbs --- LLVM changed the default parameters, so we either have to change the expectations in the ".amdgcn_target" string (which is basically an assert), or set the attributes be want explicitly on the assembler command line. (Or port binutils to amdgcn, but there's no plan for that.)
[Bug tree-optimization/84958] int loads not eliminated against larger stores
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84958 --- Comment #6 from Andrew Stubbs --- (In reply to Tom de Vries from comment #5) > I've removed the xfail for nvptx. > > The only remaining xfail is for gcn. Is that one still necessary? The test still fails for gcn.
[Bug target/97521] [11 Regression] wrong code with -mno-sse2 since r11-3394
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97521 --- Comment #21 from Andrew Stubbs --- (In reply to Richard Biener from comment #19) > GCN also uses MODE_INT for the mask mode and thus may be similarly affected. > Andrew - are the bits in the mask dense? Thus for a V4SImode compare > would the mask occupy only the lowest 4 bits of the DImode mask? Yes, that's correct.
[Bug target/97521] [11 Regression] wrong code with -mno-sse2 since r11-3394
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97521 --- Comment #22 from Andrew Stubbs --- (In reply to Andrew Stubbs from comment #21) > (In reply to Richard Biener from comment #19) > > GCN also uses MODE_INT for the mask mode and thus may be similarly affected. > > Andrew - are the bits in the mask dense? Thus for a V4SImode compare > > would the mask occupy only the lowest 4 bits of the DImode mask? > > Yes, that's correct. Or rather, I should say that that *will* be the case when I add partial vector support; right now it can only be done via masking V64SImode. A have a patch set, but the last problem is that while_ult doesn't operate on partial integer masks, leading to wrong code. AArch64 doesn't have a problem with this because it uses VBI masks of the right size. I have a patch that adds the vector size as an operand to while_ult; this seems to fix the problems on GCN, but I need to make corresponding changes for AArch64 also before I can submit those patches, and time is tight.
[Bug libgomp/97332] [gcn] GCN_NUM_GANGS/GCN_NUM_WORKERS override compile-time constants
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97332 Andrew Stubbs changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2020-10-08 Status|UNCONFIRMED |NEW --- Comment #1 from Andrew Stubbs --- At the point the overrides are applied (run_kernel) the code only knows what dimensions were selected at runtime, not how those figures were arrived at. It then prints (with GCN_DEBUG set) the "launch attributes" and "launch actuals". To fix this the overrides will have to applied much earlier, and independently for OpenACC (gcn_exec) and OpenMP (parse_target_attributes). That or the automatic balancing be applied later. Or perhaps the original attributes be stored for later inspection (but GOMP_kernel_launch_attributes is defined by libgomp). The "attributes" and "actuals" will need to be overhauled. Probably get_group_size can be removed. It ought to be doable though.
[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418 --- Comment #4 from Andrew Stubbs --- Alexandre's patch has this: emit_move_insn (rem, plus_constant (ptr_mode, rem, -blksize)); Is that generally a valid thing to do? It seems like other places do similar things...
[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418 Andrew Stubbs changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2021-05-05 Status|UNCONFIRMED |NEW --- Comment #6 from Andrew Stubbs --- Using force_operand does fix Tobias's reduced testcase. I'll test it further and let you know.
[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418 --- Comment #9 from Andrew Stubbs --- I found a couple of other places to put force_operand and the full case works now. Running more tests
[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418 --- Comment #13 from Andrew Stubbs --- I found a lot more ICEs when testing my patch. They look to be unrelated (TImode come back to haunt us), but it makes it hard to be sure.
[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418 Andrew Stubbs changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #17 from Andrew Stubbs --- This issue should be fixed now.
[Bug testsuite/108898] [13 Regression] Test introduced by r13-6278-g3da77f217c8b2089ecba3eb201e727c3fcdcd19d failed on i386
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108898 --- Comment #1 from Andrew Stubbs --- I tested it on i686-pc-linux-gnu before I posted the patch, and it was working then. Can you be more specific what configuration you were testing, please?
[Bug testsuite/108898] [13 Regression] Test introduced by r13-6278-g3da77f217c8b2089ecba3eb201e727c3fcdcd19d failed on i386
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108898 --- Comment #4 from Andrew Stubbs --- I did not know there was a way to do that! I'll add this to my to-do list.
[Bug target/105246] [amdgcn] Use library call for SQRT with -ffast-math + provide additional option to use single-precsion opcode
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105246 --- Comment #2 from Andrew Stubbs --- When we first coded this we only had the GCN3 ISA manual, which says nothing about the accuracy. Now I look in the Vega manual (GCN5) I see: Square root with perhaps not the accuracy you were hoping for -- (2**29)ULP accuracy. On the upside, denormals are supported. The most recent CDNA2 manual is a bit less verbose: Square root. Precision is (2**29) ULP, and supports denormals. The compiler already emits Newton Raphson iterations for division with -ffast-math, so I'm sure it can be done, but I'm not too clear on the mathematics myself.
[Bug tree-optimization/106476] New: ICE generating FOLD_EXTRACT_LAST
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106476 Bug ID: 106476 Summary: ICE generating FOLD_EXTRACT_LAST Product: gcc Version: unknown Status: UNCONFIRMED Severity: major Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ams at gcc dot gnu.org CC: rguenther at suse dot de Target Milestone: --- Target: amdgcn-amdhsa Commit 8f4d9c1deda "amdgcn: 64-bit not" exposed an ICE in tree-vect_stmts.cc when compiling gcc.dg/torture/pr67470.c at -O2 for amdgcn. The newly implemented op is not the problem, but it allows this optimization (and many others) to proceed, and the error is no longer hidden. amdgcn has masked vectors and fold_extract_last, which leads to a code path through tree-vect-stmts.cc that has vec_then_clause = vec_oprnds2[i]; if (reduction_type != EXTRACT_LAST_REDUCTION) vec_else_clause = vec_oprnds3[i]; and then /* Instead of doing ~x ? y : z do x ? z : y. */ vec_compare = new_temp; std::swap (vec_then_clause, vec_else_clause); and finally new_stmt = gimple_build_call_internal (IFN_FOLD_EXTRACT_LAST, 3, else_clause, vec_compare, vec_then_clause); in which vec_then_clause remains set to NULL_TREE. The dump shows e_lsm.16_32 = .FOLD_EXTRACT_LAST (e_lsm.16_8, _70, ); (note the last field is missing.) I can fix the ICE if I add "else vec_else_clause = integer_zero_node", but I'm not sure that is the correct logical solution. (CC Richi who touched this code last)
[Bug middle-end/107088] [13 Regression] cselib ICE building __trunctfxf2 on ia64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107088 Andrew Stubbs changed: What|Removed |Added Target|ia64-*-*|ia64-*-*, amdgcn-*-* CC||ams at gcc dot gnu.org --- Comment #7 from Andrew Stubbs --- I get the same failure on amdgcn building newlib/libm/math/kf_rem_pio2.c
[Bug middle-end/107088] [13 Regression] cselib ICE building __trunctfxf2 on ia64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107088 --- Comment #9 from Andrew Stubbs --- I can confirm that the patch fixes the amdgcn build.
[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 --- Comment #4 from Andrew Stubbs --- I don't understand rgroups, but I can say that GCN masks are very simply one-bit-one-lane. There are always 64-lanes, regardless of the type, so V64QI mode has fewer bytes and bits than V64DImode (when written to memory). This is different to most other architectures where the bit-size remains the same and number of lanes varies with the inner type, and has caused us some issues with invalid assumptions in GCC (e.g. "there's no need for sign-extends in vector registers" is not true for GCN). However, I think it's the same as you're describing for AVX512, at least in this respect. Incidentally I'm on the cusp of adding multiple "virtual" vector sizes in the GCN backend (in lieu of implementing full mask support everywhere in the middle-end and fixing all the cost assumptions), so these VIEW_CONVERT_EXPR issues are getting worse. I have a bunch of vec_extract patterns that fix up some of it. Within the backed, the V32, V16, V8, V4 and V2 vectors are all really just 64-lane vectors with the mask preset, so the mask has to remain DImode or register allocation becomes tricky.
[Bug middle-end/104026] [12 Regression] ICE in wide_int_to_tree_1, at tree.c:1755 via tree-vect-loop-manip.c:673
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104026 Andrew Stubbs changed: What|Removed |Added CC||ams at gcc dot gnu.org --- Comment #6 from Andrew Stubbs --- amdgcn always uses 64-lane vectors, regardless of type, and relies on masking to support anything smaller. The len_store pattern seems to have been introduced in July 2020 which is more recent than the last major work in the amdgcn backend.
[Bug target/100181] hot-cold partitioned code doesn't assemble
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100181 --- Comment #13 from Andrew Stubbs --- I've updated the LLVM version documentation at https://gcc.gnu.org/wiki/Offloading#For_AMD_GCN: It's LLVM 9 or 13.0.1 now (nothing in between), and will be 13.0.1+ for the next release (dropping LLVM 9 because we'll want to add newer device support to GCC soonish).
[Bug target/103201] [12 Regression] trunk 20211111 ftbfs for amdgcn – libgomp/teams.c:49:6: error: 'struct gomp_thread' has no member named 'num_teams'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103201 --- Comment #3 from Andrew Stubbs --- I did some preliminary testing on your patch: the libgomp.c/target-teams-1.c testcase runs fine on amdgcn. I presume that that covers most of the existing features of those runtime calls?
[Bug target/103396] [12 Regression][GCN][BUILD] ICE RTL check: access of elt 4 of vector with last elt 3 in move_callee_saved_registers, at config/gcn/gcn.c:2821
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103396 Andrew Stubbs changed: What|Removed |Added Last reconfirmed||2021-11-24 Status|UNCONFIRMED |ASSIGNED Assignee|unassigned at gcc dot gnu.org |ams at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #4 from Andrew Stubbs --- I think I have a fix for this. It happens when the link register has to be saved because it is used implicitly by a function call, but the register is never explicitly mentioned anywhere else in the function. I don't know why this hasn't been a problem before now?
[Bug target/103396] [12 Regression][GCN][BUILD] ICE RTL check: access of elt 4 of vector with last elt 3 in move_callee_saved_registers, at config/gcn/gcn.c:2821
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103396 Andrew Stubbs changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #6 from Andrew Stubbs --- This problem should be fixed now.
[Bug target/102260] amdgcn offload compiler fails to configure, not matching target directive's target id
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102260 Andrew Stubbs changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2021-09-09 Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |ams at gcc dot gnu.org --- Comment #1 from Andrew Stubbs --- In addition to changing the amdgcn_target syntax in LLVM 13, the LLVM GCN guys have also renamed the "sram-ecc" attribute to "sramecc" on the CLI, and have not provided any backwards compatibility for either change. These are not helpful decisions and will require configure tests in GCC to support all the variations. :-( I'm working on it.
[Bug target/102544] GCN offloading not working for 'amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102544 --- Comment #1 from Andrew Stubbs --- Please set "export GCN_DEBUG=1", try it again, and post the output.
[Bug target/102544] GCN offloading not working for 'amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102544 --- Comment #3 from Andrew Stubbs --- That output shows that we have the correct libgomp and rocm is installed and working. Libgomp initialized the GCN plugin, but did not attempt to initialize the device (the next message in the output should have been "Selected kernel arguments memory region", or at least a GCN error message). Instead we have a target-independent libgomp error. Presumably the kernel metadata is malformed, somehow? I think we need a testcase to debug this further, preferably reduced to be as simple as possible. Perhaps it would be a good idea to start with a minimal toy example and see if that works on the device. #include #include int main () { int v = 1; #pragma acc parallel copy(v) { if (acc_on_device(acc_device_host)) v = -1; // error else { v = 2; // success } } printf ("v is %d\n", v); return v; }
[Bug target/102544] GCN offloading not working for 'amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102544 --- Comment #5 from Andrew Stubbs --- Sorry, I should have said to compile with -fopenacc. If you did do that, please post the GCN_DEBUG output.
[Bug target/102544] GCN offloading not working for 'amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102544 --- Comment #8 from Andrew Stubbs --- Did you get the C version to return anything other than "-1"? (The expected result is "2".) I'm still trying to determine if the device is compatible, but the mapping problem looks like a different issue. Your code works fine on my device using a somewhat more recent GCC build. (I can't install that exact toolchain right now.)
[Bug target/107510] gcc/config/gcn/gcn.cc:4930:9: style: Same expression on both sides of '||'. [duplicateExpression]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107510 Andrew Stubbs changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |ams at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #2 from Andrew Stubbs --- Oops, I thought I fixed that. :(
[Bug target/107510] gcc/config/gcn/gcn.cc:4930:9: style: Same expression on both sides of '||'. [duplicateExpression]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107510 Andrew Stubbs changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #4 from Andrew Stubbs --- Fixed.
[Bug other/89863] [meta-bug] Issues in gcc that other static analyzers (cppcheck, clang-static-analyzer, PVS-studio) find that gcc misses
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89863 Bug 89863 depends on bug 107510, which changed state. Bug 107510 Summary: gcc/config/gcn/gcn.cc:4930:9: style: Same expression on both sides of '||'. [duplicateExpression] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107510 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug target/105873] [amdgcn][OpenMP] task reductions fail with "team master not responding; slave thread aborting"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105873 --- Comment #4 from Andrew Stubbs --- I think unused threads should be given a no-op function to run, not a null pointer. The GCN implementation cannot tell the difference between a null pointer and an unset pointer (which is what happens when the master thread dies). There's also a potential issue of what happens when barriers occur within an active thread when there are also inactive threads. GCN barrier instructions are unconditional, meaning that all the live threads must respond. The inactive threads can do so, in a harmless way, as long as they allowed to spin, but we don't want them spinning forever when the master dies. I believe the current barrier implementation skips the barrier instruction when the team's thread count is 1. This is how we avoid issues with nested teams and tasks. I don't know why that doesn't help here?
[Bug target/95023] Offloading AMD GCN wiki cannot be followed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95023 Andrew Stubbs changed: What|Removed |Added CC||ams at gcc dot gnu.org Resolution|--- |DUPLICATE Status|UNCONFIRMED |RESOLVED --- Comment #3 from Andrew Stubbs --- The second problem in this bug is reported in 97827. *** This bug has been marked as a duplicate of bug 97827 ***
[Bug target/97827] bootstrap error building the amdgcn-amdhsa offload compiler with LLVM 11
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97827 Andrew Stubbs changed: What|Removed |Added CC||xw111luoye at gmail dot com --- Comment #17 from Andrew Stubbs --- *** Bug 95023 has been marked as a duplicate of this bug. ***
[Bug target/101484] [12 Regression] trunk 20210717 ftbfs for amdgcn-amdhsa (gcn offload)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101484 Andrew Stubbs changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Last reconfirmed||2021-07-17 --- Comment #1 from Andrew Stubbs --- A new warning has been added that falsely identifies any access to a hardcoded constant address as bogus. This has affected a few targets, including GCN libgomp. See pr101374. There's some discussion what to do about it. E.g. https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574880.html
[Bug target/101544] [OpenMP][AMDGCN][nvptx] C++ offloading: unresolved _Znwm = "operator new(unsigned long)"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101544 --- Comment #3 from Andrew Stubbs --- The standalone amdgcn configuration does not support C++. There are a number of technical reasons why it doesn't Just Work, but basically it comes down to no-one ever working on it. Our customers were primarily interested in Fortran with C second. C++ offloading works fine provided that there are no library calls or exceptions. Ignoring unsupported C++ language features, for now, I don't think there's any reason why libstdc++ would need to be cut down. We already build the full libgfortran for amdgcn. System calls that make no sense on the GPU were implemented as stubs in Newlib (mostly returning some reasonable errno value), and it would be straight-forward to implement more the same way. I believe static constructors work (libgfortran uses some), but exception handling does not. I'm not sure what other exotica C++ might need? As for exceptions, set-jump-long-jump is not implemented because there was no call for it and I didn't know how to handle the GCN register files properly. Not only are they variable-sized, they're also potentially very large: ranging from ~6KB up to ~65KB, I think (102 32-bit scalar, and 256 2048-bit vector registers, for single-threaded mode, but only 80 scalar and 24 vector registers in maximum occupancy mode, in which case per-thread stack space is also quite limited). I'm not sure now the other exception implementations work.
[Bug target/100208] amdgcn fails to build with llvm-mc from llvm12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100208 Andrew Stubbs changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #3 from Andrew Stubbs --- I think this issue should be resolved now. (Other reasons why GCC fails with LLVM 12 still exist).
[Bug target/101544] [OpenMP][AMDGCN][nvptx] C++ offloading: unresolved _Znwm = "operator new(unsigned long)"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101544 --- Comment #5 from Andrew Stubbs --- [Note: all of my comments refer to the amdgcn case. nvptx has somewhat different support in this area.] (In reply to Jonathan Wakely from comment #4) > But it's a waste of space in the .so to build lots of symbols that use the > stubs. DSOs are not supported. This is strictly for static linking only. > There are other reasons it might be nice to be able to configure libstdc++ > for something in between a full hosted environment and a minimal > freestanding one. If it isn't a horrible hack, like libgfortran minimal mode, then fine. > > I believe static constructors work (libgfortran uses some), but exception > > handling does not. I'm not sure what other exotica C++ might need? > > Ideally, __cxa_atexit and __cxa_thread_atexit for static and thread-local > destructors, but we can survive without them (and have not-fully-conforming > destruction ordering). Offload kernels are just fragments of programs, so this is tricky in those cases. Libgomp explicitly calls _init_array and _fini_array as single-threaded kernel launches. Actually, it's not clear that deconstruction is in any way interesting, given that code running on the GPU has no external access and the resources are all released when the host program exits. Similarly, C++ threads are not interesting in the GPU-offload case. There are a fixed number or threads launched on entry and they are managed by libgomp. In theory it would be possible to code gthreads/libstdc++ to use them in standalone mode, but really that mode only exists to facilitate compiler testing.
[Bug testsuite/113085] New test case libgomp.c/alloc-pinned-1.c from r14-6499-g348874f0baac0f fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113085 --- Comment #6 from Andrew Stubbs --- (In reply to seurer from comment #5) > I should note that pinned-2 also fails on powerpc64 LE. > > make -k check-target-libgomp RUNTESTFLAGS="c.exp=libgomp.c/alloc-pinned-*" > FAIL: libgomp.c/alloc-pinned-1.c execution test > FAIL: libgomp.c/alloc-pinned-2.c execution test > > > On powerpc64 BE pinned-3 and -4 fail (but not -1 and -2): > > make -k check-target-libgomp RUNTESTFLAGS="--target_board=unix'{-m32,-m64}' > c.exp=libgomp.c/alloc-pinned-*" > FAIL: libgomp.c/alloc-pinned-3.c execution test > FAIL: libgomp.c/alloc-pinned-4.c execution test Please show any messages in the libgomp.log file, and find out what the page sizes and locked memory limits are on both machines.
[Bug testsuite/113085] New test case libgomp.c/alloc-pinned-1.c from r14-6499-g348874f0baac0f fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113085 --- Comment #8 from Andrew Stubbs --- (In reply to seurer from comment #7) > On the BE machine: > > seurer@nilram:~/gcc/git/build/gcc-test$ ulimit -a > real-time non-blocking time (microseconds, -R) unlimited > ... > max locked memory (kbytes, -l) 529679232 > ... That's a suspiciously large number, but OK. > seurer@nilram:~/gcc/git/build/gcc-test$ getconf PAGESIZE > 65536 > > > There were no messages. Running it in gdb I get: > > (gdb) where > #0 0x0fce3340 in ?? () from /lib32/libc.so.6 > #1 0x0fc851e4 in raise () from /lib32/libc.so.6 > #2 0x0fc6a128 in abort () from /lib32/libc.so.6 > #3 0x1ae4 in set_pin_limit (size=size@entry=131072) at > /home/seurer/gcc/git/gcc-test/libgomp/testsuite/libgomp.c/alloc-pinned-4.c:44 > #4 0x1754 in main () at > /home/seurer/gcc/git/gcc-test/libgomp/testsuite/libgomp.c/alloc-pinned-4.c: > 106 > > > if (getrlimit (RLIMIT_MEMLOCK, &limit)) > abort (); // line 44 in alloc-pinned-4.c Why would that fail? Perhaps you can investigate the errno. You're probably best placed to submit a patch for whatever this issue is. > > This is a Debian Trixie machine and it too is using whatever the defaults > are. Good to know.
[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302 --- Comment #2 from Andrew Stubbs --- The execution test checks that each of the libgcc routines work correctly, and the scan assembler tests make sure that we're getting coverage of all of them. In this case, the failure indicates that we're not testing the routine we were aiming for (but I think it does execute correctly and give a good result).
[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302 --- Comment #4 from Andrew Stubbs --- Yes, that's what the simd-math-3* tests do. The simd-math-5* tests are explicitly supposed to be doing this in the context of the autovectorizer. If these tests are being compiled as (newly) intended then we should change the expected results. So, questions: 1. Are the new results actually correct? (So far I only know that being different is expected.) 2. Is there some other testcase form that would exercise the previously intended routines? 3. Is the new behaviour configurable? I don't think the 16-bit shift bug ever existed on GCN (in which "short" vectors actually have excess bits in each lane, much like scalar registers do).
[Bug driver/114717] '-fcf-protection' vs. offloading compilation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114717 --- Comment #3 from Andrew Stubbs --- Can this be filtered (safely) in mkoffload? That tool is offload-target-specific, so no problem with "if offload target were to support it".
[Bug tree-optimization/115304] gcc.dg/vect/slp-gap-1.c FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304 --- Comment #9 from Andrew Stubbs --- (In reply to Richard Biener from comment #6) > The best strathegy for GCN would be to gather V4QImode aka SImode into the > V64QImode (or V16SImode) vector. For pix2 we have a gap of 28 elements, > doing consecutive loads isn't a good strategy here. I don't fully understand what you're trying to say here, so apologies if you knew all this already and I missed the point. In general, on GCN V4QImode is not in any way equivalent to SImode (when the values are in registers). The vector registers are not one single string of re-interpretable bits. For the same reason, you can't load a value as V64QImode and then try to interpret it as V16SImode. GCN vector registers just don't work like SSE/Neon/etc. When you load a V64QImode vector, each lane is extended to 32 bits, so what you actually get in hardware is a V64SImode vector. Likewise, when you load a V4QImode vector the hardware representation is actually V4SImode (which in itself is just V64SImode with undefined values in the unused lanes).
[Bug tree-optimization/115304] gcc.dg/vect/slp-gap-1.c FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304 --- Comment #11 from Andrew Stubbs --- (In reply to rguent...@suse.de from comment #10) > On Mon, 3 Jun 2024, ams at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304 > > > > --- Comment #9 from Andrew Stubbs --- > > (In reply to Richard Biener from comment #6) > > > The best strathegy for GCN would be to gather V4QImode aka SImode into the > > > V64QImode (or V16SImode) vector. For pix2 we have a gap of 28 elements, > > > doing consecutive loads isn't a good strategy here. > > > > I don't fully understand what you're trying to say here, so apologies if you > > knew all this already and I missed the point. > > > > In general, on GCN V4QImode is not in any way equivalent to SImode (when the > > values are in registers). The vector registers are not one single string of > > re-interpretable bits. > > > > For the same reason, you can't load a value as V64QImode and then try to > > interpret it as V16SImode. GCN vector registers just don't work like > > SSE/Neon/etc. > > > > When you load a V64QImode vector, each lane is extended to 32 bits, so what > > you > > actually get in hardware is a V64SImode vector. > > > > Likewise, when you load a V4QImode vector the hardware representation is > > actually V4SImode (which in itself is just V64SImode with undefined values > > in > > the unused lanes). > > I see. I wonder if there's not one or two latent wrong-code because of > this and the vectorizers assumptions ;) I suppose modes_tieable_p > will tell us whether a VIEW_CONVERT_EXPR will do the right thing? > Is GET_MODE_SIZE (V64QImode) == GET_MODE_SIZE (V64SImode) btw? > And V64QImode really V64PSImode? The mode size says how big it will be when written to memory, so no they're not the same. I believe this matches the scalar QImode behaviour. We don't use any PSI modes. There are (some) machine instructions for V64QImode (and V64HImode) so we don't want to lose that information. There may well be some bugs, but we have handling for conversions in a number of places. There are truncate and extend patterns that operate lane-wise, and vec_extract can take a subset of a vector, IIRC. > Still for a V64QImode load on { c[0], c[1], c[2], c[3], c[32], c[33], > c[34], c[35], ... } it's probably best to use a single V64QImode gather > with GCN then rather than four "consecutive" V64QImode loads and then > element swizzling. Fewer loads are always better, and permutations are expensive operations (and don't work with 64-lane vectors on RDNA devices because they're actually two 32-lane vectors stuck together) so it can certainly make sense to use gather with a vector of permuted offsets (although it can be expensive to generate that vector in the first place).
[Bug target/112088] [14 Regression] GCN target testing broken by "amdgcn: add -march=gfx1030 EXPERIMENTAL"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112088 Andrew Stubbs changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 Last reconfirmed||2023-10-27 Assignee|unassigned at gcc dot gnu.org |ams at gcc dot gnu.org --- Comment #1 from Andrew Stubbs --- I'm testing a fix for this.
[Bug target/112088] [14 Regression] GCN target testing broken by "amdgcn: add -march=gfx1030 EXPERIMENTAL"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112088 Andrew Stubbs changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #3 from Andrew Stubbs --- The patch should fix the bug.
[Bug target/112308] [14 Regression] GCN: 'error: literal operands are not supported' for 'v_add_co_u32'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112308 Andrew Stubbs changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2023-11-09 Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |ams at gcc dot gnu.org
[Bug target/112313] [14 Regression] GCN target 'gcc.dg/pr111082.c' ICE, 'during RTL pass: vregs': 'error: unrecognizable insn'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112313 Andrew Stubbs changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED Assignee|unassigned at gcc dot gnu.org |ams at gcc dot gnu.org --- Comment #2 from Andrew Stubbs --- This is now fixed.
[Bug target/112308] [14 Regression] GCN: 'error: literal operands are not supported' for 'v_add_co_u32'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112308 Andrew Stubbs changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #2 from Andrew Stubbs --- This should be fixed now.
[Bug target/112481] [14 Regression] RISCV: ICE: Segmentation fault when compiling pr110817-3.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112481 Andrew Stubbs changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2023-11-13 Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |ams at gcc dot gnu.org --- Comment #4 from Andrew Stubbs --- It fails because optab_handler fails to find an instruction for "and_optab" in SImode. I didn't consider handling that case; seems so unlikely. I guess architectures that can't "and" masks don't get to have safe masks? ... I'll work on a fix.
[Bug target/112481] [14 Regression] RISCV: ICE: Segmentation fault when compiling pr110817-3.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112481 --- Comment #7 from Andrew Stubbs --- Simply changing to OPTAB_WIDEN solves the ICE, but I don't know if it does so in a sensible way, for RISC V. @@ -7489,7 +7489,7 @@ store_constructor (tree exp, rtx target, int cleared, poly_int64 size, if (maybe_ne (GET_MODE_PRECISION (mode), nunits)) tmp = expand_binop (mode, and_optab, tmp, GEN_INT ((1 << nunits) - 1), target, - true, OPTAB_DIRECT); + true, OPTAB_WIDEN); if (tmp != target) emit_move_insn (target, tmp); break; Here are the instructions it generates: (set (reg:DI 165) (and:DI (subreg:DI (reg:SI 164) 0) (const_int 1 [0x1]))) (set (reg:SI 154) (subreg:SI (reg:DI 165) 0)) Should I use that patch? I think it's harmless on targets where OPTAB_DIRECT would work.
[Bug target/112481] [14 Regression] RISCV: ICE: Segmentation fault when compiling pr110817-3.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112481 Andrew Stubbs changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #13 from Andrew Stubbs --- This should be fixed now.
[Bug target/110313] [14 Regression] GCN Fiji reload ICE in 'process_alt_operands'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110313 --- Comment #1 from Andrew Stubbs --- This ICE also affect the following standalone test failures (raw amdgcn, no offloading): gfortran.dg/assumed_rank_21.f90 gfortran.dg/finalize_38.f90 gfortran.dg/finalize_38a.f90
[Bug target/110313] [14 Regression] GCN Fiji reload ICE in 'process_alt_operands'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110313 --- Comment #3 from Andrew Stubbs --- It's curious that this affects the Fiji target only, and not the newer targets at all. There are some additional register options for multiply instructions, some differences to atomics, but mostly the difference is that Fiji's "flat" load and store instructions can't have offsets.
[Bug target/110313] [14 Regression] GCN Fiji reload ICE in 'process_alt_operands'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110313 --- Comment #5 from Andrew Stubbs --- One thing that is unusual about the GCN stack pointer is that it's actually two registers. Could this be breaking some cprop assumptions? GCN can't fit an address in one (SImode) register so all (DImode) pointers require a pair of registers. We had to rework the dwarf stack representation code for this architecture, so I'm pretty sure no other port does this.
[Bug target/112937] [14 Regression] GCN: FAILs due to unconditional 'f->use_flat_addressing = true;'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112937 --- Comment #2 from Andrew Stubbs --- Flat addressing *should* be the safe option that always works (although using "global" address space permits slightly more efficient offset options).
[Bug target/113022] GCN offloading bricked by "amdgcn: Work around XNACK register allocation problem"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113022 --- Comment #1 from Andrew Stubbs --- This is what I get for trying to get this done before vacation. :( Yes, there's probably something in mkoffload that has to match the default change from -mxnack=any to -mxnack=off on the older ISAs.
[Bug testsuite/113085] New test case libgomp.c/alloc-pinned-1.c from r14-6499-g348874f0baac0f fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113085 --- Comment #1 from Andrew Stubbs --- That is a typo. I don't want to make it pass on machines that have insufficient memory configured because it will mask the case where it fails for another reason. However, the testcase was originally supposed to fit in 64kB. Is your page size larger than 4kB?
[Bug testsuite/113085] New test case libgomp.c/alloc-pinned-1.c from r14-6499-g348874f0baac0f fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113085 --- Comment #4 from Andrew Stubbs --- It's going to be difficult to make this test work when only one page of locked memory is available. :-( I will look at making it "unsupported".
[Bug middle-end/113163] [14 Regression][GCN] ICE in vect_peel_nonlinear_iv_init, at tree-vect-loop.cc:9420
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113163 Andrew Stubbs changed: What|Removed |Added CC||ams at gcc dot gnu.org --- Comment #11 from Andrew Stubbs --- (In reply to Tamar Christina from comment #7) > This seems to happen because the vectorizer decides to use partial vectors > to vectorize the loop and the target picks a nonlinear induction step which > we can't support for early breaks. In which hook is this selected? I'm not aware of this being a deliberate choice we made...
[Bug middle-end/113199] [14 Regression][GCN] ICE (segfault) due to invalid 'loop_mask_46 = VEC_PERM_EXPR' when compiling Newlib's wcsftime.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113199 --- Comment #5 from Andrew Stubbs --- I can confirm that I can now build the amdgcn toolchain once more. :-) Thanks.
[Bug target/113615] internal compiler error: in extract_insn, at recog.cc:2812
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113615 --- Comment #3 from Andrew Stubbs --- I did see these, but I hadn't had time to chase them up. The proposed patch is exactly the sort of solution I was expecting to find, short term. Have you confirmed that it fixes all the cases? A proper solution is to find out how to implement reductions with the RDNA ISA, of course, but that's probably non-trivial (as in, I'm pretty sure it's more than renaming a few mnemonics), and low-priority as GCC does a reasonably good job without them.
[Bug target/115631] [15 Regression] GCN: [-PASS:-]{+FAIL:+} c-c++-common/torture/builtin-arith-overflow-6.c -O2 execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115631 --- Comment #1 from Andrew Stubbs --- It was writing 0 to s12 (scalar register) and then moving the zero to lane zero of v0 (vector register). Now it's writing the 0 directly to v0, of which all but lane zero is masked. These should be identical (unless s12 was also live). The problem must be elsewhere.
[Bug target/115640] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 --- Comment #3 from Andrew Stubbs --- (In reply to Richard Biener from comment #2) > If you force GCN to use fixed length vectors (how?), does it work? How's > it behaving on aarch64 with SVE? (the CI was happy, but maybe doesn't > enable SVE) I believe "--param vect-partial-vector-usage=0" will disable the use of WHILE_ULT? The default is "2" for the standalone toolchain, and last I checked the value is inherited from the host in the offload toolchain; the default for x86_64 was "1", meaning approximately "only use partial vectors in epilogue loops".
[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 --- Comment #10 from Andrew Stubbs --- On 26/06/2024 12:05, rguenth at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 > > --- Comment #8 from Richard Biener --- > (In reply to Richard Biener from comment #7) >> I will have a look (and for run validation try to reproduce with gfx1036). > > OK, so with gfx1036 we end up using 16 byte vectors and the testcase > passes. The difference with gfx908 is > > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: ==> examining statement: _14 = aa[_13]; > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: vect_model_load_cost: aligned. > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: vect_model_load_cost: inside_cost = 2, prologue_cost = 0 . > > vs. > > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: ==> examining statement: _14 = aa[_13]; > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > missed: unsupported vect permute { 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 > 10 > 10 11 11 12 12 13 13 14 14 15 15 } > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > missed: unsupported load permutation > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:19:72: > missed: not vectorized: relevant stmt not supported: _14 = aa[_13]; > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: removing SLP instance operations starting from: REALPART_EXPR > <(*hadcur_24(D))[_2]> = _86; > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > missed: unsupported SLP instances > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: re-trying with SLP disabled > > so gfx1036 cannot do such permutes but gfx908 can? GFX10 has more limited permutation capabilities than GFX9 because it only has 32-lane vectors natively, even though we're using the 64-lane "compatibility" mode. However, in theory, the permutation capabilities on V32 and below should be the same, and some permutations on V64 are allowed, so I don't know why it doesn't use it. It's possible I broke the logic in gcn_vectorize_vec_perm_const: /* RDNA devices can only do permutations within each group of 32-lanes. Reject permutations that cross the boundary. */ if (TARGET_RDNA2_PLUS) for (unsigned int i = 0; i < nelt; i++) if (i < 31 ? perm[i] > 31 : perm[i] < 32) return false; It looks right to me though? The vec_extract patterns that also use permutations are likewise supposedly still enabled for V32 and below. Andrew
[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 --- Comment #14 from Andrew Stubbs --- On 26/06/2024 13:34, rguenth at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 > > --- Comment #13 from Richard Biener --- > (In reply to Richard Biener from comment #12) >> (In reply to Andrew Stubbs from comment #10) >>> GFX10 has more limited permutation capabilities than GFX9 because it >>> only has 32-lane vectors natively, even though we're using the 64-lane >>> "compatibility" mode. >>> >>> However, in theory, the permutation capabilities on V32 and below should >>> be the same, and some permutations on V64 are allowed, so I don't know >>> why it doesn't use it. It's possible I broke the logic in >>> gcn_vectorize_vec_perm_const: >>> >>> /* RDNA devices can only do permutations within each group of 32-lanes. >>>Reject permutations that cross the boundary. */ >>> if (TARGET_RDNA2_PLUS) >>> for (unsigned int i = 0; i < nelt; i++) >>> if (i < 31 ? perm[i] > 31 : perm[i] < 32) >>> return false; >>> >>> It looks right to me though? >> >> nelt == 32 so I think the last element has the wrong check applied? >> >> It should be >> >>> if (i < 32 ? perm[i] > 31 : perm[i] < 32) >> >> I think. With that the vectorization happens in a similar way but the >> failure still doesn't reproduce (without the patch, of course). Oops, I think you're right. > Btw, the above looks quite odd for nelt == 32 anyway - we are permuting > two vectors src0 and src1 into one 32 element dst vector (it's no longer > required that src0 and src1 line up with the dst vector size btw, they > might have different nelt). So the loop would reject interleaving > the low parts of two 32 element vectors, a permute that would look like > { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes" > mean you can never mix the two vector inputs? Or does GCN not have > a two-to-one vector permute instruction? GCN does not have two-to-one vector permute in hardware, so we do two permutes and a vec_merge to get the same effect. GFX9 can permute all the elements within a 64 lane vector arbitrarily. GFX10 and GFX11 can permute the low-32 and high-32 elements freely, but no value may cross the boundary. AFAIK there's no way to do that via any vector instruction (i.e. without writing to memory, or extracting values element-wise). In theory, we could implement permutes with different sized inputs and outputs, but right now those are rejected early. The interleave example wouldn't work in hardware, for GFX10, but we could have it for GFX9. However, I think you might be right about the numbering of the "perm" array; we probably need to be testing "(perm[i] % nelt) > 31" if we are to support two-to-one permutations. Thanks for looking at this. Andrew
[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 --- Comment #16 from Andrew Stubbs --- On 26/06/2024 14:41, rguenther at suse dot de wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 > > --- Comment #15 from rguenther at suse dot de --- >>> Btw, the above looks quite odd for nelt == 32 anyway - we are permuting >>> two vectors src0 and src1 into one 32 element dst vector (it's no longer >>> required that src0 and src1 line up with the dst vector size btw, they >>> might have different nelt). So the loop would reject interleaving >>> the low parts of two 32 element vectors, a permute that would look like >>> { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes" >>> mean you can never mix the two vector inputs? Or does GCN not have >>> a two-to-one vector permute instruction? >> >> GCN does not have two-to-one vector permute in hardware, so we do two >> permutes and a vec_merge to get the same effect. >> >> GFX9 can permute all the elements within a 64 lane vector arbitrarily. >> >> GFX10 and GFX11 can permute the low-32 and high-32 elements freely, but >> no value may cross the boundary. AFAIK there's no way to do that via any >> vector instruction (i.e. without writing to memory, or extracting values >> element-wise). > > I see - so it cannot even swap low-32 and high-32? I'm thinking of > what sub-part of permutes would be possible by extending the two-to-one > vec_merge trick. No(?) The 64-lane compatibility mode works, under the hood, by allocating double the number of 32-lane registers and then executing each instruction twice. Mostly this is invisible, but it gets exposed for permutations and the like. Logically, the microarchitecture could do a vec_merge to DTRT, but I've not found a way to express that. It's possible I missed something when RTFM. > OTOH we restrict GFX10/11 to 32 lane vectors so in practice this > restriction should be fine. Yes, with the "31" fixed it should work. Andrew
[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 --- Comment #18 from Andrew Stubbs --- That should fix the broken validation check. All V32 permutations should work now on RDNA GPUs, I think. V16 and smaller were already working fine.
[Bug target/116104] [15 Regression] GCN vs. "[rtl-optimization/116037] Explicitly track if a destination was skipped in ext-dce"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116104 --- Comment #3 from Andrew Stubbs --- (In reply to Jeffrey A. Law from comment #1) > So, how am I supposed to reproduce this? I don't have an assembler/binutils > for amdgcn and thus libgcc won't configure. Thus I can't extract a testcase. > > Alternately, if you could just attach a .i file, it'd be helpful. In fact, you probably do have an assembler and linker already installed! Just copy "llvm-mc" to amdgcn-amdhsa/bin/as and llvm's "lld" to amdgcn-amdhsa/bin/ld. You might want ar, ranlib, and nm too. Full details here: https://gcc.gnu.org/wiki/Offloading#For_AMD_GCN
[Bug target/116104] [15 Regression] GCN vs. "[rtl-optimization/116037] Explicitly track if a destination was skipped in ext-dce"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116104 --- Comment #4 from Andrew Stubbs --- The problem insn is this: (insn 31 30 32 2 (set (reg:V2SI 711) (ashift:V2SI (reg:V2SI 161 v1) (const_vector:V2SI [ (const_int 3 [0x3]) repeated x2 ]))) "/srv/data/ams/upB/gcn/src/gcc-mainline/libgcc/libgcc2.c":70:11 2043 {vashlv2si3} (expr_list:REG_DEAD (reg:V2SI 161 v1) (nil))) That is, it shifts each element in a 2-element vector by 3 bits. This looks like valid GCN code to me. ext_dce does not seem to be handling the vector case.
[Bug target/116103] [15 Regression] GCN vs. "Internal-fn: Only allow modes describe types for internal fn[PR115961]"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116103 --- Comment #8 from Andrew Stubbs --- (In reply to Thomas Schwinge from comment #4) > (In reply to Richard Biener from comment #2) > > if (VECTOR_BOOLEAN_TYPE_P (type) > > && SCALAR_INT_MODE_P (TYPE_MODE (type))) > > return true; > > > && TYPE_PRECISION (TREE_TYPE (type)) == 1 > > > Thomas, does that resolve the issue? > > Thanks, it does: restores the original '*.s' exactly. (Assuming that's the > desired outcome, Andrew?) It looks like while_ult is being rejected ... we really do want that!
[Bug target/116104] [15 Regression] GCN vs. "[rtl-optimization/116037] Explicitly track if a destination was skipped in ext-dce"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116104 Andrew Stubbs changed: What|Removed |Added Resolution|FIXED |--- Status|RESOLVED|REOPENED --- Comment #9 from Andrew Stubbs --- The patch fixed ASHIFT, but we now have the exact same failure in LSHIFTRT. Looking at the code, there's probably a few other cases in that switch statement. At least ASHIFTRT, and probably all the other uses of CONSTANT_P.
[Bug target/116955] [15 Regression] GCN '-march=gfx1100': [-PASS:-]{+FAIL:+} gcc.dg/vect/pr81740-2.c execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116955 --- Comment #2 from Andrew Stubbs --- Compared to gfx908, gfx1100 lacks 64-lane permute and vector reductions. Permute works with 32 lanes or fewer, but reductions are unimplemented in the backend. Otherwise it should vectorize the same. That might explain some additional failures, but probably doesn't explain a regression.
[Bug target/116571] [15 Regression] GCN vs. "lower SLP load permutation to interleaving"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116571 --- Comment #6 from Andrew Stubbs --- (In reply to Richard Biener from comment #5) > (In reply to Thomas Schwinge from comment #4) > > The GCN target FAILs that I originally had reported here: > > > > > [-PASS:-]{+FAIL:+} gcc.dg/vect/slp-11a.c scan-tree-dump-times vect > > > "vectorizing stmts using SLP" [-0-]{+1+} > > > > > [-PASS:-]{+FAIL:+} gcc.dg/vect/slp-12a.c scan-tree-dump-times vect > > > "vectorizing stmts using SLP" [-0-]{+1+} > > > > ... are back to PASS as of recently; should we close this PR? > > I'd say so. > > > > > Andrew, anything to be done anyway, regarding the following? > > > > (In reply to Richard Biener from comment #3) > > > Possibly for GCN the issue is the vect_strided8 which is implemented as > > > > > > foreach N {2 3 4 5 6 7 8} { > > > eval [string map [list N $N] { > > > # Return 1 if the target supports 2-vector interleaving > > > proc check_effective_target_vect_stridedN { } { > > > return [check_cached_effective_target_indexed vect_stridedN { > > > if { (N & -N) == N > > > && [check_effective_target_vect_interleave] > > > && [check_effective_target_vect_extract_even_odd] } { > > > return 1 > > > } > > > if { ([istarget arm*-*-*] > > > || [istarget aarch64*-*-*]) && N >= 2 && N <= 4 } { > > > return 1 > > > } > > > if { ([istarget riscv*-*-*]) && N >= 2 && N <= 8 } { > > > return 1 > > > } > > > if [check_effective_target_vect_fully_masked] { > > > return 1 > > > } > > > > > > not sure if gcn really supports a load/store-lane with 8 elements. > > Note I might have misunderstood the vect_stridedN effective target, but > the last line looks odd (x86 also can do fully masked loops with AVX512 > but definitely cannot do arbitrary interleaving schemes just because of > that). OK, but unless I'm missing something x86 is not one of the targets where check_effective_target_vect_fully_masked is true. > I'd remove that last line, it was added by you, Andrew, in > r9-5484-g674931d2b7bd88 ... I will switch it to test for GCN specifically. As far as I know, GCN is fine with 8 element vectors and can load store them from any arbitrary pattern you can generate.
[Bug target/117657] [15 Regression][gcn] ICE during in-tree newlib build: error: unrecognizable insn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117657 --- Comment #6 from Andrew Stubbs --- The patch changed the wrong operand on the gen_gather_insn_1offset_exec call. It sets one of the offsets undefined instead of setting the else value undefined. I'm testing a fix.
[Bug target/117657] [15 Regression][gcn] ICE during in-tree newlib build: error: unrecognizable insn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117657 --- Comment #1 from Andrew Stubbs --- This appears to have been caused by the recent maskload patches, which is weird because I thought I already tested the patches that were posted.
[Bug target/117657] [15 Regression][gcn] ICE during in-tree newlib build: error: unrecognizable insn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117657 Andrew Stubbs changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #12 from Andrew Stubbs --- Marking as fixed. The new test failure has been reported as pr117709.
[Bug target/117709] [15 regression] maskload else case generating wrong code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117709 --- Comment #6 from Andrew Stubbs --- Yes, that fixes the issue, thanks. The only diff in the assembly now, compared to before the "else" patch, is the zero-initialization is gone. This is good; the mysterious extra code seemed like a step backwards. :)
[Bug target/117709] New: [15 regression] maskload else case generating wrong code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117709 Bug ID: 117709 Summary: [15 regression] maskload else case generating wrong code Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ams at gcc dot gnu.org CC: rdapp at gcc dot gnu.org Target Milestone: --- The following testcase aborts on amdgcn since the maskload else patches were added (https://patchwork.sourceware.org/project/gcc/list/?series=40395, plus the patch for pr117657). ---8<--- int i, j; int k[11][101]; __attribute__((noipa)) void doit (void) { int err = 0; for (j = -11; j >= -41; j -= 15) { k[0][-j] = 1; } #pragma omp simd collapse(2) lastprivate (i, j) reduction(|:err) for (i = 4; i < 8; i += 12) for (j = -8 * i - 9; j < i * -3 + 6; j += 15) err |= (k[0][-j] != 1); if (err) __builtin_abort (); } int main () { doit (); } ---8<--- The testcase is simplified from gcc.dg/vect/vect-simd-17.c. If I revert most of the gcn patch, but keep the new operand with the "maskload_else_operand" predicate, so that the backend generates the zero-initializer, but the middle-end assumes the lanes are undefined, then the testcase still fails. I think this shows that the problem in in the additional code generated by the middle end, not that the vectors are actually corrupt, but I've not identified exactly how.
[Bug target/117709] [15 regression] maskload else case generating wrong code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117709 --- Comment #4 from Andrew Stubbs --- The mask is a 64-bit integer value in the "exec" register. I agree that I cannot see the problem staring at it. Like I said, I changed the backend so that it generated the zero-initializers anyway, and the snippets you show above reverted to as they were before. However, there's some new code further up the assembly that I don't understand yet.
[Bug target/117657] [15 Regression][gcn] ICE during in-tree newlib build: error: unrecognizable insn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117657 --- Comment #9 from Andrew Stubbs --- That commit should fix the build failure. However, I'm now seeing a wrong-code regression in gcc.dg/vect/vect-simd-17.c that I can't prove isn't related. The testcase now aborts, at least on gfx90a, where it used to pass. There's rather a lot going on in there, so I'll have to minimize the testcase. I'm not sure why I didn't see this issue before, but I did miss that the last version of the patch posted had extra new stuff in it, so that's probably why.
[Bug target/101544] [OpenMP][AMDGCN][nvptx] C++ offloading: unresolved _Znwm = "operator new(unsigned long)"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101544 --- Comment #15 from Andrew Stubbs --- BTW, if you're calling "new" in the offload kernel then you're probably "doing it wrong", even when we do implement full C++ support. Offload kernels are for hot code, executed many times, and memory allocation is inherently slow. On AMDGCN, "malloc" uses Newlib's heap support and gets serialized via a global lock. Likewise for "free". On NVPTX, the implementation is provided by the PTX finalizer, so may be better optimized, but I still don't recommend it. I'm assuming you're using printf for debug and testing only, so that's fine, but it definitely has no place in hot code either.
[Bug target/101544] [OpenMP][AMDGCN][nvptx] C++ offloading: unresolved _Znwm = "operator new(unsigned long)"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101544 --- Comment #14 from Andrew Stubbs --- "printf" exists and has been working on both AMDGCN and NVPTX devices since forever. "fputs", "puts", and "write", etc. should all work too. If the FORTIFY_SOURCE trick doesn't get rid of __printf_chk, or has other side-effects, then you can probably use __builtin_printf in these instances to bypass the preprocessor "magic".
[Bug middle-end/119325] [15 Regression] libgomp.c/simd-math-1.c (gcn offloading): timeout (for fmodf, remainderf) since r15-7257-g54bdeca3c62144
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119325 --- Comment #10 from Andrew Stubbs --- The libm vector routines are pretty much just the scalar routine translated into vector extension statements wrapped in preprocessor macros. They should be unaffected by the vectorizer (and most of the optimization passes, for that matter).
[Bug middle-end/119325] [15 Regression] libgomp.c/simd-math-1.c (gcn offloading): timeout (for fmodf, remainderf) since r15-7257-g54bdeca3c62144
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119325 --- Comment #17 from Andrew Stubbs --- Oops, that has __to and __from backwards ... you get the idea.
[Bug testsuite/119286] [15 Regression] GCN vs. "middle-end: delay checking for alignment to load [PR118464]"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119286 --- Comment #3 from Andrew Stubbs --- The RDNA consumer devices, such as gfx1100, support permute for V32 and smaller, but not V64. Gather/scatter should be able to load from arbitrary addresses, but synthesising a vector with those addresses may run into the permutation issue. If that's not the issue then I don't know why it wouldn't work. The GFX9/CDNA devices support V64 properly. As for the scalar loads, we never actually implemented any costs. In general, loading a vector elementwise isn't actually worse than just doing the scalar algorithm (very slow on a GPU), so we have focused on making the vector stuff work except that RDNA has the hardware limitations.
[Bug middle-end/119325] [15 Regression] libgomp.c/simd-math-1.c (gcn offloading): timeout (for fmodf, remainderf) since r15-7257-g54bdeca3c62144
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119325 --- Comment #16 from Andrew Stubbs --- Perhaps: asm ("mov %0, %1" : "=v"(__from), "v"(__to)); or maybe asm ("; no-op cast %0" : "=v"(__from), "0"(__to)); Is there a downside to that in the optimizer(s)?
[Bug middle-end/119325] [15 Regression] libgomp.c/simd-math-1.c (gcn offloading): timeout (for fmodf, remainderf) since r15-7284-g6b56e645a7b481
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119325 --- Comment #3 from Andrew Stubbs --- Supposedly, the non-openmp equivalent test is gcc.target/gcn/simd-math-1.c, but that seems to be passing still.
[Bug middle-end/119325] [15 Regression] libgomp.c/simd-math-1.c (gcn offloading): timeout (for fmodf, remainderf) since r15-7257-g54bdeca3c62144
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119325 --- Comment #20 from Andrew Stubbs --- I tried the memcpy solution with the following testcase: v2sf smaller (v64sf in) { v2sf out = RESIZE_VECTOR (v2sf, in); return out; } v64sf bigger (v2sf in) { v64sf out = RESIZE_VECTOR (v64sf, in); return out; } It doesn't look great, compiled with -O3 -fomit-frame-pointer, as it's routing the conversion through expensive stack memory: smaller: s_addc_u32 s17, s17, 0 s_mov_b64 exec, -1 v_lshlrev_b32 v4, 2, v1 s_mov_b32 s22, scc s_add_u32 s12, s16, -256 s_addc_u32 s13, s17, -1 s_cmpk_lg_u32 s22, 0 v_add_co_u32v4, s[22:23], s12, v4 v_mov_b32 v5, s13 v_addc_co_u32 v5, s[22:23], 0, v5, s[22:23] flat_store_dwordv[4:5], v8 offset:0 < Store big vector s_mov_b64 exec, 3 v_lshlrev_b32 v8, 2, v1 v_add_co_u32v8, s[22:23], s12, v8 v_mov_b32 v9, s13 v_addc_co_u32 v9, s[22:23], 0, v9, s[22:23] flat_load_dword v8, v[8:9] offset:0 < Load smaller vector s_waitcnt 0 s_sub_u32 s16, s16, 256 s_subb_u32 s17, s17, 0 s_setpc_b64 s[18:19] bigger: s_addc_u32 s17, s17, 0 s_mov_b64 exec, -1 v_mov_b32 v0, 0 s_mov_b32 s22, scc s_add_u32 s12, s16, -256 s_addc_u32 s13, s17, -1 s_cmpk_lg_u32 s22, 0 v_lshlrev_b32 v4, 2, v1 v_mov_b32 v5, s13 v_add_co_u32v4, s[22:23], s12, v4 v_addc_co_u32 v5, s[22:23], 0, v5, s[22:23] flat_store_dwordv[4:5], v0 offset:0< Initialize zeroed big vector s_mov_b64 exec, 3 v_lshlrev_b32 v4, 2, v1 v_mov_b32 v5, s13 v_add_co_u32v4, s[22:23], s12, v4 v_addc_co_u32 v5, s[22:23], 0, v5, s[22:23] flat_store_dwordv[4:5], v8 offset:0< Store small vector over zeros s_mov_b64 exec, -1 v_lshlrev_b32 v4, 2, v1 v_mov_b32 v5, s13 v_add_co_u32v4, s[22:23], s12, v4 v_addc_co_u32 v5, s[22:23], 0, v5, s[22:23] flat_load_dword v8, v[4:5] offset:0< Load combined big vector s_waitcnt 0 s_sub_u32 s16, s16, 256 s_subb_u32 s17, s17, 0 s_setpc_b64 s[18:19] Here's my alternative in-register solution: #define RESIZE_VECTOR(to_t, from) \ ({ \ to_t __to; \ if (VECTOR_WIDTH (to_t) < VECTOR_WIDTH (__typeof (from))) \ asm ("; no-op cast %0" : "=v"(__to) : "0"(from)); \ else \ { \ unsigned long __mask = -1L; \ int lanes = VECTOR_WIDTH (__typeof (from)); \ __mask <<= lanes; \ __builtin_choose_expr ( \ V_SF_SI_P (to_t), \ ({asm ("v_mov_b32 %0, 0" : "=v"(__to) : "0"(from), "e"(__mask));}), \ ({asm ("v_mov_b32 %H0, 0\n\t" \ "v_mov_b32 %L0, 0" : "=v"(__to) : "0"(from), "e"(__mask));})); \ } \
[Bug target/119474] GCN 'libgomp.oacc-c++/pr96835-1.C' ICE 'during GIMPLE pass: ivopts'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119474 --- Comment #1 from Andrew Stubbs --- In the -O1 case, the problem seems to be that the "ivopts" pass has identified an item-in-an-array-in-a-struct as the IV, and that struct is in a different address space: Type: REFERENCE ADDRESS Use 0.0: At stmt:_9 = v1.values_[i_38]; At pos: v1.values_[i_38] IV struct: Type: int * Base: (int *) (unsigned long) &v1 Step: 4 Object: (void *) (&v1) Biv: N Overflowness wrto loop niter: Overflow When it tries to calculate the base address-of operator, it observes the type of "values_[i_38]", which doesn't have anything like an address space. In the -O2 case, the problem seems to be the same, except that it's happening in the "vect" pass: the compiler attempts to create a "vectp" to match &v1, but doesn't propogate the address space correctly. Question: are we just missing address-space handling in a bunch of places, or is the OpenACC lowering producing non-canonical IVs somehow?
[Bug target/119474] GCN 'libgomp.oacc-c++/pr96835-1.C' ICE 'during GIMPLE pass: ivopts'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119474 --- Comment #6 from Andrew Stubbs --- The address space has to be introduced "late" because it's done in the accelerator compiler, so post-IPA. The pass is "oaccdevlow" (currently no.103). The address space is selected via the TARGET_GOACC_ADJUST_PRIVATE_DECL hook.
[Bug target/119474] GCN 'libgomp.oacc-c++/pr96835-1.C' ICE 'during GIMPLE pass: ivopts'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119474 --- Comment #9 from Andrew Stubbs --- This patch fixes the -O1 failure, for *this* testcase: diff --git a/gcc/tree.cc b/gcc/tree.cc index eccfcc89da40..4bfdb7a938e7 100644 --- a/gcc/tree.cc +++ b/gcc/tree.cc @@ -7085,11 +7085,8 @@ build_pointer_type_for_mode (tree to_type, machine_mode mode, if (to_type == error_mark_node) return error_mark_node; - if (mode == VOIDmode) -{ - addr_space_t as = TYPE_ADDR_SPACE (to_type); - mode = targetm.addr_space.pointer_mode (as); -} + addr_space_t as = TYPE_ADDR_SPACE (to_type); + mode = targetm.addr_space.pointer_mode (as); /* If the pointed-to type has the may_alias attribute set, force a TYPE_REF_CAN_ALIAS_ALL pointer to be generated. */ @@ -7121,6 +7118,7 @@ build_pointer_type_for_mode (tree to_type, machine_mode mode, TYPE_REF_CAN_ALIAS_ALL (t) = can_alias_all; TYPE_NEXT_PTR_TO (t) = TYPE_POINTER_TO (to_type); TYPE_POINTER_TO (to_type) = t; + TYPE_ADDR_SPACE (t) = TYPE_ADDR_SPACE (to_type); /* During LTO we do not set TYPE_CANONICAL of pointers and references. */ if (TYPE_STRUCTURAL_EQUALITY_P (to_type) || in_lto_p) This is clearly wrong though, because the address space that the pointer is *in* doesn't have to be the same as the one it *points to*, so I need a better solution. Any suggestions where to start?
[Bug target/119369] GCN: weak undefined symbols -> execution test FAIL, 'HSA_STATUS_ERROR_VARIABLE_UNDEFINED'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119369 --- Comment #2 from Andrew Stubbs --- We used to have work-arounds for ROCm runtime linker deficiencies, but these were removed in 2020, as they were no longer necessary when we moved to HSACOv3: https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=f64f0899ba14058534e19ee46cfbf6f921e4644c Something like this can probably be put back, if necessary.
[Bug target/119369] GCN: weak undefined symbols -> execution test FAIL, 'HSA_STATUS_ERROR_VARIABLE_UNDEFINED'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119369 --- Comment #5 from Andrew Stubbs --- A post-linker could be included as part of the mkoffload process (or maybe we could fix up the weak directives in the assembler as part of the pre-assembler step we already have). Either way, there's no mkoffload in the pipeline for the standalone tests. However, given that weak symbols are known to be half-broken then maybe -fno-weak is the correct solution in that configuration? The way the old workaround worked was it would hide *all* the relocations from the runtime loader by editing the ELF file in memory, and then fix up the relocations manually, in GPU memory, after the loader had chosen the load address. This was suboptimal because device memcpy is slow (adding to the initial setup overhead only), but means we have full control over runtime symbol resolution. It's frustrating though because, as you can see from the old code, processing relocations is not hard, and adding weak symbol support would be fairly straight-forward, so you'd hope the driver could already handle it. :( IIUC, the reason the weak symbols are not resolved already is because we use --pie with --export-dynamic. The latter ought not be necessary, but it is used because that's what LLVM uses (or used to use during the v3 era) and is therefore the path of least resistance within the driver code (as in, this is the reason we could stop resolving our own relocations). It's possible that we don't have to do this any more?
[Bug target/119474] GCN 'libgomp.oacc-c++/pr96835-1.C' ICE 'during GIMPLE pass: ivopts'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119474 --- Comment #8 from Andrew Stubbs --- This patch fixes the ICE and produces working code at -O2 and -O3: diff --git a/gcc/omp-offload.cc b/gcc/omp-offload.cc index da2b54b76485..1778a70bf755 100644 --- a/gcc/omp-offload.cc +++ b/gcc/omp-offload.cc @@ -1907,7 +1907,7 @@ oacc_rewrite_var_decl (tree *tp, int *walk_subtrees, void *data) /* Adjust the type of the component ref itself. */ tree comp_type = TREE_TYPE (*tp); int comp_quals = TYPE_QUALS (comp_type); - if (TREE_CODE (*tp) == COMPONENT_REF && comp_quals != base_quals) + if (/*TREE_CODE (*tp) == COMPONENT_REF &&*/ comp_quals != base_quals) { comp_quals |= base_quals; TREE_TYPE (*tp) However, we now get a new ICE at -O1 in emit-rtl.cc:663 /* Allow truncation but not extension since we do not know if the number is signed or unsigned. */ gcc_assert (prec <= v.get_precision ()); I think this is probably caused by address space 4 using 32-bit pointers (normal pointers are 64-bit), but I've not confirmed this yet.