[Bug other/99288] xgettext does not get HOST_WIDE_INT_PRINT_UNSIGNED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99288 Richard Biener changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #4 from Richard Biener --- Fixed.
[Bug translation/40883] [meta-bug] Translation breakage with trivial fixes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40883 Bug 40883 depends on bug 99288, which changed state. Bug 99288 Summary: xgettext does not get HOST_WIDE_INT_PRINT_UNSIGNED https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99288 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug testsuite/99292] FAIL: gcc.c-torture/compile/pr98096.c -O0 (test for excess errors)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99292 --- Comment #1 from Richard Biener --- IIRC it requires LRA, maybe add a dg target selector for LRA (or reload, that's likely smaller now)?
[Bug c/99295] [11 Regression] documentation on __attribute__((malloc)) is wrong
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99295 Richard Biener changed: What|Removed |Added Priority|P3 |P1
[Bug middle-end/99299] Need a recoverable version of __builtin_trap()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99299 Richard Biener changed: What|Removed |Added Keywords||missed-optimization CC||rguenth at gcc dot gnu.org --- Comment #4 from Richard Biener --- 'enhancement' Importance is the magic we use, in the end it's a missed optimization since you refer to sub-optimal code gen. I'm not sure what your proposed not noreturn trap() would do in terms of IL semantics compared to a not specially annotated general call? "recoverable" likely means resuming after the trap, not on an exception path (so it'll not be a throw())? The only thing that might be useful to the middle-end would be marking the function as not altering the memory state. But I suppose it should still serve as a barrier for code motion of both loads and stores, even of those loads/stores are known to not trap. The only magic we'd have for this would be __attribute__((const,returns_twice)). Which likely will be more detrimental to general optimization. So - what's the "sub-optimal code generation" you refer to from the (presumably) volatile asm() you use for the trap? [yeah, asm() on GIMPLE is less optimized than a call]
[Bug rtl-optimization/99305] [11 Regression] range condition simplification after inlining
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99305 Richard Biener changed: What|Removed |Added Keywords||missed-optimization, ||needs-bisection Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Target Milestone|--- |11.0 Last reconfirmed||2021-03-01 --- Comment #1 from Richard Biener --- Confirmed. Some forwprop/match.pd prevents phiopt to trigger: GCC 10 (forwprop->phiopt): [local count: 1073741824]: _7 = (unsigned char) c_2(D); _8 = _7 + 208; - if (_8 <= 9) -goto ; [50.00%] - else -goto ; [50.00%] - - [local count: 536870913]: - - [local count: 1073741824]: - # iftmp.1_1 = PHI <1(3), 0(2)> - return iftmp.1_1; + _9 = _8 <= 9; + return _9; forwprop difference GCC 10/11: - Replaced '_9 != 0' with '_8 <= 9' -bar (char c) +bool bar (char c) { bool iftmp.1_1; - unsigned char _7; - unsigned char _8; + unsigned char c.0_4; + unsigned char _5; + bool _6; + bool _7; [local count: 1073741824]: - _7 = (unsigned char) c_2(D); - _8 = _7 + 208; - if (_8 <= 9) + if (c_2(D) != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 536870913]: + c.0_4 = (unsigned char) c_2(D); + _5 = c.0_4 + 208; + _6 = _5 <= 9; + _7 = -_6; [local count: 1073741824]: - # iftmp.1_1 = PHI <1(3), 0(2)> + # iftmp.1_1 = PHI <_7(3), 0(2)> return iftmp.1_1;
[Bug libstdc++/99306] cross compiler bootstrap failure on msdosdjgpp: error: alignment of 'm' is greater than maximum object file alignment 16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99306 Richard Biener changed: What|Removed |Added Last reconfirmed||2021-03-01 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #2 from Richard Biener --- __gnu_cxx::__mutex& get_mutex(unsigned char i) { // increase alignment to put each lock on a separate cache line struct alignas(64) M : __gnu_cxx::__mutex { }; static M m[mask + 1]; return m[i]; there's __BIGGEST_ALIGNMENT__ one could use as bound but that will usually be lower than the max ofile alignment and on most targets likely less than 64. That value (64) looks like it should be target dependent anyway (configury?)
[Bug c++/99309] [10/11 Regression] Segmentation fault with __builtin_constant_p usage at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99309 Richard Biener changed: What|Removed |Added Known to fail||11.0 Target Milestone|--- |10.3 Summary|Segmentation fault with |[10/11 Regression] |__builtin_constant_p usage |Segmentation fault with |at -O2 |__builtin_constant_p usage ||at -O2 Known to work||9.3.1 Priority|P3 |P2 Last reconfirmed||2021-03-01 Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Keywords||wrong-code --- Comment #1 from Richard Biener --- confirmed.
[Bug c++/99310] [11 Regression] ICE: canonical types differ for identical types 'void (A::)(void*)' and 'void (A::)(void*)'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99310 Richard Biener changed: What|Removed |Added Priority|P3 |P4 Target Milestone|--- |11.0 Keywords||error-recovery, ||ice-checking
[Bug preprocessor/99313] ICE while changing global target options via pragma
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99313 --- Comment #3 from Richard Biener --- But this results in unexpected behavior when there's functions with arch=z13 vs. arch=z9 and depending on "luck" we then inherit the wrong params where we should not? That said, when unifying target/optimize options these should be handled and stored once, right?
[Bug c++/99318] [10/11 Regression] -Wdeprecated-declarations where non-should be?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99318 Richard Biener changed: What|Removed |Added Target Milestone|--- |10.3 Keywords||diagnostic, rejects-valid
[Bug c/99323] [9/10/11 Regression] ICE in add_hint, at diagnostic-show-locus.c:2234
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99323 Richard Biener changed: What|Removed |Added Target Milestone|--- |9.4 CC||dmalcolm at gcc dot gnu.org
[Bug c/99324] ICE in mark_addressable, at gimple-expr.c:918
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99324 Richard Biener changed: What|Removed |Added Last reconfirmed||2021-03-02 Status|UNCONFIRMED |NEW Ever confirmed|0 |1 --- Comment #1 from Richard Biener --- Confirmed. 914 /* Also mark the artificial SSA_NAME that points to the partition of X. */ 915 if (TREE_CODE (x) == VAR_DECL 916 && !DECL_EXTERNAL (x) 917 && !TREE_STATIC (x) 918 && cfun->gimple_df != NULL 919 && cfun->gimple_df->decls_to_pointers != NULL) 920 { (gdb) p cfun $1 = (function *) 0x0 I suppose this could be made more robust by checking for cfun being non-NULL or checking currently_expanding_to_rtl.
[Bug c/99325] [11 Regression] ICE in maybe_print_line_1, at c-family/c-ppoutput.c:454
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99325 Richard Biener changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2021-03-02 Status|UNCONFIRMED |NEW Target Milestone|--- |11.0 --- Comment #1 from Richard Biener --- Confirmed.
[Bug fortran/99326] [9/10/11 Regression] ICE in gfc_build_dummy_array_decl, at fortran/trans-decl.c:1299
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99326 Richard Biener changed: What|Removed |Added Priority|P3 |P4 Target Milestone|--- |9.4
[Bug debug/99334] Generated DWARF unwind table issue while on instructions where rbp is pointing to callers stack frame
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99334 Richard Biener changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2021-03-02 Status|UNCONFIRMED |WAITING Target||x86_64-linux
[Bug c/99324] [8/9/10/11 Regression] ICE in mark_addressable, at gimple-expr.c:918 since r6-314
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99324 --- Comment #4 from Richard Biener --- (In reply to Jakub Jelinek from comment #3) > Wouldn't it be better to remove the mark_addressable call from build_va_arg > and call {c,cxx}_mark_addressable in the callers instead. Sure, or make it a langhook so c-common code can call the "correct" mark_addresable (there's also c_common_mark_addressable_vec which might suggest that splitting out common c_common_mark_addressable from {c,cxx}_mark_addressable should be viable and use that). > That way we'd also e.g. diagnose invalid (on i686-linux): > register __builtin_va_list ap __asm ("%ebx"); > > void > foo (int a, ...) > { > __builtin_va_arg (ap, int); > }
[Bug c/99340] -Werror=maybe-uninitialized warning with -fPIE, but not -fPIC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99340 --- Comment #2 from Richard Biener --- PIC allows interposing ags_midi_buffer_util_get_varlength and thus possibly initializing the argument. PIE does not allow this so we see it is not initialized. I suppose the change on the branch is for some unreduced testcase where different optimization might trigger the new warning (correctly I think).
[Bug middle-end/99339] Poor codegen with simple varargs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339 Richard Biener changed: What|Removed |Added Ever confirmed|0 |1 Keywords||missed-optimization Target||x86_64-*-* Component|c |middle-end CC||matz at gcc dot gnu.org, ||rguenth at gcc dot gnu.org Status|UNCONFIRMED |NEW Last reconfirmed||2021-03-02 --- Comment #1 from Richard Biener --- The stack space is not eliminated because we lower __builtin_va_start only after RTL expansion and that reserves stack space necessary for accessing some of the meta (including the passed value itself) as memory. So it's unavoidable up to somebody designing sth smarter around varargs and GIMPLE. Arguably the not lowered variant would be easier to expand optimally: int test_va (int x) { struct va[1]; int i; int _7; [local count: 1073741824]: __builtin_va_start (&va, 0); i_4 = .VA_ARG (&va, 0B, 0B); __builtin_va_end (&va); _7 = i_4 + x_6(D); va ={v} {CLOBBER}; return _7; I'm not fully sure why we lower at all. Part of the lowering determines whether there's any FP arguments referenced and optimizes based on that, but IIRC that's all.
[Bug middle-end/99339] Poor codegen with simple varargs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339 --- Comment #2 from Richard Biener --- Btw, clang manages to produce the following, which shows the situation could be worse ;) test_va:# @test_va .cfi_startproc # %bb.0: subq$88, %rsp .cfi_def_cfa_offset 96 movl%eax, %r10d movl%edi, %eax testb %r10b, %r10b je .LBB0_2 # %bb.1: movaps %xmm0, -48(%rsp) movaps %xmm1, -32(%rsp) movaps %xmm2, -16(%rsp) movaps %xmm3, (%rsp) movaps %xmm4, 16(%rsp) movaps %xmm5, 32(%rsp) movaps %xmm6, 48(%rsp) movaps %xmm7, 64(%rsp) .LBB0_2: movq%rsi, -88(%rsp) movq%rdx, -80(%rsp) movq%rcx, -72(%rsp) movq%r8, -64(%rsp) movq%r9, -56(%rsp) leaq-96(%rsp), %rcx movq%rcx, -112(%rsp) leaq96(%rsp), %rcx movq%rcx, -120(%rsp) movabsq $206158430216, %rcx # imm = 0x38 movq%rcx, -128(%rsp) movl$8, %edx cmpq$40, %rdx ja .LBB0_4 # %bb.3: movl$8, %ecx addq-112(%rsp), %rcx addl$8, %edx movl%edx, -128(%rsp) jmp .LBB0_5 .LBB0_4: movq-120(%rsp), %rcx leaq8(%rcx), %rdx movq%rdx, -120(%rsp) .LBB0_5: addl(%rcx), %eax addq$88, %rsp .cfi_def_cfa_offset 8 retq
[Bug middle-end/99339] Poor codegen with simple varargs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339 --- Comment #3 from Richard Biener --- So we could try to lower even va_start/end to expose the va_list meta fully to the middle-end early which should eventually allow eliding it. That would require introducing other builtins/internal fns to allow referencing the frame or the incoming arg registers by number.
[Bug c/99340] -Werror=maybe-uninitialized warning with -fPIE, but not -fPIC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99340 --- Comment #6 from Richard Biener --- GCC 9 warns as well. I think this was a false negative which is now fixed. Note GCC 10.1.0 and GCC 10.2.0 warn for me as well, so something must have regressed this between 10.2.0 and g:eddcb627ccfbd97e025cf366 I'm inclined to mark as INVALID.
[Bug middle-end/99339] Poor codegen with simple varargs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339 Richard Biener changed: What|Removed |Added CC||jamborm at gcc dot gnu.org --- Comment #7 from Richard Biener --- For simple cases some IPA pass (IPA-CP or IPA-SRA?) could also 'clone' varargs functions based on callers, eliding varargs and thus also allow inlining (or like the early IPA-SRA did, modify a function in place if all callers are simple). Directly supporting inlining might also be possible. What's required for all this is some local analysis of the varargs function on whether it's possible to replace the .VA_ARG calls with direct parameter references (no .VA_ARG in loops for example, no passing of the va_list to other functions, etc.).
[Bug inline-asm/99342] Clobbered register used for input operand (aarch64)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99342 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |INVALID Target||aarch64 --- Comment #5 from Richard Biener --- (In reply to Stewart Hildebrand from comment #4) > Created attachment 50287 [details] > Simplified test case > > I simplified the test case - hopefully this should make it clearer. This: > > asm volatile("\n" > "ldr x0, %0 \n" > "ldr x1, %1 \n" > "ldr x2, %2 \n" > : // No output operands > : // Inputs: >"Q"(s_current->_state.fp), "Ump"(s_current->_state.sp), >"Ump"(this->_state.fp) > : // Clobbers: >// Registers we use here >"x0", "x1", "x2", >// Callee-saved registers (general purpose) >"x19", "x20", "x21", "x22", "x23", "x24", >"x25", "x26", "x27", "x28", >// Memory access >"memory"); > > Results in: > > 118: f9400080ldr x0, [x4] > 11c: f9401461ldr x1, [x3, #40] > 120: f9400c02ldr x2, [x0, #24] you are clobbering x{0,1,2} before the asm finished using its input operands so you have to use earlyclobbers.
[Bug preprocessor/99343] Suggest: -H option support output to file
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99343 Richard Biener changed: What|Removed |Added Last reconfirmed||2021-03-03 Ever confirmed|0 |1 Severity|normal |enhancement Status|UNCONFIRMED |NEW --- Comment #1 from Richard Biener --- sounds reasonable. patches should be sent to gcc-patc...@gcc.gnu.org, see also https://gcc.gnu.org/contribute.html
[Bug fortran/99345] [11 Regression] ICE in doloop_contained_procedure_code, at fortran/frontend-passes.c:2464
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99345 Richard Biener changed: What|Removed |Added Priority|P3 |P4 Target Milestone|--- |11.0
[Bug rtl-optimization/99347] [9/10/11 Regression] ICE in create_block_for_bookkeeping, at sel-sched.c:4549 since r9-6859-g25eafae67f186cfa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99347 Richard Biener changed: What|Removed |Added Target Milestone|--- |9.4 Priority|P3 |P2 CC||amonakov at gcc dot gnu.org
[Bug fortran/99350] [9/10/11 Regression] ICE in gfc_get_symbol_decl, at fortran/trans-decl.c:1869
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99350 Richard Biener changed: What|Removed |Added Priority|P3 |P4 Target Milestone|--- |9.4
[Bug fortran/99355] -freal-X-real-Y -freal-Z-real-X promotes Z to Y
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99355 Richard Biener changed: What|Removed |Added Version|unknown |10.2.0 --- Comment #2 from Richard Biener --- So you say -freal-8-real-16 -freal-4-real-8 promotes real(4) to real(16). Which indeed sounds less than useful but could be a valid reading of the intended semantics.
[Bug ipa/99357] Missed Dead Code Elimination Opportunity
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99357 Richard Biener changed: What|Removed |Added Last reconfirmed||2021-03-03 Status|UNCONFIRMED |NEW Component|tree-optimization |ipa Ever confirmed|0 |1 CC||hubicka at gcc dot gnu.org, ||marxin at gcc dot gnu.org, ||rguenth at gcc dot gnu.org Keywords||missed-optimization --- Comment #1 from Richard Biener --- We have no "flow sensitive" analysis of global variable values. When you remove the 'a = 0' assignment we figure a is never written to and promote it constant which then allows constant folding of the read. Now we could eventually enhance that analysis to ignore writes that store the same value as the initializer (and also make sure to remove those later). But consider static int a = 0; extern void bar(void); int main() { if (a) bar(); a = 1; return 0; } which would be still valid to be optimized to just int main() { return 0; } eliding the call and the variable 'a' completely (since it's unused). Thus it's also a missed dead store elimination (for which we'd need to know if there are any finalizers referencing 'a' for example).
[Bug tree-optimization/97897] ICE tree check: expected ssa_name, have integer_cst in compute_optimized_partition_bases, at tree-ssa-coalesce.c:1638
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97897 Richard Biener changed: What|Removed |Added Known to work||10.2.1 Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from Richard Biener --- Fixed. Not planning to backport further.
[Bug tree-optimization/98526] [10 Regression] Double-counting of reduction cost
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98526 Richard Biener changed: What|Removed |Added Status|ASSIGNED|RESOLVED Priority|P3 |P2 Known to fail||10.2.0 Resolution|--- |FIXED Known to work||10.2.1 --- Comment #7 from Richard Biener --- Fixed.
[Bug tree-optimization/98640] [10 Regression] GCC produces incorrect code with -O1 and higher since r10-2711-g3ed01d5408045d80
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98640 Richard Biener changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from Richard Biener --- Fixed.
[Bug tree-optimization/99101] optimization bug with -ffinite-loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99101 Richard Biener changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #21 from Richard Biener --- So I'm somewhat lost in pointing to the actual error. And I'm not sure there is any error, just unfortuate optimization behavior in the face of the testcase being undefined with -ffinite-loops (or in C++). That said, more "sensible" optimization from the undefined behavior would have been to exit the loop, not preserving the if (xx) test. With preserving it we either end up with infinite puts() or no puts() calls both which have the "wrong" number of invocations of the side-effect in the loop. There's still the intuitively missing control dependence on the if (at_eof) check (which is also missing without -ffinite-loops but doesn't cause any wrong DCE there). But as said my gut feeling is that control dependence doesn't capture the number of invocations but only whether something is invoked. That's likely why we manually add control dependences of the latch of loops for possibly infinite loops. CCing Honza who added the control-dependence stuff and who may remember some extra details.
[Bug tree-optimization/99101] optimization bug with -ffinite-loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99101 --- Comment #23 from Richard Biener --- Just for the record we had the idea to apply the "bolt" of marking the latch control dependence (as done for possibly infinite loops) for loops containing stmts with side-effects. diff --git a/gcc/tree-ssa-dce.c b/gcc/tree-ssa-dce.c index c027230acdc..c07b60bf25c 100644 --- a/gcc/tree-ssa-dce.c +++ b/gcc/tree-ssa-dce.c @@ -695,6 +695,12 @@ propagate_necessity (bool aggressive) if (bb != ENTRY_BLOCK_PTR_FOR_FN (cfun) && !bitmap_bit_p (visited_control_parents, bb->index)) mark_control_dependent_edges_necessary (bb, false); + /* If the stmt has side-effects the number of invocations matter. +In this case mark the containing loop control. */ + if (gimple_has_side_effects (stmt) + && bb->loop_father->num != 0) + mark_control_dependent_edges_necessary (bb->loop_father->latch, + false); } if (gimple_code (stmt) == GIMPLE_PHI But while that works for CDDCE1, CDDCE2 is presented a slightly altered CFG that somehow prevents it from working. Which also means that both loops need to be considered infinite for the present bolting to work.
[Bug c/99363] [11 regression] gcc.dg/attr-flatten-1.c fails starting with r11-7469
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99363 Richard Biener changed: What|Removed |Added Last reconfirmed||2021-03-04 Ever confirmed|0 |1 Keywords||diagnostic, needs-bisection Target|powerpc64*-linux-gnu, |powerpc64*-linux-gnu, |cris-elf|cris-elf, x86_64-*-* Target Milestone|--- |11.0 Status|UNCONFIRMED |NEW Component|other |c --- Comment #2 from Richard Biener --- Likely fails everywhere. Eventually a testsuite issue.
[Bug fortran/99369] [10/11 Regression] ICE in gfc_resolve_expr, at fortran/resolve.c:7167
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99369 Richard Biener changed: What|Removed |Added Priority|P3 |P4 Target Milestone|--- |10.3
[Bug target/99372] gimplefe-28.c ICEs when sqrt insn is not available
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99372 Richard Biener changed: What|Removed |Added Version|unknown |11.0 Component|tree-optimization |target Target||powerpc --- Comment #1 from Richard Biener --- It does check for that: /* { dg-do compile { target sqrt_insn } } */ the error is with the powerpc target not implementing the dg sqrt_insn target properly.
[Bug ipa/99373] unused static function not being removed in some cases after optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99373 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2021-03-04 CC||hubicka at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener --- The issue is only IPA reference promotes 'd' constant and thus only late optimization elides the call to 'j'. That's too late to eliminate the function. Note we process 'j' first duing late opts (to make the late local IPA pure-const useful). We'd need another IPA phase before RTL expansion to collect unreachable functions again (IIRC the original parallel compilation GSoC project added one). I'm also quite sure we have a duplicate of this PR.
[Bug tree-optimization/99383] No tree-switch-conversion under PIC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99383 Richard Biener changed: What|Removed |Added CC||marxin at gcc dot gnu.org Status|UNCONFIRMED |NEW Keywords||missed-optimization Last reconfirmed||2021-03-04 Ever confirmed|0 |1 --- Comment #1 from Richard Biener --- Same for -fPIE. The reason is: Bailing out - value from a case would need runtime relocations. reloc = initializer_constant_valid_p (val, TREE_TYPE (val)); if ((flag_pic && reloc != null_pointer_node) || (!flag_pic && reloc == NULL_TREE)) { if (reloc) reason = "value from a case would need runtime relocations"; reloc is a STRING_CST here. Not sure why it says 'runtime relocation' or what that should be. It's a reloc in .rodata to sth in .string.
[Bug tree-optimization/99383] No tree-switch-conversion under PIC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99383 Richard Biener changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #2 from Richard Biener --- -fPIC/-fPIE refers to _code_ so I'm not sure why we restrict _data_ in any way here? Using those flags, at least? Jakub added this code for PR36881 in g:f6e6e9904cd32cc78873a33f0a3839812b0d0f57
[Bug tree-optimization/99383] No tree-switch-conversion under PIC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99383 --- Comment #3 from Richard Biener --- For the specific case of strings switch-conversion could also generate a combined string (with intermediate '\0's) and use a table of offsets into said string, thus doing a single relocation to the combined string in .text (or GOT) plus offsetting that with the offset from the table. (at the cost of less string merging and thus larger .string) I guess relocs to .string aren't any better than relocs to .{,ro}data.
[Bug middle-end/97855] [11 regression] Bogus warning locations during lto-bootstrap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97855 Richard Biener changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #4 from Richard Biener --- Fixed on trunk (but sure latent as well).
[Bug gcov-profile/99385] [11 regression] gcc.dg/tree-prof/indir-call-prof-malloc.c etc. FAIL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99385 Richard Biener changed: What|Removed |Added Priority|P3 |P1
[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 Richard Biener changed: What|Removed |Added CC||vmakarov at gcc dot gnu.org Keywords||ra --- Comment #17 from Richard Biener --- So coming back here. We're presenting RA with a quite hard problem given we have (insn 7 4 8 2 (set (reg:TI 84 [ _9 ]) (mem:TI (reg:DI 101) [0 MEM <__int128 unsigned> [(char * {ref-all})in_8(D)]+0 S16 A8])) 73 {*movti_internal} (expr_list:REG_DEAD (reg:DI 101) (nil))) (insn 8 7 9 2 (parallel [ (set (reg:DI 95) (lshiftrt:DI (subreg:DI (reg:TI 84 [ _9 ]) 8) (const_int 63 [0x3f]))) (clobber (reg:CC 17 flags)) ]) "t.c":7:26 703 {*lshrdi3_1} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) .. (insn 10 9 11 2 (parallel [ (set (reg:DI 97) (lshiftrt:DI (subreg:DI (reg:TI 84 [ _9 ]) 0) (const_int 63 [0x3f]))) (clobber (reg:CC 17 flags)) ]) "t.c":8:30 703 {*lshrdi3_1} (expr_list:REG_UNUSED (reg:CC 17 flags) .. (insn 12 11 13 2 (set (reg:V2DI 98 [ vect__5.3 ]) (ashift:V2DI (subreg:V2DI (reg:TI 84 [ _9 ]) 0) (const_int 1 [0x1]))) "t.c":9:16 3611 {ashlv2di3} (expr_list:REG_DEAD (reg:TI 84 [ _9 ]) (nil))) where I wonder why we keep the (subreg:DI (reg:TI 84 ...) 8) around for so long. Probably the subreg pass gives up because of the V2DImode subreg of that reg. That said RA chooses xmm for reg:84 but then spills it immediately to fulfil the subregs even though there's mov and pextrd that could be used or the reload could use the original mem. That we reload even the xmm use is another odd thing. Vlad, I'm not sure about the possibilities LRA has here but maybe you can have a look at the testcase in comment#6 (use -O3 -march=znver2 or -march=core-avx2). For one I expected vmovdqu (%rsi), %xmm2 vmovdqa %xmm2, -24(%rsp) movq-16(%rsp), %rax (2a) vmovdqa -24(%rsp), %xmm4 (1) ... movq-24(%rsp), %rdx (2b) (1) to be not there (not sure how that even survives postreload optimizations...) (2a/b) to be 'inherited' by instead loading from (%rsi) and 8(%rsi) which is maybe too much being asked because it requires aliasing considerations That is, even if we don't consider using movq %xmm2, %rax (2a) pextrd %xmm2, %rdx, 1 (2b) I expected us to not spill.
[Bug c++/99386] std::variant overhead much larger compared to clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99386 --- Comment #1 from Richard Biener --- Is that clang++ using libstdc++ from GCC or libc++? In the end the difference might boil down to inlining decision differences.
[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #18 from Richard Biener --- There's another thing - we end up with vmovq %rax, %xmm3 vpinsrq $1, %rdx, %xmm3, %xmm0 but that has way worse latency than the alternative you'd get w/o SSE 4.1: vmovq %rax, %xmm3 vmovq %rdx, %xmm7 punpcklqdq %xmm7, %xmm3 for example on Zen3 vmovq and vpisnrq have latencies of 3 while punpck has a latency of only one. So the second variant should have 2 cycles less latency. Testcase: typedef long v2di __attribute__((vector_size(16))); v2di foo (long a, long b) { return (v2di){a, b}; } Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3. Not sure if we should somehow do this late somehow (peephole or splitter) since it requires one more %xmm register.
[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #19 from Richard Biener --- So to recover performance we need both, avoiding the latency on the vector plus avoiding the spilling. This variant is fast: .L56: .cfi_restore_state vmovdqu (%rsi), %xmm4 movq8(%rsi), %rdx shrq$63, %rdx imulq $135, %rdx, %rdi movq(%rsi), %rdx vmovq %rdi, %xmm0 vpsllq $1, %xmm4, %xmm1 shrq$63, %rdx vmovq %rdx, %xmm5 vpunpcklqdq %xmm5, %xmm0, %xmm0 vpxor %xmm1, %xmm0, %xmm0 vmovdqu %xmm0, (%rax) jmp .L53 compared to the original: .L56: .cfi_restore_state vmovdqu (%rsi), %xmm4 vmovdqa %xmm4, 16(%rsp) movq24(%rsp), %rdx vmovdqa 16(%rsp), %xmm5 shrq$63, %rdx imulq $135, %rdx, %rdi movq16(%rsp), %rdx vmovq %rdi, %xmm0 vpsllq $1, %xmm5, %xmm1 shrq$63, %rdx vpinsrq $1, %rdx, %xmm0, %xmm0 vpxor %xmm1, %xmm0, %xmm0 vmovdqu %xmm0, (%rax) jmp .L53
[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #22 from Richard Biener --- (In reply to Uroš Bizjak from comment #21) > (In reply to Uroš Bizjak from comment #20) > > (In reply to Richard Biener from comment #18) > > > Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3. Not > > > sure if we should somehow do this late somehow (peephole or splitter) > > > since > > > it requires one more %xmm register. > > What happens if you disparage [v]pinsrd alternatives in vec_concatv2di? > > Please try this: > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index db5be59f5b7..edf7b1a3074 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -16043,7 +16043,12 @@ > (const_string "maybe_evex") >] >(const_string "orig"))) > - (set_attr "mode" "TI,TI,TI,TI,TI,TI,V4SF,V2SF,V2SF")]) > + (set_attr "mode" "TI,TI,TI,TI,TI,TI,V4SF,V2SF,V2SF") > + (set (attr "preferred_for_speed") > + (cond [(eq_attr "alternative" "0,1,2,3") > + (symbol_ref "false") > + ] > + (symbol_ref "true")))]) > > (define_insn "*vec_concatv2di_0" That works to avoid the vpinsrq. I guess the case of a mem operand behaves similar to a gpr (plus the load uop), at least I don't have any contrary evidence (but I didn't do any microbenchmarks either). I'm not sure IRA/LRA will optimally handle the situation with register pressure causing spilling in case it needs to reload both gpr operands. At least for typedef long v2di __attribute__((vector_size(16))); v2di foo (long a, long b) { return (v2di){a, b}; } with -msse4.1 -O3 -ffixed-xmm1 -ffixed-xmm2 -ffixed-xmm3 -ffixed-xmm4 -ffixed-xmm5 -ffixed-xmm6 -ffixed-xmm7 -ffixed-xmm8 -ffixed-xmm9 -ffixed-xmm10 -ffixed-xmm11 -ffixed-xmm12 -ffixed-xmm13 -ffixed-xmm14 -ffixed-xmm15 I get with the patch foo: .LFB0: .cfi_startproc movq%rsi, -16(%rsp) movq%rdi, %xmm0 pinsrq $1, -16(%rsp), %xmm0 ret while without it's movq%rdi, %xmm0 pinsrq $1, %rsi, %xmm0 as far as I understand LRA dumps the new attribute is a hard one, even applying when other alternatives are worse. In this case we choose alt 7. Covering also alts 7 and 8 with the optimize-for-speed attribute causes reload fails - which is expected if there's no way for LRA to choose alt 1. The following seems to work for the small testcase above but not for the important case in the benchmark (meh). diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index db5be59f5b7..e393a0d823b 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -15992,7 +15992,7 @@ (match_operand:DI 1 "register_operand" " 0, 0,x ,Yv,0,Yv,0,0,v") (match_operand:DI 2 "nonimmediate_operand" - " rm,rm,rm,rm,x,Yv,x,m,m")))] + " !rm,!rm,!rm,!rm,x,Yv,x,!m,!m")))] "TARGET_SSE" "@ pinsrq\t{$1, %2, %0|%0, %2, 1} I guess the idea of this insn setup was exactly to get IRA/LRA choose the optimal instruction sequence - otherwise exposing the reload so late is probably suboptimal.
[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #23 from Richard Biener --- Created attachment 50300 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50300&action=edit preprocessed source of the important Botan TU This is the full preprocessed source of the TU. When compiled with -Ofast -march=znver2 look for poly_double_n_le in the assembly, in the prologue the function jumps based on kernel size - size 16 is the important one: cmpq$16, %rdx je .L54 ... .L54: .cfi_restore_state vmovdqu (%rsi), %xmm4 vmovdqa %xmm4, 16(%rsp) movq24(%rsp), %rdx vmovdqa 16(%rsp), %xmm5 shrq$63, %rdx imulq $135, %rdx, %rcx movq16(%rsp), %rdx vmovq %rcx, %xmm0 vpsllq $1, %xmm5, %xmm1 shrq$63, %rdx vpinsrq $1, %rdx, %xmm0, %xmm0 vpxor %xmm1, %xmm0, %xmm0 vmovdqu %xmm0, (%rdi) leaq-16(%rbp), %rsp popq%r12 popq%r13 popq%rbp .cfi_remember_state .cfi_def_cfa 7, 8 ret
[Bug tree-optimization/95401] [10 Regression] GCC produces incorrect instruction with -O3 for AVX2 since r10-2257-g868363d4f52df19d
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95401 --- Comment #8 from Richard Biener --- (In reply to Alexandre Oliva from comment #7) > How important is it that the test added for this PR be split into two > separate source files? > > I ask because, on targets that support vectors, but the vector unit is not > enabled in the default configuration, vect.exp makes compile the default > action, instead of run, and with additional sources, compile fails because > one can't compile multiple sources into a single asm output. Hmm, but that sounds like a mistake in the dg setup? Anyway, if you can make the testcase fail when combined (and some noipa attributes sprinkled around) it's certainly fine to merge it into a single TU.
[Bug middle-end/99394] s254 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99394 --- Comment #2 from Richard Biener --- This is a loop-carried data dependence which we can't handle (we avoid creating those from PRE but here it appears in the source itself). I wonder how LLVM handles this (pre/post vectorization IL). Specifically 'carry around variable' is something we don't handle. Can you somehow extract a compilable testcase (with just this kernel)? Looking at the source peeling a single iteration (to get rid of the initial value) and then undoing the PRE, vectorizing for (int i = 1; i < LEN_1D; i++) { a[i] = (b[i] + b[i-1]) * (real_t).5; } would likely result in optimal code. The assembly from clang doesn't look optimal to me - llvm likely materializes 'x' as temporary array, vectorizing x[0] = b[LEN_1D-1]; for (int i = 0; i < LEN_1D; i++) { a[i] = (b[i] + x[i]) * (real_t).5; x[i+1] = b[i]; } and then somehow (like we handle OMP simd lane arrays?) uses two vectors as a sliding window over x[]. At least the standard strathegy for these kind of dependences is to get "rid" of them by making them data dependences and then hope for the best.
[Bug tree-optimization/99394] s254 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99394 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |NEW Keywords||missed-optimization Ever confirmed|0 |1 Last reconfirmed||2021-03-05 CC||rguenth at gcc dot gnu.org Component|middle-end |tree-optimization
[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395 Richard Biener changed: What|Removed |Added Last reconfirmed||2021-03-05 CC||rguenth at gcc dot gnu.org, ||rsandifo at gcc dot gnu.org Keywords||missed-optimization Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Component|middle-end |tree-optimization --- Comment #2 from Richard Biener --- please provide compilable testcases ... Reduced testcase: double a[1024]; void foo () { for (int i = 0; i < 1022; i += 2) { a[i] = a[i+1] * a[i]; a[i+1] = a[i+2] * a[i+1]; } }
[Bug tree-optimization/99397] s152 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99397 Richard Biener changed: What|Removed |Added Component|middle-end |tree-optimization Keywords||missed-optimization Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Last reconfirmed||2021-03-05 --- Comment #1 from Richard Biener --- That's the long-standing issue of dependence analysis not handling mixed array and pointer access forms which means we miss distance zero computation and handling here. There's a duplicate for this. The mitigiation is to "try again" with the array access demoted to a pointer-based access (thus, analyze some alternative DR and see if dependence analysis can handle that).
[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #26 from Richard Biener --- (In reply to rguent...@suse.de from comment #25) > On Fri, 5 Mar 2021, ubizjak at gmail dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 > > > > --- Comment #24 from Uroš Bizjak --- > > (In reply to Richard Biener from comment #22) > > > I guess the idea of this insn setup was exactly to get IRA/LRA choose > > > the optimal instruction sequence - otherwise exposing the reload so > > > late is probably suboptimal. > > > > THere is one more tool in the toolbox. A peephole2 pattern can be > > conditionalized on availabe XMM register. So, if XMM reg is available, the > > GPR->XMM move can be emitted in front of the insn. So, if there is XMM > > register > > pressure, pinsrd will be used, but if an XMM register is availabe, it will > > be > > reused to emit punpcklqdq. > > > > The peephole2 pattern can also be conditionalized for targets where GPR->XMM > > moves are fast. > > Note the trick is esp. important when GPR->XMM moves are _slow_. But only > in the case we originally combine two GPR operands. Doing two > GPR->XMM moves and then one puncklqdq hides half of the latency of the > slow moves since they have no data dependence on each other. So for the > peephole we should try to match this - a reloaded operand and a GPR > operand. When the %xmm operand results from a SSE computation there's > no point in splitting out a GPR->XMM move. > > So in the end a peephole2 sounds like it could better match the condition > the transform is profitable on. I tried diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index db5be59f5b7..8d0d3077cf8 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -1419,6 +1419,23 @@ DONE; }) +(define_peephole2 + [(set (match_operand:DI 0 "sse_reg_operand") +(match_operand:DI 1 "general_gr_operand")) + (match_scratch:DI 2 "sse_reg_operand") + (set (match_operand:V2DI 2 "sse_reg_operand") + (vec_concat:V2DI (match_dup:DI 0) +(match_operand:DI 3 "general_gr_operand")))] + "reload_completed" + [(set (match_dup 0) +(match_dup 1)) + (set (match_dup 2) +(match_dup 3)) + (set (match_dup 2) + (vec_concat:V2DI (match_dup 0) +(match_dup 2)))] + "") + ;; Merge movsd/movhpd to movupd for TARGET_SSE_UNALIGNED_LOAD_OPTIMAL targets. (define_peephole2 [(set (match_operand:V2DF 0 "sse_reg_operand") but that doesn't seem to match for some unknown reason.
[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #29 from Richard Biener --- (In reply to Uroš Bizjak from comment #27) > (In reply to Richard Biener from comment #26) > > but that doesn't seem to match for some unknown reason. > > Try this: > > (define_peephole2 > [(match_scratch:DI 5 "Yv") >(set (match_operand:DI 0 "sse_reg_operand") > (match_operand:DI 1 "general_reg_operand")) >(set (match_operand:V2DI 2 "sse_reg_operand") > (vec_concat:V2DI (match_operand:DI 3 "sse_reg_operand") > (match_operand:DI 4 "nonimmediate_gr_operand")))] > "" > [(set (match_dup 0) > (match_dup 1)) >(set (match_dup 5) > (match_dup 4)) >(set (match_dup 2) >(vec_concat:V2DI (match_dup 3) > (match_dup 5)))]) Ah, I messed up operands. The following works (the above position of match_scratch happily chooses an operand matching operand 0): ;; Further split pinsrq variants of vec_concatv2di with two GPR sources, ;; one already reloaded, to hide the latency of one GPR->XMM transitions. (define_peephole2 [(set (match_operand:DI 0 "sse_reg_operand") (match_operand:DI 1 "general_reg_operand")) (match_scratch:DI 2 "Yv") (set (match_operand:V2DI 3 "sse_reg_operand") (vec_concat:V2DI (match_dup 0) (match_operand:DI 4 "nonimmediate_gr_operand")))] "reload_completed && optimize_insn_for_speed_p ()" [(set (match_dup 0) (match_dup 1)) (set (match_dup 2) (match_dup 4)) (set (match_dup 3) (vec_concat:V2DI (match_dup 0) (match_dup 2)))]) but for some reason it again doesn't work for the important loop. There we have 389: xmm0:DI=cx:DI REG_DEAD cx:DI 390: dx:DI=[sp:DI+0x10] 56: {dx:DI=dx:DI 0>>0x3f;clobber flags:CC;} REG_UNUSED flags:CC 57: xmm0:V2DI=vec_concat(xmm0:DI,dx:DI) I suppose the reason is that there's two unrelated insns between the xmm0 = cx:DI and the vec_concat. Which would hint that we somehow need to not match this GPR->XMM move in the peephole pattern but instead somehow in the condition (can we use DF there?) The simplified variant below works but IMHO matches cases we do not want to transform. I can't find any example on how to achieve that though. ;; Further split pinsrq variants of vec_concatv2di with two GPR sources, ;; one already reloaded, to hide the latency of one GPR->XMM transitions. (define_peephole2 [(match_scratch:DI 3 "Yv") (set (match_operand:V2DI 0 "sse_reg_operand") (vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand") (match_operand:DI 2 "nonimmediate_gr_operand")))] "reload_completed && optimize_insn_for_speed_p ()" [(set (match_dup 3) (match_dup 2)) (set (match_dup 0) (vec_concat:V2DI (match_dup 1) (match_dup 3)))])
[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #33 from Richard Biener --- Created attachment 50308 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50308&action=edit patch I am testing the following.
[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #35 from Richard Biener --- (In reply to Richard Biener from comment #33) > Created attachment 50308 [details] > patch > > I am testing the following. It FAILs FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler vpinsrq[^\\n\\r]*\\ \\\$1[^\\n\\r]*%[re]si[^\\n\\r]*%xmm18[^\\n\\r]*%xmm19 FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler vpinsrq[^\\n\\r]*\$1[^\\n\\r]*%rsi[^\\n\\r]*%xmm16[^\\n\\r]*%xmm17 FAIL: gcc.target/i386/avx512vl-concatv2di-1.c scan-assembler vmovhps[^\\n\\r]*%[re]si[^\\n\\r]*%xmm18[^\\n\\r]*%xmm19 I'll see how to update those next week.
[Bug tree-optimization/99407] s243 benchmark of TSVC is vectorized by clang and not by gcc, missed DSE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99407 Richard Biener changed: What|Removed |Added Last reconfirmed||2021-03-08 Ever confirmed|0 |1 Keywords||missed-optimization Blocks||53947 Summary|s243 benchmark of TSVC is |s243 benchmark of TSVC is |vectorized by clang and not |vectorized by clang and not |by gcc |by gcc, missed DSE Component|middle-end |tree-optimization Status|UNCONFIRMED |NEW --- Comment #2 from Richard Biener --- Hmm, wonder why DSE didn't remove the first a[i] store. Ah, because DSE doesn't use data-ref analysis and thus cannot disambiguate the variable offset. Manually applying DSE produces .L4: vmovaps c(%rax), %ymm1 vaddps e(%rax), %ymm1, %ymm0 addq$32, %rax vmovups a-28(%rax), %ymm1 vmulps d-32(%rax), %ymm1, %ymm1 vmulps d-32(%rax), %ymm0, %ymm0 vaddps b-32(%rax), %ymm0, %ymm0 vmovaps %ymm0, b-32(%rax) vaddps %ymm0, %ymm1, %ymm0 vmovaps %ymm0, a-32(%rax) cmpq$127968, %rax jne .L4 manually DSEd loop: for (int nl = 0; nl < iterations; nl++) { for (int i = 0; i < LEN_1D-1; i++) { real_t tem = b[i] + c[i ] * d[i]; b[i] = tem + d[i ] * e[i]; a[i] = b[i] + a[i+1] * d[i]; } } Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations
[Bug middle-end/99408] s3251 benchmark of TSVC vectorized by clang runs about 7 times faster compared to gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99408 Richard Biener changed: What|Removed |Added Blocks||53947 Keywords||missed-optimization --- Comment #1 from Richard Biener --- Hum, GCCs code _looks_ faster. Maybe it's our tendency to duplicate memory accesses in vector instructions (there's a PR about this somewhere). A load uop on every stmt is likely the bottleneck here. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations
[Bug tree-optimization/99409] s252 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99409 Richard Biener changed: What|Removed |Added Keywords||missed-optimization Blocks||53947 Component|middle-end |tree-optimization --- Comment #1 from Richard Biener --- Yes, we can't do 'scalar expansion'. We'd need some pre-pass to turn PHIs into data accesses. Here we want t[0] = (real_t) 0.; for (int i = 0; i < LEN_1D; i++) { s = b[i] * c[i]; a[i] = s + t[i]; t[i+1] = s; } and then of course the trick is to elide the actual array and instead do clever shuffling of vector registers instead. IIRC one of the other TSVC examples was similar. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations
[Bug tree-optimization/99411] s311, s312, s31111, s31111, s3110, vsumr benchmark of TSVC is vectorized by clang better than by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99411 Richard Biener changed: What|Removed |Added Blocks||53947 Keywords||missed-optimization Component|middle-end |tree-optimization --- Comment #6 from Richard Biener --- So clang uses a larger VF (unroll of the vectorized loop) here. I think we have another PR about this. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations
[Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412 Richard Biener changed: What|Removed |Added Component|middle-end |tree-optimization Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Blocks||53947 Keywords||missed-optimization Ever confirmed|0 |1 Depends on||97832 Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2021-03-08 --- Comment #1 from Richard Biener --- With -fno-tree-reassoc we detect the reduction chain and produce .L3: vmovaps b(%rax), %ymm5 vmovaps b+32(%rax), %ymm6 addq$160, %rax vfmadd231ps a-160(%rax), %ymm5, %ymm1 vmovaps b-96(%rax), %ymm7 vfmadd231ps a-128(%rax), %ymm6, %ymm0 vmovaps b-64(%rax), %ymm5 vmovaps b-32(%rax), %ymm6 vfmadd231ps a-96(%rax), %ymm7, %ymm2 vfmadd231ps a-64(%rax), %ymm5, %ymm3 vfmadd231ps a-32(%rax), %ymm6, %ymm4 cmpq$128000, %rax jne .L3 vaddps %ymm1, %ymm0, %ymm0 vaddps %ymm2, %ymm0, %ymm0 vaddps %ymm3, %ymm0, %ymm0 vaddps %ymm4, %ymm0, %ymm0 vextractf128$0x1, %ymm0, %xmm1 vaddps %xmm0, %xmm1, %xmm1 vmovhlps%xmm1, %xmm1, %xmm0 vaddps %xmm1, %xmm0, %xmm0 vshufps $85, %xmm0, %xmm0, %xmm1 vaddps %xmm0, %xmm1, %xmm0 decl%edx jne .L2 we're not re-rolling and thus are forced to use a VF of 4 here. Note that LLVM doesn't seem to veectorize the loop but instead vectorizes the basic-block which isn't what TSVC looks for (but that would work for non-fast-math). Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 [Bug 97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
[Bug tree-optimization/99414] s235 benchmark of TSVC is vectorized better by icc than gcc (loop interchange)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99414 Richard Biener changed: What|Removed |Added CC||amker at gcc dot gnu.org, ||rguenth at gcc dot gnu.org Blocks||53947 Status|UNCONFIRMED |NEW Last reconfirmed||2021-03-08 Ever confirmed|0 |1 Component|middle-end |tree-optimization Keywords||missed-optimization --- Comment #1 from Richard Biener --- linterchange says: Consider loop interchange for loop_nest<1 - 3> Access Strides for DRs: a[i_33]: <0, 4, 0> b[i_33]: <0, 4, 0> c[i_33]: <0, 4, 0> a[i_33]: <0, 4, 0> aa[_6][i_33]: <0, 4, 1024> bb[j_34][i_33]: <0, 4, 1024> aa[j_34][i_33]: <0, 4, 1024> Loop(3) carried vars: Induction: j_34 = {1, 1}_3 Induction: ivtmp_53 = {255, 4294967295}_3 Loop(2) carried vars: Induction: i_33 = {0, 1}_2 Induction: ivtmp_51 = {256, 4294967295}_2 and then doesn't do anything. I suppose the best thing to do here is to first distribute the loop nest, but our cost modeling fuses the two obvious candidates: Fuse partitions because they have shared memory refs: Part 1: 0, 1, 2, 3, 4, 5, 6, 7, 19, 20, 21 Part 2: 0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 so this is a case that asks for better cost modeling there. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations
[Bug tree-optimization/99415] s115 benchmark of TSVC is vectorized by icc and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99415 Richard Biener changed: What|Removed |Added Component|middle-end |tree-optimization Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Blocks||53947 Last reconfirmed||2021-03-08 Keywords||missed-optimization --- Comment #1 from Richard Biener --- The benchmark is written badly to confuse our loop header copying it seems. Writing for (int j = 0; j < LEN_2D-1; j++) { for (int i = j+1; i < LEN_2D; i++) { a[i] -= aa[j][i] * a[j]; } } fixes the vectorizing. Possibly a mistake users do, so probably worth investigating further. Not sure how to most easily address this - we'd like to peel the last iteration of the outer loop, noting it does nothing. Maybe loop-splitting can figure this out? Alternatively loop header copying should just do its job... Hmm, actually loop-header copying does do its job but then there's jump threading messing this up again (the loop header check is redundant for all but the last iteration of the outer loop). So -fno-tree-dominator-opts fixes this as well. And for some reason ch_vect thinks the loops are all do-while loops. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations
[Bug tree-optimization/99416] s211 benchmark of TSVC is vectorized by icc and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99416 Richard Biener changed: What|Removed |Added Blocks||53947 Last reconfirmed||2021-03-08 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Component|middle-end |tree-optimization Keywords||missed-optimization CC||amker at gcc dot gnu.org, ||rguenth at gcc dot gnu.org --- Comment #1 from Richard Biener --- Confirmed. ICC applies loop distribution but again our cost-modeling doesn't want that to happen. I suspect we want to detect extra incentives there (make dependences "good", allow interchange, etc.) Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations
[Bug ipa/99419] possible missed optimization for dead code elimination
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99419 Richard Biener changed: What|Removed |Added Ever confirmed|0 |1 Version|unknown |11.0 Depends on||80603 Last reconfirmed||2021-03-08 Keywords||missed-optimization Status|UNCONFIRMED |NEW --- Comment #1 from Richard Biener --- dup or at least depends on PR80603 Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80603 [Bug 80603] Optimize loads from constant arrays or aggregates with arrays
[Bug ipa/99428] possible missed optimization for dead code elimination
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99428 Richard Biener changed: What|Removed |Added Component|tree-optimization |ipa Keywords||missed-optimization Last reconfirmed||2021-03-08 CC||hubicka at gcc dot gnu.org, ||marxin at gcc dot gnu.org Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Version|unknown |11.0 --- Comment #1 from Richard Biener --- IPA - gimple "phase ordering" issue. Alternatively when 'b' is discovered read-only its analysis would need to consider the initializer propagated and thus eventually not address-taken to make 'a' readonly as well ... (or apply modref to tell 'a' is not written to?)
[Bug c++/99445] [11 Regression] ICE in hashtab_chk_error, at hash-table.c:137 since r11-7011-g6e0a231a4aa2407b
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99445 Richard Biener changed: What|Removed |Added Priority|P3 |P1 Target Milestone|--- |11.0
[Bug preprocessor/99446] [11 Regression] ICE in linemap_position_for_loc_and_offset, at libcpp/line-map.c:1005
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99446 Richard Biener changed: What|Removed |Added Target Milestone|--- |11.0
[Bug lto/99447] [11 Regression] ICE (segfault) in lookup_page_table_entry
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99447 Richard Biener changed: What|Removed |Added Target Milestone|--- |11.0 --- Comment #4 from Richard Biener --- I also wonder when the GC was triggered, thus whether it's another case of a live stmt / SSA name where we now forcefully free the CFG.
[Bug c++/99451] [plugin] cannot enable specific dump for plugin passes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99451 --- Comment #1 from Richard Biener --- Yeah.
[Bug c++/99456] [11 regression] ABI breakage with some static initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99456 Richard Biener changed: What|Removed |Added Keywords||missed-optimization Priority|P3 |P1
[Bug debug/99457] gcc/gdb -gstabs+ is buggy.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99457 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |WORKSFORME --- Comment #5 from Richard Biener --- Works for me.
[Bug c++/99459] [11 Regression] Many coroutines regressions on armv7hl-linux-gnueabi
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99459 Richard Biener changed: What|Removed |Added Target||arm Priority|P3 |P1
[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #36 from Richard Biener --- (In reply to Richard Biener from comment #35) > (In reply to Richard Biener from comment #33) > > Created attachment 50308 [details] > > patch > > > > I am testing the following. > > It FAILs > > FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler > vpinsrq[^\\n\\r]*\\ > \\\$1[^\\n\\r]*%[re]si[^\\n\\r]*%xmm18[^\\n\\r]*%xmm19 That's exactly the case we're looking after. V2DI concat from two GPRs. > FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler > vpinsrq[^\\n\\r]*\$1[^\\n\\r]*%rsi[^\\n\\r]*%xmm16[^\\n\\r]*%xmm17 This is, like below, a MEM case. > FAIL: gcc.target/i386/avx512vl-concatv2di-1.c scan-assembler > vmovhps[^\\n\\r]*%[re]si[^\\n\\r]*%xmm18[^\\n\\r]*%xmm19 This one is because nonimmediate_gr_operand also matches a MEM, in this case we apply the peephole to (insn 12 11 13 2 (set (reg/v:V2DI 55 xmm19 [ c ]) (vec_concat:V2DI (reg:DI 54 xmm18 [91]) (mem:DI (reg/v/f:DI 4 si [orig:86 y ] [86]) [1 *y_8(D)+0 S8 A64]))) latency-wise memory isn't any better than a GPR so the decision to split is reasonable. > I'll see how to update those next week. So I updated the above to check for vpunpcklqdq instead.
[Bug rtl-optimization/99462] New: Enhance scheduling to split instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99462 Bug ID: 99462 Summary: Enhance scheduling to split instructions Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: rguenth at gcc dot gnu.org Target Milestone: --- Maybe the scheduler(s) can already do this (I have zero knowledge here). For example the x86 vec_concatv2di insn has alternatives that cause the instruction to be split into multiple uops (vpinsrq, movhpd) when the 'insert' operand is not XMM (but GPR or MEM). We now have a peephole2 to split such cases: +;; Further split pinsrq variants of vec_concatv2di to hide the latency +;; the GPR->XMM transition(s). +(define_peephole2 + [(match_scratch:DI 3 "Yv") + (set (match_operand:V2DI 0 "sse_reg_operand") + (vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand") +(match_operand:DI 2 "nonimmediate_gr_operand")))] + "TARGET_64BIT && TARGET_SSE4_1 + && !optimize_insn_for_size_p ()" + [(set (match_dup 3) +(match_dup 2)) + (set (match_dup 0) + (vec_concat:V2DI (match_dup 1) +(match_dup 3)))]) but in reality this is only profitable when we either can execute two "bad" move uops in parallel (thus when originally composing two GPRs or two MEMs) or when we can schedule one "bad" move much earlier. Thus, can the scheduler already "split" an instruction - say split away a load uop and issue it early when a scratch register is available? (the reverse alternative is to not expose multi-uop insns before scheduling and only merge them later - during scheduling?) How does GCC deal with situations like this?
[Bug rtl-optimization/99462] Enhance scheduling to split instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99462 Richard Biener changed: What|Removed |Added Keywords||missed-optimization CC||amonakov at gcc dot gnu.org, ||law at gcc dot gnu.org --- Comment #1 from Richard Biener --- CCing scheduler maintainers
[Bug target/99461] [11 Regression] ICE in extract_constrain_insn, at recog.c:2670 since r11-7526-g9105757a59b89019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99461 Richard Biener changed: What|Removed |Added Priority|P3 |P1 --- Comment #1 from Richard Biener --- Looks like a dup.
[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #37 from Richard Biener --- So my analysis was partly wrong and the vpinsrq isn't an issue for the benchmark but only the spilling is. Note that the other idea of disparaging vector CTORs more like with diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 260f87b..f8caf8e7dff 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -21821,8 +21821,15 @@ ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost, case vec_construct: { - /* N element inserts into SSE vectors. */ + /* N-element inserts into SSE vectors. */ int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; + /* We cannot insert from GPRs directly but there's always a +GPR->XMM uop involved. Account for that. +??? Note that loads are already costed separately so this +eventually double-counts them. */ + if (!fp) + cost += (TYPE_VECTOR_SUBPARTS (vectype) +* ix86_cost->hard_register.integer_to_sse); /* One vinserti128 for combining two SSE vectors for AVX256. */ if (GET_MODE_BITSIZE (mode) == 256) cost += ix86_vec_cost (mode, ix86_cost->addss); helps for generic and core-avx2 tuning: t.c:10:3: note: Cost model analysis: 0x3858cd0 _6 1 times scalar_store costs 12 in body 0x3858cd0 _4 1 times scalar_store costs 12 in body 0x3858cd0 _5 ^ carry_10 1 times scalar_stmt costs 4 in body 0x3858cd0 _2 ^ _3 1 times scalar_stmt costs 4 in body 0x3858cd0 _15 << 1 1 times scalar_stmt costs 4 in body 0x3858cd0 _14 << 1 1 times scalar_stmt costs 4 in body 0x3858cd0 BIT_FIELD_REF <_9, 64, 0> 1 times scalar_stmt costs 4 in body 0x3858cd0 0 times vec_perm costs 0 in body 0x3858cd0 _15 << 1 1 times vector_stmt costs 4 in body 0x3858cd0 _5 ^ carry_10 1 times vector_stmt costs 4 in body 0x3858cd0 1 times vec_construct costs 20 in prologue 0x3858cd0 _6 1 times unaligned_store (misalign -1) costs 12 in body 0x3858cd0 BIT_FIELD_REF <_9, 64, 0> 1 times vec_to_scalar costs 4 in epilogue 0x3858cd0 BIT_FIELD_REF <_9, 64, 64> 1 times vec_to_scalar costs 4 in epilogue t.c:10:3: note: Cost model analysis for part in loop 0: Vector cost: 48 Scalar cost: 44 t.c:10:3: missed: not vectorized: vectorization is not profitable. but not for znver2: t.c:10:3: note: Cost model analysis: 0x3703790 _6 1 times scalar_store costs 16 in body 0x3703790 _4 1 times scalar_store costs 16 in body 0x3703790 _5 ^ carry_10 1 times scalar_stmt costs 4 in body 0x3703790 _2 ^ _3 1 times scalar_stmt costs 4 in body 0x3703790 _15 << 1 1 times scalar_stmt costs 4 in body 0x3703790 _14 << 1 1 times scalar_stmt costs 4 in body 0x3703790 BIT_FIELD_REF <_9, 64, 0> 1 times scalar_stmt costs 4 in body 0x3703790 0 times vec_perm costs 0 in body 0x3703790 _15 << 1 1 times vector_stmt costs 4 in body 0x3703790 _5 ^ carry_10 1 times vector_stmt costs 4 in body 0x3703790 1 times vec_construct costs 20 in prologue 0x3703790 _6 1 times unaligned_store (misalign -1) costs 16 in body 0x3703790 BIT_FIELD_REF <_9, 64, 0> 1 times vec_to_scalar costs 4 in epilogue 0x3703790 BIT_FIELD_REF <_9, 64, 64> 1 times vec_to_scalar costs 4 in epilogue t.c:10:3: note: Cost model analysis for part in loop 0: Vector cost: 52 Scalar cost: 52 t.c:10:3: note: Basic block will be vectorized using SLP appearantly for znver{1,2,3} we choose a slightly higher load/store cost. We could also try mitigating vectorization by decomposing the __int128 load in forwprop where we have else if (TREE_CODE (TREE_TYPE (lhs)) == VECTOR_TYPE && TYPE_MODE (TREE_TYPE (lhs)) == BLKmode && gimple_assign_load_p (stmt) && !gimple_has_volatile_ops (stmt) && (TREE_CODE (gimple_assign_rhs1 (stmt)) != TARGET_MEM_REF) && !stmt_can_throw_internal (cfun, stmt)) { /* Rewrite loads used only in BIT_FIELD_REF extractions to component-wise loads. */ this was tailored to decompose GCC vector extension loads that are not supported on the HW early. Here we have _9 = MEM <__int128 unsigned> [(char * {ref-all})in_8(D)]; _14 = BIT_FIELD_REF <_9, 64, 64>; _15 = BIT_FIELD_REF <_9, 64, 0>; where the HW doesn't have any __int128 GPRs. If we do not vectorize then the RTL pipeline will eventually split the load. If vectorization is profitable then the vectorizer should be able to vectorize the resulting split loads as well. In this case this would cause actual costing of the load (the re-use of the __int128 to-be-in-SSE reg is instead free) and also cost the live lane extract for the retained integer code. But that moves the cost even more towards vectorizing since now a vector load (cost 12) plus two live lane extracts (when fixed to cost sse_to_integer that's 2 * 6) is used in place of two scalar loads (cost
[Bug tree-optimization/99473] redundant conditional zero-initialization not eliminated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99473 Richard Biener changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #2 from Richard Biener --- g3 and g1 behave differently also because of (there's a dup PR I can't find right now) sinking happening in a way that the pass store-commoning code doesn't trigger on the sunk store. cselim doesn't trigger because if ((TREE_CODE (lhs) != MEM_REF && TREE_CODE (lhs) != ARRAY_REF && TREE_CODE (lhs) != COMPONENT_REF) || !is_gimple_reg_type (TREE_TYPE (lhs))) return false; lhs is a VAR_DECL and 'nontrap' only tracks pointers. There's code to actually handle auto-vars now but the above still disallows bare decls. Because we have the address-taken the transform will also require -fallow-store-data-races. diff --git a/gcc/tree-ssa-phiopt.c b/gcc/tree-ssa-phiopt.c index ddd9d531b13..6f7efa29a1b 100644 --- a/gcc/tree-ssa-phiopt.c +++ b/gcc/tree-ssa-phiopt.c @@ -2490,9 +2490,8 @@ cond_store_replacement (basic_block middle_bb, basic_block join_bb, locus = gimple_location (assign); lhs = gimple_assign_lhs (assign); rhs = gimple_assign_rhs1 (assign); - if ((TREE_CODE (lhs) != MEM_REF - && TREE_CODE (lhs) != ARRAY_REF - && TREE_CODE (lhs) != COMPONENT_REF) + if ((!REFERENCE_CLASS_P (lhs) + && !DECL_P (lhs)) || !is_gimple_reg_type (TREE_TYPE (lhs))) return false; fixes g3 this (with -fallow-store-data-races). Queued for GCC 12. g2 needs sinking/commoning of f (&x) for which there's yet another PR I think.
[Bug tree-optimization/99475] [10/11 Regression] bogus -Warray-bounds accessing an array element of empty structs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99475 Richard Biener changed: What|Removed |Added Priority|P3 |P2 Target Milestone|--- |10.3
[Bug target/99487] [10 Regression] ICE during RTL pass: final in expand_function_start on hppa-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99487 Richard Biener changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |WAITING Last reconfirmed||2021-03-09 Keywords||build Target Milestone|--- |10.3 --- Comment #1 from Richard Biener --- That's during bootstrap it seems? The backtrace is not very useful, it misses the caller. Is there more pieces scattered somewhere in the log?
[Bug tree-optimization/97104] [11 Regression] aarch64, SVE: ICE in vect_get_loop_mask since r11-3070-g783dc66f9cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97104 Richard Biener changed: What|Removed |Added CC||marxin at gcc dot gnu.org, ||rsandifo at gcc dot gnu.org Keywords||needs-bisection --- Comment #4 from Richard Biener --- Also gone latent (or fixed) now. I also know nothing about the loop masks code. Extracting a GIMPLE testcase from one of the broken revs maybe carries this over. We're vectorizing the last testcase with SVE now and not ICEing. So maybe it was really fixed, who knows (but I don't see any live stmts). Can somebody bisect what fixed this?
[Bug target/97513] [11 regression] aarch64 SVE regressions since r11-3822
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97513 Richard Biener changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2021-03-09 Status|UNCONFIRMED |WAITING --- Comment #5 from Richard Biener --- What's the current state of affairs?
[Bug target/99492] double and double _Complex not aligned consistently on AIX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99492 Richard Biener changed: What|Removed |Added Keywords||ABI --- Comment #1 from Richard Biener --- I assume GCC puts it at offset 8?
[Bug other/99496] [11 regression] g++.dg/modules/xtreme-header-3_c.C ICEs after r11-7557
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99496 Richard Biener changed: What|Removed |Added Target Milestone|--- |11.0 Priority|P3 |P1
[Bug target/99497] _mm_min_ss/_mm_max_ss incorrect results when values known at compile time
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99497 --- Comment #3 from Richard Biener --- (In reply to Jakub Jelinek from comment #2) > And another question is if we without -ffast-math ever create > MIN_EXPR/MAX_EXPR and what exactly are the rules for those, if it is safe to > expand those into SMAX etc., or if those need to use UNSPECs too. We don't create them w/o -ffinite-math-only -fno-signed-zeros. We of course eventually could, if there's a compare sequence matching the relaxed requirements. But MIN/MAX_EXPR should be safe to expand to smin/max always given their semantics match.
[Bug testsuite/99498] [11 regression] new test case g++.dg/opt/pr99305.C in r11-7587 fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99498 Richard Biener changed: What|Removed |Added Target Milestone|--- |11.0
[Bug c++/99500] [11 Regression] ICE: tree check: expected tree that contains 'decl minimal' structure, have 'error_mark' in cp_parser_requirement_parameter_list, at cp/parser.c:28828
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99500 Richard Biener changed: What|Removed |Added Priority|P3 |P4
[Bug tree-optimization/99504] Missing memmove detection
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99504 Richard Biener changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Keywords||missed-optimization Last reconfirmed||2021-03-10 --- Comment #1 from Richard Biener --- The issue is that in the pixel case we have an aggregate assignment: [local count: 955630225]: # p_17 = PHI # q_18 = PHI # i_19 = PHI q_9 = q_18 + 4; p_10 = p_17 + 4; *p_17 = *q_18; i_12 = i_19 + 1; if (n_8(D) != i_12) goto ; [89.00%] else goto ; [11.00%] and that's not handled by vectorization or dependence analysis. We might want to consider applying the same folding to this as we do for memcpy folding and turn it into _42 = MEM [q_18, (pixel *)0]; MEM [q_17, (pixel *)0] = _42;
[Bug fortran/99506] internal compiler error: in record_reference, at cgraphbuild.c:64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99506 --- Comment #4 from Richard Biener --- This is a frontend issue, the FE produces an invalid static initializer for 'latt' (DECL_INITIAL): {(real(kind=8)) latt100[(integer(kind=8)) i + -1] / 1.0e+2, (real(kind=8)) latt100[(integer(kind=8)) i + -1] / 1.0e+2,... } if this should be dynamic initialization FEs are responsible for lowering this. I don't know fortran enough for what 'parameter' means in this context: real(double), parameter:: latt(jmax) = [(latt100(i)/100.d0, j=1,jmax)] but the middle-end sees a readonly global (TREE_STATIC) variable.
[Bug c++/99508] [11 Regression] Asm labels declared inside a function are ignored
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99508 Richard Biener changed: What|Removed |Added Priority|P3 |P1 Known to work||10.2.1 Summary|Asm labels declared inside |[11 Regression] Asm labels |a function are ignored |declared inside a function ||are ignored Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Target Milestone|--- |11.0 Keywords||wrong-code Last reconfirmed||2021-03-10 Known to fail||11.0 --- Comment #1 from Richard Biener --- Confirmed. Works with the C frontend.
[Bug gcov-profile/99512] New: Add counter annotation to allow store-data-races to be introduced with -fprofile-update=single
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99512 Bug ID: 99512 Summary: Add counter annotation to allow store-data-races to be introduced with -fprofile-update=single Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: gcov-profile Assignee: unassigned at gcc dot gnu.org Reporter: rguenth at gcc dot gnu.org CC: marxin at gcc dot gnu.org Target Milestone: --- PR64928 shows that LIM when applying store-motion to profile counters with -fprofile-update=single adds code to avoid introducing store-data-races. But with -fprofile-update=single that's not required. The proposal is to add means to control -fallow-store-data-races at the decl level (also by users) by adding an attribute and in the middle-end abstract the global flag check for a decl specific which can lookup the attribute and coverage.c marking the profile counters this way for -fprofile-update=single.
[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 --- Comment #36 from Richard Biener --- So the issue is still the same - one thing I noticed is that store-motion also adds a flag for each counter update to avoid introducing store-data-races. -fallow-store-data-races mitigates that part and speeds up the compilation quite a bit. In case there are threads involved you'd want -fprofile-update=atomic which then causes store-motion to give up and the compile-time is great overall. The original trigger of the regression is likely the marking of the profile counters as to not be aliased - we might want to introduce another flag to tell that store-data-races for the particular decl are not a consideration (maybe even have some user-visible attribute for this). Otherwise re-confirmed (I stripped options down to -O -fPIC -fprofile-arcs -ftest-coverage): rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O -fPIC -fprofile-arcs -ftest-coverage fib-2.o1-fib-2.i 1.84user 0.05system 0:01.90elapsed 99%CPU (0avgtext+0avgdata 160764maxresident)k 0inputs+0outputs (0major+58129minor)pagefaults 0swaps rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O -fPIC -fprofile-arcs -ftest-coverage fib-3.o1-fib-3.i 10.15user 0.17system 0:10.32elapsed 99%CPU (0avgtext+0avgdata 726688maxresident)k 0inputs+0outputs (0major+265008minor)pagefaults 0swaps rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O -fPIC -fprofile-arcs -ftest-coverage fib-4.o1-fib-4.i 43.60user 1.06system 0:44.68elapsed 99%CPU (0avgtext+0avgdata 6107260maxresident)k 0inputs+0outputs (0major+1765217minor)pagefaults 0swaps rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O -fPIC -fprofile-arcs -ftest-coverage fib-5.o1-fib-5.i gcc: fatal error: Killed signal terminated program cc1 compilation terminated. Command exited with non-zero status 1 143.09user 3.93system 2:28.29elapsed 99%CPU (0avgtext+0avgdata 24636148maxresident)k 37504inputs+0outputs (31major+6133278minor)pagefaults 0swaps on the last which runs OOM adding -fallow-store-data-races does rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O -fPIC -fprofile-arcs -ftest-coverage fib-5.o1-fib-5.i -fallow-store-data-races 123.06user 0.45system 2:03.59elapsed 99%CPU (0avgtext+0avgdata 100maxresident)k 57304inputs+0outputs (68major+535127minor)pagefaults 0swaps and -fprofile-update=atomic rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O -fPIC -fprofile-arcs -ftest-coverage fib-5.o1-fib-5.i -fprofile-update=atomic 0.61user 0.02system 0:00.63elapsed 100%CPU (0avgtext+0avgdata 73236maxresident)k 72inputs+0outputs (0major+18284minor)pagefaults 0swaps and -fno-tree-loop-im rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O -fPIC -fprofile-arcs -ftest-coverage fib-5.o1-fib-5.i -fno-tree-loop-im 1.06user 0.01system 0:01.07elapsed 99%CPU (0avgtext+0avgdata 90672maxresident)k 0inputs+0outputs (0major+24331minor)pagefaults 0swaps I still wonder if you can produce an even smaller testcase where visualizing the CFG is possible. Unfortunately the source is mechanically generated and following it is hard. Like a testcase that retains the basic structure but ends up with just a few (2, less than 10) computed gotos?
[Bug gcov-profile/99512] Add counter annotation to allow store-data-races to be introduced with -fprofile-update=single
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99512 Richard Biener changed: What|Removed |Added Keywords||missed-optimization Blocks||64928 Severity|normal |enhancement Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 [Bug 64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs
[Bug tree-optimization/99510] [11 Regression] Compile time hog in build_aligned_type since r11-7123-g63538886d1f7fc7c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99510 Richard Biener changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #1 from Richard Biener --- Confirmed. tree slp vectorization : 32.83 ( 84%) 0.05 ( 20%) 32.90 ( 83%) 62M ( 24%) I suspect it's latent before and caused by excessive vector size iteration. I'll see what's the real cause here.
[Bug tree-optimization/99510] [11 Regression] Compile time hog in build_aligned_type since r11-7123-g63538886d1f7fc7c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99510 --- Comment #2 from Richard Biener --- Ah, OK. We're having a lot of vector CTORs we "vectorize" with load permutations like { 484 506 } and that runs into the pre-existing issue (there's a PR about this...) that we emit dead vector loads for all of the elements in the group, including gaps. Costing says they're even which possibly makes sense. We do a build_aligned_type for each emitted stmt and for some reason it's quite costly here (well, there's the awkward linear type variant list to walk ...). Caching should be possible but the load vectorization loop is already quite awkward. Meh. The rev. likely triggered this because we didn't cost the scalar root stmt before (the CTOR itself we replace). Doing that made the costing profitable. Having equal scalar and vector load cost makes fixing on the costing side difficult - the vector load should be an epsilon more expensive to avoid these issues. Note for some reason we have gazillion of type variants here. Huh. ~36070 variants per type. Ah. And _that's_ because build_aligned_type does for (t = TYPE_MAIN_VARIANT (type); t; t = TYPE_NEXT_VARIANT (t)) if (check_aligned_type (t, type, align)) return t; t = build_variant_type_copy (type); SET_TYPE_ALIGN (t, align); TYPE_USER_ALIGN (t) = 1; and check_aligned_type checks for an exact match TYPE_USER_ALIGN, but of course if 'type' wasn't aligned originally it won't find the created aligned type ... Fixing that fixes the compile-time issue.
[Bug tree-optimization/99510] [11 Regression] Compile time hog in build_aligned_type since r11-7123-g63538886d1f7fc7c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99510 Richard Biener changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #4 from Richard Biener --- Fixed.