[Bug sanitizer/97868] New: warn about using fences with TSAN
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97868 Bug ID: 97868 Summary: warn about using fences with TSAN Product: gcc Version: 10.2.0 Status: UNCONFIRMED Keywords: diagnostic Severity: normal Priority: P3 Component: sanitizer Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org CC: dodji at gcc dot gnu.org, dvyukov at gcc dot gnu.org, jakub at gcc dot gnu.org, kcc at gcc dot gnu.org, marxin at gcc dot gnu.org Target Milestone: --- The thread sanitizer (-fsanitize=thread) does not handle C++ atomic_thread_fence. This is barely acknowledged in the documentation, but causes a number of users to waste a lot of time trying to understand how the reported race could occur. Ideally, it would be supported, but that seems hard. What does seem doable is adding a warning for programs using fences with tsan. I don't really care if it is a compile-time warning, or a runtime warning, and in the second case whether it appears as soon as a fence is executed or only as a note after reported races, as long as I get a hint about it.
[Bug target/98167] [x86] Failure to optimize operation on indentically shuffled operands into a shuffle of the result of the operation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167 --- Comment #8 from Marc Glisse --- (In reply to Richard Biener from comment #4) > We already handle IX86_BUILTIN_SHUFPD there but not IX86_BUILTIN_SHUFPS for > some reason. https://gcc.gnu.org/pipermail/gcc-patches/2019-May/521983.html I was checking with just one builtin if this was the right approach, and never got to extend it to others, sorry. Handling shufps in a similar way seems good to me, if anyone has time to do it.
[Bug tree-optimization/97085] [11 Regression] aarch64, SVE: ICE in gimple_expand_vec_cond_expr since r11-2610-ga1ee6d507b
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97085 --- Comment #6 from Marc Glisse --- (In reply to Richard Biener from comment #5) > (In reply to Marc Glisse from comment #4) > > I would be happy with a revert of that patch, if the ARM backend gets fixed, > > but indeed a missed optimization should not cause an ICE. > > Not sure what the ARM backend issue is. PR 96528 > Well, VEC_COND_EXPR (as some other VEC_ tree codes) are special in that > we are (try to...) be careful to only synthesize ones supported "directly" > by the target. After vector lowering, yes. But before that, the front-end can produce vec_cond_expr for vector types that are not supported. Ah, you probably meant synthesize them from optimization passes, ok. > For the mask vectors (VECTOR_BOOLEAN_TYPE_P, in the > AVX512/SVE case) I don't think the targets support ?: natively but they > have bitwise instructions for this case. That means we could 'simply' > expand mask x ? y : z as (y & x) | (z & ~x) I guess [requires x and y,z > to be of the same type of course]. I wondered whether we ever > need to translate between, say, vector and vector > where lowering ?: this way would require '[un]packing' one of the vectors. I still need to go back to the introduction of those types to understand why vector exists at all... > True, unless you go to bitwise ops. For scalar flag ? bool_a : bool_b > ?: isn't the natural representation either - at least I'm not aware > of any pattern transforming (a & b) | (c & ~b) to b ? a : c for > precision one integer types ;) There are PRs asking for this transformation (and for transformations that this one would enable).
[Bug c++/98556] [8/9/10/11 Regression] ICE: 'verify_gimple' failed since r8-4821-g1af4ebf5985ef2aa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98556 --- Comment #4 from Marc Glisse --- The result of the subtraction is supposed to be an integer type, and is instead an enum based on that underlying type? Maybe the verification code needs tweaking to allow that.
[Bug target/98698] New: atomic load to FPU registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98698 Bug ID: 98698 Summary: atomic load to FPU registers Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-* #include std::atomic a; double f(){ return a.load(std::memory_order_relaxed); } is compiled by g++ to movqa(%rip), %rax movq%rax, %xmm0 ret As far as I understand, a direct movsd to xmm0 would still be atomic, and that's indeed what llvm outputs.
[Bug d/98607] GDC merging computations but rounding mode has changed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98607 --- Comment #9 from Marc Glisse --- Since I doubt gdc handles rounding modes correctly for scalars, I think you can ignore this issue in the implementation of the vector intrinsics for now (same as we do in C and C++). Note that gcc isn't alone here, llvm doesn't implement pragma fenv_access either, and even visual studio, which does implement it for scalars, fails for vectors. I did not test with Intel's compiler.
[Bug middle-end/98709] gcc optimizes bitwise operations, but doesn't optimize logical ones
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98709 --- Comment #1 from Marc Glisse --- At the end of gimple, we have _6 = a_3(D) ^ b_4(D); _1 = ~_6; _2 = a_3(D) == b_4(D); _7 = _1 & _2; I guess we are missing a simplification of ~(a^b) to a==b for bool (similar to ~(a!=b) be we canonicalize != to ^).
[Bug tree-optimization/60770] disappearing clobbers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60770 --- Comment #14 from Marc Glisse --- (In reply to Orgad Shaneh from comment #13) > The case described in comment 1 doesn't issue a warning with GCC 10. It does for me with -Wall -O (you need at least some optimization). If there is still a problem, you need to open a new issue.
[Bug target/98962] New: Perform bitops on floats directly with SSE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98962 Bug ID: 98962 Summary: Perform bitops on floats directly with SSE Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-* (from https://stackoverflow.com/q/66023408/1918193 ) float f(float a){ unsigned ai; __builtin_memcpy(&ai, &a, 4); unsigned ri = ai ^ (1U << 31); float r; __builtin_memcpy(&r, &ri, 4); return r; } results in movd%xmm0, %eax addl$-2147483648, %eax movd%eax, %xmm0 while llvm simplifies it to xorps .LCPI0_0(%rip), %xmm0
[Bug tree-optimization/99046] New: [[gnu::const]] function needs noexcept to be recognized as loop invariant
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99046 Bug ID: 99046 Summary: [[gnu::const]] function needs noexcept to be recognized as loop invariant Product: gcc Version: 10.2.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- (from https://stackoverflow.com/q/66100945/1918193) double x[1000] = {}; [[gnu::const]] double* g(double* var); void f() { for (int i = 1; i < 1000; i++) { g(x)[i] = (g(x)[i-1] + 1.0) * 1.001; } } g++ -O3 eliminates half of the calls to g, but fails to move it to a single call before the loop, while llvm does just that. Gcc does manage it if I mark f as noexcept or nothrow. Whether const functions may throw seems debatable, but if they do throw, I expect them to do so consistently, and since the loop has at least one iteration and starts with this call, the transformation seems safe to me.
[Bug c++/100322] Switching from std=c++17 to std=c++20 causes performance regression in relationals
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100322 --- Comment #7 from Marc Glisse --- PR94589 then.
[Bug tree-optimization/94589] Optimize (i<=>0)>0 to i>0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94589 --- Comment #7 from Marc Glisse --- Some key steps in the optimization: PRE turns PHI<-1,0,1> > 0 into PHI<0,0,1> reassoc then combines the operations (it didn't in gcc-10) forwprop+phiopt cleans up (i>0)!=0?1:0 into just i>0. Having to wait until phiopt4 to get the simplified form is still very long, and most likely causes missed optimizations in earlier passes. But nice progress!
[Bug tree-optimization/94589] Optimize (i<=>0)>0 to i>0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94589 --- Comment #8 from Marc Glisse --- PR96480 would be my guess.
[Bug tree-optimization/94589] Optimize (i<=>0)>0 to i>0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94589 --- Comment #12 from Marc Glisse --- (In reply to rguent...@suse.de from comment #11) > For PR7 > I have prototyped a forwprop patch to try constant folding > stmts with all-constant PHIs, thus in this case c$_M_value_2 > 0, > when there's only a single use of it Maybe we could handle any case where trying to fold the single use (counting x*x as a single use of x) with each possible value satisfies is_gimple_val (or whatever the condition is to be allowed in a PHI, and without introducing a use of a ssa_name before it is defined), so that things like PHI & X would simplify. But the constant case is indeed the most important, and should allow the optimization in this PR before the vectorizer using reassoc1.
[Bug tree-optimization/100366] spurious warning - std::vector::clear followed by std::vector::insert(vec.end(), ...) with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100366 Marc Glisse changed: What|Removed |Added Last reconfirmed||2021-05-02 Ever confirmed|0 |1 Component|c++ |tree-optimization Keywords||diagnostic, ||missed-optimization Status|UNCONFIRMED |NEW --- Comment #1 from Marc Glisse --- Assuming the warning happens during the strlen pass, we are still missing a lot of optimizations at that point if (_6 != _7) goto ; [70.00%] else goto ; [30.00%] [local count: 322122544]: _158 = _7 - _6; once VRP2 (2 passes after strlen) replaces _158 with 0 and propagates it, maybe the code becomes nice enough to avoid confusing this fragile warning (I didn't check). Before FRE3, we have _6 = vec_2(D)->D.33506._M_impl.D.32819._M_start; _7 = vec_2(D)->D.33506._M_impl.D.32819._M_finish; if (_6 != _7) goto ; [70.00%] else goto ; [30.00%] [local count: 1073741824]: _5 = MEM[(char * const &)vec_2(D) + 8]; MEM[(struct __normal_iterator *)&D.33862] ={v} {CLOBBER}; MEM[(struct __normal_iterator *)&D.33862]._M_current = _5; __position = D.33862; _12 = MEM[(const char * const &)vec_2(D)]; _13 = MEM[(const char * const &)&__position]; _14 = _13 - _12; and after FRE3 [local count: 1073741824]: _5 = MEM[(char * const &)vec_2(D) + 8]; MEM[(struct __normal_iterator *)&D.33862] ={v} {CLOBBER}; MEM[(struct __normal_iterator *)&D.33862]._M_current = _5; __position = D.33862; _14 = _5 - _6; Only PRE manages to notice that _5 is the same as _7, which is already late. And it then takes until VRP2 to realize that _7 - _6 must be 0 in the else branch of _6 != _7. * I am not sure why FRE manages to optimize _12 and not _5, that seems like the first thing to check (maybe the +8 means it is obviously "partial") * I don't know if some other pass than VRP could learn that b-a is 0 if not a!=b.
[Bug tree-optimization/100366] spurious warning - std::vector::clear followed by std::vector::insert(vec.end(), ...) with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100366 --- Comment #5 from Marc Glisse --- (In reply to Martin Sebor from comment #2) > The IL looks like the warning is justified: The memcpy call is dead code, we just fail to notice it. >[local count: 230225493]: > # prephitmp_42 = PHI <_6(4), _7(3)> This is always _6, because in bb 3 we have _6 == _7. > pretmp_67 = vec_2(D)->D.33449._M_impl.D.32762._M_start; > _69 = prephitmp_42 - pretmp_67; Always 0. >[local count: 220460391]: > MEM [(char * {ref-all})_155] = pretmp_72; > _50 = vec_2(D)->D.33449._M_impl.D.32762._M_finish; > _Num_51 = _50 - prephitmp_42; Always 0, in bb 4 we copy _M_start in _M_finish if they are not already equal. (sorry for the wrong FRE comment earlier) Note that if I replace operator new/delete with malloc/free inline void* operator new(std::size_t n){return __builtin_malloc(n);} inline void operator delete(void*p)noexcept{__builtin_free(p);} inline void operator delete(void*p,std::size_t)noexcept{__builtin_free(p);} we optimize quite a bit more and the warning disappears.
[Bug tree-optimization/100366] spurious warning - std::vector::clear followed by std::vector::insert(vec.end(), ...) with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100366 --- Comment #6 from Marc Glisse --- So, apart from the small missed PHI optimization, this is probably the common issue that since operator new is replacable, we can't really assume that it does not clobber anything, and that hurts optimizations :-( Not sure if there would be any convenient workaround for this specific case.
[Bug tree-optimization/100366] spurious warning - std::vector::clear followed by std::vector::insert(vec.end(), ...) with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100366 --- Comment #7 from Marc Glisse --- It seems to help if we save the values before the allocation in vector.tcc, although I cannot promise it won't pessimize something else... And that's just a workaround, not a solution. @@ -766,13 +766,16 @@ { const size_type __len = _M_check_len(__n, "vector::_M_range_insert"); + pointer __old_start(this->_M_impl._M_start); + pointer __old_finish(this->_M_impl._M_finish); + pointer __old_end_of_storage(this->_M_impl._M_end_of_storage); pointer __new_start(this->_M_allocate(__len)); pointer __new_finish(__new_start); __try { __new_finish = std::__uninitialized_move_if_noexcept_a - (this->_M_impl._M_start, __position.base(), + (__old_start, __position.base(), __new_start, _M_get_Tp_allocator()); __new_finish = std::__uninitialized_copy_a(__first, __last, @@ -780,7 +783,7 @@ _M_get_Tp_allocator()); __new_finish = std::__uninitialized_move_if_noexcept_a - (__position.base(), this->_M_impl._M_finish, + (__position.base(), __old_finish, __new_finish, _M_get_Tp_allocator()); } __catch(...) @@ -790,12 +793,12 @@ _M_deallocate(__new_start, __len); __throw_exception_again; } - std::_Destroy(this->_M_impl._M_start, this->_M_impl._M_finish, + std::_Destroy(__old_start, __old_finish, _M_get_Tp_allocator()); _GLIBCXX_ASAN_ANNOTATE_REINIT; - _M_deallocate(this->_M_impl._M_start, - this->_M_impl._M_end_of_storage - - this->_M_impl._M_start); + _M_deallocate(__old_start, + __old_end_of_storage + - __old_start); this->_M_impl._M_start = __new_start; this->_M_impl._M_finish = __new_finish; this->_M_impl._M_end_of_storage = __new_start + __len;
[Bug c++/100746] NRVO should not introduce aliasing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100746 --- Comment #1 from Marc Glisse --- PR 80740 ?
[Bug c++/63164] unnecessary calls to __dynamic_cast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63164 Marc Glisse changed: What|Removed |Added Last reconfirmed||2021-05-26 Status|UNCONFIRMED |NEW Ever confirmed|0 |1 --- Comment #2 from Marc Glisse --- I was going to file exactly the same RFE for dynamic_cast and final types (preferably it should also work if 'final' is only detected by LTO, but that shouldn't block an easier front-end patch), so confirmed.
[Bug target/100784] ICE: Segmentation fault, contains_struct_check(tree_node*, tree_node_structure_enum, char const*, int, char const*)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100784 --- Comment #2 from Marc Glisse --- Do we need to punt if there is no lhs? (with optimization the call should be removed as pure) I probably won't have time to try it for a while.
[Bug c++/100929] gcc fails to optimize less to min for SIMD code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100929 --- Comment #1 from Marc Glisse --- Please attach your testcases to the bug report. godbolt links are nice complements, but not considered sufficient here. We don't lower the comparison or the blend in GIMPLE (yet). I think Hongtao Liu is doing blends right now. I don't know if there would be issues for comparisons (with -ftrapping-math for instance?). If you write (x
[Bug target/100929] gcc fails to optimize less to min for SIMD code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100929 Marc Glisse changed: What|Removed |Added Version|og10 (devel/omp/gcc-10) |11.1.0 Keywords||missed-optimization Component|c++ |target Severity|normal |enhancement Target||x86_64-*-*
[Bug rtl-optimization/95405] Unnecessary stores with std::optional
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95405 --- Comment #3 from Marc Glisse --- For a self-contained version, see below. Notice how the extra constructor in _Optional_payload_base changes the generated code, or storing directly a _Optional_payload_base instead of _Optional_payload in optional struct _Optional_payload_base { long _M_value; bool _M_engaged = false; _Optional_payload_base() = default; ~_Optional_payload_base() = default; _Optional_payload_base(const _Optional_payload_base&) = default; _Optional_payload_base(_Optional_payload_base&&) = default; _Optional_payload_base(double,float); }; struct _Optional_payload : _Optional_payload_base { }; struct optional { _Optional_payload _M_payload; }; optional foo(); long bar() { auto r = foo(); if (r._M_payload._M_engaged) return r._M_payload._M_value; else return 0L; }
[Bug rtl-optimization/95405] Unnecessary stores with std::optional
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95405 --- Comment #5 from Marc Glisse --- GIMPLE doesn't know about calling conventions, that's something that only "appears" during expansion to RTL. Still, I don't claim to understand what is going on here.
[Bug target/100929] gcc fails to optimize less to min for SIMD code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100929 --- Comment #4 from Marc Glisse --- (In reply to Denis Yaroshevskiy from comment #3) > Is what @Andrew Pinski copied enough? I think so (it is missing the command line), although one example with an integer type could also help in case floats turn out to have a different issue. > -ftrapping-math causes clang to stop doing this optimisation. Note that -ftrapping-math is on by default with gcc (PR 54192), but -fno-trapping-math wouldn't solve your problem, we are missing other things.
[Bug middle-end/54400] recognize vector reductions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54400 --- Comment #8 from Marc Glisse --- (In reply to Richard Biener from comment #7) > (note avoiding hadd in the reduc pattern was intended). Indeed. Except with -Os, or if a processor with a fast hadd appears, vectorising this doesn't bring anything. It doesn't hurt either though.
[Bug middle-end/101063] #pragma STDC FENV_ACCESS ON: wrong code generation: instructions leading to side effects may not be generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101063 --- Comment #1 from Marc Glisse --- > Note 1: Under -Wall gcc generates warning: > :5: warning: ignoring '#pragma STDC FENV_ACCESS' [-Wunknown-pragmas] That seems like a huge hint, this is not implemented in gcc. You can find several existing PR in this bugzilla. There is a branch refs/users/glisse/heads/fenv that was kind of functional last time I tried, but I'll never have time to polish it.
[Bug target/102783] [powerpc] FPSCR manipulations cannot be relied upon
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102783 --- Comment #12 from Marc Glisse --- (In reply to Marc Glisse from comment #11) > Since I had forgotten where it was, let me write here that it is git branch > /users/glisse/fenv Since it became impossible (hooks) to push to that branch a while ago, I should post somewhere the FIXME file I couldn't push last year: Looking at LLVM, I notice that my design in the gcc fenv branch seems to be missing a fundamental piece: it has nothing preventing "normal" operations from outside from migrating towards the protected region, where they may end up using an unexpected rounding mode (unprotected doesn't mean any rounding mode, it means the default one), or setting flags that we will observe. One idea to prevent this would be to make sure that there are no normal FP operations in functions that have protected operations (does that mean we should mark functions? Just checking if there is a protected FP op doesn't work if we call a function that does the op). This means that we should turn all FP operations of the function into protected ones (possibly with more relaxed flags if they are not in the protected region), and we should also do that whenever inlining mixed functions. And cross my fingers that the compiler doesn't start using FP ops out of thin air. Would that be sufficient?
[Bug tree-optimization/104675] [9/10/11/12 Regression] ICE: in expand_expr_real_2, at expr.cc:9773 at -O with __real__ + __imag__ extraction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104675 --- Comment #6 from Marc Glisse --- I am only learning now that bit ops don't exist for complex numbers :-/ I don't really see why not, but that's a different question. Thanks for fixing this. Looking to see if I could quickly find other similar issues, I only noticed 2 ICEs typedef _Complex unsigned T; T f(T x){ return (x/2)*2; } T g(T x){ return (x*2)/2; }
[Bug tree-optimization/105053] New: Wrong loop count for scalar code from vectorizer
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105053 Bug ID: 105053 Summary: Wrong loop count for scalar code from vectorizer Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: wrong-code Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- #include #include #include #include #include int main(){ const long n = 1; std::vector> vec; vec.reserve(n); std::random_device rd; std::default_random_engine re(rd()); std::uniform_int_distribution rand_int; std::uniform_real_distribution rand_dbl; for(int i=0;i(vec[i]),std::get<1>(vec[i]))); std::cout << sup << '\n'; } { int sup = 0; for(int i=0;i(vec[i])),std::get<1>(vec[i])); std::cout << sup << '\n'; } } Can output for instance 2147483645 2147483637 compiled with -O3, whereas the 2 numbers should be the same. If I compare what I get from the first loop with -O3 -fno-tree-loop-vectorize to the second loop with just -O3, the code is almost identical, except that the (scalar) code only iterates on 1/4 of the array, as if it was using a bound meant for a vector. -fno-tree-loop-vectorize seems to be ok.
[Bug tree-optimization/105053] Wrong loop count for scalar code from vectorizer
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105053 --- Comment #8 from Marc Glisse --- Thank you. I originally noticed the problem with 11.2.0-18 (Debian), so I believe this will be needed on that branch as well. 10.3.0 looked ok...
[Bug tree-optimization/105062] New: Suboptimal vectorization for reduction with several elements
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105062 Bug ID: 105062 Summary: Suboptimal vectorization for reduction with several elements Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- The testcase is essentially the same as in PR105053, but here this is about performance, not correctness. #include #include #include #include #include #include int main(){ const long n = 1; std::vector> vec; vec.reserve(n); std::random_device rd; std::default_random_engine re(rd()); std::uniform_int_distribution rand_int; std::uniform_real_distribution rand_dbl; for(int i=0;i(vec[i]),std::get<1>(vec[i]))); volatile int noopt0 = sup; } #else { int sup = 0; for(int i=0;i(vec[i])),std::get<1>(vec[i])); volatile int noopt1 = sup; } #endif auto finish = std::chrono::system_clock::now(); std::cout << std::chrono::duration_cast(finish - start).count() << '\n'; } I compile with -O3 -march=skylake (originally noticed with -march=native on a i7-10875H CPU). The second loop runs in about 60ms, while the first (compiling with -DSLOW) runs in 80ms. The generated asm also looks very different. For the fast code, the core loop is .L64: vmovdqu (%rax), %ymm3 addq$64, %rax vpunpckldq -32(%rax), %ymm3, %ymm0 vpermd %ymm0, %ymm2, %ymm0 vpmaxsd %ymm0, %ymm1, %ymm1 cmpq%rdx, %rax jne .L64 which looks nice and compact (well, I think we could do without the vpermd, but it is already great). Now for the slow code, we have .L64: vmovdqu (%rax), %ymm0 vmovdqu 32(%rax), %ymm10 vmovdqu 64(%rax), %ymm2 vmovdqu 96(%rax), %ymm9 vpermd %ymm10, %ymm6, %ymm8 vpermd %ymm0, %ymm7, %ymm1 vpblendd$240, %ymm8, %ymm1, %ymm1 vpermd %ymm9, %ymm6, %ymm11 vpermd %ymm2, %ymm7, %ymm8 vpermd %ymm0, %ymm4, %ymm0 vpermd %ymm10, %ymm3, %ymm10 vpermd %ymm2, %ymm4, %ymm2 vpermd %ymm9, %ymm3, %ymm9 vpblendd$240, %ymm11, %ymm8, %ymm8 vpblendd$240, %ymm10, %ymm0, %ymm0 vpblendd$240, %ymm9, %ymm2, %ymm2 vpermd %ymm1, %ymm4, %ymm1 vpermd %ymm8, %ymm3, %ymm8 vpermd %ymm0, %ymm4, %ymm0 vpermd %ymm2, %ymm3, %ymm2 vpblendd$240, %ymm8, %ymm1, %ymm1 vpblendd$240, %ymm2, %ymm0, %ymm0 vpmaxsd %ymm0, %ymm1, %ymm1 subq$-128, %rax vpmaxsd %ymm1, %ymm5, %ymm5 cmpq%rdx, %rax jne .L64 It is unrolled once more than the fast code and contains an excessive amount of shuffling. If I understand correctly, it vectorizes a reduction with MAX_EXPR on "sup" but does not consider the operation max(get<0>,get<1>) as being part of this reduction, so it generates code that would make sense if I used 2 different operations like sup=std::max(sup,std::get<0>(vec[i])+std::get<1>(vec[i])) instead of both being the same MAX_EXPR. Maybe, when we discover a reduction, we could check if the elements are themselves computed with the same operation as the reduction and in that case try to make it a "bigger" reduction?
[Bug tree-optimization/105062] Suboptimal vectorization for reduction with several elements
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105062 --- Comment #2 from Marc Glisse --- (In reply to Richard Biener from comment #1) > But since not all of the std::max are recognized as > MAX_EXPR but some only after loop if-conversion Ah, I hadn't noticed. I tried replacing std::max with a simpler by-value version so we get MAX_EXPR already in early inline, but that didn't help. Actually, it made things worse: #include #include #include #include #include #include int my_max(int a, int b){ return (a> vec; vec.reserve(n); std::random_device rd; std::default_random_engine re(rd()); std::uniform_int_distribution rand_int; std::uniform_real_distribution rand_dbl; for(int i=0;i(vec[i]),std::get<1>(vec[i]))); volatile int noopt0 = sup; } #else { int sup = 0; for(int i=0;i(vec[i])),std::get<1>(vec[i])); volatile int noopt1 = sup; } #endif auto finish = std::chrono::system_clock::now(); std::cout << std::chrono::duration_cast(finish - start).count() << '\n'; } Now reassoc1 turns the fast code into the slow code before the vectorizer can detect the reduction chain :-(
[Bug target/100929] gcc fails to optimize less to min for SIMD code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100929 --- Comment #6 from Marc Glisse --- (blend is now lowered in gimple) For the integer case, the mix of vector(int) and vector(char) obfuscates things a bit, we have __m256i if_else_int (__m256i x, __m256i y) { vector(32) char _4; vector(32) char _5; vector(32) char _6; vector(32) _7; vector(32) char _8; vector(4) long long int _9; vector(8) int _10; vector(8) int _11; vector(8) _12; vector(8) int _13; [local count: 1073741824]: _10 = VIEW_CONVERT_EXPR(x_2(D)); _11 = VIEW_CONVERT_EXPR(y_3(D)); _12 = _10 > _11; _13 = VEC_COND_EXPR <_12, { -1, -1, -1, -1, -1, -1, -1, -1 }, { 0, 0, 0, 0, 0, 0, 0, 0 }>; _5 = VIEW_CONVERT_EXPR(_13); _4 = VIEW_CONVERT_EXPR(y_3(D)); _6 = VIEW_CONVERT_EXPR(x_2(D)); _7 = _5 < { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; _8 = VEC_COND_EXPR <_7, _4, _6>; _9 = VIEW_CONVERT_EXPR<__m256i>(_8); return _9; } A first step would be to teach gcc that it can do a VEC_COND_EXPR<_12, _11, _10> with fewer VIEW_CONVERT_EXPR (maybe follow the definition chain of the condition through trivial ops like <0, view_convert or ?-1:0 until we find a real comparison _10 > _11, to determine the right size?). Other steps: * Move (or at least partially copy) fold_cond_expr_with_comparison to match.pd so we can recognize min/max. * Lower __builtin_ia32_cmpps256 (y_2(D), x_3(D), 17) to GIMPLE for the float case, if that's a valid thing to do (NaN, etc).
[Bug libstdc++/105308] New: Specialize for_each
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105308 Bug ID: 105308 Summary: Specialize for_each Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- Hello, with a balanced binary tree, as used for instance in std::set or std::map, it is relatively easy to perform an operation in parallel on all elements (like for_each): recurse on the 2 subtrees in parallel (and probably assign the top node to one of the subtrees arbitrarily). Of course there are technical details, we don't store the size of subtrees so we may want to decide in advance how deep to switch to sequential, etc. Doing this requires accessing details of the tree implementation and cannot be done by a user (plus, for_each doesn't seem to be a customization point...). I am still confused that we have the traditional for_each, the new for_each with execution policy, the new range for_each, but no mixed range + execution policy. This specialization would be easier to implement for a whole tree than for an arbitrary subrange. It is still possible there, but likely less balanced, and we may need a first pass to find the common ancestor and possibly other relevant information (or check if the range is the whole container if that's possible and only optimize that case). Possibly some other containers could specialize for_each, although it isn't as obvious. Actually, even the sequential for_each could benefit from a specialization for various containers. Recursing on subtrees is a bit cheaper than having the iterator move up and down, forward_list could avoid pointing to the previous element, dequeue could try to split at block boundaries, etc. Other algorithms that iterate through a range like reduce, all_of, etc could also benefit, hopefully most are simple wrappers around others so few would need a specialization.
[Bug libstdc++/105308] Specialize for_each
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105308 --- Comment #2 from Marc Glisse --- (In reply to Jonathan Wakely from comment #1) > I'm unclear what the request is. The list isn't super clear to me either, any sensible specialization of a standard algorithm for a standard container. Even simply ranges::for_each(std::set,*) looks like it could be a bit faster with a specialization instead of using iterators. > Are you proposing this for the parallel > std::for_each with an execution policy? Yes, that's the first motivation. > That code comes from the PSTL project which is part of LLVM, > and maintained by Intel, so enhancements to it should ideally be done > upstream. But the code would need to use private interfaces of libstdc++'s _Rb_tree. Does PSTL contain a lot of special code, with one variant for libstdc++ / libc++ / other, that uses internals of the datastructures?
[Bug tree-optimization/106677] New: Abstraction overhead with std::views::join
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106677 Bug ID: 106677 Summary: Abstraction overhead with std::views::join Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- (from https://stackoverflow.com/q/73407636/1918193 ) #include #include #include struct Foo { auto join() const { return m_array | std::views::join; } auto direct() const { return std::views::all(m_array[0]); } std::array, 1> m_array; }; __attribute__((noinline)) int sum_array(const Foo& foo) { int result = 0; for (int* val : foo.join()) result += *val; return result; } __attribute__((noinline)) int sum_vec(const Foo& foo) { int result = 0; for (int* val : foo.direct()) result += *val; return result; } I am using a snapshot from 20220719 with -std=gnu++2b -O3 and looking at .optimized dumps. sum_vec gets relatively nice, short code. sum_array gets something uglier. _18 = &foo_5(D)->m_array; _6 = foo_5(D) + 24; if (_6 != _18) Err, x != x+24 should be folded to false? Let's add if(foo.m_array.begin()==foo.m_array.end())__builtin_unreachable(); to move forward. _16 = MEM[(int * const * const &)foo_4(D)]; _17 = MEM[(int * const * const &)foo_4(D) + 8]; if (_16 != _17) goto ; [5.50%] else goto ; [94.50%] why are we guessing that the vector is probably empty? Let's look at more code [local count: 853673669]: _10 = &MEM[(const struct array *)foo_4(D)]._M_elems; _7 = foo_4(D) + 24; _16 = MEM[(int * const * const &)foo_4(D)]; _17 = MEM[(int * const * const &)foo_4(D) + 8]; if (_16 != _17) goto ; [5.50%] else goto ; [94.50%] [local count: 806721618]: _18 = foo_4(D) + 24; [local count: 96636762]: # SR.89_28 = PHI <_10(2), _18(3)> # SR.90_41 = PHI <_16(2), 0B(3)> goto ; [100.00%] [local count: 923031551]: # result_2 = PHI <0(4), result_12(8)> # SR.89_13 = PHI # SR.90_30 = PHI if (_7 == SR.89_13) goto ; [30.00%] else goto ; [70.00%] [local count: 276909463]: if (SR.90_30 == 0B) goto ; [16.34%] else goto ; [83.66%] [local count: 96636764]: # result_31 = PHI return result_31; (why not _18 = _7 towards the beginning?) It would be nice if threading could isolate the case of an empty vector: 2 -> 3 -> 4 -> 9 -> 10 -> 11: just return 0, and the rest of the code may become easier to optimize. Let me add if(foo.m_array[0].begin()==foo.m_array[0].end())__builtin_unreachable(); to avoid the empty vector case as well. This looks better, at least the inner loop looks normal, but we are still iterating on the elements of m_array, when we should be able to tell that it has exactly 1 element.
[Bug tree-optimization/106247] GCC12 warning in Eigen: array subscript is partly outside array bounds
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106247 Marc Glisse changed: What|Removed |Added Status|WAITING |NEW --- Comment #6 from Marc Glisse --- (In reply to Andrew Pinski from comment #2) > the warning is correct in the sense the load is there in IR, though it looks > like it is dead (but only because b and a are unused): #include Eigen::Array a; Eigen::Array b; void f(){ b.col(0).tail(2) = a.col(1); } still warns about some 256 bit code which is still very dead (we later optimize the whole function to just _64 = MEM [(char * {ref-all})&a + 16B]; MEM [(char * {ref-all})&b + 8B] = _64; ) so the fact that a and b are unused in the original testcase is not important. I assume there were good reasons to put the warning this early (VRP1), but it means that some dead code that will be removed later is still around. (@Daniel: it can be easier to track things with separate issues, if you have a different testcase to provide)
[Bug target/102783] [powerpc] FPSCR manipulations cannot be relied upon
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102783 --- Comment #11 from Marc Glisse --- (In reply to Segher Boessenkool from comment #8) > Thanks for the pointer, I'll find Marc's work. Since I had forgotten where it was, let me write here that it is git branch /users/glisse/fenv
[Bug middle-end/106805] Undue optimisation of floating-point comparisons
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106805 --- Comment #2 from Marc Glisse --- A problematic optimization pointed in the discussion: (simplify (cmp @0 REAL_CST@1) [...] (if (REAL_VALUE_ISNAN (TREE_REAL_CST (@1)) && !tree_expr_signaling_nan_p (@1) && !tree_expr_maybe_signaling_nan_p (@0)) { constant_boolean_node (cmp == NE_EXPR, type); })
[Bug c++/107065] GCC treats rvalue as an lvalue
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107065 --- Comment #8 from Marc Glisse --- (simplify (bit_not (bit_not @0)) @0) while in an other place we have (simplify (bit_and @0 integer_all_onesp) (non_lvalue @0))
[Bug c++/107065] GCC treats rvalue as an lvalue
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107065 --- Comment #11 from Marc Glisse --- Did you try adding "non_lvalue" in match.pd? It looks less intrusive. Although in the long term your approach seems better and the failures should be fixable.
[Bug c++/107065] GCC treats rvalue as an lvalue
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107065 --- Comment #13 from Marc Glisse --- (In reply to Jakub Jelinek from comment #12) > Doing it on the match.pd side doesn't look right, there could be many other > optimizations that result in something similar. $ grep -c non_lvalue match.pd 12 probably they should be removed and those that were useful should be fixed by similar techniques as you are considering... To add one more option to your list, maybe the generic-simplify machinery could add non_lvalue automatically in some cases? I still prefer your first option though, tweaking the warning code, which probably expected x!=0 and now gets !(x==0) or something similar.
[Bug tree-optimization/107184] New: Copy warnings in dump files
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107184 Bug ID: 107184 Summary: Copy warnings in dump files Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: diagnostic Severity: enhancement Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- One irritation with warnings like -Wuse-after-free and all the other optimization-based warnings is how hard they are to track. Yes, it tells me where the call is in my code, but that's far from enough. With -fdump-tree-waccess, I can have some idea of what the code looks like, after various optimizations, that makes the compiler warn. However, identifying the relevant statements in the dump file can take a long time, and it remains faster to break out the debugger on the compiler :-( It seems that a small thing that could help a bit would be to print a copy of the warnings and notes in the dump file, next to the relevant statements. Or at least some easy to find marker. I most certainly don't claim that this will solve anything, I just see it as a low (?) hanging fruit.
[Bug tree-optimization/107184] Copy warnings in dump files
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107184 --- Comment #3 from Marc Glisse --- (In reply to Richard Biener from comment #2) > Confirmed - for array-bounds I added some "array-bound warning for %E" > printing the SSA name/stmt in the dump file. Sounds good, I'll try that next time the warning is of the array-bound type. > What I find useful in tracking down things is to -fdump-tree-FOO-lineno which > at least gets you the locations in the dump. Ah, I didn't know that one (-lineno isn't part of -all). It is nice, but with inlining and all the corresponding source line actually appears hundreds of times in the dump, and this does not tell me which of those causes the warning.
[Bug tree-optimization/54346] combine permutations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54346 --- Comment #6 from Marc Glisse --- The log says that this breaks tree-ssa/forwprop-19.c, but I don't see any xfail or anything. Does it only fail because gimple-simplify leaves some dead code around, so you could update the test to scan the next DCE pass dump instead of forwprop1? Or are we missing a transformation that just detects a VEC_PERM_EXPR with an identity permutation?
[Bug c++/104184] New: [11/12 Regression] ICE Error reporting routines re-entered. xref_basetypes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104184 Bug ID: 104184 Summary: [11/12 Regression] ICE Error reporting routines re-entered. xref_basetypes Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: diagnostic Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- This is reduced from valid code (I think) and it still compiles with "g++ -c -w" or "clang++ -c", although the undefined inline function seems to play a strong role, so this may not be exactly the same as the original ICE. The call stack starts the same (xref_basetypes) but in the original it went through "unify" and only failed with gcc-12 (not 11) with -Wall and -Wextra. template using mp_size_t = int; template struct mp_list; template struct mp_identity { using type = int; }; template struct mp_inherit : T... {}; template using mpmf_wrap = mp_identity; template using mpmf_unwrap = typename T::type; template struct mp_map_find_impl; template class M, class... T, class K> struct mp_map_find_impl, K> { using U = mp_inherit...>; static mp_identity f(mp_identity *); using type = mpmf_unwrap; }; template using mp_map_find = typename mp_map_find_impl::type; template using mp_second = int; template struct mp_at_c_impl { using _map = mp_list, int>; using type = mp_second>>; }; template using make_arg_list = mp_identity; template struct argument_pack { using type = typename mp_at_c_impl< typename make_arg_list::type, 0>::type; }; struct parameters { typedef mp_list<> parameter_spec; }; template using boost_param_result_39refine_mesh_3 = mp_identity; template inline typename boost_param_result_39refine_mesh_3< typename argument_pack::type>::type refine_mesh_3(); int main() { refine_mesh_3(); } Internal compiler error: Error reporting routines re-entered. 0xec0348 xref_basetypes(tree_node*, tree_node*) ../../src/gcc/cp/decl.cc:15783 0x101d194 instantiate_class_template_1 ../../src/gcc/cp/pt.cc:11953 0x101ec31 instantiate_class_template(tree_node*) ../../src/gcc/cp/pt.cc:12311 0x10714d8 complete_type(tree_node*) ../../src/gcc/cp/typeck.cc:143 0x102d168 lookup_base(tree_node*, tree_node*, int, base_kind*, int) ../../src/gcc/cp/search.cc:229 0xe0ea26 standard_conversion ../../src/gcc/cp/call.cc:1403 0xe12484 implicit_conversion_1 ../../src/gcc/cp/call.cc:2031 0xe12484 implicit_conversion ../../src/gcc/cp/call.cc:2131 0xe13d7e add_function_candidate ../../src/gcc/cp/call.cc:2465 0xe15349 add_candidates ../../src/gcc/cp/call.cc:6182 0xe1c362 add_candidates ../../src/gcc/cp/call.cc:6051 0xe1c362 build_new_method_call(tree_node*, tree_node*, vec**, tree_node*, int, tree_node**, int) ../../src/gcc/cp/call.cc:11012 0x1039e3d finish_call_expr(tree_node*, vec**, bool, bool, int) ../../src/gcc/cp/semantics.cc:2788 0xfe96d4 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, bool) ../../src/gcc/cp/pt.cc:20780 0xff5e8c tsubst(tree_node*, tree_node*, int, tree_node*) ../../src/gcc/cp/pt.cc:16162 0x100494e tsubst_template_args(tree_node*, tree_node*, int, tree_node*) ../../src/gcc/cp/pt.cc:13423 0xff635e tsubst(tree_node*, tree_node*, int, tree_node*) ../../src/gcc/cp/pt.cc:15461 0x1005a77 tsubst_decl ../../src/gcc/cp/pt.cc:14815 0x101d842 instantiate_class_template_1 ../../src/gcc/cp/pt.cc:12076 0x101ec31 instantiate_class_template(tree_node*) ../../src/gcc/cp/pt.cc:12311
[Bug c++/104184] [11/12 Regression] ICE Error reporting routines re-entered. xref_basetypes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104184 --- Comment #1 from Marc Glisse --- A different reduction from the same code. This one does not compile with clang anymore. ICEs with -Wall -W, but not if I remove one of them. using mp_false = struct mp_identity { using type = void; }; template using mp_if_c = typename T ::type; template using mp_at_c = typename mp_if_c::type; template using make_arg_list = List; template using make_parameter_spec_items = SpecSeq; template struct argument_pack { using type = mp_at_c::type, typename Parameters::deduced_listboosttag_keyword_arg, mp_false>::type, 0>; }; void no_exude(); template using boost_param_result_465refine_mesh_3 = mp_identity; template typename boost_param_result_465refine_mesh_3< typename argument_pack::type>::type refine_mesh_3(ParameterArgumentType0, ParameterArgumentType1, ParameterArgumentType2, ParameterArgumentType3, ParameterArgumentType4, ParameterArgumentType5 a5) {} int verify___trans_tmp_1, image_domain; struct Tester { template void verify(C3t3 c3t3, Domain domain, Criteria criteria, Domain_type_tag) { refine_mesh_3(c3t3, domain, criteria, no_exude, verify___trans_tmp_1, verify___trans_tmp_1); } } image_c3t3; struct Image_tester : Tester { void image() { void criteria(); verify(image_c3t3, image_domain, criteria, int()); } };
[Bug c++/104184] [11/12 Regression] ICE Error reporting routines re-entered. xref_basetypes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104184 --- Comment #2 from Marc Glisse --- And the stack trace for comment #1 Internal compiler error: Error reporting routines re-entered. 0xff6b0d tsubst(tree_node*, tree_node*, int, tree_node*) ../../src/gcc/cp/pt.cc:16068 0xff5f6d tsubst(tree_node*, tree_node*, int, tree_node*) ../../src/gcc/cp/pt.cc:16055 0x100494e tsubst_template_args(tree_node*, tree_node*, int, tree_node*) ../../src/gcc/cp/pt.cc:13423 0xff635e tsubst(tree_node*, tree_node*, int, tree_node*) ../../src/gcc/cp/pt.cc:15461 0xff5f6d tsubst(tree_node*, tree_node*, int, tree_node*) ../../src/gcc/cp/pt.cc:16055 0x100494e tsubst_template_args(tree_node*, tree_node*, int, tree_node*) ../../src/gcc/cp/pt.cc:13423 0xff635e tsubst(tree_node*, tree_node*, int, tree_node*) ../../src/gcc/cp/pt.cc:15461 0x1005a77 tsubst_decl ../../src/gcc/cp/pt.cc:14815 0x101d842 instantiate_class_template_1 ../../src/gcc/cp/pt.cc:12076 0x101ec31 instantiate_class_template(tree_node*) ../../src/gcc/cp/pt.cc:12311 0x10714d8 complete_type(tree_node*) ../../src/gcc/cp/typeck.cc:143 0x107163d complete_type_or_maybe_complain(tree_node*, tree_node*, int) ../../src/gcc/cp/typeck.cc:156 0xff73d0 tsubst(tree_node*, tree_node*, int, tree_node*) ../../src/gcc/cp/pt.cc:16081 0x100494e tsubst_template_args(tree_node*, tree_node*, int, tree_node*) ../../src/gcc/cp/pt.cc:13423 0xff635e tsubst(tree_node*, tree_node*, int, tree_node*) ../../src/gcc/cp/pt.cc:15461 0xee6e31 dump_template_bindings ../../src/gcc/cp/error.cc:486 0xee0619 dump_function_decl ../../src/gcc/cp/error.cc:1805 0xee8602 decl_to_string ../../src/gcc/cp/error.cc:3225 0xee8602 cp_printer ../../src/gcc/cp/error.cc:4396 0x281b82f pp_format(pretty_printer*, text_info*) ../../src/gcc/pretty-print.cc:1475
[Bug c++/104184] [11/12 Regression] ICE Error reporting routines re-entered. xref_basetypes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104184 --- Comment #3 from Marc Glisse --- comment #1 actually reduces to struct voider { using type = void; }; template struct rename : P {}; template using ignore = voider; template typename ignore::type>::type g(T a) {} void f() { g(1); } (still questionable and rejected by clang, I think I'll also attach the compressed initial preprocessed file, in case the reductions hit different bugs)
[Bug c++/104184] [11/12 Regression] ICE Error reporting routines re-entered. xref_basetypes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104184 --- Comment #4 from Marc Glisse --- https://geometrica.saclay.inria.fr/team/Marc.Glisse/tmp/mybug.cc.xz (1.7M after compression still exceeds the limit) With -Wall -Wextra Internal compiler error: Error reporting routines re-entered. 0xec0348 xref_basetypes(tree_node*, tree_node*) ../../src/gcc/cp/decl.cc:15783 0x101d194 instantiate_class_template_1 ../../src/gcc/cp/pt.cc:11953 0x101ec31 instantiate_class_template(tree_node*) ../../src/gcc/cp/pt.cc:12311 0x10714d8 complete_type(tree_node*) ../../src/gcc/cp/typeck.cc:143 0xff0ad6 get_template_base ../../src/gcc/cp/pt.cc:23282 0xff2720 unify ../../src/gcc/cp/pt.cc:24348 0xff10d4 unify ../../src/gcc/cp/pt.cc:24499 0xfee75b unify_one_argument ../../src/gcc/cp/pt.cc:22472 0xfffd65 type_unification_real ../../src/gcc/cp/pt.cc:22595 0x1019da9 fn_type_unification(tree_node*, tree_node*, tree_node*, tree_node* const*, unsigned int, tree_node*, unification_kind_t, int, conversion**, bool, bool) ../../src/gcc/cp/pt.cc:21923 0xe146d9 add_template_candidate_real ../../src/gcc/cp/call.cc:3544 0xe15633 add_template_candidate ../../src/gcc/cp/call.cc:3632 0xe15633 add_candidates ../../src/gcc/cp/call.cc:6165 0xe1c362 add_candidates ../../src/gcc/cp/call.cc:6051 0xe1c362 build_new_method_call(tree_node*, tree_node*, vec**, tree_node*, int, tree_node**, int) ../../src/gcc/cp/call.cc:11012 0x1039e3d finish_call_expr(tree_node*, vec**, bool, bool, int) ../../src/gcc/cp/semantics.cc:2788 0xfe96d4 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, bool) ../../src/gcc/cp/pt.cc:20780 0xff5e8c tsubst(tree_node*, tree_node*, int, tree_node*) ../../src/gcc/cp/pt.cc:16162 0x100494e tsubst_template_args(tree_node*, tree_node*, int, tree_node*) ../../src/gcc/cp/pt.cc:13423 0xff635e tsubst(tree_node*, tree_node*, int, tree_node*) ../../src/gcc/cp/pt.cc:15461
[Bug c++/104235] New: [12 Regression] ICE: in cp_parser_template_id, at cp/parser.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104235 Bug ID: 104235 Summary: [12 Regression] ICE: in cp_parser_template_id, at cp/parser.cc Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: ice-on-valid-code Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- template struct L: M { using M::a; void p() { this->template a<>; } }; (accepted by g++-11 and clang++-13) bug.cc: In member function 'void L::p()': bug.cc:4:31: internal compiler error: in cp_parser_template_id, at cp/parser.cc:18367 4 | void p() { this->template a<>; } | ^ 0x719e9c cp_parser_template_id ../../src/gcc/cp/parser.cc:18367 0xfa5beb cp_parser_class_name ../../src/gcc/cp/parser.cc:25694 0xf9bddb cp_parser_qualifying_entity ../../src/gcc/cp/parser.cc:7118 0xf9bddb cp_parser_nested_name_specifier_opt ../../src/gcc/cp/parser.cc:6800 0xf9da5a cp_parser_id_expression ../../src/gcc/cp/parser.cc:6148 0xfa63cf cp_parser_postfix_dot_deref_expression ../../src/gcc/cp/parser.cc:8305 0xf9a103 cp_parser_postfix_expression ../../src/gcc/cp/parser.cc:7904 0xf81eea cp_parser_binary_expression ../../src/gcc/cp/parser.cc:10041 0xf82a4e cp_parser_assignment_expression ../../src/gcc/cp/parser.cc:10345 0xf84579 cp_parser_expression ../../src/gcc/cp/parser.cc:10515 0xf87b97 cp_parser_expression_statement ../../src/gcc/cp/parser.cc:12711 0xf950b7 cp_parser_statement ../../src/gcc/cp/parser.cc:12507 0xf9619d cp_parser_statement_seq_opt ../../src/gcc/cp/parser.cc:12856 0xf96277 cp_parser_compound_statement ../../src/gcc/cp/parser.cc:12808 0xfb6565 cp_parser_function_body ../../src/gcc/cp/parser.cc:25052 0xfb6565 cp_parser_ctor_initializer_opt_and_function_body ../../src/gcc/cp/parser.cc:25103 0xfb746e cp_parser_function_definition_after_declarator ../../src/gcc/cp/parser.cc:31229 0xfb791c cp_parser_late_parsing_for_member ../../src/gcc/cp/parser.cc:32150 0xf8fb2a cp_parser_class_specifier_1 ../../src/gcc/cp/parser.cc:26170 0xf90b72 cp_parser_class_specifier ../../src/gcc/cp/parser.cc:26194
[Bug target/104239] [12 Regression] immintrin.h or x86gprintrin.h headers can't be included
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104239 --- Comment #2 from Marc Glisse --- Thanks for fixing that bug, but don't you still have issues with NO_WARN_X86_INTRINSICS if you rely on __has_include for immintrin.h?
[Bug libstdc++/104361] Biased Reference Counting for the standard library
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104361 --- Comment #2 from Marc Glisse --- I looked at this paper for a different project a while ago, and it doesn't seem like such a good match for C++ in general. While the basic idea looks simple (use 2 counters, one for the thread that created the object, one for the others), making it work in all cases is actually a lot of work. In particular the paper requires a runtime that periodically checks a queue in each thread.
[Bug tree-optimization/104389] [10/11/12 Regression] HUGE_VAL * 0.0 is no longer a NaN
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104389 --- Comment #6 from Marc Glisse --- Not this bug, but note that the comment and the code don't match in this transformation: "a negative value" becomes !tree_expr_maybe_real_minus_zero_p (@0) which is quite different. I am not sure the path with a negative @0 for which tree_expr_maybe_real_minus_zero_p returns false can be reached though.
[Bug tree-optimization/104420] New: [12 Regression] Inconsistent checks for X * 0.0 optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104420 Bug ID: 104420 Summary: [12 Regression] Inconsistent checks for X * 0.0 optimization Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: wrong-code Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- (from a comment in PR 104389) /* Maybe fold x * 0 to 0. The expressions aren't the same when x is NaN, since x * 0 is also NaN. Nor are they the same in modes with signed zeros, since multiplying a negative value by 0 gives -0, not +0. Nor when x is +-Inf, since x * 0 is NaN. */ (simplify (mult @0 real_zerop@1) (if (!tree_expr_maybe_nan_p (@0) && (!HONOR_NANS (type) || !tree_expr_maybe_infinite_p (@0)) && !tree_expr_maybe_real_minus_zero_p (@0) && !tree_expr_maybe_real_minus_zero_p (@1)) @1)) Notice how the comment talks about @0 being a "negative value" while the code says "!tree_expr_maybe_real_minus_zero_p (@0)", which is not at all the same thing. Because tree_expr_maybe_real_minus_zero_p is rather weak, it does not trigger so often, but still: double f(int a){ return a*0.; } is optimized to "return 0.;" whereas f(-42) should return -0.
[Bug libstdc++/103453] New: ASAN detection with clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103453 Bug ID: 103453 Summary: ASAN detection with clang Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- Libstdc++ uses __SANITIZE_ADDRESS__ to detect if ASAN is enabled, but with clang that should be __has_feature(address_sanitizer). This means that _GLIBCXX_SANITIZE_STD_ALLOCATOR is not automatically defined, and thus defining _GLIBCXX_SANITIZE_VECTOR has no effect. (noticed in https://stackoverflow.com/q/70117470/1918193 )
[Bug libstdc++/51653] More compact std::tuple
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51653 --- Comment #6 from Marc Glisse --- (In reply to Andrew Pinski from comment #5) > Is there anything more to do for this? Yes. This PR is about having the library reorder the elements of a tuple to minimize the size, and the current code does not do anything like that. Now this would be an ABI break, and even if it wasn't we might not want to do that, so it is ok if a libstdc++ maintainer decides to close it as wontfix.
[Bug tree-optimization/90433] POINTER_DIFF_EXPR in vectorizer prologue
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90433 --- Comment #3 from Marc Glisse --- (In reply to Andrew Pinski from comment #2) > The trunk we don't vectorize the code any more . I thought it might be because we found a way to use memcpy instead, which would have been good, but no, the vect dump shows an extremely common gcc issue missed: not vectorized: more than one data ref in stmt: MEM[(struct _Tuple_impl *)__cur_14].D.36092 = MEM[(struct _Tuple_impl &)__first_19].D.36092;
[Bug libstdc++/58338] Add noexcept to functions with a narrow contract
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58338 Marc Glisse changed: What|Removed |Added Resolution|--- |FIXED Status|WAITING |RESOLVED --- Comment #16 from Marc Glisse --- No idea if there are low hanging fruits. I think the original idea was to get consensus on the idea to add noexcept in various places, and this seems well accepted now. At some point (back when I thought I would have enough free time) my plan was to implement some form of noexcept(auto) as an extension, I think most of the remaining places where we may want to add noexcept would benefit from that. The effort and risk in working around the lack of this feature (writing 10+ lines of nexcept(...), is_nothrow_*, etc) make it not worth it to me.
[Bug sanitizer/97868] warn about using fences with TSAN
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97868 --- Comment #6 from Marc Glisse --- (In reply to pavlick from comment #5) > Why is there false positive and no warning about the unsupported feature > (atomic_thread_fence)? You are probably using an old version of gcc. With a recent one, this prints In function 'void std::atomic_thread_fence(std::memory_order)', inlined from 'void Test::add()' at 3.cc:14:22: /usr/lib/gcc-snapshot/include/c++/12/bits/atomic_base.h:126:26: warning: 'atomic_thread_fence' is not supported with '-fsanitize=thread' [-Wtsan] 126 | { __atomic_thread_fence(int(__m)); } | ~^~
[Bug testsuite/53155] Not parallel: test for -j fails with new make
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53155 --- Comment #6 from Marc Glisse --- (In reply to Andrew Pinski from comment #5) > Hmm, Did something change in make? It looks like make now splits -j from other flags in MFLAGS, -wkj becomes -kw -j, so the old filters probably work now...
[Bug c/102760] ICE: in decompose, at wide-int.h:984
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102760 --- Comment #3 from Marc Glisse --- (In reply to Martin Liška from comment #2) > Likely triggered with r7-821-gc7986356a1ca8e8e. >From Andrew's comment, it looks like the bug is before that transformation, since it receives a bit_and_expr of type int with an argument of type char, no?
[Bug tree-optimization/107520] New: Optimize std::lerp(d, d, 0.5)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107520 Bug ID: 107520 Summary: Optimize std::lerp(d, d, 0.5) Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- In some C++ code I have, it would be convenient if the compiler, possibly with the help of the standard library, could make the following function cheap, ideally just the identity. I'll probably end up wrapping lerp with a function that first checks with __builtin_constant_p if the 2 bounds are equal, but I'll post this in case people have ideas how to improve things. #include double f(double d){ return std::lerp(d, d, .5); } Currently, with -O3, we generate movapd %xmm0, %xmm1 pxor%xmm0, %xmm0 comisd %xmm1, %xmm0 jnb .L7 comisd %xmm0, %xmm1 jb .L6 .L7: pxor%xmm0, %xmm0 ucomisd %xmm0, %xmm1 jp .L6 je .L11 .L6: movapd %xmm1, %xmm0 subsd %xmm1, %xmm0 mulsd .LC1(%rip), %xmm0 addsd %xmm1, %xmm0 maxsd %xmm1, %xmm0 ret .p2align 4,,10 .p2align 3 .L11: mulsd .LC1(%rip), %xmm1 movapd %xmm1, %xmm0 addsd %xmm1, %xmm0 ret (clang is better at avoiding the redundant comparison) With -fno-trapping-math to help a bit, I see at the beginning if (d_2(D) == 0.0) goto ; [34.00%] else goto ; [66.00%] [local count: 475287355]: _7 = d_2(D) * 5.0e-1; _10 = _7 * 2.0e+0; I think that even with the default -fsigned-zeros, simplifying to _10 = d_2(D) is valid. Adding -fno-signed-zeros [local count: 1073741824]: if (d_2(D) == 0.0) goto ; [34.00%] else goto ; [66.00%] [local count: 598454470]: _13 = d_2(D) - d_2(D); _14 = _13 * 5.0e-1; __x_15 = d_2(D) + _14; if (d_2(D) u>= __x_15) goto ; [50.00%] else goto ; [50.00%] [local count: 299227235]: [local count: 1073741825]: # _12 = PHI return _12; _13 is 0 or NaN, which doesn't change for _14, and __x_15 is just d_2, so we always return d_2.
[Bug target/107546] [10/11/12/13 Regression] simd, redundant pcmpeqb and pxor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107546 --- Comment #5 from Marc Glisse --- typedef signed char v16qs __attribute__((vector_size(16))); auto bar(v16qs x) { return x < 48; } clang does expand it as 48 gt x. Gcc however does its usual change to x <= 47, which it then tries to expand as ~(x > 47). I guess the expansion for x <= y could be tweaked in the case where one argument is constant to undo what was done earlier in the pipeline and expand as 48 > x.
[Bug c++/107622] Missing optimization of switch-statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107622 --- Comment #7 from Marc Glisse --- (Wilhelm, when you post testcases, please post the full file including the #include lines) (In reply to Richard Biener from comment #5) > Did you try -fstrict-enums? IIUC, even if optimizations using -fstrict-enums were implemented, they would only help with the first testcase if the number of enum values was a power of 2. For {A,B,C}, -fstrict-enums still considers 3 a valid value. I have long wanted an attribute to specify that a particular enum is only allowed to take the values explicitly listed, though I cannot find a relevant issue in bugzilla at the moment. Comment #4 is an independent issue where gcc fails to notice that since the static variable does not escape, it can be replaced with a local constant.
[Bug tree-optimization/107663] -Wmaybe-uninitialized does not catch an uninitialized variable if its address was passed at -O0 and there was a call before hand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107663 --- Comment #1 from Marc Glisse --- Gcc often ignores the control flow for alias/escape analysis. v escapes at some point, and that's enough to prevent gcc from noticing that nothing can have written to v *before* the use. The same thing also hinders some optimizations, I am sure there are duplicates in bugzilla.
[Bug tree-optimization/89317] Ineffective code from std::copy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89317 --- Comment #11 from Marc Glisse --- (In reply to Richard Biener from comment #10) > Should be fixed in GCC 13. If I compile the original testcase with -O3, I get for test2: _1 = this_6(D) + 16; _2 = &this_6(D)->data1; if (_1 != _2) so we should probably also handle comparisons and not just subtractions. For this particular testcase, the relevant optimizations still happen and RTL cleans up the comparison, so it is ok, but the pattern appears in other PRs like PR 106677.
[Bug testsuite/108190] g++.target/i386/*pr54700*.C testcases fail on x86_64-mingw
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108190 --- Comment #6 from Marc Glisse --- Indeed, this looks like a common issue (at least with the x86 backend): the memory load is combined with the comparison before we try combining the comparison with the blend, and this last combination is then rejected because it expects a register, not memory. So either we are too eager in merging loads with instructions, or we reject instructions too early when we could still fix the operands with an extra load.
[Bug libstdc++/80331] unused const std::string not optimized away
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80331 --- Comment #10 from Marc Glisse --- (In reply to AK from comment #9) > can't repro this with gcc 12.1 Seems like this is fixed? No. As stated in other comments, it still reproduces with a longer string (or with -D_GLIBCXX_USE_CXX11_ABI=0).
[Bug tree-optimization/101501] [11/12 Regression] wrong code at -O3 on x86_64-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101501 --- Comment #2 from Marc Glisse --- unsigned char a = 55; int main() { unsigned char c; d: c = a-- * 52; if (c) goto d; __builtin_printf("%d\n", a); } outputs 40 at -O3 instead of 255, and already fails with gcc-8. Cunroll seems confused about the number of iterations of this loop.
[Bug bootstrap/49908] -lm missing after -lmpc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49908 --- Comment #5 from Marc Glisse --- (In reply to Andrew Pinski from comment #4) > GCC builds now with the c++ which means this won't show up. Just because g++ has an implicit -lm doesn't mean that any random 3rd-party C++ compiler does too. (I don't really care about this PR though, I don't mind it being closed)
[Bug libstdc++/58909] C++11's condition variables fail with static linking
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58909 --- Comment #25 from Marc Glisse --- Note that this also affects dynamic linking with -Wl,--as-needed (which some platforms use by default). #include int main(){ std::once_flag o; std::call_once(o, [](){}); } $ g++ b.cc -lpthread && ldd ./a.out linux-vdso.so.1 (0x7ffca7b6) libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x7f9c809ac000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f9c807e7000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f9c806a3000) /lib64/ld-linux-x86-64.so.2 (0x7f9c80bd4000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x7f9c80689000) No libpthread there :-( (using -pthread instead of -lpthread works, but some build systems like cmake use -lpthread by default)
[Bug libstdc++/58909] C++11's condition variables fail with static linking
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58909 Marc Glisse changed: What|Removed |Added CC||glisse at gcc dot gnu.org --- Comment #27 from Marc Glisse --- (In reply to Jonathan Wakely from comment #26) > If you create a new thread of execution then you'll get a non-weak reference > to pthread_create, which should cause libpthread.so to be linked even with > -Wl,--as-needed (and for static linking it will work if libpthread.a has a > single .o with all symbols). > > If you don't actually have multiple threads in your program, then things > like condition_variable and once_flag can end up using the stubs in > libc.so.6 which are no-ops. But since you don't have multiple threads, it's > probably not a major problem. For call_once, it throws an exception whether there are other threads or not, it isn't a no-op. (as you might guess, this code is in a library, I don't control if threads are used elsewhere) > Most uses of std::once_flag would be better > done with a local static variable anyway (the exception being non-static > data members of classes). I build trees with a once_flag in each node, there is no way I can do that with static variables. > With glibc 2.34 the problem goes away, so I'm not sure it's worth investing > much effort in libstdc++ trying to work around the problems with weak > symbols. Ok. I just wanted to advertise that the issue is not limited to static linking. (too bad you had to revert the new call_once implementation)
[Bug target/101611] New: AVX2 vector arithmetic shift lowered to scalar unnecessarily
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101611 Bug ID: 101611 Summary: AVX2 vector arithmetic shift lowered to scalar unnecessarily Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-* Stealing the example from PR 56873 #define SIZE 32 typedef long long veci __attribute__((vector_size(SIZE))); veci f(veci a, veci b){ return a>>b; } but compiling with -O3 -mavx2 this time, gcc produces scalar code vmovq %xmm1, %rcx vmovq %xmm0, %rax vpextrq $1, %xmm0, %rsi sarq%cl, %rax vextracti128$0x1, %ymm0, %xmm0 vpextrq $1, %xmm1, %rcx vextracti128$0x1, %ymm1, %xmm1 movq%rax, %rdx sarq%cl, %rsi vmovq %xmm0, %rax vmovq %xmm1, %rcx vmovq %rdx, %xmm5 sarq%cl, %rax vpextrq $1, %xmm1, %rcx movq%rax, %rdi vpextrq $1, %xmm0, %rax vpinsrq $1, %rsi, %xmm5, %xmm0 sarq%cl, %rax vmovq %rdi, %xmm4 vpinsrq $1, %rax, %xmm4, %xmm1 vinserti128 $0x1, %xmm1, %ymm0, %ymm0 ret while clang outputs much shorter vector code vpbroadcastq.LCPI0_0(%rip), %ymm2 # ymm2 = [9223372036854775808,9223372036854775808,9223372036854775808,9223372036854775808] vpsrlvq %ymm1, %ymm2, %ymm2 vpsrlvq %ymm1, %ymm0, %ymm0 vpxor %ymm2, %ymm0, %ymm0 vpsubq %ymm2, %ymm0, %ymm0 retq
[Bug middle-end/56873] vector shift lowered to scalars
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56873 Marc Glisse changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #3 from Marc Glisse --- Indeed, I now get sensible code with -mxop. Not so with -mavx2, but that seems independent and I filed it as PR 101611.
[Bug target/101611] AVX2 vector arithmetic shift lowered to scalar unnecessarily
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101611 --- Comment #5 from Marc Glisse --- (In reply to Jakub Jelinek from comment #2) > for arithmetic V[24]DImode >> V[24]DImode > logical ((x >> y) ^ (0x8000ULL >> y)) - (0x8000ULL > >> y) > can be used. I guess it would be complicated to try and implement this fallback strategy in a generic way so other modes/targets could benefit.
[Bug target/101611] AVX2 vector arithmetic shift lowered to scalar unnecessarily
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101611 --- Comment #7 from Marc Glisse --- The same strategy to implement arithmetic shift in terms of logical shift works not just for vector>>vector but also vector>>scalar and scalar>>scalar. But it is probably not worth the trouble indeed, especially since your target patch is ready :-)
[Bug c++/91099] constexpr vs -frounding-math
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91099 Marc Glisse changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #2 from Marc Glisse --- Interestingly, Jakub's patch has the same issue my patch here had: "defeated by constexpr-caching if we change flag_rounding_math in the middle of a translation unit". constexpr double f(){return 1./3.;} #if BUG __attribute__((optimize("no-rounding-math"))) double g(){ double d=f(); return d; } #endif __attribute__((optimize("rounding-math"))) double h(){ double d=f(); return d; } The presence of g changes the code we generate for h. At least we don't seem to reuse the cache from a different value of manifestly_const_eval, so maybe changing rounding_math is just not supported and this goes in the list of issues with attribute optimize.
[Bug tree-optimization/101639] New: vectorization with bool reduction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101639 Bug ID: 101639 Summary: vectorization with bool reduction Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-* bool f(char* p, long n) { bool r = true; for(long i = 0; i < n; ++i) r &= (p[i] != 0); return r; } is not vectorized, while if I simply declare r as char instead of bool, it is (not quite optimal since it fails to pull &1 out of the loop, but that's a separate issue).
[Bug c++/101651] New: constexpr write to simd vector element
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101651 Bug ID: 101651 Summary: constexpr write to simd vector element Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- (adapted from https://stackoverflow.com/q/68517921/1918193) #ifdef WORK #include typedef std::array vec; #else typedef char vec __attribute__((vector_size(16))); #endif constexpr auto gen () { vec ret{}; for (int i = 0; i < sizeof(vec); ++i) { ret[i] = 2; } return ret; }; constexpr auto m = gen(); c.cc:9:23: in 'constexpr' expansion of 'gen()' c.cc:9:24: error: modification of '(char [16])ret' is not a constant expression 9 | constexpr auto m = gen(); |^ However, with -DWORK to use std::array instead of the vector extension, it compiles just fine, so there shouldn't be any strong obstacle to implement this.
[Bug libstdc++/101659] _GLIBCXX_DEBUG mode for std::optional ?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101659 --- Comment #1 from Marc Glisse --- I already see some "__glibcxx_assert(this->_M_is_engaged());" in the code, which IIUC should be enabled by _GLIBCXX_ASSERTIONS (and a fortiori by _GLIBCXX_DEBUG). Did you actually try it?
[Bug tree-optimization/94356] Missed optimisation: useless multiplication generated for pointer comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94356 --- Comment #6 from Marc Glisse --- (In reply to Andrew Pinski from comment #5) > Hmm, the following is worse: That looks like a separate issue. We have fold_comparison for GENERIC, and match.pd has related patterns for integers, or for pointers with ==, but not for pointers with <. Strange, I thought I had added those, possibly together with pointer_diff since the behavior is similar.
[Bug c++/101795] (x > QNaNf) is not a constant expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101795 --- Comment #1 from Marc Glisse --- Hint: -fno-trapping-math lets it compile. It should probably be accepted in a manifestly_const_eval context, although some in the committee wanted to prevent the use of NaN (and sometimes even infinity!) in constant expressions...
[Bug rtl-optimization/43147] SSE shuffle merge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 --- Comment #17 from Marc Glisse --- (In reply to Hongtao.liu from comment #15) > The issue can also be solved by folding __builtin_ia32_shufps to gimple > VEC_PERM_EXPR, Didn't you post a patch to do that last year? What happened to it?
[Bug c++/58055] [meta-bug] RVO / NRVO improvements
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58055 Bug 58055 depends on bug 57176, which changed state. Bug 57176 Summary: copy elision with function arguments passed by value https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57176 What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |INVALID
[Bug c++/57176] copy elision with function arguments passed by value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57176 Marc Glisse changed: What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |INVALID --- Comment #6 from Marc Glisse --- (In reply to Jonathan Wakely from comment #5) > Is it worth keeping this open if we're not allowed to make this change? Probably not since wg21 explicitly added text to forbid this optimization. It belongs in some non-existent wg21 feature request list...
[Bug ipa/111643] __attribute__((flatten)) with -O1 runs out of memory (killed cc1)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111643 Marc Glisse changed: What|Removed |Added CC||glisse at gcc dot gnu.org --- Comment #2 from Marc Glisse --- (In reply to Andrew Pinski from comment #1) > I am 99% sure this is falls under don't do this as flatten inlines > everything it can that the function calls ... Maybe people end up abusing flatten because we are missing a convenient way for a caller to ask that a call be inlined? From the callee, we can use always_inline (couldn't this be used on name_original in this testcase?), but from the caller... Here even a non-recursive version of flatten would have helped.
[Bug middle-end/116510] [15 Regression] ice in decompose, at wide-int.h:1049
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116510 --- Comment #7 from Marc Glisse --- (In reply to Richard Biener from comment #3) > the issue is that with_possible_nonzero_bits2 also supports SSA_NAMEs, so > @1 cannot be used like this. It has only 2 cases, and both imply INTEGER_CST, if I interpret the match syntax correctly (it's been a while) (match (with_certain_nonzero_bits2 @0) INTEGER_CST@0) (match (with_certain_nonzero_bits2 @0) (bit_ior @1 INTEGER_CST@0))
[Bug middle-end/79173] add-with-carry and subtract-with-borrow support (x86_64 and others)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173 --- Comment #26 from Marc Glisse --- (In reply to CVS Commits from comment #22) > While the design of these builtins in clang is questionable, > rather than being say > unsigned __builtin_addc (unsigned, unsigned, bool, bool *) > so that it is clear they add two [0, 0x] range numbers > plus one [0, 1] range carry in and give [0, 0x] range > return plus [0, 1] range carry out, they actually instead > add 3 [0, 0x] values together but the carry out > isn't then the expected [0, 2] value because > 0xULL + 0x + 0x is 0x2fffd, > but just [0, 1] whether there was any overflow at all. That is very strange. I always thought that the original intent was for __builtin_addc to assume that its third argument was in [0, 1] and generate a single addc instruction on hardware that has it, and the type only ended up being the same as the others for convenience (also C used not to have a bool type). The final overflow never being 2 confirms this. It may be worth discussing with clang developers if they would be willing to document such a [0, 1] restriction, and maybe have ubsan check it.
[Bug tree-optimization/119690] [12/13/14/15 regression] wrong code at -O{s,1,2,3} on x86_64-linux-gnu since r12-4871
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119690 --- Comment #3 from Marc Glisse --- > Originally we compute { 0, +, 1 } here, now in the first iteration > we do 0 - -2147483648. Looking only at what you posted, I'd say _3 is -2147483647 _4 is 0 _7 is computed either as 0 + 1 or -2147483647 - -2147483648, both of which give 1 I don't see any 0 - -2147483648 ?