[Bug middle-end/54299] New: Array parameter does not allow for iterator syntax
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54299 Bug #: 54299 Summary: Array parameter does not allow for iterator syntax Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end AssignedTo: unassig...@gcc.gnu.org ReportedBy: drepper@gmail.com Compile the following code: ~~ int aa[10] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }; int f(int arr[10]) { int s = 0; for (auto i : arr) s += i; return s; } int main() { return f(aa); } ~~ This fails with u.cc: In function ‘int f(int*)’: u.cc:18:17: error: ‘begin’ was not declared in this scope u.cc:18:17: error: ‘end’ was not declared in this scope u.cc:18:17: error: unable to deduce ‘auto’ from ‘’ This indicates that the problem is that the parameter is seen as 'int *' instead of as 'int [10]'. According to Andrew another problem caused by the too-early decay of arguments to pointers (bug 24666). Changing the code as follows makes it compile: ~~ int aa[1][10] = { { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 } }; int f(int arr[1][10]) { int s = 0; for (auto i : arr[0]) s += i; return s; } int main() { return f(aa); } ~~
[Bug target/54087] __atomic_fetch_add does not use xadd instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54087 --- Comment #8 from Ulrich Drepper 2012-08-23 15:41:49 UTC --- (In reply to comment #7) > Check to see if it solves the problem as well. I tested it. Seems to work in all cases and does not disturb other optimizations like comparisons with zero.
[Bug c++/54376] New: incorrect complaint about redefinition
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54376 Bug #: 54376 Summary: incorrect complaint about redefinition Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ AssignedTo: unassig...@gcc.gnu.org ReportedBy: drepper@gmail.com At least I think this is a compiler problem. I cannot see anything wrong in the libstdc++ code or in my test code. Compiling the following code with 4.7.0 or even the current trunk version of the compiler will produce an error like this. There is no double inclusion problem and still the line with the redefinition is exactly the same as the definition. If you take out one of the two variable definitions and uses the error disappears which indicates the problem is that the compiler doesn't distinguish instantiations correctly. The other pairs of instantiations also produce the same type of mistake. In file included from /usr/lib/gcc/x86_64-redhat-linux/4.7.0/../../../../include/c++/4.7.0/random:50:0, from r3.cc:3: /usr/lib/gcc/x86_64-redhat-linux/4.7.0/../../../../include/c++/4.7.0/bits/random.h: In instantiation of ‘class std::lognormal_distribution’: r3.cc:31:39: required from here /usr/lib/gcc/x86_64-redhat-linux/4.7.0/../../../../include/c++/4.7.0/bits/random.h:2279:9: error: redefinition of ‘template bool std::operator==(const std::lognormal_distribution<_RealType>&, const std::lognormal_distribution<_RealType>&)’ /usr/lib/gcc/x86_64-redhat-linux/4.7.0/../../../../include/c++/4.7.0/bits/random.h:2279:9: error: ‘template bool std::operator==(const std::lognormal_distribution<_RealType>&, const std::lognormal_distribution<_RealType>&)’ previously defined here Source: #include #include #include template void measure(const char *name, size_t n, E &e, D &d) { typename D::result_type arr[n]; typename D::result_type s = 0; for (int tries = 0; tries < 100; ++tries) { e.seed(1234); for (size_t i = 0; i < n; ++i) arr[i] = d(e); for (size_t i = 0; i < n; ++i) s += arr[i]; } std::cout << name << " " << n << " = " << " " << s << std::endl; } int main(void) { std::mt19937 e2; std::lognormal_distribution d7; std::lognormal_distribution d8; //std::gamma_distribution d9; //std::gamma_distribution d10; //std::chi_squared_distribution d11; //std::chi_squared_distribution d12; //std::fisher_f_distribution d15; //std::fisher_f_distribution d16; //std::student_t_distribution d17; //std::student_t_distribution d18; //std::binomial_distribution d20; //std::binomial_distribution d21; //std::negative_binomial_distribution d24; //std::negative_binomial_distribution d25; //std::poisson_distribution d26; //std::poisson_distribution d27; for (size_t n = 10; n < 10; n *= 1.1) { measure("lognormal:32", n, e2, d7); measure("lognormal:64", n, e2, d8); //measure("gamma:32", n, e2, d9); //measure("gamma:64", n, e2, d10); //measure("chi_squared:32", n, e2, d11); //measure("chi_squared:64", n, e2, d12); //measure("fisher_f:32", n, e2, d15); //measure("fisher_f:64", n, e2, d16); //measure("student_t:32", n, e2, d17); //measure("student_t:64", n, e2, d18); //measure("binomial:32", n, e2, d20); //measure("binomial:64", n, e2, d21); //measure("negative_binomial:32", n, e2, d24); //measure("negative_binomial:64", n, e2, d25); //measure("poisson:32", n, e2, d26); //measure("poisson:64", n, e2, d27); } }
[Bug c++/54376] incorrect complaint about redefinition
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54376 --- Comment #10 from Ulrich Drepper 2012-08-25 22:54:02 UTC --- Created attachment 28085 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28085 Avoid nested inlined friend functions This patch fixes the issue for me. It also cleans the code. There is currently a lot of inconsistency as to where the operator== functions are defined, all depending on whether they are friends or not. With this patch all operator== are defined after the class and friend declarations are used.
[Bug bootstrap/54419] [4.8 Regression] Compiling libstdc++-v3/src/c++11/random.cc fails on platforms not knowing rdrand
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419 --- Comment #2 from Ulrich Drepper 2012-08-30 20:19:35 UTC --- The instruction is generated by the compiler. If you try to compile a new compiler you have to make sure the tools used are recent enough to understand the output of the compiler.
[Bug bootstrap/54419] [4.8 Regression] Compiling libstdc++-v3/src/c++11/random.cc fails on platforms not knowing rdrand
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419 --- Comment #9 from Ulrich Drepper 2012-08-31 17:46:41 UTC --- (In reply to comment #8) > Is it clear which are the specific requirements for the various x86* targets? > I'm wondering if after all it's just matter of updating: > http://gcc.gnu.org/install/specific.html Indeed. You cannot use old binutils if any of the code generated by the compiler requires something newer. If these dependencies are not wanted then make the compiler to emit .byte sequences when the new builtins are used. > Since rdrand is only supported on Ivy Bridge processors, shouldn't > src/c++11/random.cc have a fall through using rdtsc in case the processor > doesn't support rdrand? Read the code. There is of course a fall-through for older processors. This is about the code generated by the compiler, not what is used at runtime.
[Bug bootstrap/54419] [4.8 Regression] Compiling libstdc++-v3/src/c++11/random.cc fails on platforms not knowing rdrand
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419 --- Comment #15 from Ulrich Drepper 2012-09-02 20:04:57 UTC --- (In reply to comment #14) > libstdc++ should check if rdrand is supported by assembler > before using __builtin_ia32_rdrand32_step. Every gcc feature should have a test. When you added the built-in this should have happened. The unavailability of a recent-enough compiler should therefore have been a problem for a long time. It's just wrong to expect a compiler to work with binutils versions which cannot handle all the output the compiler produces.
[Bug bootstrap/54419] [4.8 Regression] Compiling libstdc++-v3/src/c++11/random.cc fails on platforms not knowing rdrand
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419 --- Comment #20 from Ulrich Drepper 2012-09-04 01:06:33 UTC --- Created attachment 28127 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28127 Check for rdrand availability How about this patch? Not sure whether this handles cross-compiling. It seems to work for me. I still think it's wrong to bother with obsolete assemblers...
[Bug bootstrap/54419] [4.8 Regression] Compiling libstdc++-v3/src/c++11/random.cc fails on platforms not knowing rdrand
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419 --- Comment #36 from Ulrich Drepper 2012-09-05 13:25:21 UTC --- (In reply to comment #35) > What will happen if the assembly accept rdrand, but not the CPU? The code at runtime checks for the feature bit. There will be no problem. This is *exclusively* a problem with obsolete assemblers.
[Bug bootstrap/54419] [4.8 Regression] Compiling libstdc++-v3/src/c++11/random.cc fails on platforms not knowing rdrand
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419 --- Comment #37 from Ulrich Drepper 2012-09-05 13:57:27 UTC --- (In reply to comment #23) > (though, > apparently insufficient for i?86 - it should use either __get_cpuid, or > __get_cpuid_max before __cpuid). I fixed that. The code now should work in theory also on those systems. Although the sheer size of all the code together will prevent these systems from being used...
[Bug c++/54825] New: ICE with vector extension
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54825 Bug #: 54825 Summary: ICE with vector extension Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ AssignedTo: unassig...@gcc.gnu.org ReportedBy: drepper@gmail.com Host: x86_64-linux While trying to convert some of libstdc++ to use gcc's vector extensions I ran into this ICE. The code /should/ be valid.
[Bug c++/54825] ICE with vector extension
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54825 --- Comment #1 from Ulrich Drepper 2012-10-05 13:58:21 UTC --- In case the version number isn't making this clear, I tested this with the current mainline code. 4.7 probably won't work at all since some of the features used have been added to the C++ frontend after 4.7.
[Bug c++/54825] ICE with vector extension
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54825 --- Comment #2 from Ulrich Drepper 2012-10-05 13:59:26 UTC --- Created attachment 28363 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28363 Reproducer Why didn't BZ add the file?...
[Bug tree-optimization/54825] ICE with vector extension
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54825 --- Comment #11 from Ulrich Drepper 2012-10-05 15:12:18 UTC --- (In reply to comment #7) > Created attachment 28364 [details] > patch > > patch I am testing. This seems to fix the problem for me, even with the original code and not the reduced test case.
[Bug tree-optimization/54855] New: Unnecessary duplication when performing scalar operation on vector element
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54855 Bug #: 54855 Summary: Unnecessary duplication when performing scalar operation on vector element Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: drepper@gmail.com Take the following code: #include typedef double v2df __attribute__((vector_size(16))); int main(int argc, char *argv[]) { v2df v = { 2.0, 2.0 }; v2df v2 = { 2.0, 2.0 }; while (argc-- > 1) { v[0] -= 1.0; v *= v2; } printf("%g\n", v[0] + v[1]); return 0; } It compiles as C and C++, both compilers behave the same. When compiling on x86-64 (therefore with SSE enabled) it generates for the loop this code: 4003f0: 66 0f 28 c1 movapd %xmm1,%xmm0 4003f4: 83 e8 01sub$0x1,%eax 4003f7: f2 0f 5c c2 subsd %xmm2,%xmm0 4003fb: f2 0f 10 c8 movsd %xmm0,%xmm1 4003ff: 66 0f 58 c9 addpd %xmm1,%xmm1 400403: 75 eb jne4003f0 I.e., the value is pulled out of the vector, the subtraction is performed, and then the scalar value is put back into the vector. Instead the following sequence would have been completely sufficient: sub$0x1,%eax subsd %xmm2,%xmm1 addpd %xmm1,%xmm1 jne...back The subsd instruction doesn't touch the high parts of the register. I know this is a special case, it only works if the scalar operation is for the element zero of the vector. But code can be designed like that. I have some code which would work nicely like this. I don't know whether this translates to other architectures as well.
[Bug libstdc++/54869] ext/random/simd_fast_mersenne_twister_engine/cons/default.cc FAILs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54869 --- Comment #4 from Ulrich Drepper 2012-10-09 11:23:41 UTC --- (In reply to comment #0) > The new ext/random/simd_fast_mersenne_twister_engine/cons/default.cc testcase > FAILs on Solaris/SPARC (both 32 and 64-bit): That's expected. I mentioned when I posted the patches that the implementation is for little endian machines. I don't have access to any big endian machines and therefore didn't even try to make it work. It might be sufficient, at end of _M_gen_rand, to swap the order of the four 32-bit words in a 128-bit word. I never tested this, someone else will have to do this.
[Bug c/47043] New: allow deprecating enum values
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47043 Summary: allow deprecating enum values Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: c AssignedTo: unassig...@gcc.gnu.org ReportedBy: drepper@gmail.com The deprecated is nice and it's use should be expanded. Sometimes enum values have to be deprecated and it would be useful if one could write this: enum { newval, oldval __attribute__ ((deprecated)) }; Any use of 'oldval' should provoke the usual warning.
[Bug c++/50734] New: const and pure attributes don't have the effect as in C
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50734 Bug #: 50734 Summary: const and pure attributes don't have the effect as in C Classification: Unclassified Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ AssignedTo: unassig...@gcc.gnu.org ReportedBy: drepper@gmail.com When the const and pure function attributes are used the compiler doesn't generate the same code for C++ as for C. Take this code: extern int *f() __attribute__((pure)); int g(int *a) { int s = 0; for (int n = 0; n < 100; ++n) s += f()[a[n]]; return s; } When compiled as C code the call to 'f' is hoisted out of the loop. Similarly when const instead of pure is used. One can also define 'f' as extern "C" without changing the result: <_Z1gPi>: 0:41 54push %r12 2:49 89 fc mov%rdi,%r12 5:55 push %rbp 6:31 edxor%ebp,%ebp 8:53 push %rbx 9:31 dbxor%ebx,%ebx b:0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 10:e8 00 00 00 00 callq 15 <_Z1gPi+0x15> 11: R_X86_64_PC32f-0x4 15:49 63 14 1c movslq (%r12,%rbx,1),%rdx 19:48 83 c3 04 add$0x4,%rbx 1d:03 2c 90 add(%rax,%rdx,4),%ebp 20:48 81 fb 90 01 00 00 cmp$0x190,%rbx 27:75 e7jne10 <_Z1gPi+0x10> 29:5b pop%rbx 2a:89 e8mov%ebp,%eax 2c:5d pop%rbp 2d:41 5cpop%r12 2f:c3 retq Versus the C code: : 0:53 push %rbx 1:31 c0xor%eax,%eax 3:48 89 fb mov%rdi,%rbx 6:e8 00 00 00 00 callq b 7: R_X86_64_PC32f-0x4 b:31 d2xor%edx,%edx d:31 c9xor%ecx,%ecx f:90 nop 10:48 63 34 13 movslq (%rbx,%rdx,1),%rsi 14:48 83 c2 04 add$0x4,%rdx 18:03 0c b0 add(%rax,%rsi,4),%ecx 1b:48 81 fa 90 01 00 00 cmp$0x190,%rdx 22:75 ecjne10 24:89 c8mov%ecx,%eax 26:5b pop%rbx 27:c3 retq When the very same code is compiled with the C++ compiler the call stays in the loop. Should there be a reason for this (which I cannot see, these are extensions and gcc is not limited by a standard) the compiler should issue a warning and there should be a way to get the behavior we get with the C compiler. I checked this with the current Fedora x86-64 compiler gcc version 4.6.1 20110908 (Red Hat 4.6.1-9) (GCC) This is most probably architecture-independent.
[Bug middle-end/50963] New: TLS incompatible with -mcmodel=large & PIC
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50963 Bug #: 50963 Summary: TLS incompatible with -mcmodel=large & PIC Classification: Unclassified Product: gcc Version: 4.6.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end AssignedTo: unassig...@gcc.gnu.org ReportedBy: drepper@gmail.com Due to some build problems with the default models a program was compiled with -mcmodel=large. But that seems to be incompatible with TLS in PIC. This tiny code sequence blows up gcc as recently as 4.6.2 (from Fedora rawhide): __thread int a; int f(int b) { return a; } The ICE message when compiled with 'g++ -c -mcmodel=large t.c -fpic' is: t.c: In function ‘int f(int)’: t.c:6:1: error: unrecognizable insn: (call_insn/u 6 5 7 3 (parallel [ (set (reg:DI 0 ax) (call:DI (mem:QI (symbol_ref:DI ("__tls_get_addr")) [0 S1 A8]) (const_int 0 [0]))) (unspec:DI [ (symbol_ref:DI ("a") [flags 0x10] ) ] UNSPEC_TLS_GD) ]) t.c:5 -1 (expr_list:REG_EH_REGION (const_int -2147483648 [0x8000]) (nil)) (nil)) t.c:6:1: internal compiler error: in extract_insn, at recog.c:2109
[Bug tree-optimization/50984] New: Boolean return value expression clears register too often
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50984 Bug #: 50984 Summary: Boolean return value expression clears register too often Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: drepper@gmail.com Target: x86_64-linux Compile this code with the current HEAD gcc (or 4.5, I tried that as well) and you see less than optimal code: int f(int a, int b) { return a & 8 && b & 4; } For x86-64 I see this asm code: xorl%eax, %eax andl$8, %edi je.L2 xorl%eax, %eax <- Unnecessary !!! andl$4, %esi setne%al .L2: rep ret The compiler should realize that the second xor is unnecessary.
[Bug tree-optimization/53243] New: Use vector comparisons for if cascades
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53243 Bug #: 53243 Summary: Use vector comparisons for if cascades Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: drepper@gmail.com Target: x86_64-linux Created attachment 27312 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27312 Test program (compile with and without -DOLD) The vector units can compare multiple comparisons concurrently but this is not used automatically in gcc in situations where it can lead to better performance. Assume a function like this: void f(float a) { if (a < 1.0) cb(1); else if (a < 2.0) cb(2); else if (a < 3.0) cb(3); else if (a < 4.0) cb(4); else if (a < 5.0) cb(5); else if (a < 6.0) cb(6); else if (a < 7.0) cb(7); else if (a < 8.0) cb(8); else ++o; } In this case the first or second if is not marked with __builtin_expect as likely, otherwise the following *might* not apply. The routine can be rewritten for AVX machines like this: void f(float a) { const __m256 fv = _mm256_set_ps(8.0,7.0,6.0,5.0,4.0,3.0,2.0,1.0); __m256 r = _mm256_cmp_ps(fv, _mm256_set1_ps(a), _CMP_LT_OS); int i = _mm256_movemask_ps(r); asm goto ("bsr %0, %0; jz %l[less1]; .pushsection .rodata; 1: .quad %l2, %l3, %l4, %l5, %l6, %l7, %l8, %l9; .popsection; jmp *1b(,%0,8)" : : "r" (i) : : less1, less2, less3, less4, less5, less6, less7, less8, gt8); __builtin_unreachable (); less1: cb(1); return; less2: cb(2); return; less3: cb(3); return; less4: cb(4); return; less5: cb(5); return; less6: cb(6); return; less7: cb(7); return; less8: cb(8); return; gt8: ++o; } This might not generate the absolute best code but it runs for the test program which I attach 20% faster. The same technique can be applied to integer comparisons. More complex if cascades can also be simplified a lot by masking the integer bsr result accordingly. This should still be faster.
[Bug target/54087] New: __atomic_fetch_add does not use xadd instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54087 Bug #: 54087 Summary: __atomic_fetch_add does not use xadd instruction Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: drepper@gmail.com Target: x86_64-redhat-linux Compiling this code int a; int f1(int p) { return __atomic_sub_fetch(&a, p, __ATOMIC_SEQ_CST) == 0; } int f2(int p) { return __atomic_fetch_sub(&a, p, __ATOMIC_SEQ_CST) - p == 0; } you'll see that neither function uses the xadd instruction with the lock prefix. Instead an expensive emulation using cmpxchg is used: : 0:8b 05 00 00 00 00mov0x0(%rip),%eax# 6 2: R_X86_64_PC32a-0x4 6:89 c2mov%eax,%edx 8:29 fasub%edi,%edx a:f0 0f b1 15 00 00 00 lock cmpxchg %edx,0x0(%rip)# 12 11:00 e: R_X86_64_PC32a-0x4 12:75 f2jne6 14:31 c0xor%eax,%eax 16:85 d2test %edx,%edx 18:0f 94 c0 sete %al 1b:c3 retq This implementation not only is larger, it has possibly (unlikely) unbounded cost and even if the cmpxchg succeeds right away it is costlier. The last point is esepcially true if the cache line for the variable in question is not in the core's cache. In this case the initial load causes a I->S transition for the cache line and the cmpxchg an additional and possibly also very expensive S->E transition. Using xadd would cause a I->E transition. The config/i386/sync.md file in the current tree contains a pattern for atomic_fetch_add which does use xadd but it seems not to be used, even if instead of the function parameter an immediate value is used. ;; For operand 2 nonmemory_operand predicate is used instead of ;; register_operand to allow combiner to better optimize atomic ;; additions of constants. (define_insn "atomic_fetch_add" [(set (match_operand:SWI 0 "register_operand" "=") (unspec_volatile:SWI [(match_operand:SWI 1 "memory_operand" "+m") (match_operand:SI 3 "const_int_operand")];; model UNSPECV_XCHG)) (set (match_dup 1) (plus:SWI (match_dup 1) (match_operand:SWI 2 "nonmemory_operand" "0"))) (clobber (reg:CC FLAGS_REG))] "TARGET_XADD" "lock{%;} %K3xadd{}\t{%0, %1|%1, %0}") X86_ARCH_XADD should be defined for every architecture but i386.
[Bug target/54087] __atomic_fetch_add does not use xadd instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54087 --- Comment #3 from Ulrich Drepper 2012-08-01 16:06:33 UTC --- (In reply to comment #2) > (In reply to comment #1) > > Use __atomic_add_fetch and __atomic_fetch_sub in the testcase, and you will > > Eh, __atomic_fetch_add. Yes, but the compiler should automatically do this. The extreme case is this: int v; int a(void) { return __sync_sub_and_fetch(&v, 5); } int b(void) { return __sync_add_and_fetch(&v, -5); } The second function does compile as expected. The first doesn't, it uses cmpxchg. Shouldn't this be easy enough to fix by adding patterns for atomic_fetch_sub and atomic_sub_fetch which match if the second parameter is a constant? If it's not a constant a bit more code is needed but that should be no problem either.
[Bug target/54087] __atomic_fetch_add does not use xadd instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54087 --- Comment #4 from Ulrich Drepper 2012-08-02 14:33:19 UTC --- One more data point. In a micro-benchmark which uses realistic code used in production the change from __sync_sub_and_fetch(var, constant) to __sync_add_and_fetch(var, -constant) lead to a 10% to 27% improvement in performance. The cmpxchg use with the necessary initial load and I->S cache transition really kills performance when memory is highly contested.
[Bug target/54087] __atomic_fetch_add does not use xadd instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54087 --- Comment #6 from Ulrich Drepper 2012-08-03 02:16:57 UTC --- (In reply to comment #5) > This patch introduces atomic_fetch_sub: Seems to work nicely.
[Bug middle-end/54167] New: excessive alignment
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54167 Bug #: 54167 Summary: excessive alignment Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end AssignedTo: unassig...@gcc.gnu.org ReportedBy: drepper@gmail.com Compile the following code: struct c { int a, b; /*constexpr*/ c() : a(1), b(2) { } }; c v; The variable v will be defined with: .bss .align 16 .typev, @object .sizev, 8 v: .zero8 The variable has alignment 16! If you uncomment the constexpr and compile with -std=gnu++11 it can be seen that the compiler does know what the correct alignment is: .globlv .data .align 4 .typev, @object .sizev, 8 v: .long1 .long2 This happens with the current svn version as well as with 4.7.0.
[Bug tree-optimization/52070] New: missing integer comparison optimization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52070 Bug #: 52070 Summary: missing integer comparison optimization Classification: Unclassified Product: gcc Version: 4.6.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: drepper@gmail.com Compile this code with gcc 4.6.2: #include size_t b; int f(size_t a) { return b == 0 || a < b; } For x86-64 I see this result: f:movqb(%rip), %rdx movl$1, %eax testq%rdx, %rdx je.L2 xorl%eax, %eax cmpq%rdi, %rdx seta%al .L2:rep ret This can be more done without a conditional jump: f:movqb(%rip), %rdx xorl%eax, %eax subq$1, %rdx cmpq%rdi, %rdx setae%al rep ret Unless the b==0 test is marked as likely I'd say this code is performing better on all architectures.
[Bug middle-end/59521] New: __builtin_expect not effective in switch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59521 Bug ID: 59521 Summary: __builtin_expect not effective in switch Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: drepper.fsp at gmail dot com When used in switch, __builtin_expect should reorder the comparisons appropriately. Take this code: #include void f(int ch) { switch (__builtin_expect(ch, 333)) { case 3: puts("a"); break; case 42: puts("e"); break; case 333: puts("i"); break; } } Current mainline (and also prior versions, I tested 4.8.2) produce with -O3 code like this: : 0:83 ff 2a cmp$0x2a,%edi 3:74 33je 38 5:81 ff 4d 01 00 00cmp$0x14d,%edi b:74 1bje 28 d:83 ff 03 cmp$0x3,%edi 10:74 06je 18 12:c3 retq 13:0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 18:bf 00 00 00 00 mov$0x0,%edi 1d:e9 00 00 00 00 jmpq 22 22:66 0f 1f 44 00 00nopw 0x0(%rax,%rax,1) 28:bf 00 00 00 00 mov$0x0,%edi 2d:e9 00 00 00 00 jmpq 32 32:66 0f 1f 44 00 00nopw 0x0(%rax,%rax,1) 38:bf 00 00 00 00 mov$0x0,%edi 3d:e9 00 00 00 00 jmpq 42 Instead the test for 333/$0x14d should have been moved to the front.
[Bug tree-optimization/51492] New: vectorizer generates unnecessary code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492 Bug #: 51492 Summary: vectorizer generates unnecessary code Classification: Unclassified Product: gcc Version: 4.6.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: drepper@gmail.com Build: x86_64-linux Compile this code with 4.6.2 on a x86-64 machine with -O3: #define SIZE 65536 #define WSIZE 64 unsigned short head[SIZE] __attribute__((aligned(64))); void f(void) { for (unsigned n = 0; n < SIZE; ++n) { unsigned short m = head[n]; head[n] = (unsigned short)(m >= WSIZE ? m-WSIZE : 0); } } The result I see is this: : 0:66 0f ef d2 pxor %xmm2,%xmm2 4:b8 00 00 00 00 mov$0x0,%eax 5: R_X86_64_32head 9:66 0f 6f 25 00 00 00 movdqa 0x0(%rip),%xmm4# 11 10:00 d: R_X86_64_PC32.LC0-0x4 11:66 0f 6f 1d 00 00 00 movdqa 0x0(%rip),%xmm3# 19 18:00 15: R_X86_64_PC32.LC1-0x4 19:0f 1f 80 00 00 00 00 nopl 0x0(%rax) 20:66 0f 6f 00 movdqa (%rax),%xmm0 24:66 0f 6f c8 movdqa %xmm0,%xmm1 28:66 0f d9 c4 psubusw %xmm4,%xmm0 2c:66 0f 75 c2 pcmpeqw %xmm2,%xmm0 30:66 0f fd cb paddw %xmm3,%xmm1 34:66 0f df c1 pandn %xmm1,%xmm0 38:66 0f 7f 00 movdqa %xmm0,(%rax) 3c:48 83 c0 10 add$0x10,%rax 40:48 3d 00 00 00 00cmp$0x0,%rax 42: R_X86_64_32Shead+0x2 46:75 d8jne20 48:f3 c3repz retq There is a lot of unnecessary code. The psubusw instruction alone is sufficient. The purpose of this instruction is to implement saturated subtraction. Why does gcc create all this extra code? The code should just be movdqa (%rax), %xmm0 psubusw %xmm1, %xmm0 movdqa %mm0, (%rax) where %xmm1 has WSIZE in the 16-bit values.
[Bug c++/51785] New: gets not anymore declared
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51785 Bug #: 51785 Summary: gets not anymore declared Classification: Unclassified Product: gcc Version: 4.6.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ AssignedTo: unassig...@gcc.gnu.org ReportedBy: drepper@gmail.com glibc 2.15 and later don't declare gets anymore for ISO C11 mode and if _GNU_SOURCE is defined. This causes problems with the cstdio header which unconditionally uses using ::gets; Something has to be done about this. If you want glibc to define a macro to signal that gets is not declared let me know. Otherwise recognize __USE_GNU. The problem still applies to the trunk.
[Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492 --- Comment #2 from Ulrich Drepper 2012-01-08 18:56:48 UTC --- Note, this code appears in gzip and therefore IIRC in specCPU (in deflate.c:fill_window). Although when compiling gzip myself with that code embedded in a larger function I cannot get the optimization to apply at all. If this bug is fixed and the optimization is applied the spec numbers could go up if specCPUis testing unzipping...
[Bug tree-optimization/52034] New: __builtin_copysign optimization suboptimal
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52034 Bug #: 52034 Summary: __builtin_copysign optimization suboptimal Classification: Unclassified Product: gcc Version: 4.6.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: drepper@gmail.com The most trivial __builtin_copysign optimization is not optimal: double f(double a, double b) { return __builtin_copysign(a,b); } With gcc 4.6.2 this gets compiled to movapd%xmm1, %xmm2 andpd.LC0(%rip), %xmm0 andpd.LC1(%rip), %xmm2 orpd%xmm2, %xmm0 ret There is no reason for %xmm1 to be duplicated to %xmm2. This is sufficient: andpd.LC0(%rip), %xmm0 andpd.LC1(%rip), %xmm1 orpd%xmm1, %xmm0 ret The same happens with more complicated code sequences.
[Bug middle-end/59521] __builtin_expect not effective in switch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59521 --- Comment #12 from Ulrich Drepper --- On Fri, Jul 14, 2017 at 2:17 PM, marxin at gcc dot gnu.org wrote: > Maybe I miss something, but I would expect to sort all branches in > emit_case_decision_tree as either predictors can sort branches, or one have a > profile feedback. Having a chain of equal comparisons, that should be always > beneficial, or? I agree. There seems to be no negative effect. If you use a stable sort algorithm the programmer can have influence when needed since the program's order is preserved. If the compiler has probability information it should use it. Note, I just mean the order of the tests. Deciding about placing code in cold sections is a different story but this isn't what we're talking about here, right?