[Bug c/60847] New: x86 BMI intrinsics not recognized
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60847 Bug ID: 60847 Summary: x86 BMI intrinsics not recognized Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: spatel at rotateright dot com With gcc 4.9.0 (version details below), the x86 bit manipulation instruction (BMI) C intrinsics are not being recognized. This appears to be a regression from gcc 4.8.2. $ cat bmi.c #include int foo(int a) { return _blsmsk_u32(a); } int foo2(int a) { return _blsr_u32(a); } $ gcc -O1 bmi.c -mbmi -S -o - .text .globl _foo _foo: LFB2449: subq$8, %rsp LCFI0: movl$0, %eax call__blsmsk_u32 <--- this should be a 'blsmsk' instruction addq$8, %rsp LCFI1: ret LFE2449: .globl _foo2 _foo2: LFB2450: subq$8, %rsp LCFI2: movl$0, %eax call__blsr_u32 <--- this should be a 'blsr' instruction addq$8, %rsp $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/opt/local/libexec/gcc/x86_64-apple-darwin13/4.9.0/lto-wrapper Target: x86_64-apple-darwin13 Configured with: /opt/local/var/macports/build/_opt_mports_dports_lang_gcc49/gcc49/work/gcc-4.9-20140406/configure --prefix=/opt/local --build=x86_64-apple-darwin13 --enable-languages=c,c++,objc,obj-c++,fortran,java --libdir=/opt/local/lib/gcc49 --includedir=/opt/local/include/gcc49 --infodir=/opt/local/share/info --mandir=/opt/local/share/man --datarootdir=/opt/local/share/gcc-4.9 --with-local-prefix=/opt/local --with-system-zlib --disable-nls --program-suffix=-mp-4.9 --with-gxx-include-dir=/opt/local/include/gcc49/c++/ --with-gmp=/opt/local --with-mpfr=/opt/local --with-mpc=/opt/local --with-cloog=/opt/local --enable-cloog-backend=isl --disable-cloog-version-check --enable-stage1-checking --disable-multilib --enable-lto --enable-libstdcxx-time --with-as=/opt/local/bin/as --with-ld=/opt/local/bin/ld --with-ar=/opt/local/bin/ar --with-bugurl=https://trac.macports.org/newticket --with-pkgversion='MacPorts gcc49 4.9-20140406_0' Thread model: posix gcc version 4.9.0 20140406 (experimental) (MacPorts gcc49 4.9-20140406_0)
[Bug c/60847] x86 BMI intrinsics not recognized
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60847 --- Comment #1 from Sanjay Patel --- It looks like an extra leading underscore is required to recognize the BMI intrinsics. This is not happening with other (BMI2, SSE4) intrinsics. According to the Intel reference docs and previous versions of gcc, a single underscore is the correct usage.
[Bug c/60847] x86 BMI intrinsics not recognized
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60847 Sanjay Patel changed: What|Removed |Added Component|target |c --- Comment #3 from Sanjay Patel --- Here's the evidence of the extra leading underscore being the cause of the bug: $ cat bmi.c #include int foo(int a) { return __blsmsk_u32(a); } int foo2(int a) { return __blsr_u32(a); } $ gcc -O1 bmi.c -mbmi -S -o - .text .globl _foo _foo: LFB2449: blsmsk%edi, %eax ret LFE2449: .globl _foo2 _foo2: LFB2450: blsr%edi, %eax ret Thanks to Craig Topper for noticing the underscore problem. Corresponding bug in LLVM where this was first noted is here: http://llvm.org/bugs/show_bug.cgi?id=19431
[Bug target/60847] [4.9/4.10 Regression] x86 BMI intrinsics not recognized
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60847 --- Comment #8 from Sanjay Patel --- Thanks, Jakub. I see that the fix duplicates all of the intrinsics with a double-leading-underscore variant. Why do we need that? AFAIK, no other x86 intrinsics have this kind of duplication.
[Bug target/60847] [4.9/4.10 Regression] x86 BMI intrinsics not recognized
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60847 --- Comment #10 from Sanjay Patel --- Ah - thank you for the explanation! I found the original checkin from AMD: http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01356.html Strangely, I can't find any documentation for those double-underscores from AMD though.
[Bug c++/64677] New: incorrect result with complex division?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64677 Bug ID: 64677 Summary: incorrect result with complex division? Product: gcc Version: 4.9.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: spatel at rotateright dot com I'm not sure if this is a bug at -O0, at -O1 (in MPFR because all math is folded out in this case?), or neither: #include #include #include int main() { std::complex c(-61.887073591767951,-60.052083270252012); double a = (1.0 / c).real(); std::cout << std::setprecision(17) << " " << a << std::endl; } $ g++ complex_div.cpp -O0 -std=c++11 $ ./a.out -0.0083223357032193145 $ g++ complex_div.cpp -O1 -std=c++11 $ ./a.out -0.0083223357032193128 Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-apple-darwin14.0.0/4.9.2/lto-wrapper Target: x86_64-apple-darwin14.0.0 Configured with: ../gcc-4.9-20141029/configure --enable-languages=c++,fortran Thread model: posix gcc version 4.9.2 20141029 (prerelease) (GCC) COLLECT_GCC_OPTIONS='-mmacosx-version-min=10.10.0' '-O1' '-std=c++11' '-v' '-shared-libgcc' '-mtune=core2' /usr/local/libexec/gcc/x86_64-apple-darwin14.0.0/4.9.2/cc1plus -quiet -v -D__DYNAMIC__ 22241.cpp -fPIC -quiet -dumpbase 22241.cpp -mmacosx-version-min=10.10.0 -mtune=core2 -auxbase 22241 -O1 -std=c++11 -version -o /var/folders/k1/5fqvbm0n1zj6kjp0s9p18rm4gn/T//ccaWfRcp.s GNU C++ (GCC) version 4.9.2 20141029 (prerelease) (x86_64-apple-darwin14.0.0) compiled by GNU C version 4.9.2 20141029 (prerelease), GMP version 6.0.0, MPFR version 3.1.2-p10, MPC version 1.0.2 GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072 ignoring nonexistent directory "/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/../../../../x86_64-apple-darwin14.0.0/include" ignoring nonexistent directory "/usr/include" #include "..." search starts here: #include <...> search starts here: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk/usr/include/ /usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/../../../../include/c++/4.9.2 /usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/../../../../include/c++/4.9.2/x86_64-apple-darwin14.0.0 /usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/../../../../include/c++/4.9.2/backward /usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/include /usr/local/include /usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/include-fixed /System/Library/Frameworks /Library/Frameworks End of search list. GNU C++ (GCC) version 4.9.2 20141029 (prerelease) (x86_64-apple-darwin14.0.0) compiled by GNU C version 4.9.2 20141029 (prerelease), GMP version 6.0.0, MPFR version 3.1.2-p10, MPC version 1.0.2 GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072 Compiler executable checksum: c0a1d89bdb8ef292bcb2f0d5b923240f COLLECT_GCC_OPTIONS='-mmacosx-version-min=10.10.0' '-O1' '-std=c++11' '-v' '-shared-libgcc' '-mtune=core2' as -arch x86_64 -force_cpusubtype_ALL -o /var/folders/k1/5fqvbm0n1zj6kjp0s9p18rm4gn/T//ccOTSBAQ.o /var/folders/k1/5fqvbm0n1zj6kjp0s9p18rm4gn/T//ccaWfRcp.s COMPILER_PATH=/usr/local/libexec/gcc/x86_64-apple-darwin14.0.0/4.9.2/:/usr/local/libexec/gcc/x86_64-apple-darwin14.0.0/4.9.2/:/usr/local/libexec/gcc/x86_64-apple-darwin14.0.0/:/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/:/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/ LIBRARY_PATH=/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/:/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/../../../:/usr/lib/ COLLECT_GCC_OPTIONS='-mmacosx-version-min=10.10.0' '-O1' '-std=c++11' '-v' '-shared-libgcc' '-mtune=core2' /usr/local/libexec/gcc/x86_64-apple-darwin14.0.0/4.9.2/collect2 -dynamic -arch x86_64 -macosx_version_min 10.10.0 -weak_reference_mismatches non-weak -o a.out -L/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2 -L/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/../../.. /var/folders/k1/5fqvbm0n1zj6kjp0s9p18rm4gn/T//ccOTSBAQ.o -lstdc++ -no_compact_unwind -lSystem -lgcc_ext.10.5 -lgcc -lSystem -v collect2 version 4.9.2 20141029 (prerelease) /usr/bin/ld -dynamic -arch x86_64 -macosx_version_min 10.10.0 -weak_reference_mismatches non-weak -o a.out -L/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2 -L/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/../../.. /var/folders/k1/5fqvbm0n1zj6kjp0s9p18rm4gn/T//ccOTSBAQ.o -lstdc++ -no_compact_unwind -lSystem -lgcc_ext.10.5 -lgcc -lSystem -v @(#)PROGRAM:ld PROJECT:ld64-241.9 configured to support archs: armv6 armv7 armv7s arm64 i386 x86_64 x86_64h armv6m armv7m armv7em Library search paths: /usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2 /usr/local/lib /usr/lib /usr/local/lib Framework search paths: /Library/Frameworks/ /System/Library/Frameworks/
[Bug c++/64677] incorrect result with complex division?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64677 --- Comment #2 from Sanjay Patel --- This is on plain x86-64 with SSE (before the addition of any FMA instructions), so lack of FMA must be accounted for? The answers differ in the last digit / ULP. Is there some standard or golden implementation that will answer which answer is correct?
[Bug libgcc/64677] incorrect result with complex division?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64677 --- Comment #5 from Sanjay Patel --- (In reply to Mikhail Maltsev from comment #3) > So, compile-time result is more precise. BTW, what does the disassembly look > like? In the -O0 case, it looks like all of the math is handled in: call___ieee_divdc3 In the -O1 case, the result is precomputed and loaded from constant pool: LC1: .long 4250898262 .long -1082062004 Is Wolfram Alpha considered the authoritative answer? Or is there some IEEE reference implementation that we can consult for this kind of question?
[Bug libgcc/64677] incorrect result with complex division?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64677 --- Comment #8 from Sanjay Patel --- (In reply to Andrew Pinski from comment #7) > Can you try this under Linux too, just to double check there? Wow, that other bug shows that there are a lot of variables here. I don't know what to make of this: First, I'm just trying g++ 4.8.2 on Ubuntu 14.04 because that's what I have available at the moment. It seems I don't need the -std=c++11 flag as I do on OS X? But using that flag changes the result at -O1! $ g++ -O0 complex_div.cpp ; ./a.out -0.0083223357032193145 (does not match Wolfram) $ g++ -O1 complex_div.cpp ; ./a.out -0.0083223357032193145 (does not match Wolfram) $ g++ -O0 complex_div.cpp -std=c++11 ; ./a.out -0.0083223357032193145 (does not match Wolfram) $ g++ -O1 complex_div.cpp -std=c++11 ; ./a.out -0.0083223357032193128 (matches Wolfram...but why is this different?!)
[Bug libgcc/64677] incorrect result with complex division?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64677 --- Comment #9 from Sanjay Patel --- (In reply to Sanjay Patel from comment #8) > It seems I don't need the -std=c++11 flag as I do on OS X? Actually, I screwed that up. We don't need that flag on OS X either...and thankfully, the behavior matches Linux. This is on the same OS X (10.10) that I was testing on when filing the bug: $ g++ -O0 complex_div.cpp ; ./a.out -0.0083223357032193145 $ g++ -O1 complex_div.cpp ; ./a.out -0.0083223357032193145 $ g++ -O0 complex_div.cpp -std=c++11 ; ./a.out -0.0083223357032193145 $ g++ -O1 complex_div.cpp -std=c++11 ; ./a.out -0.0083223357032193128 So there are (at least) 2 questions: 1. Why does the answer change based on the -std? 2. What is the correct answer?
[Bug libgcc/64677] incorrect result with complex division?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64677 --- Comment #11 from Sanjay Patel --- (In reply to Mikhail Maltsev from comment #10) > C++11 supports constexpr (and std::complex has constexpr constructor). Ah, that makes sense. Yes, we're only generating the answer using MPFR with c++11 and optimization. So I think this comes down to an implementation difference between libgcc and MPFR. > By the way, according to C++ standard, precision of floating point numbers > is implementation-defined. Hmmm...so we still don't know which answer is correct or if both answers are acceptable?
[Bug target/62041] New: vector fneg codegen uses a subtract instead of an xor (x86-64)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62041 Bug ID: 62041 Summary: vector fneg codegen uses a subtract instead of an xor (x86-64) Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: spatel at rotateright dot com $ cat fneg.c #include __m128 fneg4(__m128 x) { return _mm_sub_ps(_mm_set1_ps(-0.0), x); } $ ~gcc49/local/bin/gcc -march=core-avx2 -O2 -S fneg.c -o - ... _fneg4: LFB513: vmovapsLC0(%rip), %xmm1 vsubps%xmm0, %xmm1, %xmm0 ret ... LC0: .long2147483648 .long2147483648 .long2147483648 .long2147483648 Instead of generating 'vsubps' here, it would be better to generate 'vxorps' because we know we're just flipping the sign bit of each element. This is what gcc does for the scalar version of this code. Note that there is no difference if I use -ffast-math with this testcase. With -ffast-math enabled, we should generate the same 'xorps' code even if the "-0.0" is "+0.0". Again, that's what the scalar codegen does, so I think this is just a deficiency when generating vector code. I can file the -ffast-math case as a separate bug if that would be better.
[Bug target/62054] New: fabsf uses constant pool and andps (x86-64) - use pabsd instead?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62054 Bug ID: 62054 Summary: fabsf uses constant pool and andps (x86-64) - use pabsd instead? Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: spatel at rotateright dot com $ cat fabs.c #include float foo(float a) { return fabsf(a); } $ gcc49 -O1 fabs.c -S -o - .text .globl _foo _foo: LFB19: movssLC0(%rip), %xmm1 andps%xmm1, %xmm0 ret LFE19: .literal16 .align 4 LC0: .long2147483647 .long0 .long0 .long0 I think we can save 16-bytes of constant pool data and a load instruction by generating: pabsd %xmm0, %xmm0 If this was part of a larger floating point chain of ops and depending on CPU, there may be some speed penalty for intermingling integer and FP ops on data in an xmm reg, but the size savings should outweigh that?
[Bug target/62055] New: missed optimization: recognize fnabs (FP negative absolute value) (x86-64)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62055 Bug ID: 62055 Summary: missed optimization: recognize fnabs (FP negative absolute value) (x86-64) Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: spatel at rotateright dot com $ cat fnabs.c #include float foo(float a) { return -fabsf(a); } $ gcc49 -O1 fnabs.c -S -o - .text .globl _foo _foo: LFB19: movssLC0(%rip), %xmm1 andps%xmm1, %xmm0 movssLC1(%rip), %xmm1 xorps%xmm1, %xmm0 ret LFE19: .literal16 .align 4 LC0: .long2147483647 .long0 .long0 .long0 .align 4 LC1: .long2147483648 .long0 .long0 .long0 --- That's a lot of constant pool data and instructions to turn on a single bit. I think there are 2 steps to improving this. First, recognize that -(fabs(a)) can be transformed into an 'or' op: movssLC0(%rip), %xmm1 orps%xmm1, %xmm0 LC0: .long2147483648 Second, I don't think we need the extra 0 longs here; movss only loads 4 bytes. This may require understanding that the upper vector elements for the 'orps' are don't cares?
[Bug target/62054] fabsf uses constant pool and andps (x86-64) - use pabsd instead?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62054 Sanjay Patel changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |INVALID --- Comment #2 from Sanjay Patel --- Ah, sorry for the noise. I misunderstood pabsd.
[Bug target/62054] fabsf uses constant pool and andps (x86-64) - use pabsd instead?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62054 --- Comment #3 from Sanjay Patel --- I think there's still an optimization possible here regarding the constant pool data - see bug 62055. Hopefully, I didn't mess that one up. :)
[Bug target/62191] New: extra shift generated for vector integer division by constant 2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62191 Bug ID: 62191 Summary: extra shift generated for vector integer division by constant 2 Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: spatel at rotateright dot com Using gcc 4.9: $ cat sdiv.c typedef int vecint __attribute__((vector_size(16))); vecint f(vecint x) { return x/2; } $ gcc -O2 sdiv.c -S -o - ... movdqa%xmm0, %xmm1 psrad$31, %xmm1<--- splat the sign bit psrld$31, %xmm1<--- then shift sign bit down to LSB paddd%xmm1, %xmm0 <--- add sign bit to quotient psrad$1, %xmm0 <--- div via alg shift right ret -- I don't think the first shift right algebraic is necessary. We splat the sign bit and then shift that right logically, so the upper bits are all zero'd anyway. This is a special case for signed integer division by 2. You need that first 'psrad' for any other power of 2 because the subsequent logical shift would not also be a shift of 31.