[Bug c/60847] New: x86 BMI intrinsics not recognized

2014-04-15 Thread spatel at rotateright dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60847

Bug ID: 60847
   Summary: x86 BMI intrinsics not recognized
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: spatel at rotateright dot com

With gcc 4.9.0 (version details below), the x86 bit manipulation instruction
(BMI) C intrinsics are not being recognized. This appears to be a regression
from gcc 4.8.2.

$ cat bmi.c
#include 
int foo(int a) { return _blsmsk_u32(a); }
int foo2(int a) { return _blsr_u32(a); }

$ gcc -O1 bmi.c -mbmi -S -o -
.text
.globl _foo
_foo:
LFB2449:
subq$8, %rsp
LCFI0:
movl$0, %eax
call__blsmsk_u32  <--- this should be a 'blsmsk' instruction
addq$8, %rsp
LCFI1:
ret
LFE2449:
.globl _foo2
_foo2:
LFB2450:
subq$8, %rsp
LCFI2:
movl$0, %eax
call__blsr_u32  <--- this should be a 'blsr' instruction
addq$8, %rsp


$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/local/libexec/gcc/x86_64-apple-darwin13/4.9.0/lto-wrapper
Target: x86_64-apple-darwin13
Configured with:
/opt/local/var/macports/build/_opt_mports_dports_lang_gcc49/gcc49/work/gcc-4.9-20140406/configure
--prefix=/opt/local --build=x86_64-apple-darwin13
--enable-languages=c,c++,objc,obj-c++,fortran,java
--libdir=/opt/local/lib/gcc49 --includedir=/opt/local/include/gcc49
--infodir=/opt/local/share/info --mandir=/opt/local/share/man
--datarootdir=/opt/local/share/gcc-4.9 --with-local-prefix=/opt/local
--with-system-zlib --disable-nls --program-suffix=-mp-4.9
--with-gxx-include-dir=/opt/local/include/gcc49/c++/ --with-gmp=/opt/local
--with-mpfr=/opt/local --with-mpc=/opt/local --with-cloog=/opt/local
--enable-cloog-backend=isl --disable-cloog-version-check
--enable-stage1-checking --disable-multilib --enable-lto
--enable-libstdcxx-time --with-as=/opt/local/bin/as --with-ld=/opt/local/bin/ld
--with-ar=/opt/local/bin/ar --with-bugurl=https://trac.macports.org/newticket
--with-pkgversion='MacPorts gcc49 4.9-20140406_0'
Thread model: posix
gcc version 4.9.0 20140406 (experimental) (MacPorts gcc49 4.9-20140406_0)


[Bug c/60847] x86 BMI intrinsics not recognized

2014-04-15 Thread spatel at rotateright dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60847

--- Comment #1 from Sanjay Patel  ---
It looks like an extra leading underscore is required to recognize the BMI
intrinsics. This is not happening with other (BMI2, SSE4) intrinsics. 

According to the Intel reference docs and previous versions of gcc, a single
underscore is the correct usage.


[Bug c/60847] x86 BMI intrinsics not recognized

2014-04-15 Thread spatel at rotateright dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60847

Sanjay Patel  changed:

   What|Removed |Added

  Component|target  |c

--- Comment #3 from Sanjay Patel  ---
Here's the evidence of the extra leading underscore being the cause of the bug:

$ cat bmi.c
#include 
int foo(int a) { return __blsmsk_u32(a); }
int foo2(int a) { return __blsr_u32(a); }

$ gcc -O1 bmi.c -mbmi -S -o -
.text
.globl _foo
_foo:
LFB2449:
blsmsk%edi, %eax
ret
LFE2449:
.globl _foo2
_foo2:
LFB2450:
blsr%edi, %eax
ret


Thanks to Craig Topper for noticing the underscore problem. Corresponding bug
in LLVM where this was first noted is here:
http://llvm.org/bugs/show_bug.cgi?id=19431


[Bug target/60847] [4.9/4.10 Regression] x86 BMI intrinsics not recognized

2014-04-30 Thread spatel at rotateright dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60847

--- Comment #8 from Sanjay Patel  ---
Thanks, Jakub. 

I see that the fix duplicates all of the intrinsics with a
double-leading-underscore variant. Why do we need that? AFAIK, no other x86
intrinsics have this kind of duplication.


[Bug target/60847] [4.9/4.10 Regression] x86 BMI intrinsics not recognized

2014-04-30 Thread spatel at rotateright dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60847

--- Comment #10 from Sanjay Patel  ---
Ah - thank you for the explanation! I found the original checkin from AMD:
http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01356.html

Strangely, I can't find any documentation for those double-underscores from AMD
though.


[Bug c++/64677] New: incorrect result with complex division?

2015-01-19 Thread spatel at rotateright dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64677

Bug ID: 64677
   Summary: incorrect result with complex division?
   Product: gcc
   Version: 4.9.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: spatel at rotateright dot com

I'm not sure if this is a bug at -O0, at -O1 (in MPFR because all math is
folded out in this case?), or neither:

#include 
#include 
#include 

int main()
{
std::complex c(-61.887073591767951,-60.052083270252012);
double a = (1.0 / c).real();

std::cout << std::setprecision(17) << " " << a << std::endl;
}

$ g++ complex_div.cpp -O0 -std=c++11
$ ./a.out 
 -0.0083223357032193145
$ g++ complex_div.cpp -O1 -std=c++11
$ ./a.out 
 -0.0083223357032193128


Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-apple-darwin14.0.0/4.9.2/lto-wrapper
Target: x86_64-apple-darwin14.0.0
Configured with: ../gcc-4.9-20141029/configure --enable-languages=c++,fortran
Thread model: posix
gcc version 4.9.2 20141029 (prerelease) (GCC) 
COLLECT_GCC_OPTIONS='-mmacosx-version-min=10.10.0' '-O1' '-std=c++11' '-v'
'-shared-libgcc' '-mtune=core2'
 /usr/local/libexec/gcc/x86_64-apple-darwin14.0.0/4.9.2/cc1plus -quiet -v
-D__DYNAMIC__ 22241.cpp -fPIC -quiet -dumpbase 22241.cpp
-mmacosx-version-min=10.10.0 -mtune=core2 -auxbase 22241 -O1 -std=c++11
-version -o /var/folders/k1/5fqvbm0n1zj6kjp0s9p18rm4gn/T//ccaWfRcp.s
GNU C++ (GCC) version 4.9.2 20141029 (prerelease) (x86_64-apple-darwin14.0.0)
compiled by GNU C version 4.9.2 20141029 (prerelease), GMP version 6.0.0,
MPFR version 3.1.2-p10, MPC version 1.0.2
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
ignoring nonexistent directory
"/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/../../../../x86_64-apple-darwin14.0.0/include"
ignoring nonexistent directory "/usr/include"
#include "..." search starts here:
#include <...> search starts here:

/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk/usr/include/

/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/../../../../include/c++/4.9.2

/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/../../../../include/c++/4.9.2/x86_64-apple-darwin14.0.0

/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/../../../../include/c++/4.9.2/backward
 /usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/include
 /usr/local/include
 /usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/include-fixed
 /System/Library/Frameworks
 /Library/Frameworks
End of search list.
GNU C++ (GCC) version 4.9.2 20141029 (prerelease) (x86_64-apple-darwin14.0.0)
compiled by GNU C version 4.9.2 20141029 (prerelease), GMP version 6.0.0,
MPFR version 3.1.2-p10, MPC version 1.0.2
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: c0a1d89bdb8ef292bcb2f0d5b923240f
COLLECT_GCC_OPTIONS='-mmacosx-version-min=10.10.0' '-O1' '-std=c++11' '-v'
'-shared-libgcc' '-mtune=core2'
 as -arch x86_64 -force_cpusubtype_ALL -o
/var/folders/k1/5fqvbm0n1zj6kjp0s9p18rm4gn/T//ccOTSBAQ.o
/var/folders/k1/5fqvbm0n1zj6kjp0s9p18rm4gn/T//ccaWfRcp.s
COMPILER_PATH=/usr/local/libexec/gcc/x86_64-apple-darwin14.0.0/4.9.2/:/usr/local/libexec/gcc/x86_64-apple-darwin14.0.0/4.9.2/:/usr/local/libexec/gcc/x86_64-apple-darwin14.0.0/:/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/:/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/
LIBRARY_PATH=/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/:/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/../../../:/usr/lib/
COLLECT_GCC_OPTIONS='-mmacosx-version-min=10.10.0' '-O1' '-std=c++11' '-v'
'-shared-libgcc' '-mtune=core2'
 /usr/local/libexec/gcc/x86_64-apple-darwin14.0.0/4.9.2/collect2 -dynamic -arch
x86_64 -macosx_version_min 10.10.0 -weak_reference_mismatches non-weak -o a.out
-L/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2
-L/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/../../..
/var/folders/k1/5fqvbm0n1zj6kjp0s9p18rm4gn/T//ccOTSBAQ.o -lstdc++
-no_compact_unwind -lSystem -lgcc_ext.10.5 -lgcc -lSystem -v
collect2 version 4.9.2 20141029 (prerelease)
/usr/bin/ld -dynamic -arch x86_64 -macosx_version_min 10.10.0
-weak_reference_mismatches non-weak -o a.out
-L/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2
-L/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2/../../..
/var/folders/k1/5fqvbm0n1zj6kjp0s9p18rm4gn/T//ccOTSBAQ.o -lstdc++
-no_compact_unwind -lSystem -lgcc_ext.10.5 -lgcc -lSystem -v
@(#)PROGRAM:ld  PROJECT:ld64-241.9
configured to support archs: armv6 armv7 armv7s arm64 i386 x86_64 x86_64h
armv6m armv7m armv7em
Library search paths:
/usr/local/lib/gcc/x86_64-apple-darwin14.0.0/4.9.2
/usr/local/lib
/usr/lib
/usr/local/lib
Framework search paths:
/Library/Frameworks/
/System/Library/Frameworks/


[Bug c++/64677] incorrect result with complex division?

2015-01-19 Thread spatel at rotateright dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64677

--- Comment #2 from Sanjay Patel  ---
This is on plain x86-64 with SSE (before the addition of any FMA instructions),
so lack of FMA must be accounted for?

The answers differ in the last digit / ULP. Is there some standard or golden
implementation that will answer which answer is correct?


[Bug libgcc/64677] incorrect result with complex division?

2015-01-20 Thread spatel at rotateright dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64677

--- Comment #5 from Sanjay Patel  ---
(In reply to Mikhail Maltsev from comment #3)
> So, compile-time result is more precise. BTW, what does the disassembly look
> like?

In the -O0 case, it looks like all of the math is handled in:
 call___ieee_divdc3

In the -O1 case, the result is precomputed and loaded from constant pool:
LC1:
.long   4250898262
.long   -1082062004


Is Wolfram Alpha considered the authoritative answer? Or is there some IEEE
reference implementation that we can consult for this kind of question?


[Bug libgcc/64677] incorrect result with complex division?

2015-01-20 Thread spatel at rotateright dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64677

--- Comment #8 from Sanjay Patel  ---
(In reply to Andrew Pinski from comment #7) 
> Can you try this under Linux too, just to double check there?

Wow, that other bug shows that there are a lot of variables here. 

I don't know what to make of this: First, I'm just trying g++ 4.8.2 on Ubuntu
14.04 because that's what I have available at the moment. 

It seems I don't need the -std=c++11 flag as I do on OS X? But using that flag
changes the result at -O1!

$ g++ -O0 complex_div.cpp ; ./a.out
 -0.0083223357032193145  

(does not match Wolfram)

$ g++ -O1 complex_div.cpp ; ./a.out
 -0.0083223357032193145  

(does not match Wolfram)

$ g++ -O0 complex_div.cpp -std=c++11 ; ./a.out
 -0.0083223357032193145

(does not match Wolfram)

$ g++ -O1 complex_div.cpp -std=c++11 ; ./a.out
 -0.0083223357032193128

(matches Wolfram...but why is this different?!)


[Bug libgcc/64677] incorrect result with complex division?

2015-01-20 Thread spatel at rotateright dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64677

--- Comment #9 from Sanjay Patel  ---
(In reply to Sanjay Patel from comment #8)
> It seems I don't need the -std=c++11 flag as I do on OS X?

Actually, I screwed that up. We don't need that flag on OS X either...and
thankfully, the behavior matches Linux. This is on the same OS X (10.10) that I
was testing on when filing the bug:

$ g++ -O0 complex_div.cpp ; ./a.out 
 -0.0083223357032193145

$ g++ -O1 complex_div.cpp ; ./a.out 
 -0.0083223357032193145

$ g++ -O0 complex_div.cpp -std=c++11 ; ./a.out 
 -0.0083223357032193145

$ g++ -O1 complex_div.cpp -std=c++11 ; ./a.out 
 -0.0083223357032193128

So there are (at least) 2 questions:
1. Why does the answer change based on the -std?
2. What is the correct answer?


[Bug libgcc/64677] incorrect result with complex division?

2015-01-21 Thread spatel at rotateright dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64677

--- Comment #11 from Sanjay Patel  ---
(In reply to Mikhail Maltsev from comment #10)
> C++11 supports constexpr (and std::complex has constexpr constructor).

Ah, that makes sense. Yes, we're only generating the answer using MPFR with
c++11 and optimization. So I think this comes down to an implementation
difference between libgcc and MPFR.


> By the way, according to C++ standard, precision of floating point numbers
> is implementation-defined.

Hmmm...so we still don't know which answer is correct or if both answers are
acceptable?


[Bug target/62041] New: vector fneg codegen uses a subtract instead of an xor (x86-64)

2014-08-06 Thread spatel at rotateright dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62041

Bug ID: 62041
   Summary: vector fneg codegen uses a subtract instead of an xor
(x86-64)
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: spatel at rotateright dot com

$ cat fneg.c
#include 

__m128 fneg4(__m128 x) {
return _mm_sub_ps(_mm_set1_ps(-0.0), x);
}

$ ~gcc49/local/bin/gcc -march=core-avx2 -O2 -S fneg.c -o - 
...
_fneg4:
LFB513:
vmovapsLC0(%rip), %xmm1
vsubps%xmm0, %xmm1, %xmm0
ret
...
LC0:
.long2147483648
.long2147483648
.long2147483648
.long2147483648



Instead of generating 'vsubps' here, it would be better to generate 'vxorps'
because we know we're just flipping the sign bit of each element. This is what
gcc does for the scalar version of this code.

Note that there is no difference if I use -ffast-math with this testcase. With
-ffast-math enabled, we should generate the same 'xorps' code even if the
"-0.0" is "+0.0". Again, that's what the scalar codegen does, so I think this
is just a deficiency when generating vector code.

I can file the -ffast-math case as a separate bug if that would be better.


[Bug target/62054] New: fabsf uses constant pool and andps (x86-64) - use pabsd instead?

2014-08-07 Thread spatel at rotateright dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62054

Bug ID: 62054
   Summary: fabsf uses constant pool and andps (x86-64) - use
pabsd instead?
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: spatel at rotateright dot com

$ cat fabs.c 
#include 
float foo(float a) {
return fabsf(a);
}

$ gcc49 -O1 fabs.c -S -o -
.text
.globl _foo
_foo:
LFB19:
movssLC0(%rip), %xmm1
andps%xmm1, %xmm0
ret
LFE19:
.literal16
.align 4
LC0:
.long2147483647
.long0
.long0
.long0



I think we can save 16-bytes of constant pool data and a load instruction by
generating:

   pabsd %xmm0, %xmm0


If this was part of a larger floating point chain of ops and depending on CPU,
there may be some speed penalty for intermingling integer and FP ops on data in
an xmm reg, but the size savings should outweigh that?


[Bug target/62055] New: missed optimization: recognize fnabs (FP negative absolute value) (x86-64)

2014-08-07 Thread spatel at rotateright dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62055

Bug ID: 62055
   Summary: missed optimization: recognize fnabs (FP negative
absolute value) (x86-64)
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: spatel at rotateright dot com

$ cat fnabs.c
#include 
float foo(float a) {
return -fabsf(a);
}
$ gcc49 -O1 fnabs.c -S -o -
.text
.globl _foo
_foo:
LFB19:
movssLC0(%rip), %xmm1
andps%xmm1, %xmm0
movssLC1(%rip), %xmm1
xorps%xmm1, %xmm0
ret
LFE19:
.literal16
.align 4
LC0:
.long2147483647
.long0
.long0
.long0
.align 4
LC1:
.long2147483648
.long0
.long0
.long0

---

That's a lot of constant pool data and instructions to turn on a single bit.

I think there are 2 steps to improving this. First, recognize that -(fabs(a))
can be transformed into an 'or' op:

movssLC0(%rip), %xmm1
orps%xmm1, %xmm0

LC0:
.long2147483648

Second, I don't think we need the extra 0 longs here; movss only loads 4 bytes.
This may require understanding that the upper vector elements for the 'orps'
are don't cares?


[Bug target/62054] fabsf uses constant pool and andps (x86-64) - use pabsd instead?

2014-08-07 Thread spatel at rotateright dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62054

Sanjay Patel  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID

--- Comment #2 from Sanjay Patel  ---
Ah, sorry for the noise. I misunderstood pabsd.


[Bug target/62054] fabsf uses constant pool and andps (x86-64) - use pabsd instead?

2014-08-07 Thread spatel at rotateright dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62054

--- Comment #3 from Sanjay Patel  ---
I think there's still an optimization possible here regarding the constant pool
data - see bug 62055. Hopefully, I didn't mess that one up. :)


[Bug target/62191] New: extra shift generated for vector integer division by constant 2

2014-08-19 Thread spatel at rotateright dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62191

Bug ID: 62191
   Summary: extra shift generated for vector integer division by
constant 2
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: spatel at rotateright dot com

Using gcc 4.9:

$ cat sdiv.c
typedef int vecint __attribute__((vector_size(16))); 
vecint f(vecint x) { 
return x/2;
} 

$ gcc -O2 sdiv.c -S -o  -
...
movdqa%xmm0, %xmm1
psrad$31, %xmm1<--- splat the sign bit
psrld$31, %xmm1<--- then shift sign bit down to LSB
paddd%xmm1, %xmm0  <--- add sign bit to quotient
psrad$1, %xmm0 <--- div via alg shift right
ret

--

I don't think the first shift right algebraic is necessary. We splat the sign
bit and then shift that right logically, so the upper bits are all zero'd
anyway. 

This is a special case for signed integer division by 2. You need that first
'psrad' for any other power of 2 because the subsequent logical shift would not
also be a shift of 31.