[Bug c/95661] New: Code built with -m32 uses SSE2 instructions

2020-06-12 Thread memmerto at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95661

Bug ID: 95661
   Summary: Code built with -m32 uses SSE2 instructions
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: memmerto at ca dot ibm.com
  Target Milestone: ---

When building 32-bit code with -m32, SSE2 instructions are generated.  This is
in contrast to the docs.

https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
"The -m32 option sets int, long, and pointer types to 32 bits, and generates
code that runs on any i386 system."

In particular, the code to do floating-point/integer conversions appear to use
SSE2 instructions.

Compiler:

$ /opt/rh/devtoolset-8/root/usr/bin/gcc -v
Using built-in specs.
COLLECT_GCC=/opt/rh/devtoolset-8/root/usr/bin/gcc
COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-8/root/usr/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap
--enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-8/root/usr
--mandir=/opt/rh/devtoolset-8/root/usr/share/man
--infodir=/opt/rh/devtoolset-8/root/usr/share/info
--with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared
--enable-threads=posix --enable-checking=release --enable-multilib
--with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions
--enable-gnu-unique-object --enable-linker-build-id
--with-gcc-major-version-only --with-linker-hash-style=gnu
--with-default-libstdcxx-abi=gcc4-compatible --enable-plugin
--enable-initfini-array
--with-isl=/builddir/build/BUILD/gcc-8.3.1-20190311/obj-x86_64-redhat-linux/isl-install
--disable-libmpx --enable-gnu-indirect-function --with-tune=generic
--with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC)

Compile Command:

gcc -c -m32 t.c -save-temps -fverbose-asm -o t.o

Testcase:

#include 
#include 

int main(void)
{
  double d = 100.0;
  int i = (int)d;
  printf("%d\n",i);
}

Assembly Fragment:

# t.c:7:   int i = (int)d;
movsd   -16(%ebp), %xmm0# d, tmp90
cvttsd2si   %xmm0, %eax # tmp90, tmp91
movl%eax, -20(%ebp) # tmp91, i

I would expect 387 instructions to be generated (since -mfpmath is the default
for 32-bit targets).

[Bug target/95661] Code built with -m32 uses SSE2 instructions

2020-06-13 Thread memmerto at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95661

--- Comment #4 from Matt Emmerton  ---
Thank you everyone.  This fully explains why we were still getting SSE in
32-bit mode.

[Bug target/93177] New: PPC: Missing many useful platform intrinsics

2020-01-06 Thread memmerto at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

Bug ID: 93177
   Summary: PPC: Missing many useful platform intrinsics
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: memmerto at ca dot ibm.com
  Target Milestone: ---

File gcc/config/rs6000/ppu_intrinsics.h defines a lot of useful intrisics for
PPC, but this heading on this file indicates that it is specific to the "Cell
BEA", which is a PPC derivative.

The #define guards at the top of the file suggest that the file was intended
for both ppu (cell) and ppc/ppc64 (PowerPC/POWER) configurations.

It would be very useful if this file could be installed on all powerpc targets,
or perhaps cloned to ppc_intrinsics.h and have that installed on powerpc.

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-08 Thread memmerto at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #2 from Matt Emmerton  ---
This appears to have packaging complications by vendors as well :(

On powerpc-ibm-aix7.1.0.0 this doesn't get installed.
On ppc64le-redhat-linux it does.

However, both of these cases would benefit from something targeted specifically
to PPC, rather than PPU/Cell.

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-10 Thread memmerto at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #4 from Matt Emmerton  ---
The intrinsics that we would find useful, having used them as provided by the
IBM XL C/C++ compiler, are the following:

__sync()
__isync()
__lwsync()

__dcbt()
__dcbtst()

__lwarx()
__ldarx()
__stwcx()
__stdcx()

__protected_stream_set()
__protected_stream_count()
__protected_stream_count_depth() // currently not implemented in gcc
__protected_stream_go()

The implementation of stwcx() and stdcx() need revision on PPC.
As I understand it, there is no need the mfocrf instruction nor the
mask-and-shift on result.

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-13 Thread memmerto at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #8 from Matt Emmerton  ---
(In reply to Andrew Pinski from comment #5)
> > __lwarx()
> > __ldarx()
> > __stwcx()
> > __stdcx()
> 
> Is there a reason why the __atomic_* builtins don't work?

There are places in our code where we do manipulations of the lockword that
cannot be emulated by the __atomic_* builtins, and thus require us to emit
discrete larx/stcx instructions (with other goodness in between.)

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-13 Thread memmerto at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #9 from Matt Emmerton  ---
(In reply to Segher Boessenkool from comment #6)
> (In reply to Matt Emmerton from comment #4)
> > The intrinsics that we would find useful, having used them as provided by
> > the IBM XL C/C++ compiler, are the following:
> > 
> > __sync()
> > __isync()
> > __lwsync()
> 
> The sync intrinsics need to be tied to some other code.  A volatile asm with
> a "memory" clobber is not good enough, in many cases.

We use these in our internal mutex and atomic implementations, and the
resulting sequences are carefully scrutinized.

> > __lwarx()
> > __ldarx()
> > __stwcx()
> > __stdcx()
> 
> The compiler can always insert memory accesses in between those two, if you
> have them as separate intrinsics (and it will, simply stack accesses for
> temporaries will do, already).  If those accesses hit the same reservation
> granule as the larx/stcx. uses, you lose.
> 
> You need to write the whole sequence in one piece of assembler code.

I would argue that the compiler should be smart enough to realize that these
are part of a decomposed atomic operation, and avoid arbitrary instruction
injection.

As per my previous update, we use these primitives to implement things that the
bulitin __atomic_* functions do not implement.

> > __protected_stream_set()
> > __protected_stream_count()
> > __protected_stream_count_depth() // currently not implemented in gcc
> > __protected_stream_go()
> 
> Those are pretty specific to CBE I think?

No.  They are implemented on POWER5 and above (ISA 2.02), and are useful in
managing cache prefetch behaviour.

> > The implementation of stwcx() and stdcx() need revision on PPC.
> > As I understand it, there is no need the mfocrf instruction nor the
> > mask-and-shift on result.
> 
> How else would you output the CR0.EQ bit?

There is no need to copy CR0 to a GPR - branch instructions such as BNE can
operate on CR0 directly.

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-23 Thread memmerto at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #11 from Matt Emmerton  ---
> > > > The implementation of stwcx() and stdcx() need revision on PPC.
> > > > As I understand it, there is no need the mfocrf instruction nor the
> > > > mask-and-shift on result.
> > > 
> > > How else would you output the CR0.EQ bit?
> > 
> > There is no need to copy CR0 to a GPR - branch instructions such as BNE can
> > operate on CR0 directly.
> 
> You cannot write anything that maps to a CR field directly.

No need to access it directly - just use a BNE instruction (to branch for
retry/success) which operates implicitly on CR0.EQ.

There are plenty of material out there that implements atomic operations on
POWER like this:

loop:
lwarx
// do something
stwcx
bne loop:

gcc does an unnecessary mfocrf + cmp to achieve the same result.

Is there an assumption in gcc that the "result" of any intrinsic is reported in
a GPR, which disallows this implicit use of CR0?

[Bug target/93408] New: PPC: Provide intrinsics for cache prefetch instructions

2020-01-23 Thread memmerto at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93408

Bug ID: 93408
   Summary: PPC: Provide intrinsics for cache prefetch
instructions
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: memmerto at ca dot ibm.com
  Target Milestone: ---

From https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

> > > > __protected_stream_set()
> > > > __protected_stream_count()
> > > > __protected_stream_count_depth() // currently not implemented in gcc
> > > > __protected_stream_go()
> > > 
> > > Those are pretty specific to CBE I think?
> > 
> > No.  They are implemented on POWER5 and above (ISA 2.02), and are useful in
> > managing cache prefetch behaviour.
> 
> Open a separate feature request for these then, please.

This is that request.

[Bug target/93417] New: PPC: Support the "Flag Output Operands" so inline-asm can avoid having to copy CRx to GPR

2020-01-24 Thread memmerto at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93417

Bug ID: 93417
   Summary: PPC: Support the "Flag Output Operands" so inline-asm
can avoid having to copy CRx to GPR
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: memmerto at ca dot ibm.com
  Target Milestone: ---

From https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

> If PowerPC back-end supported the "Flag Output Operands" part
> if GCC's inline-asm, you could use that to do the correct thing.
> But sadly PowerPC does not currently.
> 
> https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Flag-Output-Operands

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-24 Thread memmerto at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #14 from Matt Emmerton  ---
I'd like to thank everyone for the great discussion so far.
Here's a summary of where we are at this point.

1) sync intrinsics

Useful, but with caveats.

2) cache prefetch intrinsics

Implemented via __builtin_prefetch()

3) larx/stcx intrinsics

Useful, but with caveats.

Improvements to stcx CR handling, see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93417

4) streaming cache prefetch intrinsics

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93408