[Bug c/95661] New: Code built with -m32 uses SSE2 instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95661 Bug ID: 95661 Summary: Code built with -m32 uses SSE2 instructions Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: memmerto at ca dot ibm.com Target Milestone: --- When building 32-bit code with -m32, SSE2 instructions are generated. This is in contrast to the docs. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html "The -m32 option sets int, long, and pointer types to 32 bits, and generates code that runs on any i386 system." In particular, the code to do floating-point/integer conversions appear to use SSE2 instructions. Compiler: $ /opt/rh/devtoolset-8/root/usr/bin/gcc -v Using built-in specs. COLLECT_GCC=/opt/rh/devtoolset-8/root/usr/bin/gcc COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-8/root/usr/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper Target: x86_64-redhat-linux Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-8/root/usr --mandir=/opt/rh/devtoolset-8/root/usr/share/man --infodir=/opt/rh/devtoolset-8/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --with-default-libstdcxx-abi=gcc4-compatible --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-8.3.1-20190311/obj-x86_64-redhat-linux/isl-install --disable-libmpx --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux Thread model: posix gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC) Compile Command: gcc -c -m32 t.c -save-temps -fverbose-asm -o t.o Testcase: #include #include int main(void) { double d = 100.0; int i = (int)d; printf("%d\n",i); } Assembly Fragment: # t.c:7: int i = (int)d; movsd -16(%ebp), %xmm0# d, tmp90 cvttsd2si %xmm0, %eax # tmp90, tmp91 movl%eax, -20(%ebp) # tmp91, i I would expect 387 instructions to be generated (since -mfpmath is the default for 32-bit targets).
[Bug target/95661] Code built with -m32 uses SSE2 instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95661 --- Comment #4 from Matt Emmerton --- Thank you everyone. This fully explains why we were still getting SSE in 32-bit mode.
[Bug target/93177] New: PPC: Missing many useful platform intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177 Bug ID: 93177 Summary: PPC: Missing many useful platform intrinsics Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: memmerto at ca dot ibm.com Target Milestone: --- File gcc/config/rs6000/ppu_intrinsics.h defines a lot of useful intrisics for PPC, but this heading on this file indicates that it is specific to the "Cell BEA", which is a PPC derivative. The #define guards at the top of the file suggest that the file was intended for both ppu (cell) and ppc/ppc64 (PowerPC/POWER) configurations. It would be very useful if this file could be installed on all powerpc targets, or perhaps cloned to ppc_intrinsics.h and have that installed on powerpc.
[Bug target/93177] PPC: Missing many useful platform intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177 --- Comment #2 from Matt Emmerton --- This appears to have packaging complications by vendors as well :( On powerpc-ibm-aix7.1.0.0 this doesn't get installed. On ppc64le-redhat-linux it does. However, both of these cases would benefit from something targeted specifically to PPC, rather than PPU/Cell.
[Bug target/93177] PPC: Missing many useful platform intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177 --- Comment #4 from Matt Emmerton --- The intrinsics that we would find useful, having used them as provided by the IBM XL C/C++ compiler, are the following: __sync() __isync() __lwsync() __dcbt() __dcbtst() __lwarx() __ldarx() __stwcx() __stdcx() __protected_stream_set() __protected_stream_count() __protected_stream_count_depth() // currently not implemented in gcc __protected_stream_go() The implementation of stwcx() and stdcx() need revision on PPC. As I understand it, there is no need the mfocrf instruction nor the mask-and-shift on result.
[Bug target/93177] PPC: Missing many useful platform intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177 --- Comment #8 from Matt Emmerton --- (In reply to Andrew Pinski from comment #5) > > __lwarx() > > __ldarx() > > __stwcx() > > __stdcx() > > Is there a reason why the __atomic_* builtins don't work? There are places in our code where we do manipulations of the lockword that cannot be emulated by the __atomic_* builtins, and thus require us to emit discrete larx/stcx instructions (with other goodness in between.)
[Bug target/93177] PPC: Missing many useful platform intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177 --- Comment #9 from Matt Emmerton --- (In reply to Segher Boessenkool from comment #6) > (In reply to Matt Emmerton from comment #4) > > The intrinsics that we would find useful, having used them as provided by > > the IBM XL C/C++ compiler, are the following: > > > > __sync() > > __isync() > > __lwsync() > > The sync intrinsics need to be tied to some other code. A volatile asm with > a "memory" clobber is not good enough, in many cases. We use these in our internal mutex and atomic implementations, and the resulting sequences are carefully scrutinized. > > __lwarx() > > __ldarx() > > __stwcx() > > __stdcx() > > The compiler can always insert memory accesses in between those two, if you > have them as separate intrinsics (and it will, simply stack accesses for > temporaries will do, already). If those accesses hit the same reservation > granule as the larx/stcx. uses, you lose. > > You need to write the whole sequence in one piece of assembler code. I would argue that the compiler should be smart enough to realize that these are part of a decomposed atomic operation, and avoid arbitrary instruction injection. As per my previous update, we use these primitives to implement things that the bulitin __atomic_* functions do not implement. > > __protected_stream_set() > > __protected_stream_count() > > __protected_stream_count_depth() // currently not implemented in gcc > > __protected_stream_go() > > Those are pretty specific to CBE I think? No. They are implemented on POWER5 and above (ISA 2.02), and are useful in managing cache prefetch behaviour. > > The implementation of stwcx() and stdcx() need revision on PPC. > > As I understand it, there is no need the mfocrf instruction nor the > > mask-and-shift on result. > > How else would you output the CR0.EQ bit? There is no need to copy CR0 to a GPR - branch instructions such as BNE can operate on CR0 directly.
[Bug target/93177] PPC: Missing many useful platform intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177 --- Comment #11 from Matt Emmerton --- > > > > The implementation of stwcx() and stdcx() need revision on PPC. > > > > As I understand it, there is no need the mfocrf instruction nor the > > > > mask-and-shift on result. > > > > > > How else would you output the CR0.EQ bit? > > > > There is no need to copy CR0 to a GPR - branch instructions such as BNE can > > operate on CR0 directly. > > You cannot write anything that maps to a CR field directly. No need to access it directly - just use a BNE instruction (to branch for retry/success) which operates implicitly on CR0.EQ. There are plenty of material out there that implements atomic operations on POWER like this: loop: lwarx // do something stwcx bne loop: gcc does an unnecessary mfocrf + cmp to achieve the same result. Is there an assumption in gcc that the "result" of any intrinsic is reported in a GPR, which disallows this implicit use of CR0?
[Bug target/93408] New: PPC: Provide intrinsics for cache prefetch instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93408 Bug ID: 93408 Summary: PPC: Provide intrinsics for cache prefetch instructions Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: memmerto at ca dot ibm.com Target Milestone: --- From https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177 > > > > __protected_stream_set() > > > > __protected_stream_count() > > > > __protected_stream_count_depth() // currently not implemented in gcc > > > > __protected_stream_go() > > > > > > Those are pretty specific to CBE I think? > > > > No. They are implemented on POWER5 and above (ISA 2.02), and are useful in > > managing cache prefetch behaviour. > > Open a separate feature request for these then, please. This is that request.
[Bug target/93417] New: PPC: Support the "Flag Output Operands" so inline-asm can avoid having to copy CRx to GPR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93417 Bug ID: 93417 Summary: PPC: Support the "Flag Output Operands" so inline-asm can avoid having to copy CRx to GPR Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: memmerto at ca dot ibm.com Target Milestone: --- From https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177 > If PowerPC back-end supported the "Flag Output Operands" part > if GCC's inline-asm, you could use that to do the correct thing. > But sadly PowerPC does not currently. > > https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Flag-Output-Operands
[Bug target/93177] PPC: Missing many useful platform intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177 --- Comment #14 from Matt Emmerton --- I'd like to thank everyone for the great discussion so far. Here's a summary of where we are at this point. 1) sync intrinsics Useful, but with caveats. 2) cache prefetch intrinsics Implemented via __builtin_prefetch() 3) larx/stcx intrinsics Useful, but with caveats. Improvements to stcx CR handling, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93417 4) streaming cache prefetch intrinsics See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93408