https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78543

Michael Meissner <meissner at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #41035|0                           |1
        is obsolete|                            |

--- Comment #20 from Michael Meissner <meissner at gcc dot gnu.org> ---
Created attachment 41050
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41050&action=edit
Proposed patch to fix the problem (rework)

This patch reworks the original patch I submitted, to try and make it less
hacky.  It separates the bswap insns where there is hardware support into
separate read, write, and register swap instructions. This is because the
register allocators will try to push the bswap value in a register to the stack
and do the load based swap with reverse bytes.  Reload fumbles in certain
conditions.  LRA generates working code, but the store and the load with byte
reverse from the same location, can slow things down compared to the operation
on registers.

I only did this optimization where we had the hardware support (i.e. bswap for
HImode all of the time, bswap for SImode all of the time, and bswap for DImode
if we are executing 64-bit instructions and the machine has LDBRX/STDBRX --
power7 and newer/cell ppc).

I have done bootstrap builds on a little endian power8 system, on a big endian
power8 system, and a big endian power7 system (both 32/64-bit support on this
last system).  There were no regressions.

I am building the patches applied to gcc 6 right now.  The patches apply
cleanly to gcc 6.  I suspect it will also build on gcc 5.

I built spec 2006 benchmarks with the compiler.  There are 12 benchmarks that
generate one or more load/store with byte swap instructions (perlbench, gcc,
gamess, milc, zeusmp, calculix, h264ref, tonto, omnetpp, wrf, sphinx3,
xalancbmk).

I compared the instructions generated.  10 of the benchmarks generated the same
instructions.

Milc generated 1 less load with byte swap instruction and 1 more store with
byte swap instruction.

Sphinx3 generated 6 less load with byte swap instructions and 6 more store with
byte swap instructions.

So I count this as the same level of byte swapping is being generated.

Reply via email to