On 14/11/16 14:25, Kyrill Tkachov wrote:

On 11/11/16 15:31, Kyrill Tkachov wrote:

On 11/11/16 10:17, Kyrill Tkachov wrote:

On 10/11/16 23:39, Segher Boessenkool wrote:
On Thu, Nov 10, 2016 at 02:42:24PM -0800, Andrew Pinski wrote:
On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov
I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were
some interesting swings.
458.sjeng     +1.45%
471.omnetpp   +2.19%
445.gobmk     -2.01%

On SPECFP:
453.povray    +7.00%

Wow, this looks really good.  Thank you for implementing this.  If I
get some time I am going to try it out on other processors than A72
but I doubt I have time any time soon.
I'd love to hear what causes the slowdown for gobmk as well, btw.

I haven't yet gotten a direct answer for that (through performance analysis 
tools)
but I have noticed that load/store pairs are not generated as aggressively as I 
hoped.
They are being merged by the sched fusion pass and peepholes (which runs after 
this)
but it still misses cases. I've hacked the SWS hooks to generate pairs 
explicitly and that
increases the number of pairs and helps code size to boot. It complicates the 
logic of
the hooks a bit but not too much.

I'll make those changes and re-benchmark, hopefully that
will help performance.


And here's a version that explicitly emits pairs. I've looked at assembly 
codegen on SPEC2006
and it generates quite a few more LDP/STP pairs than the original version.
I kicked off benchmarks over the weekend to see the effect.
Andrew, if you want to try it out (more benchmarking and testing always 
welcome) this is the
one to try.


And I discovered over the weekend that gamess and wrf have validation errors.
This version runs correctly.
SPECINT results were fine though and there is even a small overall gain due to
sjeng and omnetpp. However, gobmk still has the regression.
I'll rerun SPECFP with this patch (it's really just a small bugfix over the 
previous version)
and get on with analysing gobmk.


After looking at the gobmk performance with performance counters it looks like 
more icache pressure.
I see an increase in misses.
This looks to me like an effect of code size increase, though it is not that 
large an increase (0.4% with SWS).
Branch mispredicts also go up a bit but not as much as icache misses.
I don't think there's anything we can do here, or at least that this patch can 
do about it.
Overall, there's a slight improvement in SPECINT, even with the gobmk 
regression and a slightly larger improvement
on SPECFP due to povray.

Segher, one curious artifact I spotted while looking at codegen differences in 
gobmk was a case where we fail
to emit load-pairs as effectively in the epilogue and its preceeding basic 
block.
So before we had this epilogue:
.L43:
    ldp    x21, x22, [sp, 16]
    ldp    x23, x24, [sp, 32]
    ldp    x25, x26, [sp, 48]
    ldp    x27, x28, [sp, 64]
    ldr    x30, [sp, 80]
    ldp    x19, x20, [sp], 112
    ret

and I see this becoming (among numerous other changes in the function):

.L69:
    ldp    x21, x22, [sp, 16]
    ldr    x24, [sp, 40]
.L43:
    ldp    x25, x26, [sp, 48]
    ldp    x27, x28, [sp, 64]
    ldr    x23, [sp, 32]
    ldr    x30, [sp, 80]
    ldp    x19, x20, [sp], 112
    ret

So this is better in the cases where we jump straight into .L43 because we load 
fewer registers
but worse when we jump to or fallthrough to .L69 because x23 and x24 are now 
restored using two loads
rather than a single load-pair. This hunk isn't critical to performance in 
gobmk though.

Given that there is an overall gain is this ok for trunk?
https://gcc.gnu.org/ml/gcc-patches/2016-11/msg01352.html

Thanks,
Kyrill


2016-11-11  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>

    * config/aarch64/aarch64.h (machine_function): Add
    reg_is_wrapped_separately field.
    * config/aarch64/aarch64.c (emit_set_insn): Change return type to
    rtx_insn *.
    (aarch64_save_callee_saves): Don't save registers that are wrapped
    separately.
    (aarch64_restore_callee_saves): Don't restore registers that are
    wrapped separately.
    (offset_9bit_signed_unscaled_p, offset_12bit_unsigned_scaled_p,
    aarch64_offset_7bit_signed_scaled_p): Move earlier in the file.
    (aarch64_get_separate_components): New function.
    (aarch64_get_next_set_bit): Likewise.
    (aarch64_components_for_bb): Likewise.
    (aarch64_disqualify_components): Likewise.
    (aarch64_emit_prologue_components): Likewise.
    (aarch64_emit_epilogue_components): Likewise.
    (aarch64_set_handled_components): Likewise.
    (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS,
    TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB,
    TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS,
    TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS,
    TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS,
    TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define.


Reply via email to