On Thu, May 15, 2014 at 12:26 AM, bin.cheng <[email protected]> wrote:
> Hi,
> Targets like ARM and AARCH64 support double-word load store instructions,
> and these instructions are generally faster than the corresponding two
> load/stores. GCC currently uses peephole2 to merge paired load/store into
> one single instruction which has a disadvantage. It can only handle simple
> cases like the two instructions actually appear sequentially in instruction
> stream, and is too weak to handle cases in which the two load/store are
> intervened by other irrelevant instructions.
>
> Here comes up with a new GCC pass looking through each basic block and
> merging paired load store even they are not adjacent to each other. The
> algorithm is pretty simple:
> 1) In initialization pass iterating over instruction stream it collects
> relevant memory access information for each instruction.
> 2) It iterates over each basic block, tries to find possible paired
> instruction for each memory access instruction. During this work, it checks
> dependencies between the two possible instructions and also records the
> information indicating how to pair the two instructions. To avoid quadratic
> behavior of the algorithm, It introduces new parameter
> max-merge-paired-loadstore-distance and set the default value to 4, which is
> large enough to catch major part of opportunities on ARM/cortex-a15.
> 3) For each candidate pair, it calls back-end's hook to do target dependent
> check and merge the two instructions if possible.
>
> Though the parameter is set to 4, for miscellaneous benchmarks, this pass
> can merge numerous opportunities except ones already merged by peephole2
> (same level numbers of opportunities comparing to peepholed ones). GCC
> bootstrap can also confirm this finding.
>
> Yet there is an open issue about when we should run this new pass. Though
> register renaming is disabled by default now, I put this pass after it,
> because renaming can resolve some false dependencies thus benefit this pass.
> Another finding is, it can capture a lot more opportunities if it's after
> sched2, but I am not sure whether it will mess up with scheduling results in
> this way.
>
> So, any comments about this?
>
> Thanks,
> bin
>
>
> 2014-05-15 Bin Cheng <[email protected]>
> * common.opt (flag_merge_paired_loadstore): New option.
> * merge-paired-loadstore.c: New file.
> * Makefile.in: Support new file.
> * config/arm/arm.c (TARGET_MERGE_PAIRED_LOADSTORE): New macro.
> (load_latency_expanded_p, arm_merge_paired_loadstore): New function.
> * params.def (PARAM_MAX_MERGE_PAIRED_LOADSTORE_DISTANCE): New param.
> * doc/invoke.texi (-fmerge-paired-loadstore): New.
> (max-merge-paired-loadstore-distance): New.
> * doc/tm.texi.in (TARGET_MERGE_PAIRED_LOADSTORE): New.
> * doc/tm.texi: Regenerated.
> * target.def (merge_paired_loadstore): New.
> * tree-pass.h (make_pass_merge_paired_loadstore): New decl.
> * passes.def (pass_merge_paired_loadstore): New pass.
> * timevar.def (TV_MERGE_PAIRED_LOADSTORE): New time var.
>
> gcc/testsuite/ChangeLog
> 2014-05-15 Bin Cheng <[email protected]>
>
> * gcc.target/arm/merge-paired-loadstore.c: New test.
>
Here is a testcase on x86-64:
---
struct Foo
{
Foo (double x0, double x1, double x2)
{
data[0] = x0;
data[1] = x1;
data[2] = x2;
}
double data[3];
};
const Foo f1 (0.0, 0.0, 1.0);
const Foo f2 (1.0, 0.0, 0.0);
struct Bar
{
Bar (float x0, float x1, float x2, float x3, float x4)
{
data[0] = x0;
data[1] = x1;
data[2] = x2;
data[3] = x3;
data[4] = x4;
}
float data[5];
};
const Bar b1 (0.0, 0.0, 0.0, 0.0, 1.0);
const Bar b2 (1.0, 0.0, 0.0, 0.0, 0.0);
---
We generate
xorpd %xmm0, %xmm0
movsd .LC1(%rip), %xmm1
movsd %xmm0, _ZL2f1(%rip)
movsd %xmm0, _ZL2f1+8(%rip)
movsd %xmm0, _ZL2f2+8(%rip)
movsd %xmm0, _ZL2f2+16(%rip)
xorps %xmm0, %xmm0
movsd %xmm1, _ZL2f1+16(%rip)
movsd %xmm1, _ZL2f2(%rip)
movss .LC3(%rip), %xmm1
movss %xmm0, _ZL2b1(%rip)
movss %xmm0, _ZL2b1+4(%rip)
movss %xmm0, _ZL2b1+8(%rip)
movss %xmm0, _ZL2b1+12(%rip)
movss %xmm1, _ZL2b1+16(%rip)
movss %xmm1, _ZL2b2(%rip)
movss %xmm0, _ZL2b2+4(%rip)
movss %xmm0, _ZL2b2+8(%rip)
movss %xmm0, _ZL2b2+12(%rip)
movss %xmm0, _ZL2b2+16(%rip)
There are pairs of movsd and sets of 4 movss. We should
be able to handle more than 2 load/store insns.
--
H.J.