On Tue, Jun 20, 2017 at 12:36 PM, Richard Biener <richard.guent...@gmail.com> wrote: > On Tue, Jun 20, 2017 at 11:20 AM, Bin.Cheng <amker.ch...@gmail.com> wrote: >> On Fri, Jun 16, 2017 at 6:15 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>> On Fri, Jun 16, 2017 at 11:21 AM, Richard Biener >>> <richard.guent...@gmail.com> wrote: >>>> On Mon, Jun 12, 2017 at 7:03 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>>>> Hi, >>>>> For now, loop distribution handles variables used outside of loop as >>>>> reduction. >>>>> This is inaccurate because all partitions contain statement defining >>>>> induction >>>>> vars. >>>> >>>> But final induction values are usually not used outside of the loop... >>> This is in actuality for induction variable which is used outside of the >>> loop. >>>> >>>> What is missing is loop distribution trying to change partition order. In >>>> fact >>>> we somehow assume we can move a reduction across a detected builtin >>>> (I don't remember if we ever check for validity of that...). >>> Hmm, I am not sure when we can't. If there is any dependence between >>> builtin/reduction partitions, it should be captured by RDG or PG, >>> otherwise the partitions are independent and can be freely ordered as >>> long as reduction partition is scheduled last? >>>> >>>>> Ideally we should factor out scev-propagation as a standalone interface >>>>> which can be called when necessary. Before that, this patch simply >>>>> workarounds >>>>> reduction issue by checking if the statement belongs to all partitions. >>>>> If yes, >>>>> the reduction must be computed in the last partition no matter how the >>>>> loop is >>>>> distributed. >>>>> Bootstrap and test on x86_64 and AArch64. Is it OK? >>>> >>>> stmt_in_all_partitions is not kept up-to-date during partition merging and >>>> if >>>> merging makes the reduction partition(s) pass the stmt_in_all_partitions >>>> test your simple workaround doesn't work ... >>> I think it doesn't matter because: >>> A) it's really workaround for induction variables. In general, >>> induction variables are included by all partition. >>> B) After classify partition, we immediately fuses all reduction >>> partitions. More stmt_in_all_partitions means we are fusing >>> non-reduction partition with reduction partition, so the newly >>> generated (stmt_in_all_partitions) are actually not reduction >>> statements. The workaround won't work anyway even the bitmap is >>> maintained. >>>> >>>> As written it's a valid optimization but can you please note it's >>>> limitation in >>>> some comment please? >>> Yeah, I will add comment explaining it. >> Comment added in new version patch. It also computes bitmap outside >> now, is it OK? > > Ok. Can you add a testcase for this as well please? I think the > series up to this > is now fully reviewed, I defered 1/n (the new IFN) to the last one > containing the > runtime versioning. Can you re-post that (you can merge with the IFN patch) > to apply after the series has been applied up to this? Test case added.
Thanks, bin 2017-06-20 Bin Cheng <bin.ch...@arm.com> * tree-loop-distribution.c (classify_partition): New parameter and better handle reduction statement. (rdg_build_partitions): Revise comment. (distribute_loop): Compute statements in all partitions and pass it to classify_partition. gcc/testsuite/ChangeLog 2017-06-20 Bin Cheng <bin.ch...@arm.com> * gcc.dg/tree-ssa/ldist-26.c: New test.
From b16a4839f3211737dccc3ff92ab2c4f325907cd3 Mon Sep 17 00:00:00 2001 From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com> Date: Thu, 22 Jun 2017 17:16:58 +0100 Subject: [PATCH 11/13] reduction-workaround-20170607.txt --- gcc/testsuite/gcc.dg/tree-ssa/ldist-26.c | 36 ++++++++++++++++++++++++++ gcc/tree-loop-distribution.c | 43 ++++++++++++++++++++++++-------- 2 files changed, 68 insertions(+), 11 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-26.c diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-26.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-26.c new file mode 100644 index 0000000..3a69884 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-26.c @@ -0,0 +1,36 @@ +/* { dg-do run } */ +/* { dg-options "-O3 -ftree-loop-distribution -fdump-tree-ldist-details" } */ + +extern void abort (void); + +int a[130], b[128], c[128]; + +int __attribute__((noinline,noclone)) +foo (int len, int x) +{ + int i; + for (i = 1; i <= len; ++i) + { + a[i] = a[i + 2] + 1; + b[i] = 0; + a[i + 1] = a[i] - 3; + if (i < x) + c[i] = a[i]; + } + return i; +} + +int main() +{ + int i; + for (i = 0; i < 130; ++i) + a[i] = i; + foo (127, 67); + if (a[0] != 0 || a[1] != 4 || a[127] != 130) + abort (); + return 0; +} + +/* { dg-final { scan-tree-dump "distributed: split to 2 loops and 0 library calls" "ldist" } } */ +/* { dg-final { scan-tree-dump "distributed: split to 1 loops and 1 library calls" "ldist" } } */ +/* { dg-final { scan-tree-dump "generated memset zero" "ldist" } } */ diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c index 87fdc15..b15ec04 100644 --- a/gcc/tree-loop-distribution.c +++ b/gcc/tree-loop-distribution.c @@ -1254,17 +1254,18 @@ build_rdg_partition_for_vertex (struct graph *rdg, int v) } /* Classifies the builtin kind we can generate for PARTITION of RDG and LOOP. - For the moment we detect only the memset zero pattern. */ + For the moment we detect memset, memcpy and memmove patterns. Bitmap + STMT_IN_ALL_PARTITIONS contains statements belonging to all partitions. */ static void -classify_partition (loop_p loop, struct graph *rdg, partition *partition) +classify_partition (loop_p loop, struct graph *rdg, partition *partition, + bitmap stmt_in_all_partitions) { bitmap_iterator bi; unsigned i; tree nb_iter; data_reference_p single_load, single_store; - bool volatiles_p = false; - bool plus_one = false; + bool volatiles_p = false, plus_one = false, has_reduction = false; partition->kind = PKIND_NORMAL; partition->main_dr = NULL; @@ -1279,16 +1280,31 @@ classify_partition (loop_p loop, struct graph *rdg, partition *partition) if (gimple_has_volatile_ops (stmt)) volatiles_p = true; - /* If the stmt has uses outside of the loop mark it as reduction. */ + /* If the stmt is not included by all partitions and there is uses + outside of the loop, then mark the partition as reduction. */ if (stmt_has_scalar_dependences_outside_loop (loop, stmt)) { - partition->reduction_p = true; - return; + /* Due to limitation in the transform phase we have to fuse all + reduction partitions. As a result, this could cancel valid + loop distribution especially for loop that induction variable + is used outside of loop. To workaround this issue, we skip + marking partition as reudction if the reduction stmt belongs + to all partitions. In such case, reduction will be computed + correctly no matter how partitions are fused/distributed. */ + if (!bitmap_bit_p (stmt_in_all_partitions, i)) + { + partition->reduction_p = true; + return; + } + has_reduction = true; } } /* Perform general partition disqualification for builtins. */ if (volatiles_p + /* Simple workaround to prevent classifying the partition as builtin + if it contains any use outside of loop. */ + || has_reduction || !flag_tree_loop_distribute_patterns) return; @@ -1461,9 +1477,9 @@ share_memory_accesses (struct graph *rdg, return false; } -/* Aggregate several components into a useful partition that is - registered in the PARTITIONS vector. Partitions will be - distributed in different loops. */ +/* For each seed statement in STARTING_STMTS, this function builds + partition for it by adding depended statements according to RDG. + All partitions are recorded in PARTITIONS. */ static void rdg_build_partitions (struct graph *rdg, @@ -1731,10 +1747,15 @@ distribute_loop (struct loop *loop, vec<gimple *> stmts, auto_vec<struct partition *, 3> partitions; rdg_build_partitions (rdg, stmts, &partitions); + auto_bitmap stmt_in_all_partitions; + bitmap_copy (stmt_in_all_partitions, partitions[0]->stmts); + for (i = 1; partitions.iterate (i, &partition); ++i) + bitmap_and_into (stmt_in_all_partitions, partitions[i]->stmts); + any_builtin = false; FOR_EACH_VEC_ELT (partitions, i, partition) { - classify_partition (loop, rdg, partition); + classify_partition (loop, rdg, partition, stmt_in_all_partitions); any_builtin |= partition_builtin_p (partition); } -- 1.9.1