Hi All,
When the patch for PR114074 was applied we saw a good boost in exchange2.
This boost was partially caused by a simplification of the addressing modes.
With the patch applied IV opts saw the following form for the base addressing;
Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
324) + 36)
vs what we normally get:
Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
* 81) + 9) * 4
This is because the patch promoted multiplies where one operand is a constant
from a signed multiply to an unsigned one, to attempt to fold away the constant.
This patch attempts the same but due to the various problems with SCEV and
niters not being able to analyze the resulting forms (i.e. PR114322) we can't
do it during SCEV or in the general form like in fold-const like extract_muldiv
attempts.
Instead this applies the simplification during IVopts initialization when we
create the IV. Essentially when we know the IV won't overflow with regards to
niters then we perform an affine fold which gets it to simplify the internal
computation, even if this is signed because we know that for IVOPTs uses the
IV won't ever overflow. This allows IV opts to see the simplified form
without influencing the rest of the compiler.
as mentioned in PR114074 it would be good to fix the missed optimization in the
other passes so we can perform this in general.
The reason this has a big impact on fortran code is that fortran doesn't seem to
have unsigned integer types. As such all it's addressing are created with
signed types and folding does not happen on them due to the possible overflow.
concretely on AArch64 this changes the results from generation:
mov x27, -108
mov x24, -72
mov x23, -36
add x21, x1, x0, lsl 2
add x19, x20, x22
.L5:
add x0, x22, x19
add x19, x19, 324
ldr d1, [x0, x27]
add v1.2s, v1.2s, v15.2s
str d1, [x20, 216]
ldr d0, [x0, x24]
add v0.2s, v0.2s, v15.2s
str d0, [x20, 252]
ldr d31, [x0, x23]
add v31.2s, v31.2s, v15.2s
str d31, [x20, 288]
bl digits_20_
cmp x21, x19
bne .L5
into:
.L5:
ldr d1, [x19, -108]
add v1.2s, v1.2s, v15.2s
str d1, [x20, 216]
ldr d0, [x19, -72]
add v0.2s, v0.2s, v15.2s
str d0, [x20, 252]
ldr d31, [x19, -36]
add x19, x19, 324
add v31.2s, v31.2s, v15.2s
str d31, [x20, 288]
bl digits_20_
cmp x21, x19
bne .L5
The two patches together results in a 10% performance increase in exchange2 in
SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in compile
time. There's also a 5% performance improvement in fotonik3d and similar
reduction in binary size.
Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
Ok for master?
Thanks,
Tamar
gcc/ChangeLog:
PR tree-optimization/114932
* tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned fold.
gcc/testsuite/ChangeLog:
PR tree-optimization/114932
* gfortran.dg/addressing-modes_1.f90: New test.
---
diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
new file mode 100644
index
0000000000000000000000000000000000000000..334d5bc47a16e53e9168bb1f90dfeff584b4e494
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
@@ -0,0 +1,37 @@
+! { dg-do compile { target aarch64-*-* } }
+! { dg-additional-options "-w -Ofast" }
+
+ module brute_force
+ integer, parameter :: r=9
+ integer block(r, r, 0)
+ contains
+ subroutine brute
+ do
+ do
+ do
+ do
+ do
+ do
+ do i7 = l0, 1
+ select case(1 )
+ case(1)
+ block(:2, 7:, 1) = block(:2, 7:, i7) - 1
+ end select
+ do i8 = 1, 1
+ do i9 = 1, 1
+ if(1 == 1) then
+ call digits_20
+ end if
+ end do
+ end do
+ end do
+ end do
+ end do
+ end do
+ end do
+ end do
+ end do
+ end
+ end
+
+! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, x[0-9]+\]} } }
diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
index
4338d7b64a6c2df6404b8d5e51c7f62c23006e72..f621e4ee924b930e1e1d68e35f3d3a0d52470811
100644
--- a/gcc/tree-ssa-loop-ivopts.cc
+++ b/gcc/tree-ssa-loop-ivopts.cc
@@ -1216,6 +1216,18 @@ alloc_iv (struct ivopts_data *data, tree base, tree step,
base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
}
+ /* If we know the IV won't overflow wrt niters and the type is an unsigned
+ type then fold using affine unsigned arithmetic to allow more folding of
+ constants. */
+ if (no_overflow
+ && TYPE_UNSIGNED (TREE_TYPE (expr)))
+ {
+ aff_tree comb;
+ tree utype = unsigned_type_for (TREE_TYPE (expr));
+ tree_to_aff_combination (expr, utype, &comb);
+ base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
+ }
+
iv->base = base;
iv->base_object = determine_base_object (data, base);
iv->step = step;
--
diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
new file mode 100644
index 0000000000000000000000000000000000000000..334d5bc47a16e53e9168bb1f90dfeff584b4e494
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
@@ -0,0 +1,37 @@
+! { dg-do compile { target aarch64-*-* } }
+! { dg-additional-options "-w -Ofast" }
+
+ module brute_force
+ integer, parameter :: r=9
+ integer block(r, r, 0)
+ contains
+ subroutine brute
+ do
+ do
+ do
+ do
+ do
+ do
+ do i7 = l0, 1
+ select case(1 )
+ case(1)
+ block(:2, 7:, 1) = block(:2, 7:, i7) - 1
+ end select
+ do i8 = 1, 1
+ do i9 = 1, 1
+ if(1 == 1) then
+ call digits_20
+ end if
+ end do
+ end do
+ end do
+ end do
+ end do
+ end do
+ end do
+ end do
+ end do
+ end
+ end
+
+! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, x[0-9]+\]} } }
diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
index 4338d7b64a6c2df6404b8d5e51c7f62c23006e72..f621e4ee924b930e1e1d68e35f3d3a0d52470811 100644
--- a/gcc/tree-ssa-loop-ivopts.cc
+++ b/gcc/tree-ssa-loop-ivopts.cc
@@ -1216,6 +1216,18 @@ alloc_iv (struct ivopts_data *data, tree base, tree step,
base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
}
+ /* If we know the IV won't overflow wrt niters and the type is an unsigned
+ type then fold using affine unsigned arithmetic to allow more folding of
+ constants. */
+ if (no_overflow
+ && TYPE_UNSIGNED (TREE_TYPE (expr)))
+ {
+ aff_tree comb;
+ tree utype = unsigned_type_for (TREE_TYPE (expr));
+ tree_to_aff_combination (expr, utype, &comb);
+ base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
+ }
+
iv->base = base;
iv->base_object = determine_base_object (data, base);
iv->step = step;