Hi, As mentioned in PR77445, the improvements to the jump threading cost model this year have caused substantial regressions in the amount of jump threading we do and the performance of workloads which rely on that threading.
This patch represents the low-bar in fixing the performance issues reported in PR77445 - by weakening the cost model enough that we thread in a way much closer to GCC 6. I don't think this patch is likely to be acceptable for trunk, but I'm posting it for consideration regardless. Under the new cost model, if the edge doesn't pass optimize_edge_for_speed_p, then we don't thread. The problem in late threading is bad edge profile data makes the edge look cold, and thus it fails optimize_edge_for_speed_p and is no longer considered a candidate for threading. As an aside, I think this is the wrong cost model for jump threading, where you get the most impact if you can resolve unpredictable switch statements - which by their nature may have multiple cold edges in need of threading. Early threading should avoid these issues, as there is no edge profile info yet. optimize_edge_for_speed_p is therefore more likely to hold, but the condition for threading is: if (speed_p && optimize_edge_for_speed_p (taken_edge)) { if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS)) { [...reject threading...] } } else if (n_insns > 1) { [...reject threading...] } With speed_p is hardwired to false for the early threader ( pass_early_thread_jumps::execute ): find_jump_threads_backwards (bb, false); So we always fall to the n_insns > 1 case and thus only rarely get to thread. In this patch I change that call in pass_early_thread_jumps::execute to instead look at optimize_bb_for_speed_p (bb) . That allows the speed_p check to pass in the main threading cost model, and then the optimize_edge_for_speed_p can also pass. That gets the first stage of jump-threading back working in a proprietary benchmark which is sensitive to this optimisation. To get the rest of the required jump threading, I also have to weaken the cost model - and this is obviously a hack! The easy hack is to special case when the taken edge has frequency zero, and permit jump threading there. I know this patch is likely not the preferred way to fix this. For me that would be a change to the cost model, which as I mentioned above I think misses the point about which edges we want to thread. By far the best fix would be to the junk edge profiling data we create during threading. However, this patch does fix the performance issues identified in PR77445, and does highlight a fundamental issue with the early threader (which doesn't seem to me like it will be effective while it sets speed_p to false), so I'd like it to be considered for trunk if no better fix appears before stage 4. Bootstrapped on x86_64 with no issues. The testsuite changes just reshuffle which passes spot the threading opportunities. OK? Thanks, James --- gcc/ 2016-12-15 James Greenhalgh <james.greenha...@arm.com> PR tree-optimization/77445 * tree-ssa-threadbackward.c (profitable_jump_thread_path) Work around sometimes corrupt edge frequency data. (pass_early_thread_jumps::execute): Pass optimize_bb_for_speed_p as the speed_p parameter to find_jump_threads_backwards to enable threading in more cases. gcc/testsuite/ 2016-12-15 James Greenhalgh <james.greenha...@arm.com> PR tree-optimization/77445 * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust options and dump passes. * gcc.dg/tree-ssa/pr66752-3.c: Likewise.
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c index 896c8bf..39ec3d6 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-dce2" } */ +/* { dg-options "-O2 -fdump-tree-ethread-details -fdump-tree-dce2" } */ extern int status, pt; extern int count; @@ -34,7 +34,7 @@ foo (int N, int c, int b, int *a) /* There are 4 FSM jump threading opportunities, all of which will be realized, which will eliminate testing of FLAG, completely. */ -/* { dg-final { scan-tree-dump-times "Registering FSM" 4 "thread1"} } */ +/* { dg-final { scan-tree-dump-times "Registering FSM" 4 "ethread"} } */ /* There should be no assignments or references to FLAG, verify they're eliminated as early as possible. */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c index 9a9d1cb..5b087fb 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c @@ -1,8 +1,9 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 16" "thread1" } } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 9" "thread2" } } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 3" "thread3" } } */ +/* { dg-options "-O2 -fdump-tree-ethread-stats -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 16" "ethread" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 6" "thread1" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 4" "thread2" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 2" "thread3" } } */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2" } } */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom3" } } */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "vrp2" } } */ diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c index 203e20e..0d29ab5 100644 --- a/gcc/tree-ssa-threadbackward.c +++ b/gcc/tree-ssa-threadbackward.c @@ -311,7 +311,20 @@ profitable_jump_thread_path (vec<basic_block, va_gc> *&path, return NULL; } - if (speed_p && optimize_edge_for_speed_p (taken_edge)) + + /* FIXME: Edge frequency can get badly out shape as a result of + the jump threading passes. In those cases, + EDGE_FREQUENCY (taken_edge) == 0 , and so trivially fails the + test for optimize_edge_for_speed_p. The correct fix would + be to ensure that profiling information coming out of jump threading + is meaningful, but in lieu of that add a hack check to this cost model + which permits jump threading in the case EDGE_FREQUENCY has been + corrupted. Only do this if the profile info is present and corrupt, + not if it is absent. */ + if (speed_p + && (optimize_edge_for_speed_p (taken_edge) + || (profile_status_for_fn (cfun) != PROFILE_ABSENT + && EDGE_FREQUENCY (taken_edge) == 0))) { if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS)) { @@ -870,7 +883,7 @@ pass_early_thread_jumps::execute (function *fun) FOR_EACH_BB_FN (bb, fun) { if (EDGE_COUNT (bb->succs) > 1) - find_jump_threads_backwards (bb, false); + find_jump_threads_backwards (bb, optimize_bb_for_speed_p (bb)); } thread_through_all_blocks (true); return 0;