https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67213
--- Comment #4 from Fredrik Hederstierna <fredrik.hederstie...@securitas-direct.com> --- I've investigated this issue some further, and I believe the problem might be that we allow too many iterations when doing complete peeling of loops on ARM. The heuristics in "tree-ssa-loop-ivcanon.c" for estimating unrolled cost/size in "estimated_unrolled_size()" is quite rough, just assuming it will be reduced in further passes to 2/3? This is not always true and can lead to larger code size I think after a complete peeling of loops (as in the example in this issue). It seems very difficult to estimate the final size of complete peeling, also across all architectures. I've experimented with 3/4 if optimizing for size, but it became worse. One solution that works for me is to set a lower limit for the number of times the unpeeling may use: I did this patch and it worked. (Same thing is done in "spu.c" for SPU architecture when they want small code size.) In function "arm_option_override (void)": diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index c868490..2ba8244 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c + + /* Small loops might be completely unpeeled even at -Os. + Try to keep code small. */ + if (optimize_function_for_size_p (cfun) + && !flag_unroll_loops && !flag_peel_loops) + maybe_set_param_value (PARAM_MAX_COMPLETELY_PEEL_TIMES, 4, + global_options.x_param_values, + global_options_set.x_param_values); I simply override max-completely-peel-times to be 4 instead of default 16, and this seems to work well. I tested it with CSiBE benchmark on arm/thumb1/thumb2 and I got shorter code on all tests, no negative results on any function. What do you think, is it a okey solution to solve this issue, even though the long-term best solution would be to be able to estimate cost/size better of unrolling, but this seems like a much more difficult problem to solve. /Fredrik