https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809
Martin Liška <marxin at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2019-04-16 Ever confirmed|0 |1 --- Comment #5 from Martin Liška <marxin at gcc dot gnu.org> --- I would like working on this problem. I've read the Peters very detail analysis on Stack overflow and I have first couple of questions and observations I've did: 1) I would suggest to remove usage of 'rep scasb' at all; even for -Os the price paid is quite huge 2) I've made strlen instrumentation for -fprofile-{generate,use} and collected SPEC2016 statistics for train runs: Benchmark strlen calls executed strlen calls total executions avg. strlen 400.perlbench 102 39 2358804 10.97 401.bzip2 0 0 0 403.gcc 144 21 4081 9.3 410.bwaves 0 0 0 416.gamess 0 0 0 429.mcf 0 0 0 433.milc 3 0 0 434.zeusmp 0 0 0 435.gromacs 86 7 92 12.46 436.cactusADM 110 46 61788 10.61 437.leslie3d 0 0 0 444.namd 0 0 0 445.gobmk 41 7 75196 2.01 447.dealII 3 0 0 450.soplex 8 6 1161517 25.59 453.povray 67 25 54584 33.25 454.calculix 54 0 0 456.hmmer 93 10 52 15.1 458.sjeng 0 0 0 459.GemsFDTD 0 0 0 462.libquantum 0 0 0 464.h264ref 12 1 1 14274.0 465.tonto 0 0 0 470.lbm 0 0 0 471.omnetpp 50 15 24291732 9.79 473.astar 0 0 0 481.wrf 42 15 20490 9.41 482.sphinx3 23 11 402963 1.61 483.xalancbmk 27 3 160 13.04 Columns: Benchmark name, # of strlen calls in the benchmarks, # of strlen calls that were called during train run, total number of strlen execution, average strlen Note: 14274.0 value for 464.h264ref is correct: content_76 = GetConfigFileContent (filename_53); _7 = strlen (content_76); Based on the numbers an average string for which a strlen is called is quite short (<32B). 3) The assumption that most strlen arguments have a known 16B alignment is quite optimistic. As mentioned, {c,}alloc returns a memory aligned to that, but strlen is most commonly called for a generic character pointer for which we can't prove the alignment. 4) Peter's suggested ASM expansion assumes such alignment. I expect a bit more complex code for a general alignment situation? 5) strlen call has the advantage then even though being compiled with -O2 -march=x86-64 (a distribution options), the glibc can use ifunc to dispatch to an optimized implementation 6) The decision code in ix86_expand_strlen looks strange to me: bool ix86_expand_strlen (rtx out, rtx src, rtx eoschar, rtx align) { rtx addr, scratch1, scratch2, scratch3, scratch4; /* The generic case of strlen expander is long. Avoid it's expanding unless TARGET_INLINE_ALL_STRINGOPS. */ if (TARGET_UNROLL_STRLEN && eoschar == const0_rtx && optimize > 1 && !TARGET_INLINE_ALL_STRINGOPS && !optimize_insn_for_size_p () && (!CONST_INT_P (align) || INTVAL (align) < 4)) return false; That explains why we generate 'rep scasb' for -O1. My suggestions: - I would use strlen call in all situations - Maybe I would instrument strlen calls in -fprofile-generate/use and if there's a strlen call with a very small average size and a known 4B alignment, I would generate 4B loop Thghouts?