https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80479
jreiser at bitwagon dot com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jreiser at bitwagon dot com --- Comment #12 from jreiser at bitwagon dot com --- Working well with valgrind(memcheck) might be worth more than a slight increase in speed. How much faster [measured] is the inline version in contrast to calling strcmp() closed subroutine, and over what distributions of inputs? I see the inline version use 10 registers (3,4,7,8,9,10,26,28,30,31) and at least 332 bytes of instructions, assuming at least one instruction at .L10 [not shown] and 7 repetitions of the block at .L22 (for bytes 9 through 64 in 8-byte chunks.) At first glance that seems to be expensive. Could much the same speed be obtained by re-coding the strcmp() closed subroutine to use the technique of the inlining? Then valgrind(memcheck) could intercept and re-direct the whole subroutine easily [by name], avoiding tedious analysis. "addi 31,31,1" at .L11+8 is dead. The opcode 'xor.' might use less energy (no carry chain) than 'subf.'