https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103550
--- Comment #7 from cqwrteur <unlvsur at live dot com> --- (In reply to Andrew Pinski from comment #5) > (In reply to cqwrteur from comment #4) > > (In reply to Andrew Pinski from comment #2) > > > Looks like it is a register allocation/scheduling issue. The extra > > > instructions are mov. > > > > Are there good algos that can allocate registers optimal? > > note the move instructions might be "free" on most modern x86 machine, it > just takes up icache space and decode time. > having so little registers and having a 2 operand instruction set makes > register allocation a hard problem really. Yes LLVM might get it right in > this testcase but there are others where GCC might do a better job. https://github.com/openssl/openssl/blob/38288f424faa0cf61bd705c497bb1a1657611da1/crypto/sha/asm/sha512-x86_64.pl#L18 OpenSSL's comments: # 40% improvement over compiler-generated code on Opteron. On EM64T # sha256 was observed to run >80% faster and sha512 - >40%. No magical # tricks, just straight implementation... I really wonder why gcc # [being armed with inline assembler] fails to generate as fast code. # The only thing which is cool about this module is that it's very # same instruction sequence used for both SHA-256 and SHA-512. In # former case the instructions operate on 32-bit operands, while in # latter - on 64-bit ones. All I had to do is to get one flavor right, # the other one passed the test right away:-)