https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71321
Bug ID: 71321 Summary: [6 regression] x86: worse code for uint8_t % 10 and / 10 Product: gcc Version: 6.1.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: i386-linux-gnu, x86_64-linux-gnu If we have an integer (0..99), we can modulo and divide by 10 to get two decimal digits, then convert to a pair of ASCII bytes with a newline by adding `00\n`. When replacing div and mod with a multiplicative inverse, gcc 6.1 uses more instructions than gcc 5.3, due to poor choices. See also https://godbolt.org/g/vvS5J6 #include <stdint.h> // assuming little-endian __attribute__((always_inline)) unsigned cvt_to_2digit(uint8_t i, uint8_t base) { return ((i / base) | (uint32_t)(i % base)<<8); } // movzbl %dil,%eax # 5.3 and 6.1, with -O3 -march=haswell // div %sil // movzwl %ax,%eax // at -Os, gcc uses a useless AND eax, 0xFFF, instead of a movzx eax,ax. I think to avoid partial-register stalls? unsigned cvt_to_2digit_ascii(uint8_t i) { return cvt_to_2digit(i, 10) + 0x0a3030; // + "00\n" converts to ASCII } Compiling with -O3 -march=haswell ## gcc 5.3 ## gcc 6.1 movzbl %dil, %edx movzbl %dil, %eax leal (%rdx,%rdx,4), %ecx leal 0(,%rax,4), %edx # requires a 4B zero displacement leal (%rdx,%rcx,8), %edx movl %eax, %ecx # lea should let us avoid mov leal (%rdx,%rdx,4), %edx addl %eax, %edx leal (%rcx,%rdx,8), %edx leal 0(,%rdx,4), %eax # requires a 4B zero displacement addl %eax, %edx shrw $11, %dx shrw $11, %dx leal (%rdx,%rdx,4), %eax leal 0(,%rdx,4), %eax # requires a 4B zero displacement. gcc5.3 didn't use any of these addl %edx, %eax movzbl %dl, %edx movzbl %dl, %edx # same after this addl %eax, %eax addl %eax, %eax subl %eax, %edi subl %eax, %edi movzbl %dil, %eax movzbl %dil, %eax sall $8, %eax sall $8, %eax orl %eax, %edx orl %eax, %edx leal 667696(%rdx), %eax leal 667696(%rdx), %eax with -mtune=haswell, it's prob. best to merge with mov ah, dil or something, rather than movzx/shift/or. Haswell has no penalty for partial-registers, but still has partial-reg renaming to avoid false dependencies: the best of both worlds. BTW, with -Os, both gcc versions compile it to movb $10, %dl movzbl %dil, %eax divb %dl andl $4095, %eax # partial reg stall. gcc does this even with -march=core2 where it matters addl $667696, %eax The AND appears to be totally useless, because the upper bytes of eax are already zero (from movzbl %dil, %eax before div). I thought the movzbl %ax, %eax in the unknown-divisor version was to avoid partial-register slowdowns, but maybe it's just based on the possible range of the result. Off-topic, but I noticed this while writing FizzBuzz in asm. http://stackoverflow.com/a/37494090/224132