https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71321
Bug ID: 71321
Summary: [6 regression] x86: worse code for uint8_t % 10 and /
10
Product: gcc
Version: 6.1.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: i386-linux-gnu, x86_64-linux-gnu
If we have an integer (0..99), we can modulo and divide by 10 to get two
decimal digits, then convert to a pair of ASCII bytes with a newline by adding
`00\n`. When replacing div and mod with a multiplicative inverse, gcc 6.1
uses more instructions than gcc 5.3, due to poor choices.
See also https://godbolt.org/g/vvS5J6
#include <stdint.h>
// assuming little-endian
__attribute__((always_inline))
unsigned cvt_to_2digit(uint8_t i, uint8_t base) {
return ((i / base) | (uint32_t)(i % base)<<8);
}
// movzbl %dil,%eax # 5.3 and 6.1, with -O3 -march=haswell
// div %sil
// movzwl %ax,%eax
// at -Os, gcc uses a useless AND eax, 0xFFF, instead of a movzx eax,ax. I
think to avoid partial-register stalls?
unsigned cvt_to_2digit_ascii(uint8_t i) {
return cvt_to_2digit(i, 10) + 0x0a3030; // + "00\n" converts to ASCII
}
Compiling with -O3 -march=haswell
## gcc 5.3 ## gcc 6.1
movzbl %dil, %edx movzbl %dil, %eax
leal (%rdx,%rdx,4), %ecx leal 0(,%rax,4), %edx #
requires a 4B zero displacement
leal (%rdx,%rcx,8), %edx movl %eax, %ecx # lea
should let us avoid mov
leal (%rdx,%rdx,4), %edx addl %eax, %edx
leal (%rcx,%rdx,8), %edx
leal 0(,%rdx,4), %eax #
requires a 4B zero displacement
addl %eax, %edx
shrw $11, %dx shrw $11, %dx
leal (%rdx,%rdx,4), %eax leal 0(,%rdx,4), %eax #
requires a 4B zero displacement. gcc5.3 didn't use any of these
addl %edx, %eax
movzbl %dl, %edx movzbl %dl, %edx # same after
this
addl %eax, %eax addl %eax, %eax
subl %eax, %edi subl %eax, %edi
movzbl %dil, %eax movzbl %dil, %eax
sall $8, %eax sall $8, %eax
orl %eax, %edx orl %eax, %edx
leal 667696(%rdx), %eax leal 667696(%rdx), %eax
with -mtune=haswell, it's prob. best to merge with mov ah, dil or
something, rather than movzx/shift/or. Haswell has no penalty for
partial-registers, but still has partial-reg renaming to avoid false
dependencies: the best of both worlds.
BTW, with -Os, both gcc versions compile it to
movb $10, %dl
movzbl %dil, %eax
divb %dl
andl $4095, %eax # partial reg stall. gcc does this even with
-march=core2 where it matters
addl $667696, %eax
The AND appears to be totally useless, because the upper bytes of eax are
already zero (from movzbl %dil, %eax before div). I thought the movzbl %ax,
%eax in the unknown-divisor version was to avoid partial-register slowdowns,
but maybe it's just based on the possible range of the result.
Off-topic, but I noticed this while writing FizzBuzz in asm.
http://stackoverflow.com/a/37494090/224132