https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71321

            Bug ID: 71321
           Summary: [6 regression] x86: worse code for uint8_t % 10 and /
                    10
           Product: gcc
           Version: 6.1.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: i386-linux-gnu, x86_64-linux-gnu

If we have an integer (0..99), we can modulo and divide by 10 to get two
decimal digits, then convert to a pair of ASCII bytes with a newline by adding
`00\n`.   When replacing div and mod with a multiplicative inverse, gcc 6.1
uses more instructions than gcc 5.3, due to poor choices.

See also https://godbolt.org/g/vvS5J6

#include <stdint.h>
// assuming little-endian
__attribute__((always_inline)) 
unsigned cvt_to_2digit(uint8_t i, uint8_t base) {
  return ((i / base) | (uint32_t)(i % base)<<8);
}
  // movzbl %dil,%eax    # 5.3 and 6.1, with -O3 -march=haswell
  // div    %sil
  // movzwl %ax,%eax

// at -Os, gcc uses a useless  AND eax, 0xFFF, instead of a movzx eax,ax.  I
think to avoid partial-register stalls?
unsigned cvt_to_2digit_ascii(uint8_t i) {
  return cvt_to_2digit(i, 10) + 0x0a3030;    // + "00\n" converts to ASCII
}

Compiling with -O3 -march=haswell
        ## gcc 5.3                         ## gcc 6.1
        movzbl  %dil, %edx                 movzbl  %dil, %eax
        leal    (%rdx,%rdx,4), %ecx        leal    0(,%rax,4), %edx   #
requires a 4B zero displacement
        leal    (%rdx,%rcx,8), %edx        movl    %eax, %ecx         # lea
should let us avoid mov
        leal    (%rdx,%rdx,4), %edx        addl    %eax, %edx
                                           leal    (%rcx,%rdx,8), %edx
                                           leal    0(,%rdx,4), %eax   #
requires a 4B zero displacement
                                           addl    %eax, %edx
        shrw    $11, %dx                   shrw    $11, %dx
        leal    (%rdx,%rdx,4), %eax        leal    0(,%rdx,4), %eax   #
requires a 4B zero displacement.  gcc5.3 didn't use any of these
                                           addl    %edx, %eax
        movzbl  %dl, %edx                  movzbl  %dl, %edx       # same after
this
        addl    %eax, %eax                 addl    %eax, %eax
        subl    %eax, %edi                 subl    %eax, %edi
        movzbl  %dil, %eax                 movzbl  %dil, %eax
        sall    $8, %eax                   sall    $8, %eax
        orl     %eax, %edx                 orl     %eax, %edx
        leal    667696(%rdx), %eax         leal    667696(%rdx), %eax

with -mtune=haswell, it's  prob. best to merge with   mov ah, dil  or
something, rather than movzx/shift/or.  Haswell has no penalty for
partial-registers, but still has partial-reg renaming to avoid false
dependencies: the best of both worlds.



BTW, with -Os, both gcc versions compile it to

        movb    $10, %dl
        movzbl  %dil, %eax
        divb    %dl
        andl    $4095, %eax      # partial reg stall.  gcc does this even with
-march=core2 where it matters
        addl    $667696, %eax

The AND appears to be totally useless, because the upper bytes of eax are
already zero (from movzbl %dil, %eax before div).  I thought the movzbl %ax,
%eax  in the unknown-divisor version was to avoid partial-register slowdowns,
but maybe it's just based on the possible range of the result.

Off-topic, but I noticed this while writing FizzBuzz in asm. 
http://stackoverflow.com/a/37494090/224132

Reply via email to