http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57236

            Bug ID: 57236
           Summary: Missed optimization: weird pointer update after the
                    loop
           Product: gcc
           Version: 4.8.1
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: petschy at gmail dot com

In short:
In a loop, I write to and increment a pointer in each iteration. Then, after
the loop, I write to the pointer once more. I noticed that after the loop,
instead of using the pointer and writing to it, the generated code calculates
the resulting pointer from the loop count and the value of the pointer before
the loop. This is clearly unneeded work.

In detail:
The attached code was reduced from a variable-length integer I/O lib, using
binary streams. Bytes go into a buffer in the stream class, and get flushed
when the buffer is full. Stream::WriteU8() writes a single byte: checks whether
there is place in the buffer, flushes if needed and stores the byte. Calling
this fn from loops is ineffective, as the flush check is performed on each
call. To work around this, there is the Txn class which allows the flush check
to be amortized: checks only once in the ctor, and the stream buffer ptr is
updated in the dtor with the number of bytes written.

write1() uses Stream::WriteU8() directly, the buffer pointer gets loaded/stored
on each iteration. I tried to Ensure() the needed number of bytes before the
loop, but that didn't eliminate the loads/stores, I guess that this is too hard
to track in the optimizer, though I'd be interested to read comments about
this.

write2() is the same, except Stream::Txn is used. The pointer is loaded once,
written to and incremented, so far so good. But when the loop exits, comes the
weird part. The generated code looks like:

   0x000000000040078d <+141>:   mov    %rsi,%rdi
   0x0000000000400790 <+144>:   mov    %r12d,%ecx
   0x0000000000400793 <+147>:   mov    %ebp,%r8d
   0x0000000000400796 <+150>:   add    $0x1,%rdi
   0x000000000040079a <+154>:   shr    %cl,%r8d
   0x000000000040079d <+157>:   and    $0x7f,%r8d
   0x00000000004007a1 <+161>:   mov    %r8b,-0x1(%rdi)
   0x00000000004007a5 <+165>:   sub    $0x7,%cl
   0x00000000004007a8 <+168>:   jne    0x400793 <_Z6write2R6Streamj+147>
loop ends here, when cl is zero, then
   0x00000000004007aa <+170>:   mov    $0xffffffb7,%ecx
   0x00000000004007af <+175>:   mov    %r12d,%eax
   0x00000000004007b2 <+178>:   imul   %ecx,%eax
   0x00000000004007b5 <+181>:   sub    $0x1,%eax
   0x00000000004007b8 <+184>:   movzbl %al,%eax
   0x00000000004007bb <+187>:   lea    0x1(%rsi,%rax,1),%rsi
jump to the last WriteU8() after the loop
   0x00000000004007c0 <+192>:   jmpq   0x40072e <_Z6write2R6Streamj+46>

The source is:
        while (UNLIKELY(b)) {
                txn.WriteU8((v >> b) & 0x7f);
                b -= 7;
        }
        txn.WriteU8(v | 0x80);

If I undertand correctly, that imul calculates the pointer increment from the
loop count, which is added to rsi (the pointer value before the loop). However,
the pointer is readily available in rdi.

clang 3.4 generates the expected code: after the loop, just jumps and uses rax
for the final write:
   0x0000000000400723 <+179>:   mov    %rdx,%rax
   0x0000000000400726 <+182>:   mov    %bl,%cl
   0x0000000000400728 <+184>:   mov    %ebp,%esi
   0x000000000040072a <+186>:   shr    %cl,%esi
   0x000000000040072c <+188>:   and    $0x7f,%sil
   0x0000000000400730 <+192>:   mov    %sil,(%rax)
   0x0000000000400733 <+195>:   inc    %rax
   0x0000000000400736 <+198>:   add    $0xf9,%bl
   0x0000000000400739 <+201>:   jne    0x400726 <write2(Stream&, unsigned
int)+182>
   0x000000000040073b <+203>:   jmpq   0x40069f <write2(Stream&, unsigned
int)+47>

For the full disasm dumps, see the attached files.

I'm just guessing that this is a tree-optimization issue, please change the
component if needed.

On a side note: are there any benefits of pre-incrementing the pointer and then
writing to offset -1? The insn is 4 bytes instead of 3, does the better
scheduling (?) justify the code size increment?

Versions tested:
g++-4.8.1 -v
Using built-in specs.
COLLECT_GCC=g++-4.8.1
COLLECT_LTO_WRAPPER=/home/usr-local/bin/../libexec/gcc/x86_64-unknown-linux-gnu/4.8.1/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ./configure --enable-languages=c,c++ --program-suffix=-4.8.1
Thread model: posix
gcc version 4.8.1 20130427 (prerelease) (GCC) 

g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.7/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.7.2-5'
--with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs
--enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr
--program-suffix=-4.7 --enable-shared --enable-linker-build-id
--with-system-zlib --libexecdir=/usr/lib --without-included-gettext
--enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7
--libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object
--enable-plugin --enable-objc-gc --with-arch-32=i586 --with-tune=generic
--enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 4.7.2 (Debian 4.7.2-5) 

 g++-4.9.0 -v
Using built-in specs.
COLLECT_GCC=g++-4.9.0
COLLECT_LTO_WRAPPER=/home/usr-local/bin/../libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ./configure --enable-languages=c,c++ --program-suffix=-4.9.0
Thread model: posix
gcc version 4.9.0 20130510 (experimental) (GCC) 

Regards, Peter

Reply via email to