http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57236
Bug ID: 57236 Summary: Missed optimization: weird pointer update after the loop Product: gcc Version: 4.8.1 Status: UNCONFIRMED Severity: minor Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: petschy at gmail dot com In short: In a loop, I write to and increment a pointer in each iteration. Then, after the loop, I write to the pointer once more. I noticed that after the loop, instead of using the pointer and writing to it, the generated code calculates the resulting pointer from the loop count and the value of the pointer before the loop. This is clearly unneeded work. In detail: The attached code was reduced from a variable-length integer I/O lib, using binary streams. Bytes go into a buffer in the stream class, and get flushed when the buffer is full. Stream::WriteU8() writes a single byte: checks whether there is place in the buffer, flushes if needed and stores the byte. Calling this fn from loops is ineffective, as the flush check is performed on each call. To work around this, there is the Txn class which allows the flush check to be amortized: checks only once in the ctor, and the stream buffer ptr is updated in the dtor with the number of bytes written. write1() uses Stream::WriteU8() directly, the buffer pointer gets loaded/stored on each iteration. I tried to Ensure() the needed number of bytes before the loop, but that didn't eliminate the loads/stores, I guess that this is too hard to track in the optimizer, though I'd be interested to read comments about this. write2() is the same, except Stream::Txn is used. The pointer is loaded once, written to and incremented, so far so good. But when the loop exits, comes the weird part. The generated code looks like: 0x000000000040078d <+141>: mov %rsi,%rdi 0x0000000000400790 <+144>: mov %r12d,%ecx 0x0000000000400793 <+147>: mov %ebp,%r8d 0x0000000000400796 <+150>: add $0x1,%rdi 0x000000000040079a <+154>: shr %cl,%r8d 0x000000000040079d <+157>: and $0x7f,%r8d 0x00000000004007a1 <+161>: mov %r8b,-0x1(%rdi) 0x00000000004007a5 <+165>: sub $0x7,%cl 0x00000000004007a8 <+168>: jne 0x400793 <_Z6write2R6Streamj+147> loop ends here, when cl is zero, then 0x00000000004007aa <+170>: mov $0xffffffb7,%ecx 0x00000000004007af <+175>: mov %r12d,%eax 0x00000000004007b2 <+178>: imul %ecx,%eax 0x00000000004007b5 <+181>: sub $0x1,%eax 0x00000000004007b8 <+184>: movzbl %al,%eax 0x00000000004007bb <+187>: lea 0x1(%rsi,%rax,1),%rsi jump to the last WriteU8() after the loop 0x00000000004007c0 <+192>: jmpq 0x40072e <_Z6write2R6Streamj+46> The source is: while (UNLIKELY(b)) { txn.WriteU8((v >> b) & 0x7f); b -= 7; } txn.WriteU8(v | 0x80); If I undertand correctly, that imul calculates the pointer increment from the loop count, which is added to rsi (the pointer value before the loop). However, the pointer is readily available in rdi. clang 3.4 generates the expected code: after the loop, just jumps and uses rax for the final write: 0x0000000000400723 <+179>: mov %rdx,%rax 0x0000000000400726 <+182>: mov %bl,%cl 0x0000000000400728 <+184>: mov %ebp,%esi 0x000000000040072a <+186>: shr %cl,%esi 0x000000000040072c <+188>: and $0x7f,%sil 0x0000000000400730 <+192>: mov %sil,(%rax) 0x0000000000400733 <+195>: inc %rax 0x0000000000400736 <+198>: add $0xf9,%bl 0x0000000000400739 <+201>: jne 0x400726 <write2(Stream&, unsigned int)+182> 0x000000000040073b <+203>: jmpq 0x40069f <write2(Stream&, unsigned int)+47> For the full disasm dumps, see the attached files. I'm just guessing that this is a tree-optimization issue, please change the component if needed. On a side note: are there any benefits of pre-incrementing the pointer and then writing to offset -1? The insn is 4 bytes instead of 3, does the better scheduling (?) justify the code size increment? Versions tested: g++-4.8.1 -v Using built-in specs. COLLECT_GCC=g++-4.8.1 COLLECT_LTO_WRAPPER=/home/usr-local/bin/../libexec/gcc/x86_64-unknown-linux-gnu/4.8.1/lto-wrapper Target: x86_64-unknown-linux-gnu Configured with: ./configure --enable-languages=c,c++ --program-suffix=-4.8.1 Thread model: posix gcc version 4.8.1 20130427 (prerelease) (GCC) g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.7/lto-wrapper Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 4.7.2-5' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs --enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.7 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --with-arch-32=i586 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.7.2 (Debian 4.7.2-5) g++-4.9.0 -v Using built-in specs. COLLECT_GCC=g++-4.9.0 COLLECT_LTO_WRAPPER=/home/usr-local/bin/../libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/lto-wrapper Target: x86_64-unknown-linux-gnu Configured with: ./configure --enable-languages=c,c++ --program-suffix=-4.9.0 Thread model: posix gcc version 4.9.0 20130510 (experimental) (GCC) Regards, Peter