Hello all,

In the work I'm doing on my new book, I'm trying to show how modern compiler optimizations can eliminate a good deal of the overhead introduced by an modular/unit-testable design. In verifying some of my text, I found that GCC 4.4 and 4.5 (20091018, Ubuntu 9.10 package) isn't doing an optimization that I expected it to do:

class Calculable
{
public:
        virtual unsigned char calculate() = 0;
};

class X : public Calculable
{
public:
        unsigned char calculate() { return 1; }
};

class Y : public Calculable
{
public:
        unsigned char calculate() { return 2; }
};

static void print(Calculable& c)
{
        printf("%d\n", c.calculate());
        printf("+1: %d\n", c.calculate() + 1);
}

int main()
{
        X x;
        Y y;

        print(x);
        print(y);

        return 0;
}

GCC 4.5 (and 4.4.1) generates this approximate code:

~/src $ /usr/lib/gcc-snapshot/bin/g++ -O3 -ftree-loop-ivcanon -fivopts -ftree-loop-im -fwhole-program -fipa-struct-reorg -fipa-matrix-reorg -fgcse-sm -fgcse-las -fgcse-after-reload --param max-gcse-memory=100000000 --param max-pending-list-length=100000 folding-test-interface.cpp -o folding-test-interface_gcc450_20091018-O3-kitchen-sink

~/src$ objdump -Mintel -S folding-test-interface_gcc450_20091018-O3-kitchen-sink | less -p \<main

0000000000400310 <main>:
  400310:       53                      push   rbx
  400311:       48 83 ec 20             sub    rsp,0x20
  400315:       48 8d 5c 24 10          lea    rbx,[rsp+0x10]
40031a: 48 c7 44 24 10 c0 04 mov QWORD PTR [rsp+0x10],0x4004c0
  400321:       40 00
  400323:       48 c7 04 24 00 05 40    mov    QWORD PTR [rsp],0x400500
  40032a:       00
  40032b:       48 89 df                mov    rdi,rbx
40032e: ff 15 8c 01 00 00 call QWORD PTR [rip+0x18c] # 4004c0 <_ZTV1X+0x10>
  400334:       bf ac 04 40 00          mov    edi,0x4004ac
  400339:       0f b6 f0                movzx  esi,al
  40033c:       31 c0                   xor    eax,eax
  40033e:       e8 a5 03 00 00          call   4006e8 <pri...@plt>
  400343:       48 8b 44 24 10          mov    rax,QWORD PTR [rsp+0x10]
  400348:       48 89 df                mov    rdi,rbx
  40034b:       ff 10                   call   QWORD PTR [rax]
  40034d:       0f b6 f0                movzx  esi,al
  400350:       bf a4 04 40 00          mov    edi,0x4004a4
  400355:       31 c0                   xor    eax,eax
  400357:       83 c6 01                add    esi,0x1
  40035a:       e8 89 03 00 00          call   4006e8 <pri...@plt>
[...]

as seen here, GCC isn't folding/inlining the constants returned across the virtual function boundary, even though they are visible in the compilation unit and -O3 -fwhole-program is being used. (Note that I started with just that commandline, and added things in an attempt to induce the optimization I was hoping for.)

I was able to induce the optimization by removing a level of indirection via two ways: 1) By having two print() methods, one overloaded to accept X& and a second overload to accept Y&; and 2) by replacing the classes with single-level indirection function pointers:
--
#include <stdio.h>

typedef unsigned char(*Calculable)(void);

unsigned char one() { return 1; }
unsigned char two() { return 2; }

static void print(Calculable calculate)
{
        printf("%d\n", calculate());
        printf("+1: %d\n", calculate() + 1);
}

int main()
{
        print(one);
        print(two);

        return 0;
}
--
For completeness, this code is generated from the function-pointer example optimizes in the way I expect:
0000000000400390 <main>:
  400390:       48 83 ec 08             sub    rsp,0x8
  400394:       ba 01 00 00 00          mov    edx,0x1
  400399:       be e4 04 40 00          mov    esi,0x4004e4
  40039e:       bf 01 00 00 00          mov    edi,0x1
  4003a3:       31 c0                   xor    eax,eax
  4003a5:       e8 c6 02 00 00          call   400670 <__printf_...@plt>
  4003aa:       ba 02 00 00 00          mov    edx,0x2
  4003af:       be dc 04 40 00          mov    esi,0x4004dc
  4003b4:       bf 01 00 00 00          mov    edi,0x1
  4003b9:       31 c0                   xor    eax,eax
  4003bb:       e8 b0 02 00 00          call   400670 <__printf_...@plt>



Modifying this last example to include two function pointer indirections once again causes the optimization to be missed.

So, my questions are:
0) Am I missing some existing commandline parameter that would induce the optimization? (e.g. a bad connection between my chair and keyboard)
1) Is this a missed optimization bug, or is this a missing feature?
2) Either way, what are the steps to correct the issue?

Thanks in advance for insights and/or help!



PS: I would test with a newer 4.5.0 build, but I'm having trouble bootstrapping. Any help is appreciated on that email (sent yesterday), as well.

--
tangled strands of DNA explain the way that I behave.
http://www.clock.org/~matt

Reply via email to