I have a Pentium-4 HT 521 running in 64 bit mode here, which seems to have a
branch prediction or prefetch misfeature. Here is an example:
// stall.c
typedef int (*fn)(void*);
int nop(void* ip)
{
fn *next=((fn*)ip)+1;
#ifdef HEIMLICH
// choked?
if (ip==0) abort();
#endif
#ifdef NOSC
return 1+(*next)(next);
#else
return (*next)(next);
#endif
}
int ret(void* ip)
{
return 0;
}
int main()
{
int i;
fn prog[]={&nop,&nop,&nop,&nop,&nop,&nop,&nop,&nop,&nop,&ret};
for (i=0;i<1;i++)
{
(*prog)(prog);
}
}
// eof
(gcc is 4.0.3, gcc-4.3 from svn isn't different)
gcc -march=nocona -fomit-frame-pointer -O3 stall.c
./a.out runtime: 5.75 sec
gcc -march=nocona -fomit-frame-pointer -O3 -DHEIMLICH stall.c
./a.out runtime: 1.92 sec
gcc -m32 -march=prescott -fomit-frame-pointer -O3 stall.c
./a.out runtime: 7.06 sec
gcc -m32 -march=prescott -fomit-frame-pointer -DHEIMLICH -O3 stall.c
./a.out runtime: 2.67 sec
It looks like the extra branch involved in the "if (*ip==0) abort();" line
shakes something up in a healthy way, bringing performance back to the
regions of a core duo cpu. In fact, a simple "jz 0" somewhere before the
generated sibling call has the same effect. A similar result can be obtained
with -DNOSC (which will result in an indirect call).
Since this behaviour affects all kinds of dispatching code (switch, goto
label, interpreter), I would like to know if this is specific to my stepping
or a more general problem of the precott core. That is I'd like to ask if you
people can reproduce this with other models/steppings, in order to find out
if it's considerable enough to file a enhancement report for the optimizer.
Here's my relevant data (using http://www.etallen.com/cpuid.html):
version information (1/eax):
processor type = primary processor (0)
family = Intel Pentium 4/Pentium D/Pentium Extreme
Edition/Celeron/Xeon/Xeon MP/Itanium2, AMD Athlon 64/Athlon
XP-M/Opteron/Sempron/Turion (15)
model = 0x4 (4)
stepping id = 0x1 (1)
extended family = 0x0 (0)
extended model = 0x0 (0)
(simple synth) = Intel Pentium 4 (Prescott E0) / Xeon (Nocona E0) /
Xeon MP (Cranford A0 / Potomac C0) / Celeron D (Prescott E0 ) / Mobile
Pentium 4 (Prescott E0), 90nm