Strange timings on nocona/prescott with indirect jumps/calls

2007-05-02 Thread Marco Manfredini
I have a Pentium-4 HT 521 running in 64 bit mode here, which seems to have a 
branch prediction or prefetch misfeature. Here is an example: 

// stall.c

typedef int (*fn)(void*); 
int nop(void* ip) 
{
fn *next=((fn*)ip)+1; 
#ifdef HEIMLICH
// choked?
if (ip==0) abort(); 
#endif
#ifdef NOSC
return 1+(*next)(next);
#else
return (*next)(next); 
#endif
}
int ret(void* ip)
{
return 0; 
}

int main()
{
int i; 
fn prog[]={&nop,&nop,&nop,&nop,&nop,&nop,&nop,&nop,&nop,&ret}; 
for (i=0;i<1;i++) 
{
(*prog)(prog); 
}
}

// eof

(gcc is 4.0.3, gcc-4.3 from svn isn't different) 

gcc -march=nocona -fomit-frame-pointer -O3 stall.c
./a.out runtime: 5.75 sec

gcc -march=nocona -fomit-frame-pointer -O3 -DHEIMLICH stall.c
./a.out runtime: 1.92 sec

gcc -m32 -march=prescott -fomit-frame-pointer -O3 stall.c
./a.out runtime: 7.06 sec

gcc -m32 -march=prescott -fomit-frame-pointer -DHEIMLICH -O3 stall.c
./a.out runtime: 2.67 sec

It looks like the extra branch involved in the "if (*ip==0) abort();" line 
shakes something up in a healthy way, bringing performance back to the 
regions of a core duo cpu. In fact, a simple "jz 0" somewhere before the 
generated sibling call has the same effect. A similar result can be obtained 
with -DNOSC (which will result in an indirect call). 

Since this behaviour affects all kinds of dispatching code (switch, goto 
label, interpreter), I would like to know if this is specific to my stepping 
or a more general problem of the precott core. That is I'd like to ask if you 
people can reproduce this with other models/steppings, in order to find out 
if it's considerable enough to file a enhancement report for the optimizer.

Here's my relevant data (using http://www.etallen.com/cpuid.html): 

   version information (1/eax):
  processor type  = primary processor (0)
  family  = Intel Pentium 4/Pentium D/Pentium Extreme 
Edition/Celeron/Xeon/Xeon MP/Itanium2, AMD Athlon 64/Athlon 
XP-M/Opteron/Sempron/Turion (15)
  model   = 0x4 (4)
  stepping id = 0x1 (1)
  extended family = 0x0 (0)
  extended model  = 0x0 (0)
  (simple synth)  = Intel Pentium 4 (Prescott E0) / Xeon (Nocona E0) / 
Xeon MP (Cranford A0 / Potomac C0) / Celeron D (Prescott E0 ) / Mobile 
Pentium 4 (Prescott E0), 90nm



Re: What to do with hardware exception (unaligned access) ? ARM920T processor

2008-10-01 Thread Marco Manfredini
On Wednesday 01 October 2008, Martin Guy wrote:
> If you don't want to make the code portable and your are running a
> recent Linux, a fast fix is to
>   echo 2 > /proc/cpu/alignment
> which should make the kernel trap misaligned accesses and fix them up
> for you, with a loss in performance of course. The real answer is to
> fix the code...

...and this is where -Wcast-align should help. The OP should also have a look 
at -Wpadded and -Wpacked, because this may expose similar pitfalls.
 
This writeup looks like a good start for the OP:
http://lecs.cs.ucla.edu/wiki/index.php/XScale_alignment