Hello All,
I have a question about performance on Power8 (little-endian, GCC 4.9.1)
specially with load/store.
I evaluate all possibilities to evaluate polynomial, to simplify the thread I
provide a basic example
of meta-programing about Polynomial evaluation with Horner method. I have:
template<int n>
struct coeff{ };
template<>
struct coeff<0>{
const static inline double coefficient() {return 1.00000;}
};
template<>
struct coeff<1>{
const static inline double coefficient() {return 9.99999;}
};
template<>
struct coeff<2>{
const static inline double coefficient() {return 5.00000;}
};
template<>
struct coeff<3>{
const static inline double coefficient() {return 1.66666;}
};
template<>
struct coeff<4>{
const static inline double coefficient() {return 4.16667;}
};
template< template<int n> class C, int n>
struct horner_helper{
static const inline double h(double x){
return C<4-n>::coefficient() + horner_helper<C,n-1>::h(x)*x;
}
};
template<template<int n> class C>
struct horner_helper<C,0>{
static const inline double h(double){
return C<4>::coefficient();
}
};
inline double horner(double x){
return horner_helper<coeff,4>::h(x);
}
double poly(double x){
double y = horner(x);
return y;
}
If I have a look about the ASM generates by GCC compare XLC I have:
MAKEFILE
all:
make gcc xlc
gcc:
gcc -O3 -c horner.cpp -o horner.o
ar rcs libhorner_gcc.a horner.o
rm horner.o
xlc:
xlc -O3 -c horner.cpp -o horner.o
ar rcs libhorner_xlc.a horner.o
rm horner.o
clean:
rm -f horner.o libhorner_gcc.a libhorner_xlc.a
GCC
0000000000000000 <_Z4polyd>:
0: 00 00 4c 3c addis r2,r12,0
4: 00 00 42 38 addi r2,r2,0
8: 00 00 22 3d addis r9,r2,0
c: 00 00 89 c9 lfd f12,0(r9)
10: 00 00 22 3d addis r9,r2,0
14: 00 00 29 c9 lfd f9,0(r9)
18: 00 00 22 3d addis r9,r2,0
1c: 00 00 09 c0 lfs f0,0(r9)
20: 00 00 22 3d addis r9,r2,0
24: 00 00 49 c9 lfd f10,0(r9)
28: 00 00 22 3d addis r9,r2,0
2c: 3a 4b 81 fd fmadd f12,f1,f12,f9
30: 00 00 69 c1 lfs f11,0(r9)
34: 3a 03 01 fc fmadd f0,f1,f12,f0
38: 3a 50 01 fc fmadd f0,f1,f0,f10
3c: 3a 58 21 fc fmadd f1,f1,f0,f11
40: 20 00 80 4e blr
44: 00 00 00 00 .long 0x0
48: 00 09 00 00 .long 0x900
4c: 00 00 00 00 .long 0x0
XLC
0: 00 00 4c 3c addis r2,r12,0
4: 00 00 42 38 addi r2,r2,0
8: 00 00 62 3c addis r3,r2,0
c: 00 00 63 e8 ld r3,0(r3)
10: 8c 03 05 10 vspltisw v0,5
14: 8c 03 21 10 vspltisw v1,1
18: 00 00 03 c8 lfd f0,0(r3)
1c: 08 00 43 c8 lfd f2,8(r3)
20: e2 03 60 f0 xvcvsxwdp vs3,vs32
24: 08 09 02 f0 xsmaddadp vs0,vs2,vs1
28: 10 00 43 c8 lfd f2,16(r3)
2c: 08 01 61 f0 xsmaddadp vs3,vs1,vs0
30: e2 0b 00 f0 xvcvsxwdp vs0,vs33
34: 08 19 41 f0 xsmaddadp vs2,vs1,vs3
38: 48 01 22 f0 xsmaddmdp vs1,vs2,vs0
3c: 20 00 80 4e blr
40: 00 00 00 00 .long 0x0
44: 00 09 22 00 .long 0x220900
48: 00 00 00 00 .long 0x0
4c: 40 00 00 00 .long 0x40
The code of XLC is more compact due to direct load (with the offset of the
address) contrary
to GCC where the address is computed with addis. Moreover XLC privileges VMX
computation
but I thing it should change nothing. For this kind of computation
( I measure the latency with an external programs) XLC is faster than +- 30% on
other test cases (similar).
Does this address computation costs extra time (I will say yes and it
introduces a data hazard in the pipeline) or does use the Instruction fusion
process
described in « IBM POWER8 processor core microarchitecture » at running time
and so merge addis + ld to ld X(r3)?
Best,
Tim