On Mon, Nov 02, 2015 at 02:59:37PM +0000, Ewart Timothée wrote:
> Hello All,
>
> I have a question about performance on Power8 (little-endian, GCC 4.9.1)
> specially with load/store.
> I evaluate all possibilities to evaluate polynomial, to simplify the thread I
> provide a basic example
> of meta-programing about Polynomial evaluation with Horner method. I have:
...
> The code of XLC is more compact due to direct load (with the offset of the
> address) contrary
> to GCC where the address is computed with addis. Moreover XLC privileges VMX
> computation
> but I thing it should change nothing. For this kind of computation
> ( I measure the latency with an external programs) XLC is faster than +- 30%
> on other test cases (similar).
>
> Does this address computation costs extra time (I will say yes and it
> introduces a data hazard in the pipeline) or does use the Instruction fusion
> process described in « IBM POWER8 processor core microarchitecture » at
> running time and so merge addis + ld to ld X(r3)?
The power8 machine fusion does not cover fusing addis with the lfd/lfs
instructions. Presently, it has 2 cases where instructions are fused:
1) If you have an addis instruction that sets a register, followed by a
zero-extending load to the same general purpose register;
2) If you have an or immediate instruction that loads up an integer
constant followed by a vector load instruction that uses the constant
as an index register.
Future machines may expand upon the list of fusable instructions.
--
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: [email protected], phone: +1 (978) 899-4797