Re: indirect load store on POWER8 and extra dress computation

Michael Meissner Mon, 02 Nov 2015 11:22:58 -0800

On Mon, Nov 02, 2015 at 02:59:37PM +0000, Ewart Timothée wrote:
> Hello All,
> 
> I have a question about performance on Power8 (little-endian, GCC 4.9.1) 
> specially with load/store.
> I evaluate all possibilities to evaluate polynomial, to simplify the thread I 
> provide a basic example 
> of  meta-programing about Polynomial evaluation with Horner method. I have:


...
 
> The code of XLC is more compact due to direct load (with the offset of the 
> address) contrary 
> to GCC where the address is computed with addis. Moreover XLC privileges VMX 
> computation
> but I thing it should change nothing.  For this kind of computation  
> ( I measure the latency with an external programs) XLC is faster than +- 30% 
> on other test cases (similar).
> 
> Does this address computation costs extra time (I will say yes and it
> introduces a data hazard in the pipeline) or does use the Instruction fusion
> process described in « IBM POWER8 processor core microarchitecture » at
> running time and so merge addis + ld to ld X(r3)?

The power8 machine fusion does not cover fusing addis with the lfd/lfs
instructions.  Presently, it has 2 cases where instructions are fused:

    1)  If you have an addis instruction that sets a register, followed by a
        zero-extending load to the same general purpose register;

    2)  If you have an or immediate instruction that loads up an integer
        constant followed by a vector load instruction that uses the constant
        as an index register.

Future machines may expand upon the list of fusable instructions.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: [email protected], phone: +1 (978) 899-4797

Re: indirect load store on POWER8 and extra dress computation

Reply via email to