Re: [PATCH i386][google]With -mtune=core2, avoid generating the slow unaligned vector load/store (issue5488054)

Xinliang David Li Tue, 13 Dec 2011 12:17:18 -0800

See instruction tables here:
http://www.agner.org/optimize/instruction_tables.pdf

My brief reading of the table for core2 and corei7 suggest the following:

1. On core2

movdqu -- both load and store forms take up to 8 cycles to complete,
and store form produces 8 uops while load produces 4 uops

movsd load:  1 uop, 2 cycle latency
movsd store: 1 uop, 3 cycle latency

movhpd, movlpd load: 2 uops, 3 cycle latency
movhpd store: 2 uops, 5 cycle latency
movlpd store: 1uop, 3 cycle latency

2. Core i7

movdqu load: 1 uop, 2 cycle latency
movdqu store: 1 uop, 3 cycle latency

movsd load: 1 uop, 2 cycle latency
movsd store: 1uop, 3 cycle latency

movhpd, movlpd load: 2 uop, 3 cycle latency
movhpd, movlpd sotre: 2 uop, 5 cycle latency

>From the above, looks like a Sri's original simple heuristic should work fine

1) for corei7, if the load and stores can not be proved to be 128 bit
aligned, always use movdqu

2) for core2,  experiment can be done to determine whether to look at
unaligned stores or both unaligned loads to disable vectorization.

Yes, for longer term, a more precise cost model is probably needed --
but require lots of work which may not work a lot better in practice.

What is more important is to beef up gcc infrastructure to allow more
aggressive alignment (info) propagation.

In 4.4, gcc does alignment (output array) based versioning -- Sri's
patch has the effect of doing the samething but only for selected
targets.

thanks,

David

On Tue, Dec 13, 2011 at 10:56 AM, Richard Henderson <r...@redhat.com> wrote:
> On 12/13/2011 10:26 AM, Sriraman Tallam wrote:
>> Cool, this works for stores!  It generates the movlps + movhps. I have
>> to also make a similar change to another call to gen_sse2_movdqu for
>> loads. Would it be ok to not do this when tune=core2?
>
> We can work something out.
>
> I'd like you to do the benchmarking to know if unaligned loads are really as 
> expensive as unaligned stores, and whether there are reformatting penalties 
> that make the movlps+movhps option for either load or store less attractive.
>
>
> r~

Re: [PATCH i386][google]With -mtune=core2, avoid generating the slow unaligned vector load/store (issue5488054)

Reply via email to