I updated the patch to add the checks in vectorizable_load and vectorizable_store itself.
Thanks, -Sri. On Tue, Dec 13, 2011 at 12:16 PM, Xinliang David Li <[email protected]> wrote: > See instruction tables here: > http://www.agner.org/optimize/instruction_tables.pdf > > My brief reading of the table for core2 and corei7 suggest the following: > > 1. On core2 > > movdqu -- both load and store forms take up to 8 cycles to complete, > and store form produces 8 uops while load produces 4 uops > > movsd load: 1 uop, 2 cycle latency > movsd store: 1 uop, 3 cycle latency > > movhpd, movlpd load: 2 uops, 3 cycle latency > movhpd store: 2 uops, 5 cycle latency > movlpd store: 1uop, 3 cycle latency > > > 2. Core i7 > > movdqu load: 1 uop, 2 cycle latency > movdqu store: 1 uop, 3 cycle latency > > movsd load: 1 uop, 2 cycle latency > movsd store: 1uop, 3 cycle latency > > movhpd, movlpd load: 2 uop, 3 cycle latency > movhpd, movlpd sotre: 2 uop, 5 cycle latency > > > From the above, looks like a Sri's original simple heuristic should work fine > > 1) for corei7, if the load and stores can not be proved to be 128 bit > aligned, always use movdqu > > 2) for core2, experiment can be done to determine whether to look at > unaligned stores or both unaligned loads to disable vectorization. > > Yes, for longer term, a more precise cost model is probably needed -- > but require lots of work which may not work a lot better in practice. > > What is more important is to beef up gcc infrastructure to allow more > aggressive alignment (info) propagation. > > In 4.4, gcc does alignment (output array) based versioning -- Sri's > patch has the effect of doing the samething but only for selected > targets. > > thanks, > > David > > On Tue, Dec 13, 2011 at 10:56 AM, Richard Henderson <[email protected]> wrote: >> On 12/13/2011 10:26 AM, Sriraman Tallam wrote: >>> Cool, this works for stores! It generates the movlps + movhps. I have >>> to also make a similar change to another call to gen_sse2_movdqu for >>> loads. Would it be ok to not do this when tune=core2? >> >> We can work something out. >> >> I'd like you to do the benchmarking to know if unaligned loads are really as >> expensive as unaligned stores, and whether there are reformatting penalties >> that make the movlps+movhps option for either load or store less attractive. >> >> >> r~
