See instruction tables here: http://www.agner.org/optimize/instruction_tables.pdf
My brief reading of the table for core2 and corei7 suggest the following: 1. On core2 movdqu -- both load and store forms take up to 8 cycles to complete, and store form produces 8 uops while load produces 4 uops movsd load: 1 uop, 2 cycle latency movsd store: 1 uop, 3 cycle latency movhpd, movlpd load: 2 uops, 3 cycle latency movhpd store: 2 uops, 5 cycle latency movlpd store: 1uop, 3 cycle latency 2. Core i7 movdqu load: 1 uop, 2 cycle latency movdqu store: 1 uop, 3 cycle latency movsd load: 1 uop, 2 cycle latency movsd store: 1uop, 3 cycle latency movhpd, movlpd load: 2 uop, 3 cycle latency movhpd, movlpd sotre: 2 uop, 5 cycle latency >From the above, looks like a Sri's original simple heuristic should work fine 1) for corei7, if the load and stores can not be proved to be 128 bit aligned, always use movdqu 2) for core2, experiment can be done to determine whether to look at unaligned stores or both unaligned loads to disable vectorization. Yes, for longer term, a more precise cost model is probably needed -- but require lots of work which may not work a lot better in practice. What is more important is to beef up gcc infrastructure to allow more aggressive alignment (info) propagation. In 4.4, gcc does alignment (output array) based versioning -- Sri's patch has the effect of doing the samething but only for selected targets. thanks, David On Tue, Dec 13, 2011 at 10:56 AM, Richard Henderson <r...@redhat.com> wrote: > On 12/13/2011 10:26 AM, Sriraman Tallam wrote: >> Cool, this works for stores! It generates the movlps + movhps. I have >> to also make a similar change to another call to gen_sse2_movdqu for >> loads. Would it be ok to not do this when tune=core2? > > We can work something out. > > I'd like you to do the benchmarking to know if unaligned loads are really as > expensive as unaligned stores, and whether there are reformatting penalties > that make the movlps+movhps option for either load or store less attractive. > > > r~