------- Additional Comments From tbptbp at gmail dot com 2005-01-30 18:40 ------- Yes that's not a win per se but even with those "unrolled" addr computations its encodings end up generally tighter, ie: gcc: 40114d: c1 e1 04 shl $0x4,%ecx 401150: 8d 41 30 lea 0x30(%ecx),%eax ... 40115a: 0f 58 0c 07 addps (%edi,%eax,1),%xmm1 ... 40116f: 0f 58 04 0f addps (%edi,%ecx,1),%xmm0
icc: 4236e4: 03 ed add %ebp,%ebp 4236e6: 0f 28 64 ef 30 movaps 0x30(%edi,%ebp,8),%xmm4 4236eb: 0f 28 0c ef movaps (%edi,%ebp,8),%xmm1 Small win (and it's hard to follow as they schedule things very differently and gcc touches the stack a lot more), but could be even better if shifting was allowed. And in such a lenghty loop, decoding bandwith is scarce. If gcc wasn't so greedingly trying to precompute indexes and offsets... Could you tell me why gcc feels obliged to make a local copy of the hit_t structure on the stack and then update both the copy and the original? Ideally i'd like it to not try to outsmart me :) (or maybe i'm missing something obvious). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19680