------- Comment #3 from changpeng dot fang at amd dot com 2010-08-24 00:22 ------- I checked with open64 and did not find any regression. And for the above testcase, open64 generated 3 non-temporal prefetches. As a result, I am guessing that we are just unlucky that the prefetch kicks out useful data for such streaming accesses (gcc generate one prefetcht0):
.Lt_0_6402: #<loop> Loop body line 8, nesting depth: 1, estimated iterations: 1000 .loc 1 7 0 movss 0(%r10),%xmm0 # [0] id:67 subss 0(%r9),%xmm0 # [3] .loc 1 8 0 mulss %xmm0,%xmm0 # [9] mulss 0(%rax),%xmm0 # [13] .loc 1 7 0 prefetchnta 128(%r10) # [17] L1 prefetchnta 128(%r9) # [17] L1 .loc 1 8 0 addq $4,%rax # [17] addq $4,%r10 # [18] addq $4,%r9 # [18] cmpq %r11,%rax # [18] prefetchnta 124(%rax) # [19] L1 subss %xmm0,%xmm1 # [19] jle .Lt_0_6402 # [19] -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45391