------- Comment #32 from eyal at geomage dot com 2008-02-12 11:28 ------- (In reply to comment #31) > > I would appriciate, however, a further explaination about this issue. > The explanation has to deal with CPU architecture and is not related to > compilers. In case of cache miss the memory load and store take tens of cpu > cycles instead of few cycles in case of cache hit. > When we run: > time ./mvec 400000 1 29720 1000 > The program perform 400000 iterations of outer loop and 29720 iterations in > internal loop. The internal loop performs 3 load accesses and one store access > per iteration. Starting from second iteration of outer loop, all 29720 > elements of arrays pSum, pSum1 and pVec1 will be placed into cache and from > this point all accesses will be cache hits. (I assume that data cache is big > enough to contain all 29720*3 elements). > Lets look at the slow run: > % time ./TestVec 92200 8 89720 1000 > Here the program perform (89720-8) iterations in internal loop, so in order to > have cache hits most of the time we need the cache to be at least 89712*3 in > size. Lets consider what will happen if cache size is only half of required > amount. After completion of first iteration of the outer loop, the cache will > be filled with second half of data from arrays. At start of second iteration > of outer loop, all elements from first half will be evicted from the cache as > most caches use LRU policy to choose evicted elements. Considering that > PPC970 > is out-of-order, multiple-issue architecture we can guess why CPU have enough > time to perform arithmetic operations even in scalar manner without adding any > overhead relatively to vectorized version of internal loop.
Thanks a lot for the detailed explaination Victor. I'll try to see if I can break the real code to be more memory friendly. Again thanks a lot guys. eyal -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117