Hello, > 2. Right now I am inserting a __builting_prefetch(...) call immediately > before the actual read, getting something like: > D.1117_12 = &A[D.1101_14]; > __builtin_prefetch (D.1117_12, 0, 1); > D.1102_16 = A[D.1101_14]; > > However, if I enable the instruction scheduler pass, it doesn't realize > there's a dependency between the prefetch and the load, and it actually > moves the prefetch after the load, rendering it useless. How can I > instruct the scheduler of this dependence? > > My thinking is to also specify a latency for prefetch, so that the > scheduler will hopefully place the prefetch somewhere earlier in the > code to partially hide this latency. Do you see anything wrong with this > approach?
well, it assumes that the scheduler works with long enough lookahead to actually be able to move the prefetch far enough; i.e., if the architecture you work with is relatively slow in comparison with the memory access times, this might be feasible approach. However, on modern machines, miss in L2 cache may take hundreds of cycles, and it is not clear to me that scheduler will be able to move the prefetch so far, or indeed, that it would even be possible (I think often you do not know the address far enough in advance). Also, prefetching outside of loops in general appears not to be all that profitable, since usually most of the time is spent within loops. So I would recommend first doing some analysis and measurements (say by introducing the prefetches by hand) to check whether this project really has potential to lead to significant speedups. Zdenek