On 10/24/2016 04:25 AM, Paolo Bonzini wrote:
>> > for (; p + 8 <= e; p += 8) {
>> > - __builtin_prefetch(p + 8, 0, 0);
>> > + __builtin_prefetch(p +
>> > + (8 * cache_line_factor * prefetch_line_dist), 0, 0);
> You should precompute cache_line_bytes * prefetch_line_dist /
> sizeof(uint64_t) in a single variable, prefetch_distance. This saves
> the effort of loading global variables repeatedly. Then you can do
>
> __builtin_prefetch(p + prefetch_distance, 0, 0);
>
Let's not complicate things by dividing by sizeof(uint64_t).
It's less complicated to avoid both that and the implied multiply.
__builtin_prefetch((char *)p + prefetch_distance, 0, 0)
r~