On Thu, Jul 22, 2021 at 3:14 PM Dr. David Alan Gilbert <[email protected]>
wrote:
> * Richard Henderson ([email protected]) wrote:
> > On 7/22/21 12:02 AM, Dr. David Alan Gilbert wrote:
> > > Hi Richard,
> > > I think you were the last person to fiddle with the prefetching
> > > in buffer_zero_avx2 and friends; Joe (cc'd) wondered if explicit
> > > prefetching still made sense on modern CPUs, and that their hardware
> > > generally figures stuff out better on simple increments.
> > >
> > > What was your thinking on this, and did you actually measure
> > > any improvement?
> >
> > Ah, well, that was 5 years ago so I have no particular memory of this.
> It
> > wouldn't surprise me if you can't measure any improvement on modern
> > hardware.
> >
> > Do you now measure an improvement with the prefetches gone?
>
> Not tried, it just came from Joe's suggestion that it was generally a
> bad idea these days; I do remember that the behaviour of those functions
> is quite tricky because there performance is VERY data dependent - many
> VMs actually have pages that are quite dirty so you never iterate the
> loop, but then you hit others with big zero pages and you spend your
> entire life in the loop.
>
>
Dave, Richard:
My curiosity got the best of me. So I created a small test program that
used the buffer_zero_avx2() routine from qemu's bufferiszero.c.
When I run it on an Intel Cascade Lake processor, the cost of calling
"__builtin_prefetch(p)" is in the noise range . It's always "just
slightly" slower. I doubt it could ever be measured in qemu.
Ironically, when I disabled the hardware prefetchers, the program slowed
down over 33%. And the call to "__builtin_prefetch(p)" actually hurt
performance by over 3%.
My results are below, (only with the hardware prefetchers enabled). The
program is attached.
Joe
# gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do
./a.out; done
TSC 356144 Kcycles.
TSC 356714 Kcycles.
TSC 356707 Kcycles.
TSC 356565 Kcycles.
TSC 356853 Kcycles.
# gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
TSC 355520 Kcycles.
TSC 355961 Kcycles.
TSC 355872 Kcycles.
TSC 355948 Kcycles.
TSC 355918 Kcycles.
Dave
> >
> > r~
> >
> --
> Dr. David Alan Gilbert / [email protected] / Manchester, UK
>
>
/*
* Simple program to test if a prefetch helps or hurts buffer_zero_avx2.
*
* Compile with either:
* gcc -mavx buffer_zero_avx.c -O
* or
* gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH
*/
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <stddef.h>
#include <sys/mman.h>
#include <string.h>
#define likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)
static __inline__ u_int64_t start_clock();
static __inline__ u_int64_t stop_clock();
static int buffer_zero_avx2(const void *buf, size_t len);
/*
* Allocate a large chuck of anon memory, touch/zero it,
* and then time the call to buffer_zero_avx2().
*/
int main()
{
long i;
size_t mmap_len = 2UL*1024*1024*1024;
char *ptr = mmap(NULL, mmap_len,
PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0L);
if (ptr == MAP_FAILED) {
perror(" mmap");
exit(1);
}
// Touch the pages (they're already cleared)
memset(ptr,0x0,mmap_len);
u_int64_t start_rdtsc = start_clock();
buffer_zero_avx2(ptr, mmap_len);
u_int64_t stop_rdtsc = stop_clock();
u_int64_t diff = stop_rdtsc - start_rdtsc;
printf("TSC %ld Kcycles. \n", diff/1000);
}
static int
buffer_zero_avx2(const void *buf, size_t len)
{
/* Begin with an unaligned head of 32 bytes. */
__m256i t = _mm256_loadu_si256(buf);
__m256i *p = (__m256i *)(((uintptr_t)buf + 5 * 32) & -32);
__m256i *e = (__m256i *)(((uintptr_t)buf + len) & -32);
if (likely(p <= e)) {
/* Loop over 32-byte aligned blocks of 128. */
do {
#ifdef DO_PREFETCH
__builtin_prefetch(p);
#endif
if (unlikely(!_mm256_testz_si256(t, t))) {
printf("In unlikely buffer_zero, p:%lx \n",p);
return 0;
}
t = p[-4] | p[-3] | p[-2] | p[-1];
p += 4;
} while (p <= e);
} else {
t |= _mm256_loadu_si256(buf + 32);
if (len <= 128) {
goto last2;
}
}
/* Finish the last block of 128 unaligned. */
t |= _mm256_loadu_si256(buf + len - 4 * 32);
t |= _mm256_loadu_si256(buf + len - 3 * 32);
last2:
t |= _mm256_loadu_si256(buf + len - 2 * 32);
t |= _mm256_loadu_si256(buf + len - 1 * 32);
// printf("End of buffer_zero_avx2\n");
return _mm256_testz_si256(t, t);
}
static __inline__ u_int64_t
start_clock() {
// See: Intel Doc #324264, "How to Benchmark Code Execution Times on Intel...",
u_int32_t hi, lo;
__asm__ __volatile__ (
"CPUID\n\t"
"RDTSC\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t": "=r" (hi), "=r" (lo)::
"%rax", "%rbx", "%rcx", "%rdx");
return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
}
static __inline__ u_int64_t
stop_clock() {
// See: Intel Doc #324264, "How to Benchmark Code Execution Times on Intel...",
u_int32_t hi, lo;
__asm__ __volatile__(
"RDTSCP\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
"CPUID\n\t": "=r" (hi), "=r" (lo)::
"%rax", "%rbx", "%rcx", "%rdx");
return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
}