TL,
As peterGo, I was unable to reproduce your findings:
uname -a
Linux 4.8.0-30-generic #32-Ubuntu SMP Fri Dec 2 03:43:27 UTC 2016 x86_64
x86_64 x86_64 GNU/Linux
go version
go version go1.7.4 linux/amd64
cat /proc/cpuinfo
CPU Intel(R) Core(TM) i7-6560U CPU @ 2.20GHz
go test -bench=.
[...]
BenchmarkMemclr_2000000-4 3000 421532 ns/op
BenchmarkLoop_2000000-4 2000 791318 ns/op
So memclr is ~2x faster on my machine.
In order to see what actually happens, lets use the pprof tool:
go test -bench=. -cpuprofile test.prof
Then `go tool pprof test.prof`, and `top 5` (sanity check):
flat flat% sum% cum cum%
1.69s 57.88% 57.88% 1.69s 57.88% _/tmp/goperf.memsetLoop
1.22s 41.78% 99.66% 1.22s 41.78% runtime.memclr
So far so good, memsetloop and the _runtime_ memclr are being called.
Going down the rabbit hole, lets look at the assembly:
(pprof) disasm memsetLoop
Total: 2.92s
ROUTINE ======================== _/tmp/goperf.memsetLoop
1.69s 1.69s (flat, cum) 57.88% of Total
. . 46d770: MOVQ 0x10(SP), AX
. . 46d775: MOVQ 0x8(SP), CX
. . 46d77a: MOVL 0x20(SP), DX
. . 46d77e: XORL BX, BX
. . 46d780: CMPQ AX, BX
. . 46d783: JGE 0x46d790
400ms 400ms 46d785: MOVL DX, 0(CX)(BX*4)
1.14s 1.14s 46d788: INCQ BX
150ms 150ms 46d78b: CMPQ AX, BX
. . 46d78e: JL 0x46d785
Standard loop, and definitively not using vectorized instructions (explains
the difference on my CPU)
For comparison, the finely hand-tuned memclr implementation is at
https://golang.org/src/runtime/memclr_amd64.s (my computer being fairly
recent, it takes full advantage of the large registers available).
Can you try to perform the same exercise on your hardware? It will likely
shed some lights on the peculiar results you are experiencing.
Regards
RD
--
You received this message because you are subscribed to the Google Groups
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.