Here is a more definitive reply than my original reply.
I got as far as this
func BenchmarkPopCountSimple(b *testing.B) {
sum := 0 // Avoid dead code elimination.
for i := 0; i < b.N; i++ {
sum += PopCount(0x1234567890abcdef)
}
}
As you can see from the objdump, BenchmarkPopCountSimple dead code is
eliminated.
func BenchmarkPopCountSimple(b *testing.B) {
0x4e38c0 31c9 XORL CX, CX
for i := 0; i < b.N; i++ {
0x4e38c2 eb03 JMP 0x4e38c7
0x4e38c4 48ffc1 INCQ CX
0x4e38c7 48398890010000 CMPQ CX, 0x190(AX)
0x4e38ce 7ff4 JG 0x4e38c4
}
0x4e38d0 c3 RET
I added an additional BenchmarkPopCountAlive benchmark.
var sum = 0 // Avoid dead code elimination.
func BenchmarkPopCountAlive(b *testing.B) {
sum = 0
for i := 0; i < b.N; i++ {
sum += PopCount(0x1234567890abcdef)
}
}
As you can see from the objdump, BenchmarkPopCountAlive code is not
eliminated.
For details, see the popcount.txt attachment.
Peter
On Monday, May 31, 2021 at 6:29:27 PM UTC-4 Paul S. R. Chisholm wrote:
> This is not a serious problem, but it surprised me. (By the way, how can I
> post a message with code formatting?)
>
> I'd like to create table-driven benchmarks:
> https://blog.golang.org/subtests.
>
> To that end, I started with code from *The Go Programming Language*:
>
> // PopCount is based on an example from chapter 2 of THE GO PROGRAMMING
> LANGUAGE.
> // Copyright © 2016 Alan A. A. Donovan & Brian W. Kernighan.
> // License: https://creativecommons.org/licenses/by-nc-sa/4.0/
>
> package popcount
>
> // pc[i] is the population count of i.
> var pc [256]byte
>
> func init() {
> for i := range pc {
> pc[i] = pc[i/2] + byte(i&1)
> }
> }
>
> // PopCount returns the population count (number of set bits) of x.
> func PopCount(x uint64) int {
> return int(pc[byte(x>>(0*8))] +
> pc[byte(x>>(1*8))] +
> pc[byte(x>>(2*8))] +
> pc[byte(x>>(3*8))] +
> pc[byte(x>>(4*8))] +
> pc[byte(x>>(5*8))] +
> pc[byte(x>>(6*8))] +
> pc[byte(x>>(7*8))])
> }
>
> I then wrote a simple test:
>
> // popcount_simple_test.go
> package popcount
>
> import "testing"
>
> func BenchmarkPopCountSimple(b *testing.B) {
> sum := 0 // Avoid dead code elimination.
> for i := 0; i < b.N; i++ {
> sum += PopCount(0x1234567890abcdef)
> }
> }
>
> Because I was paranoid about dead code elimination, I wrote an identical
> test that calls the function many times (inside the loop to call it b.N
> times, which should make no difference if the hypothetically-still-dead
> code is being eliminated):
>
> // popcount_slow_simple_test.go
> package popcount
>
> import "testing"
>
> func BenchmarkPopCountSlowSimple(b *testing.B) {
> sum := 0 // Avoid dead code elimination.
> for i := 0; i < b.N; i++ {
> // Exactly as before, but call the function many times.
> for j:= 0; j < 1_000_000; j++ {
> sum += PopCount(0x1234567890abcdef)
> }
> }
> }
>
> Then I wrote an "almost simple" test that uses testing.B.Run() but is
> hardwired to call the function being benchmarked:
>
> // popcount_almost_simple_test.go
> package popcount
>
> import "testing"
>
> func BenchmarkPopCountAlmostSimple(b *testing.B) {
> b.Run("BenchmarkPopCountAlmostSimple", func(b *testing.B) {
> sum := 0 // Avoid dead code elimination.
> for i := 0; i < b.N; i++ {
> sum += PopCount(0x1234567890abcdef)
> }
> })
> }
>
> Finally, I wrote a "nearly-simple" test that passes arguments to Run()
> that could have come from a table:
>
> // popcount_nearly_simple.go
> package popcount
>
> import "testing"
>
> func BenchmarkPopCountNearlySimple(b *testing.B) {
> f := PopCount
> name := "PopCountNearlySimple"
> b.Run(name, func(b *testing.B) {
> sum := 0 // Avoid dead code elimination.
> for i := 0; i < b.N; i++ {
> sum += f(0x1234567890abcdef)
> }
> })
> }
>
> The simple and almost-simple results are nearly identical (happily, the
> slow results are not), but the nearly-simple results are an order of
> magnitude slower:
>
> $ go version
> go version go1.16.4 windows/amd64
> $ go test -cpu=1 -bench=.
> goos: windows
> goarch: amd64
> pkg: gopl.io/popcount
> cpu: Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz
> BenchmarkPopCountAlmostSimple/BenchmarkPopCountAlmostSimple
> 1000000000 0.6102 ns/op
> BenchmarkPopCountNearlySimple/PopCountNearlySimple
> 194925662 6.197 ns/op
> BenchmarkPopCountSimple
> 1000000000 0.5994 ns/op
> BenchmarkPopCountSlowSimple
> 1953 606194 ns/op
> PASS
> ok gopl.io/popcount 4.534s
>
> After reading this article:
>
>
> https://medium.com/a-journey-with-go/go-inlining-strategy-limitation-6b6d7fc3b1be
>
> I ran:
>
> $ go test -cpu=1 -bench=. -gcflags=-m 2>&1 | egrep 'inlining call to
> PopCount'
> .\popcount_almost_simple_test.go:9:19: inlining call to PopCount
> .\popcount_simple_test.go:8:18: inlining call to PopCount
> .\popcount_slow_simple_test.go:10:19: inlining call to PopCount
>
> That makes sense. 0.6 ns is less than a function call. The simple and
> almost-simple benchmarks can inline the call to the hardwired function, but
> the nearly-simple benchmark can't.
>
> In practice, this isn't much of a problem. Any function small enough to be
> inlined is unlikely to be a performance bottleneck. If it ever is, a
> non-table-driven benchmark can still measure it.
>
> Hope this helps. --PSRC
>
> P.S.: Out of curiosity, how can I post a message with fancy code examples
> like this one?
> https://groups.google.com/g/golang-nuts/c/5DgtH2Alt_I/m/hlsqdRSGAgAJ
>
>
--
You received this message because you are subscribed to the Google Groups
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/golang-nuts/30650fbf-c298-425b-b7ea-3c38708b33e9n%40googlegroups.com.
$ ls
alive_test.go popcount.go popcount_simple_test.go popcount.test popcount.txt
$ cat alive_test.go
package popcount
import "testing"
var sum = 0 // Avoid dead code elimination.
func BenchmarkPopCountAlive(b *testing.B) {
sum = 0
for i := 0; i < b.N; i++ {
sum += PopCount(0x1234567890abcdef)
}
}
$ go test -bench=.
goos: linux
goarch: amd64
pkg: local.mod/local/popcount
cpu: Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz
BenchmarkPopCountAlive-4 531541530 2.152 ns/op
BenchmarkPopCountSimple-4 1000000000 0.3122 ns/op
PASS
ok local.mod/local/popcount 1.721s
$ go tool objdump -S -s=BenchmarkPopCountSimple popcount.test
TEXT local.mod/local/popcount.BenchmarkPopCountSimple(SB)
/home/peter/Sync/gopath/mod/nuts/popcount/popcount_simple_test.go
func BenchmarkPopCountSimple(b *testing.B) {
0x4e38c0 31c9 XORL CX, CX
for i := 0; i < b.N; i++ {
0x4e38c2 eb03 JMP 0x4e38c7
0x4e38c4 48ffc1 INCQ CX
0x4e38c7 48398890010000 CMPQ CX, 0x190(AX)
0x4e38ce 7ff4 JG 0x4e38c4
}
0x4e38d0 c3 RET
$ go tool objdump -S -s=BenchmarkPopCountAlive popcount.test
TEXT local.mod/local/popcount.BenchmarkPopCountAlive(SB)
/home/peter/Sync/gopath/mod/nuts/popcount/alive_test.go
sum = 0
0x4e3840 48c705455a130000000000 MOVQ $0x0,
local.mod/local/popcount.sum(SB)
0x4e384b 31c9 XORL CX, CX
for i := 0; i < b.N; i++ {
0x4e384d eb53 JMP 0x4e38a2
return int(pc[byte(x>>(0*8))] +
0x4e384f 0fb615f9631300 MOVZX
local.mod/local/popcount.pc+239(SB), DX
pc[byte(x>>(1*8))] +
0x4e3856 0fb61dd0631300 MOVZX
local.mod/local/popcount.pc+205(SB), BX
return int(pc[byte(x>>(0*8))] +
0x4e385d 01da ADDL BX, DX
pc[byte(x>>(2*8))] +
0x4e385f 0fb61da5631300 MOVZX
local.mod/local/popcount.pc+171(SB), BX
pc[byte(x>>(1*8))] +
0x4e3866 01da ADDL BX, DX
pc[byte(x>>(3*8))] +
0x4e3868 0fb61d81631300 MOVZX
local.mod/local/popcount.pc+144(SB), BX
pc[byte(x>>(2*8))] +
0x4e386f 01da ADDL BX, DX
pc[byte(x>>(4*8))] +
0x4e3871 0fb61d60631300 MOVZX
local.mod/local/popcount.pc+120(SB), BX
pc[byte(x>>(3*8))] +
0x4e3878 01da ADDL BX, DX
pc[byte(x>>(5*8))] +
0x4e387a 0fb61d35631300 MOVZX
local.mod/local/popcount.pc+86(SB), BX
pc[byte(x>>(4*8))] +
0x4e3881 01da ADDL BX, DX
pc[byte(x>>(6*8))] +
0x4e3883 0fb61d0a631300 MOVZX
local.mod/local/popcount.pc+52(SB), BX
pc[byte(x>>(5*8))] +
0x4e388a 01da ADDL BX, DX
pc[byte(x>>(7*8))])
0x4e388c 0fb61ddf621300 MOVZX
local.mod/local/popcount.pc+18(SB), BX
pc[byte(x>>(6*8))] +
0x4e3893 01da ADDL BX, DX
return int(pc[byte(x>>(0*8))] +
0x4e3895 0fb6d2 MOVZX DL, DX
sum += PopCount(0x1234567890abcdef)
0x4e3898 480115f1591300 ADDQ DX,
local.mod/local/popcount.sum(SB)
for i := 0; i < b.N; i++ {
0x4e389f 48ffc1 INCQ CX
0x4e38a2 48398890010000 CMPQ CX, 0x190(AX)
0x4e38a9 7fa4 JG 0x4e384f
}
0x4e38ab c3 RET
$