Just benchmarked the fastmap with integer keys. It is exactly the same as
the one I published but with integer keys and a xxh3 function for integer
values.
goos: darwin
goarch: arm64
pkg: fastCache/map
cpu: Apple M2
│ cacheint6/stats_arm64.txt │
puremapint/stats_arm64.txt │
│ sec/op │ sec/op vs base
│
Cache2Hit/_______1-8 3.496n ± 0% 2.065n ± 0% -40.93%
(p=0.002 n=6)
Cache2Hit/______10-8 4.327n ± 2% 5.052n ± 0% +16.74%
(p=0.002 n=6)
Cache2Hit/_____100-8 4.364n ± 4% 5.731n ± 12% +31.32%
(p=0.002 n=6)
Cache2Hit/____1000-8 4.248n ± 5% 5.935n ± 3% +39.73%
(p=0.002 n=6)
Cache2Hit/___10000-8 5.167n ± 1% 7.553n ± 1% +46.18%
(p=0.002 n=6)
Cache2Hit/__100000-8 6.763n ± 1% 9.300n ± 3% +37.52%
(p=0.002 n=6)
Cache2Hit/_1000000-8 18.25n ± 1% 26.32n ± 1% +44.26%
(p=0.002 n=6)
Cache2Hit/10000000-8 28.11n ± 0% 37.95n ± 0% +35.05%
(p=0.002 n=6)
geomean 6.881n 8.405n +22.15%
Le jeudi 26 juin 2025 à 09:37:45 UTC+2, [email protected] a écrit :
> Hello,
>
> the code is published here https://github.com/chmike/fastmap.
>
> Le mercredi 25 juin 2025 à 12:16:54 UTC+2, [email protected] a
> écrit :
>
>> By swap pop I mean that a deleted item is replaced by the last item of
>> the group or the chain. This will indeed move items.
>>
>> I have implemented a new version using 8bit top hashes, tombstones,
>> perfect match, quadratic probing and extensible hash table. It leaves items
>> in place and might thus be compatible with the iteration requirements of go
>> maps.
>>
>> I plan to publish it on github but still need to verify deleting items on
>> which I didn't spent much time so far. The following benchmark is with
>> tables of 256 Groups. A table is split when reaching 90%. The filling rates
>> is then far better than with my 3 table architecture. Using quadratic
>> probing would then be a better choice.
>>
>> goos: darwin
>> goarch: arm64
>> pkg: fastmap/map
>> cpu: Apple M2
>> │ altmap/stats_arm64.txt │
>> stdmap/stats_arm64.txt │
>> │ sec/op │ sec/op vs base
>> │
>> Cache2Hit/_______1-8 6.386n ± 0% 7.081n ± 0% +10.88%
>> (p=0.000 n=10)
>> Cache2Hit/______10-8 8.813n ± 0% 8.114n ± 30% ~
>> (p=0.481 n=10)
>> Cache2Hit/_____100-8 8.473n ± 7% 15.720n ± 8% +85.52%
>> (p=0.000 n=10)
>> Cache2Hit/____1000-8 8.735n ± 2% 13.715n ± 23% +57.01%
>> (p=0.000 n=10)
>> Cache2Hit/___10000-8 11.04n ± 1% 20.38n ± 6% +84.56%
>> (p=0.000 n=10)
>> Cache2Hit/__100000-8 13.19n ± 1% 17.62n ± 17% +33.59%
>> (p=0.000 n=10)
>> Cache2Hit/_1000000-8 52.02n ± 1% 55.45n ± 0% +6.59%
>> (p=0.001 n=10)
>> Cache2Hit/10000000-8 64.91n ± 1% 68.12n ± 3% +4.96%
>> (p=0.002 n=10)
>> Cache2Miss/_______1-8 5.003n ± 0% 8.243n ± 0% +64.73%
>> (p=0.000 n=10)
>> Cache2Miss/______10-8 5.878n ± 3% 9.238n ± 11% +57.18%
>> (p=0.000 n=10)
>> Cache2Miss/_____100-8 6.037n ± 4% 12.600n ± 9% +108.70%
>> (p=0.000 n=10)
>> Cache2Miss/____1000-8 6.992n ± 7% 12.015n ± 17% +71.83%
>> (p=0.000 n=10)
>> Cache2Miss/___10000-8 10.49n ± 2% 17.66n ± 9% +68.40%
>> (p=0.000 n=10)
>> Cache2Miss/__100000-8 20.73n ± 1% 26.56n ± 2% +28.09%
>> (p=0.000 n=10)
>> Cache2Miss/_1000000-8 33.52n ± 1% 36.83n ± 1% +9.89%
>> (p=0.000 n=10)
>> Cache2Miss/10000000-8 48.62n ± 1% 50.84n ± 1% +4.56%
>> (p=0.000 n=10)
>> geomean 13.25n 18.38n +38.75%
>>
>> To be honest I can't fully explain these numbers. Reducing false positive
>> by using an exact match and 8 bit top hashes might be part of it. It could
>> also be the hash function. xxh3 on arm64 is pure go only. It is fast for
>> short strings, but bad for long strings as it doesn't use simd (yet). The 8
>> byte key strings are fmt.Sprintf("%7d ", i) and fmt.Sprintf("%7d-", i) for
>> misses. I hope that these numbers don't result from a bug in my code.
>>
>> For the top hash I use the H1, H2 terminology of swiss tables. H2 are the
>> 8 less significant bits of the hash. 0x00 is used for free slots, and 0x80
>> for the tombstone. Computing H2 implies thus a conditional branch, but the
>> cpu may predict its outcome right most of the time. I benchmarked 7bit top
>> hashes, but it ended to be slower. I assumed it was due to the increased
>> number of false positive. Using 0x80 for tombstone saves one op to find
>> free slots in a group. If the speed up is due to the 8bit top hashes, I
>> suspect that the C++ swiss table might also benefit from it.
>>
>> On amd64, the performance is on par with go map, but I don't use simd
>> instruction.
>>
>> Le lundi 23 juin 2025 à 22:41:46 UTC+2, Keith Randall a écrit :
>>
>>> > As key comparison is very fast (no secondary memory access) a simple
>>> sequential key comparison is fast.
>>>
>>> We've looked at that, see CL 626699
>>> <https://go-review.googlesource.com/c/go/+/626699>. Maybe helps in some
>>> cases, seems to hurt for miss lookups. We decided not to move forward with
>>> it, but it is definitely something to think about.
>>>
>>> > I use 8 bit top hashes instead of 7
>>>
>>> How do you record a presence bit?
>>>
>>> > We can use a swap pop strategy.
>>>
>>> Not sure exactly what that is, but it sounds incompatible with
>>> iteration. Our required iteration semantics mean it is very hard to move
>>> keys.
>>>
>>> > There is a small space usage penalty but when a table is split at 80%
>>> for instance, it becomes 40%. It is then very close to need a grow which I
>>> considered a waste of effort.
>>>
>>> I think when we split a table we use the max-sized table for each part?
>>> I'm not sure what effect you're seeing here then.
>>>
>>> > I also saw that function calls are expensive. I then used manual
>>> inlining for the Get method. It is nice to avoid for code readability and
>>> maintenance though.
>>>
>>> We could theoretically have the compiler do this for builtin maps.
>>> On Sunday, June 22, 2025 at 1:28:26 AM UTC-7 [email protected]
>>> wrote:
>>>
>>>> Thank you very much for your response. I'll summarize here the main
>>>> difference with go map.
>>>>
>>>> I very easily got faster than go map with integers and alike by
>>>> dropping the swiss groups and reducing the group size to fit a cache line.
>>>> As key comparison is very fast (no secondary memory access) a simple
>>>> sequential key comparison is fast.The speed gain might be due to the hash
>>>> function as I rewrote the xxh3 hasher for uint64 values. I didn't spent
>>>> much time exploring this further as it wasn't my use case.
>>>>
>>>> Strings key definitely benefit from top hashes (swiss table groups). I
>>>> use 8 bit top hashes instead of 7 which reduces the number of false
>>>> positive and I use a perfect match which also reduce the number of false
>>>> positive on arm64. amd64 uses a perfect match as it uses the simd
>>>> instructions.
>>>>
>>>> Another difference is that go map uses quadratic probing and my map
>>>> doesn't. The popular wisdom is that probing is faster because the
>>>> probability that the data is in cache is slightly higher and the table
>>>> filling is better. Unfortunately this increases the number of required
>>>> tests where false positive increases the penalty. The numbers I showed are
>>>> for a different kind of table. It is composed of three arrays T1,T2 and
>>>> T3.
>>>> T1 is a conventional table of 8 items swiss groups. When a. group is full
>>>> the item is stored in the overflow hash table T2. When the group in T2 is
>>>> full, the item is appended to T3 which is treated as a sequence of groups.
>>>> I do a linear search in this table containing the overflows of overflows.
>>>> I
>>>> implemented a table using quadratic probing and it was faster than go map
>>>> on arm64. I assume it is due to the use of 8 bit hashes reducing the false
>>>> positive. I'm not sure.
>>>>
>>>> I tested independent hashes for T1 and T2 that then requires the use of
>>>> tombstones, and related hashes. By related hash I mean that the index in
>>>> T2
>>>> is for instance the fibonacci hash of the index in T1. Best filling rates
>>>> of T2 were obtained with a random permutation mapping. The size ration
>>>> between T1 and T2 should in theory be 2:1, but best performance where with
>>>> 16:1. I assumed that this is because T2 data is hotter (in cache) than
>>>> with
>>>> a bigger table. The benefit of such deterministic relation between the
>>>> index of T1 and T2 is that we don't need tombstones. We can use a swap pop
>>>> strategy. In summary, the idea was to minimize the number of required
>>>> probes to find a key.
>>>>
>>>> The table uses also an extensible hash for which I have a simpler code
>>>> than go map. I won't go faster. But this directory indirection introduces
>>>> a
>>>> significant performance penalty. In a later benchmark, after the one I
>>>> published here, I saw a significant performance increase by removing the
>>>> directory. When the hash function is good there most tables have the same
>>>> depth.
>>>>
>>>> Another optimization compared to go map is that my tables in the
>>>> directory don't grow. Using fixed sized array give some performance
>>>> advantage as boundaries and hash mask are constants. There is a small
>>>> space
>>>> usage penalty but when a table is split at 80% for instance, it becomes
>>>> 40%. It is then very close to need a grow which I considered a waste of
>>>> effort. For small number of items fitting. in one table, it may be worth
>>>> to
>>>> use a growable table which is what I didn't bother to implement.
>>>>
>>>> I also saw that function calls are expensive. I then used manual
>>>> inlining for the Get method. It is nice to avoid for code readability and
>>>> maintenance though.
>>>>
>>>> I tried assembly on arm64 where I could experiment with simd
>>>> instructions, but due to the assembly function call overhead it was slower
>>>> than pure go. Passing arguments by register wouldn't help. We would need
>>>> assembly inlining, but I understand that this would open a can of worms
>>>> and
>>>> probably create more problems than it would solve.
>>>>
>>>> If you are still interested and want some specific numbers I would be
>>>> glad to give them.
>>>>
>>>> Here is the cache miss benchmark that shows the benefit of reducing the
>>>> number of probes
>>>>
>>>> goos: darwin
>>>> goarch: arm64
>>>> pkg: fastCache/map
>>>> cpu: Apple M2
>>>> │ dirstr12/stats_arm64.txt │
>>>> puremapstr/stats_arm64.txt │
>>>> │ sec/op │ sec/op vs
>>>> base │
>>>> Cache2Miss/_______1-8 4.721n ± 0% 8.298n ± 0%
>>>> +75.75% (p=0.002 n=6)
>>>> Cache2Miss/______10-8 5.452n ± 0% 9.094n ± 25%
>>>> +66.81% (p=0.002 n=6)
>>>> Cache2Miss/_____100-8 5.726n ± 6% 11.550n ± 5%
>>>> +101.71% (p=0.002 n=6)
>>>> Cache2Miss/____1000-8 6.572n ± 4% 12.550n ± 21%
>>>> +90.96% (p=0.002 n=6)
>>>> Cache2Miss/___10000-8 9.428n ± 4% 18.400n ± 7%
>>>> +95.17% (p=0.002 n=6)
>>>> Cache2Miss/__100000-8 14.79n ± 1% 26.60n ± 2%
>>>> +79.79% (p=0.002 n=6)
>>>> Cache2Miss/_1000000-8 29.11n ± 1% 36.83n ± 2%
>>>> +26.54% (p=0.002 n=6)
>>>> Cache2Miss/10000000-8 40.44n ± 5% 51.12n ± 1%
>>>> +26.41% (p=0.002 n=6)
>>>> geomean 10.60n 17.80n
>>>> +67.98%
>>>>
>>>> Here is the cache hit comparing dirstr12 with flat8 that has no
>>>> directory.
>>>>
>>>> goos: darwin
>>>> goarch: arm64
>>>> pkg: fastCache/map
>>>> cpu: Apple M2
>>>> │ flat8/stats_arm64.txt │
>>>> dirstr12/stats_arm64.txt │
>>>> │ sec/op │ sec/op vs base
>>>> │
>>>> Cache2Hit/_______1-8 6.047n ± 0% 6.246n ± 5% +3.28%
>>>> (p=0.002 n=6)
>>>> Cache2Hit/______10-8 8.172n ± 0% 8.499n ± 0% +4.00%
>>>> (p=0.002 n=6)
>>>> Cache2Hit/_____100-8 7.893n ± 0% 8.256n ± 5% +4.61%
>>>> (p=0.002 n=6)
>>>> Cache2Hit/____1000-8 7.585n ± 6% 8.228n ± 2% +8.48%
>>>> (p=0.002 n=6)
>>>> Cache2Hit/___10000-8 9.191n ± 0% 10.375n ± 1% +12.88%
>>>> (p=0.002 n=6)
>>>> Cache2Hit/__100000-8 10.03n ± 4% 12.14n ± 1% +21.15%
>>>> (p=0.002 n=6)
>>>> Cache2Hit/_1000000-8 40.79n ± 1% 42.27n ± 4% +3.64%
>>>> (p=0.002 n=6)
>>>> Cache2Hit/10000000-8 47.29n ± 1% 56.49n ± 9% +19.47%
>>>> (p=0.002 n=6)
>>>> geomean 12.31n 13.47n +9.48%
>>>>
>>>> I currently don't has access to an x86 computer. Thank you for your
>>>> mail.
>>>>
>>>>
>>>> Le dimanche 22 juin 2025 à 00:12:25 UTC+2, Keith Randall a écrit :
>>>>
>>>>> Certainly interesting. We'd be interested in incorporating some ideas
>>>>> from your experiments if they translate over.
>>>>>
>>>>> One thing to caution, you showed no numbers for space utilization, map
>>>>> miss speed, whether your map can do iteration, etc. It is possible to
>>>>> trade
>>>>> off one of these for other ones, so it is not immediately obvious that
>>>>> your
>>>>> techniques would apply to the builtin map. For instance, if you're using
>>>>> 10x the space to get 10% speedup, we probably wouldn't want to do that
>>>>> for
>>>>> the general map. (Not saying that's what you did, just pointing out raw
>>>>> speed is not the only consideration.)
>>>>>
>>>>> On Thursday, June 19, 2025 at 4:11:28 AM UTC-7 [email protected]
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> trying to implement a fast cache, I inadvertently entered a rabbit
>>>>>> hole that led me to implement my own map. In the process I tried to make
>>>>>> it
>>>>>> faster than the go map just to see if it is possible. I worked weeks on
>>>>>> it
>>>>>> trying out various architectures and methods.
>>>>>>
>>>>>> On my mac book air M2, I get the following benchmarks for a Get
>>>>>> operation. The numbers are the number of items inserted in the table.
>>>>>> My
>>>>>> keys are 8 byte long strings.
>>>>>>
>>>>>> goos: darwin
>>>>>> goarch: arm64
>>>>>> pkg: fastCache/map
>>>>>> cpu: Apple M2
>>>>>> │ dirstr12/stats_arm64.txt │
>>>>>> puremapstr/stats_arm64.txt │
>>>>>> │ sec/op │ sec/op vs
>>>>>> base │
>>>>>> Cache2Hit/_______1-8 6.151n ± 6% 7.087n ± 1%
>>>>>> +15.22% (p=0.002 n=6)
>>>>>> Cache2Hit/______10-8 8.491n ± 0% 8.156n ± 29%
>>>>>> ~ (p=0.394 n=6)
>>>>>> Cache2Hit/_____100-8 8.141n ± 7% 14.185n ± 13%
>>>>>> +74.24% (p=0.002 n=6)
>>>>>> Cache2Hit/____1000-8 8.252n ± 3% 10.635n ± 39%
>>>>>> +28.89% (p=0.002 n=6)
>>>>>> Cache2Hit/___10000-8 10.45n ± 2% 20.99n ± 4%
>>>>>> +100.81% (p=0.002 n=6)
>>>>>> Cache2Hit/__100000-8 12.16n ± 1% 19.11n ± 10%
>>>>>> +57.05% (p=0.002 n=6)
>>>>>> Cache2Hit/_1000000-8 42.28n ± 2% 47.90n ± 2%
>>>>>> +13.29% (p=0.002 n=6)
>>>>>> Cache2Hit/10000000-8 56.38n ± 12% 61.82n ± 6%
>>>>>> ~ (p=0.065 n=6)
>>>>>> geomean 13.44n 17.86n
>>>>>> +32.91%
>>>>>>
>>>>>> On my amd64 i5 11th gen I get the following benchmarks.
>>>>>>
>>>>>> goos: linux
>>>>>> goarch: amd64
>>>>>> pkg: fastCache/map
>>>>>> cpu: 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60GHz
>>>>>> │ dirstr12/stats_amd64.txt │
>>>>>> puremapstr/stats_amd64.txt │
>>>>>> │ sec/op │ sec/op vs
>>>>>> base │
>>>>>> Cache2Hit/_______1-12 9.207n ± 1% 7.506n ± 3%
>>>>>> -18.48% (p=0.002 n=6)
>>>>>> Cache2Hit/______10-12 9.223n ± 0% 8.806n ± 6%
>>>>>> ~ (p=0.058 n=6)
>>>>>> Cache2Hit/_____100-12 9.279n ± 2% 10.175n ± 3%
>>>>>> +9.66% (p=0.002 n=6)
>>>>>> Cache2Hit/____1000-12 10.45n ± 2% 11.29n ± 3%
>>>>>> +8.04% (p=0.002 n=6)
>>>>>> Cache2Hit/___10000-12 16.00n ± 2% 17.21n ± 5%
>>>>>> +7.59% (p=0.002 n=6)
>>>>>> Cache2Hit/__100000-12 22.20n ± 17% 24.73n ± 22%
>>>>>> +11.42% (p=0.026 n=6)
>>>>>> Cache2Hit/_1000000-12 87.75n ± 2% 91.05n ± 5%
>>>>>> +3.76% (p=0.009 n=6)
>>>>>> Cache2Hit/10000000-12 104.2n ± 2% 105.6n ± 5%
>>>>>> ~ (p=0.558 n=6)
>>>>>> geomean 20.11n 20.49n
>>>>>> +1.90%
>>>>>>
>>>>>> On amd64 the performance is on par with go map, but go map uses
>>>>>> inlined simd instructions which I don’t use because I don’t have access
>>>>>> to
>>>>>> it in pure go. I use xxh3 right out of the box for the hash function.
>>>>>> The
>>>>>> differences are not due to the hash function.
>>>>>>
>>>>>> If there is interest in this, please let me know.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
--
You received this message because you are subscribed to the Google Groups
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/golang-nuts/1a7e15d9-a0f6-451a-923b-ea64bb5c782cn%40googlegroups.com.