[go-nuts] Re: fmap: a faster map

[email protected] Fri, 27 Jun 2025 00:24:09 -0700

Just benchmarked the fastmap with integer keys. It is exactly the same as 
the one I published but with integer keys and a xxh3 function for integer 
values.


goos: darwin
goarch: arm64
pkg: fastCache/map
cpu: Apple M2
                     │ cacheint6/stats_arm64.txt │     
puremapint/stats_arm64.txt      │
                     │          sec/op           │    sec/op     vs base   
            │
Cache2Hit/_______1-8                 3.496n ± 0%   2.065n ±  0%  -40.93% 
(p=0.002 n=6)
Cache2Hit/______10-8                 4.327n ± 2%   5.052n ±  0%  +16.74% 
(p=0.002 n=6)
Cache2Hit/_____100-8                 4.364n ± 4%   5.731n ± 12%  +31.32% 
(p=0.002 n=6)
Cache2Hit/____1000-8                 4.248n ± 5%   5.935n ±  3%  +39.73% 
(p=0.002 n=6)
Cache2Hit/___10000-8                 5.167n ± 1%   7.553n ±  1%  +46.18% 
(p=0.002 n=6)
Cache2Hit/__100000-8                 6.763n ± 1%   9.300n ±  3%  +37.52% 
(p=0.002 n=6)
Cache2Hit/_1000000-8                 18.25n ± 1%   26.32n ±  1%  +44.26% 
(p=0.002 n=6)
Cache2Hit/10000000-8                 28.11n ± 0%   37.95n ±  0%  +35.05% 
(p=0.002 n=6)
geomean                              6.881n        8.405n        +22.15%

Le jeudi 26 juin 2025 à 09:37:45 UTC+2, [email protected] a écrit :

> Hello, 
>
> the code is published here https://github.com/chmike/fastmap.
>
> Le mercredi 25 juin 2025 à 12:16:54 UTC+2, [email protected] a 
> écrit :
>
>> By swap pop I mean that a deleted item is replaced by the last item of 
>> the group or the chain. This will indeed move items.
>>
>> I have implemented a new version using 8bit top hashes, tombstones, 
>> perfect match, quadratic probing and extensible hash table. It leaves items 
>> in place and might thus be compatible with the iteration requirements of go 
>> maps.
>>
>> I plan to publish it on github but still need to verify deleting items on 
>> which I didn't spent much time so far. The following benchmark is with 
>> tables of 256 Groups. A table is split when reaching 90%. The filling rates 
>> is then far better than with my 3 table architecture. Using quadratic 
>> probing would then be a better choice. 
>>
>> goos: darwin
>> goarch: arm64
>> pkg: fastmap/map
>> cpu: Apple M2
>>                       │ altmap/stats_arm64.txt │         
>> stdmap/stats_arm64.txt         │
>>                       │         sec/op         │    sec/op      vs base   
>>               │
>> Cache2Hit/_______1-8               6.386n ± 0%    7.081n ±  0%   +10.88% 
>> (p=0.000 n=10)
>> Cache2Hit/______10-8               8.813n ± 0%    8.114n ± 30%         ~ 
>> (p=0.481 n=10)
>> Cache2Hit/_____100-8               8.473n ± 7%   15.720n ±  8%   +85.52% 
>> (p=0.000 n=10)
>> Cache2Hit/____1000-8               8.735n ± 2%   13.715n ± 23%   +57.01% 
>> (p=0.000 n=10)
>> Cache2Hit/___10000-8               11.04n ± 1%    20.38n ±  6%   +84.56% 
>> (p=0.000 n=10)
>> Cache2Hit/__100000-8               13.19n ± 1%    17.62n ± 17%   +33.59% 
>> (p=0.000 n=10)
>> Cache2Hit/_1000000-8               52.02n ± 1%    55.45n ±  0%    +6.59% 
>> (p=0.001 n=10)
>> Cache2Hit/10000000-8               64.91n ± 1%    68.12n ±  3%    +4.96% 
>> (p=0.002 n=10)
>> Cache2Miss/_______1-8              5.003n ± 0%    8.243n ±  0%   +64.73% 
>> (p=0.000 n=10)
>> Cache2Miss/______10-8              5.878n ± 3%    9.238n ± 11%   +57.18% 
>> (p=0.000 n=10)
>> Cache2Miss/_____100-8              6.037n ± 4%   12.600n ±  9%  +108.70% 
>> (p=0.000 n=10)
>> Cache2Miss/____1000-8              6.992n ± 7%   12.015n ± 17%   +71.83% 
>> (p=0.000 n=10)
>> Cache2Miss/___10000-8              10.49n ± 2%    17.66n ±  9%   +68.40% 
>> (p=0.000 n=10)
>> Cache2Miss/__100000-8              20.73n ± 1%    26.56n ±  2%   +28.09% 
>> (p=0.000 n=10)
>> Cache2Miss/_1000000-8              33.52n ± 1%    36.83n ±  1%    +9.89% 
>> (p=0.000 n=10)
>> Cache2Miss/10000000-8              48.62n ± 1%    50.84n ±  1%    +4.56% 
>> (p=0.000 n=10)
>> geomean                            13.25n         18.38n         +38.75%
>>
>> To be honest I can't fully explain these numbers. Reducing false positive 
>> by using an exact match and 8 bit top hashes might be part of it. It could 
>> also be the hash function. xxh3 on arm64 is pure go only. It is fast for 
>> short strings, but bad for long strings as it doesn't use simd (yet). The 8 
>> byte key strings are  fmt.Sprintf("%7d ", i) and fmt.Sprintf("%7d-", i) for 
>> misses. I hope that these numbers don't result from a bug in my code. 
>>
>> For the top hash I use the H1, H2 terminology of swiss tables. H2 are the 
>> 8 less significant bits of the hash. 0x00 is used for free slots, and 0x80 
>> for the tombstone. Computing H2 implies thus a conditional branch, but the 
>> cpu may predict its outcome right most of the time. I benchmarked 7bit top 
>> hashes, but it ended to be slower. I assumed it was due to the increased 
>> number of false positive. Using 0x80 for tombstone saves one op to find 
>> free slots in a group. If the speed up is due to the 8bit top hashes, I 
>> suspect that the C++ swiss table might also benefit from it.
>>
>> On amd64, the performance is on par with go map, but I don't use simd 
>> instruction. 
>>
>> Le lundi 23 juin 2025 à 22:41:46 UTC+2, Keith Randall a écrit :
>>
>>> > As key comparison is very fast (no secondary memory access) a simple 
>>> sequential key comparison is fast.
>>>
>>> We've looked at that, see CL 626699 
>>> <https://go-review.googlesource.com/c/go/+/626699>. Maybe helps in some 
>>> cases, seems to hurt for miss lookups. We decided not to move forward with 
>>> it, but it is definitely something to think about.
>>>
>>> > I use 8 bit top hashes instead of 7
>>>
>>> How do you record a presence bit?
>>>
>>> > We can use a swap pop strategy.
>>>
>>> Not sure exactly what that is, but it sounds incompatible with 
>>> iteration. Our required iteration semantics mean it is very hard to move 
>>> keys.
>>>
>>> > There is a small space usage penalty but when a table is split at 80% 
>>> for instance, it becomes 40%. It is then very close to need a grow which I 
>>> considered a waste of effort.
>>>
>>> I think when we split a table we use the max-sized table for each part? 
>>> I'm not sure what effect you're seeing here then.
>>>
>>> > I also saw that function calls are expensive. I then used manual 
>>> inlining for the Get method. It is nice to avoid for code readability and 
>>> maintenance though.
>>>
>>> We could theoretically have the compiler do this for builtin maps.
>>> On Sunday, June 22, 2025 at 1:28:26 AM UTC-7 [email protected] 
>>> wrote:
>>>
>>>> Thank you very much for your response. I'll summarize here the main 
>>>> difference with go map. 
>>>>
>>>> I very easily got faster than go map with integers and alike by 
>>>> dropping the swiss groups and reducing the group size to fit a cache line. 
>>>> As key comparison is very fast (no secondary memory access) a simple 
>>>> sequential key comparison is fast.The speed gain might be due to the hash 
>>>> function as I rewrote the xxh3 hasher for uint64 values.  I didn't spent 
>>>> much time exploring this further as it wasn't my use case. 
>>>>
>>>> Strings key definitely benefit from top hashes (swiss table groups). I 
>>>> use 8 bit top hashes instead of 7 which reduces the number of false 
>>>> positive and I use a perfect match which also reduce the number of false 
>>>> positive on arm64. amd64 uses a perfect match as it uses the simd 
>>>> instructions. 
>>>>
>>>> Another difference is that go map uses quadratic probing and my map 
>>>> doesn't. The popular wisdom is that probing is faster because the 
>>>> probability that the data is in cache is slightly higher and the table 
>>>> filling is better. Unfortunately this increases the number of required 
>>>> tests where false positive increases the penalty. The numbers I showed are 
>>>> for a different kind of table. It is composed of three arrays T1,T2 and 
>>>> T3. 
>>>> T1 is a conventional table of 8 items swiss groups. When a. group is full 
>>>> the item is stored in the overflow hash table T2. When the group in T2 is 
>>>> full, the item is appended to T3 which is treated as a sequence of groups. 
>>>> I do a linear search in this table containing the overflows of overflows. 
>>>> I 
>>>> implemented a table using quadratic probing and it was faster than go map 
>>>> on arm64. I assume it is due to the use of 8 bit hashes reducing the false 
>>>> positive. I'm not sure.
>>>>
>>>> I tested independent hashes for T1 and T2 that then requires the use of 
>>>> tombstones, and related hashes. By related hash I mean that the index in 
>>>> T2 
>>>> is for instance the fibonacci hash of the index in T1. Best filling rates 
>>>> of T2 were obtained with a random permutation mapping. The size ration 
>>>> between T1 and T2 should in theory be 2:1, but best performance where with 
>>>> 16:1. I assumed that this is because T2 data is hotter (in cache) than 
>>>> with 
>>>> a bigger table. The benefit of such deterministic relation between the 
>>>> index of T1 and T2 is that we don't need tombstones. We can use a swap pop 
>>>> strategy. In summary, the idea was to minimize the number of required 
>>>> probes to find a key. 
>>>>
>>>> The table uses also an extensible hash for which I have a simpler code 
>>>> than go map. I won't go faster. But this directory indirection introduces 
>>>> a 
>>>> significant performance penalty. In a later benchmark, after the one I 
>>>> published here, I saw a significant performance increase by removing the 
>>>> directory. When the hash function is good there most tables have the same 
>>>> depth. 
>>>>
>>>> Another optimization compared to go map is that my tables in the 
>>>> directory don't grow. Using fixed sized array give some performance 
>>>> advantage as boundaries and hash mask are constants. There is a small 
>>>> space 
>>>> usage penalty but when a table is split at 80% for instance, it becomes 
>>>> 40%. It is then very close to need a grow which I considered a waste of 
>>>> effort. For small number of items fitting. in one table, it may be worth 
>>>> to 
>>>> use a growable table which is what I didn't bother to implement. 
>>>>
>>>> I also saw that function calls are expensive. I then used manual 
>>>> inlining for the Get method. It is nice to avoid for code readability and 
>>>> maintenance though. 
>>>>
>>>> I tried assembly on arm64 where I could experiment with simd 
>>>> instructions, but due to the assembly function call overhead it was slower 
>>>> than pure go. Passing arguments by register wouldn't help. We would need 
>>>> assembly inlining, but I understand that this would open a can of worms 
>>>> and 
>>>> probably create more problems than it would solve.
>>>>
>>>> If you are still interested and want some specific numbers I would be 
>>>> glad to give them. 
>>>>
>>>> Here is the cache miss benchmark that shows the benefit of reducing the 
>>>> number of probes
>>>>
>>>> goos: darwin
>>>> goarch: arm64
>>>> pkg: fastCache/map
>>>> cpu: Apple M2
>>>>                       │ dirstr12/stats_arm64.txt │     
>>>>  puremapstr/stats_arm64.txt       │
>>>>                       │          sec/op          │    sec/op      vs 
>>>> base                │
>>>> Cache2Miss/_______1-8                4.721n ± 0%    8.298n ±  0%   
>>>> +75.75% (p=0.002 n=6)
>>>> Cache2Miss/______10-8                5.452n ± 0%    9.094n ± 25%   
>>>> +66.81% (p=0.002 n=6)
>>>> Cache2Miss/_____100-8                5.726n ± 6%   11.550n ±  5% 
>>>>  +101.71% (p=0.002 n=6)
>>>> Cache2Miss/____1000-8                6.572n ± 4%   12.550n ± 21%   
>>>> +90.96% (p=0.002 n=6)
>>>> Cache2Miss/___10000-8                9.428n ± 4%   18.400n ±  7%   
>>>> +95.17% (p=0.002 n=6)
>>>> Cache2Miss/__100000-8                14.79n ± 1%    26.60n ±  2%   
>>>> +79.79% (p=0.002 n=6)
>>>> Cache2Miss/_1000000-8                29.11n ± 1%    36.83n ±  2%   
>>>> +26.54% (p=0.002 n=6)
>>>> Cache2Miss/10000000-8                40.44n ± 5%    51.12n ±  1%   
>>>> +26.41% (p=0.002 n=6)
>>>> geomean                              10.60n         17.80n         
>>>> +67.98%
>>>>
>>>> Here is the cache hit comparing dirstr12 with flat8 that has no 
>>>> directory. 
>>>>
>>>> goos: darwin
>>>> goarch: arm64
>>>> pkg: fastCache/map
>>>> cpu: Apple M2
>>>>                      │ flat8/stats_arm64.txt │     
>>>>  dirstr12/stats_arm64.txt       │
>>>>                      │        sec/op         │    sec/op     vs base   
>>>>             │
>>>> Cache2Hit/_______1-8             6.047n ± 0%    6.246n ± 5%   +3.28% 
>>>> (p=0.002 n=6)
>>>> Cache2Hit/______10-8             8.172n ± 0%    8.499n ± 0%   +4.00% 
>>>> (p=0.002 n=6)
>>>> Cache2Hit/_____100-8             7.893n ± 0%    8.256n ± 5%   +4.61% 
>>>> (p=0.002 n=6)
>>>> Cache2Hit/____1000-8             7.585n ± 6%    8.228n ± 2%   +8.48% 
>>>> (p=0.002 n=6)
>>>> Cache2Hit/___10000-8             9.191n ± 0%   10.375n ± 1%  +12.88% 
>>>> (p=0.002 n=6)
>>>> Cache2Hit/__100000-8             10.03n ± 4%    12.14n ± 1%  +21.15% 
>>>> (p=0.002 n=6)
>>>> Cache2Hit/_1000000-8             40.79n ± 1%    42.27n ± 4%   +3.64% 
>>>> (p=0.002 n=6)
>>>> Cache2Hit/10000000-8             47.29n ± 1%    56.49n ± 9%  +19.47% 
>>>> (p=0.002 n=6)
>>>> geomean                          12.31n         13.47n        +9.48%
>>>>
>>>> I currently don't has access to an x86 computer. Thank you for your 
>>>> mail. 
>>>>  
>>>>
>>>> Le dimanche 22 juin 2025 à 00:12:25 UTC+2, Keith Randall a écrit :
>>>>
>>>>> Certainly interesting. We'd be interested in incorporating some ideas 
>>>>> from your experiments if they translate over.
>>>>>
>>>>> One thing to caution, you showed no numbers for space utilization, map 
>>>>> miss speed, whether your map can do iteration, etc. It is possible to 
>>>>> trade 
>>>>> off one of these for other ones, so it is not immediately obvious that 
>>>>> your 
>>>>> techniques would apply to the builtin map. For instance, if you're using 
>>>>> 10x the space to get 10% speedup, we probably wouldn't want to do that 
>>>>> for 
>>>>> the general map. (Not saying that's what you did, just pointing out raw 
>>>>> speed is not the only consideration.)
>>>>>
>>>>> On Thursday, June 19, 2025 at 4:11:28 AM UTC-7 [email protected] 
>>>>> wrote:
>>>>>
>>>>>> Hello, 
>>>>>>
>>>>>> trying to implement a fast cache, I inadvertently entered a rabbit 
>>>>>> hole that led me to implement my own map. In the process I tried to make 
>>>>>> it 
>>>>>> faster than the go map just to see if it is possible. I worked weeks on 
>>>>>> it 
>>>>>> trying out various architectures and methods. 
>>>>>>
>>>>>> On my mac book air M2, I get the following benchmarks for a Get 
>>>>>> operation. The numbers are the number of items inserted in the table.  
>>>>>> My 
>>>>>> keys are 8 byte long strings. 
>>>>>>
>>>>>> goos: darwin
>>>>>> goarch: arm64
>>>>>> pkg: fastCache/map
>>>>>> cpu: Apple M2
>>>>>>                      │ dirstr12/stats_arm64.txt │     
>>>>>>  puremapstr/stats_arm64.txt       │
>>>>>>                      │          sec/op          │    sec/op      vs 
>>>>>> base                │
>>>>>> Cache2Hit/_______1-8               6.151n ±  6%    7.087n ±  1%   
>>>>>> +15.22% (p=0.002 n=6)
>>>>>> Cache2Hit/______10-8               8.491n ±  0%    8.156n ± 29%       
>>>>>>   ~ (p=0.394 n=6)
>>>>>> Cache2Hit/_____100-8               8.141n ±  7%   14.185n ± 13%   
>>>>>> +74.24% (p=0.002 n=6)
>>>>>> Cache2Hit/____1000-8               8.252n ±  3%   10.635n ± 39%   
>>>>>> +28.89% (p=0.002 n=6)
>>>>>> Cache2Hit/___10000-8               10.45n ±  2%    20.99n ±  4% 
>>>>>>  +100.81% (p=0.002 n=6)
>>>>>> Cache2Hit/__100000-8               12.16n ±  1%    19.11n ± 10%   
>>>>>> +57.05% (p=0.002 n=6)
>>>>>> Cache2Hit/_1000000-8               42.28n ±  2%    47.90n ±  2%   
>>>>>> +13.29% (p=0.002 n=6)
>>>>>> Cache2Hit/10000000-8               56.38n ± 12%    61.82n ±  6%       
>>>>>>   ~ (p=0.065 n=6)
>>>>>> geomean                            13.44n          17.86n         
>>>>>> +32.91%
>>>>>>
>>>>>> On my amd64 i5 11th gen I get the following benchmarks.
>>>>>>
>>>>>> goos: linux
>>>>>> goarch: amd64
>>>>>> pkg: fastCache/map
>>>>>> cpu: 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60GHz
>>>>>>                       │ dirstr12/stats_amd64.txt │     
>>>>>>  puremapstr/stats_amd64.txt      │
>>>>>>                       │          sec/op          │    sec/op      vs 
>>>>>> base               │
>>>>>> Cache2Hit/_______1-12               9.207n ±  1%    7.506n ±  3% 
>>>>>>  -18.48% (p=0.002 n=6)
>>>>>> Cache2Hit/______10-12               9.223n ±  0%    8.806n ±  6%     
>>>>>>    ~ (p=0.058 n=6)
>>>>>> Cache2Hit/_____100-12               9.279n ±  2%   10.175n ±  3%   
>>>>>> +9.66% (p=0.002 n=6)
>>>>>> Cache2Hit/____1000-12               10.45n ±  2%    11.29n ±  3%   
>>>>>> +8.04% (p=0.002 n=6)
>>>>>> Cache2Hit/___10000-12               16.00n ±  2%    17.21n ±  5%   
>>>>>> +7.59% (p=0.002 n=6)
>>>>>> Cache2Hit/__100000-12               22.20n ± 17%    24.73n ± 22% 
>>>>>>  +11.42% (p=0.026 n=6)
>>>>>> Cache2Hit/_1000000-12               87.75n ±  2%    91.05n ±  5%   
>>>>>> +3.76% (p=0.009 n=6)
>>>>>> Cache2Hit/10000000-12               104.2n ±  2%    105.6n ±  5%     
>>>>>>    ~ (p=0.558 n=6)
>>>>>> geomean                             20.11n          20.49n         
>>>>>> +1.90%
>>>>>>
>>>>>> On amd64 the performance is on par with go map, but go map uses 
>>>>>> inlined simd instructions which I don’t use because I don’t have access 
>>>>>> to 
>>>>>> it in pure go. I use xxh3 right out of the box for the hash function. 
>>>>>> The 
>>>>>> differences are not due to the hash function.
>>>>>>
>>>>>> If there is interest in this, please let me know.
>>>>>>
>>>>>>
>>>>>>
>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/golang-nuts/1a7e15d9-a0f6-451a-923b-ea64bb5c782cn%40googlegroups.com.

[go-nuts] Re: fmap: a faster map

Reply via email to