Hi, I tried this change on PostgreSQL (19devel) and ran a few simple tests using pgbench to see how it behaves in practice.
I used 10 clients and 4 threads and ran each test for 60 seconds. >From my runs, the original version was giving around ~663 TPS with ~15.07 ms latency. With the patched version, I observed TPS in the range of ~638–657, averaging around ~648 TPS, with latency slightly higher (~15.2–15.6 ms). So in my tests, I didn’t see a clear improvement. If anything, the patched version appears slightly slower, though the difference is small. It might be that this change doesn’t help much in common cases where the number of keys is small, but it would be interesting to see results from other workloads as well. Thanks for the discussion—it was helpful to look into this. Regards, Lakshmi G On Fri, Mar 20, 2026 at 9:57 AM Bryan Green <[email protected]> wrote: > I meant to add that the micro-benchmark went through each loop 100,000,000 > iterations for each run... > for 20 runs to come up with those numbers. > > --bg > > On Mon, Mar 9, 2026 at 11:02 AM Bryan Green <[email protected]> > wrote: > >> I performed a micro-benchmark on my dual epyc (zen 2) server and version >> 1 wins for small values of n. >> >> 20 runs: >> >> n version min median mean max stddev noise% >> ----------------------------------------------------------------------- >> n=1 version1 2.440 2.440 2.450 2.550 0.024 4.5% >> n=1 version2 4.260 4.280 4.277 4.290 0.007 0.7% >> >> n=2 version1 2.740 2.750 2.757 2.880 0.029 5.1% >> n=2 version2 3.970 3.980 3.980 4.020 0.010 1.3% >> >> n=4 version1 4.580 4.595 4.649 4.910 0.094 7.2% >> n=4 version2 5.780 5.815 5.809 5.820 0.013 0.7% >> >> But, micro-benchmarks always make me nervous, so I looked at the actual >> instruction cost for my >> platform given the version 1 and version 2 code. >> >> If we count cpu cycles using the AMD Zen 2 instruction latency/throughput >> tables: version 1 (loop body) >> has a critical path of ~5-6 cycles per iteration. version 2 (loop body) >> has ~3-4 cycles per iteration. >> >> The problem for version 2 is that the call to memcpy is ~24-30 cycles due >> to the stub + function call + return >> and branch predictor pressure on first call. This probably results in >> ~2.5 ns per iteration cost for version 2. >> >> So, no I wouldn't call it an optimization. But, it will be interesting >> to hear other opinions on this. >> >> --bg >> >> >> On Mon, Mar 9, 2026 at 10:25 AM Ranier Vilela <[email protected]> >> wrote: >> >>> >>> >>> Em seg., 9 de mar. de 2026 às 11:47, Bryan Green <[email protected]> >>> escreveu: >>> >>>> I created an example that is a little bit closer to the actual code and >>>> changed the compiler from C++ to C. >>>> >>>> It is interesting the optimization that the compiler has chosen for >>>> version 1 versus version 2. One calls >>>> memcpy and one doesn't. There is a good chance the inlining of memcpy >>>> as SSE+scalar per iteration >>>> will be faster for syscache scans-- which I believe are usually small >>>> (1-4 keys?). >>>> >>> I doubt the inline version is better. >>> Clang is supported too and the code generated is much better with memcpy >>> one call outside of the loop. >>> >>> >>>> >>>> Probably the only reason to do this patch would be if N is normally >>>> large or if this is considered an >>>> improvement in code clarity without a detrimental impact on small N >>>> syscache scans. >>>> I realize you only said "possible small optimization". It might be >>>> worthwhile to benchmark the code for >>>> different values of n to determine if there is a tipping point either >>>> way? >>>> >>> In your opinion, shouldn't this be considered an optimization, even a >>> small one? >>> >>> best regards, >>> Ranier Vilela >>> >>
