On Tue, 10 Mar 2026 at 02:16, Ranier Vilela <[email protected]> wrote:
> In the functions *systable_beginscan* and *systable_beginscan_ordered*,
> is possible a small optimization.
> The array *idxkey* can be constructed in one go with a single call to mempcy.
> The excess might not make much of a difference, but I think it's worth the 
> effort.

This whole thread reeks of you running a static analyser to find
optimisation opportunities. If you want to find these, then please be
aware that static analysis of code is pretty much useless for this
purpose. It has no idea how hot the code in question is.

If you want to find places that actually matter, try using perf. Even
perf top --pid=N will allow you to see which functions matter enough
to warrant further analysis. By all means, if you find something
interesting and see a way to make a meaningful improvement, then post
about it and show a benchmark script and results which demonstrate the
performance gain.

For the patch and the analysis that's been done on it so far, please
be aware that the struct size that you use will be critical to both
the code emitted by the compiler and to the performance. On my
machine, ScanKeyData is 72 bytes. None of the godbolt.org links I
looked at got this right. One had 8 bytes, and another had 24 bytes.
With the default -march flags, you're probably only going to get SSE
instructions which will move 16 bytes at a time. Moving 72 bytes is
going to need 4 of those SSE moves and a MOV for the 8 byte remainder.
That's 5 instructions. 8 bytes is only 1 instruction, and 24 is 2
instructions (16+8).  I've forgotten the exact thresholds, but both
gcc and clang will inline fixed-sized memcpys below something like
256-512 bytes. I assume the compiler authors did a reasonable amount
of analysis to discover those thresholds. 4 ScanKeyData is 288 bytes,
which is pretty close to that threshold. Flicking through our code, I
see most systable_beginscan() calls are only using 1-2 scan keys. I
did see some using 3. I didn't look at all of them, but the ones I saw
are not even near the threshold for not inlining the memcpy. But, by
all means, prove me wrong and do something like add logging in
systable_beginscan and run the regression tests and use the logs to
build a histogram of the calls by nkeys.

For me, I think the unmodified code is better as the memcpy is a fixed
size and can be inlined by the compiler. I might feel different about
it if the outer loop could disappear, but that's not possible since
sk_attno needs to be set. I also doubt that it matters very much as
most of the time scans to catalogue tables will be cached in
sys/rel/cat caches. A few tables don't, like pg_inherits, possibly
others. But by all means, prove me wrong with perf results and the
query that shows systable_beginscan using a meaningful amount of CPU.

On a final note, please please stop thinking your static analyser is
going to help you improve the performance of PostgreSQL. Go and learn
a new tool that's actually useful and fit for the purpose. There's
likely years of work doing small things like this in places that
actually matter. You seem to have lots of energy for PostgreSQL, so I
highly recommend that you take the time and go and learn to use a tool
that finds those for you. You can then focus on the places it
highlights for you. Otherwise, you're just wasting your time and all
of ours.

David


Reply via email to