kaivalnp commented on PR #14874: URL: https://github.com/apache/lucene/pull/14874#issuecomment-3063496698
Thanks for all the deep-dive here @msokolov! I tried running a set of benchmarks (Cohere, 768d, byte vectors) on `niter=100,500,1000,5000,10000,50000` (only search existing index, no reindex) to test how the compiler optimizes over larger runs.. `main`: ``` recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized num_segments 0.968 2.790 2.710 0.971 100000 100 50 64 250 no 1 0.964 2.692 2.674 0.993 100000 100 50 64 250 no 1 0.963 2.685 2.677 0.997 100000 100 50 64 250 no 1 0.963 2.599 2.597 0.999 100000 100 50 64 250 no 1 0.962 2.552 2.550 0.999 100000 100 50 64 250 no 1 0.962 2.590 2.589 1.000 100000 100 50 64 250 no 1 ``` This PR: ``` recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized num_segments 0.968 2.050 1.960 0.956 100000 100 50 64 250 no 1 0.964 1.868 1.850 0.990 100000 100 50 64 250 no 1 0.963 1.894 1.885 0.995 100000 100 50 64 250 no 1 0.963 1.770 1.769 0.999 100000 100 50 64 250 no 1 0.962 1.762 1.761 1.000 100000 100 50 64 250 no 1 0.962 1.742 1.741 1.000 100000 100 50 64 250 no 1 ``` There's a possibility that the candidate (i.e. this PR) has an inherent benefit due to being run later in time (so vectors are _more likely_ to be loaded into RAM) -- so I ran baseline (i.e. `main`) immediately afterwards: ``` recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized num_segments 0.968 2.850 2.770 0.972 100000 100 50 64 250 no 1 0.964 2.712 2.694 0.993 100000 100 50 64 250 no 1 0.963 2.640 2.632 0.997 100000 100 50 64 250 no 1 0.963 2.588 2.586 0.999 100000 100 50 64 250 no 1 0.962 2.561 2.559 0.999 100000 100 50 64 250 no 1 0.962 2.550 2.549 1.000 100000 100 50 64 250 no 1 ``` For some reason, the changes in this PR are still better on my machine :/ > I think we should understand the hotspot hack a little better before we push that, because it's really kind of gross and feels like voodoo to me +1, not looking to merge this until we find out why we're seeing a difference in performance (seems counterintuitive as we're doing _more_ work but seeing better latency!) -- when we (1) create a fresh index, (2) reindex, (3) search an existing index, (4) for different parameters, (5) across different machines? Performance seems tied to the HotSpot compiler -- is there a way to make optimizations more deterministic? (or at least, explicit) On a related note, benchmark runs have been so wildly fluctuating -- I wonder if we should set larger defaults for reliable numbers.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org