mikemccand opened a new pull request, #15950: URL: https://github.com/apache/lucene/pull/15950
[Spinoff from @zihanx's recent PR (#15803) adding `ReaderUtil.partitionByLeaf` helper to collate any `ScoreDoc[]` into per-leaf sorted groups.] TL;DR: see [this cool benchmark results UI](http://githubsearch.mikemccandless.com/jmh-sort-scoredoc-benchy.html) testing various algorithms to sort `ScoreDoc[]` (smaller is better, the numbers are microseconds per sort call). The hits from a Lucene query (`ScoreDoc[]`) come out sorted by something important to the user (e.g. relevance), but in order to retrieve stored fields or doc values, you instead need to partition by leaf and sort by `docid` within each leaf, retrieve your values via Lucene's usual iterators, and somehow then invert that step so you can stick/associate the retrieved values with original `ScoreDoc[]` (sorted for the user). We currently use `Arrays.sort(int[])` after extracting just the `docid` from each hit, but that loses association with the original `ScoreDoc[]`. #15905 is about NOT losing that association by also keeping track of the `int ordinal` position (in the original array). And then @gsmiller made the cool JMH bench to find a faster partition step (after sorting) -- #15938. This all made me curious about the absolute fastest way to sort these hits. So, I worked with Claude Code (and maybe Gemini for some prompts too) to create this JMH benchmark (`ScoreDocSortBenchmark.java`) plus simple HTML UI to understand the results. It tests 12 different approaches across array sizes from 10 to 10,000: * The usual JDK options (`Arrays.sort` with lambdas, static comparators, and `parallelSort`). * Lucene's internal sorters (`IntroSorter`, `TimSorter`, `InPlaceMergeSorter`, and `LSBRadixSorter`). * A few primitive extraction techniques (packing `docid` and the original array `index` into `int[]` or `long[]` for cache-friendly primitive sorting). * A manual 2-pass 8-bit radix sort. The benchmark does upfront validation during the `@Setup` phase to ensure every sort is actually sorting correctly and not losing/duplicating any `ScoreDoc` instances, all without polluting the actual JMH measurement time. Claude and I also added a companion Python script (`jmh-table.py`) to convert the raw JMH JSON output into an interactive HTML heatmap. I tried to preserve all local commits, and tell the ai to make a local commit after each prompt+code iteration. Here is a screenshot of the output: <img width="1797" height="1161" alt="sort-benchy" src="https://github.com/user-attachments/assets/2b212dac-bbf8-4603-8a54-8b3a6f9cf5d3" /> Here's a [hosted version that should work for you](http://githubsearch.mikemccandless.com/jmh-sort-scoredoc-benchy.html) (click column headers to sort all rows!). **NOTES/Learnings** * **AI Collaboration:** This code was written mostly by an AI (Gemini CLI) with me (a human) prompting, reviewing, and iterating. * Measurements for very small array sizes (like size=10) include JMH's `Level.Invocation` overhead because we have to do a shallow array copy before every invocation. It inflates the absolute numbers a bit for tiny arrays, but the overhead is consistent across all algorithms. * Each cell is greener if it was more competitive vs the other algos, and includes a sparkle histogram of latencies. * It is disturbing how much the latency varies, and is sometimes strongly bimodal (likely hotspot structural noise e.g. mis-compilation) even when it's just one thread in a fresh JVM on an idle box * Click on column header to sort all rows by it (Excel-like). Click on a cell to see the source code for that algo, and under the table you'll see histogram of latencies of all runs for that cell. * It only tests random arrays. Ideally I would test on real hits popping out of a useful search engine. When you test on random data you draw random conclusions, sigh. * Claude invented those two primitive solutions with no specific prompting from me, which is super cool. They are nearly fastest (so is the radix sort!) across the board except `parallelSort` on large (10K) arrays, but that's using multiple CPU cores. This, despite that they must make first pass to encode to `long[]` and ending pass to get back to `ScoreDoc[]`. * It's surprising how much slower / faster the various algorithms are -- `TimSorter` (is this still used for Python's `listobject` sort?) was surprisingly slow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
