gf2121 commented on PR #12784: URL: https://github.com/apache/lucene/pull/12784#issuecomment-1808443275
Thanks for tracking in ! @mikemccand > Did we see any bump in nightly benchmarks? I would expect this change more likely bring some improvements for flushing high cardinality `StringField`, or other places taking advantages of the `BytesRefHash#sort` like [`DeletedTerms`](https://github.com/apache/lucene/blob/be27303e3ace53b8c5be9a480d5ad7d2c609a28f/lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java#L180). However, I had not expected this change would bring a bump in nightly benchmarks - If you look at the [CPU profile of the nightly indexing](https://blunders.io/jfr-demo/indexing-4kb-2023.11.12.18.03.28/cpu-samples-drill-down), you may find that most of CPU was used to tokenize / deduplicate terms before flushing. After deduplication the terms count will get reduced and the sort is not the bottleneck of the indexing speed (sort only use ~1% CPU). Small difference may be found in the flame graph that `reorder` is not there any more. But the proportion of sorting overhead is too low to bring a noticeable E2E difference for nightly indexing. **Before this patch**  **After this patch**  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org