gf2121 commented on PR #12784:
URL: https://github.com/apache/lucene/pull/12784#issuecomment-1808443275

   Thanks for tracking in ! @mikemccand 
   
   > Did we see any bump in nightly benchmarks?
   
   I would expect this change more likely bring some improvements for flushing 
high cardinality `StringField`, or other places taking advantages of the 
`BytesRefHash#sort` like 
[`DeletedTerms`](https://github.com/apache/lucene/blob/be27303e3ace53b8c5be9a480d5ad7d2c609a28f/lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java#L180).
 
   
   However, I had not expected this change would bring a bump in nightly 
benchmarks - If you look at the [CPU profile of the nightly 
indexing](https://blunders.io/jfr-demo/indexing-4kb-2023.11.12.18.03.28/cpu-samples-drill-down),
 you may find that most of CPU was used to tokenize / deduplicate terms before 
flushing. After deduplication the terms count will get reduced and the sort is 
not the bottleneck of the indexing speed (sort only use ~1% CPU).
   
   Small difference may be found in the flame graph that `reorder` is not there 
any more. But the proportion of sorting overhead is too low to bring a 
noticeable E2E difference for nightly indexing.
   
   **Before this patch**
   
![image](https://github.com/apache/lucene/assets/52390227/b4e21371-9760-465b-9593-f7066c9bc7fd)
   
   **After this patch**
   
![image](https://github.com/apache/lucene/assets/52390227/23f48ac2-45b2-4275-bd96-3fd8569a2707)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to