[ https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536905#comment-17536905 ]
Uwe Schindler commented on LUCENE-10572: ---------------------------------------- bq. ok, i like your suggestion actually. it solves my issue, it would make this testable. Then native order could be used freely without this concern. We can do this in the same way like the initialization of StringUtils. There we check the system property tests.seed and then initialize some randomness. In BitUtil we could have a similar code (or maybe share that in Constants to get the random seed and save as a Long value, or null if not given - this would prevent us from doing the lookup multiple times). In BitUtil we could have varhandles like {{BitUtil.VH_NATIVE_SHORT}} that are native in production environments, but randomized in test environments. I can make a PR as starting point tomorrow, it's too late now. We can then improve from there - [~mikemccand] other ideas included. > Can we optimize BytesRefHash? > ----------------------------- > > Key: LUCENE-10572 > URL: https://issues.apache.org/jira/browse/LUCENE-10572 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Priority: Major > > I was poking around in our nightly benchmarks > ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR > profiling that the hottest method is this: > {noformat} > PERCENT CPU SAMPLES STACK > 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals() > at > org.apache.lucene.util.BytesRefHash#findHash() > at org.apache.lucene.util.BytesRefHash#add() > at > org.apache.lucene.index.TermsHashPerField#add() > at > org.apache.lucene.index.IndexingChain$PerField#invert() > at > org.apache.lucene.index.IndexingChain#processField() > at > org.apache.lucene.index.IndexingChain#processDocument() > at > org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat} > This is kinda crazy – comparing if the term to be inserted into the inverted > index hash equals the term already added to {{BytesRefHash}} is the hottest > method during nightly benchmarks. > Discussing offline with [~rcmuir] and [~jpountz] they noticed a few > questionable things about our current implementation: > * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the > inserted term into the hash? Let's just use two bytes always, since IW > limits term length to 32 K (< 64K that an unsigned short can cover) > * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} > (BitUtil.VH_BE_SHORT.get) > * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not > aggressive enough? Or the initial sizing of the hash is too small? > * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too > many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible > "upgrades"? > * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version > ({{{}murmurhash3_x86_32{}}})? > * Are we using the JVM's intrinsics to compare multiple bytes in a single > SIMD instruction ([~rcmuir] is quite sure we are indeed)? > * [~jpountz] suggested maybe the hash insert is simply memory bound > * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total > CPU cost) > I pulled these observations from a recent (5/6/22) profiler output: > [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html] > Maybe we can improve our performance on this crazy hotspot? > Or maybe this is a "healthy" hotspot and we should leave it be! -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org