[ https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537009#comment-17537009 ]
Uwe Schindler commented on LUCENE-10572: ---------------------------------------- Hi, actually the reason why BE encoding was used in ByteBlockPool and BytesRefHash (same for PagedBytes) was to emulate those vInt encodinf. For the data structure it makes otherwise no difference. So I am fine with any decission. The only thing that needs to be done is to remove the vInt encoding, because it relies on BE for it to work (see the code that I removed). It could be fixed to also work LE, but its not universal. If we do not run into space problem during indexing (if you have many short terms which is the default for most texts, the special case with 1-byte encoding for lengths < 128 is a good idea). It just brings problems when you all the time encode/decode it (e.g. when searching the hash table with equals). >From looking at the code: I doubt that the problem really comes from the vInt >encoding in BE format. Most terms are shorter than 128 bytes in normal >indexes. I am quite sure the problem with the hotspot at equals is simple that >you need to do a full comparison using Arrays.equals(). This happens mostly in >the case that the term is already there (hash collisions also happen on same >term). Simpy said: If you insert a BytesRef into the hash and the term already >exists (a common case during indexing text), it will get a hashcode that >already exists in the table. To verify that the term is really the same it has >to compare the bytes. This is why we see the hotspot on equals. There does not >even help a better hashing algorithm, as the case "term already in table" >always needs a verification no matter how good the hashing algorithm is. > Can we optimize BytesRefHash? > ----------------------------- > > Key: LUCENE-10572 > URL: https://issues.apache.org/jira/browse/LUCENE-10572 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > I was poking around in our nightly benchmarks > ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR > profiling that the hottest method is this: > {noformat} > PERCENT CPU SAMPLES STACK > 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals() > at > org.apache.lucene.util.BytesRefHash#findHash() > at org.apache.lucene.util.BytesRefHash#add() > at > org.apache.lucene.index.TermsHashPerField#add() > at > org.apache.lucene.index.IndexingChain$PerField#invert() > at > org.apache.lucene.index.IndexingChain#processField() > at > org.apache.lucene.index.IndexingChain#processDocument() > at > org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat} > This is kinda crazy – comparing if the term to be inserted into the inverted > index hash equals the term already added to {{BytesRefHash}} is the hottest > method during nightly benchmarks. > Discussing offline with [~rcmuir] and [~jpountz] they noticed a few > questionable things about our current implementation: > * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the > inserted term into the hash? Let's just use two bytes always, since IW > limits term length to 32 K (< 64K that an unsigned short can cover) > * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} > (BitUtil.VH_BE_SHORT.get) > * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not > aggressive enough? Or the initial sizing of the hash is too small? > * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too > many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible > "upgrades"? > * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version > ({{{}murmurhash3_x86_32{}}})? > * Are we using the JVM's intrinsics to compare multiple bytes in a single > SIMD instruction ([~rcmuir] is quite sure we are indeed)? > * [~jpountz] suggested maybe the hash insert is simply memory bound > * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total > CPU cost) > I pulled these observations from a recent (5/6/22) profiler output: > [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html] > Maybe we can improve our performance on this crazy hotspot? > Or maybe this is a "healthy" hotspot and we should leave it be! -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org