date:20220514

[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-14 Thread Uwe Schindler (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537009#comment-17537009
 ] 

Uwe Schindler commented on LUCENE-10572:


Hi,
actually the reason why BE encoding was used in ByteBlockPool and BytesRefHash 
(same for PagedBytes) was to emulate those vInt encodinf. For the data 
structure it makes otherwise no difference. So I am fine with any decission. 
The only thing that needs to be done is to remove the vInt encoding, because it 
relies on BE for it to work (see the code that I removed). It could be fixed to 
also work LE, but its not universal. If we do not run into space problem during 
indexing (if you have many short terms which is the default for most texts, the 
special case with 1-byte encoding for lengths < 128 is a good idea). It just 
brings problems when you all the time encode/decode it (e.g. when searching the 
hash table with equals).

>From looking at the code: I doubt that the problem really comes from the vInt 
>encoding in BE format. Most terms are shorter than 128 bytes in normal 
>indexes. I am quite sure the problem with the hotspot at equals is simple that 
>you need to do a full comparison using Arrays.equals(). This happens mostly in 
>the case that the term is already there (hash collisions also happen on same 
>term). Simpy said: If you insert a BytesRef into the hash and the term already 
>exists (a common case during indexing text), it will get a hashcode that 
>already exists in the table. To verify that the term is really the same it has 
>to compare the bytes. This is why we see the hotspot on equals. There does not 
>even help a better hashing algorithm, as the case "term already in table" 
>always needs a verification no matter how good the hashing algorithm is.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-

[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-14 Thread Uwe Schindler (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537013#comment-17537013
 ] 

Uwe Schindler commented on LUCENE-10572:


Mike, could you make a test on how much memory increaes by the PR and if 
there's a speed improvement at all?

If memory due to the 2-byte length (this can sum up very fast if you know that 
most terms in your index are < 128 bytes) is increasing and there's no speed 
imporvement, let's throw away my PR. This would be the confirmation that equals 
is memory bound (for the reasons I told you before). Arrays.equals() is using 
intrinsic "ArraySupport#vectorizedMismatch()" (using SIMD) in JDK 17 -> this 
should answer your question:


- Our code: 
[BytesRefHash.java#L178|https://github.com/apache/lucene/blob/c1b626c0636821f4d7c085895359489e7dfa330f/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L178]
- Arrays#equals calling ArraySupport#mismatch: 
[Arrays.java#L2710-L2712|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/java/util/Arrays.java#L2710-L2712]
- ArraySupport#mismatch calling the vectorizedMismatch method: 
[ArraysSupport.java#L275-L303|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L275-L303]
- Here is the method called at end, which is {{@IntrinsicCandidate}}: 
[ArraysSupport.java#L111-L135|https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L111-L135]

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-14 Thread Uwe Schindler (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537013#comment-17537013
 ] 

Uwe Schindler edited comment on LUCENE-10572 at 5/14/22 10:27 AM:
--

Mike, could you make a test on how much memory increaes by the PR and if 
there's a speed improvement at all?

If memory due to the 2-byte length (this can sum up very fast if you know that 
most terms in your index are < 128 bytes) is increasing and there's no speed 
imporvement, let's throw away my PR. This would be the confirmation that equals 
is memory bound (for the reasons I told you before). Arrays.equals() is using 
intrinsic "ArraySupport#vectorizedMismatch()" (using SIMD) in JDK 17 -> this 
should answer your question:


- Our code: 
[BytesRefHash.java#L178|https://github.com/apache/lucene/blob/c1b626c0636821f4d7c085895359489e7dfa330f/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L178]
- Arrays#equals calling ArraySupport#mismatch: 
[Arrays.java#L2710-L2712|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/java/util/Arrays.java#L2710-L2712]
- ArraySupport#mismatch calling the vectorizedMismatch method: 
[ArraysSupport.java#L275-L303|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L275-L303]
- Here is the method called at end, which is {{@IntrinsicCandidate}}: 
[ArraysSupport.java#L111-L161|https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L111-L161]


was (Author: thetaphi):
Mike, could you make a test on how much memory increaes by the PR and if 
there's a speed improvement at all?

If memory due to the 2-byte length (this can sum up very fast if you know that 
most terms in your index are < 128 bytes) is increasing and there's no speed 
imporvement, let's throw away my PR. This would be the confirmation that equals 
is memory bound (for the reasons I told you before). Arrays.equals() is using 
intrinsic "ArraySupport#vectorizedMismatch()" (using SIMD) in JDK 17 -> this 
should answer your question:


- Our code: 
[BytesRefHash.java#L178|https://github.com/apache/lucene/blob/c1b626c0636821f4d7c085895359489e7dfa330f/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L178]
- Arrays#equals calling ArraySupport#mismatch: 
[Arrays.java#L2710-L2712|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/java/util/Arrays.java#L2710-L2712]
- ArraySupport#mismatch calling the vectorizedMismatch method: 
[ArraysSupport.java#L275-L303|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L275-L303]
- Here is the method called at end, which is {{@IntrinsicCandidate}}: 
[ArraysSupport.java#L111-L135|https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L111-L135]

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot

[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-14 Thread Uwe Schindler (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537013#comment-17537013
 ] 

Uwe Schindler edited comment on LUCENE-10572 at 5/14/22 10:28 AM:
--

Mike, could you make a test on how much memory increaes by the PR and if 
there's a speed improvement at all?

If memory due to the 2-byte length (this can sum up very fast if you know that 
most terms in your index are < 128 bytes) is increasing and there's no speed 
imporvement, let's throw away my PR. This would be the confirmation that equals 
is memory bound (for the reasons I told you before). Arrays.equals() is using 
intrinsic "ArraySupport#vectorizedMismatch()" (using SIMD) in JDK 17 when the 
array is long enough (around 9 bytes, if smaller it does not do any 
vectorization) -> this should answer your question:

- Our code: 
[BytesRefHash.java#L178|https://github.com/apache/lucene/blob/c1b626c0636821f4d7c085895359489e7dfa330f/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L178]
- Arrays#equals calling ArraySupport#mismatch: 
[Arrays.java#L2710-L2712|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/java/util/Arrays.java#L2710-L2712]
- ArraySupport#mismatch calling the vectorizedMismatch method: 
[ArraysSupport.java#L275-L303|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L275-L303]
- Here is the method called at end, which is {{@IntrinsicCandidate}}: 
[ArraysSupport.java#L111-L161|https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L111-L161]


was (Author: thetaphi):
Mike, could you make a test on how much memory increaes by the PR and if 
there's a speed improvement at all?

If memory due to the 2-byte length (this can sum up very fast if you know that 
most terms in your index are < 128 bytes) is increasing and there's no speed 
imporvement, let's throw away my PR. This would be the confirmation that equals 
is memory bound (for the reasons I told you before). Arrays.equals() is using 
intrinsic "ArraySupport#vectorizedMismatch()" (using SIMD) in JDK 17 -> this 
should answer your question:


- Our code: 
[BytesRefHash.java#L178|https://github.com/apache/lucene/blob/c1b626c0636821f4d7c085895359489e7dfa330f/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L178]
- Arrays#equals calling ArraySupport#mismatch: 
[Arrays.java#L2710-L2712|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/java/util/Arrays.java#L2710-L2712]
- ArraySupport#mismatch calling the vectorizedMismatch method: 
[ArraysSupport.java#L275-L303|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L275-L303]
- Here is the method called at end, which is {{@IntrinsicCandidate}}: 
[ArraysSupport.java#L111-L161|https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L111-L161]

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K

[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-14 Thread Uwe Schindler (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537013#comment-17537013
 ] 

Uwe Schindler edited comment on LUCENE-10572 at 5/14/22 10:29 AM:
--

Mike, could you make a test on how much memory increaes by the PR and if 
there's a speed improvement at all?

If memory due to the 2-byte length (this can sum up very fast if you know that 
most terms in your index are < 128 bytes) is increasing and there's no speed 
imporvement, let's throw away my PR. This would be the confirmation that equals 
is memory bound (for the reasons I told you before). Arrays.equals() is using 
intrinsic "ArraySupport#vectorizedMismatch()" (using SIMD) in JDK 17 when the 
array is long enough (>= 8 bytes, if smaller it does not do any vectorization) 
-> this should answer your question:

- Our code: 
[BytesRefHash.java#L178|https://github.com/apache/lucene/blob/c1b626c0636821f4d7c085895359489e7dfa330f/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L178]
- Arrays#equals calling ArraySupport#mismatch: 
[Arrays.java#L2710-L2712|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/java/util/Arrays.java#L2710-L2712]
- ArraySupport#mismatch calling the vectorizedMismatch method: 
[ArraysSupport.java#L275-L303|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L275-L303]
- Here is the method called at end, which is {{@IntrinsicCandidate}}: 
[ArraysSupport.java#L111-L161|https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L111-L161]


was (Author: thetaphi):
Mike, could you make a test on how much memory increaes by the PR and if 
there's a speed improvement at all?

If memory due to the 2-byte length (this can sum up very fast if you know that 
most terms in your index are < 128 bytes) is increasing and there's no speed 
imporvement, let's throw away my PR. This would be the confirmation that equals 
is memory bound (for the reasons I told you before). Arrays.equals() is using 
intrinsic "ArraySupport#vectorizedMismatch()" (using SIMD) in JDK 17 when the 
array is long enough (around 9 bytes, if smaller it does not do any 
vectorization) -> this should answer your question:

- Our code: 
[BytesRefHash.java#L178|https://github.com/apache/lucene/blob/c1b626c0636821f4d7c085895359489e7dfa330f/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L178]
- Arrays#equals calling ArraySupport#mismatch: 
[Arrays.java#L2710-L2712|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/java/util/Arrays.java#L2710-L2712]
- ArraySupport#mismatch calling the vectorizedMismatch method: 
[ArraysSupport.java#L275-L303|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L275-L303]
- Here is the method called at end, which is {{@IntrinsicCandidate}}: 
[ArraysSupport.java#L111-L161|https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L111-L161]

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into t

[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-14 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537018#comment-17537018
 ] 

Robert Muir commented on LUCENE-10572:
--

{quote}
Most terms are shorter than 128 bytes in normal indexes.
{quote}

Actually most analyzers have default limit of 255. So there is nothing 
practically to prevent hitting the 2-byte vint case today. I think having to 
use 2 bytes is ridiculous, especially so is a term of 30,000 bytes. Ever tried 
to type a single word that is 128 chars long?

Anyway, I just dont think we need the code complexity, byte-swapping, 
branching, etc for all this. Seems like the wrong tradeoff for indexwriter, 
where we could just use another byte and maybe be faster. Maybe we can figure 
out a benchmark to see if its the case. Thanks for making it 2-bytes instead.

To me the only question is, if we should add these NATIVE constants to bitutil, 
or simply do everything LE. I honestly think the only BE stuff out there is old 
IBM iron (AIX and S390X). Seems risky to optimize for this, I don't think many 
people are running lucene on this...

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-14 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537023#comment-17537023
 ] 

Michael McCandless commented on LUCENE-10572:
-

{quote}Mike, could you make a test on how much memory increaes by the PR and if 
there's a speed improvement at all?
{quote}
+1, I will try to benchmark the PR!  Thank you for the fast iterations here!  
Exciting :)

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler opened a new pull request, #890: Detect CI builds and enable errorprone by default for those CI builds

2022-05-14 Thread GitBox



uschindler opened a new pull request, #890:
URL: https://github.com/apache/lucene/pull/890

   This detects CI builds (Github and Jenkins) using existence of some 
environment variables. There is a new ext property `project.ext.isCIBuild`.
   
   The ext property is used to enable errorprone on thse builds. You can still 
enable just errorprone on local builds with: `-Perrorprone.enabled=true`
   
   We can use the new project property to enable other expensive checks by 
default for CI builds.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on a diff in pull request #890: Detect CI builds and enable errorprone by default for those CI builds

2022-05-14 Thread GitBox



uschindler commented on code in PR #890:
URL: https://github.com/apache/lucene/pull/890#discussion_r872968994


##
gradle/globals.gradle:
##
@@ -147,5 +147,10 @@ allprojects {
   }
   return taskList
 }
+
+// detect if we run in CI environment by looking at existence of env vars:
+// "CI": Github 
(https://docs.github.com/en/actions/learn-github-actions/environment-variables)
+// anything starting with "JENKINS_" or "HUDSON_": Jenkins/Hudson 
(https://jenkins.thetaphi.de/env-vars.html/)
+isCIBuild = System.getenv().keySet().find { it ==~ 
/(?i)(JENKINS_?.*|HUDSON_?.*|CI)/ } != null

Review Comment:
   The regex is a bit stupid. The ? Makes no sense. It should be a bit 
different... Will fix in the evening.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #890: Detect CI builds and enable errorprone by default for those CI builds

2022-05-14 Thread GitBox



uschindler commented on PR #890:
URL: https://github.com/apache/lucene/pull/890#issuecomment-1126790330

   I renamed the project property to just "errorprone".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #890: Detect CI builds and enable errorprone by default for those CI builds

2022-05-14 Thread GitBox



uschindler commented on PR #890:
URL: https://github.com/apache/lucene/pull/890#issuecomment-1126793622

   The owasp check currently fails because of vulns in Jetty. We should upgrade 
(that's a separate task).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler merged pull request #890: Detect CI builds and enable errorprone by default for those CI builds

2022-05-14 Thread GitBox



uschindler merged PR #890:
URL: https://github.com/apache/lucene/pull/890


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #890: Detect CI builds and enable errorprone by default for those CI builds

2022-05-14 Thread GitBox



mocobeta commented on code in PR #890:
URL: https://github.com/apache/lucene/pull/890#discussion_r873099330


##
lucene/CHANGES.txt:
##
@@ -211,7 +211,9 @@ Build
 
 * Upgrade forbiddenapis to version 3.3.  (Uwe Schindler)
 
-* LUCENE-10532: Remove LuceneTestCase.Slow annotation. ALl tests can be fast. 
(Robert Muir)
+* Detect CI builds on Github or Jenkins and enable errorprone.  (Uwe 
Schindler, Dawid Weiss)

Review Comment:
   @uschindler Can you please add `GITHUB#890` as the prefix of this entry?
   https://github.com/apache/lucene/pull/854



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #890: Detect CI builds and enable errorprone by default for those CI builds

2022-05-14 Thread GitBox



mocobeta commented on code in PR #890:
URL: https://github.com/apache/lucene/pull/890#discussion_r873099427


##
lucene/CHANGES.txt:
##
@@ -211,7 +211,9 @@ Build
 
 * Upgrade forbiddenapis to version 3.3.  (Uwe Schindler)
 
-* LUCENE-10532: Remove LuceneTestCase.Slow annotation. ALl tests can be fast. 
(Robert Muir)
+* Detect CI builds on Github or Jenkins and enable errorprone.  (Uwe 
Schindler, Dawid Weiss)

Review Comment:
   Also, I think "Upgrade forbiddenapis to version 3.3." can refer to 
https://github.com/apache/lucene/pull/768.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #890: Detect CI builds and enable errorprone by default for those CI builds

2022-05-14 Thread GitBox



mocobeta commented on code in PR #890:
URL: https://github.com/apache/lucene/pull/890#discussion_r873102060


##
lucene/CHANGES.txt:
##
@@ -211,7 +211,9 @@ Build
 
 * Upgrade forbiddenapis to version 3.3.  (Uwe Schindler)
 
-* LUCENE-10532: Remove LuceneTestCase.Slow annotation. ALl tests can be fast. 
(Robert Muir)
+* Detect CI builds on Github or Jenkins and enable errorprone.  (Uwe 
Schindler, Dawid Weiss)

Review Comment:
   FYI, we have the guidance for adding CHANGES entries, but having more 
concrete examples that refer to Github PRs instead of Jira issues would be 
helpful to let others know that an extra Jira accounts/issues are no longer 
required when contributing.
   
https://github.com/apache/lucene/blob/main/CONTRIBUTING.md#add-a-changes-entry



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] LuXugang closed pull request #511: LUCENE-10281: Error condition used to judge whether hits are sparse in StringValueFacetCounts

2022-05-14 Thread GitBox



LuXugang closed pull request #511: LUCENE-10281: Error condition used to judge 
whether hits are sparse in StringValueFacetCounts
URL: https://github.com/apache/lucene/pull/511


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

[GitHub] [lucene] uschindler opened a new pull request, #890: Detect CI builds and enable errorprone by default for those CI builds

[GitHub] [lucene] uschindler commented on a diff in pull request #890: Detect CI builds and enable errorprone by default for those CI builds

[GitHub] [lucene] uschindler commented on pull request #890: Detect CI builds and enable errorprone by default for those CI builds

[GitHub] [lucene] uschindler commented on pull request #890: Detect CI builds and enable errorprone by default for those CI builds

[GitHub] [lucene] uschindler merged pull request #890: Detect CI builds and enable errorprone by default for those CI builds

[GitHub] [lucene] mocobeta commented on a diff in pull request #890: Detect CI builds and enable errorprone by default for those CI builds

[GitHub] [lucene] mocobeta commented on a diff in pull request #890: Detect CI builds and enable errorprone by default for those CI builds

[GitHub] [lucene] mocobeta commented on a diff in pull request #890: Detect CI builds and enable errorprone by default for those CI builds

[GitHub] [lucene] LuXugang closed pull request #511: LUCENE-10281: Error condition used to judge whether hits are sparse in StringValueFacetCounts

16 matches

Site Navigation

Mail list logo

Footer information