Re: [PR] Add javadoc note to LeafCollector#finish [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on code in PR #12643: URL: https://github.com/apache/lucene/pull/12643#discussion_r1351629111 ## lucene/core/src/java/org/apache/lucene/search/LeafCollector.java: ## @@ -125,6 +125,8 @@ default DocIdSetIterator competitiveIterator() throws IOException { * i

Re: [PR] Avoid duplicate array fill in BPIndexReorderer [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on code in PR #12645: URL: https://github.com/apache/lucene/pull/12645#discussion_r1351615624 ## lucene/CHANGES.txt: ## @@ -178,7 +178,7 @@ Optimizations * GITHUB#12623: Use a MergeSorter taking advantage of extra storage for StableMSBRadixSorter. (Guo Feng)

[PR] Avoid duplicate array fill in BPIndexReorderer [lucene]

2023-10-09 Thread via GitHub
gf2121 opened a new pull request, #12645: URL: https://github.com/apache/lucene/pull/12645 No need to fill zero as `computeDocFreqs` will do. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [I] Write VLong in opposite order for better outputs sharing in the FST [lucene]

2023-10-09 Thread via GitHub
gf2121 closed issue #12620: Write VLong in opposite order for better outputs sharing in the FST URL: https://github.com/apache/lucene/issues/12620 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-09 Thread via GitHub
gf2121 merged PR #12631: URL: https://github.com/apache/lucene/pull/12631 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on PR #12631: URL: https://github.com/apache/lucene/pull/12631#issuecomment-1754442681 @jpountz @mikemccand Thanks a lot for the great suggestions and benchmark ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu

Re: [PR] DeletedTerms#clear should reset ByteBlockPool [lucene]

2023-10-09 Thread via GitHub
gf2121 merged PR #12630: URL: https://github.com/apache/lucene/pull/12630 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [I] Optimize counts on 2-clauses disjunctions [lucene]

2023-10-09 Thread via GitHub
jpountz commented on issue #12644: URL: https://github.com/apache/lucene/issues/12644#issuecomment-1754413611 You are right, I added the condition on deleted docs and term queries so that `count(clause)` can be computed as the doc freq of the term. -- This is an automated message from the

Re: [PR] Use radix sort to speed up the sorting of terms in TermInSetQuery [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on code in PR #12587: URL: https://github.com/apache/lucene/pull/12587#discussion_r1351349192 ## lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java: ## @@ -112,7 +113,23 @@ private static PrefixCodedTerms packTerms(String field, Collection ter

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-09 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1350439623 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -21,19 +21,18 @@ import java.util.List; import org.apache.lucene.store.DataInput; import

Re: [PR] Explicitly return needStats flag in TermStates [lucene]

2023-10-09 Thread via GitHub
gsmiller commented on PR #12638: URL: https://github.com/apache/lucene/pull/12638#issuecomment-1753905354 I'd also be curious to better understand the need here. Is it really about making `#docFreq` and `#totalTermFreq` calls safer/easier for callers somehow? It looks like you'll get `Illeg

Re: [I] Optimize counts on 2-clauses disjunctions [lucene]

2023-10-09 Thread via GitHub
gsmiller commented on issue #12644: URL: https://github.com/apache/lucene/issues/12644#issuecomment-1753879508 Oh, +1. Interesting idea to try out! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] LUCENE-10241: Updating OpenNLP to 1.9.4. [lucene]

2023-10-09 Thread via GitHub
jzonthemtn commented on PR #448: URL: https://github.com/apache/lucene/pull/448#issuecomment-1753825042 > @jzonthemtn not sure I have the knowledge or chops to do this upgrade... I'll push an update! -- This is an automated message from the Apache Git Service. To respond to the mess

Re: [PR] Use radix sort to speed up the sorting of terms in TermInSetQuery [lucene]

2023-10-09 Thread via GitHub
gsmiller commented on code in PR #12587: URL: https://github.com/apache/lucene/pull/12587#discussion_r1350774457 ## lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java: ## @@ -112,7 +113,23 @@ private static PrefixCodedTerms packTerms(String field, Collection ter

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-09 Thread via GitHub
mikemccand commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1753705229 Translating/merging the above two tables into a graph: ![image](https://github.com/apache/lucene/assets/796508/6259f97c-a065-4a98-a1fc-1e4984e2386e) Some observations:

Re: [PR] LUCENE-10241: Updating OpenNLP to 1.9.4. [lucene]

2023-10-09 Thread via GitHub
epugh commented on PR #448: URL: https://github.com/apache/lucene/pull/448#issuecomment-1753691080 @jzonthemtn not sure I have the knowledge or chops to do this upgrade... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and u

[I] Optimize counts on 2-clauses disjunctions [lucene]

2023-10-09 Thread via GitHub
jpountz opened a new issue, #12644: URL: https://github.com/apache/lucene/issues/12644 ### Description Counts on disjunctions could be optimized in the following case: - 2 clauses - both clauses are term queries - there are no deletes Then we could compute the count

Re: [PR] Refactor ByteBlockPool so it is just a "shift/mask big array" [lucene]

2023-10-09 Thread via GitHub
mikemccand commented on PR #12625: URL: https://github.com/apache/lucene/pull/12625#issuecomment-1753617337 > I run luceneutil for wikimedium10m and I don't think it shows any slow down (I find hard to understand the output): Hmm, surprisingly noisy, especially for the biggest regress

Re: [PR] Capture build scans on ge.apache.org to benefit from deep build insights [lucene]

2023-10-09 Thread via GitHub
clayburn commented on PR #12293: URL: https://github.com/apache/lucene/pull/12293#issuecomment-1753607350 @dsmiley - Here is the PR we were discussing at Community Over Code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [I] ability to run JMH benchmarks from gradle [lucene]

2023-10-09 Thread via GitHub
dweiss commented on issue #12641: URL: https://github.com/apache/lucene/issues/12641#issuecomment-1753495875 JMH is fairly self-contained, I don't think it should be a big deal to wrap it up into a separate module, without external plugins (which are problematic to debug, in case of problem

Re: [PR] Refactor Lucene95 to allow off heap vector reader reuse [lucene]

2023-10-09 Thread via GitHub
benwtrent commented on PR #12629: URL: https://github.com/apache/lucene/pull/12629#issuecomment-1753485937 I am going to merge this unless there is prevailing negative sentiment. This change should significantly reduce code churn for vector codecs that require reading/writing vectors in a f

[PR] Add javadoc note to LeafCollector#finish [lucene]

2023-10-09 Thread via GitHub
gsmiller opened a new pull request, #12643: URL: https://github.com/apache/lucene/pull/12643 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-

[PR] Ensure LeafCollector#finish is only called once on the main collector during drill-sideways [lucene]

2023-10-09 Thread via GitHub
gsmiller opened a new pull request, #12642: URL: https://github.com/apache/lucene/pull/12642 Small bug fix where `#finish` can be called multiple times on the base collector during drill-sideways -- This is an automated message from the Apache Git Service. To respond to the message, pleas

Re: [PR] Explicitly return needStats flag in TermStates [lucene]

2023-10-09 Thread via GitHub
yugushihuang commented on PR #12638: URL: https://github.com/apache/lucene/pull/12638#issuecomment-1753389852 Because TermStates can be built with or without the needStats. If in application, we build the TermStates and pass them around. It is worthwhile for the application to check if the

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1753372410 > In this PR is no change in square distance!? It only optimizes cosine and dotProduct. See the [first commit of this PR](132bf28ecf86f06f6a015f5797139d7dcf3d2fb0) and [the corresp

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
uschindler commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1753361544 > I rerun on java 21, `squareDistanceNewNew` looks faster: In this PR is no change in square distance!? It only optimizes cosine and dotProduct. -- This is an automated messa

Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-09 Thread via GitHub
rmuir commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1753202064 @benwtrent I think a big source of confusion is that while the data might be `byte`, the related functions return 4-byte `int` and 4-byte `float` so from a vector api perspective, the

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-09 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1350439623 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -21,19 +21,18 @@ import java.util.List; import org.apache.lucene.store.DataInput; import

[I] ability to run JMH benchmarks from gradle [lucene]

2023-10-09 Thread via GitHub
rmuir opened a new issue, #12641: URL: https://github.com/apache/lucene/issues/12641 ### Description Background: I'm having a hard time keeping https://github.com/rmuir/vectorbench up to date, the code has differences with what the integrated vector code in lucene is, I have to copy/

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-09 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1350425577 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -21,19 +21,18 @@ import java.util.List; import org.apache.lucene.store.DataInput; import

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-09 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1350425577 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -21,19 +21,18 @@ import java.util.List; import org.apache.lucene.store.DataInput; import

Re: [I] Sum up bit count with vector API [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on issue #12639: URL: https://github.com/apache/lucene/issues/12639#issuecomment-1753084518 The scalar impl in JDK21 looks better ``` Benchmark (size) Mode Cnt Score Error Units BitcountBenchmark.bitCountNew 1024 thrpt5

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1753077559 I rerun on java 21, `squareDistanceNewNew` looks faster: ``` openjdk version "21" 2023-09-19 OpenJDK Runtime Environment (build 21+35-2513) OpenJDK 64-Bit Server VM (build 21+35

Re: [PR] Refactor ByteBlockPool so it is just a "shift/mask big array" [lucene]

2023-10-09 Thread via GitHub
iverase commented on PR #12625: URL: https://github.com/apache/lucene/pull/12625#issuecomment-1753069277 I run luceneutil for wikimedium10m and I don't think it shows any slow down (I find hard to understand the output): ``` TaskQPS baseline StdDevQ

Re: [PR] Refactor ByteBlockPool so it is just a "shift/mask big array" [lucene]

2023-10-09 Thread via GitHub
iverase commented on code in PR #12625: URL: https://github.com/apache/lucene/pull/12625#discussion_r1350345121 ## lucene/core/src/java/org/apache/lucene/util/BytesRefBlockPool.java: ## @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [I] Sum up bit count with vector API [lucene]

2023-10-09 Thread via GitHub
rmuir commented on issue #12639: URL: https://github.com/apache/lucene/issues/12639#issuecomment-1753033439 This is confusing since IMO compiler should be doing this already? I remember seeing it relatively recently but you are testing with JDK20... https://bugs.openjdk.org/browse/JDK

Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-09 Thread via GitHub
benwtrent commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1753021709 Thank you @rmuir && @ChrisHegarty for digging into this! The current Panama Vector API makes doing this kind of thing frustrating. Thank y'all for wrestling with it to make

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1753006621 > Especially clang already makes a reasonable choice that's only sub-optimal because of CPU quirks (32x32 => 32-bit SIMD mulitplication costs more on recent Intel microarchitectures than 2

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1753003897 > Oh! I didn't look back as far as the original commit on this PR, sorry. I see now that @rmuir tried exactly the same thing. > > @gf2121 Strange that we see different results. Could

Re: [PR] DeletedTerms#clear should reset ByteBlockPool [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on code in PR #12630: URL: https://github.com/apache/lucene/pull/12630#discussion_r1350277816 ## lucene/core/src/test/org/apache/lucene/index/TestBufferedUpdates.java: ## @@ -61,10 +61,10 @@ public void testRamBytesUsed() { public void testDeletedTerms() {

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752963144 Oh! I didn't look back as far as the original commit on this PR, sorry. I see now that @rmuir tried exactly the same thing. @gf2121 Strange that we see different results. C

Re: [PR] DeletedTerms#clear should reset ByteBlockPool [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on PR #12630: URL: https://github.com/apache/lucene/pull/12630#issuecomment-1752936864 > Just to confirm: the previous PR was not released/included in 9.8.0 right? So users are not hitting this memory leak when using the 9.8.0 release. Yes, the previous PR is not incl

Re: [I] Sum up bit count with vector API [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on issue #12639: URL: https://github.com/apache/lucene/issues/12639#issuecomment-1752926917 > is it the number of longs or the number of bits? It is the number of longs. Here is the whole class: ``` @BenchmarkMode(Mode.Throughput) @OutputTimeUnit(TimeUnit.MICR

Re: [I] Sum up bit count with vector API [lucene]

2023-10-09 Thread via GitHub
jpountz commented on issue #12639: URL: https://github.com/apache/lucene/issues/12639#issuecomment-1752922284 This looks appealing. What is the `size` parameter in your micro benchmark, is it the number of longs or the number of bits? -- This is an automated message from the Apache Git Se

[PR] Ensure DrillSidewaysScorer calls LeafCollector#finish on all sideways-dim FacetsCollectors [lucene]

2023-10-09 Thread via GitHub
gsmiller opened a new pull request, #12640: URL: https://github.com/apache/lucene/pull/12640 As DrillSidewaysScorer is currently written, if any leaf collectors throw CollectionTerminatedException then `LeafCollector#finish` won't properly get called. This patch makes sure we always call `#

[I] Sum up bit count with vector API [lucene]

2023-10-09 Thread via GitHub
gf2121 opened a new issue, #12639: URL: https://github.com/apache/lucene/issues/12639 ### Description I played with vector API to sum up bit count. This pattern can be used in [bitset cardinality](https://github.com/apache/lucene/blob/dfff1e635805ffc61dd6029a8060e2635bfcbdb9/lucene/c

Re: [PR] DeletedTerms#clear should reset ByteBlockPool [lucene]

2023-10-09 Thread via GitHub
jpountz commented on PR #12630: URL: https://github.com/apache/lucene/pull/12630#issuecomment-1752884136 wow good catch. Out of curiosity, how did you catch it? Are you running snapshot Lucene builds in production? -- This is an automated message from the Apache Git Service. To respond to

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-09 Thread via GitHub
jpountz commented on code in PR #12631: URL: https://github.com/apache/lucene/pull/12631#discussion_r1350212761 ## lucene/core/src/test/org/apache/lucene/codecs/lucene90/blocktree/TestMSBVLong.java: ## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] Refactor ByteBlockPool so it is just a "shift/mask big array" [lucene]

2023-10-09 Thread via GitHub
mikemccand commented on code in PR #12625: URL: https://github.com/apache/lucene/pull/12625#discussion_r1350196379 ## lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java: ## @@ -312,14 +261,14 @@ private int findHash(BytesRef bytes) { // final position int ha

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-09 Thread via GitHub
mikemccand commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1350191783 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -21,19 +21,18 @@ import java.util.List; import org.apache.lucene.store.DataInput; impor

Re: [PR] Refactor ByteBlockPool so it is just a "shift/mask big array" [lucene]

2023-10-09 Thread via GitHub
mikemccand commented on PR #12625: URL: https://github.com/apache/lucene/pull/12625#issuecomment-1752818499 > While working in the code base I stumble with this [TODO](https://github.com/apache/lucene/blob/2474940bffe6118ed31ceb717fd49705d819e1fc/lucene/core/src/java/org/apache/lucene/util/P

Re: [PR] DeletedTerms#clear should reset ByteBlockPool [lucene]

2023-10-09 Thread via GitHub
mikemccand commented on code in PR #12630: URL: https://github.com/apache/lucene/pull/12630#discussion_r1350170058 ## lucene/core/src/test/org/apache/lucene/index/TestBufferedUpdates.java: ## @@ -61,10 +61,10 @@ public void testRamBytesUsed() { public void testDeletedTerms()

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-09 Thread via GitHub
mikemccand commented on code in PR #12631: URL: https://github.com/apache/lucene/pull/12631#discussion_r1350166040 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/FieldReader.java: ## @@ -99,6 +102,26 @@ public final class FieldReader extends Terms { */

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752758194 > Benchmark (size) Mode Cnt Score Error Units BinaryDotProductBenchmark.dotProductNew 1024 thrpt5 20.675 ± 0.051 ops/us Binar

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752723993 @rmuir Building on your idea, and focusing again on the x64 case, I get a bit of a boost by just converting directly to int (rather than the short dance). On my Rocket