Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1751999625 Here are the results from running `test_all_sizes.py` then `results_to_md.py`: |NodeHash size|FST (mb)|RAM (mb)|FST build time (sec)|

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752003494 Thanks for looking into this @rmuir, I've been thinking similar myself (just didn't get around to anything other than the thinking! ) On my Mac M2. JDK 20.0.2. ```

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752024575 ``` // sum into accumulators Vector prod16 = prod16_1.add(prod16_2); acc = acc.add(prod16.convert(VectorOperators.S2I, 0)); acc = acc.add(prod16.convert(VectorOper

Re: [PR] Enable rank-unsafe optimization of top-k hit computations by quantizing scores. [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on PR #12628: URL: https://github.com/apache/lucene/pull/12628#issuecomment-1752028823 Very cool, surprisingly impactful! > I ran the Tantivy benchmark with TOP_10 and TOP_100 commands This is the Tantivy benchmark tooling, but you are comparing Lucene (mai

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752029230 And of course, `ZERO_EXTEND_S2I`, will work in the maximum boundary case, but not in others. So the question is then just about the maximum value of the bytes in these input arrays

Re: [I] Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on issue #12542: URL: https://github.com/apache/lucene/issues/12542#issuecomment-1752030874 Talking to @sokolovm at Community Over Code 2023 he suggested another idea here: instead of a (RAM hungry) hash table, couldn't we use the growing FST itself to lookup suffixes?

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on PR #12631: URL: https://github.com/apache/lucene/pull/12631#issuecomment-1752031210 > sum | 31606784 | 27188690 | -13.98% WHOA, wow! This is a massive gain for such a tiny change :) I'll try to review soon! Nice to revisit ancient `TODO`s in the source code

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752033176 > What is the maximum value that we can see in the input bytes? All possible values is how i test > Can they every hold `-128`? Yes! > Do we need to handle "ove

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752035773 Ok, cool. If there is not already one, we should add a test to the Panama / scalar unit test for the boundary values. -- This is an automated message from the Apache Git Service.

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752036396 yeah agreed: we should test the boundaries for all 3 functions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on code in PR #12631: URL: https://github.com/apache/lucene/pull/12631#discussion_r1349699402 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/FieldReader.java: ## @@ -99,6 +102,26 @@ public final class FieldReader extends Terms { */

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752039360 yeah, you are right, i am wrong. the trick only works in the unsigned case, Byte.MIN_VALUE is a problem :( -- This is an automated message from the Apache Git Service. To respond to the

[PR] add tests for vectorutils integer boundaries [lucene]

2023-10-08 Thread via GitHub
rmuir opened a new pull request, #12634: URL: https://github.com/apache/lucene/pull/12634 Let's improve the testing for the boundary cases and check them explicitly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752041404 at least we can improve the testing out of this: https://github.com/apache/lucene/pull/12634 -- This is an automated message from the Apache Git Service. To respond to the message, pleas

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub
gf2121 commented on code in PR #12631: URL: https://github.com/apache/lucene/pull/12631#discussion_r1349705693 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java: ## @@ -81,8 +81,11 @@ public final class Lucene90BlockTreeTermsRe

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752049654 don't worry, i have a plan B. it is just frustrating due to the nightmare of operating on the mac, combined with the fact this benchmark and lucene source is a separate repo. it makes the

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752050233 see latest commit for the idea. on my mac it gives a decent boost. it uses "32-bit" vector by loading 64-bit vector from array but only processing half of it. The tests should fail as i n

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on code in PR #12631: URL: https://github.com/apache/lucene/pull/12631#discussion_r1349711457 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java: ## @@ -81,8 +81,11 @@ public final class Lucene90BlockTreeTer

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on PR #12631: URL: https://github.com/apache/lucene/pull/12631#issuecomment-1752050479 I kicked off a `luceneutil` run ... I'll post results here soonish. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

Re: [PR] add tests for vectorutils integer boundaries [lucene]

2023-10-08 Thread via GitHub
rmuir merged PR #12634: URL: https://github.com/apache/lucene/pull/12634 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752063622 ok on my mac i see: ``` Benchmark (size) Mode Cnt Score Error Units BinaryCosineBenchmark.cosineDistanceNew 1024 thrpt5 2.

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on PR #12631: URL: https://github.com/apache/lucene/pull/12631#issuecomment-1752064474 `luceneutil` results on `wikimediumall` look good -- looks like all noise (even for `PKLookup`), or, any signal (change) is very low, making the ~15% reduction very much worth it.

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752098666 I get similar bench results, the new impl is faster. ``` Benchmark (size) Mode Cnt Score Error Units BinaryDotProductBenchmark.

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752099845 My sense here is that accessing a `part` other than `0` is less performant that just reloading the data, which seems a little off. -- This is an automated message from the Apache

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752100681 > My sense here is that accessing a `part` other than `0` is less performant that just reloading the data, which seems a little off. It seems to have a heavy cost no matter how i do

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752101786 btw, another crazy avenue to possibly explore here another day, since we seem bottlenecked on integer multiply. We could try it on arm too. It is faster than the current binary code on my

[I] Should we have an interface VectorValues which would be implemented by [Byte/Float]VectorValues classes [lucene]

2023-10-08 Thread via GitHub
shubhamvishu opened a new issue, #12635: URL: https://github.com/apache/lucene/issues/12635 ### Description Currently, there is lot of code duplication due to [ByteVectorValues](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ByteVectorValues.ja

[PR] Add interface VectorValues to be implemented by [Float/Byte]VectorValues [lucene]

2023-10-08 Thread via GitHub
shubhamvishu opened a new pull request, #12636: URL: https://github.com/apache/lucene/pull/12636 ### Description The classes [ByteVectorValues](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ByteVectorValues.java) and [FloatVectorValues](http

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752107370 The other thought I had around conversion costs would be to look into reinterpret+shuffle/shift/mask crap ourselves, which seems really crazy but i'm running low on ideas. -- This is an

Re: [PR] LUCENE-10241: Updating OpenNLP to 1.9.4. [lucene]

2023-10-08 Thread via GitHub
epugh commented on PR #448: URL: https://github.com/apache/lucene/pull/448#issuecomment-1752112078 It would be nice if this was updated to the awesome new OpenNLP 2.x line! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and u

Re: [PR] Enable rank-unsafe optimization of top-k hit computations by quantizing scores. [lucene]

2023-10-08 Thread via GitHub
jpountz commented on PR #12628: URL: https://github.com/apache/lucene/pull/12628#issuecomment-1752152301 I'll try to give a bit more context how I ended up here. With recent work on vector search and excitement around it, I can't prevent myself from thinking that all users who are happy to

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1752165322 For comparison, this is how the curve (RAM required during construction vs final FST size) looks on trunk, using the god-like parameters as best I could. I sorted the results in reve

Re: [PR] Add interface VectorValues to be implemented by [Float/Byte]VectorValues [lucene]

2023-10-08 Thread via GitHub
benwtrent commented on PR #12636: URL: https://github.com/apache/lucene/pull/12636#issuecomment-1752194821 It was sort of this way before but we decided to switch it as a common interface required either: - having to use generics - an API where things weren't fully implemented or r

[I] segmentInfos.replace() doesn't set userData [lucene]

2023-10-08 Thread via GitHub
Shibi-bala opened a new issue, #12637: URL: https://github.com/apache/lucene/issues/12637 ### Description Found that the [replace method](https://github.com/qcri/solr-6/blob/master/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L875-L878) doesn't set `userData` with t

Re: [PR] Avoid NPEx if the end of the stream has been reached without reading any characters [lucene]

2023-10-08 Thread via GitHub
pzygielo commented on PR #12611: URL: https://github.com/apache/lucene/pull/12611#issuecomment-1752377046 Thanks for checking. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

[PR] Explicitly return needStats flag in TermStates [lucene]

2023-10-08 Thread via GitHub
yugushihuang opened a new pull request, #12638: URL: https://github.com/apache/lucene/pull/12638 ### Description A simple API in TermStates to expose the `needStats` flag. Addresses #12617 # -- This is an automated message from the Apache Git Service. To respond to the m

Re: [PR] Avoid NPEx if the end of the stream has been reached without reading any characters [lucene]

2023-10-08 Thread via GitHub
dweiss merged PR #12611: URL: https://github.com/apache/lucene/pull/12611 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Avoid NPEx if the end of the stream has been reached without reading any characters [lucene]

2023-10-08 Thread via GitHub
dweiss commented on PR #12611: URL: https://github.com/apache/lucene/pull/12611#issuecomment-1752397871 I've applied this to main and branch_9x (9.9). Thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] Explicitly return needStats flag in TermStates [lucene]

2023-10-08 Thread via GitHub
jpountz commented on PR #12638: URL: https://github.com/apache/lucene/pull/12638#issuecomment-1752414836 Can you explain how/when you plan to use this new API? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-08 Thread via GitHub
dweiss commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1752416032 I didn't get into all the details but I think this looks good. Your questions are indeed intriguing - I can't provide any explanation off the top of my head, really. -- This is an auto

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752723993 @rmuir Building on your idea, and focusing again on the x64 case, I get a bit of a boost by just converting directly to int (rather than the short dance). On my Rocket

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752758194 > Benchmark (size) Mode Cnt Score Error Units BinaryDotProductBenchmark.dotProductNew 1024 thrpt5 20.675 ± 0.051 ops/us Binar

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-09 Thread via GitHub
mikemccand commented on code in PR #12631: URL: https://github.com/apache/lucene/pull/12631#discussion_r1350166040 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/FieldReader.java: ## @@ -99,6 +102,26 @@ public final class FieldReader extends Terms { */

Re: [PR] DeletedTerms#clear should reset ByteBlockPool [lucene]

2023-10-09 Thread via GitHub
mikemccand commented on code in PR #12630: URL: https://github.com/apache/lucene/pull/12630#discussion_r1350170058 ## lucene/core/src/test/org/apache/lucene/index/TestBufferedUpdates.java: ## @@ -61,10 +61,10 @@ public void testRamBytesUsed() { public void testDeletedTerms()

Re: [PR] Refactor ByteBlockPool so it is just a "shift/mask big array" [lucene]

2023-10-09 Thread via GitHub
mikemccand commented on PR #12625: URL: https://github.com/apache/lucene/pull/12625#issuecomment-1752818499 > While working in the code base I stumble with this [TODO](https://github.com/apache/lucene/blob/2474940bffe6118ed31ceb717fd49705d819e1fc/lucene/core/src/java/org/apache/lucene/util/P

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-09 Thread via GitHub
mikemccand commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1350191783 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -21,19 +21,18 @@ import java.util.List; import org.apache.lucene.store.DataInput; impor

Re: [PR] Refactor ByteBlockPool so it is just a "shift/mask big array" [lucene]

2023-10-09 Thread via GitHub
mikemccand commented on code in PR #12625: URL: https://github.com/apache/lucene/pull/12625#discussion_r1350196379 ## lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java: ## @@ -312,14 +261,14 @@ private int findHash(BytesRef bytes) { // final position int ha

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-09 Thread via GitHub
jpountz commented on code in PR #12631: URL: https://github.com/apache/lucene/pull/12631#discussion_r1350212761 ## lucene/core/src/test/org/apache/lucene/codecs/lucene90/blocktree/TestMSBVLong.java: ## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] DeletedTerms#clear should reset ByteBlockPool [lucene]

2023-10-09 Thread via GitHub
jpountz commented on PR #12630: URL: https://github.com/apache/lucene/pull/12630#issuecomment-1752884136 wow good catch. Out of curiosity, how did you catch it? Are you running snapshot Lucene builds in production? -- This is an automated message from the Apache Git Service. To respond to

[I] Sum up bit count with vector API [lucene]

2023-10-09 Thread via GitHub
gf2121 opened a new issue, #12639: URL: https://github.com/apache/lucene/issues/12639 ### Description I played with vector API to sum up bit count. This pattern can be used in [bitset cardinality](https://github.com/apache/lucene/blob/dfff1e635805ffc61dd6029a8060e2635bfcbdb9/lucene/c

[PR] Ensure DrillSidewaysScorer calls LeafCollector#finish on all sideways-dim FacetsCollectors [lucene]

2023-10-09 Thread via GitHub
gsmiller opened a new pull request, #12640: URL: https://github.com/apache/lucene/pull/12640 As DrillSidewaysScorer is currently written, if any leaf collectors throw CollectionTerminatedException then `LeafCollector#finish` won't properly get called. This patch makes sure we always call `#

Re: [I] Sum up bit count with vector API [lucene]

2023-10-09 Thread via GitHub
jpountz commented on issue #12639: URL: https://github.com/apache/lucene/issues/12639#issuecomment-1752922284 This looks appealing. What is the `size` parameter in your micro benchmark, is it the number of longs or the number of bits? -- This is an automated message from the Apache Git Se

Re: [I] Sum up bit count with vector API [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on issue #12639: URL: https://github.com/apache/lucene/issues/12639#issuecomment-1752926917 > is it the number of longs or the number of bits? It is the number of longs. Here is the whole class: ``` @BenchmarkMode(Mode.Throughput) @OutputTimeUnit(TimeUnit.MICR

Re: [PR] DeletedTerms#clear should reset ByteBlockPool [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on PR #12630: URL: https://github.com/apache/lucene/pull/12630#issuecomment-1752936864 > Just to confirm: the previous PR was not released/included in 9.8.0 right? So users are not hitting this memory leak when using the 9.8.0 release. Yes, the previous PR is not incl

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752963144 Oh! I didn't look back as far as the original commit on this PR, sorry. I see now that @rmuir tried exactly the same thing. @gf2121 Strange that we see different results. C

Re: [PR] DeletedTerms#clear should reset ByteBlockPool [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on code in PR #12630: URL: https://github.com/apache/lucene/pull/12630#discussion_r1350277816 ## lucene/core/src/test/org/apache/lucene/index/TestBufferedUpdates.java: ## @@ -61,10 +61,10 @@ public void testRamBytesUsed() { public void testDeletedTerms() {

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1753003897 > Oh! I didn't look back as far as the original commit on this PR, sorry. I see now that @rmuir tried exactly the same thing. > > @gf2121 Strange that we see different results. Could

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1753006621 > Especially clang already makes a reasonable choice that's only sub-optimal because of CPU quirks (32x32 => 32-bit SIMD mulitplication costs more on recent Intel microarchitectures than 2

Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-09 Thread via GitHub
benwtrent commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1753021709 Thank you @rmuir && @ChrisHegarty for digging into this! The current Panama Vector API makes doing this kind of thing frustrating. Thank y'all for wrestling with it to make

Re: [I] Sum up bit count with vector API [lucene]

2023-10-09 Thread via GitHub
rmuir commented on issue #12639: URL: https://github.com/apache/lucene/issues/12639#issuecomment-1753033439 This is confusing since IMO compiler should be doing this already? I remember seeing it relatively recently but you are testing with JDK20... https://bugs.openjdk.org/browse/JDK

Re: [PR] Refactor ByteBlockPool so it is just a "shift/mask big array" [lucene]

2023-10-09 Thread via GitHub
iverase commented on code in PR #12625: URL: https://github.com/apache/lucene/pull/12625#discussion_r1350345121 ## lucene/core/src/java/org/apache/lucene/util/BytesRefBlockPool.java: ## @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Refactor ByteBlockPool so it is just a "shift/mask big array" [lucene]

2023-10-09 Thread via GitHub
iverase commented on PR #12625: URL: https://github.com/apache/lucene/pull/12625#issuecomment-1753069277 I run luceneutil for wikimedium10m and I don't think it shows any slow down (I find hard to understand the output): ``` TaskQPS baseline StdDevQ

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1753077559 I rerun on java 21, `squareDistanceNewNew` looks faster: ``` openjdk version "21" 2023-09-19 OpenJDK Runtime Environment (build 21+35-2513) OpenJDK 64-Bit Server VM (build 21+35

Re: [I] Sum up bit count with vector API [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on issue #12639: URL: https://github.com/apache/lucene/issues/12639#issuecomment-1753084518 The scalar impl in JDK21 looks better ``` Benchmark (size) Mode Cnt Score Error Units BitcountBenchmark.bitCountNew 1024 thrpt5

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-09 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1350425577 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -21,19 +21,18 @@ import java.util.List; import org.apache.lucene.store.DataInput; import

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-09 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1350425577 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -21,19 +21,18 @@ import java.util.List; import org.apache.lucene.store.DataInput; import

[I] ability to run JMH benchmarks from gradle [lucene]

2023-10-09 Thread via GitHub
rmuir opened a new issue, #12641: URL: https://github.com/apache/lucene/issues/12641 ### Description Background: I'm having a hard time keeping https://github.com/rmuir/vectorbench up to date, the code has differences with what the integrated vector code in lucene is, I have to copy/

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-09 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1350439623 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -21,19 +21,18 @@ import java.util.List; import org.apache.lucene.store.DataInput; import

Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-09 Thread via GitHub
rmuir commented on issue #12621: URL: https://github.com/apache/lucene/issues/12621#issuecomment-1753202064 @benwtrent I think a big source of confusion is that while the data might be `byte`, the related functions return 4-byte `int` and 4-byte `float` so from a vector api perspective, the

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
uschindler commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1753361544 > I rerun on java 21, `squareDistanceNewNew` looks faster: In this PR is no change in square distance!? It only optimizes cosine and dotProduct. -- This is an automated messa

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1753372410 > In this PR is no change in square distance!? It only optimizes cosine and dotProduct. See the [first commit of this PR](132bf28ecf86f06f6a015f5797139d7dcf3d2fb0) and [the corresp

Re: [PR] Explicitly return needStats flag in TermStates [lucene]

2023-10-09 Thread via GitHub
yugushihuang commented on PR #12638: URL: https://github.com/apache/lucene/pull/12638#issuecomment-1753389852 Because TermStates can be built with or without the needStats. If in application, we build the TermStates and pass them around. It is worthwhile for the application to check if the

[PR] Ensure LeafCollector#finish is only called once on the main collector during drill-sideways [lucene]

2023-10-09 Thread via GitHub
gsmiller opened a new pull request, #12642: URL: https://github.com/apache/lucene/pull/12642 Small bug fix where `#finish` can be called multiple times on the base collector during drill-sideways -- This is an automated message from the Apache Git Service. To respond to the message, pleas

[PR] Add javadoc note to LeafCollector#finish [lucene]

2023-10-09 Thread via GitHub
gsmiller opened a new pull request, #12643: URL: https://github.com/apache/lucene/pull/12643 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-

Re: [PR] Refactor Lucene95 to allow off heap vector reader reuse [lucene]

2023-10-09 Thread via GitHub
benwtrent commented on PR #12629: URL: https://github.com/apache/lucene/pull/12629#issuecomment-1753485937 I am going to merge this unless there is prevailing negative sentiment. This change should significantly reduce code churn for vector codecs that require reading/writing vectors in a f

Re: [I] ability to run JMH benchmarks from gradle [lucene]

2023-10-09 Thread via GitHub
dweiss commented on issue #12641: URL: https://github.com/apache/lucene/issues/12641#issuecomment-1753495875 JMH is fairly self-contained, I don't think it should be a big deal to wrap it up into a separate module, without external plugins (which are problematic to debug, in case of problem

Re: [PR] Capture build scans on ge.apache.org to benefit from deep build insights [lucene]

2023-10-09 Thread via GitHub
clayburn commented on PR #12293: URL: https://github.com/apache/lucene/pull/12293#issuecomment-1753607350 @dsmiley - Here is the PR we were discussing at Community Over Code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Refactor ByteBlockPool so it is just a "shift/mask big array" [lucene]

2023-10-09 Thread via GitHub
mikemccand commented on PR #12625: URL: https://github.com/apache/lucene/pull/12625#issuecomment-1753617337 > I run luceneutil for wikimedium10m and I don't think it shows any slow down (I find hard to understand the output): Hmm, surprisingly noisy, especially for the biggest regress

[I] Optimize counts on 2-clauses disjunctions [lucene]

2023-10-09 Thread via GitHub
jpountz opened a new issue, #12644: URL: https://github.com/apache/lucene/issues/12644 ### Description Counts on disjunctions could be optimized in the following case: - 2 clauses - both clauses are term queries - there are no deletes Then we could compute the count

Re: [PR] LUCENE-10241: Updating OpenNLP to 1.9.4. [lucene]

2023-10-09 Thread via GitHub
epugh commented on PR #448: URL: https://github.com/apache/lucene/pull/448#issuecomment-1753691080 @jzonthemtn not sure I have the knowledge or chops to do this upgrade... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and u

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-09 Thread via GitHub
mikemccand commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1753705229 Translating/merging the above two tables into a graph: ![image](https://github.com/apache/lucene/assets/796508/6259f97c-a065-4a98-a1fc-1e4984e2386e) Some observations:

Re: [PR] Use radix sort to speed up the sorting of terms in TermInSetQuery [lucene]

2023-10-09 Thread via GitHub
gsmiller commented on code in PR #12587: URL: https://github.com/apache/lucene/pull/12587#discussion_r1350774457 ## lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java: ## @@ -112,7 +113,23 @@ private static PrefixCodedTerms packTerms(String field, Collection ter

Re: [PR] LUCENE-10241: Updating OpenNLP to 1.9.4. [lucene]

2023-10-09 Thread via GitHub
jzonthemtn commented on PR #448: URL: https://github.com/apache/lucene/pull/448#issuecomment-1753825042 > @jzonthemtn not sure I have the knowledge or chops to do this upgrade... I'll push an update! -- This is an automated message from the Apache Git Service. To respond to the mess

Re: [I] Optimize counts on 2-clauses disjunctions [lucene]

2023-10-09 Thread via GitHub
gsmiller commented on issue #12644: URL: https://github.com/apache/lucene/issues/12644#issuecomment-1753879508 Oh, +1. Interesting idea to try out! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Explicitly return needStats flag in TermStates [lucene]

2023-10-09 Thread via GitHub
gsmiller commented on PR #12638: URL: https://github.com/apache/lucene/pull/12638#issuecomment-1753905354 I'd also be curious to better understand the need here. Is it really about making `#docFreq` and `#totalTermFreq` calls safer/easier for callers somehow? It looks like you'll get `Illeg

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-09 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1350439623 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -21,19 +21,18 @@ import java.util.List; import org.apache.lucene.store.DataInput; import

Re: [PR] Use radix sort to speed up the sorting of terms in TermInSetQuery [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on code in PR #12587: URL: https://github.com/apache/lucene/pull/12587#discussion_r1351349192 ## lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java: ## @@ -112,7 +113,23 @@ private static PrefixCodedTerms packTerms(String field, Collection ter

Re: [I] Optimize counts on 2-clauses disjunctions [lucene]

2023-10-09 Thread via GitHub
jpountz commented on issue #12644: URL: https://github.com/apache/lucene/issues/12644#issuecomment-1754413611 You are right, I added the condition on deleted docs and term queries so that `count(clause)` can be computed as the doc freq of the term. -- This is an automated message from the

Re: [PR] DeletedTerms#clear should reset ByteBlockPool [lucene]

2023-10-09 Thread via GitHub
gf2121 merged PR #12630: URL: https://github.com/apache/lucene/pull/12630 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on PR #12631: URL: https://github.com/apache/lucene/pull/12631#issuecomment-1754442681 @jpountz @mikemccand Thanks a lot for the great suggestions and benchmark ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-09 Thread via GitHub
gf2121 merged PR #12631: URL: https://github.com/apache/lucene/pull/12631 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [I] Write VLong in opposite order for better outputs sharing in the FST [lucene]

2023-10-09 Thread via GitHub
gf2121 closed issue #12620: Write VLong in opposite order for better outputs sharing in the FST URL: https://github.com/apache/lucene/issues/12620 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[PR] Avoid duplicate array fill in BPIndexReorderer [lucene]

2023-10-09 Thread via GitHub
gf2121 opened a new pull request, #12645: URL: https://github.com/apache/lucene/pull/12645 No need to fill zero as `computeDocFreqs` will do. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] Avoid duplicate array fill in BPIndexReorderer [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on code in PR #12645: URL: https://github.com/apache/lucene/pull/12645#discussion_r1351615624 ## lucene/CHANGES.txt: ## @@ -178,7 +178,7 @@ Optimizations * GITHUB#12623: Use a MergeSorter taking advantage of extra storage for StableMSBRadixSorter. (Guo Feng)

Re: [PR] Add javadoc note to LeafCollector#finish [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on code in PR #12643: URL: https://github.com/apache/lucene/pull/12643#discussion_r1351629111 ## lucene/core/src/java/org/apache/lucene/search/LeafCollector.java: ## @@ -125,6 +125,8 @@ default DocIdSetIterator competitiveIterator() throws IOException { * i

[PR] Move addNode to FSTCompiler [lucene]

2023-10-10 Thread via GitHub
dungba88 opened a new pull request, #12646: URL: https://github.com/apache/lucene/pull/12646 ### Description Currently FSTCompiler and FST has a circular dependencies to each other. FSTCompiler creates an instance of FST, and on adding node, it delegates to `FST.addNode()` and passin

Re: [PR] Avoid duplicate array fill in BPIndexReorderer [lucene]

2023-10-10 Thread via GitHub
gf2121 merged PR #12645: URL: https://github.com/apache/lucene/pull/12645 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Ensure LeafCollector#finish is only called once on the main collector during drill-sideways [lucene]

2023-10-10 Thread via GitHub
gf2121 commented on code in PR #12642: URL: https://github.com/apache/lucene/pull/12642#discussion_r1351726366 ## lucene/facet/src/test/org/apache/lucene/facet/TestDrillSideways.java: ## @@ -1490,7 +1542,22 @@ public List reduce(Collection collectors) { .collect(Coll

Re: [PR] Move addNode to FSTCompiler [lucene]

2023-10-10 Thread via GitHub
romseygeek commented on PR #12646: URL: https://github.com/apache/lucene/pull/12646#issuecomment-1754625497 Thanks for opening @dungba88! This FST building code is very hairy and this is a nice start at cleaning it up. Given how expert this code is and that the relevant methods are al

Re: [PR] Ensure LeafCollector#finish is only called once on the main collector during drill-sideways [lucene]

2023-10-10 Thread via GitHub
jpountz commented on code in PR #12642: URL: https://github.com/apache/lucene/pull/12642#discussion_r1351794343 ## lucene/facet/src/test/org/apache/lucene/facet/TestDrillSideways.java: ## @@ -316,6 +316,58 @@ public void testBasic() throws Exception { IOUtils.close(searcher

<    2   3   4   5   6   7   8   9   10   11   >