Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-15 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1395069004 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -337,11 +349,23 @@ public long size() { return getPosition(); } + /** Similar to

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-15 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1393462969 ## lucene/core/src/java/org/apache/lucene/util/fst/OnHeapFSTStore.java: ## @@ -64,22 +66,13 @@ public FSTStore init(DataInput in, long numBytes) throws IOException {

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-15 Thread via GitHub
dungba88 commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1813856323 Seems like this PR is getting long, so I spawned 2 PR out of it: - https://github.com/apache/lucene/pull/12814: Simplify `BytesStore` operations (which was changed to GrowableByteArra

Re: [PR] Fix segmentInfos replace doesn't set userData [lucene]

2023-11-15 Thread via GitHub
MarcusSorealheis commented on PR #12626: URL: https://github.com/apache/lucene/pull/12626#issuecomment-1813833061 Looks good now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific com

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-15 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1395069004 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -337,11 +349,23 @@ public long size() { return getPosition(); } + /** Similar to

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-15 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1393547261 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -21,12 +21,13 @@ import java.util.List; import org.apache.lucene.store.DataInput; import

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-15 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1395069004 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -337,11 +349,23 @@ public long size() { return getPosition(); } + /** Similar to

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-15 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1395069004 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -337,11 +349,23 @@ public long size() { return getPosition(); } + /** Similar to

Re: [PR] Fix segmentInfos replace doesn't set userData [lucene]

2023-11-15 Thread via GitHub
Shibi-bala commented on PR #12626: URL: https://github.com/apache/lucene/pull/12626#issuecomment-1813343695 @uschindler Ah I needed to re-sync my forked repo 😅 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

Re: [I] Port PR management bot from Apache Beam [lucene]

2023-11-15 Thread via GitHub
stefanvodita commented on issue #12796: URL: https://github.com/apache/lucene/issues/12796#issuecomment-1813265420 +1 to starting super simple. I tried to hack a workflow for marking stale PRs (#12813). Fortunately, GitHub provides good [support](https://github.com/actions/stale) for this t

[PR] Introduce workflow for stale PRs [lucene]

2023-11-15 Thread via GitHub
stefanvodita opened a new pull request, #12813: URL: https://github.com/apache/lucene/pull/12813 PRs get stale and we miss out on good contributions. This workflow will mark PRs that are becoming stale. Addresses #12796 -- This is an automated message from the Apache Git Service.

Re: [PR] Simple rename of unreleased quantization parameter [lucene]

2023-11-15 Thread via GitHub
benwtrent merged PR #12811: URL: https://github.com/apache/lucene/pull/12811 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

[PR] Simple rename of unreleased quantization parameter [lucene]

2023-11-15 Thread via GitHub
benwtrent opened a new pull request, #12811: URL: https://github.com/apache/lucene/pull/12811 the `quantile` parameter is actually a `confidence_interval` this is a simple rename of this parameter for the hnsw scalar quantized format. -- This is an automated message from the Apache Git Se

Re: [I] USearch integration and potential Vector Search performance improvements [lucene]

2023-11-15 Thread via GitHub
chadbrewbaker commented on issue #12502: URL: https://github.com/apache/lucene/issues/12502#issuecomment-1813112623 > Yes: > > * no external libraries for Lucene Core > * no native code Put it in an "examples" directory to show how to extend Lucene with JNI. If you have a $1

Re: [PR] Improve vector search speed by using FixedBitSet [lucene]

2023-11-15 Thread via GitHub
jpountz commented on PR #12789: URL: https://github.com/apache/lucene/pull/12789#issuecomment-1813030726 ++ This feels similar to `IndexOrDocValuesQuery`: we probably can't guess the absolute best threshold, but we can probably figure out something that is right more often than wrong. Hopef

Re: [PR] Utilize exact kNN search when gathering k > numVectors in a segment [lucene]

2023-11-15 Thread via GitHub
benwtrent merged PR #12806: URL: https://github.com/apache/lucene/pull/12806 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-11-15 Thread via GitHub
kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1812956627 > could you test on cohere with Max-inner product? Thanks, the gist was really helpful and gave some files including normalized and un-normalized vectors. I assume that since you

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-11-15 Thread via GitHub
kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1812941899 > You still need to score the vectors to realize that they are in the iteration set or not Right, I meant that we need not score all *other* vectors to determine if the vector it

Re: [I] HnwsGraph creates disconnected components [lucene]

2023-11-15 Thread via GitHub
benwtrent commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1812901631 @nitirajrathore could you add something to [KnnGraphTester](https://github.com/mikemccand/luceneutil/blob/master/src/main/KnnGraphTester.java) that is a test for connectedness?

Re: [PR] Simplify advancing on postings/impacts enums [lucene]

2023-11-15 Thread via GitHub
msokolov commented on PR #12810: URL: https://github.com/apache/lucene/pull/12810#issuecomment-1812678173 this sounds reasonable to me, and the code does seem simpler, but I'm not able to give a thorough review. +1 to rationalize / simplify even if it doesn't show significant peformance imp

Re: [I] HnwsGraph creates disconnected components [lucene]

2023-11-15 Thread via GitHub
msokolov commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1812647124 My memory of the way this diversity criterion has evolved is kind of hazy, but I believe in the very first implementation we would not impose any diversity check until the neighbor

Re: [PR] Utilize exact kNN search when gathering k > numVectors in a segment [lucene]

2023-11-15 Thread via GitHub
jpountz commented on code in PR #12806: URL: https://github.com/apache/lucene/pull/12806#discussion_r1394256922 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java: ## @@ -238,11 +238,23 @@ public void search(String field, float[] target, Kn

Re: [I] HnwsGraph creates disconnected components [lucene]

2023-11-15 Thread via GitHub
benwtrent commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1812605864 @msokolov good point. It seems to me we would only fully disconnect a sub-graph only if its very clustered. Is there a way to detect this in the diversity selection?

Re: [I] HnwsGraph creates disconnected components [lucene]

2023-11-15 Thread via GitHub
msokolov commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1812559330 Is the problem primarily to do with single isolated nodes or do we also see disconnected subgraphs containing multiple nodes? I think this idea would prevent the isolated nodes, bu

Re: [PR] Minor change to IndexOrDocValuesQuery#toString [lucene]

2023-11-15 Thread via GitHub
mikemccand commented on PR #12791: URL: https://github.com/apache/lucene/pull/12791#issuecomment-1812469386 Nightly sparse (NYC taxis) benchy was a bit unhappy with this change because it (weirdly) relies on `Query.toString` (I tried to fix the benchy [here](https://github.com/mikemccand/lu

Re: [PR] Utilize exact kNN search when gathering k > numVectors in a segment [lucene]

2023-11-15 Thread via GitHub
benwtrent commented on PR #12806: URL: https://github.com/apache/lucene/pull/12806#issuecomment-1812421692 > The idea makes sense to me, what is less clear to me is whether this logic belongs to the Query or to the vector reader: should searchNearestNeighbors implicitly do a linear scan whe

Re: [I] HnwsGraph creates disconnected components [lucene]

2023-11-15 Thread via GitHub
benwtrent commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1812408858 @nitirajrathore @msokolov I had an idea around this, and it will cost an extra 4bytes per node on each layer its a member (maybe we only need this on the bottom layer...) W

Re: [PR] Utilize exact kNN search when gathering k > numVectors in a segment [lucene]

2023-11-15 Thread via GitHub
jpountz commented on code in PR #12806: URL: https://github.com/apache/lucene/pull/12806#discussion_r1394090752 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -110,6 +110,12 @@ private TopDocs getLeafResults(LeafReaderContext ctx, Weight fil

Re: [PR] Simplify advancing on postings/impacts enums [lucene]

2023-11-15 Thread via GitHub
jpountz commented on PR #12810: URL: https://github.com/apache/lucene/pull/12810#issuecomment-1812099057 For reference, starting postings and skip lists at -1 changes file formats, so I'm keen to getting this change in 9.9 since we had to change the file format anyway because of the move fr

Re: [PR] Simplify advancing on postings/impacts enums [lucene]

2023-11-15 Thread via GitHub
jpountz commented on PR #12810: URL: https://github.com/apache/lucene/pull/12810#issuecomment-1812097551 This change seems to be neutral on wikibigall. No speedup, but not slowdown either. ``` TaskQPS baseline StdDevQPS my_modified_version Std

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-15 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1393547261 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -21,12 +21,13 @@ import java.util.List; import org.apache.lucene.store.DataInput; import

[PR] Simplify advancing on postings/impacts enums [lucene]

2023-11-15 Thread via GitHub
jpountz opened a new pull request, #12810: URL: https://github.com/apache/lucene/pull/12810 Currently `advance(int target)` needs to perform two checks: - is there a need to use skip lists? - is there a need for decoding a new block? Ideally we would track the last doc ID in a

[I] Simplifying TextAreaPrintStream in Luke [lucene]

2023-11-15 Thread via GitHub
picimako opened a new issue, #12809: URL: https://github.com/apache/lucene/issues/12809 ### Description Hi, I've been looking into how [`org.apache.lucene.luke.app.desktop.util.TextAreaPrintStream`](https://github.com/apache/lucene/blob/main/lucene/luke/src/java/org/apache/luce

Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-15 Thread via GitHub
zacharymorn commented on PR #240: URL: https://github.com/apache/lucene/pull/240#issuecomment-1811972923 > Hi @mikemccand @jpountz @javanna @gsmiller , I have updated this PR to pick up the latest from `main`, as well as revert some changes to save them for follow-up PRs that address other