Re: [PR] Record if block API has been used in SegmentInfo [lucene]

2023-10-16 Thread via GitHub
s1monw commented on code in PR #12685: URL: https://github.com/apache/lucene/pull/12685#discussion_r1360837043 ## lucene/core/src/java/org/apache/lucene/index/IndexWriter.java: ## @@ -3368,9 +3368,15 @@ public void addIndexesReaderMerge(MergePolicy.OneMerge merge) throws IOExce

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-16 Thread via GitHub
dungba88 commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1360866889 ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -99,31 +87,23 @@ public class FSTCompiler { * tuning and tweaking, see {@link Builder}.

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-16 Thread via GitHub
dungba88 commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1360875178 ## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ## @@ -17,50 +17,80 @@ package org.apache.lucene.util.fst; import java.io.IOException; -import o

Re: [PR] Fix segmentInfos replace doesn't set userData [lucene]

2023-10-16 Thread via GitHub
msfroh commented on code in PR #12626: URL: https://github.com/apache/lucene/pull/12626#discussion_r1360878505 ## lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java: ## @@ -1996,6 +1996,41 @@ public void testGetCommitData() throws Exception { dir.close();

Re: [PR] Fix segmentInfos replace doesn't set userData [lucene]

2023-10-16 Thread via GitHub
msfroh commented on code in PR #12626: URL: https://github.com/apache/lucene/pull/12626#discussion_r1360880697 ## lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java: ## @@ -1996,6 +1996,41 @@ public void testGetCommitData() throws Exception { dir.close();

Re: [PR] read MSB VLong in new way [lucene]

2023-10-16 Thread via GitHub
gf2121 commented on PR #12661: URL: https://github.com/apache/lucene/pull/12661#issuecomment-1764814636 I made some effort to speed up the `add` operation for `BytesRef`, getting a tiny improvement: > Baseline: after https://github.com/apache/lucene/pull/12631; Candidate: this patch;

[I] `IndexOrDocValuesQuery` does not support query highlighting [lucene]

2023-10-16 Thread via GitHub
harshavamsi opened a new issue, #12686: URL: https://github.com/apache/lucene/issues/12686 ### Description While working with the `IndexOrDocValuesQuery`, I noticed that highlighting was broken. This is potentially caused by the extract function that does not check if the query is in

[PR] Add timeouts to github jobs. [lucene]

2023-10-16 Thread via GitHub
dweiss opened a new pull request, #12687: URL: https://github.com/apache/lucene/pull/12687 Estimates taken from empirical run times (actions history), with a generous buffer added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Git

Re: [PR] Extract the hnsw graph merging from being part of the vector writer [lucene]

2023-10-16 Thread via GitHub
benwtrent commented on code in PR #12657: URL: https://github.com/apache/lucene/pull/12657#discussion_r136233 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java: ## @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (A

Re: [PR] Extract the hnsw graph merging from being part of the vector writer [lucene]

2023-10-16 Thread via GitHub
benwtrent commented on code in PR #12657: URL: https://github.com/apache/lucene/pull/12657#discussion_r1361112199 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java: ## @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (A

Re: [PR] Fix SynonymQuery equals implementation [lucene]

2023-10-16 Thread via GitHub
mingshl commented on PR #12260: URL: https://github.com/apache/lucene/pull/12260#issuecomment-1765091198 @romseygeek @mkhludnev, this bug was introduced since 9.4 version, can this PR be back-ported to 9.4.2 to fix the issue? -- This is an automated message from the Apache Git Service. To

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
jmazanec15 commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1765145453 Hey @benwtrent, sorry for delay, still looking through change. But 4x space improvement with minimal recall loss is awesome. -- This is an automated message from the Apache Git Ser

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
benwtrent commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1361186978 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
jmazanec15 commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1357440025 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Record if block API has been used in SegmentInfo [lucene]

2023-10-16 Thread via GitHub
jpountz commented on code in PR #12685: URL: https://github.com/apache/lucene/pull/12685#discussion_r1361188738 ## lucene/core/src/java/org/apache/lucene/index/SegmentInfo.java: ## @@ -153,6 +157,16 @@ public boolean getUseCompoundFile() { return isCompoundFile; } + /

Re: [PR] read MSB VLong in new way [lucene]

2023-10-16 Thread via GitHub
jpountz commented on PR #12661: URL: https://github.com/apache/lucene/pull/12661#issuecomment-1765186395 If we're specializing the format anyway, I wonder if we could try different layouts. E.g. another option could be to encode the number of supplementary bytes using unary coding (like UTF

Re: [PR] Optimize OnHeapHnswGraph's data structure [lucene]

2023-10-16 Thread via GitHub
zhaih merged PR #12651: URL: https://github.com/apache/lucene/pull/12651 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
uschindler commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1765287960 Hi, why do we need a new Codec? The Lucebe main file format does not change, olly the HNSW format was exchanged. Because like pistingsfornats and dicvaluesformats, the SPI can detect

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
benwtrent commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1765316000 @uschindler so I should just add a new format? It would be a new Lucene99 HNSW format, but keep the default Lucene95 HNSW format? Or can we change the default vector form

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
jmazanec15 commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1361261168 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -0,0 +1,782 @@ +/* + * Licensed to the Apache Software Fo

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
jimczi commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1765330759 > why do we need a new top-level Codec? The Lucene main file format does not change, only the HNSW format was exchanged. Because like ppostingsfornats and docvaluesformats, the SPI can d

[PR] Random access term dictionary [lucene]

2023-10-16 Thread via GitHub
Tony-X opened a new pull request, #12688: URL: https://github.com/apache/lucene/pull/12688 ### Description Related issue https://github.com/apache/lucene/issues/12513 Opening this PR early to avoid massive diffs in one-shot - [x] Encode (term type, local ord) in FST T

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
uschindler commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1765355363 > > why do we need a new top-level Codec? The Lucene main file format does not change, only the HNSW format was exchanged. Because like ppostingsfornats and docvaluesformats, the SPI

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
uschindler commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1765362790 > @uschindler so I should just add a new format? > > It would be a new Lucene99 HNSW format, but keep the default Lucene95 HNSW format? > > Or can we change the default v

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
uschindler commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1765382016 I just checked the code, the 9.5 top-level codec addition was useless. Just code duplication. We can't revert it anymore, but we should not repeat that. The only required top-level Fo

Re: [PR] Create a task executor when executor is not provided [lucene]

2023-10-16 Thread via GitHub
sohami commented on code in PR #12606: URL: https://github.com/apache/lucene/pull/12606#discussion_r1361333154 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -420,13 +418,12 @@ public int count(Query query) throws IOException { } /** - * Ret

Re: [PR] [BROKEN, for reference only] concurrent hnsw [lucene]

2023-10-16 Thread via GitHub
zhaih commented on code in PR #12683: URL: https://github.com/apache/lucene/pull/12683#discussion_r1361334258 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java: ## @@ -59,11 +60,26 @@ protected HnswGraph() {} * * @param level level of the graph * @pa

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
uschindler commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1765386547 The simplest change is: - Remove Lucene99Codec - In Lucene95Codec just change this: `this.defaultKnnVectorsFormat = new Lucene95HnswVectorsFormat();` to the new format. Do

Re: [PR] Extract the hnsw graph merging from being part of the vector writer [lucene]

2023-10-16 Thread via GitHub
zhaih commented on code in PR #12657: URL: https://github.com/apache/lucene/pull/12657#discussion_r1361341835 ## lucene/core/src/java/org/apache/lucene/util/hnsw/IncrementalHnswGraphMerger.java: ## @@ -0,0 +1,197 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [I] Exception rising while using QueryTimeout [lucene]

2023-10-16 Thread via GitHub
msfroh commented on issue #12032: URL: https://github.com/apache/lucene/issues/12032#issuecomment-1765587096 I started to work on making DrillSidewaysScorer work on windows of doc IDs, when I noticed the following comment added in TestDrillSideways as part of https://github.com/apache/lucen

Re: [PR] Add timeouts to github jobs. [lucene]

2023-10-16 Thread via GitHub
dweiss merged PR #12687: URL: https://github.com/apache/lucene/pull/12687 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Fix SynonymQuery equals implementation [lucene]

2023-10-16 Thread via GitHub
mkhludnev commented on PR #12260: URL: https://github.com/apache/lucene/pull/12260#issuecomment-1765741713 Hi, @mingshl I'm able to cherrypick this fix into branch_9_4, but I'm not sure if there'll be release 9.4.2 ever. -- This is an automated message from the Apache Git Service. To

Re: [PR] read MSB VLong in new way [lucene]

2023-10-17 Thread via GitHub
gf2121 commented on PR #12661: URL: https://github.com/apache/lucene/pull/12661#issuecomment-1765808689 Hi @jpountz , Thanks a lot for the suggestion! > another option could be to encode the number of supplementary bytes using unary coding (like UTF8). This is a great idea that

Re: [PR] Scorer should sum up scores into a double [lucene]

2023-10-17 Thread via GitHub
jpountz commented on code in PR #12682: URL: https://github.com/apache/lucene/pull/12682#discussion_r1361646368 ## lucene/core/src/java/org/apache/lucene/search/ReqOptSumScorer.java: ## @@ -266,7 +265,7 @@ public float score() throws IOException { score += optScorer.score

Re: [PR] read MSB VLong in new way [lucene]

2023-10-17 Thread via GitHub
jpountz commented on PR #12661: URL: https://github.com/apache/lucene/pull/12661#issuecomment-1765890646 Oh your explanation makes sense, and I agree with you that a more efficient encoding would unlikely help conterbalance the fact that more arcs need to be read per output. So this loo

Re: [PR] Scorer should sum up scores into a double [lucene]

2023-10-17 Thread via GitHub
shubhamvishu commented on code in PR #12682: URL: https://github.com/apache/lucene/pull/12682#discussion_r1361736510 ## lucene/core/src/java/org/apache/lucene/search/ReqOptSumScorer.java: ## @@ -266,7 +265,7 @@ public float score() throws IOException { score += optScorer.

Re: [PR] Scorer should sum up scores into a double [lucene]

2023-10-17 Thread via GitHub
shubhamvishu commented on code in PR #12682: URL: https://github.com/apache/lucene/pull/12682#discussion_r1361737240 ## lucene/core/src/java/org/apache/lucene/search/similarities/TFIDFSimilarity.java: ## @@ -504,9 +504,9 @@ public TFIDFScorer(float boost, Explanation idf, float[

Re: [PR] Scorer should sum up scores into a double [lucene]

2023-10-17 Thread via GitHub
jpountz commented on code in PR #12682: URL: https://github.com/apache/lucene/pull/12682#discussion_r1361739665 ## lucene/core/src/java/org/apache/lucene/search/ReqOptSumScorer.java: ## @@ -266,7 +265,7 @@ public float score() throws IOException { score += optScorer.score

Re: [PR] read MSB VLong in new way [lucene]

2023-10-17 Thread via GitHub
gf2121 commented on PR #12661: URL: https://github.com/apache/lucene/pull/12661#issuecomment-1765964640 > I wonder if extending the Outputs class directly would help, instead of storing data in an opaque byte[]? Yes ,The reuse is exactly what `Outputs` wants to do ! (see this [todo](

Re: [PR] Add support for radius-based vector searches [lucene]

2023-10-17 Thread via GitHub
jpountz commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1765975335 If I read correctly, this query ends up calling `LeafReader#searchNearestNeighbors` with k=Integer.MAX_VALUE, which will not only run in O(maxDoc) time but also use O(maxDoc) memory. I d

Re: [PR] Scorer should sum up scores into a double [lucene]

2023-10-17 Thread via GitHub
shubhamvishu commented on PR #12682: URL: https://github.com/apache/lucene/pull/12682#issuecomment-1765985160 Thanks @jpountz for the review! I have addressed the comments in the new revision. -- This is an automated message from the Apache Git Service. To respond to the message, please l

Re: [PR] Scorer should sum up scores into a double [lucene]

2023-10-17 Thread via GitHub
shubhamvishu commented on code in PR #12682: URL: https://github.com/apache/lucene/pull/12682#discussion_r1361773569 ## lucene/core/src/java/org/apache/lucene/search/ReqOptSumScorer.java: ## @@ -266,7 +265,7 @@ public float score() throws IOException { score += optScorer.

Re: [PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]

2023-10-17 Thread via GitHub
jpountz commented on code in PR #12622: URL: https://github.com/apache/lucene/pull/12622#discussion_r1361783707 ## lucene/core/src/java/org/apache/lucene/index/IndexWriter.java: ## @@ -5144,20 +5145,71 @@ public int length() { } mergeReaders.add(wrappedReader);

Re: [PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]

2023-10-17 Thread via GitHub
jpountz commented on code in PR #12622: URL: https://github.com/apache/lucene/pull/12622#discussion_r1361793823 ## lucene/core/src/java/org/apache/lucene/index/SortingCodecReader.java: ## @@ -468,7 +468,11 @@ public void checkIntegrity() throws IOException { @Override

Re: [PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]

2023-10-17 Thread via GitHub
jpountz commented on code in PR #12622: URL: https://github.com/apache/lucene/pull/12622#discussion_r1361798124 ## lucene/core/src/java/org/apache/lucene/index/SlowCompositeCodecReaderWrapper.java: ## @@ -0,0 +1,998 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]

2023-10-17 Thread via GitHub
jpountz commented on code in PR #12622: URL: https://github.com/apache/lucene/pull/12622#discussion_r1361799042 ## lucene/core/src/java/org/apache/lucene/index/SlowCompositeCodecReaderWrapper.java: ## @@ -0,0 +1,998 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]

2023-10-17 Thread via GitHub
jpountz commented on code in PR #12622: URL: https://github.com/apache/lucene/pull/12622#discussion_r1361802385 ## lucene/core/src/java/org/apache/lucene/index/IndexWriter.java: ## @@ -5144,20 +5145,71 @@ public int length() { } mergeReaders.add(wrappedReader);

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-17 Thread via GitHub
mikemccand commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1361812236 ## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ## @@ -17,50 +17,80 @@ package org.apache.lucene.util.fst; import java.io.IOException; -import

Re: [PR] Use radix sort to speed up the sorting of terms in TermInSetQuery [lucene]

2023-10-17 Thread via GitHub
gf2121 merged PR #12587: URL: https://github.com/apache/lucene/pull/12587 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-17 Thread via GitHub
mikemccand commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1361814551 ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -99,31 +87,23 @@ public class FSTCompiler { * tuning and tweaking, see {@link Builder

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-17 Thread via GitHub
mikemccand commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1766041651 > > With the PR, you unfortunately cannot easily say "give me a minimal FST at all costs", like you can with main today. You'd have to keep trying larger and larger NodeHash sizes unt

Re: [PR] Optimize outputs accumulating as MSB VLong outputs sharing more output prefix [lucene]

2023-10-17 Thread via GitHub
gf2121 commented on PR #12661: URL: https://github.com/apache/lucene/pull/12661#issuecomment-1766048357 Hi @mikemccand , it would be great if you can take a look too :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use t

Re: [PR] Use MergeSorter in StableStringSorter [lucene]

2023-10-17 Thread via GitHub
gf2121 merged PR #12652: URL: https://github.com/apache/lucene/pull/12652 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-17 Thread via GitHub
mikemccand commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1766082689 Thanks for the suggestions @dungba88! I took the approach you suggested, with a few more pushed commits just now. Despite the increase in `nocommit`s I think this is actually close!

Re: [PR] Remove over-counting of deleted terms [lucene]

2023-10-17 Thread via GitHub
gf2121 merged PR #12586: URL: https://github.com/apache/lucene/pull/12586 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Speed up TestIndexOrDocValuesQuery. [lucene]

2023-10-17 Thread via GitHub
jpountz merged PR #12672: URL: https://github.com/apache/lucene/pull/12672 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Specialize `BlockImpactsDocsEnum#nextDoc()`. [lucene]

2023-10-17 Thread via GitHub
jpountz merged PR #12670: URL: https://github.com/apache/lucene/pull/12670 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-17 Thread via GitHub
benwtrent commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1362016333 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,316 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Fix lazy decoding of frequencies in `BlockImpactsDocsEnum`. [lucene]

2023-10-17 Thread via GitHub
jpountz commented on PR #12668: URL: https://github.com/apache/lucene/pull/12668#issuecomment-1766342540 Even though the speedup is less pronounced than in the above luceneutil run, there seems to be an actual speedup in nightly benchmarks for boolean queries. E.g. the last 3 data points of

Re: [I] analysis-stempel incorrect tokens generation for numbers [LUCENE-10290] [lucene]

2023-10-17 Thread via GitHub
tomsquest commented on issue #11326: URL: https://github.com/apache/lucene/issues/11326#issuecomment-1766389365 This issue occurred to us also, and not only for numbers. Actually, token finishing by `1` will be stemmed! ``` GET _analyze { "tokenizer": "standard", "filt

Re: [PR] Reduce collection operations when minShouldMatch == 0. [lucene]

2023-10-17 Thread via GitHub
jpountz commented on PR #12602: URL: https://github.com/apache/lucene/pull/12602#issuecomment-1766428044 I would be surprisid if this change would yield a noticeable speedup? Does it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-17 Thread via GitHub
mayya-sharipova commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1362208743 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java: ## @@ -0,0 +1,1149 @@ +/* + * Licensed to the Apache Software Foundat

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-17 Thread via GitHub
mayya-sharipova commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1362208743 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java: ## @@ -0,0 +1,1149 @@ +/* + * Licensed to the Apache Software Foundat

Re: [PR] [BROKEN, for reference only] concurrent hnsw [lucene]

2023-10-17 Thread via GitHub
msokolov commented on code in PR #12683: URL: https://github.com/apache/lucene/pull/12683#discussion_r1362245604 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java: ## @@ -59,11 +60,26 @@ protected HnswGraph() {} * * @param level level of the graph *

Re: [PR] Add support for radius-based vector searches [lucene]

2023-10-17 Thread via GitHub
kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1766741617 > If I read correctly, this query ends up calling LeafReader#searchNearestNeighbors with k=Integer.MAX_VALUE No, we're calling the [new API](https://github.com/apache/lucene/blob

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-17 Thread via GitHub
benwtrent commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1362432970 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,316 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-17 Thread via GitHub
benwtrent commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1362437535 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,316 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Extract the hnsw graph merging from being part of the vector writer [lucene]

2023-10-17 Thread via GitHub
benwtrent commented on code in PR #12657: URL: https://github.com/apache/lucene/pull/12657#discussion_r1362456237 ## lucene/core/src/java/org/apache/lucene/util/hnsw/InitializedHnswGraphBuilder.java: ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) un

Re: [PR] Extract the hnsw graph merging from being part of the vector writer [lucene]

2023-10-17 Thread via GitHub
benwtrent commented on code in PR #12657: URL: https://github.com/apache/lucene/pull/12657#discussion_r1362457390 ## lucene/core/src/java/org/apache/lucene/util/hnsw/IncrementalHnswGraphMerger.java: ## @@ -0,0 +1,197 @@ +/* + * Licensed to the Apache Software Foundation (ASF) un

Re: [PR] Add support for radius-based vector searches [lucene]

2023-10-17 Thread via GitHub
jpountz commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1766795903 Thanks for explaining, I had overlooked how the `Integer.MAX_VALUE` was used indeed. I'm still interested in figuring out if we can have stronger guarantees on the worst-case memory usag

Re: [PR] Add support for radius-based vector searches [lucene]

2023-10-17 Thread via GitHub
kaivalnp commented on code in PR #12679: URL: https://github.com/apache/lucene/pull/12679#discussion_r1362472568 ## lucene/core/src/java/org/apache/lucene/search/AbstractRnnVectorQuery.java: ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] Add support for radius-based vector searches [lucene]

2023-10-17 Thread via GitHub
kaivalnp commented on code in PR #12679: URL: https://github.com/apache/lucene/pull/12679#discussion_r1362475149 ## lucene/core/src/java/org/apache/lucene/search/AbstractRnnVectorQuery.java: ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] Add support for radius-based vector searches [lucene]

2023-10-17 Thread via GitHub
kaivalnp commented on code in PR #12679: URL: https://github.com/apache/lucene/pull/12679#discussion_r1362474206 ## lucene/core/src/java/org/apache/lucene/search/AbstractRnnVectorQuery.java: ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] Add support for radius-based vector searches [lucene]

2023-10-17 Thread via GitHub
kaivalnp commented on code in PR #12679: URL: https://github.com/apache/lucene/pull/12679#discussion_r1362476143 ## lucene/core/src/java/org/apache/lucene/search/RnnCollector.java: ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + *

Re: [PR] Add support for radius-based vector searches [lucene]

2023-10-17 Thread via GitHub
kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1766834111 Thanks for the review @shubhamvishu! Addressed some of the comments above > Is it right to call it a radius-based search here? I think of it as finding all results within a

Re: [PR] Fix SynonymQuery equals implementation [lucene]

2023-10-17 Thread via GitHub
mingshl commented on PR #12260: URL: https://github.com/apache/lucene/pull/12260#issuecomment-1766881156 Thank you! @mkhludnev -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

Re: [PR] Extract the hnsw graph merging from being part of the vector writer [lucene]

2023-10-17 Thread via GitHub
benwtrent merged PR #12657: URL: https://github.com/apache/lucene/pull/12657 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [PR] Optimize outputs accumulating as MSB VLong outputs sharing more output prefix [lucene]

2023-10-17 Thread via GitHub
gf2121 commented on PR #12661: URL: https://github.com/apache/lucene/pull/12661#issuecomment-1766958252 An idea comes to me that maybe we do not really need to do combine all these `BytesRef`s to a single `BytesRef`, we can just build a `DataInput` over these `BytesRef`s to read. Luckily, o

[PR] TaskExecutor to cancel all tasks on exception [lucene]

2023-10-17 Thread via GitHub
javanna opened a new pull request, #12689: URL: https://github.com/apache/lucene/pull/12689 When operations are parallelized, like query rewrite, or search, or createWeight, one of the tasks may throw an exception. In that case we wait for all tasks to be completed before re-throwing th

Re: [PR] Add support for radius-based vector searches [lucene]

2023-10-17 Thread via GitHub
benwtrent commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1766983182 > I think of it as finding all results within a high-dimensional circle / sphere / equivalent, dot-product, cosine, etc. don't really follow that same idea as you point out. I w

Re: [PR] TaskExecutor to cancel all tasks on exception [lucene]

2023-10-17 Thread via GitHub
javanna commented on code in PR #12689: URL: https://github.com/apache/lucene/pull/12689#discussion_r1362620375 ## lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java: ## @@ -64,64 +67,124 @@ public final class TaskExecutor { * @param the return type of the task

Re: [PR] TaskExecutor to cancel all tasks on exception [lucene]

2023-10-17 Thread via GitHub
javanna commented on code in PR #12689: URL: https://github.com/apache/lucene/pull/12689#discussion_r1362621063 ## lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java: ## @@ -64,64 +67,124 @@ public final class TaskExecutor { * @param the return type of the task

Re: [PR] TaskExecutor to cancel all tasks on exception [lucene]

2023-10-17 Thread via GitHub
javanna commented on code in PR #12689: URL: https://github.com/apache/lucene/pull/12689#discussion_r1362621950 ## lucene/core/src/test/org/apache/lucene/search/TestTaskExecutor.java: ## @@ -43,7 +47,8 @@ public class TestTaskExecutor extends LuceneTestCase { public static vo

Re: [PR] TaskExecutor to cancel all tasks on exception [lucene]

2023-10-17 Thread via GitHub
javanna commented on code in PR #12689: URL: https://github.com/apache/lucene/pull/12689#discussion_r1362621950 ## lucene/core/src/test/org/apache/lucene/search/TestTaskExecutor.java: ## @@ -43,7 +47,8 @@ public class TestTaskExecutor extends LuceneTestCase { public static vo

Re: [PR] Add support for radius-based vector searches [lucene]

2023-10-17 Thread via GitHub
kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1766995337 ### Benchmarks Using the vector file from https://home.apache.org/~sokolov/enwiki-20120502-lines-1k-100d.vec (enwiki dataset, unit vectors, 100 dimensions) The setup was 1

Re: [PR] Add support for radius-based vector searches [lucene]

2023-10-17 Thread via GitHub
kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1767022898 > stronger guarantees on the worst-case memory usage Totally agreed @jpountz! It is very easy to go wrong in the new API, specially if the user passes a low threshold (high radius

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-17 Thread via GitHub
benwtrent commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1362661760 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java: ## @@ -0,0 +1,1149 @@ +/* + * Licensed to the Apache Software Foundation (A

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-17 Thread via GitHub
benwtrent commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1362664506 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -0,0 +1,782 @@ +/* + * Licensed to the Apache Software Fou

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-17 Thread via GitHub
benwtrent commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1362665321 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -0,0 +1,782 @@ +/* + * Licensed to the Apache Software Fou

Re: [I] [DISCUSS] Should there be a threshold-based vector search API? [lucene]

2023-10-17 Thread via GitHub
kaivalnp commented on issue #12579: URL: https://github.com/apache/lucene/issues/12579#issuecomment-1767112899 > one other thing to think about is https://weaviate.io/blog/weaviate-1-20-release#autocut Interesting! They [seem to](https://github.com/weaviate/weaviate/blob/c382dcbe6ff0

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-17 Thread via GitHub
benwtrent commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1362725464 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Move private static classes or functions out of DoubleValuesSource [lucene]

2023-10-17 Thread via GitHub
gsmiller commented on PR #12671: URL: https://github.com/apache/lucene/pull/12671#issuecomment-1767291109 Thanks for your further thoughts @shubhamvishu. Getting more opinions is always good, and like I said, I don't feel strongly enough about this change to block moving forward with it or

[PR] Remove direct dependency of NodeHash to FST [lucene]

2023-10-17 Thread via GitHub
dungba88 opened a new pull request, #12690: URL: https://github.com/apache/lucene/pull/12690 ### Description Follow-up of https://github.com/apache/lucene/pull/12646. NodeHash still depends on both FSTCompiler and FST. With the current method signature, one can create the NodeHash wi

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-17 Thread via GitHub
dungba88 commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1363098628 ## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ## @@ -17,79 +17,177 @@ package org.apache.lucene.util.fst; import java.io.IOException; -import

Re: [I] HnwsGraph creates disconnected components [lucene]

2023-10-17 Thread via GitHub
nitirajrathore commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1767662289 I was able to run tests on wiki dataset using the luceneutils package. The [results shows](https://github.com/mikemccand/luceneutil/pull/236) that even with a single segment

Re: [PR] Refactor ByteBlockPool so it is just a "shift/mask big array" [lucene]

2023-10-17 Thread via GitHub
iverase merged PR #12625: URL: https://github.com/apache/lucene/pull/12625 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Optimize outputs accumulating as MSB VLong outputs sharing more output prefix [lucene]

2023-10-17 Thread via GitHub
gf2121 commented on PR #12661: URL: https://github.com/apache/lucene/pull/12661#issuecomment-1767756956 > So this looks like a hard search/space trade-off: we either get fast reads or good compression but we can't get both? IMO theoretically yes. We ignored some potential optimization

Re: [PR] Optimize outputs accumulating as MSB VLong outputs sharing more output prefix [lucene]

2023-10-17 Thread via GitHub
gf2121 commented on code in PR #12661: URL: https://github.com/apache/lucene/pull/12661#discussion_r1363317643 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/FieldReader.java: ## @@ -118,13 +118,11 @@ long readVLongOutput(DataInput in) throws IOException {

Re: [PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]

2023-10-18 Thread via GitHub
gf2121 commented on code in PR #12622: URL: https://github.com/apache/lucene/pull/12622#discussion_r1363431399 ## lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoin.java: ## @@ -113,6 +113,7 @@ public void testEmptyChildFilter() throws Exception { final Direct

Re: [PR] Sometimes intersect the essential clause and the best non-essential clause. [lucene]

2023-10-18 Thread via GitHub
jpountz commented on PR #12589: URL: https://github.com/apache/lucene/pull/12589#issuecomment-1767952830 I moved the optimization as part of the partitioning logic so that it's easier to test. It's ready for review. -- This is an automated message from the Apache Git Service. To respond t

<    7   8   9   10   11   12   13   14   15   16   >