[GitHub] [lucene] Deepika0510 opened a new pull request, #927: Adding Timeout Support to IndexSearcher
Deepika0510 opened a new pull request, #927: URL: https://github.com/apache/lucene/pull/927 ### Problem Statement Adding optional "timeout" capabilities to `IndexSearcher`. This would enable users to (optionally) specify a maximum time budget for search execution. If the search "times out", partial results would be available. Original Jira [issue](https://issues.apache.org/jira/browse/LUCENE-10151). ### Approach Created `TimeLimitingBulkscorer` extended from `BulkScorer` class to wrap methods with timeout and delegates the calls to underlying `BulkScorer`. Also created `TestTimeLimitingBulkScorer`. Link for the [Jira](https://issues.apache.org/jira/browse/LUCENE-10544) where all the approaches have been discussed. ### Note This implementation is in progress and the current PR consists of basic search functionality with timeout value as additional argument. ### Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my code conforms to the standards described there to the best of my ability. - [x] I have raised the PR against the main branch. - [x] I have run ./gradlew check. - [x] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #926: VectorSimilarityFunction reverse removal
msokolov commented on code in PR #926: URL: https://github.com/apache/lucene/pull/926#discussion_r883626055 ## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ## @@ -47,63 +42,30 @@ public float convertToScore(float similarity) { DOT_PRODUCT { @Override public float compare(float[] v1, float[] v2) { - return dotProduct(v1, v2); -} - -@Override -public float convertToScore(float similarity) { - return (1 + similarity) / 2; + return (1 + dotProduct(v1, v2)) / 2; } }, /** * Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all * vectors to unit length, and instead use {@link VectorSimilarityFunction#DOT_PRODUCT}. You * should only use this function if you need to preserve the original vectors and cannot normalize - * them in advance. + * them in advance. The similarity score is normalised to assure it is positive. */ COSINE { @Override public float compare(float[] v1, float[] v2) { - return cosine(v1, v2); -} - -@Override -public float convertToScore(float similarity) { - return (1 + similarity) / 2; + return (1 + cosine(v1, v2)) / 2; } }; /** - * If true, the scores associated with vector comparisons are nonnegative and in reverse order; - * that is, lower scores represent more similar vectors. Otherwise, if false, higher scores - * represent more similar vectors, and scores may be negative or positive. - */ - public final boolean reversed; - - VectorSimilarityFunction(boolean reversed) { -this.reversed = reversed; - } - - VectorSimilarityFunction() { -reversed = false; - } - - /** - * Calculates a similarity score between the two vectors with a specified function. + * Calculates a similarity score between the two vectors with a specified function. Highest + * similarity score, means always closer vectors. Review Comment: grammar nit: "Higher similarity scores correspond to closer vectors" seems more idiomatic to me ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java: ## @@ -56,9 +56,9 @@ long apply(long v) { // Whether the search stopped early because it reached the visited nodes limit private boolean incomplete; - public NeighborQueue(int initialSize, boolean reversed) { + public NeighborQueue(int initialSize, boolean descOrder) { Review Comment: for consistency, perhaps call the parameter `maxHeap`? ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java: ## @@ -109,12 +117,15 @@ public int[] nodes() { /** Returns the top element's node id. */ public int topNode() { -return (int) order.apply(heap.top()); +return decodeNodeId(heap.top()); } - /** Returns the top element's node score. */ - public float topScore() { -return NumericUtils.sortableIntToFloat((int) (order.apply(heap.top()) >> 32)); + /** + * Returns the top element's node score. For the min heap this is the minimum score. For the max Review Comment: Is there a reason for renaming this method? It seems unneccessary. We could just change the javadoc to say "returns the top element's score"? ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -264,13 +263,13 @@ private int findWorstNonDiverse(NeighborArray neighbors) throws IOException { for (int i = neighbors.size() - 1; i > 0; i--) { int cNode = neighbors.node[i]; float[] cVector = vectorValues.vectorValue(cNode); - bound.set(neighbors.score[i]); + minAcceptedSimilarity = neighbors.score[i]; // check the candidate against its better-scoring neighbors for (int j = i - 1; j >= 0; j--) { -float diversityCheck = +float buildVectoreSimilarity = Review Comment: typo - "Vectore" should read "Vector" ## lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java: ## @@ -193,25 +204,36 @@ public void testAdvanceShallow() throws IOException { } try (IndexReader reader = DirectoryReader.open(d)) { IndexSearcher searcher = new IndexSearcher(reader); -KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 3}, 3); +KnnVectorQuery query = new KnnVectorQuery("field", new float[] {0.5f, 1}, 3); Query dasq = query.rewrite(reader); Scorer scorer = dasq.createWeight(searcher, ScoreMode.COMPLETE, 1).scorer(reader.leaves().get(0)); // before advancing the iterator -assertEquals(1, scorer.advanceShallow(0)); +assertEquals(0, scorer.advanceShallow(0)); assertEquals(1, scorer.advanceShallow(1)); assertEquals(NO_MORE_DOCS, scorer.advanceShallow(10)); // after advancing the iterator scorer.iterator().advance(2); assertEquals(2, scorer.advanceShallow(0)); +
[GitHub] [lucene] mocobeta closed pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
mocobeta closed pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial URL: https://github.com/apache/lucene/pull/923 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal
alessandrobenedetti commented on code in PR #926: URL: https://github.com/apache/lucene/pull/926#discussion_r883687821 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java: ## @@ -56,9 +56,9 @@ long apply(long v) { // Whether the search stopped early because it reached the visited nodes limit private boolean incomplete; - public NeighborQueue(int initialSize, boolean reversed) { + public NeighborQueue(int initialSize, boolean descOrder) { Review Comment: I originally called it "maxHeap" in a previous commit and then moved to "descOrder" to be in line with the NeighborArray constructor. A feedback I had from a colleague of mine was: "externally I may be interested only in building it specifying descOrder or ascOrder, then internally and per debug purposes this is mapped to MIN_HEAP or MAX_HEAP" But anyway, if you/community believes it is better "maxHeap" I'll revert that bit! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal
alessandrobenedetti commented on code in PR #926: URL: https://github.com/apache/lucene/pull/926#discussion_r883689413 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java: ## @@ -109,12 +117,15 @@ public int[] nodes() { /** Returns the top element's node id. */ public int topNode() { -return (int) order.apply(heap.top()); +return decodeNodeId(heap.top()); } - /** Returns the top element's node score. */ - public float topScore() { -return NumericUtils.sortableIntToFloat((int) (order.apply(heap.top()) >> 32)); + /** + * Returns the top element's node score. For the min heap this is the minimum score. For the max Review Comment: maybe, but when i read topScore, I always think "maxScore", but actually it is the topNode score, which means min or max depending on the heap. My preference is to rename it to topNodeScore, but it's not a super-strong blocking opinion :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal
alessandrobenedetti commented on code in PR #926: URL: https://github.com/apache/lucene/pull/926#discussion_r883695653 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -264,13 +263,13 @@ private int findWorstNonDiverse(NeighborArray neighbors) throws IOException { for (int i = neighbors.size() - 1; i > 0; i--) { int cNode = neighbors.node[i]; float[] cVector = vectorValues.vectorValue(cNode); - bound.set(neighbors.score[i]); + minAcceptedSimilarity = neighbors.score[i]; // check the candidate against its better-scoring neighbors for (int j = i - 1; j >= 0; j--) { -float diversityCheck = +float buildVectoreSimilarity = Review Comment: fixed! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal
alessandrobenedetti commented on code in PR #926: URL: https://github.com/apache/lucene/pull/926#discussion_r883696044 ## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ## @@ -47,63 +42,30 @@ public float convertToScore(float similarity) { DOT_PRODUCT { @Override public float compare(float[] v1, float[] v2) { - return dotProduct(v1, v2); -} - -@Override -public float convertToScore(float similarity) { - return (1 + similarity) / 2; + return (1 + dotProduct(v1, v2)) / 2; } }, /** * Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all * vectors to unit length, and instead use {@link VectorSimilarityFunction#DOT_PRODUCT}. You * should only use this function if you need to preserve the original vectors and cannot normalize - * them in advance. + * them in advance. The similarity score is normalised to assure it is positive. */ COSINE { @Override public float compare(float[] v1, float[] v2) { - return cosine(v1, v2); -} - -@Override -public float convertToScore(float similarity) { - return (1 + similarity) / 2; + return (1 + cosine(v1, v2)) / 2; } }; /** - * If true, the scores associated with vector comparisons are nonnegative and in reverse order; - * that is, lower scores represent more similar vectors. Otherwise, if false, higher scores - * represent more similar vectors, and scores may be negative or positive. - */ - public final boolean reversed; - - VectorSimilarityFunction(boolean reversed) { -this.reversed = reversed; - } - - VectorSimilarityFunction() { -reversed = false; - } - - /** - * Calculates a similarity score between the two vectors with a specified function. + * Calculates a similarity score between the two vectors with a specified function. Highest + * similarity score, means always closer vectors. Review Comment: fixed! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal
alessandrobenedetti commented on code in PR #926: URL: https://github.com/apache/lucene/pull/926#discussion_r883696690 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java: ## @@ -109,12 +117,15 @@ public int[] nodes() { /** Returns the top element's node id. */ public int topNode() { -return (int) order.apply(heap.top()); +return decodeNodeId(heap.top()); } - /** Returns the top element's node score. */ - public float topScore() { -return NumericUtils.sortableIntToFloat((int) (order.apply(heap.top()) >> 32)); + /** + * Returns the top element's node score. For the min heap this is the minimum score. For the max Review Comment: In relation to the test I'll take a look and let you know! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal
alessandrobenedetti commented on code in PR #926: URL: https://github.com/apache/lucene/pull/926#discussion_r883697041 ## lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java: ## @@ -193,25 +204,36 @@ public void testAdvanceShallow() throws IOException { } try (IndexReader reader = DirectoryReader.open(d)) { IndexSearcher searcher = new IndexSearcher(reader); -KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 3}, 3); +KnnVectorQuery query = new KnnVectorQuery("field", new float[] {0.5f, 1}, 3); Query dasq = query.rewrite(reader); Scorer scorer = dasq.createWeight(searcher, ScoreMode.COMPLETE, 1).scorer(reader.leaves().get(0)); // before advancing the iterator -assertEquals(1, scorer.advanceShallow(0)); +assertEquals(0, scorer.advanceShallow(0)); assertEquals(1, scorer.advanceShallow(1)); assertEquals(NO_MORE_DOCS, scorer.advanceShallow(10)); // after advancing the iterator scorer.iterator().advance(2); assertEquals(2, scorer.advanceShallow(0)); +assertEquals(2, scorer.advanceShallow(1)); assertEquals(2, scorer.advanceShallow(2)); -assertEquals(3, scorer.advanceShallow(3)); assertEquals(NO_MORE_DOCS, scorer.advanceShallow(10)); } } } + /** + * Query = (0.5, 1) + * Doc0 = (0, 0) 1 / (l2distance + 1) from query = 0.444 + * Doc1 = (1, 1) 1 / (l2distance + 1) from query = 0.8 + * Doc2 = (2, 2) 1 / (l2distance + 1) from query = 0.235 + * Doc3 = (3, 3) 1 / (l2distance + 1) from query = 0.089 + * Doc4 = (4, 4) 1 / (l2distance + 1) from query = 0.045 + * + * The expected TOP 3 = [Doc1, Doc0, Doc2] + * @throws IOException + */ Review Comment: I'll check it and let you know! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542982#comment-17542982 ] Tomoko Uchida commented on LUCENE-10562: Hi, I would also recommend you not to use wildcard query, and reconsider the analysis though... if you can't or don't want to reindex the whole data and the "filter" part returns a really small result (~100, maybe?), you could consider a two-stage search? For example, run the first stage search to get the filtered results with 'status':'open', then index the results into an on-the-fly, ephemeral index backed by ByteBufferesDirectory, then perform the second search to get the final result you want. I'm not sure it works well for your data/application and there is apparently performance overhead, just remember it's occasionally useful to implement such a two-phase strategy (for me). > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
msokolov commented on PR #923: URL: https://github.com/apache/lucene/pull/923#issuecomment-1139740650 I don't understand what happened here - I'm truly sorry if I somehow (bikeshedding?) discouraged you from improving the instructions here; if it wasn't clear: I thought what you had proposed was an improvement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
mocobeta commented on PR #923: URL: https://github.com/apache/lucene/pull/923#issuecomment-1139813850 I was just unsure where to go for improving it. There will be a better way that does not overwhelm the readers of this tutorial with unfalimiar concepts or too wordy descriptions, I haven't found it yet though. Maybe, I'll open another one sometime. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #926: VectorSimilarityFunction reverse removal
mayya-sharipova commented on code in PR #926: URL: https://github.com/apache/lucene/pull/926#discussion_r883911186 ## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ## @@ -47,63 +42,30 @@ public float convertToScore(float similarity) { DOT_PRODUCT { @Override public float compare(float[] v1, float[] v2) { - return dotProduct(v1, v2); -} - -@Override -public float convertToScore(float similarity) { - return (1 + similarity) / 2; + return (1 + dotProduct(v1, v2)) / 2; } }, /** * Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all * vectors to unit length, and instead use {@link VectorSimilarityFunction#DOT_PRODUCT}. You * should only use this function if you need to preserve the original vectors and cannot normalize - * them in advance. + * them in advance. The similarity score is normalised to assure it is positive. */ COSINE { @Override public float compare(float[] v1, float[] v2) { - return cosine(v1, v2); -} - -@Override -public float convertToScore(float similarity) { - return (1 + similarity) / 2; + return (1 + cosine(v1, v2)) / 2; } }; /** - * If true, the scores associated with vector comparisons are nonnegative and in reverse order; - * that is, lower scores represent more similar vectors. Otherwise, if false, higher scores - * represent more similar vectors, and scores may be negative or positive. - */ - public final boolean reversed; - - VectorSimilarityFunction(boolean reversed) { -this.reversed = reversed; - } - - VectorSimilarityFunction() { -reversed = false; - } - - /** - * Calculates a similarity score between the two vectors with a specified function. + * Calculates a similarity score between the two vectors with a specified function. Higher + * similarity scores correspond to closer vectors. * * @param v1 a vector * @param v2 another vector, of the same dimension * @return the value of the similarity function applied to the two vectors */ public abstract float compare(float[] v1, float[] v2); Review Comment: may be instead of `compare` we should call this function `score`? ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -264,13 +263,13 @@ private int findWorstNonDiverse(NeighborArray neighbors) throws IOException { for (int i = neighbors.size() - 1; i > 0; i--) { int cNode = neighbors.node[i]; float[] cVector = vectorValues.vectorValue(cNode); - bound.set(neighbors.score[i]); + minAcceptedSimilarity = neighbors.score[i]; Review Comment: Similarly here, we don't need to use the global variable of `minAcceptedSimilarity`, it should be local for this function. ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java: ## @@ -56,9 +56,9 @@ long apply(long v) { // Whether the search stopped early because it reached the visited nodes limit private boolean incomplete; - public NeighborQueue(int initialSize, boolean reversed) { + public NeighborQueue(int initialSize, boolean descOrder) { Review Comment: I am also in favour of calling this parameter "maxHeap", first it consistent with other variables in the class, and for me personally it is easier to think if I want max_heap or min_heap for the current case. ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java: ## @@ -155,14 +155,14 @@ private NeighborQueue searchLevel( // A bound that holds the minimum similarity to the query vector that a candidate vector must // have to be considered. -BoundsChecker bound = BoundsChecker.create(similarityFunction.reversed); +float minAcceptedSimilarity = Float.NEGATIVE_INFINITY; if (results.size() >= topK) { - bound.set(results.topScore()); + minAcceptedSimilarity = results.topNodeScore(); } while (candidates.size() > 0 && results.incomplete() == false) { // get the best candidate (closest or best scoring) - float topCandidateScore = candidates.topScore(); - if (bound.check(topCandidateScore)) { + float topCandidateSimilarity = candidates.topNodeScore(); Review Comment: Now with your changes, these are really scores, so I think it is probably worth to keep old names such as `topCandidateScore` ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswGraphBuilder.java: ## @@ -51,7 +50,7 @@ public final class Lucene90HnswGraphBuilder { private final VectorSimilarityFunction similarityFunction; private final RandomAccessVectorValues vectorValues; private final SplittableRandom random; - private final BoundsChecker bound; + private float minAcceptedSimilarity = Float.POSITIVE_INFINITY; Review Comment: I think our idea for old codec
[GitHub] [lucene] dweiss commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
dweiss commented on PR #923: URL: https://github.com/apache/lucene/pull/923#issuecomment-113873 Umm.. I agree with Mike, Tomoko - the improvement was there and evident. Can we commit it in and maybe reiterate if we find a better wording in the future? What you did was way better than it had been before. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal
alessandrobenedetti commented on code in PR #926: URL: https://github.com/apache/lucene/pull/926#discussion_r883955103 ## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ## @@ -47,63 +42,30 @@ public float convertToScore(float similarity) { DOT_PRODUCT { @Override public float compare(float[] v1, float[] v2) { - return dotProduct(v1, v2); -} - -@Override -public float convertToScore(float similarity) { - return (1 + similarity) / 2; + return (1 + dotProduct(v1, v2)) / 2; } }, /** * Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all * vectors to unit length, and instead use {@link VectorSimilarityFunction#DOT_PRODUCT}. You * should only use this function if you need to preserve the original vectors and cannot normalize - * them in advance. + * them in advance. The similarity score is normalised to assure it is positive. */ COSINE { @Override public float compare(float[] v1, float[] v2) { - return cosine(v1, v2); -} - -@Override -public float convertToScore(float similarity) { - return (1 + similarity) / 2; + return (1 + cosine(v1, v2)) / 2; } }; /** - * If true, the scores associated with vector comparisons are nonnegative and in reverse order; - * that is, lower scores represent more similar vectors. Otherwise, if false, higher scores - * represent more similar vectors, and scores may be negative or positive. - */ - public final boolean reversed; - - VectorSimilarityFunction(boolean reversed) { -this.reversed = reversed; - } - - VectorSimilarityFunction() { -reversed = false; - } - - /** - * Calculates a similarity score between the two vectors with a specified function. + * Calculates a similarity score between the two vectors with a specified function. Higher + * similarity scores correspond to closer vectors. * * @param v1 a vector * @param v2 another vector, of the same dimension * @return the value of the similarity function applied to the two vectors */ public abstract float compare(float[] v1, float[] v2); Review Comment: should we call it 'similarity'? if we want to name the method with a verb maybe: getSimilarity? scoreSimilarity? computeSimilarity? I think they are almost synonyms here but in general, I think we are using the similarity as a score. Not a strong opinion anyway, happy to change it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal
alessandrobenedetti commented on code in PR #926: URL: https://github.com/apache/lucene/pull/926#discussion_r883956265 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java: ## @@ -155,14 +155,14 @@ private NeighborQueue searchLevel( // A bound that holds the minimum similarity to the query vector that a candidate vector must // have to be considered. -BoundsChecker bound = BoundsChecker.create(similarityFunction.reversed); +float minAcceptedSimilarity = Float.NEGATIVE_INFINITY; if (results.size() >= topK) { - bound.set(results.topScore()); + minAcceptedSimilarity = results.topNodeScore(); } while (candidates.size() > 0 && results.incomplete() == false) { // get the best candidate (closest or best scoring) - float topCandidateScore = candidates.topScore(); - if (bound.check(topCandidateScore)) { + float topCandidateSimilarity = candidates.topNodeScore(); Review Comment: it is a similarity score rather than a document score(even if they pretty much align). Let's decide if naming all of them 'similarity" (as it should be right now) or all "scores" I am more in favor of 'similarity' as it makes the code a bit more readable, not a strong opinion anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal
alessandrobenedetti commented on code in PR #926: URL: https://github.com/apache/lucene/pull/926#discussion_r883956772 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -245,11 +244,11 @@ private boolean diversityCheck( NeighborArray neighbors, RandomAccessVectorValues vectorValues) throws IOException { -bound.set(score); +minAcceptedSimilarity = score; Review Comment: I was initially using it, but I noticed the BoundChecker was global, so I thought using the min check locally could cause bugs, if you are sure local is just fine I'll do that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal
alessandrobenedetti commented on code in PR #926: URL: https://github.com/apache/lucene/pull/926#discussion_r883957300 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswGraphBuilder.java: ## @@ -51,7 +50,7 @@ public final class Lucene90HnswGraphBuilder { private final VectorSimilarityFunction similarityFunction; private final RandomAccessVectorValues vectorValues; private final SplittableRandom random; - private final BoundsChecker bound; + private float minAcceptedSimilarity = Float.POSITIVE_INFINITY; Review Comment: Ok, didn't know! waiting for @msokolov and others' confirmation then, happy to do the change once we all agree! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal
alessandrobenedetti commented on code in PR #926: URL: https://github.com/apache/lucene/pull/926#discussion_r883958913 ## lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java: ## @@ -193,25 +208,40 @@ public void testAdvanceShallow() throws IOException { } try (IndexReader reader = DirectoryReader.open(d)) { IndexSearcher searcher = new IndexSearcher(reader); -KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 3}, 3); +KnnVectorQuery query = new KnnVectorQuery("field", new float[] {0.5f, 1}, 3); Review Comment: Will double check, @msokolov confirmed it was done on purpose so I may just revert it and investigate it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
vigyasharma commented on PR #633: URL: https://github.com/apache/lucene/pull/633#issuecomment-1140109855 @mikemccand Thanks for beasting it and uncovering this issue! I tried to do the same on my box but haven't had any luck so far. Any ideas on how I can make it more likely to happen? How many cores (and threads) did you run this on? Side Note: it might be helpful to have a gradle parameter for thread count for concurrency tests. Maybe something that goes with the [Concurrent](https://issues.apache.org/jira/browse/LUCENE-10566) test group? I'm not sure if there's one for this already (the test seems hardcoded to `int numThreads = TEST_NIGHTLY ? 5 : 2;`. I tried reading through code to figure this out. I suspect this is because the `synchronized AddIndexesMergeSource::getNextMerge()` function is really locking on the inner class object, and not on IndexWriter. This works fine for `pendingAddIndexesMerges` because that is an inner class object, but probably hits race conditions for `runningMerges` which belongs to `IndexWriter`. I could either wrap the runningMerges update with a `synchronized (IndexWriter.this) {}`, or make `runningMerges` a `synchronizedSet`. I like the second approach as it automatically fixes this at all other places. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
mocobeta commented on PR #923: URL: https://github.com/apache/lucene/pull/923#issuecomment-1140176391 I'm unsure if the change improves it. Thinking calmly there are two questions in the change here. 1. The start commands for the demo app are no longer platform agnostic. - The old one works on any platform since it does not include any file paths in it - it picks CLASSPATH environment variable. 2. Put all jars in a directory into classpath with `*` is not great. - Old one says "Put all four of these files in your Java CLASSPATH." - yes it's the correct way to encourage people to find necessary jars and construct classpath on their own, except for the list of jars is obsoleted. As we saw things seem not so obvious as I first thought, I would postpone it for now; anyway it does not a big deal I think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
mocobeta commented on PR #923: URL: https://github.com/apache/lucene/pull/923#issuecomment-1140180971 Also thank you both for your comments - at least I learned the subtle difficulties related to launching commands (again). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org