[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542440#comment-17542440 ] Uwe Schindler commented on LUCENE-10562: Hi [~zhuming], this is better a question to ask on the user mailing list. As short answer: If you use {{TopTermsScoringBooleanQueryRewrite}} you have to live with the consequences. As said several times in this issue: If you need to use wildcard queries think about changing your analysis, so you can do the same queries (e.g., by using ngrams in the analysis) in a performant ways. It is impossible to implement wildcard queries in an efficient way in inverted indexes, as the the expansion is always done before the query and it can't use any other query clauses: There's no way to only select terms in the first query that would also produce a hit for the second query (your filter) as there is no relationship at all. In addition: Scoring of wildcard queries like that are not the right way to solve your problem. > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542440#comment-17542440 ] Uwe Schindler edited comment on LUCENE-10562 at 5/26/22 10:58 AM: -- Hi [~zhuming], this is better a question to ask on the user mailing list. As short answer: If you use {{TopTermsScoringBooleanQueryRewrite}} you have to live with the consequences. As said several times in this issue: If you need to use wildcard queries think about changing your analysis, so you can do the same queries (e.g., by using ngrams in the analysis) in a performant ways. It is impossible to implement wildcard queries in an efficient way in inverted indexes, as the the expansion is always done before the query and it can't use any other query clauses: There's no way to only select terms in the first query that would also produce a hit for the second query (your filter) as there is no relationship at all. In addition: Scoring of wildcard queries like that - "hoping for something" - does not look like the right way to solve your problem. was (Author: thetaphi): Hi [~zhuming], this is better a question to ask on the user mailing list. As short answer: If you use {{TopTermsScoringBooleanQueryRewrite}} you have to live with the consequences. As said several times in this issue: If you need to use wildcard queries think about changing your analysis, so you can do the same queries (e.g., by using ngrams in the analysis) in a performant ways. It is impossible to implement wildcard queries in an efficient way in inverted indexes, as the the expansion is always done before the query and it can't use any other query clauses: There's no way to only select terms in the first query that would also produce a hit for the second query (your filter) as there is no relationship at all. In addition: Scoring of wildcard queries like that are not the right way to solve your problem. > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #924: Create Lucene93 Codec and move Lucene92 to backwards_codecs
msokolov commented on code in PR #924: URL: https://github.com/apache/lucene/pull/924#discussion_r882698450 ## lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene92/Lucene92RWHnswVectorsFormat.java: ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.backward_codecs.lucene92; + +import java.io.IOException; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.index.SegmentReadState; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.util.hnsw.HnswGraph; + +public final class Lucene92RWHnswVectorsFormat extends Lucene92HnswVectorsFormat { + + /** Default number of maximum connections per node */ + public static final int DEFAULT_MAX_CONN = 16; + + /** + * Default number of the size of the queue maintained while searching during a graph construction. + */ + public static final int DEFAULT_BEAM_WIDTH = 100; + + static final int DIRECT_MONOTONIC_BLOCK_SHIFT = 16; + + /** + * Controls how many of the nearest neighbor candidates are connected to the new node. Defaults to + * {@link #DEFAULT_MAX_CONN}. See {@link HnswGraph} for more details. + */ + private final int maxConn; + + /** + * The number of candidate neighbors to track while searching the graph for each newly inserted + * node. Defaults to to {@link #DEFAULT_BEAM_WIDTH}. See {@link HnswGraph} for details. + */ + private final int beamWidth; + + /** Constructs a format using default graph construction parameters. */ + public Lucene92RWHnswVectorsFormat() { Review Comment: I moved the support for these from the read-only format to the read-write format because they are only used for writing. But I see the benefit of consistency too. I can move them back -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #924: Create Lucene93 Codec and move Lucene92 to backwards_codecs
msokolov commented on code in PR #924: URL: https://github.com/apache/lucene/pull/924#discussion_r882700613 ## lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene92/TestLucene92HnswVectorsFormat.java: ## @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.backward_codecs.lucene92; + +import org.apache.lucene.codecs.Codec; +import org.apache.lucene.codecs.KnnVectorsFormat; +import org.apache.lucene.tests.index.BaseKnnVectorsFormatTestCase; + +public class TestLucene92HnswVectorsFormat extends BaseKnnVectorsFormatTestCase { + @Override + protected Codec getCodec() { +return new Lucene92RWCodec(); + } + + public void testToString() { +Codec customCodec = +new Lucene92RWCodec() { + @Override + public KnnVectorsFormat getKnnVectorsFormatForField(String field) { +return new Lucene92RWHnswVectorsFormat(); + } +}; +String expectedString = "Lucene92RWHnswVectorsFormat"; Review Comment: Well, I guess it's merely that there is no possibility of any other values than the default ones. Indeed we can remove the local variables as you mentioned above since they always have the same value, and replace them with the constants. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #924: Create Lucene93 Codec and move Lucene92 to backwards_codecs
msokolov commented on PR #924: URL: https://github.com/apache/lucene/pull/924#issuecomment-1138630677 Hmm I got confused about the capitalization of the format name. I saw that the names were initial-lower-case, which I thought was a mistake introduced during this refactoring, but now I see it's what we had in the Lucene92 format (but not lin Lucene91, where the name was initial-capitalized). So I'll go back to the lower case version I guess... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude merged pull request #2662: SOLR-16215 Escape query characters in Solr SQL Array UDF functions
thelabdude merged PR #2662: URL: https://github.com/apache/lucene-solr/pull/2662 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on pull request #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit
Yuti-G commented on PR #915: URL: https://github.com/apache/lucene/pull/915#issuecomment-1138850060 > But: the `getAllDims` time for SSDV seems to have gotten much faster with this PR, which is great! Was that expected? Or is this some horrible noise? Is it repeatable? I think it's just noise. I just re-run the dedicated luceneutil faceting benchmark against the main branch: 1st run: https://user-images.githubusercontent.com/4710/170536870-929c6d0d-1d47-4fb3-bd80-c30e62c8d51e.png";> 2nd run: https://user-images.githubusercontent.com/4710/170536986-492cf393-c3f7-4248-bdb7-60ceec0aa1e7.png";> 3rd run: https://user-images.githubusercontent.com/4710/170546269-34be4600-e384-4e18-bd0f-cc7ac3936a6e.png";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit
gsmiller commented on PR #915: URL: https://github.com/apache/lucene/pull/915#issuecomment-1138931375 @Yuti-G I just updated the PR with some additional comments/javadoc and a very minor optimization in the SSDV#getTopDims case. Could you have a look at the latest changes when you get a chance? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit
gsmiller commented on PR #915: URL: https://github.com/apache/lucene/pull/915#issuecomment-1138932259 Since this change is purely meant to remove some code duplication and make some very minor optimizations, and doesn't modify the API or expose any additional API surface area, I plan to merge in the next couple of days unless anyone objects. If anyone wants more time to review or has feedback, I'm more than happy to wait. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on pull request #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit
Yuti-G commented on PR #915: URL: https://github.com/apache/lucene/pull/915#issuecomment-1138982348 Looks good to me! I will rebase my current work at https://github.com/apache/lucene/pull/914 - `getAllChildren` after this PR is merged. Thank you so much for making the code so clean! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10510) Check module access prior to running gjf/spotless/errorprone tasks
[ https://issues.apache.org/jira/browse/LUCENE-10510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542666#comment-17542666 ] Alessandro Benedetti commented on LUCENE-10510: --- I spent roughly one hour fighting with Gradle, Iwas trying to run ./gradlew tidy before the ./gradlew check: I have a JDK 17 and all I get is always a vague: "> Certain gradle tasks and plugins require access to jdk.compiler internals, your gradle.properties might have just been generated or could be out of sync (see help/localSettings.txt)" I explored the code that generates the exception: {code:java} task checkJdkInternalsExportedToGradle() { doFirst { def jdkCompilerModule = ModuleLayer.boot().findModule("jdk.compiler").orElseThrow() def gradleModule = getClass().module def internalsExported = [ "com.sun.tools.javac.api", "com.sun.tools.javac.file", "com.sun.tools.javac.parser", "com.sun.tools.javac.tree", "com.sun.tools.javac.util" ].stream() .allMatch(pkg -> jdkCompilerModule.isExported(pkg, gradleModule)) if (!internalsExported) { throw new GradleException( "Certain gradle tasks and plugins require access to jdk.compiler" + " internals, your gradle.properties might have just been generated or could be" + " out of sync (see help/localSettings.txt)") } } {code} And I also read the "help/localSettings.txt" with no success. Maybe I am tired tonight, am I missing something? I couldn't find any recommendation for how to fix the problem. If I am not missing anything, we should do something as I assume a random new contributor would be lost > Check module access prior to running gjf/spotless/errorprone tasks > -- > > Key: LUCENE-10510 > URL: https://issues.apache.org/jira/browse/LUCENE-10510 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Trivial > Fix For: 9.2 > > Time Spent: 0.5h > Remaining Estimate: 0h > > PR at: [https://github.com/apache/lucene/pull/802] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10510) Check module access prior to running gjf/spotless/errorprone tasks
[ https://issues.apache.org/jira/browse/LUCENE-10510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542668#comment-17542668 ] Dawid Weiss commented on LUCENE-10510: -- Delete your gradle.properties and allow it to regenerate from scratch. This is explained in localSettings.txt: {code} The first invocation of any task in Lucene's gradle build will generate and save a project-local 'gradle.properties' file. This file contains the defaults you may (but don't have to) tweak for your particular hardware (or taste). Note there are certain settings in that file that may be _required_ at runtime for certain plugins (an example is the spotless/ google java format plugin, which requires adding custom exports to JVM modules). Gradle build only generates this file if it's not already present (it never overwrites the defaults) -- occasionally you may have to manually delete (or move) this file and regenerate from scratch. {code} > Check module access prior to running gjf/spotless/errorprone tasks > -- > > Key: LUCENE-10510 > URL: https://issues.apache.org/jira/browse/LUCENE-10510 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Trivial > Fix For: 9.2 > > Time Spent: 0.5h > Remaining Estimate: 0h > > PR at: [https://github.com/apache/lucene/pull/802] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10593) VectorSimilarityFunction reverse removal
Alessandro Benedetti created LUCENE-10593: - Summary: VectorSimilarityFunction reverse removal Key: LUCENE-10593 URL: https://issues.apache.org/jira/browse/LUCENE-10593 Project: Lucene - Core Issue Type: Improvement Reporter: Alessandro Benedetti org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity behaves in an opposite way in comparison to the other similarities: A higher similarity score means higher distance, for this reason, has been marked with "reversed" and a function is present to map from the similarity to a score (where higher means closer, like in all other similarities.) Having this counterintuitive behavior with no apparent explanation I could find(please correct me if I am wrong) brings a lot of nasty side effects for the code readability, especially when combined with the NeighbourQueue that has a "reversed" itself. In addition, it complicates also the usage of the pattern: Result Queue -> MIN HEAP Candidate Queue -> MAX HEAP In HNSW searchers. The proposal in my Pull Request aims to: 1) the Euclidean similarity just returns the score, in line with the other similarities, with the formula currently used to move from distance to score 2) simplify the code, removing the bound checker that's not necessary anymore 3) refactor here and there to be in line with the simplification 4) refactor of NeighborQueue to clearly state when it's a MIN_HEAP or MAX_HEAP, now debugging is much easier and understanding the HNSW code is much more intuitive -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10510) Check module access prior to running gjf/spotless/errorprone tasks
[ https://issues.apache.org/jira/browse/LUCENE-10510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542669#comment-17542669 ] Alessandro Benedetti commented on LUCENE-10510: --- [~dweiss] Your help has been pure gold, thank you very much!! I had to delete the gradle.properties and run ./gradlew tidy twice. The first time I got the error again and the second time it went ok. Should we document that more clearly? Do you why this happens? the "occasionally you may have to manually delete (or move) this file and regenerate from scratch." didn't caught my attention > Check module access prior to running gjf/spotless/errorprone tasks > -- > > Key: LUCENE-10510 > URL: https://issues.apache.org/jira/browse/LUCENE-10510 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Trivial > Fix For: 9.2 > > Time Spent: 0.5h > Remaining Estimate: 0h > > PR at: [https://github.com/apache/lucene/pull/802] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti opened a new pull request, #926: Neigbour queue reversed
alessandrobenedetti opened a new pull request, #926: URL: https://github.com/apache/lucene/pull/926 (https://issues.apache.org/jira/browse/LUCENE-10593) org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity behaves in an opposite way in comparison to the other similarities: A higher similarity score means higher distance, for this reason, has been marked with "reversed" and a function is present to map from the similarity to a score (where higher means closer, like in all other similarities.) Having this counterintuitive behavior with no apparent explanation I could find(please correct me if I am wrong) brings a lot of nasty side effects for the code readability, especially when combined with the NeighbourQueue that has a "reversed" itself. In addition, it complicates also the usage of the pattern: Result Queue -> MIN HEAP Candidate Queue -> MAX HEAP In HNSW searchers. The proposal in my Pull Request aims to: 1) the Euclidean similarity just returns the score, in line with the other similarities, with the formula currently used to move from distance to score 2) simplify the code, removing the bound checker that's not necessary anymore 3) refactor here and there to be in line with the simplification 4) refactor of NeighborQueue to clearly state when it's a MIN_HEAP or MAX_HEAP, now debugging is much easier and understanding the HNSW code is much more intuitive -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal
alessandrobenedetti commented on code in PR #926: URL: https://github.com/apache/lucene/pull/926#discussion_r883088833 ## lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java: ## @@ -193,25 +204,36 @@ public void testAdvanceShallow() throws IOException { } try (IndexReader reader = DirectoryReader.open(d)) { IndexSearcher searcher = new IndexSearcher(reader); -KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 3}, 3); +KnnVectorQuery query = new KnnVectorQuery("field", new float[] {0.5f, 1}, 3); Query dasq = query.rewrite(reader); Scorer scorer = dasq.createWeight(searcher, ScoreMode.COMPLETE, 1).scorer(reader.leaves().get(0)); // before advancing the iterator -assertEquals(1, scorer.advanceShallow(0)); +assertEquals(0, scorer.advanceShallow(0)); assertEquals(1, scorer.advanceShallow(1)); assertEquals(NO_MORE_DOCS, scorer.advanceShallow(10)); // after advancing the iterator scorer.iterator().advance(2); assertEquals(2, scorer.advanceShallow(0)); +assertEquals(2, scorer.advanceShallow(1)); assertEquals(2, scorer.advanceShallow(2)); -assertEquals(3, scorer.advanceShallow(3)); assertEquals(NO_MORE_DOCS, scorer.advanceShallow(10)); } } } + /** + * Query = (0.5, 1) + * Doc0 = (0, 0) 1 / (l2distance + 1) from query = 0.444 + * Doc1 = (1, 1) 1 / (l2distance + 1) from query = 0.8 + * Doc2 = (2, 2) 1 / (l2distance + 1) from query = 0.235 + * Doc3 = (3, 3) 1 / (l2distance + 1) from query = 0.089 + * Doc4 = (4, 4) 1 / (l2distance + 1) from query = 0.045 + * + * The expected TOP 3 = [Doc1, Doc0, Doc2] + * @throws IOException + */ Review Comment: The original test was creating multiple documents with the same distance from the query vector. I saw inconsistencies and a not-deterministic approach (probably caused by the graph construction and search). I added a clear example with well defined different distances and all looks good. But let me know if you want me to investigate it more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10593) VectorSimilarityFunction reverse removal
[ https://issues.apache.org/jira/browse/LUCENE-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542677#comment-17542677 ] Alessandro Benedetti commented on LUCENE-10593: --- https://github.com/apache/lucene/pull/926 has been opened, [~sokolov], [~mayya], [~julietibs] [~jpountz] feel free to review > VectorSimilarityFunction reverse removal > > > Key: LUCENE-10593 > URL: https://issues.apache.org/jira/browse/LUCENE-10593 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Alessandro Benedetti >Priority: Major > Labels: vector-based-search > > org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity behaves > in an opposite way in comparison to the other similarities: > A higher similarity score means higher distance, for this reason, has been > marked with "reversed" and a function is present to map from the similarity > to a score (where higher means closer, like in all other similarities.) > Having this counterintuitive behavior with no apparent explanation I could > find(please correct me if I am wrong) brings a lot of nasty side effects for > the code readability, especially when combined with the NeighbourQueue that > has a "reversed" itself. > In addition, it complicates also the usage of the pattern: > Result Queue -> MIN HEAP > Candidate Queue -> MAX HEAP > In HNSW searchers. > The proposal in my Pull Request aims to: > 1) the Euclidean similarity just returns the score, in line with the other > similarities, with the formula currently used to move from distance to score > 2) simplify the code, removing the bound checker that's not necessary anymore > 3) refactor here and there to be in line with the simplification > 4) refactor of NeighborQueue to clearly state when it's a MIN_HEAP or > MAX_HEAP, now debugging is much easier and understanding the HNSW code is > much more intuitive -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8806) WANDScorer should support two-phase iterator
[ https://issues.apache.org/jira/browse/LUCENE-8806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542717#comment-17542717 ] Denilson Amorim commented on LUCENE-8806: - I was curious on the status of this issue. Are the benchmarks posted above already considering merging of impacts for phrase queries? That is. two phase iteration is still not a gain on WAND at this time? > WANDScorer should support two-phase iterator > > > Key: LUCENE-8806 > URL: https://issues.apache.org/jira/browse/LUCENE-8806 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Major > Attachments: LUCENE-8806.patch, LUCENE-8806.patch > > > Following https://issues.apache.org/jira/browse/LUCENE-8770 the WANDScorer > should leverage two-phase iterators in order to be faster when used in > conjunctions. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10510) Check module access prior to running gjf/spotless/errorprone tasks
[ https://issues.apache.org/jira/browse/LUCENE-10510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542778#comment-17542778 ] Dawid Weiss commented on LUCENE-10510: -- This is caused by google formatter accessing JVM internals. The first tidy failure tries to actually explain why it's failed - this is the message you were getting: {code} * What went wrong: Execution failed for task ':checkJdkInternalsExportedToGradle'. > Certain gradle tasks and plugins require access to jdk.compiler internals, > your gradle.properties might have just been generated or could be out of sync > (see help/localSettings.txt) {code} I'm not sure what can be improved here but feel free to suggest something to your liking! > Check module access prior to running gjf/spotless/errorprone tasks > -- > > Key: LUCENE-10510 > URL: https://issues.apache.org/jira/browse/LUCENE-10510 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Trivial > Fix For: 9.2 > > Time Spent: 0.5h > Remaining Estimate: 0h > > PR at: [https://github.com/apache/lucene/pull/802] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org