[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-26 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542440#comment-17542440
 ] 

Uwe Schindler commented on LUCENE-10562:


Hi [~zhuming],
this is better a question to ask on the user mailing list.

As short answer: If you use {{TopTermsScoringBooleanQueryRewrite}} you have to 
live with the consequences. As said several times in this issue: If you need to 
use wildcard queries think about changing your analysis, so you can do the same 
queries (e.g., by using ngrams in the analysis) in a performant ways. It is 
impossible to implement wildcard queries in an efficient way in inverted 
indexes, as the the expansion is always done before the query and it can't use 
any other query clauses: There's no way to only select terms in the first query 
that would also produce a hit for the second query (your filter) as there is no 
relationship at all.

In addition: Scoring of wildcard queries like that are not the right way to 
solve your problem.

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-26 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542440#comment-17542440
 ] 

Uwe Schindler edited comment on LUCENE-10562 at 5/26/22 10:58 AM:
--

Hi [~zhuming],
this is better a question to ask on the user mailing list.

As short answer: If you use {{TopTermsScoringBooleanQueryRewrite}} you have to 
live with the consequences. As said several times in this issue: If you need to 
use wildcard queries think about changing your analysis, so you can do the same 
queries (e.g., by using ngrams in the analysis) in a performant ways. It is 
impossible to implement wildcard queries in an efficient way in inverted 
indexes, as the the expansion is always done before the query and it can't use 
any other query clauses: There's no way to only select terms in the first query 
that would also produce a hit for the second query (your filter) as there is no 
relationship at all.

In addition: Scoring of wildcard queries like that - "hoping for something" - 
does not look like the right way to solve your problem.


was (Author: thetaphi):
Hi [~zhuming],
this is better a question to ask on the user mailing list.

As short answer: If you use {{TopTermsScoringBooleanQueryRewrite}} you have to 
live with the consequences. As said several times in this issue: If you need to 
use wildcard queries think about changing your analysis, so you can do the same 
queries (e.g., by using ngrams in the analysis) in a performant ways. It is 
impossible to implement wildcard queries in an efficient way in inverted 
indexes, as the the expansion is always done before the query and it can't use 
any other query clauses: There's no way to only select terms in the first query 
that would also produce a hit for the second query (your filter) as there is no 
relationship at all.

In addition: Scoring of wildcard queries like that are not the right way to 
solve your problem.

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #924: Create Lucene93 Codec and move Lucene92 to backwards_codecs

2022-05-26 Thread GitBox


msokolov commented on code in PR #924:
URL: https://github.com/apache/lucene/pull/924#discussion_r882698450


##
lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene92/Lucene92RWHnswVectorsFormat.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.backward_codecs.lucene92;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+public final class Lucene92RWHnswVectorsFormat extends 
Lucene92HnswVectorsFormat {
+
+  /** Default number of maximum connections per node */
+  public static final int DEFAULT_MAX_CONN = 16;
+
+  /**
+   * Default number of the size of the queue maintained while searching during 
a graph construction.
+   */
+  public static final int DEFAULT_BEAM_WIDTH = 100;
+
+  static final int DIRECT_MONOTONIC_BLOCK_SHIFT = 16;
+
+  /**
+   * Controls how many of the nearest neighbor candidates are connected to the 
new node. Defaults to
+   * {@link #DEFAULT_MAX_CONN}. See {@link HnswGraph} for more details.
+   */
+  private final int maxConn;
+
+  /**
+   * The number of candidate neighbors to track while searching the graph for 
each newly inserted
+   * node. Defaults to to {@link #DEFAULT_BEAM_WIDTH}. See {@link HnswGraph} 
for details.
+   */
+  private final int beamWidth;
+
+  /** Constructs a format using default graph construction parameters. */
+  public Lucene92RWHnswVectorsFormat() {

Review Comment:
   I moved the support for these from the read-only format to the read-write 
format because they are only used for writing. But I see the benefit of 
consistency too. I can move them back



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #924: Create Lucene93 Codec and move Lucene92 to backwards_codecs

2022-05-26 Thread GitBox


msokolov commented on code in PR #924:
URL: https://github.com/apache/lucene/pull/924#discussion_r882700613


##
lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene92/TestLucene92HnswVectorsFormat.java:
##
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.backward_codecs.lucene92;
+
+import org.apache.lucene.codecs.Codec;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.tests.index.BaseKnnVectorsFormatTestCase;
+
+public class TestLucene92HnswVectorsFormat extends 
BaseKnnVectorsFormatTestCase {
+  @Override
+  protected Codec getCodec() {
+return new Lucene92RWCodec();
+  }
+
+  public void testToString() {
+Codec customCodec =
+new Lucene92RWCodec() {
+  @Override
+  public KnnVectorsFormat getKnnVectorsFormatForField(String field) {
+return new Lucene92RWHnswVectorsFormat();
+  }
+};
+String expectedString = "Lucene92RWHnswVectorsFormat";

Review Comment:
   Well, I guess it's merely that there is no possibility of any other values 
than the default ones. Indeed we can remove the local variables as you 
mentioned above since they always have the same value, and replace them with 
the constants.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #924: Create Lucene93 Codec and move Lucene92 to backwards_codecs

2022-05-26 Thread GitBox


msokolov commented on PR #924:
URL: https://github.com/apache/lucene/pull/924#issuecomment-1138630677

   Hmm I got confused about the capitalization of the format name. I saw that 
the names were initial-lower-case, which I thought was a mistake introduced 
during this refactoring, but now I see it's what we had in the Lucene92 format 
(but not lin Lucene91, where the name was initial-capitalized). So I'll go back 
to the lower case version I guess...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] thelabdude merged pull request #2662: SOLR-16215 Escape query characters in Solr SQL Array UDF functions

2022-05-26 Thread GitBox


thelabdude merged PR #2662:
URL: https://github.com/apache/lucene-solr/pull/2662


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on pull request #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit

2022-05-26 Thread GitBox


Yuti-G commented on PR #915:
URL: https://github.com/apache/lucene/pull/915#issuecomment-1138850060

   > But: the `getAllDims` time for SSDV seems to have gotten much faster with 
this PR, which is great! Was that expected? Or is this some horrible noise? Is 
it repeatable?
   
   I think it's just noise. I just re-run the dedicated luceneutil faceting 
benchmark against the main branch:
   
   1st run:
   https://user-images.githubusercontent.com/4710/170536870-929c6d0d-1d47-4fb3-bd80-c30e62c8d51e.png";>
   
   2nd run:
   https://user-images.githubusercontent.com/4710/170536986-492cf393-c3f7-4248-bdb7-60ceec0aa1e7.png";>
   
   3rd run:
   https://user-images.githubusercontent.com/4710/170546269-34be4600-e384-4e18-bd0f-cc7ac3936a6e.png";>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit

2022-05-26 Thread GitBox


gsmiller commented on PR #915:
URL: https://github.com/apache/lucene/pull/915#issuecomment-1138931375

   @Yuti-G I just updated the PR with some additional comments/javadoc and a 
very minor optimization in the SSDV#getTopDims case. Could you have a look at 
the latest changes when you get a chance?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit

2022-05-26 Thread GitBox


gsmiller commented on PR #915:
URL: https://github.com/apache/lucene/pull/915#issuecomment-1138932259

   Since this change is purely meant to remove some code duplication and make 
some very minor optimizations, and doesn't modify the API or expose any 
additional API surface area, I plan to merge in the next couple of days unless 
anyone objects. If anyone wants more time to review or has feedback, I'm more 
than happy to wait. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on pull request #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit

2022-05-26 Thread GitBox


Yuti-G commented on PR #915:
URL: https://github.com/apache/lucene/pull/915#issuecomment-1138982348

   Looks good to me! I will rebase my current work at 
https://github.com/apache/lucene/pull/914 - `getAllChildren`  after this PR is 
merged. Thank you so much for making the code so clean!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10510) Check module access prior to running gjf/spotless/errorprone tasks

2022-05-26 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542666#comment-17542666
 ] 

Alessandro Benedetti commented on LUCENE-10510:
---

I spent roughly one hour fighting with Gradle, Iwas trying to run ./gradlew 
tidy before the ./gradlew check:
I have a JDK 17 and all I get is always a vague:
"> Certain gradle tasks and plugins require access to jdk.compiler internals, 
your gradle.properties might have just been generated or could be out of sync 
(see help/localSettings.txt)"

I explored the code that generates the exception:

{code:java}
task checkJdkInternalsExportedToGradle() {
doFirst {
  def jdkCompilerModule = 
ModuleLayer.boot().findModule("jdk.compiler").orElseThrow()
  def gradleModule = getClass().module
  def internalsExported = [
  "com.sun.tools.javac.api",
  "com.sun.tools.javac.file",
  "com.sun.tools.javac.parser",
  "com.sun.tools.javac.tree",
  "com.sun.tools.javac.util"
  ].stream()
.allMatch(pkg -> jdkCompilerModule.isExported(pkg, gradleModule))

  if (!internalsExported) {
throw new GradleException(
"Certain gradle tasks and plugins require access to jdk.compiler" +
" internals, your gradle.properties might have just been 
generated or could be" +
" out of sync (see help/localSettings.txt)")
  }
}
{code}

And I also read the "help/localSettings.txt" with no success.
Maybe I am tired tonight, am I missing something?
I couldn't find any recommendation for how to fix the problem.
If I am not missing anything, we should do something as I assume a random new 
contributor would be lost


> Check module access prior to running gjf/spotless/errorprone tasks
> --
>
> Key: LUCENE-10510
> URL: https://issues.apache.org/jira/browse/LUCENE-10510
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
> Fix For: 9.2
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> PR at: [https://github.com/apache/lucene/pull/802]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10510) Check module access prior to running gjf/spotless/errorprone tasks

2022-05-26 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542668#comment-17542668
 ] 

Dawid Weiss commented on LUCENE-10510:
--

Delete your gradle.properties and allow it to regenerate from scratch. This is 
explained in localSettings.txt:
{code}
The first invocation of any task in Lucene's gradle build will generate
and save a project-local 'gradle.properties' file. This file contains
the defaults you may (but don't have to) tweak for your particular hardware
(or taste). Note there are certain settings in that file that may
be _required_ at runtime for certain plugins (an example is the spotless/
google java format plugin, which requires adding custom exports to JVM 
modules). Gradle
build only generates this file if it's not already present (it never overwrites
the defaults) -- occasionally you may have to manually delete (or move) this
file and regenerate from scratch. 
{code}

> Check module access prior to running gjf/spotless/errorprone tasks
> --
>
> Key: LUCENE-10510
> URL: https://issues.apache.org/jira/browse/LUCENE-10510
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
> Fix For: 9.2
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> PR at: [https://github.com/apache/lucene/pull/802]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10593) VectorSimilarityFunction reverse removal

2022-05-26 Thread Alessandro Benedetti (Jira)
Alessandro Benedetti created LUCENE-10593:
-

 Summary: VectorSimilarityFunction reverse removal
 Key: LUCENE-10593
 URL: https://issues.apache.org/jira/browse/LUCENE-10593
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Alessandro Benedetti


org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity behaves 
in an opposite way in comparison to the other similarities:
A higher similarity score means higher distance, for this reason, has been 
marked with "reversed" and a function is present to map from the similarity to 
a score (where higher means closer, like in all other similarities.)

Having this counterintuitive behavior with no apparent explanation I could 
find(please correct me if I am wrong) brings a lot of nasty side effects for 
the code readability, especially when combined with the NeighbourQueue that has 
a "reversed" itself.
In addition, it complicates also the usage of the pattern:
Result Queue -> MIN HEAP
Candidate Queue -> MAX HEAP
In HNSW searchers.

The proposal in my Pull Request aims to:

1) the Euclidean similarity just returns the score, in line with the other 
similarities, with the formula currently used to move from distance to score

2) simplify the code, removing the bound checker that's not necessary anymore

3) refactor here and there to be in line with the simplification

4) refactor of NeighborQueue to clearly state when it's a MIN_HEAP or MAX_HEAP, 
now debugging is much easier and understanding the HNSW code is much more 
intuitive




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10510) Check module access prior to running gjf/spotless/errorprone tasks

2022-05-26 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542669#comment-17542669
 ] 

Alessandro Benedetti commented on LUCENE-10510:
---

[~dweiss] Your help has been pure gold, thank you very much!!

I had to delete the gradle.properties and run ./gradlew tidy twice.
The first time I got the error again and the second time it went ok.

Should we document that more clearly?
Do you why this happens?
the "occasionally you may have to manually delete (or move) this
file and regenerate from scratch." didn't caught my attention

> Check module access prior to running gjf/spotless/errorprone tasks
> --
>
> Key: LUCENE-10510
> URL: https://issues.apache.org/jira/browse/LUCENE-10510
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
> Fix For: 9.2
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> PR at: [https://github.com/apache/lucene/pull/802]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] alessandrobenedetti opened a new pull request, #926: Neigbour queue reversed

2022-05-26 Thread GitBox


alessandrobenedetti opened a new pull request, #926:
URL: https://github.com/apache/lucene/pull/926

   (https://issues.apache.org/jira/browse/LUCENE-10593)
   
   org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity 
behaves in an opposite way in comparison to the other similarities:
   A higher similarity score means higher distance, for this reason, has been 
marked with "reversed" and a function is present to map from the similarity to 
a score (where higher means closer, like in all other similarities.)
   Having this counterintuitive behavior with no apparent explanation I could 
find(please correct me if I am wrong) brings a lot of nasty side effects for 
the code readability, especially when combined with the NeighbourQueue that has 
a "reversed" itself.
   In addition, it complicates also the usage of the pattern:
   Result Queue -> MIN HEAP
   Candidate Queue -> MAX HEAP
   In HNSW searchers.
   The proposal in my Pull Request aims to:
   1) the Euclidean similarity just returns the score, in line with the other 
similarities, with the formula currently used to move from distance to score
   2) simplify the code, removing the bound checker that's not necessary anymore
   3) refactor here and there to be in line with the simplification
   4) refactor of NeighborQueue to clearly state when it's a MIN_HEAP or 
MAX_HEAP, now debugging is much easier and understanding the HNSW code is much 
more intuitive


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

2022-05-26 Thread GitBox


alessandrobenedetti commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r883088833


##
lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java:
##
@@ -193,25 +204,36 @@ public void testAdvanceShallow() throws IOException {
   }
   try (IndexReader reader = DirectoryReader.open(d)) {
 IndexSearcher searcher = new IndexSearcher(reader);
-KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 3}, 
3);
+KnnVectorQuery query = new KnnVectorQuery("field", new float[] {0.5f, 
1}, 3);
 Query dasq = query.rewrite(reader);
 Scorer scorer =
 dasq.createWeight(searcher, ScoreMode.COMPLETE, 
1).scorer(reader.leaves().get(0));
 // before advancing the iterator
-assertEquals(1, scorer.advanceShallow(0));
+assertEquals(0, scorer.advanceShallow(0));
 assertEquals(1, scorer.advanceShallow(1));
 assertEquals(NO_MORE_DOCS, scorer.advanceShallow(10));
 
 // after advancing the iterator
 scorer.iterator().advance(2);
 assertEquals(2, scorer.advanceShallow(0));
+assertEquals(2, scorer.advanceShallow(1));
 assertEquals(2, scorer.advanceShallow(2));
-assertEquals(3, scorer.advanceShallow(3));
 assertEquals(NO_MORE_DOCS, scorer.advanceShallow(10));
   }
 }
   }
 
+  /**
+   * Query = (0.5, 1)
+   * Doc0 = (0, 0) 1 / (l2distance + 1) from query = 0.444
+   * Doc1 = (1, 1) 1 / (l2distance + 1) from query = 0.8
+   * Doc2 = (2, 2) 1 / (l2distance + 1) from query = 0.235
+   * Doc3 = (3, 3) 1 / (l2distance + 1) from query = 0.089
+   * Doc4 = (4, 4) 1 / (l2distance + 1) from query = 0.045
+   * 
+   * The expected TOP 3 = [Doc1, Doc0, Doc2]
+   * @throws IOException
+   */

Review Comment:
   The original test was creating multiple documents with the same distance 
from the query vector. I saw inconsistencies and a not-deterministic approach 
(probably caused by the graph construction and search).
   I added a clear example with well defined different distances and all looks 
good.
   But let me know if you want me to investigate it more.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10593) VectorSimilarityFunction reverse removal

2022-05-26 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542677#comment-17542677
 ] 

Alessandro Benedetti commented on LUCENE-10593:
---

https://github.com/apache/lucene/pull/926 has been opened, [~sokolov], 
[~mayya], [~julietibs] [~jpountz] feel free to review

> VectorSimilarityFunction reverse removal
> 
>
> Key: LUCENE-10593
> URL: https://issues.apache.org/jira/browse/LUCENE-10593
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alessandro Benedetti
>Priority: Major
>  Labels: vector-based-search
>
> org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity behaves 
> in an opposite way in comparison to the other similarities:
> A higher similarity score means higher distance, for this reason, has been 
> marked with "reversed" and a function is present to map from the similarity 
> to a score (where higher means closer, like in all other similarities.)
> Having this counterintuitive behavior with no apparent explanation I could 
> find(please correct me if I am wrong) brings a lot of nasty side effects for 
> the code readability, especially when combined with the NeighbourQueue that 
> has a "reversed" itself.
> In addition, it complicates also the usage of the pattern:
> Result Queue -> MIN HEAP
> Candidate Queue -> MAX HEAP
> In HNSW searchers.
> The proposal in my Pull Request aims to:
> 1) the Euclidean similarity just returns the score, in line with the other 
> similarities, with the formula currently used to move from distance to score
> 2) simplify the code, removing the bound checker that's not necessary anymore
> 3) refactor here and there to be in line with the simplification
> 4) refactor of NeighborQueue to clearly state when it's a MIN_HEAP or 
> MAX_HEAP, now debugging is much easier and understanding the HNSW code is 
> much more intuitive



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8806) WANDScorer should support two-phase iterator

2022-05-26 Thread Denilson Amorim (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542717#comment-17542717
 ] 

Denilson Amorim commented on LUCENE-8806:
-

I was curious on the status of this issue. Are the benchmarks posted above 
already considering merging of impacts for phrase queries? That is. two phase 
iteration is still not a gain on WAND at this time?

> WANDScorer should support two-phase iterator
> 
>
> Key: LUCENE-8806
> URL: https://issues.apache.org/jira/browse/LUCENE-8806
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8806.patch, LUCENE-8806.patch
>
>
> Following https://issues.apache.org/jira/browse/LUCENE-8770 the WANDScorer 
> should leverage two-phase iterators in order to be faster when used in 
> conjunctions.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10510) Check module access prior to running gjf/spotless/errorprone tasks

2022-05-26 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542778#comment-17542778
 ] 

Dawid Weiss commented on LUCENE-10510:
--

This is caused by google formatter accessing JVM internals. The first tidy 
failure tries to actually explain why it's failed - this is the message you 
were getting:
{code}
* What went wrong:
Execution failed for task ':checkJdkInternalsExportedToGradle'.
> Certain gradle tasks and plugins require access to jdk.compiler internals, 
> your gradle.properties might have just been generated or could be out of sync 
> (see help/localSettings.txt)
{code}

I'm not sure what can be improved here but feel free to suggest something to 
your liking!

> Check module access prior to running gjf/spotless/errorprone tasks
> --
>
> Key: LUCENE-10510
> URL: https://issues.apache.org/jira/browse/LUCENE-10510
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
> Fix For: 9.2
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> PR at: [https://github.com/apache/lucene/pull/802]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org