date:20220527

[GitHub] [lucene] Deepika0510 opened a new pull request, #927: Adding Timeout Support to IndexSearcher

2022-05-27 Thread GitBox



Deepika0510 opened a new pull request, #927:
URL: https://github.com/apache/lucene/pull/927

   ### Problem Statement
   Adding optional "timeout" capabilities to `IndexSearcher`. This would enable 
users to (optionally) specify a maximum time budget for search execution. If 
the search "times out", partial results would be available. Original Jira 
[issue](https://issues.apache.org/jira/browse/LUCENE-10151).
   ### Approach
   Created `TimeLimitingBulkscorer` extended from `BulkScorer` class to wrap 
methods with timeout and delegates the calls to underlying `BulkScorer`. Also 
created `TestTimeLimitingBulkScorer`. Link for the 
[Jira](https://issues.apache.org/jira/browse/LUCENE-10544) where all the 
approaches have been discussed.
   ### Note
   This implementation is in progress and the current PR consists of basic 
search functionality with timeout value as additional argument.
   ### Checklist
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my 
code conforms to the standards described there to the best of my ability.
   
   - [x] I have raised the PR against the main branch.
   
   - [x] I have run ./gradlew check.
   
   - [x] I have added tests for my changes.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

2022-05-27 Thread GitBox



msokolov commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r883626055


##
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##
@@ -47,63 +42,30 @@ public float convertToScore(float similarity) {
   DOT_PRODUCT {
 @Override
 public float compare(float[] v1, float[] v2) {
-  return dotProduct(v1, v2);
-}
-
-@Override
-public float convertToScore(float similarity) {
-  return (1 + similarity) / 2;
+  return (1 + dotProduct(v1, v2)) / 2;
 }
   },
 
   /**
* Cosine similarity. NOTE: the preferred way to perform cosine similarity 
is to normalize all
* vectors to unit length, and instead use {@link 
VectorSimilarityFunction#DOT_PRODUCT}. You
* should only use this function if you need to preserve the original 
vectors and cannot normalize
-   * them in advance.
+   * them in advance. The similarity score is normalised to assure it is 
positive.
*/
   COSINE {
 @Override
 public float compare(float[] v1, float[] v2) {
-  return cosine(v1, v2);
-}
-
-@Override
-public float convertToScore(float similarity) {
-  return (1 + similarity) / 2;
+  return (1 + cosine(v1, v2)) / 2;
 }
   };
 
   /**
-   * If true, the scores associated with vector comparisons are nonnegative 
and in reverse order;
-   * that is, lower scores represent more similar vectors. Otherwise, if 
false, higher scores
-   * represent more similar vectors, and scores may be negative or positive.
-   */
-  public final boolean reversed;
-
-  VectorSimilarityFunction(boolean reversed) {
-this.reversed = reversed;
-  }
-
-  VectorSimilarityFunction() {
-reversed = false;
-  }
-
-  /**
-   * Calculates a similarity score between the two vectors with a specified 
function.
+   * Calculates a similarity score between the two vectors with a specified 
function. Highest
+   * similarity score, means always closer vectors.

Review Comment:
   grammar nit: "Higher similarity scores correspond to closer vectors" seems 
more idiomatic to me



##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java:
##
@@ -56,9 +56,9 @@ long apply(long v) {
   // Whether the search stopped early because it reached the visited nodes 
limit
   private boolean incomplete;
 
-  public NeighborQueue(int initialSize, boolean reversed) {
+  public NeighborQueue(int initialSize, boolean descOrder) {

Review Comment:
   for consistency, perhaps call the parameter `maxHeap`?



##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java:
##
@@ -109,12 +117,15 @@ public int[] nodes() {
 
   /** Returns the top element's node id. */
   public int topNode() {
-return (int) order.apply(heap.top());
+return decodeNodeId(heap.top());
   }
 
-  /** Returns the top element's node score. */
-  public float topScore() {
-return NumericUtils.sortableIntToFloat((int) (order.apply(heap.top()) >> 
32));
+  /**
+   * Returns the top element's node score. For the min heap this is the 
minimum score. For the max

Review Comment:
   Is there a reason for renaming this method? It seems unneccessary. We could 
just change the javadoc to say "returns the top element's score"?



##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -264,13 +263,13 @@ private int findWorstNonDiverse(NeighborArray neighbors) 
throws IOException {
 for (int i = neighbors.size() - 1; i > 0; i--) {
   int cNode = neighbors.node[i];
   float[] cVector = vectorValues.vectorValue(cNode);
-  bound.set(neighbors.score[i]);
+  minAcceptedSimilarity = neighbors.score[i];
   // check the candidate against its better-scoring neighbors
   for (int j = i - 1; j >= 0; j--) {
-float diversityCheck =
+float buildVectoreSimilarity =

Review Comment:
   typo - "Vectore" should read "Vector"



##
lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java:
##
@@ -193,25 +204,36 @@ public void testAdvanceShallow() throws IOException {
   }
   try (IndexReader reader = DirectoryReader.open(d)) {
 IndexSearcher searcher = new IndexSearcher(reader);
-KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 3}, 
3);
+KnnVectorQuery query = new KnnVectorQuery("field", new float[] {0.5f, 
1}, 3);
 Query dasq = query.rewrite(reader);
 Scorer scorer =
 dasq.createWeight(searcher, ScoreMode.COMPLETE, 
1).scorer(reader.leaves().get(0));
 // before advancing the iterator
-assertEquals(1, scorer.advanceShallow(0));
+assertEquals(0, scorer.advanceShallow(0));
 assertEquals(1, scorer.advanceShallow(1));
 assertEquals(NO_MORE_DOCS, scorer.advanceShallow(10));
 
 // after advancing the iterator
 scorer.iterator().advance(2);
 assertEquals(2, scorer.advanceShallow(0));
+

[GitHub] [lucene] mocobeta closed pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial

2022-05-27 Thread GitBox



mocobeta closed pull request #923: LUCENE-10200: Correct outdated instruction 
in the demo tutorial
URL: https://github.com/apache/lucene/pull/923


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

2022-05-27 Thread GitBox



alessandrobenedetti commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r883687821


##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java:
##
@@ -56,9 +56,9 @@ long apply(long v) {
   // Whether the search stopped early because it reached the visited nodes 
limit
   private boolean incomplete;
 
-  public NeighborQueue(int initialSize, boolean reversed) {
+  public NeighborQueue(int initialSize, boolean descOrder) {

Review Comment:
   I originally called it "maxHeap" in a previous commit and then moved to 
"descOrder" to be in line with the NeighborArray constructor.
   A feedback I had from a colleague of mine was: 
   "externally I may be interested only in building it specifying descOrder or 
ascOrder, then internally and per debug purposes this is mapped to MIN_HEAP or 
MAX_HEAP"
   
   But anyway, if you/community believes it is better "maxHeap" I'll revert 
that bit!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

2022-05-27 Thread GitBox



alessandrobenedetti commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r883689413


##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java:
##
@@ -109,12 +117,15 @@ public int[] nodes() {
 
   /** Returns the top element's node id. */
   public int topNode() {
-return (int) order.apply(heap.top());
+return decodeNodeId(heap.top());
   }
 
-  /** Returns the top element's node score. */
-  public float topScore() {
-return NumericUtils.sortableIntToFloat((int) (order.apply(heap.top()) >> 
32));
+  /**
+   * Returns the top element's node score. For the min heap this is the 
minimum score. For the max

Review Comment:
   maybe, but when i read topScore, I always think "maxScore", but actually it 
is the topNode score, which means min or max depending on the heap.
   My preference is to rename it to topNodeScore, but it's not a super-strong 
blocking opinion :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

2022-05-27 Thread GitBox



alessandrobenedetti commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r883695653


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -264,13 +263,13 @@ private int findWorstNonDiverse(NeighborArray neighbors) 
throws IOException {
 for (int i = neighbors.size() - 1; i > 0; i--) {
   int cNode = neighbors.node[i];
   float[] cVector = vectorValues.vectorValue(cNode);
-  bound.set(neighbors.score[i]);
+  minAcceptedSimilarity = neighbors.score[i];
   // check the candidate against its better-scoring neighbors
   for (int j = i - 1; j >= 0; j--) {
-float diversityCheck =
+float buildVectoreSimilarity =

Review Comment:
   fixed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

2022-05-27 Thread GitBox



alessandrobenedetti commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r883696044


##
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##
@@ -47,63 +42,30 @@ public float convertToScore(float similarity) {
   DOT_PRODUCT {
 @Override
 public float compare(float[] v1, float[] v2) {
-  return dotProduct(v1, v2);
-}
-
-@Override
-public float convertToScore(float similarity) {
-  return (1 + similarity) / 2;
+  return (1 + dotProduct(v1, v2)) / 2;
 }
   },
 
   /**
* Cosine similarity. NOTE: the preferred way to perform cosine similarity 
is to normalize all
* vectors to unit length, and instead use {@link 
VectorSimilarityFunction#DOT_PRODUCT}. You
* should only use this function if you need to preserve the original 
vectors and cannot normalize
-   * them in advance.
+   * them in advance. The similarity score is normalised to assure it is 
positive.
*/
   COSINE {
 @Override
 public float compare(float[] v1, float[] v2) {
-  return cosine(v1, v2);
-}
-
-@Override
-public float convertToScore(float similarity) {
-  return (1 + similarity) / 2;
+  return (1 + cosine(v1, v2)) / 2;
 }
   };
 
   /**
-   * If true, the scores associated with vector comparisons are nonnegative 
and in reverse order;
-   * that is, lower scores represent more similar vectors. Otherwise, if 
false, higher scores
-   * represent more similar vectors, and scores may be negative or positive.
-   */
-  public final boolean reversed;
-
-  VectorSimilarityFunction(boolean reversed) {
-this.reversed = reversed;
-  }
-
-  VectorSimilarityFunction() {
-reversed = false;
-  }
-
-  /**
-   * Calculates a similarity score between the two vectors with a specified 
function.
+   * Calculates a similarity score between the two vectors with a specified 
function. Highest
+   * similarity score, means always closer vectors.

Review Comment:
   fixed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

2022-05-27 Thread GitBox



alessandrobenedetti commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r883696690


##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java:
##
@@ -109,12 +117,15 @@ public int[] nodes() {
 
   /** Returns the top element's node id. */
   public int topNode() {
-return (int) order.apply(heap.top());
+return decodeNodeId(heap.top());
   }
 
-  /** Returns the top element's node score. */
-  public float topScore() {
-return NumericUtils.sortableIntToFloat((int) (order.apply(heap.top()) >> 
32));
+  /**
+   * Returns the top element's node score. For the min heap this is the 
minimum score. For the max

Review Comment:
   In relation to the test I'll take a look and let you know!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

2022-05-27 Thread GitBox



alessandrobenedetti commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r883697041


##
lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java:
##
@@ -193,25 +204,36 @@ public void testAdvanceShallow() throws IOException {
   }
   try (IndexReader reader = DirectoryReader.open(d)) {
 IndexSearcher searcher = new IndexSearcher(reader);
-KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 3}, 
3);
+KnnVectorQuery query = new KnnVectorQuery("field", new float[] {0.5f, 
1}, 3);
 Query dasq = query.rewrite(reader);
 Scorer scorer =
 dasq.createWeight(searcher, ScoreMode.COMPLETE, 
1).scorer(reader.leaves().get(0));
 // before advancing the iterator
-assertEquals(1, scorer.advanceShallow(0));
+assertEquals(0, scorer.advanceShallow(0));
 assertEquals(1, scorer.advanceShallow(1));
 assertEquals(NO_MORE_DOCS, scorer.advanceShallow(10));
 
 // after advancing the iterator
 scorer.iterator().advance(2);
 assertEquals(2, scorer.advanceShallow(0));
+assertEquals(2, scorer.advanceShallow(1));
 assertEquals(2, scorer.advanceShallow(2));
-assertEquals(3, scorer.advanceShallow(3));
 assertEquals(NO_MORE_DOCS, scorer.advanceShallow(10));
   }
 }
   }
 
+  /**
+   * Query = (0.5, 1)
+   * Doc0 = (0, 0) 1 / (l2distance + 1) from query = 0.444
+   * Doc1 = (1, 1) 1 / (l2distance + 1) from query = 0.8
+   * Doc2 = (2, 2) 1 / (l2distance + 1) from query = 0.235
+   * Doc3 = (3, 3) 1 / (l2distance + 1) from query = 0.089
+   * Doc4 = (4, 4) 1 / (l2distance + 1) from query = 0.045
+   * 
+   * The expected TOP 3 = [Doc1, Doc0, Doc2]
+   * @throws IOException
+   */

Review Comment:
   I'll check it and let you know!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-27 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542982#comment-17542982
 ] 

Tomoko Uchida commented on LUCENE-10562:


Hi,
I would also recommend you not to use wildcard query, and reconsider the 
analysis though... if you can't or don't want to reindex the whole data and the 
"filter" part returns a really small result (~100, maybe?), you could consider 
a two-stage search? For example, run the first stage search to get the filtered 
results with 'status':'open', then index the results into an on-the-fly, 
ephemeral index backed by ByteBufferesDirectory, then perform the second search 
to get the final result you want. I'm not sure it works well for your 
data/application and there is apparently performance overhead, just remember 
it's occasionally useful to implement such a two-phase strategy (for me).

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial

2022-05-27 Thread GitBox



msokolov commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-1139740650

   I don't understand what happened here - I'm truly sorry if I somehow 
(bikeshedding?) discouraged you from improving the instructions here; if it 
wasn't clear: I thought what you had proposed was an improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial

2022-05-27 Thread GitBox



mocobeta commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-1139813850

   I was just unsure where to go for improving it. There will be a better way 
that does not overwhelm the readers of this tutorial with unfalimiar concepts 
or too wordy descriptions, I haven't found it yet though. Maybe, I'll open 
another one sometime.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

2022-05-27 Thread GitBox



mayya-sharipova commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r883911186


##
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##
@@ -47,63 +42,30 @@ public float convertToScore(float similarity) {
   DOT_PRODUCT {
 @Override
 public float compare(float[] v1, float[] v2) {
-  return dotProduct(v1, v2);
-}
-
-@Override
-public float convertToScore(float similarity) {
-  return (1 + similarity) / 2;
+  return (1 + dotProduct(v1, v2)) / 2;
 }
   },
 
   /**
* Cosine similarity. NOTE: the preferred way to perform cosine similarity 
is to normalize all
* vectors to unit length, and instead use {@link 
VectorSimilarityFunction#DOT_PRODUCT}. You
* should only use this function if you need to preserve the original 
vectors and cannot normalize
-   * them in advance.
+   * them in advance. The similarity score is normalised to assure it is 
positive.
*/
   COSINE {
 @Override
 public float compare(float[] v1, float[] v2) {
-  return cosine(v1, v2);
-}
-
-@Override
-public float convertToScore(float similarity) {
-  return (1 + similarity) / 2;
+  return (1 + cosine(v1, v2)) / 2;
 }
   };
 
   /**
-   * If true, the scores associated with vector comparisons are nonnegative 
and in reverse order;
-   * that is, lower scores represent more similar vectors. Otherwise, if 
false, higher scores
-   * represent more similar vectors, and scores may be negative or positive.
-   */
-  public final boolean reversed;
-
-  VectorSimilarityFunction(boolean reversed) {
-this.reversed = reversed;
-  }
-
-  VectorSimilarityFunction() {
-reversed = false;
-  }
-
-  /**
-   * Calculates a similarity score between the two vectors with a specified 
function.
+   * Calculates a similarity score between the two vectors with a specified 
function. Higher
+   * similarity scores correspond to closer vectors.
*
* @param v1 a vector
* @param v2 another vector, of the same dimension
* @return the value of the similarity function applied to the two vectors
*/
   public abstract float compare(float[] v1, float[] v2);

Review Comment:
   may be instead of `compare` we should call this function `score`?



##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -264,13 +263,13 @@ private int findWorstNonDiverse(NeighborArray neighbors) 
throws IOException {
 for (int i = neighbors.size() - 1; i > 0; i--) {
   int cNode = neighbors.node[i];
   float[] cVector = vectorValues.vectorValue(cNode);
-  bound.set(neighbors.score[i]);
+  minAcceptedSimilarity = neighbors.score[i];

Review Comment:
   Similarly here, we don't need to use the global variable of 
`minAcceptedSimilarity`, it should be local for this function.



##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java:
##
@@ -56,9 +56,9 @@ long apply(long v) {
   // Whether the search stopped early because it reached the visited nodes 
limit
   private boolean incomplete;
 
-  public NeighborQueue(int initialSize, boolean reversed) {
+  public NeighborQueue(int initialSize, boolean descOrder) {

Review Comment:
   I am also in favour of calling this parameter "maxHeap", first it consistent 
with other variables in the class, and for me personally it is easier to think 
if I want max_heap or min_heap for the current case.



##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##
@@ -155,14 +155,14 @@ private NeighborQueue searchLevel(
 
 // A bound that holds the minimum similarity to the query vector that a 
candidate vector must
 // have to be considered.
-BoundsChecker bound = BoundsChecker.create(similarityFunction.reversed);
+float minAcceptedSimilarity = Float.NEGATIVE_INFINITY;
 if (results.size() >= topK) {
-  bound.set(results.topScore());
+  minAcceptedSimilarity = results.topNodeScore();
 }
 while (candidates.size() > 0 && results.incomplete() == false) {
   // get the best candidate (closest or best scoring)
-  float topCandidateScore = candidates.topScore();
-  if (bound.check(topCandidateScore)) {
+  float topCandidateSimilarity = candidates.topNodeScore();

Review Comment:
   Now with your changes, these are really scores, so I think it is probably 
worth to keep old names such as `topCandidateScore`



##
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswGraphBuilder.java:
##
@@ -51,7 +50,7 @@ public final class Lucene90HnswGraphBuilder {
   private final VectorSimilarityFunction similarityFunction;
   private final RandomAccessVectorValues vectorValues;
   private final SplittableRandom random;
-  private final BoundsChecker bound;
+  private float minAcceptedSimilarity = Float.POSITIVE_INFINITY;

Review Comment:
   I think our idea for old codec

[GitHub] [lucene] dweiss commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial

2022-05-27 Thread GitBox



dweiss commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-113873

   Umm.. I agree with Mike, Tomoko - the improvement was there and evident. Can 
we commit it in and maybe reiterate if we find a better wording in the future? 
What you did was way better than it had been before.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

2022-05-27 Thread GitBox



alessandrobenedetti commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r883955103


##
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##
@@ -47,63 +42,30 @@ public float convertToScore(float similarity) {
   DOT_PRODUCT {
 @Override
 public float compare(float[] v1, float[] v2) {
-  return dotProduct(v1, v2);
-}
-
-@Override
-public float convertToScore(float similarity) {
-  return (1 + similarity) / 2;
+  return (1 + dotProduct(v1, v2)) / 2;
 }
   },
 
   /**
* Cosine similarity. NOTE: the preferred way to perform cosine similarity 
is to normalize all
* vectors to unit length, and instead use {@link 
VectorSimilarityFunction#DOT_PRODUCT}. You
* should only use this function if you need to preserve the original 
vectors and cannot normalize
-   * them in advance.
+   * them in advance. The similarity score is normalised to assure it is 
positive.
*/
   COSINE {
 @Override
 public float compare(float[] v1, float[] v2) {
-  return cosine(v1, v2);
-}
-
-@Override
-public float convertToScore(float similarity) {
-  return (1 + similarity) / 2;
+  return (1 + cosine(v1, v2)) / 2;
 }
   };
 
   /**
-   * If true, the scores associated with vector comparisons are nonnegative 
and in reverse order;
-   * that is, lower scores represent more similar vectors. Otherwise, if 
false, higher scores
-   * represent more similar vectors, and scores may be negative or positive.
-   */
-  public final boolean reversed;
-
-  VectorSimilarityFunction(boolean reversed) {
-this.reversed = reversed;
-  }
-
-  VectorSimilarityFunction() {
-reversed = false;
-  }
-
-  /**
-   * Calculates a similarity score between the two vectors with a specified 
function.
+   * Calculates a similarity score between the two vectors with a specified 
function. Higher
+   * similarity scores correspond to closer vectors.
*
* @param v1 a vector
* @param v2 another vector, of the same dimension
* @return the value of the similarity function applied to the two vectors
*/
   public abstract float compare(float[] v1, float[] v2);

Review Comment:
   should we call it 'similarity'?
   if we want to name the method with a verb maybe:
   getSimilarity?
   scoreSimilarity?
   computeSimilarity?
   I think they are almost synonyms here but in general, I think we are using 
the similarity as a score.
   Not a strong opinion anyway, happy to change it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

2022-05-27 Thread GitBox



alessandrobenedetti commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r883956265


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##
@@ -155,14 +155,14 @@ private NeighborQueue searchLevel(
 
 // A bound that holds the minimum similarity to the query vector that a 
candidate vector must
 // have to be considered.
-BoundsChecker bound = BoundsChecker.create(similarityFunction.reversed);
+float minAcceptedSimilarity = Float.NEGATIVE_INFINITY;
 if (results.size() >= topK) {
-  bound.set(results.topScore());
+  minAcceptedSimilarity = results.topNodeScore();
 }
 while (candidates.size() > 0 && results.incomplete() == false) {
   // get the best candidate (closest or best scoring)
-  float topCandidateScore = candidates.topScore();
-  if (bound.check(topCandidateScore)) {
+  float topCandidateSimilarity = candidates.topNodeScore();

Review Comment:
   it is a similarity score rather than a document score(even if they pretty 
much align).
   Let's decide if naming all of them 'similarity" (as it should be right now) 
or all "scores"
   I am more in favor of 'similarity' as it makes the code a bit more readable, 
not a strong opinion anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

2022-05-27 Thread GitBox



alessandrobenedetti commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r883956772


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -245,11 +244,11 @@ private boolean diversityCheck(
   NeighborArray neighbors,
   RandomAccessVectorValues vectorValues)
   throws IOException {
-bound.set(score);
+minAcceptedSimilarity = score;

Review Comment:
   I was initially using it, but I noticed the BoundChecker was global, so I 
thought using the min check locally could cause bugs, if you are sure local is 
just fine I'll do that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

2022-05-27 Thread GitBox



alessandrobenedetti commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r883957300


##
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswGraphBuilder.java:
##
@@ -51,7 +50,7 @@ public final class Lucene90HnswGraphBuilder {
   private final VectorSimilarityFunction similarityFunction;
   private final RandomAccessVectorValues vectorValues;
   private final SplittableRandom random;
-  private final BoundsChecker bound;
+  private float minAcceptedSimilarity = Float.POSITIVE_INFINITY;

Review Comment:
   Ok, didn't know!
   waiting for @msokolov  and others' confirmation then, happy to do the change 
once we all agree!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

2022-05-27 Thread GitBox



alessandrobenedetti commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r883958913


##
lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java:
##
@@ -193,25 +208,40 @@ public void testAdvanceShallow() throws IOException {
   }
   try (IndexReader reader = DirectoryReader.open(d)) {
 IndexSearcher searcher = new IndexSearcher(reader);
-KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 3}, 
3);
+KnnVectorQuery query = new KnnVectorQuery("field", new float[] {0.5f, 
1}, 3);

Review Comment:
   Will double check, @msokolov confirmed it was done on purpose so I may just 
revert it and investigate it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-05-27 Thread GitBox



vigyasharma commented on PR #633:
URL: https://github.com/apache/lucene/pull/633#issuecomment-1140109855

   @mikemccand Thanks for beasting it and uncovering this issue! 
   I tried to do the same on my box but haven't had any luck so far. Any ideas 
on how I can make it more likely to happen? How many cores (and threads) did 
you run this on?
   
   Side Note: it might be helpful to have a gradle parameter for thread count 
for concurrency tests. Maybe something that goes with the 
[Concurrent](https://issues.apache.org/jira/browse/LUCENE-10566) test group? 
I'm not sure if there's one for this already (the test seems hardcoded to `int 
numThreads = TEST_NIGHTLY ? 5 : 2;`.
   
   I tried reading through code to figure this out. I suspect this is because 
the `synchronized AddIndexesMergeSource::getNextMerge()` function is really 
locking on the inner class object, and not on IndexWriter. This works fine for 
`pendingAddIndexesMerges` because that is an inner class object, but probably 
hits race conditions for `runningMerges` which belongs to `IndexWriter`.
   
   I could either wrap the runningMerges update with a `synchronized 
(IndexWriter.this) {}`, or make `runningMerges` a `synchronizedSet`. I like the 
second approach as it automatically fixes this at all other places. What do you 
think?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial

2022-05-27 Thread GitBox



mocobeta commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-1140176391

   I'm unsure if the change improves it. Thinking calmly there are two 
questions in the change here.
   1. The start commands for the demo app are no longer platform agnostic.
   - The old one works on any platform since it does not include any file 
paths in it - it picks CLASSPATH environment variable.
   2. Put all jars in a directory into classpath with `*` is not great.
   - Old one says "Put all four of these files in your Java CLASSPATH." - 
yes it's the correct way to encourage people to find necessary jars and 
construct classpath on their own, except for the list of jars is obsoleted.
   
   As we saw things seem not so obvious as I first thought, I would postpone it 
for now; anyway it does not a big deal I think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial

2022-05-27 Thread GitBox



mocobeta commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-1140180971

   Also thank you both for your comments - at least I learned the subtle 
difficulties related to launching commands (again).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] Deepika0510 opened a new pull request, #927: Adding Timeout Support to IndexSearcher

[GitHub] [lucene] msokolov commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

[GitHub] [lucene] mocobeta closed pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

[GitHub] [lucene] msokolov commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial

[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

[GitHub] [lucene] dweiss commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

[GitHub] [lucene] vigyasharma commented on pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial

[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial

22 matches

Site Navigation

Mail list logo

Footer information