[GitHub] [lucene] mayya-sharipova commented on issue #11769: TestKnnVectorQuery.testScoreEuclidean fails

2022-09-19 Thread GitBox


mayya-sharipova commented on issue #11769:
URL: https://github.com/apache/lucene/issues/11769#issuecomment-1251024321

   Thanks @msokolov.
   
   I run the above test on main, and it doesn't fall anymore.
   Closing the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova closed issue #11769: TestKnnVectorQuery.testScoreEuclidean fails

2022-09-19 Thread GitBox


mayya-sharipova closed issue #11769: TestKnnVectorQuery.testScoreEuclidean fails
URL: https://github.com/apache/lucene/issues/11769


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on pull request #11784: NeighborArray is now fixed size

2022-09-19 Thread GitBox


mayya-sharipova commented on PR #11784:
URL: https://github.com/apache/lucene/pull/11784#issuecomment-1251045793

   @msokolov Thanks for tackling this. I was also thinking to remove 
`NeighborArray` of resizing, which makes logic simplier.
   
   I was thinking a better approach would be to leave it to `NeighborArray` 
users to define `maxSize`, and not add +1  in the `NeighborArray` class itself 
as this PR suggests. For example, 
[OnHeapHnswGraph](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java#L62-L66)
 already adds +1 when creating `NeighborArray`.
   
   What do you think?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on issue #11769: TestKnnVectorQuery.testScoreEuclidean fails

2022-09-19 Thread GitBox


msokolov commented on issue #11769:
URL: https://github.com/apache/lucene/issues/11769#issuecomment-1251087019

   I just backported to 9.x and 9_4 branches


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #11784: NeighborArray is now fixed size

2022-09-19 Thread GitBox


msokolov commented on PR #11784:
URL: https://github.com/apache/lucene/pull/11784#issuecomment-1251132612

   > I was thinking a better approach would be to leave it to uses of 
NeighborArray to define maxSize, and not add +1 in the NeighborArray class 
itself as this PR suggests
   
   I guess I was thinking that since this class only has a single use, it 
wouldn't matter? But it definitely is better encapsulation to move the sizing 
logic to the place where we know how many we need. +1 to have consumers do it, 
especially since at least in one place they already do :) I'll follow up with a 
patch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on pull request #11781: Diversity check bugfix

2022-09-19 Thread GitBox


mayya-sharipova commented on PR #11781:
URL: https://github.com/apache/lucene/pull/11781#issuecomment-1251139127

   @msokolov Thanks for tacking this.
   
   I ran ann benchmarks with this change, and happy to confirm that in my test 
recall with this PR is the same as in 9.3 branch, although QPS is lower, but we 
can investigate QPSs later.
   
   
   
   
   **glove-100-angular M:16 efConstruction:100**
   
   | | 9.3 recall |  9.3 QPS | this PR recall | this PR QPS |
   | --- | -: | ---: | -: | --: |
   | n_cands=10  |  0.620 | 2745.933 |  0.620 |1675.500 |
   | n_cands=20  |  0.680 | 2288.665 |  0.680 |1512.744 |
   | n_cands=40  |  0.746 | 1770.243 |  0.746 |1040.240 |
   | n_cands=80  |  0.809 | 1226.738 |  0.809 | 695.236 |
   | n_cands=120 |  0.843 |  948.908 |  0.843 | 525.914 |
   | n_cands=200 |  0.878 |  671.781 |  0.878 | 351.529 |
   | n_cands=400 |  0.918 |  392.265 |  0.918 | 207.854 |
   | n_cands=600 |  0.937 |  282.403 |  0.937 | 144.311 |
   | n_cands=800 |  0.949 |  214.620 |  0.949 | 116.875 |


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11781: Diversity check bugfix

2022-09-19 Thread GitBox


mayya-sharipova commented on code in PR #11781:
URL: https://github.com/apache/lucene/pull/11781#discussion_r974364476


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -316,49 +316,49 @@ private boolean isDiverse(BytesRef candidate, 
NeighborArray neighbors, float sco
*/
   private int findWorstNonDiverse(NeighborArray neighbors) throws IOException {
 for (int i = neighbors.size() - 1; i > 0; i--) {
-  if (isWorstNonDiverse(i, neighbors, neighbors.score[i])) {
+  if (isWorstNonDiverse(i, neighbors)) {
 return i;
   }
 }
 return neighbors.size() - 1;
   }
 
-  private boolean isWorstNonDiverse(
-  int candidate, NeighborArray neighbors, float minAcceptedSimilarity) 
throws IOException {
+  private boolean isWorstNonDiverse(int candidateIndex, NeighborArray 
neighbors)
+  throws IOException {
+int candidateNode = neighbors.node[candidateIndex];
 return switch (vectorEncoding) {
-  case BYTE -> isWorstNonDiverse(
-  candidate, vectors.binaryValue(candidate), neighbors, 
minAcceptedSimilarity);
+  case BYTE -> isWorstNonDiverse(candidateIndex, 
vectors.binaryValue(candidateNode), neighbors);
   case FLOAT32 -> isWorstNonDiverse(
-  candidate, vectors.vectorValue(candidate), neighbors, 
minAcceptedSimilarity);
+  candidateIndex, vectors.vectorValue(candidateNode), neighbors);
 };
   }
 
   private boolean isWorstNonDiverse(
-  int candidateIndex, float[] candidate, NeighborArray neighbors, float 
minAcceptedSimilarity)
-  throws IOException {
-for (int i = candidateIndex - 1; i > -0; i--) {
+  int candidateIndex, float[] candidateVector, NeighborArray neighbors) 
throws IOException {
+float minAcceptedSimilarity = neighbors.score[candidateIndex];
+for (int i = candidateIndex - 1; i >= 0; i--) {
   float neighborSimilarity =
-  similarityFunction.compare(candidate, 
vectorsCopy.vectorValue(neighbors.node[i]));
-  // node i is too similar to node j given its score relative to the base 
node
+  similarityFunction.compare(candidateVector, 
vectorsCopy.vectorValue(neighbors.node[i]));
+  // candidate node is too similar to node i given its score relative to 
the base node
   if (neighborSimilarity >= minAcceptedSimilarity) {
-return false;
+return true;
   }
 }
-return true;
+return false;
   }
 
   private boolean isWorstNonDiverse(
-  int candidateIndex, BytesRef candidate, NeighborArray neighbors, float 
minAcceptedSimilarity)
-  throws IOException {
-for (int i = candidateIndex - 1; i > -0; i--) {
+  int candidateIndex, BytesRef candidateVector, NeighborArray neighbors) 
throws IOException {

Review Comment:
   I am surprised that with this big change, we had only a small reduction in 
recall. I guess the reason could be that in our tests diversity check was 
really relevant only for small number of nodes; in majority of cases the 
algorithm just eliminated the most distant node.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11781: Diversity check bugfix

2022-09-19 Thread GitBox


mayya-sharipova commented on code in PR #11781:
URL: https://github.com/apache/lucene/pull/11781#discussion_r974364724


##
lucene/core/src/test/org/apache/lucene/util/hnsw/TestHnswGraph.java:
##
@@ -555,6 +556,78 @@ public void testDiversity() throws IOException {
 assertLevel0Neighbors(builder.hnsw, 5, 1, 4);
   }
 
+  public void testDiversityFallback() throws IOException {
+vectorEncoding = randomVectorEncoding();
+similarityFunction = VectorSimilarityFunction.EUCLIDEAN;
+// Some test cases can't be exercised in two dimensions;
+// in particular if a new neighbor displaces an existing neighbor
+// by being closer to the target, yet none of the existing neighbors is 
closer to the new vector
+// than to the target -- ie they all remain diverse, so we simply drop the 
farthest one.
+float[][] values = {
+  {0, 0, 0},
+  {0, 1, 0},
+  {0, 0, 2},
+  {1, 0, 0},
+  {0, 0.4f, 0}
+};
+MockVectorValues vectors = new MockVectorValues(values);
+// First add nodes until everybody gets a full neighbor list
+HnswGraphBuilder builder =
+HnswGraphBuilder.create(
+vectors, vectorEncoding, similarityFunction, 1, 10, 
random().nextInt());
+// node 0 is added by the builder constructor
+// builder.addGraphNode(vectors.vectorValue(0));
+RandomAccessVectorValues vectorsCopy = vectors.copy();
+builder.addGraphNode(1, vectorsCopy);
+builder.addGraphNode(2, vectorsCopy);
+assertLevel0Neighbors(builder.hnsw, 0, 1, 2);
+// 2 is closer to 0 than 1, so it is excluded as non-diverse
+assertLevel0Neighbors(builder.hnsw, 1, 0);
+// 1 is closer to 0 than 2, so it is excluded as non-diverse
+assertLevel0Neighbors(builder.hnsw, 2, 0);
+
+builder.addGraphNode(3, vectorsCopy);
+// this is one case we are testing; 2 has been displaced by 3

Review Comment:
   nice test!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #11784: NeighborArray is now fixed size

2022-09-19 Thread GitBox


msokolov commented on PR #11784:
URL: https://github.com/apache/lucene/pull/11784#issuecomment-1251146315

   Also -- now that I see this I realize that most likely we are never 
exercising this resize capability, so removing it won't really help performance 
/ memory usage as I was hoping. But it still seems like a good cleanup?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase opened a new issue, #11785: Improve tessellator performance by delaying calls of`isIntersectingPolygon`

2022-09-19 Thread GitBox


iverase opened a new issue, #11785:
URL: https://github.com/apache/lucene/issues/11785

   ### Description
   
   This method iterates over all the remaining edges of a polygons to check if 
a given edge intersects any of them .Currently the method is called when curing 
local intersections or splitting the polygon  which is iterating over the 
polygon edges so it is potentially a O(n^2) on the edges of the polygon. 
   
   The calls are performed on a big conditional but currently the calls are not 
done in the last position. So just moving the call to the last position brings 
a very nice performance improvement. For example for the polygons shared on 
https://github.com/apache/lucene/issues/11777:
   
   
[FE-2456.txt](https://github.com/apache/lucene/files/9577391/FE-2456.txt):
  without change:  542.682 seconds
  with change: 229.524 seconds
   
   
[ORG-24132378.txt](https://github.com/apache/lucene/files/9577398/ORG-24132378.txt):
without change: too long, I did not have patience to let it finish.
with change:  1416.57 seconds


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase opened a new pull request, #11786: Improve tessellator performance by delaying calls to the method #isIntersectingPolygon

2022-09-19 Thread GitBox


iverase opened a new pull request, #11786:
URL: https://github.com/apache/lucene/pull/11786

   See https://github.com/apache/lucene/issues/11785
   
   With these change the bottleneck of the tessellator moves to the algorithm 
that eliminates holes from the polygon. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on issue #11773: Could `PointRangeQuery`'s boundary values used for `NumericComparator` to calculate `estimatedNumberOfMatches`

2022-09-19 Thread GitBox


jpountz commented on issue #11773:
URL: https://github.com/apache/lucene/issues/11773#issuecomment-1251190154

   The `estimatedNumberOfMatches` should still be very close to the actual 
number, so I'm not expecting that a more precise value would change when we 
rebuild the `DocIdSet` of top-k candidates, would it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #11781: Diversity check bugfix

2022-09-19 Thread GitBox


msokolov commented on code in PR #11781:
URL: https://github.com/apache/lucene/pull/11781#discussion_r974411586


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -316,49 +316,49 @@ private boolean isDiverse(BytesRef candidate, 
NeighborArray neighbors, float sco
*/
   private int findWorstNonDiverse(NeighborArray neighbors) throws IOException {
 for (int i = neighbors.size() - 1; i > 0; i--) {
-  if (isWorstNonDiverse(i, neighbors, neighbors.score[i])) {
+  if (isWorstNonDiverse(i, neighbors)) {
 return i;
   }
 }
 return neighbors.size() - 1;
   }
 
-  private boolean isWorstNonDiverse(
-  int candidate, NeighborArray neighbors, float minAcceptedSimilarity) 
throws IOException {
+  private boolean isWorstNonDiverse(int candidateIndex, NeighborArray 
neighbors)
+  throws IOException {
+int candidateNode = neighbors.node[candidateIndex];
 return switch (vectorEncoding) {
-  case BYTE -> isWorstNonDiverse(
-  candidate, vectors.binaryValue(candidate), neighbors, 
minAcceptedSimilarity);
+  case BYTE -> isWorstNonDiverse(candidateIndex, 
vectors.binaryValue(candidateNode), neighbors);
   case FLOAT32 -> isWorstNonDiverse(
-  candidate, vectors.vectorValue(candidate), neighbors, 
minAcceptedSimilarity);
+  candidateIndex, vectors.vectorValue(candidateNode), neighbors);
 };
   }
 
   private boolean isWorstNonDiverse(
-  int candidateIndex, float[] candidate, NeighborArray neighbors, float 
minAcceptedSimilarity)
-  throws IOException {
-for (int i = candidateIndex - 1; i > -0; i--) {
+  int candidateIndex, float[] candidateVector, NeighborArray neighbors) 
throws IOException {
+float minAcceptedSimilarity = neighbors.score[candidateIndex];
+for (int i = candidateIndex - 1; i >= 0; i--) {
   float neighborSimilarity =
-  similarityFunction.compare(candidate, 
vectorsCopy.vectorValue(neighbors.node[i]));
-  // node i is too similar to node j given its score relative to the base 
node
+  similarityFunction.compare(candidateVector, 
vectorsCopy.vectorValue(neighbors.node[i]));
+  // candidate node is too similar to node i given its score relative to 
the base node
   if (neighborSimilarity >= minAcceptedSimilarity) {
-return false;
+return true;
   }
 }
-return true;
+return false;
   }
 
   private boolean isWorstNonDiverse(
-  int candidateIndex, BytesRef candidate, NeighborArray neighbors, float 
minAcceptedSimilarity)
-  throws IOException {
-for (int i = candidateIndex - 1; i > -0; i--) {
+  int candidateIndex, BytesRef candidateVector, NeighborArray neighbors) 
throws IOException {

Review Comment:
   I know - how did this garbage even work at all! :frowning_face: It's kind of 
astonishing how insensitive this whole process is to the diversity checking. 
Initially we didn't have it at all though (just always pick the closest 
neighbors), and things still kind of work. Then I had the wonky implementation 
that did not sort the neighbors while indexing, but did some best effort kind 
of thing, and still it mostly worked. So we need good tests here to ensure we 
are doing the right thing! Because bugs here can lead to small degradation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov merged pull request #11781: Diversity check bugfix

2022-09-19 Thread GitBox


msokolov merged PR #11781:
URL: https://github.com/apache/lucene/pull/11781


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov closed issue #11782: Fix bugs in HNSW diversity check introduced in LUCENE-10577

2022-09-19 Thread GitBox


msokolov closed issue #11782: Fix bugs in HNSW diversity check introduced in 
LUCENE-10577
URL: https://github.com/apache/lucene/issues/11782


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on issue #11782: Fix bugs in HNSW diversity check introduced in LUCENE-10577

2022-09-19 Thread GitBox


msokolov commented on issue #11782:
URL: https://github.com/apache/lucene/issues/11782#issuecomment-1251216990

   merged #11781 and cherry-picked to `branch_9x` and `branch_9_4`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits

2022-09-19 Thread GitBox


jpountz commented on issue #11761:
URL: https://github.com/apache/lucene/issues/11761#issuecomment-1251264246

   I got some numbers for write amplification for the case tested in 
`TestTieredMergePolicy#testSimulateUpdates`:
   
   | Allowed percentage of deletes | Write amplification |
   | - | - |
   | 50 (max) | 4.34 |
   | 33 (default) | 4.34 |
   | 20 (min) | 4.68 |
   | 10 | 6.13 |
   | 5 | 8.76 |
   | 4 | 10.31 |
   | 3 | 12.97 |
   | 2 | 18.76 |
   | 1 | 37.89 |
   | 0 | 10779.78 |
   
   Assuming these numbers are representative, maybe we could allow users to 
configure 5% as the allowed percentage of deletes that their indexes may have, 
which translates to ~2x more write amplification compared to the default of 33% 
according to the above numbers.
   
   For reference, the algorithm that `TieredMergePolicy` uses to keep the 
number of deletes under the threshold consists of running the most balanced 
merge (with a small bias towards merges that reclaim more deletes) until the 
number of deletes of the index is under the threshold.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov opened a new issue, #11787: Handle degenerate case where all HNSW search candidates are filtered

2022-09-19 Thread GitBox


msokolov opened a new issue, #11787:
URL: https://github.com/apache/lucene/issues/11787

   ### Description
   
   This test failure reproduces every time. What seems to happen is that we 
search with a filter that retains > 50% of documents yet we hit an unlucky 
condition where the graph is not fully connected and every candidate node we 
visit gets filtered, so we end up with 0 results. It's kind of a degenerate 
case that is pretty unlikely to arise in a real graph, yet it seems we ought to 
have some kind of fallback to exact search for this case.
   
   
   ./gradlew :lucene:core:test --tests 
"org.apache.lucene.search.TestKnnVectorQuery.testFilterWithSameScore" 
-Ptests.jvms=1 -Ptests.jvmargs=-XX:TieredStopAtLevel=1 
-Ptests.seed=C5E04AD69C13E006 -Ptests.gui=true -Ptests.file.encoding=ISO-8859-1
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov merged pull request #11747: update DOAP and releaseWizard to reflect migration to github

2022-09-19 Thread GitBox


msokolov merged PR #11747:
URL: https://github.com/apache/lucene/pull/11747


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] reta opened a new issue, #11788: Upgrade ANTLR to version 4.11.1

2022-09-19 Thread GitBox


reta opened a new issue, #11788:
URL: https://github.com/apache/lucene/issues/11788

   ### Description
   
   The Apache Lucene is using quite old version of ANTLR 4.5.1-1. By itseld, it 
is not a showstopper, but more profound issue is that some ANTLR 3.x bits are 
used [1]. Since ANTLR 4.10.x (or even earlier), the compatibility layer with  
`3.x` release line has been dropped in `4.x` (see please [2]), which makes 
Apache Lucene impossile to use with recent ANTLR 4.10.x+ releases [3]. The 
sample exception is below. 
   
   ```
  > java.lang.UnsupportedOperationException: 
java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not 
deserialize ATN with version 3 (expected 4).
  > at 
org.antlr.antlr4.runtime@4.11.1/org.antlr.v4.runtime.atn.ATNDeserializer.deserialize(ATNDeserializer.java:56)
  > at 
org.antlr.antlr4.runtime@4.11.1/org.antlr.v4.runtime.atn.ATNDeserializer.deserialize(ATNDeserializer.java:48)
  > at 
org.apache.lucene.expressions@10.0.0-SNAPSHOT/org.apache.lucene.expressions.js.JavascriptLexer.(JavascriptLexer.java:279)
   
   ```
   
   [1] 
https://github.com/apache/lucene/blob/main/lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptLexer.java#L189
   [2] 
https://github.com/antlr/antlr4/commit/c68e127a7cf14470565d6e6ae1eff06db3e56ea7
   [3] https://github.com/opensearch-project/OpenSearch/pull/4546
   
   @uschindler @jpountz any objections in migrating to ANTLR `4.11.1`? I would 
be happy to offer my help here, thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] sashashura opened a new pull request, #11789: GitHub Workflows security hardening

2022-09-19 Thread GitBox


sashashura opened a new pull request, #11789:
URL: https://github.com/apache/lucene/pull/11789

   This PR adds explicit [permissions 
section](https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#permissions)
 to workflows. This is a security best practice because by default workflows 
run with [extended set of 
permissions](https://docs.github.com/en/actions/security-guides/automatic-token-authentication#permissions-for-the-github_token)
 (except from `on: pull_request` [from external 
forks](https://securitylab.github.com/research/github-actions-preventing-pwn-requests/)).
 By specifying any permission explicitly all others are set to none. By using 
the principle of least privilege the damage a compromised workflow can do 
(because of an 
[injection](https://securitylab.github.com/research/github-actions-untrusted-input/)
 or compromised third party tool or action) is restricted.
   It is recommended to have [most strict permissions on the top 
level](https://github.com/ossf/scorecard/blob/main/docs/checks.md#token-permissions)
 and grant write permissions on [job 
level](https://docs.github.com/en/actions/using-jobs/assigning-permissions-to-jobs)
 case by case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on issue #11782: Fix bugs in HNSW diversity check introduced in LUCENE-10577

2022-09-19 Thread GitBox


jtibshirani commented on issue #11782:
URL: https://github.com/apache/lucene/issues/11782#issuecomment-1251626019

   @msokolov a test case started failing regularly after you merged the change. 
Here's an example repro line:
   
   ```
   ./gradlew test --tests TestKnnVectorQuery.testFilterWithSameScore 
-Dtests.seed=1951CEB96E0899ED -Dtests.locale=en-PR 
-Dtests.timezone=Antarctica/South_Pole -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on issue #11782: Fix bugs in HNSW diversity check introduced in LUCENE-10577

2022-09-19 Thread GitBox


msokolov commented on issue #11782:
URL: https://github.com/apache/lucene/issues/11782#issuecomment-1251640822

   Thanks, I had opened https://github.com/apache/lucene/issues/11787. I'm not 
entirely sure this is unexpected? But maybe the graphs have become sparser 
somehow??


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on issue #11787: Handle degenerate case where all HNSW search candidates are filtered

2022-09-19 Thread GitBox


msokolov commented on issue #11787:
URL: https://github.com/apache/lucene/issues/11787#issuecomment-1251652637

   This test is really testing a pathological case ... when the vectors are all 
the same everything is equidistant from everything else and "nearest neighbor" 
ceases to really even mean anything. I'm not sure we should actually have this 
test other than to verify that there is no crash. Maybe I'm misunderstanding, 
but what it the test really asserting?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov opened a new pull request, #11790: Mark HNSW search results incomplete when fewer than topK are found

2022-09-19 Thread GitBox


msokolov opened a new pull request, #11790:
URL: https://github.com/apache/lucene/pull/11790

   This addresses a random test failure that came up recently due to another 
fix. I think this failure exposed a hole in our logic; when a search returns 
fewer results than requested *and we have not explored the entire graph*, we 
should fall back to exhaustive search. This can happen in degenerate cases such 
as this test creates.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on issue #11787: Handle degenerate case where all HNSW search candidates are filtered

2022-09-19 Thread GitBox


jtibshirani commented on issue #11787:
URL: https://github.com/apache/lucene/issues/11787#issuecomment-1251681230

   Thanks for digging into this! I added this test to exercise the tie-breaking 
logic. But now I think it wasn't a good idea -- HNSW is known to exhibit very 
poor performance when vectors are duplicated. And this test takes it to an 
extreme! It's not really a scenario we support well.
   
   Maybe we could just remove this test. It wasn't critical, and I could always 
follow-up with a better way to test tie-breaking.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on issue #11782: Fix bugs in HNSW diversity check introduced in LUCENE-10577

2022-09-19 Thread GitBox


jtibshirani commented on issue #11782:
URL: https://github.com/apache/lucene/issues/11782#issuecomment-1251691399

   Oh oops, I had missed that. I made a comment on the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] patelprateek opened a new issue, #11791: cardinality estimation for query filters

2022-09-19 Thread GitBox


patelprateek opened a new issue, #11791:
URL: https://github.com/apache/lucene/issues/11791

   ### Description
   
   For large scale data the query filters can take long time to execute and 
return data . the returned data can also be large like millions of documents . 
Is there any functionality to be able to get some quick approximate estimate 
for query filters that can be potentially used to decide whether to run the 
query or not. 
   If not , would like to know any recommendation or ideas on how we can 
implement or build that functionality ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on issue #11787: Handle degenerate case where all HNSW search candidates are filtered

2022-09-19 Thread GitBox


msokolov commented on issue #11787:
URL: https://github.com/apache/lucene/issues/11787#issuecomment-1251697346

   We could keep the test if we did this: 
https://github.com/apache/lucene/pull/11790 which would cause fallback to a 
full scan in this kind of case. It seems like a reasonable fallback to me


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase merged pull request #11786: Improve tessellator performance by delaying calls to the method #isIntersectingPolygon

2022-09-19 Thread GitBox


iverase merged PR #11786:
URL: https://github.com/apache/lucene/pull/11786


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #11789: GitHub Workflows security hardening

2022-09-19 Thread GitBox


dweiss commented on PR #11789:
URL: https://github.com/apache/lucene/pull/11789#issuecomment-1251873349

   LGTM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase closed issue #11785: Improve tessellator performance by delaying calls of`isIntersectingPolygon`

2022-09-19 Thread GitBox


iverase closed issue #11785: Improve tessellator performance by delaying calls 
of`isIntersectingPolygon`
URL: https://github.com/apache/lucene/issues/11785


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on issue #11785: Improve tessellator performance by delaying calls of`isIntersectingPolygon`

2022-09-19 Thread GitBox


iverase commented on issue #11785:
URL: https://github.com/apache/lucene/issues/11785#issuecomment-1251887364

   closed in https://github.com/apache/lucene/pull/11786


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org