[PR] Fix TestHnswByteVectorGraph.testSortedAndUnsortedIndicesReturnSameResults [lucene]

2024-05-12 Thread via GitHub


timgrein opened a new pull request, #13361:
URL: https://github.com/apache/lucene/pull/13361

   ##Closes https://github.com/apache/lucene/issues/13210
   
    Description
   
   The following test failed as it produced two different lists of ids for a 
sorted and unsorted HNSW byte vector graph:
   `gradlew test --tests 
TestHnswByteVectorGraph.testSortedAndUnsortedIndicesReturnSameResults 
-Dtests.seed=B41BEC5619361A16 -Dtests.locale=hi-IN 
-Dtests.timezone=Atlantic/Stanley -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8`
   
   Considering that the graphs of 2 indices are organized differently we need 
to explore a lot of candidates (`k`). Increasing `k` from `60` to `80` fixes 
the test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] HnwsGraph creates disconnected components [lucene]

2024-05-12 Thread via GitHub


xwt1 commented on issue #12627:
URL: https://github.com/apache/lucene/issues/12627#issuecomment-2106247607

   > Hi @benwtrent,
   > 
   > I left Amazon but I was able to run some tests with open dataset and also 
with Amazon dataset before leaving. I cannot share whole lot of detail about 
Amazon dataset but some aggregated results are shown. Rest assured I am picking 
this issue on priority now.
   > 
   > I was able to run tests trying out different heuristics. Basically, I 
tried the Lucene `default` diversity check, which generates disconnected even 
with 50K nodes of minilm dataset. I further tried different variations as 
suggested in HNSW paper. The `extended candidates` approach reduced the 
disconnectedness but did not eliminate it completely and it increased the 
indexing time by many times. Next, `keepPruned candidates` approach with 
keeping max-conn connections increased the disconnected nodes. So I tried 
`keepPruned candidates` with keeping pruned connections only till max-conn/2. 
This also did not show any improvement. Next, I tried the new proposed 
hueristic but without the component of removeing the two way connections, so 
basically the new hueristic with just removing one side of the duplex 
connection in the graph. Interestingly, this also did not change the 
disconnectedness. This was a bit surprising to me. Then I tried `new heuristic 
with remove-otherhalf`, basically 
 removing the two way connections completely. This completely removed 
disconnecteness and number of disconnected nodes at all levels was zero. But 
unfortunately this means that the edges at some nodes can grow beyond the 
max-conn. I did not get chance to find the counts and distribution of total 
connections at each node which goes beyond max-conn, but I assume this is 
something that we may not want in lucene. So I thought may be the removing 
duplex edges (i.e. the remove-otherhalf) is the key behind decreasing 
disconnectedness. So I tried `default` algo and `keep prunned` both with the 
`remove-otherhalf`. Interestingly, those two experiments also did not decrease 
number of disconnected nodes. Now, I was left with `new heuristic with remove 
other half` as the best option. To make sure that total connections per node do 
not grow beyond max-conn, I modified the algo to remove some random node in 
case it is not able to find the best node to remove (`new heuristic with remove 
otherhalf an
 d honour max-conn`). This did help to keep the overall disconnectedness to 
zero, but it did show some nodes at level1 to be disconnected still (see 
comments) for minilm dataset. I tried with max-conn=8, max-conn=16, 
num_docs=50K and num_docs=100k. All gave zero overall disconnectedness and zero 
disconnected nodes at level0 but there were some disconnected nodes at level1 
graph.
   > 
   > Anyway, I also did the experiment using Amazon dataset for `new heuristic 
with remove otherhalf and honour max-conn`. I had to do code changes again to 
adopt it to Lucene 9.7 version. I saw similar results. Here also `default` 
algorithm gave lot of disconnected nodes but the new heuristic gave zero 
disconnected nodes at level0 and zero overall disconnected nodes. But at level1 
and sometimes at level2 also there were some nodes that were disconnected. I am 
more confident of the new heuristic now, but won't mind to run more tests and 
perf tests.
   > 
   > PR with all heuristics : 
https://github.com/apache/lucene/pull/12783/commits
   > 
   > max-conn=16 & num_docs = 50K (unless specified)
   > 
   > Algo   no. of Disconnected Nodes at zeroth level   %age overall 
disconnected (nodes)   Commentsindex time
   > Baseline Equivalent90  0.1740 (87 nodes)   Same as 
baseline but with some extra parameters to allow more experiments   33s
   > Baseline   90  0.1740 (87) Exactly the code as in production   
   > Extend Candidates  39  0.0760 (38) 280s
   > keep-pruned till max-conn  154 0.2880 (144)disconnected at 
level1=4 and at level2=136s
   > keep-pruned till half of max-conn  97  0.1860 (93) 34s
   > new heuristic without remove-otherhalf 119 0.2240 (112)
35s
   > new heuristic with remove-otherhalf0   0   fully 
connnected at both 50K docs and 100K docs but there were errors at max-conn=8 
as the size of neighbour array cannot grow and this algo allows more than 
max-conn connections. 33s
   > baseline with remove-otherhalf 91  0.1720 (86) 
remove-otherhalf does not give zero disconnectedness with mainline algo 
   > keep-half-pruned with remove-otherhalf 90  0.1740 (87) no 
effect of remove-otherhalf with keep-half pruned 33s
   > new heuristic with remove otherhalf and honour max-conn0   0   
but for max-conn=8 and docs=100k I saw 35 disconnected nodes at level1. There 
were no disconnected nodes at level0 even with max-conn=8 36s
   > Amazon data

Re: [PR] Add a MemorySegment Vector scorer - for scoring without copying on-heap [lucene]

2024-05-12 Thread via GitHub


ChrisHegarty commented on code in PR #13339:
URL: https://github.com/apache/lucene/pull/13339#discussion_r1597648528


##
lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentByteVectorScorerSupplier.java:
##
@@ -0,0 +1,237 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.internal.vectorization;
+
+import java.io.IOException;
+import java.lang.foreign.MemorySegment;
+import java.util.Optional;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.store.FilterIndexInput;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.MemorySegmentAccessInput;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.hnsw.RandomAccessVectorValues;
+import org.apache.lucene.util.hnsw.RandomVectorScorer;
+import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier;
+
+/**
+ * A scorer of vectors whose element size is byte.
+ *
+ * This class is both a scorer supplier and a scorer.
+ */
+public abstract sealed class MemorySegmentByteVectorScorerSupplier

Review Comment:
   Ok. Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Always add sub query explanations to DisjunctionMaxQuery [lucene]

2024-05-12 Thread via GitHub


timgrein opened a new pull request, #13362:
URL: https://github.com/apache/lucene/pull/13362

   ## Closes https://github.com/apache/lucene/issues/13357
   
   ### Description
   
   https://github.com/apache/lucene/issues/13357 states that it's useful to 
have the explanations of the sub queries of `DisjuncationMaxQuery` present 
even, if the document didn't match. Considering that other queries like 
`CoveringQuery` also include [explanations, if the document didn't 
match](https://github.com/apache/lucene/blob/25f1efd8eb34c2eb135a3c0850723ca5b03deec7/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CoveringQuery.java#L218-L223)
 I've adjusted the logic according to the issue's proposal.
   
   Also added tests for `explain` (match and no match case).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add 'passageSortComparator' option in FieldHighlighter [lucene]

2024-05-12 Thread via GitHub


github-actions[bot] commented on PR #13276:
URL: https://github.com/apache/lucene/pull/13276#issuecomment-2106427848

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reuse BitSet when there are deleted documents in the index instead of creating new BitSet [lucene]

2024-05-12 Thread via GitHub


Pulkitg64 commented on PR #12857:
URL: https://github.com/apache/lucene/pull/12857#issuecomment-2106667534

   Thanks @mikemccand for the pointers. Will try to run benchmarks on this 
change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org