Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-11 Thread via GitHub


jpountz commented on PR #14204:
URL: https://github.com/apache/lucene/pull/14204#issuecomment-2650473311

   I moved the code to the sandbox facet framework and applied suggestions, I 
hope I didn't miss any.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Bugfix/fix hnsw search termination check [lucene]

2025-02-11 Thread via GitHub


jimczi commented on code in PR #14215:
URL: https://github.com/apache/lucene/pull/14215#discussion_r1950877363


##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -52,7 +52,7 @@ public boolean collect(int docId, float similarity) {
 
   @Override
   public float minCompetitiveSimilarity() {
-return queue.size() >= k() ? queue.topScore() : Float.NEGATIVE_INFINITY;
+return queue.size() >= k() ? Math.nextUp(queue.topScore()) : 
Float.NEGATIVE_INFINITY;

Review Comment:
   It was more about the high-level contract of minCompetitiveSimilarity. It's 
not a big deal, but it does place the responsibility on the various 
implementers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves [lucene]

2025-02-11 Thread via GitHub


iverase commented on PR #14176:
URL: https://github.com/apache/lucene/pull/14176#issuecomment-2650518843

   >The [nightly geo 
benchy](https://benchmarks.mikemccandless.com/geobench.html) didn't seem 
impacted either way; maybe the tasks it runs are not exercising the optimized 
path here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-11 Thread via GitHub


epotyom commented on code in PR #14204:
URL: https://github.com/apache/lucene/pull/14204#discussion_r1950866681


##
lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java:
##
@@ -0,0 +1,252 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.histogram;
+
+import java.io.IOException;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesSkipper;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.NumericDocValues;
+import org.apache.lucene.index.SortedNumericDocValues;
+import org.apache.lucene.internal.hppc.LongIntHashMap;
+import org.apache.lucene.search.CollectionTerminatedException;
+import org.apache.lucene.search.Collector;
+import org.apache.lucene.search.LeafCollector;
+import org.apache.lucene.search.Scorable;
+import org.apache.lucene.search.ScoreMode;
+
+final class HistogramCollector implements Collector {

Review Comment:
   > the API is designed around the idea of hierarchical facets
   
   I don't think hierarchy is a central piece of the API, it is more like 
hierarchical facets are also supported if needed. For many facet 
implementations it is not needed, such as ranges, distinct long values, etc.
   
   I briefly looked at the code and I think it fits into the sandbox module API 
use case. I believe all you need is to implement FacetCutter 
(`#createLeafCutter`) and LeafFacetCutter (`#advanceExact` and `#nextOrd`) 
interfaces, which return bucket indexes as facet ordinals.
   
   What you would get for free is counting, or computing other aggregations 
based on numeric fields for the doc (FacetRecorder implementations); 
sorting/topN buckets by count or aggregation (`OrdinaIterator`s). You could 
also implement `OrdToLabel` interface to do the math explained in 
HistogramCollectorManager's javadoc and return String representations of bucket 
ranges.
   
   That being said, if you don't think that this additional functionality is 
useful, then there is probably no reason to use the new API, which is more 
flexible at a cost of creating more abstractions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves [lucene]

2025-02-11 Thread via GitHub


mikemccand commented on PR #14176:
URL: https://github.com/apache/lucene/pull/14176#issuecomment-2650500796

   > Incredible speeds ups here 
https://benchmarks.mikemccandless.com/FilteredIntNRQ.html and here 
https://benchmarks.mikemccandless.com/IntNRQ.html
   
   Yeah, wow!
   
   > These numbers look great! I want to run this change on the geo benchmarks 
but I expect similar speed ups. I am planning to do it early next week.
   
   The [nightly geo 
benchy](https://benchmarks.mikemccandless.com/geobench.html) didn't seem 
impacted either way; maybe the tasks it runs are not exercising the optimized 
path here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add UnwrappingReuseStrategy for AnalyzerWrapper [lucene]

2025-02-11 Thread via GitHub


mayya-sharipova commented on PR #14154:
URL: https://github.com/apache/lucene/pull/14154#issuecomment-2651354310

   @jpountz Thank you for your feedback. I am happy to discuss alternatives. 
   
   Isn't the whole idea of ReuseStrategy to decide if an analyzer's components 
can be reused or should be recreated/swapped?  
   
   > I worry that this is just one place where the assumption that analysis 
components are immutable is violated, and that there are other ones that 
haven't been found yet. I'm curious if other alternatives were considered, e.g. 
atomically swapping the analyzer or updating the behavior of analysis 
components without needing to replace them.
   
   Yes, indeed in this PR analysis components are immutable, they are 
recreated/ swapped in an analyzer
   
   
   
   For the context: this PR came from a need to have an updateable synonyms 
analyzer when it is wrapped  in another analyzer, for example 
CompletionAnalyzer. In Elasticsearch an analyzer with updateable synonyms has a 
custom ReuseStrategy that creates a new components when synonyms are updated. 
We now want CompletionAnalyzer also update its components when its wrapped 
synonyms analyzer updated synonyms.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

2025-02-11 Thread via GitHub


benwtrent commented on issue #14208:
URL: https://github.com/apache/lucene/issues/14208#issuecomment-2650772079

   @Vikasht34 I thank you for helpful suggestions, but these just seem like 
rehashes of already made suggestions or things that are completely unrelated. 
Were these generated via some LLM or something?
   
   
   
   One idea I had was we eagerly start delta encoding var-int connections as 
they reach a certain size. Maybe we get lucky and as the graph grows we are 
less likely to update a graphs connections and thus have to re-caculate and 
extend the graph connections. This will complicate the diversity checks for 
connections as it relies on vectors to be sorted by nearness :/ 
   
   Maybe we delta encode their scores instead of the neighbor ordinals (crazy 
idea, I don't know how to do this)? Its conceivable that the scores for a given 
NSW will be generally close to eachother.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Bugfix/fix hnsw search termination check [lucene]

2025-02-11 Thread via GitHub


benwtrent commented on code in PR #14215:
URL: https://github.com/apache/lucene/pull/14215#discussion_r1950821830


##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -52,7 +52,7 @@ public boolean collect(int docId, float similarity) {
 
   @Override
   public float minCompetitiveSimilarity() {
-return queue.size() >= k() ? queue.topScore() : Float.NEGATIVE_INFINITY;
+return queue.size() >= k() ? Math.nextUp(queue.topScore()) : 
Float.NEGATIVE_INFINITY;

Review Comment:
   I am fine with that as well. I suppose the only thing that actually uses 
this is the graph searcher :/



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

2025-02-11 Thread via GitHub


Vikasht34 commented on issue #14208:
URL: https://github.com/apache/lucene/issues/14208#issuecomment-2651710710

   @benwtrent Thanks ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Clean up public non-final statics [lucene]

2025-02-11 Thread via GitHub


msokolov commented on PR #14221:
URL: https://github.com/apache/lucene/pull/14221#issuecomment-2652083206

   LGTM, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add UnwrappingReuseStrategy for AnalyzerWrapper [lucene]

2025-02-11 Thread via GitHub


jpountz commented on PR #14154:
URL: https://github.com/apache/lucene/pull/14154#issuecomment-2652082591

   I suspect that `ReuseStrategy` hasn't been built with the idea that it could 
be use to invalidate components, but only as a way to describe an efficient 
caching strategy for analysis components, typically whether analysis components 
can be cached globally (`GLOBAL_REUSE_STRATEGY`) or need a different cache 
entry on each field (`PER_FIELD_REUSE_STRATEGY`).
   
   @uschindler I believe that you are the one who designed this logic, I'd be 
interested in your thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Clean up public non-final statics [lucene]

2025-02-11 Thread via GitHub


risdenk commented on code in PR #14221:
URL: https://github.com/apache/lucene/pull/14221#discussion_r1951627505


##
lucene/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/smart/AnalyzerProfile.java:
##
@@ -34,45 +34,44 @@
 public class AnalyzerProfile {
 
   /** Global indicating the configured analysis data directory */
-  public static String ANALYSIS_DATA_DIR = "";
+  public static final String ANALYSIS_DATA_DIR;
 
   static {
-init();
+ANALYSIS_DATA_DIR = init();

Review Comment:
   Does this need to be in the static block? cant the line above just be
   
   `public static final String ANALYSIS_DATA_DIR = init();`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Clean up public non-final statics [lucene]

2025-02-11 Thread via GitHub


risdenk commented on PR #14221:
URL: https://github.com/apache/lucene/pull/14221#issuecomment-2652103797

   https://github.com/apache/lucene/pull/14216 may help with this specifically 
`NonFinalStaticField` which was off since it was super noisy. 
https://errorprone.info/bugpattern/NonFinalStaticField


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Consistent KNN query results with multiple leafs [lucene]

2025-02-11 Thread via GitHub


benwtrent commented on code in PR #14191:
URL: https://github.com/apache/lucene/pull/14191#discussion_r1939664144


##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -19,11 +19,7 @@
 import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
 
 import java.io.IOException;
-import java.util.ArrayList;
-import java.util.Arrays;
-import java.util.Comparator;
-import java.util.List;
-import java.util.Objects;
+import java.util.*;

Review Comment:
   lets not use `*`



##
lucene/core/src/java/org/apache/lucene/search/knn/TopKnnCollectorManager.java:
##
@@ -55,8 +57,13 @@ public KnnCollector newCollector(
 if (globalScoreQueue == null) {
   return new TopKnnCollector(k, visitedLimit, searchStrategy);
 } else {
-  return new MultiLeafKnnCollector(
-  k, globalScoreQueue, new TopKnnCollector(k, visitedLimit, 
searchStrategy));
+  if (freeze.getAndSet(false)) {
+return new MultiLeafKnnCollector(
+k, globalScoreQueue, new TopKnnCollector(k, visitedLimit, 
searchStrategy), false);
+  } else {

Review Comment:
   I don't think you need to do this "freeze" thing. The only issue was sharing 
information between threads. A single executor continuing to share information 
is completely ok.



##
lucene/core/src/java/org/apache/lucene/search/knn/TopKnnCollectorManager.java:
##
@@ -55,8 +57,13 @@ public KnnCollector newCollector(
 if (globalScoreQueue == null) {
   return new TopKnnCollector(k, visitedLimit, searchStrategy);
 } else {
-  return new MultiLeafKnnCollector(
-  k, globalScoreQueue, new TopKnnCollector(k, visitedLimit, 
searchStrategy));
+  if (freeze.getAndSet(false)) {
+return new MultiLeafKnnCollector(
+k, globalScoreQueue, new TopKnnCollector(k, visitedLimit, 
searchStrategy), false);
+  } else {

Review Comment:
   For this change, we simply make `MultiLeafKnnCollector` not thread safe at 
all. We can continue to have this `globalScoreQueue` that is unique per thread 
and shared between all the collectors, but all the code assumes its within a 
single thread. 
   
   This will likely speed things up significantly.



##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -88,11 +91,64 @@ public Query rewrite(IndexSearcher indexSearcher) throws 
IOException {
 getKnnCollectorManager(k, indexSearcher), 
indexSearcher.getTimeout());
 TaskExecutor taskExecutor = indexSearcher.getTaskExecutor();
 List leafReaderContexts = reader.leaves();
+
 List> tasks = new ArrayList<>(leafReaderContexts.size());
-for (LeafReaderContext context : leafReaderContexts) {
-  tasks.add(() -> searchLeaf(context, filterWeight, knnCollectorManager));
+
+TopDocs[] perLeafResults;
+if (leafReaderContexts.size() > 1) {
+  if (true) {
+/* sort LRCs by segment size */
+List sortedLeafReaderContexts = 
leafReaderContexts.stream()
+.sorted(Comparator.comparingInt(o -> 
o.reader().numDocs())).toList();
+int noLRCs = sortedLeafReaderContexts.size();

Review Comment:
   I think this is OK for a POC, but we can easily sort actual number of 
vectors for real data.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Clean up public non-final statics [lucene]

2025-02-11 Thread via GitHub


msfroh commented on code in PR #14221:
URL: https://github.com/apache/lucene/pull/14221#discussion_r1951677461


##
lucene/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/smart/AnalyzerProfile.java:
##
@@ -34,45 +34,44 @@
 public class AnalyzerProfile {
 
   /** Global indicating the configured analysis data directory */
-  public static String ANALYSIS_DATA_DIR = "";
+  public static final String ANALYSIS_DATA_DIR;
 
   static {
-init();
+ANALYSIS_DATA_DIR = init();

Review Comment:
   You are absolutely right! Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] org.apache.lucene.search.TestKnnFloatVectorQuery.testFindFewer ComparisonFailure: expected: but was: [lucene]

2025-02-11 Thread via GitHub


navneet1v commented on issue #14175:
URL: https://github.com/apache/lucene/issues/14175#issuecomment-2650199394

   @ChrisHegarty I did some debugging on this failed test and what I found is 
for the particular see the Lucene99ScalarQuantizedVectorsFormat is getting 
picked from a list of formats which gets created here 
(https://github.com/apache/lucene/blob/6b0112cdee284cae9143402d7dd80fdaf4d22f93/lucene/test-framework/src/java/org/apache/lucene/tests/index/RandomCodec.java#L267).
 Now due to this Quantized format, scores of 0th doc and 2nd doc becomes same 
causing the failure, as it is not guaranteed which doc will come first. 
   
   Code ref: 
https://github.com/apache/lucene/blob/6b0112cdee284cae9143402d7dd80fdaf4d22f93/lucene/test-framework/src/java/org/apache/lucene/tests/index/RandomCodec.java#L210-L219
 
   
   
   One way I think we can fix this is rather than checking the order of docs, 
we should compare if all the docs we expect are present or not. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

2025-02-11 Thread via GitHub


john-wagster commented on PR #14192:
URL: https://github.com/apache/lucene/pull/14192#issuecomment-2651490994

   @rmuir I made another pass based on your feedback and I'm good with and 
agree to keep this simple for a first pass.  To that end I've done the 
following: 
   
   * CaseFolding is no longer externally exposed (can revisit this later)
   * CaseFolding has one method to get the set of alternates renamed from 
`fold` which was inaccurate previously.  The idea being to introduce a `fold` 
method in the future if it's valuable.  
   * CaseFolding only includes the set of edge cases that are not handled by 
`Character.toLowerCase` and `Character.toUpperCase` and is a switch/case 
statement now with hex codepoints and comments on ever character
   * this is similar to ASCIIFoldingFilter
   * it was built using 
https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt and filtered by the 
set of things not already covered by `Character.toLowerCase` and 
`Character.toUpperCase`
   * The utility class CaseFoldingUtil for generating the switch / case 
statement code in the CaseFolding class has been removed.
   * ASCII_CASE_INSENSITIVE flag has been deprecated and has the same behavior 
as the CASE_INSENSITIVE flag
   * Tried to minimize changes otherwise with less concern for performance 
during compilation of the regex
   
   Thoughts at this point and whether I've adequately incorporated your 
feedback would be much appreciated. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-11 Thread via GitHub


gsmiller commented on code in PR #14204:
URL: https://github.com/apache/lucene/pull/14204#discussion_r1951323035


##
lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java:
##
@@ -0,0 +1,252 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.histogram;
+
+import java.io.IOException;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesSkipper;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.NumericDocValues;
+import org.apache.lucene.index.SortedNumericDocValues;
+import org.apache.lucene.internal.hppc.LongIntHashMap;
+import org.apache.lucene.search.CollectionTerminatedException;
+import org.apache.lucene.search.Collector;
+import org.apache.lucene.search.LeafCollector;
+import org.apache.lucene.search.Scorable;
+import org.apache.lucene.search.ScoreMode;
+
+final class HistogramCollector implements Collector {
+
+  private final String field;
+  private final long interval;
+  private final LongIntHashMap counts;
+
+  HistogramCollector(String field, long interval) {
+this.field = field;
+this.interval = interval;
+this.counts = new LongIntHashMap();
+  }
+
+  @Override
+  public LeafCollector getLeafCollector(LeafReaderContext context) throws 
IOException {
+FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field);
+if (fi == null) {
+  // The segment has no values, nothing to do.
+  throw new CollectionTerminatedException();
+}
+if (fi.getDocValuesType() != DocValuesType.NUMERIC
+&& fi.getDocValuesType() != DocValuesType.SORTED_NUMERIC) {
+  throw new IllegalStateException(
+  "Expected numeric field, but got doc-value type: " + 
fi.getDocValuesType());
+}
+SortedNumericDocValues values = 
DocValues.getSortedNumeric(context.reader(), field);
+NumericDocValues singleton = DocValues.unwrapSingleton(values);
+if (singleton == null) {
+  return new HistogramNaiveLeafCollector(values, interval, counts);
+} else {
+  DocValuesSkipper skipper = context.reader().getDocValuesSkipper(field);
+  if (skipper != null) {
+long leafMinQuotient = Math.floorDiv(skipper.minValue(), interval);
+long leafMaxQuotient = Math.floorDiv(skipper.maxValue(), interval);
+if (leafMaxQuotient - leafMinQuotient <= 1024) {
+  // Only use the optimized implementation if there is a small number 
of unique quotients,
+  // so that we can count them using a dense array instead of a hash 
table.
+  return new HistogramLeafCollector(singleton, skipper, interval, 
counts);
+}
+  }
+  return new HistogramNaiveSingleValuedLeafCollector(singleton, interval, 
counts);
+}
+  }
+
+  @Override
+  public ScoreMode scoreMode() {
+return ScoreMode.COMPLETE_NO_SCORES;
+  }
+
+  LongIntHashMap getCounts() {
+return counts;
+  }
+
+  /**
+   * Naive implementation of a histogram {@link LeafCollector}, which iterates 
all maches and looks
+   * up the value to determine the corresponding bucket.
+   */
+  private static class HistogramNaiveLeafCollector implements LeafCollector {
+
+private final SortedNumericDocValues values;
+private final long interval;
+private final LongIntHashMap counts;
+
+HistogramNaiveLeafCollector(
+SortedNumericDocValues values, long interval, LongIntHashMap counts) {
+  this.values = values;
+  this.interval = interval;
+  this.counts = counts;
+}
+
+@Override
+public void setScorer(Scorable scorer) throws IOException {}
+
+@Override
+public void collect(int doc) throws IOException {
+  if (values.advanceExact(doc)) {
+int valueCount = values.docValueCount();
+long prevQuotient = Long.MIN_VALUE;
+for (int i = 0; i < valueCount; ++i) {
+  final long value = values.nextValue();
+  final long quotient = Math.floorDiv(value, interval);
+  // We must not double-count values that divide to the same quotient 
since this returns doc
+  // counts as opposed to value co

[I] NRT replication should make it possible/easy to use bite-sized commits [lucene]

2025-02-11 Thread via GitHub


mikemccand opened a new issue, #14219:
URL: https://github.com/apache/lucene/issues/14219

   ### Description
   
   At Amazon (product search) we use Lucene's [awesome near-real-time segment 
replication](https://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html)
 to efficiently distribute index changes (through S3) to searchers.
   
   This "index once search many" approach works very well for applications 
needing high QPS scale.  It enables physical isolation of indexing and 
searching (so e.g. heavy merges cannot affect searching).  It enables rapid 
failover or proactive cluster load-balancing if an indexer crashes or its node 
is too hot: just promote one of the replicas.  And it means you should 
frontload lots of indexing cost if it make searching even a bit faster.
   
   But recently we've been struggling with too-large checkpoints.  We ask the 
indexer to write a new checkpoint (an `IndexWriter` commit call) every 60 
seconds, and searchers copy down the checkpoint and light the new segments.  
During periods of heavy index updates ("update storms"), combined with our very 
aggressive `TieredMergePolicy` configuration to reclaim deletes, we see a big 
write amplification (bytes copied to replicas vs bytes written for newly 
indexed documents), sometimes sending many 10s of GB new segments in a single 
checkpoint.
   
   When replicas copy these large checkpoints, it can induce heavy page faults 
on the hot query path for in-flight queries (we suspect the [`MADV_RANDOM` hint 
for KNN files to also exacerbate 
things](https://github.com/apache/lucene/pull/13267) for us -- this is good for 
cold indices, but maybe not mostly hot ones?) since the hot searching pages 
evicted by copying in the large checkpoint before any of the new segments are 
lit puts RAM pressure on the OS.  We could maybe tune the OS to more 
aggressively move dirty pages to disk?  Or maybe try `O_DIRECT` when copying 
the new checkpoint files.  But still when we then light the new segments, we'll 
hit page faults then too.
   
   We had an a-ha moment on how to fix this, using APIs Lucene already exposes! 
 We just need to decouple committing from checkpointing/replicating.  Instead 
of committing/replicating every 60 seconds, ask Lucene to commit much more 
frequently (say once per second, like OpenSearch/Elasticsearch default I think, 
 or maybe "whenever > N GB segments turnover", though this is harder).  
Configure a time-based  `IndexDeletionPolicy` so these commit points all live 
for a long time (say an hour).  Then, every 60 seconds (or whatever your 
replication interval is), replicate all new commit points (and any segment 
files referenced by these commit points) out to searchers.
   
   The searchers can then carefully pick and choose which commit points they 
want to switch too, in a bite sized / stepping stone manner, ensuring that each 
commit point they light has < N GB turnover in the segments, meaning the OS 
will only ever need "hot-pages plus N" GB of working RAM.  This leans nicely on 
[Lucene's strongly transactional 
APIs](https://blog.mikemccandless.com/2012/03/transactional-lucene.html), and I 
think it's largely sugar / utility classes in NRT replicator that we'd need to 
add to demonstrate this approach, maybe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Vector API to decode BKD docIds [lucene]

2025-02-11 Thread via GitHub


gf2121 commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2651208564

   Thanks for feedback! I implement the fixed-size inner loop and print out 
assembly for all. 
[perf_asm.log](https://github.com/user-attachments/files/18752147/perf_asm.log)
   
   * When profiling enabled, `#current countVariable=true` and `#current 
countVariable=false` has same (slow) speed. It seems like profiling prevented 
some optimization.
   
   * According to the assembly, `#current bpv=16` does not get auto-vectorized. 
`#current bpv=24` gets vectorized on the shift loop, but not for the remainder 
loop.
   
   * According to the assembly, the innerloop get auto-vectorized, but slower 
than vector API.
   
   **MAC M2**
   ```
   Benchmark(bpv)  (countVariable)   Mode  CntScore 
   Error   Units
   BKDCodecBenchmark.current   16 true  thrpt5  103.490 
±  6.785  ops/ms
   BKDCodecBenchmark.current   16false  thrpt5  212.488 
±  5.383  ops/ms
   BKDCodecBenchmark.current   24 true  thrpt5   91.203 
±  1.023  ops/ms
   BKDCodecBenchmark.current   24false  thrpt5  149.742 
±  1.953  ops/ms
   BKDCodecBenchmark.currentVector 16 true  thrpt5  213.162 
±  1.598  ops/ms
   BKDCodecBenchmark.currentVector 16false  thrpt5  216.529 
±  2.518  ops/ms
   BKDCodecBenchmark.currentVector 24 true  thrpt5  153.970 
±  1.101  ops/ms
   BKDCodecBenchmark.currentVector 24false  thrpt5  140.103 
±  3.001  ops/ms
   BKDCodecBenchmark.innerLoop 16 true  thrpt5  129.281 
±  0.471  ops/ms
   BKDCodecBenchmark.innerLoop 16false  thrpt5  131.083 
±  8.775  ops/ms
   BKDCodecBenchmark.innerLoop 24 true  thrpt5   99.597 
±  2.850  ops/ms
   BKDCodecBenchmark.innerLoop 24false  thrpt5   96.235 
± 14.875  ops/ms
   BKDCodecBenchmark.legacy16 true  thrpt5  104.314 
±  0.557  ops/ms
   BKDCodecBenchmark.legacy16false  thrpt5  202.175 
± 10.863  ops/ms
   BKDCodecBenchmark.legacy24 true  thrpt5   86.016 
±  1.315  ops/ms
   BKDCodecBenchmark.legacy24false  thrpt5   85.609 
±  5.733  ops/ms
   ```
   
   **Linux X86 AVX512 profiling disabled**
   ```
   Benchmark(bpv)  (countVariable)   Mode  CntScore 
   Error   Units
   BKDCodecBenchmark.current   16 true  thrpt5   41.138 
±  1.770  ops/ms
   BKDCodecBenchmark.current   16false  thrpt5  142.277 
±  0.943  ops/ms
   BKDCodecBenchmark.current   24 true  thrpt5   43.104 
±  0.066  ops/ms
   BKDCodecBenchmark.current   24false  thrpt5   42.760 
±  0.496  ops/ms
   BKDCodecBenchmark.currentVector 16 true  thrpt5   86.565 
±  0.904  ops/ms
   BKDCodecBenchmark.currentVector 16false  thrpt5   86.624 
±  0.395  ops/ms
   BKDCodecBenchmark.currentVector 24 true  thrpt5   80.064 
±  2.604  ops/ms
   BKDCodecBenchmark.currentVector 24false  thrpt5   76.638 
± 18.692  ops/ms
   BKDCodecBenchmark.innerLoop 16 true  thrpt5   43.810 
±  1.096  ops/ms
   BKDCodecBenchmark.innerLoop 16false  thrpt5   42.485 
±  0.073  ops/ms
   BKDCodecBenchmark.innerLoop 24 true  thrpt5   37.255 
±  0.994  ops/ms
   BKDCodecBenchmark.innerLoop 24false  thrpt5   37.243 
±  0.593  ops/ms
   BKDCodecBenchmark.legacy16 true  thrpt5   41.415 
±  0.079  ops/ms
   BKDCodecBenchmark.legacy16false  thrpt5  145.713 
±  0.381  ops/ms
   BKDCodecBenchmark.legacy24 true  thrpt5   27.758 
±  4.210  ops/ms
   BKDCodecBenchmark.legacy24false  thrpt5   28.519 
±  1.839  ops/ms
   ```
   
   **Linux X86 AVX512 profiling enabled**
   ```
   Benchmark(bpv)  (countVariable)   Mode  Cnt   
Score   Error   Units
   BKDCodecBenchmark.current   16 true  thrpt5  
29.878 ± 0.130  ops/ms
   BKDCodecBenchmark.current:asm   16 true  thrpt  
NaN ---
   BKDCodecBenchmark.current   16false  thrpt5  
29.314 ± 0.229  ops/ms
   BKDCodecBenchmark.current:asm   16false  thrpt  
NaN ---
   BKDCodecBenchmark.current   24 true  thrpt5  
34.874 ± 0.320  ops/ms
   BKDCodecBenchmark.current:asm   24 true  thrpt  
NaN ---
   BKDCodecBenchmark.current   24false  thrpt5  
33.987 ± 0.055

Re: [PR] Bugfix/fix hnsw search termination check [lucene]

2025-02-11 Thread via GitHub


tteofili commented on code in PR #14215:
URL: https://github.com/apache/lucene/pull/14215#discussion_r1951068471


##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -52,7 +52,7 @@ public boolean collect(int docId, float similarity) {
 
   @Override
   public float minCompetitiveSimilarity() {
-return queue.size() >= k() ? queue.topScore() : Float.NEGATIVE_INFINITY;
+return queue.size() >= k() ? Math.nextUp(queue.topScore()) : 
Float.NEGATIVE_INFINITY;

Review Comment:
   > I am fine with that as well. I suppose the only thing that actually uses 
this is the graph searcher :/
   
   we now possibly have multiple class for graph searchers (extensions of 
`AbstractHnswGraphSearcher`), but much better than going over all of the 
`KnnCollectors`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Bugfix/fix hnsw search termination check [lucene]

2025-02-11 Thread via GitHub


tteofili commented on PR #14215:
URL: https://github.com/apache/lucene/pull/14215#issuecomment-2651170577

   only perhaps it'd be nice if we could add an @Monster / nightly test to make 
sure we don't run in this again in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

2025-02-11 Thread via GitHub


Vikasht34 commented on issue #14208:
URL: https://github.com/apache/lucene/issues/14208#issuecomment-2651658228

   @benwtrent  haha , It looks like from LLM , Originally it came from this 
paper 
   https://www.arxiv.org/pdf/1908.00814v5
   
   ```
   The upper layer graph is merged into the approximate
   k-NN graph of the next layer after each round of J-Merge.
   During the construction process, one is allowed to save an
   arbitrary number of layers to form the hierarchy.
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2025-02-11 Thread via GitHub


github-actions[bot] commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2652354185

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-02-11 Thread via GitHub


tveasey commented on PR #14078:
URL: https://github.com/apache/lucene/pull/14078#issuecomment-2652753715

   This pull request relates only to OSQ, and thus the proper scope of 
discussion is regarding the concerns raised around its attribution.
   
   We have pursued multiple conversations and discussions in order to resolve 
various concerns amicably, and we continue to desire collaboration and open 
discourse with respect to each party's innovations on this problem space. We 
would like to reiterate that we highly rate the research in both RaBitQ and its 
extension. However, we also maintain that OSQ is not a derivative of extended 
RaBitQ as we have done in private communications. 
   
   > In addition, Elastic shared a doc with us explaining the differences 
between OSQ and extended RaBitQ, but our understanding is that the core ideas 
of two methods, i.e., trying different parameters for scalar quantization on a 
per-vector basis, are similar.
   
   This goes to crux of the disagreement around attribution. In the following 
we describe the two schemes: extended RaBitQ [1] and OSQ [2] and state again 
the points we have made privately regarding their differences.
   
   Both methods centre the data, which was originally proposed in LVQ [3] to 
the best of our knowledge.
   
   Extended RaBitQ proposes a method for constructing a codebook for which one 
can compute the dot product efficiently using integer arithmetic.
   
   In a nutshell it proceeds as follows:
   1. Apply a random orthogonal transformation of the input vectors.
   2. Normalise resulting vectors and perform an efficient search for the 
nearest codebook centroid.
   3. Snap the floating point vector to this codebook centroid and then undo 
the normalization procedure to obtain an estimate of the original dot product.
   
   At its heart it is a variant of product quantization which allows for 
efficient similarity calculations and was positioned as such in the original 
paper.
   
   The key ingredients of OSQ are as follows:
   1. Construct per vector quantization intervals, inherited from LVQ.
   2. Formulate quantization as a hyperparameter optimization problem for the 
interval [a, b] used to construct the grid of possible quantized vectors, 
building on our earlier work which we published in April [4].
   3. Perform an analysis to show how to choose [a, b] to minimize the expected 
square error in the dot product when the data are normally distributed using 
sufficient statistics of the vector component distribution. (We perform an 
empirical study of the distribution of vector components in embedding models 
and show they are typically normally distributed.)
   4. Perform an iterative procedure based on coordinate descent which 
optimizes an anisotropic loss based on the ideas in [5] w.r.t. the interval [a, 
b] initialising with the output of 3.
   
   At an algorithmic level there is no similarity between the methods other 
than they both centre the data. At a conceptual level RaBitQ was developed as a 
variant of product quantization and OSQ was developed as an approach to 
hyperparameter optimization for per vector scalar quantization. For the record, 
we were inspired to revisit scalar quantization in light of the success of 
binary RaBitQ and we were exploring ideas prior to any communication between 
our teams regarding extended RaBitQ. The actual inspiration for the approach 
was our previous work on hyperparameter optimisation for quantization intervals 
and LVQ and we attribute this PR accordingly.
   
   Retrospectively, and in parallel to our work the authors of RaBitQ explored 
the relationship between extended RaBitQ and scalar quantization directly ([6] 
published 12/12/2024). They show that the act of normalising the vectors, 
finding the nearest centroid then undoing the normalisation can be thought of 
as a process of rescaling the quantization grid so that the nearest quantized 
vector winds up closer to the original floating point vector. Whilst this bears 
some relation to optimising b - a they never formulate this process as an 
optimisation problem and at no point show exactly what quantity they in fact 
optimise. In our view, this misses the key point that the important idea is the 
formulation. As such it is very clear that they do not and could not introduce 
any different optimisation objective, such as anisotopic loss in the dot 
product. Even if they did, OSQ optimises both a and b directly which gives it 
an extra degree of freedom and develops an effective and non-trivial 
 and completely distinct solution for this problem.
   
   As we have communicated privately, we dispute that OSQ is a variant of 
extended RaBitQ. We laid out in broadly similar terms exactly why we disagree 
at a technical level (along the lines of the discussion in this comment). We 
have received no further technical engagement with the points we have raised 
and have therefore reached an impasse.
   
   Regarding this point

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-11 Thread via GitHub


navneet1v commented on code in PR #14178:
URL: https://github.com/apache/lucene/pull/14178#discussion_r1952041592


##
lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java:
##
@@ -0,0 +1,457 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.codecs.faiss;
+
+import static java.lang.foreign.ValueLayout.ADDRESS;
+import static java.lang.foreign.ValueLayout.JAVA_FLOAT;
+import static java.lang.foreign.ValueLayout.JAVA_INT;
+import static java.lang.foreign.ValueLayout.JAVA_LONG;
+
+import java.io.IOException;
+import java.lang.foreign.Arena;
+import java.lang.foreign.FunctionDescriptor;
+import java.lang.foreign.Linker;
+import java.lang.foreign.MemoryLayout;
+import java.lang.foreign.MemorySegment;
+import java.lang.foreign.SymbolLookup;
+import java.lang.invoke.MethodHandle;
+import java.lang.invoke.MethodHandles;
+import java.lang.invoke.MethodType;
+import java.nio.ByteOrder;
+import java.nio.FloatBuffer;
+import java.nio.LongBuffer;
+import java.util.Locale;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.KnnVectorValues;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.search.KnnCollector;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.FixedBitSet;
+import org.apache.lucene.util.hnsw.IntToIntFunction;
+
+public final class LibFaissC {
+  /*
+   * TODO: Requires some changes to Faiss
+   *  - https://github.com/facebookresearch/faiss/pull/4158 (merged in main, 
to be released in v1.11.0)
+   *  - https://github.com/facebookresearch/faiss/pull/4167 (merged in main, 
to be released in v1.11.0)
+   *  - https://github.com/facebookresearch/faiss/pull/4180 (in progress)
+   */
+
+  public static final String LIBRARY_NAME = "faiss_c";
+  public static final String LIBRARY_VERSION = "1.10.0";
+
+  static {
+try {
+  System.loadLibrary(LIBRARY_NAME);
+} catch (UnsatisfiedLinkError e) {
+  throw new RuntimeException(
+  "Shared library not found, build the Faiss C_API from 
https://github.com/facebookresearch/faiss/blob/main/c_api/INSTALL.md "
+  + "and link it (along with all dependencies) to the library path 
"
+  + "(-Djava.library.path JVM argument or $LD_LIBRARY_PATH 
environment variable)",
+  e);
+}
+checkLibraryVersion();
+  }
+
+  private LibFaissC() {}
+
+  @SuppressWarnings("SameParameterValue")
+  private static MemorySegment getUpcallStub(
+  Arena arena, MethodHandle target, MemoryLayout resLayout, 
MemoryLayout... argLayouts) {
+return Linker.nativeLinker()
+.upcallStub(target, FunctionDescriptor.of(resLayout, argLayouts), 
arena);
+  }
+
+  private static MethodHandle getDowncallHandle(
+  String functionName, MemoryLayout resLayout, MemoryLayout... argLayouts) 
{
+return Linker.nativeLinker()
+.downcallHandle(
+SymbolLookup.loaderLookup().find(functionName).orElseThrow(),
+FunctionDescriptor.of(resLayout, argLayouts));
+  }
+
+  private static void checkLibraryVersion() {
+MethodHandle getVersion = getDowncallHandle("faiss_get_version", ADDRESS);
+String actualVersion = callAndGetString(getVersion);
+if (LIBRARY_VERSION.equals(actualVersion) == false) {
+  throw new UnsupportedOperationException(
+  String.format(
+  Locale.ROOT,
+  "Expected Faiss library version %s, found %s",
+  LIBRARY_VERSION,
+  actualVersion));
+}
+  }
+
+  private static final MethodHandle FREE_INDEX =
+  getDowncallHandle("faiss_Index_free", JAVA_INT, ADDRESS);
+
+  public static void freeIndex(MemorySegment indexPointer) {
+callAndHandleError(FREE_INDEX, indexPointer);
+  }
+
+  private static final MethodHandle FREE_CUSTOM_IO_WRITER =
+  getDowncallHandle("faiss_CustomIOWriter_free", JAVA_INT, ADDRESS);
+
+  public static void freeCustomIOWriter(MemorySegment customIOWriterPointer) {
+callAndHandleError(FREE_CUSTOM_IO_WRITER, customIOWriterPointer);
+  }
+
+  private static final MethodHandle FREE_CUST

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-11 Thread via GitHub


navneet1v commented on code in PR #14178:
URL: https://github.com/apache/lucene/pull/14178#discussion_r1952038874


##
lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java:
##
@@ -0,0 +1,457 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.codecs.faiss;
+
+import static java.lang.foreign.ValueLayout.ADDRESS;
+import static java.lang.foreign.ValueLayout.JAVA_FLOAT;
+import static java.lang.foreign.ValueLayout.JAVA_INT;
+import static java.lang.foreign.ValueLayout.JAVA_LONG;
+
+import java.io.IOException;
+import java.lang.foreign.Arena;
+import java.lang.foreign.FunctionDescriptor;
+import java.lang.foreign.Linker;
+import java.lang.foreign.MemoryLayout;
+import java.lang.foreign.MemorySegment;
+import java.lang.foreign.SymbolLookup;
+import java.lang.invoke.MethodHandle;
+import java.lang.invoke.MethodHandles;
+import java.lang.invoke.MethodType;
+import java.nio.ByteOrder;
+import java.nio.FloatBuffer;
+import java.nio.LongBuffer;
+import java.util.Locale;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.KnnVectorValues;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.search.KnnCollector;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.FixedBitSet;
+import org.apache.lucene.util.hnsw.IntToIntFunction;
+
+public final class LibFaissC {
+  /*
+   * TODO: Requires some changes to Faiss
+   *  - https://github.com/facebookresearch/faiss/pull/4158 (merged in main, 
to be released in v1.11.0)
+   *  - https://github.com/facebookresearch/faiss/pull/4167 (merged in main, 
to be released in v1.11.0)
+   *  - https://github.com/facebookresearch/faiss/pull/4180 (in progress)
+   */
+
+  public static final String LIBRARY_NAME = "faiss_c";
+  public static final String LIBRARY_VERSION = "1.10.0";
+
+  static {
+try {
+  System.loadLibrary(LIBRARY_NAME);
+} catch (UnsatisfiedLinkError e) {
+  throw new RuntimeException(
+  "Shared library not found, build the Faiss C_API from 
https://github.com/facebookresearch/faiss/blob/main/c_api/INSTALL.md "
+  + "and link it (along with all dependencies) to the library path 
"
+  + "(-Djava.library.path JVM argument or $LD_LIBRARY_PATH 
environment variable)",
+  e);
+}
+checkLibraryVersion();
+  }
+
+  private LibFaissC() {}
+
+  @SuppressWarnings("SameParameterValue")
+  private static MemorySegment getUpcallStub(
+  Arena arena, MethodHandle target, MemoryLayout resLayout, 
MemoryLayout... argLayouts) {
+return Linker.nativeLinker()
+.upcallStub(target, FunctionDescriptor.of(resLayout, argLayouts), 
arena);
+  }
+
+  private static MethodHandle getDowncallHandle(
+  String functionName, MemoryLayout resLayout, MemoryLayout... argLayouts) 
{
+return Linker.nativeLinker()
+.downcallHandle(
+SymbolLookup.loaderLookup().find(functionName).orElseThrow(),
+FunctionDescriptor.of(resLayout, argLayouts));
+  }
+
+  private static void checkLibraryVersion() {
+MethodHandle getVersion = getDowncallHandle("faiss_get_version", ADDRESS);
+String actualVersion = callAndGetString(getVersion);
+if (LIBRARY_VERSION.equals(actualVersion) == false) {
+  throw new UnsupportedOperationException(
+  String.format(
+  Locale.ROOT,
+  "Expected Faiss library version %s, found %s",
+  LIBRARY_VERSION,
+  actualVersion));
+}
+  }
+
+  private static final MethodHandle FREE_INDEX =
+  getDowncallHandle("faiss_Index_free", JAVA_INT, ADDRESS);
+
+  public static void freeIndex(MemorySegment indexPointer) {
+callAndHandleError(FREE_INDEX, indexPointer);
+  }
+
+  private static final MethodHandle FREE_CUSTOM_IO_WRITER =
+  getDowncallHandle("faiss_CustomIOWriter_free", JAVA_INT, ADDRESS);
+
+  public static void freeCustomIOWriter(MemorySegment customIOWriterPointer) {
+callAndHandleError(FREE_CUSTOM_IO_WRITER, customIOWriterPointer);
+  }
+
+  private static final MethodHandle FREE_CUST

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-11 Thread via GitHub


navneet1v commented on code in PR #14178:
URL: https://github.com/apache/lucene/pull/14178#discussion_r1952063404


##
lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java:
##
@@ -0,0 +1,457 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.codecs.faiss;
+
+import static java.lang.foreign.ValueLayout.ADDRESS;
+import static java.lang.foreign.ValueLayout.JAVA_FLOAT;
+import static java.lang.foreign.ValueLayout.JAVA_INT;
+import static java.lang.foreign.ValueLayout.JAVA_LONG;
+
+import java.io.IOException;
+import java.lang.foreign.Arena;
+import java.lang.foreign.FunctionDescriptor;
+import java.lang.foreign.Linker;
+import java.lang.foreign.MemoryLayout;
+import java.lang.foreign.MemorySegment;
+import java.lang.foreign.SymbolLookup;
+import java.lang.invoke.MethodHandle;
+import java.lang.invoke.MethodHandles;
+import java.lang.invoke.MethodType;
+import java.nio.ByteOrder;
+import java.nio.FloatBuffer;
+import java.nio.LongBuffer;
+import java.util.Locale;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.KnnVectorValues;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.search.KnnCollector;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.FixedBitSet;
+import org.apache.lucene.util.hnsw.IntToIntFunction;
+
+public final class LibFaissC {
+  /*
+   * TODO: Requires some changes to Faiss
+   *  - https://github.com/facebookresearch/faiss/pull/4158 (merged in main, 
to be released in v1.11.0)
+   *  - https://github.com/facebookresearch/faiss/pull/4167 (merged in main, 
to be released in v1.11.0)
+   *  - https://github.com/facebookresearch/faiss/pull/4180 (in progress)
+   */
+
+  public static final String LIBRARY_NAME = "faiss_c";
+  public static final String LIBRARY_VERSION = "1.10.0";
+
+  static {
+try {
+  System.loadLibrary(LIBRARY_NAME);
+} catch (UnsatisfiedLinkError e) {
+  throw new RuntimeException(
+  "Shared library not found, build the Faiss C_API from 
https://github.com/facebookresearch/faiss/blob/main/c_api/INSTALL.md "
+  + "and link it (along with all dependencies) to the library path 
"
+  + "(-Djava.library.path JVM argument or $LD_LIBRARY_PATH 
environment variable)",
+  e);
+}
+checkLibraryVersion();
+  }
+
+  private LibFaissC() {}
+
+  @SuppressWarnings("SameParameterValue")
+  private static MemorySegment getUpcallStub(
+  Arena arena, MethodHandle target, MemoryLayout resLayout, 
MemoryLayout... argLayouts) {
+return Linker.nativeLinker()
+.upcallStub(target, FunctionDescriptor.of(resLayout, argLayouts), 
arena);
+  }
+
+  private static MethodHandle getDowncallHandle(
+  String functionName, MemoryLayout resLayout, MemoryLayout... argLayouts) 
{
+return Linker.nativeLinker()
+.downcallHandle(
+SymbolLookup.loaderLookup().find(functionName).orElseThrow(),
+FunctionDescriptor.of(resLayout, argLayouts));
+  }
+
+  private static void checkLibraryVersion() {
+MethodHandle getVersion = getDowncallHandle("faiss_get_version", ADDRESS);
+String actualVersion = callAndGetString(getVersion);
+if (LIBRARY_VERSION.equals(actualVersion) == false) {
+  throw new UnsupportedOperationException(
+  String.format(
+  Locale.ROOT,
+  "Expected Faiss library version %s, found %s",
+  LIBRARY_VERSION,
+  actualVersion));
+}
+  }
+
+  private static final MethodHandle FREE_INDEX =
+  getDowncallHandle("faiss_Index_free", JAVA_INT, ADDRESS);
+
+  public static void freeIndex(MemorySegment indexPointer) {
+callAndHandleError(FREE_INDEX, indexPointer);
+  }
+
+  private static final MethodHandle FREE_CUSTOM_IO_WRITER =
+  getDowncallHandle("faiss_CustomIOWriter_free", JAVA_INT, ADDRESS);
+
+  public static void freeCustomIOWriter(MemorySegment customIOWriterPointer) {
+callAndHandleError(FREE_CUSTOM_IO_WRITER, customIOWriterPointer);
+  }
+
+  private static final MethodHandle FREE_CUST

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-02-11 Thread via GitHub


kaivalnp commented on code in PR #14178:
URL: https://github.com/apache/lucene/pull/14178#discussion_r1952061362


##
lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java:
##
@@ -0,0 +1,457 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.codecs.faiss;
+
+import static java.lang.foreign.ValueLayout.ADDRESS;
+import static java.lang.foreign.ValueLayout.JAVA_FLOAT;
+import static java.lang.foreign.ValueLayout.JAVA_INT;
+import static java.lang.foreign.ValueLayout.JAVA_LONG;
+
+import java.io.IOException;
+import java.lang.foreign.Arena;
+import java.lang.foreign.FunctionDescriptor;
+import java.lang.foreign.Linker;
+import java.lang.foreign.MemoryLayout;
+import java.lang.foreign.MemorySegment;
+import java.lang.foreign.SymbolLookup;
+import java.lang.invoke.MethodHandle;
+import java.lang.invoke.MethodHandles;
+import java.lang.invoke.MethodType;
+import java.nio.ByteOrder;
+import java.nio.FloatBuffer;
+import java.nio.LongBuffer;
+import java.util.Locale;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.KnnVectorValues;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.search.KnnCollector;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.FixedBitSet;
+import org.apache.lucene.util.hnsw.IntToIntFunction;
+
+public final class LibFaissC {
+  /*
+   * TODO: Requires some changes to Faiss
+   *  - https://github.com/facebookresearch/faiss/pull/4158 (merged in main, 
to be released in v1.11.0)
+   *  - https://github.com/facebookresearch/faiss/pull/4167 (merged in main, 
to be released in v1.11.0)
+   *  - https://github.com/facebookresearch/faiss/pull/4180 (in progress)
+   */
+
+  public static final String LIBRARY_NAME = "faiss_c";
+  public static final String LIBRARY_VERSION = "1.10.0";
+
+  static {
+try {
+  System.loadLibrary(LIBRARY_NAME);
+} catch (UnsatisfiedLinkError e) {
+  throw new RuntimeException(
+  "Shared library not found, build the Faiss C_API from 
https://github.com/facebookresearch/faiss/blob/main/c_api/INSTALL.md "
+  + "and link it (along with all dependencies) to the library path 
"
+  + "(-Djava.library.path JVM argument or $LD_LIBRARY_PATH 
environment variable)",
+  e);
+}
+checkLibraryVersion();
+  }
+
+  private LibFaissC() {}
+
+  @SuppressWarnings("SameParameterValue")
+  private static MemorySegment getUpcallStub(
+  Arena arena, MethodHandle target, MemoryLayout resLayout, 
MemoryLayout... argLayouts) {
+return Linker.nativeLinker()
+.upcallStub(target, FunctionDescriptor.of(resLayout, argLayouts), 
arena);
+  }
+
+  private static MethodHandle getDowncallHandle(
+  String functionName, MemoryLayout resLayout, MemoryLayout... argLayouts) 
{
+return Linker.nativeLinker()
+.downcallHandle(
+SymbolLookup.loaderLookup().find(functionName).orElseThrow(),
+FunctionDescriptor.of(resLayout, argLayouts));
+  }
+
+  private static void checkLibraryVersion() {
+MethodHandle getVersion = getDowncallHandle("faiss_get_version", ADDRESS);
+String actualVersion = callAndGetString(getVersion);
+if (LIBRARY_VERSION.equals(actualVersion) == false) {
+  throw new UnsupportedOperationException(
+  String.format(
+  Locale.ROOT,
+  "Expected Faiss library version %s, found %s",
+  LIBRARY_VERSION,
+  actualVersion));
+}
+  }
+
+  private static final MethodHandle FREE_INDEX =
+  getDowncallHandle("faiss_Index_free", JAVA_INT, ADDRESS);
+
+  public static void freeIndex(MemorySegment indexPointer) {
+callAndHandleError(FREE_INDEX, indexPointer);
+  }
+
+  private static final MethodHandle FREE_CUSTOM_IO_WRITER =
+  getDowncallHandle("faiss_CustomIOWriter_free", JAVA_INT, ADDRESS);
+
+  public static void freeCustomIOWriter(MemorySegment customIOWriterPointer) {
+callAndHandleError(FREE_CUSTOM_IO_WRITER, customIOWriterPointer);
+  }
+
+  private static final MethodHandle FREE_CUSTO

Re: [PR] Use Vector API to decode BKD docIds [lucene]

2025-02-11 Thread via GitHub


gf2121 commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2652803442

   > applied the 0xFF mask to scratch in the shift loop
   
   This helps generate `vpand` in assembly, but not help performance too much.
   
   > Sorry for pushing
   
   Not at all, it's interesting & exciting to play with how loops get 
auto-vectorized!
   
   >  if we could get auto-vectorization to do the right thing, then this would 
automatically benefit all users, not only those who enable the vector module.
   
   +1
   
   I try to minimize the code to see when loop will be vectorized. I write a 
simple `ShiftMaskBenchmark`(I add it to this patch temporarily to show code). 
It tells me that JIT can not vectorize loop when the offset is not predictable. 
This might explain why remainder decode / bpv16 decoding can not be vectorized.
   
   asm: 
   
[perf_asm.log](https://github.com/user-attachments/files/18763358/perf_asm.log)
   
   ```
   java -version
   
   openjdk version "23.0.2" 2025-01-21
   OpenJDK Runtime Environment (build 23.0.2+7-58)
   OpenJDK 64-Bit Server VM (build 23.0.2+7-58, mixed mode, sharing)
   
   Mac M2
   
   Benchmark  Mode  Cnt Score Error   Units
   ShiftMaskBenchmark.fixOffset  thrpt5  9131.716 ±  49.914  ops/ms
   ShiftMaskBenchmark.varOffset  thrpt5  3019.323 ± 120.261  ops/ms
   
   Linux AVX512
   
   Benchmark  Mode  Cnt ScoreError   Units
   ShiftMaskBenchmark.fixOffset  thrpt5  5021.907 ? 13.400  ops/ms
   ShiftMaskBenchmark.fixOffset:asm  thrptNaN  ---
   ShiftMaskBenchmark.varOffset  thrpt5  1115.164 ?  0.958  ops/ms
   ShiftMaskBenchmark.varOffset:asm  thrptNaN  ---
   ```
   
   InnerLoop seems like a good option if the conclusion is true. I'd try to see 
if i can improve it a bit.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Vector API to decode BKD docIds [lucene]

2025-02-11 Thread via GitHub


jpountz commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2652267868

   > #current bpv=24 gets vectorized on the shift loop, but not for the 
remainder loop.
   
   This is an interesting observation. I wonder if a small refactoring could 
help it get auto-vectorized? E.g. what if we applied the `0xFF` mask to 
`scratch` in the shift loop rather than the remainder loop? Or if we split the 
remainder loop into 3 loops, one for each 8 bits that get contributed to the 
value?
   
   Sorry for pushing, but if we could get auto-vectorization to do the right 
thing, then this would automatically benefit all users, not only those who 
enable the vector module.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-11 Thread via GitHub


gsmiller commented on PR #14204:
URL: https://github.com/apache/lucene/pull/14204#issuecomment-2651798767

   This looks good to me @jpountz. I think it makes sense to put this in 
sandbox, but I'd personally also be fine with leaving it where you initially 
had it. (I think this also highlights that it would be nice to get the sandbox 
faceting module out of sandbox and reconciled with the traditional faceting 
module sooner-rather-than later).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] TestForTooMuchCloning.test fails [lucene]

2025-02-11 Thread via GitHub


benwtrent opened a new issue, #14220:
URL: https://github.com/apache/lucene/issues/14220

   ### Description
   
   I have seen this fail in periodic CI builds and on some PR builds.
   
   ```
   TestForTooMuchCloning > test FAILED
   --
     | java.lang.AssertionError: too many calls to IndexInput.clone during 
merging: 519
     | at 
__randomizedtesting.SeedInfo.seed([274680B90E738EFA:AF12BF63A08FE302]:0)
     | at org.junit.Assert.fail(Assert.java:89)
     | at org.junit.Assert.assertTrue(Assert.java:42)
     | at 
org.apache.lucene.index.TestForTooMuchCloning.test(TestForTooMuchCloning.java:63)
     | at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
     | at java.base/java.lang.reflect.Method.invoke(Method.java:580)
   
   ```
   
   Or:
   
   ```
   TestForTooMuchCloning > test FAILED
   java.lang.AssertionError: too many calls to IndexInput.clone during 
merging: 503
   at 
__randomizedtesting.SeedInfo.seed([FB5D79942E80A9F0:7309464E807CC408]:0)
   at org.junit.Assert.fail(Assert.java:89)
   at org.junit.Assert.assertTrue(Assert.java:42)
   at 
org.apache.lucene.index.TestForTooMuchCloning.test(TestForTooMuchCloning.java:63)
   ```
   
   ### Gradle command to reproduce
   
   ```
   ./gradlew test --tests TestForTooMuchCloning.test 
-Dtests.seed=FB5D79942E80A9F0 -Dtests.locale=sl-Latn-SI 
-Dtests.timezone=America/Mexico_City -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Correct hashCode of SynonymQuery [lucene]

2025-02-11 Thread via GitHub


mayya-sharipova merged PR #14217:
URL: https://github.com/apache/lucene/pull/14217


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

2025-02-11 Thread via GitHub


benwtrent commented on issue #14208:
URL: https://github.com/apache/lucene/issues/14208#issuecomment-2651681430

   Ah, ok :) no offense meant.
   
   It just struck me, while they are ideas that might be worth exploring, they 
were a little too general (what I have seen from LLMs in the past).
   
   If we could merge portions of the graph together at a time (e.g. 
bootstrapping from clusters), we could reduce memory and improve performance. 
It's just tricky to get right and nobody has tackled it yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

2025-02-11 Thread via GitHub


Vikasht34 commented on issue #14208:
URL: https://github.com/apache/lucene/issues/14208#issuecomment-2651696435

   @benwtrent Should we create separate issue and I can take a stab , 
implementing it with this paper https://www.arxiv.org/pdf/1908.00814v5


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Clean up public non-final statics [lucene]

2025-02-11 Thread via GitHub


msfroh opened a new pull request, #14221:
URL: https://github.com/apache/lucene/pull/14221

   ### Description
   
   Following up on https://github.com/apache/lucene/issues/14151 and 
https://github.com/apache/lucene/issues/14152, I decided to grep for other 
`public static` non-`final` variables in the Lucene codebase.
   
   This change makes `static`s `final` where it is both possible and seems 
appropriate. (That is, the variables really do seem to be constants.) I made a 
couple `private` instead, since they are modified, but privately.
   
   I also updated some old-school code where IntelliJ flagged opportunities to 
use newer Java idioms.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Bugfix/fix hnsw search termination check [lucene]

2025-02-11 Thread via GitHub


benwtrent merged PR #14215:
URL: https://github.com/apache/lucene/pull/14215


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]

2025-02-11 Thread via GitHub


benwtrent commented on issue #14208:
URL: https://github.com/apache/lucene/issues/14208#issuecomment-2651707316

   @Vikasht34 there is already a general "make hnsw merges faster" issue, so I 
think you can just refer to it.
   
   I look forward to the results!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestForTooMuchCloning.test fails [lucene]

2025-02-11 Thread via GitHub


benwtrent commented on issue #14220:
URL: https://github.com/apache/lucene/issues/14220#issuecomment-2651975052

   Took a bit to find the first commit where this fails. It sometimes succeeds, 
so gitbisect was tricky. 
   
   The first commit I could find is this one: 
   
   It fails pretty reliably with this seed on this commit.
   
   
https://github.com/apache/lucene/commit/d2c69c1472c7af87d7e619dfcabe207a4a614c20
   
   The commit before this one, I ran 1000s of tests and it never failed with 
this seed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Consistent KNN query results with multiple leafs [lucene]

2025-02-11 Thread via GitHub


benwtrent commented on PR #14191:
URL: https://github.com/apache/lucene/pull/14191#issuecomment-2652285114

   OK, I ran a slightly modified version of this: 
https://github.com/apache/lucene/compare/main...benwtrent:lucene:feature/consistent-sharing-knn?expand=1
   
   I indexed 8M docs with Elasticsearch's default merging. This resulted in 
multiple different tiers and many segments. 
   
   The store shows its better than no sharing at all (2x), but not nearly as 
good (from a visited standpoint) as the current sharing model (between all 
segments).
   
   ## No sharing:
   ```
   recall  latency(ms) nDoc  topK  fanout  visited  num segments  
selectivity
1.000  213.400  800   100 100   412280   128
1.000
   ```
   
   ## Current Sharing:
   
   ### Single threaded:
   
   ```
   recall  latency(ms) nDoc  topK  fanout  visited  num segments  
selectivity
0.949   59.900  800   100 10087299   128
1.000
   ```
   
   
   ### 8 Threads (multiple runs)
   ```
   recall  latency(ms) nDoc  topK  fanout  visited  num segments  
selectivity
0.967   12.000  800   100 10084477   128
1.000
0.967   10.100  800   100 10084568   128
1.000
0.949   12.300  800   100 10085034   128
1.000
0.949   13.300  800   100 10084267   128
1.000
0.967   11.700  800   100 10086035   128
1.000
0.949   12.000  800   100 10084993   128
1.000
0.949   12.700  800   100 10085139   128
1.000
0.967   10.000  800   100 10085590   128
1.000
0.949   10.200  800   100 10085106   128
1.000
0.967   12.600  800   100 10085817   128
1.000
   ```
   
   ## New Sharing:
   
   ### Single thread:
   
   ```
   recall  latency(ms) nDoc  topK  fanout  visited  num segments  
selectivity
0.957  119.500  800   100 100   193402   128
1.000
   ```
   
   ### 8 Threads (multiple runs)
   
   ```
   recall  latency(ms) nDoc  topK  fanout  visited  num segments  
selectivity
0.957   16.400  800   100 100   193402   128
1.000
0.957   14.600  800   100 100   193402   128
1.000
0.957   16.800  800   100 100   193402   128
1.000
0.957   16.400  800   100 100   193402   128
1.000
0.957   16.700  800   100 100   193402   128
1.000
0.957   16.000  800   100 100   193402   128
1.000
0.957   16.200  800   100 100   193402   128
1.000
0.957   14.400  800   100 100   193402   128
1.000
0.957   19.500  800   100 100   193402   128
1.000
0.957   18.600  800   100 100   193402   128
1.000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Refactor QueryCache to improve concurrency and performance [lucene]

2025-02-11 Thread via GitHub


sgup432 opened a new issue, #14222:
URL: https://github.com/apache/lucene/issues/14222

   ### Description
   
   Given the significant role of LRUQueryCache in Lucene, I see opportunities 
to enhance its performance. Although there have been many discussions on this 
topic like [here](https://github.com/apache/lucene/issues/10081), they haven’t 
led to anywhere. 
   
   Currently, QueryCache uses a `Map`, where:
   
   - **CacheKey** is tied to a specific segment.
   - **LeafCache** is per-segment and maintains a Map, 
where CacheAndCount is essentially a docID set.
   
   ### Current Issue:
   We use a global read/write lock to put items into the cache. However, since 
CacheKey is our primary key, this limits concurrency.
   
   For example, if there are only 10 segments but 10,000 unique queries, we end 
up with just 10 unique cache keys. This means a large number of requests 
compete for the same write lock, significantly degrading performance by either 
waiting or not using cache (by skipping it if lock not obtained)
   
   ### One simple Solution:
   
   - **Introduce a Composite Key**
 - Instead of CacheKey alone, use (CacheKey, Query) as the key.
 - The new structure would be `Map`
   
   - **Partition the Cache for Better Concurrency**
 - Create multiple LRU partitions (e.g., 8, 16, or 64 , configurable).
 - Each partition has its own read/write lock to reduce contention.
 - Assign incoming requests to specific partition using:
 `hash(CacheKey, Query) % partition_count` (as an example)
   
   - I also see we use uniqueQueries/mostRecentlyUsedQueries 
[here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L94-L98),
 to maintain a LRU list for the queries. I think we can remove it and instead 
maintain a separate LRU list per partition, using CompositeKey.
   
   - For Eviction & Cleanup, we can collect stale `CacheKey` entries in a 
`Set` when segment is closed due to merge. A background thread later 
iterates over cache entries and clears them, similar to how RequestCache in 
OpenSearch works.
   
   
   Let me know what you think. I think it might work pretty well(and simple 
enough) but I am open to other ideas.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org