Re: [PR] Lucene99HnswVectorsReader[.readFields] readability tweaks [lucene]

2024-07-12 Thread via GitHub


cpoerschke merged PR #13532:
URL: https://github.com/apache/lucene/pull/13532


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] in KnnVectorsWriter reduce code duplication w.r.t. MergedVectorValues.merge(Float|Byte)VectorValues [lucene]

2024-07-12 Thread via GitHub


cpoerschke commented on code in PR #13539:
URL: https://github.com/apache/lucene/pull/13539#discussion_r1675576429


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##
@@ -22,6 +22,7 @@
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.List;
+import java.util.function.BiFunction;
 import java.util.Objects;

Review Comment:
   ```suggestion
   import java.util.Objects;
   import java.util.function.BiFunction;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] in KnnVectorsWriter reduce code duplication w.r.t. MergedVectorValues.merge(Float|Byte)VectorValues [lucene]

2024-07-12 Thread via GitHub


cpoerschke merged PR #13539:
URL: https://github.com/apache/lucene/pull/13539


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] "gradlew clean check" results in internal gradle error "Unable to make progress running work." [lucene]

2024-07-12 Thread via GitHub


dweiss opened a new issue, #13567:
URL: https://github.com/apache/lucene/issues/13567

   ### Description
   
   As reported by Chris, running "gradlew clean check" on main results in this 
nasty (or its variations):
   ```
   Unable to make progress running work. The following items are queued for 
execution but none of them can be started:
 - Build ':build-infra':
 - Waiting for nodes:
 - :build-infra:clean (state=SHOULD_RUN, 
dependencies=COMPLETE_AND_SUCCESSFUL, group=task group 0, dependencies=[Resolve 
mutations for :build-infra:clean (EXECUTED)], has-failed-dependency=false )
 - :build-infra:forbiddenApis (state=SHOULD_RUN, 
dependencies=NOT_COMPLETE, group=task group 1, 
dependencies=[:build-infra:forbiddenApisMain (EXECUTED), 
:build-infra:forbiddenApisTest (SHOULD_RUN)], 
waiting-for=[:build-infra:forbiddenApisTest (SHOULD_RUN)], 
has-failed-dependency=false )
 - Nodes ready to start:
 - :build-infra:clean
 - :build-infra:compileTestJava
 - :build-infra:processTestResources
 - Reachable nodes:
 - :build-infra:compileTestJava (state=SHOULD_RUN, 
dependencies=COMPLETE_AND_SUCCESSFUL, group=task group 1, dependencies=[Resolve 
mutations for :build-infra:compileTestJava (EXECUTED), destroyer locations for 
task group 0 (EXECUTED), :build-infra:classes (EXECUTED), 
:build-infra:compileJava (EXECUTED)], has-failed-dependency=false )
 - :build-infra:processTestResources (state=SHOULD_RUN, 
dependencies=COMPLETE_AND_SUCCESSFUL, group=task group 1, dependencies=[Resolve 
mutations for :build-infra:processTestResources (EXECUTED), destroyer locations 
for task group 0 (EXECUTED)], has-failed-dependency=false )
 - :build-infra:testClasses (state=SHOULD_RUN, 
dependencies=NOT_COMPLETE, group=task group 1, 
dependencies=[:build-infra:compileTestJava (SHOULD_RUN), 
:build-infra:processTestResources (SHOULD_RUN)], 
waiting-for=[:build-infra:processTestResources (SHOULD_RUN), 
:build-infra:compileTestJava (SHOULD_RUN)], has-failed-dependency=false )
 - :build-infra:forbiddenApisTest (state=SHOULD_RUN, 
dependencies=NOT_COMPLETE, group=task group 1, 
dependencies=[:build-infra:classes (EXECUTED), :build-infra:compileJava 
(EXECUTED), :build-infra:compileTestJava (SHOULD_RUN), :build-infra:testClasses 
(SHOULD_RUN)], waiting-for=[:build-infra:compileTestJava (SHOULD_RUN), 
:build-infra:testClasses (SHOULD_RUN)], has-failed-dependency=false )
 - Scheduling events:
 - node added to plan: destroyer locations for task group 0, when: 
scheduled, state: SHOULD_RUN, dependencies: 1, is ready node? false
 - Ordinal groups:
 - group 0 entry nodes: [:build-infra:clean (SHOULD_RUN)]
 - group 1 entry nodes: [:build-infra:forbiddenApis (SHOULD_RUN)]
 - Workers waiting for work: 12
 - Stopped workers: 3
   ```
   
   I've dumped task execution ordering using this plugin:
   ```
 id 'org.barfuin.gradle.taskinfo' version '2.2.0'
   ```
   and I can see composite project tasks are intertwined with top level project 
tasks - this may lead to insanity.
   
   **The only workaround for now is to not run 'clean' together with other 
tasks.**
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] "gradlew clean check" results in internal gradle error "Unable to make progress running work." [lucene]

2024-07-12 Thread via GitHub


dweiss commented on issue #13567:
URL: https://github.com/apache/lucene/issues/13567#issuecomment-2225232211

   I think this comment provides an explanation of what is happening:
   https://github.com/gradle/gradle/issues/23585#issuecomment-1403862031
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Minor cleanup in some Facet tests [lucene]

2024-07-12 Thread via GitHub


mikemccand commented on PR #13489:
URL: https://github.com/apache/lucene/pull/13489#issuecomment-2225379585

   Thank you @slow-J and @stefanvodita!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-12 Thread via GitHub


cpoerschke commented on code in PR #13525:
URL: https://github.com/apache/lucene/pull/13525#discussion_r1675724584


##
lucene/core/src/java/org/apache/lucene/index/FieldInfo.java:
##
@@ -92,6 +97,8 @@ public FieldInfo(
   int vectorDimension,
   VectorEncoding vectorEncoding,
   VectorSimilarityFunction vectorSimilarityFunction,
+  boolean isTensor,
+  TensorSimilarityFunction.Aggregation tensorAggregate,

Review Comment:
   added 
https://github.com/apache/lucene/pull/13525/commits/b2b95cdb31c7deb73c8ab5a70e002bef4414
 to give that a go



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-12 Thread via GitHub


cpoerschke commented on code in PR #13525:
URL: https://github.com/apache/lucene/pull/13525#discussion_r1675735652


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatMultiVectorsWriter.java:
##
@@ -0,0 +1,824 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene99;
+
+import static 
org.apache.lucene.codecs.lucene99.Lucene99FlatMultiVectorsFormat.DIRECT_MONOTONIC_BLOCK_SHIFT;
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.KnnFieldVectorsWriter;
+import org.apache.lucene.codecs.hnsw.FlatFieldVectorsWriter;
+import org.apache.lucene.codecs.hnsw.FlatVectorsScorer;
+import org.apache.lucene.codecs.hnsw.FlatVectorsWriter;
+import org.apache.lucene.codecs.lucene95.OffHeapByteVectorValues;
+import org.apache.lucene.codecs.lucene95.OffHeapFloatVectorValues;
+import org.apache.lucene.codecs.lucene95.OrdToDocDISIReaderConfiguration;
+import org.apache.lucene.index.ByteVectorValues;
+import org.apache.lucene.index.DocsWithFieldSet;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.IndexFileNames;
+import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.index.Sorter;
+import org.apache.lucene.index.VectorEncoding;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.ReadAdvice;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.ByteMultiVectorValue;
+import org.apache.lucene.util.FloatMultiVectorValue;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.LongValues;
+import org.apache.lucene.util.RamUsageEstimator;
+import org.apache.lucene.util.hnsw.CloseableRandomVectorScorerSupplier;
+import org.apache.lucene.util.hnsw.RandomVectorScorer;
+import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier;
+
+/**
+ * Writes vector values to index segments.
+ *
+ * @lucene.experimental
+ */
+// noCommit - pending tests
+public final class Lucene99FlatMultiVectorsWriter extends FlatVectorsWriter {
+
+  private static final long SHALLLOW_RAM_BYTES_USED =
+  
RamUsageEstimator.shallowSizeOfInstance(Lucene99FlatMultiVectorsWriter.class);
+
+  private final SegmentWriteState segmentWriteState;
+  private final IndexOutput meta, vectorData;
+
+  private final List> fields = new ArrayList<>();
+  private boolean finished;
+
+  public Lucene99FlatMultiVectorsWriter(SegmentWriteState state, 
FlatVectorsScorer scorer)
+  throws IOException {
+super(scorer);
+segmentWriteState = state;
+String metaFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name,
+state.segmentSuffix,
+Lucene99FlatMultiVectorsFormat.META_EXTENSION);
+
+String vectorDataFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name,
+state.segmentSuffix,
+Lucene99FlatMultiVectorsFormat.VECTOR_DATA_EXTENSION);
+
+boolean success = false;
+try {
+  meta = state.directory.createOutput(metaFileName, state.context);
+  vectorData = state.directory.createOutput(vectorDataFileName, 
state.context);
+
+  CodecUtil.writeIndexHeader(
+  meta,
+  Lucene99FlatMultiVectorsFormat.META_CODEC_NAME,
+  Lucene99FlatMultiVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  CodecUtil.writeIndexHeader(
+  vectorData,
+  Lucene99FlatMultiVectorsFormat.VECTOR_DATA_CODEC_NAME,
+  Lucene99FlatMultiVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  success = true;
+} finally {
+  if (success == false) {
+IOUtils.closeWhileHandlingEx

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-12 Thread via GitHub


cpoerschke commented on code in PR #13525:
URL: https://github.com/apache/lucene/pull/13525#discussion_r1675737097


##
lucene/core/src/java/org/apache/lucene/index/IndexingChain.java:
##
@@ -1527,15 +1549,20 @@ void setPoints(int dimensionCount, int 
indexDimensionCount, int numBytes) {
 }
 
 void setVectors(
-VectorEncoding encoding, VectorSimilarityFunction similarityFunction, 
int dimension) {
+VectorEncoding encoding,
+VectorSimilarityFunction similarityFunction,
+int dimension,
+MultiVectorSimilarityFunction.Aggregation multiVectorAggregate) {

Review Comment:
   added 
https://github.com/apache/lucene/pull/13525/commits/eacc63cb86f12a4087996c310a68ed3882d6f0f6
 to fold `setMultiVectors` into `setVectors` here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-12 Thread via GitHub


cpoerschke commented on code in PR #13525:
URL: https://github.com/apache/lucene/pull/13525#discussion_r1675739599


##
lucene/core/src/java/org/apache/lucene/index/FieldInfos.java:
##
@@ -452,7 +465,8 @@ synchronized int addOrGet(FieldInfo fi) {
 new FieldVectorProperties(
 fi.getVectorDimension(),
 fi.getVectorEncoding(),
-fi.getVectorSimilarityFunction()));
+fi.getVectorSimilarityFunction()),
+new 
FieldMultiVectorProperties(fi.getMultiVectorSimilarityFunction()));

Review Comment:
   Wondering if a separate `FieldMultiVectorProperties` is needed or if we 
could tack onto the `FieldVectorProperties` that is already there?
   ```suggestion
   fi.getVectorSimilarityFunction(),
   fi.getMultiVectorSimilarityFunction()));
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Compute facets while collecting [lucene]

2024-07-12 Thread via GitHub


epotyom opened a new pull request, #13568:
URL: https://github.com/apache/lucene/pull/13568

   @Shradha26 and I are working on a new faceting implementation in the sandbox 
module. With this, we are proposing the following new features to Lucene’s 
faceting and aggregation capabilities - 
   
    Compute multiple associated values for all field types
   
   For an index with documents that have facet field `color` and numeric field 
`popularity`, today, we can associate the MAX or SUM of `popularity` with the 
facets of color using the `TaxonomyFloatFacetAssociations` implementations. We 
are limited to computing only one associated value in one iteration per 
Faceting implementation. But sometimes, we also want to compute multiple values 
 e.g. SUM(popularity) and MAX(bm25_score). With the new sandbox faceting, we 
have native support for multiple associated values with each facet without 
requiring to iterate through the documents multiple times.
   
   Current faceting supports computing associated values for Taxonomy fields 
only. With the new implementation, users will be able to compute associated 
values for any faceting implementation. 
   
    Computing facets during collection
   
   Today `FacetsCollector#collect` only collects docIDs, and facet computations 
happens in each `Facets` instance in its own doc ID loop. Some numeric field 
values may be expensive to compute (e.g. some relevance scores).  In order to 
re-use the values from these fields of documents to compute the different 
associated values for different facets, we have to cache values for all docs, 
which might also be expensive.
   
   The new implementation computes facets as it collects the docIDs. I allows 
us to do the expensive computation just once for a document, and then use this 
value for all the facets the document belongs to.
   
    Decouple aggregations computation from the logic where we decide which 
facets a document belongs to
   
   The current faceting implementation is responsible for retrieving the 
different facets associated with a document and storing aggregated values or 
counts across different facets. Due to this, we have different facet 
implementations to handle each type of aggregated values - e.g 
IntTaxonomyFacets, FloatTaxonomyFacets etc. 
   
   In the proposed implementation, we have two new interfaces that help 
decouple the “faceting” and the “aggregation” logic - 
   
   1. FacetCutter - “cuts” a document into facets, yields facet IDs/ordinals 
(int).
   2. FacetRecorder - for given document ID and facet ID, record some data 
(count, long aggregations, double aggregations or anything else depending on 
selected FacetRecorder implementation)
   
   As a result, any facet type (taxonomy, numeric ranges, etc) can aggregate 
any type of values (count only, or any type or number of associated value 
aggregations)
   
    Unified, flexible facet results filtering and sorting
   
   Today, `Facets#getTopChildren`, `getAllChildren`, `getTopDims` etc method 
can be used to get facet results. This API has some limitations. One limitation 
is that each Facets implementation has slightly different logic. For example, 
most implementation sort results by count to get topN, but the ones that 
compute associations sort by aggregated value (in descending order). They also 
often have different tie-break algorithms. Some implementations tie break by 
facet ordinal (e.g. `TaxonomyFacets`), some by facet value (e.g. 
`LongValueFacetCounts`), etc. To implement different sort order, clients have 
to extend these classes and override these methods, which is not always 
convenient, e.g. some of these classes are package private.
   
   We have introduced an `OrdinalIterator` interface, which has implementations 
to perform “get children”, “get top N”, “get specific values” operations. 
Ordinals can be sorted by values, collected by FacetRecorder, by labels, etc. 
Clients can extend `OrdinalsIterator` interface to implement custom filtering 
or sort order.
   
    DrillSideways supports any collectors
   
   For the sandbox module to work, we had to change DrillSideways to support 
any type of Collector, not just FacetCollector. The challenge there is that 
different drill sideways dimension might want to use different Collector types 
and return different result types (`CollectorManager`), and Java generics 
don't make handling of unknown number of generic types very easy. In the PR it 
is solved by adding `CollectorOwner` class which keeps collectors that 
CollectorManager creates. As a result, DrillSideways doesn't have to manage 
Collectors and results itself, which allows us to use wildcard type `?` and let 
client handle collector and result types.
   
    Latency Improvement
   
   Computing facets during collection is more expensive, because we need to 
collect in each searcher slice, and then merge results in 
`CollectorManager#reduce`. At the same time, it reduces latency, becau

Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]

2024-07-12 Thread via GitHub


msokolov commented on PR #13566:
URL: https://github.com/apache/lucene/pull/13566#issuecomment-2225606218

   I tested using KnnGraphTester and in the process changed this to handle 
multiple levels (it just seemed to make sense form a consistency perspective). 
I also found there were a couple of bugs:
   * the `M`/max-connections check was off-by-one
   * added a call to `finish()` so it would actually do something!
   
   Testing w/1M 256-dim documents I got results that are within noise for 
recall, latency, and index time:
   
   |test|recall|latency|index time|
   |-|-|-|-|
   | baseline| 0.927 | 2.32 | 164755 |
   | baseline| 0.926 | 2.26 | 167982 |
   | candidate| 0.924 | 2.16 | 159109 |
   | candidate| 0.927 | 2.25 | 169839 |
   
   I printed out the graph connectivity and observed that there were many 
singleton nodes produced (nodes not connected to any other node) in the 
baseline case and these were largely eliminated in the candidate, although I 
did still see some on level 1 on one of my tests in some segments.
   
   I also printed out the time taken to do the connection - it was about 3s in 
a 3m indexing run. Overall I think this is worth doing in order to improve 
sanity.
   
   It would still be better if we could *guarantee* that the graphs are fully 
connected. It is challenging though since we assume nodes will also comply with 
max-connection limit and in theory every node in a component could be maximally 
connected which would not allow for any further connection, so I think I will 
put this off for some future and opt for P not P. 
   
   Still need to address the concurrent graph builder.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]

2024-07-12 Thread via GitHub


benwtrent commented on code in PR #13566:
URL: https://github.com/apache/lucene/pull/13566#discussion_r1676046779


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -408,7 +410,23 @@ private void finish() throws IOException {
   }
 
   private void connectComponents() throws IOException {
-List components = HnswUtil.components(hnsw);
+long start = System.nanoTime();
+for (int level = 0; level < hnsw.numLevels(); level++) {
+  if (connectComponents(level) == false) {

Review Comment:
   I like the logging, we should do this for sure. But I also think this should 
be an `assert` as it should never happen in tests.



##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswUtil.java:
##
@@ -30,46 +30,83 @@
 import org.apache.lucene.index.IndexReader;
 import org.apache.lucene.index.LeafReaderContext;
 import org.apache.lucene.util.FixedBitSet;
-import org.apache.lucene.util.hnsw.HnswGraph;
 
 /** Utilities for use in tests involving HNSW graphs */
-public class HnswTestUtil {
+public class HnswUtil {
 
   /**
* Returns true iff level 0 of the graph is fully connected - that is every 
node is reachable from
* any entry point.
*/
   public static boolean isFullyConnected(HnswGraph knnValues) throws 
IOException {
-return componentSizes(knnValues).size() < 2;
+for (int level = 0; level < knnValues.numLevels(); level++) {
+  if (components(knnValues, level).size() > 1) {
+return false;
+  }
+}
+return true;
   }
 
   /**
* Returns the sizes of the distinct graph components on level 0. If the 
graph is fully-connected
* there will only be a single component. If the graph is empty, the 
returned list will be empty.
*/
   public static List componentSizes(HnswGraph hnsw) throws 
IOException {
-List sizes = new ArrayList<>();
+return componentSizes(hnsw, 0);
+  }
+
+  /**
+   * Returns the sizes of the distinct graph components on the given level. If 
the graph is
+   * fully-connected there will only be a single component. If the graph is 
empty, the returned list
+   * will be empty.
+   */
+  public static List componentSizes(HnswGraph hnsw, int level) throws 
IOException {
+return components(hnsw, level).stream().map(Component::size).toList();
+  }
+
+  // Finds all connected components of the graph
+  static List components(HnswGraph hnsw, int level) throws 
IOException {

Review Comment:
   Do you mind providing a bit more color to this java doc. It took me a minute 
to understand that a component is really "A selection of nodes that are 
traversable to each other".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]

2024-07-12 Thread via GitHub


msokolov commented on code in PR #13566:
URL: https://github.com/apache/lucene/pull/13566#discussion_r1676077428


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -408,7 +410,23 @@ private void finish() throws IOException {
   }
 
   private void connectComponents() throws IOException {
-List components = HnswUtil.components(hnsw);
+long start = System.nanoTime();
+for (int level = 0; level < hnsw.numLevels(); level++) {
+  if (connectComponents(level) == false) {

Review Comment:
   I don't think we can make this an `assert` without some more major change 
for the reason (mentioned above) that one of the components may already be 
maximally connected (ie its nodes all have the max number of neighbors).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]

2024-07-12 Thread via GitHub


msokolov commented on code in PR #13566:
URL: https://github.com/apache/lucene/pull/13566#discussion_r1676077871


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswUtil.java:
##
@@ -30,46 +30,83 @@
 import org.apache.lucene.index.IndexReader;
 import org.apache.lucene.index.LeafReaderContext;
 import org.apache.lucene.util.FixedBitSet;
-import org.apache.lucene.util.hnsw.HnswGraph;
 
 /** Utilities for use in tests involving HNSW graphs */
-public class HnswTestUtil {
+public class HnswUtil {
 
   /**
* Returns true iff level 0 of the graph is fully connected - that is every 
node is reachable from
* any entry point.
*/
   public static boolean isFullyConnected(HnswGraph knnValues) throws 
IOException {
-return componentSizes(knnValues).size() < 2;
+for (int level = 0; level < knnValues.numLevels(); level++) {
+  if (components(knnValues, level).size() > 1) {
+return false;
+  }
+}
+return true;
   }
 
   /**
* Returns the sizes of the distinct graph components on level 0. If the 
graph is fully-connected
* there will only be a single component. If the graph is empty, the 
returned list will be empty.
*/
   public static List componentSizes(HnswGraph hnsw) throws 
IOException {
-List sizes = new ArrayList<>();
+return componentSizes(hnsw, 0);
+  }
+
+  /**
+   * Returns the sizes of the distinct graph components on the given level. If 
the graph is
+   * fully-connected there will only be a single component. If the graph is 
empty, the returned list
+   * will be empty.
+   */
+  public static List componentSizes(HnswGraph hnsw, int level) throws 
IOException {
+return components(hnsw, level).stream().map(Component::size).toList();
+  }
+
+  // Finds all connected components of the graph
+  static List components(HnswGraph hnsw, int level) throws 
IOException {

Review Comment:
   sure - I'll explain the terminology -- it comes from graph theory 
https://en.wikipedia.org/wiki/Component_(graph_theory)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]

2024-07-12 Thread via GitHub


msokolov commented on code in PR #13566:
URL: https://github.com/apache/lucene/pull/13566#discussion_r1676091586


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -408,7 +410,23 @@ private void finish() throws IOException {
   }
 
   private void connectComponents() throws IOException {
-List components = HnswUtil.components(hnsw);
+long start = System.nanoTime();
+for (int level = 0; level < hnsw.numLevels(); level++) {
+  if (connectComponents(level) == false) {

Review Comment:
   A degenerate case is `M=1` in this case every component will have 2 nodes 
connected to each other and we cannot connect them further, but this will arise 
for M>1 too. We can try to use pruning again, but then we run the risk of 
orphaning a new component and would have to iterate and as far as I can tell 
there is no guarantee we would converge even for higher M. I think in order to 
guarantee connectedness we have to relax the number of connections limit, but 
we assume elsewhere (not only in NeighborArray) that this limit is obeyed (eg 
when searching)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]

2024-07-12 Thread via GitHub


benwtrent commented on code in PR #13566:
URL: https://github.com/apache/lucene/pull/13566#discussion_r1676111453


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -408,7 +410,23 @@ private void finish() throws IOException {
   }
 
   private void connectComponents() throws IOException {
-List components = HnswUtil.components(hnsw);
+long start = System.nanoTime();
+for (int level = 0; level < hnsw.numLevels(); level++) {
+  if (connectComponents(level) == false) {

Review Comment:
   Yeah, ok, the degenerate case is indeed problematic. I think it is OK to 
make this best effort and log. 
   
   If there are still many disconnected components, the better way is for the 
number of connections to be bumped higher.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]

2024-07-12 Thread via GitHub


msokolov commented on code in PR #13566:
URL: https://github.com/apache/lucene/pull/13566#discussion_r1676154393


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswUtil.java:
##
@@ -30,46 +30,83 @@
 import org.apache.lucene.index.IndexReader;
 import org.apache.lucene.index.LeafReaderContext;
 import org.apache.lucene.util.FixedBitSet;
-import org.apache.lucene.util.hnsw.HnswGraph;
 
 /** Utilities for use in tests involving HNSW graphs */
-public class HnswTestUtil {
+public class HnswUtil {
 
   /**
* Returns true iff level 0 of the graph is fully connected - that is every 
node is reachable from
* any entry point.
*/
   public static boolean isFullyConnected(HnswGraph knnValues) throws 
IOException {
-return componentSizes(knnValues).size() < 2;
+for (int level = 0; level < knnValues.numLevels(); level++) {
+  if (components(knnValues, level).size() > 1) {
+return false;
+  }
+}
+return true;
   }
 
   /**
* Returns the sizes of the distinct graph components on level 0. If the 
graph is fully-connected
* there will only be a single component. If the graph is empty, the 
returned list will be empty.
*/
   public static List componentSizes(HnswGraph hnsw) throws 
IOException {
-List sizes = new ArrayList<>();
+return componentSizes(hnsw, 0);
+  }
+
+  /**
+   * Returns the sizes of the distinct graph components on the given level. If 
the graph is
+   * fully-connected there will only be a single component. If the graph is 
empty, the returned list
+   * will be empty.
+   */
+  public static List componentSizes(HnswGraph hnsw, int level) throws 
IOException {
+return components(hnsw, level).stream().map(Component::size).toList();
+  }
+
+  // Finds all connected components of the graph
+  static List components(HnswGraph hnsw, int level) throws 
IOException {

Review Comment:
   After reading that I realized that this whole notion of components really is 
only well-defined for undirected graphs. Ours are directed, so this isn't 
really the same thing after all. Actually it raises a question about this whole 
business because we can make it so every node is reachable from node 0, but 
that still doesn't guarantee that every node is reachable from every other 
node. I still think this is progress, but we might want to think about how to 
measure this more global property?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]

2024-07-12 Thread via GitHub


msokolov commented on code in PR #13566:
URL: https://github.com/apache/lucene/pull/13566#discussion_r1676160815


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswUtil.java:
##
@@ -30,46 +30,83 @@
 import org.apache.lucene.index.IndexReader;
 import org.apache.lucene.index.LeafReaderContext;
 import org.apache.lucene.util.FixedBitSet;
-import org.apache.lucene.util.hnsw.HnswGraph;
 
 /** Utilities for use in tests involving HNSW graphs */
-public class HnswTestUtil {
+public class HnswUtil {
 
   /**
* Returns true iff level 0 of the graph is fully connected - that is every 
node is reachable from
* any entry point.
*/
   public static boolean isFullyConnected(HnswGraph knnValues) throws 
IOException {
-return componentSizes(knnValues).size() < 2;
+for (int level = 0; level < knnValues.numLevels(); level++) {
+  if (components(knnValues, level).size() > 1) {
+return false;
+  }
+}
+return true;
   }
 
   /**
* Returns the sizes of the distinct graph components on level 0. If the 
graph is fully-connected
* there will only be a single component. If the graph is empty, the 
returned list will be empty.
*/
   public static List componentSizes(HnswGraph hnsw) throws 
IOException {
-List sizes = new ArrayList<>();
+return componentSizes(hnsw, 0);
+  }
+
+  /**
+   * Returns the sizes of the distinct graph components on the given level. If 
the graph is
+   * fully-connected there will only be a single component. If the graph is 
empty, the returned list
+   * will be empty.
+   */
+  public static List componentSizes(HnswGraph hnsw, int level) throws 
IOException {
+return components(hnsw, level).stream().map(Component::size).toList();
+  }
+
+  // Finds all connected components of the graph
+  static List components(HnswGraph hnsw, int level) throws 
IOException {

Review Comment:
   https://en.wikipedia.org/wiki/Directed_graph#Directed_graph_connectivity 
defines the "strongly connected" property, which I think is what we want



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]

2024-07-12 Thread via GitHub


msokolov commented on code in PR #13566:
URL: https://github.com/apache/lucene/pull/13566#discussion_r1676192442


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswUtil.java:
##
@@ -30,46 +30,83 @@
 import org.apache.lucene.index.IndexReader;
 import org.apache.lucene.index.LeafReaderContext;
 import org.apache.lucene.util.FixedBitSet;
-import org.apache.lucene.util.hnsw.HnswGraph;
 
 /** Utilities for use in tests involving HNSW graphs */
-public class HnswTestUtil {
+public class HnswUtil {
 
   /**
* Returns true iff level 0 of the graph is fully connected - that is every 
node is reachable from
* any entry point.
*/
   public static boolean isFullyConnected(HnswGraph knnValues) throws 
IOException {
-return componentSizes(knnValues).size() < 2;
+for (int level = 0; level < knnValues.numLevels(); level++) {
+  if (components(knnValues, level).size() > 1) {
+return false;
+  }
+}
+return true;
   }
 
   /**
* Returns the sizes of the distinct graph components on level 0. If the 
graph is fully-connected
* there will only be a single component. If the graph is empty, the 
returned list will be empty.
*/
   public static List componentSizes(HnswGraph hnsw) throws 
IOException {
-List sizes = new ArrayList<>();
+return componentSizes(hnsw, 0);
+  }
+
+  /**
+   * Returns the sizes of the distinct graph components on the given level. If 
the graph is
+   * fully-connected there will only be a single component. If the graph is 
empty, the returned list
+   * will be empty.
+   */
+  public static List componentSizes(HnswGraph hnsw, int level) throws 
IOException {
+return components(hnsw, level).stream().map(Component::size).toList();
+  }
+
+  // Finds all connected components of the graph
+  static List components(HnswGraph hnsw, int level) throws 
IOException {

Review Comment:
   OK, I have an idea how we can maybe compute strongly-connected efficiently 
b/c our graphs are "mostly" undirected -- I think we can check back-links while 
walking the graph? I'll noodle on it and see if I can come up with a stronger 
test 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]

2024-07-12 Thread via GitHub


msokolov commented on PR #13566:
URL: https://github.com/apache/lucene/pull/13566#issuecomment-2225970694

   This version also applies the connectedness-checking and patching to 
concurrent graph build. It was a trivial addition. I tested and it seems to 
maybe help a bit:
   
   |test   | recall | latency | index time |
   |---||-||
   | baseline  | 0.925  | 2.29| 172225 |
   | baseline  | 0.925  | 2.39| 168809 |
   | candidate | 0.928  | 2.27| 163188 |
   | candidate | 0.928  | 2.25| 168005 |
   
   I still plan to try applying the strongly-connected graph check, so might 
want to hold off further testing until we have that. In theory it could suggest 
additional connections to add to the graph, although I suspect in practice its 
impact will be limited since what we seem to see is one big component (probably 
strongly connected?) plus a bunch of single disconnected nodes, but we'll see.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]

2024-07-12 Thread via GitHub


benwtrent commented on PR #13566:
URL: https://github.com/apache/lucene/pull/13566#issuecomment-2226232962

   I have benchmarked 2 data sets with 2 scenarios (int4 and int7 quantization) 
and have found no significant difference in runtime between this branch and the 
main branch. 
   
   || e5Small build | e5Small recall | e5small vectors visited 
| CohereV2 Build | CohereV2 recall | CohereV2 vectors visited |
   
||---||-||-|--|
   | candidate int7 | 582387| 0.974  | 3090
| 1360523| 0.825   | 4219 |
   | baseline int7  | 574839| 0.974  | 3090
| 1360658| 0.825   | 4218 |
   | candidate int4 | 571491| 0.866  | 3118
| 1367738| 0.516   | 4513 |
   | baseline int4  | 586543| 0.866  | 3118
| 1381992| 0.516   | 4512 |
   
   
   We should test a previously bad case to see how much longer indexing takes & 
how much recall is improved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]

2024-07-12 Thread via GitHub


msokolov commented on PR #13566:
URL: https://github.com/apache/lucene/pull/13566#issuecomment-2226299595

   To make things look bad, I think we need to reduce M and/or beamWidth? In my 
test I saw some impact with M=16 and beamWidth=50


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Merge on Commit: No merges if new data is flushed (but not committed) [lucene]

2024-07-12 Thread via GitHub


ameyakarve commented on issue #13537:
URL: https://github.com/apache/lucene/issues/13537#issuecomment-2226555771

   Yeah this works on 9.9
   
   Thanks for pointing it out. I'll close this issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Merge on Commit: No merges if new data is flushed (but not committed) [lucene]

2024-07-12 Thread via GitHub


ameyakarve closed issue #13537: Merge on Commit: No merges if new data is 
flushed (but not committed)
URL: https://github.com/apache/lucene/issues/13537


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Issue #13482] Fix english grammar error in Field that stores a per-document long values for scoring [lucene]

2024-07-12 Thread via GitHub


github-actions[bot] commented on PR #13490:
URL: https://github.com/apache/lucene/pull/13490#issuecomment-2226563782

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org