Re: [PR] Lucene99HnswVectorsReader[.readFields] readability tweaks [lucene]
cpoerschke merged PR #13532: URL: https://github.com/apache/lucene/pull/13532 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] in KnnVectorsWriter reduce code duplication w.r.t. MergedVectorValues.merge(Float|Byte)VectorValues [lucene]
cpoerschke commented on code in PR #13539: URL: https://github.com/apache/lucene/pull/13539#discussion_r1675576429 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -22,6 +22,7 @@ import java.util.ArrayList; import java.util.Arrays; import java.util.List; +import java.util.function.BiFunction; import java.util.Objects; Review Comment: ```suggestion import java.util.Objects; import java.util.function.BiFunction; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] in KnnVectorsWriter reduce code duplication w.r.t. MergedVectorValues.merge(Float|Byte)VectorValues [lucene]
cpoerschke merged PR #13539: URL: https://github.com/apache/lucene/pull/13539 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] "gradlew clean check" results in internal gradle error "Unable to make progress running work." [lucene]
dweiss opened a new issue, #13567: URL: https://github.com/apache/lucene/issues/13567 ### Description As reported by Chris, running "gradlew clean check" on main results in this nasty (or its variations): ``` Unable to make progress running work. The following items are queued for execution but none of them can be started: - Build ':build-infra': - Waiting for nodes: - :build-infra:clean (state=SHOULD_RUN, dependencies=COMPLETE_AND_SUCCESSFUL, group=task group 0, dependencies=[Resolve mutations for :build-infra:clean (EXECUTED)], has-failed-dependency=false ) - :build-infra:forbiddenApis (state=SHOULD_RUN, dependencies=NOT_COMPLETE, group=task group 1, dependencies=[:build-infra:forbiddenApisMain (EXECUTED), :build-infra:forbiddenApisTest (SHOULD_RUN)], waiting-for=[:build-infra:forbiddenApisTest (SHOULD_RUN)], has-failed-dependency=false ) - Nodes ready to start: - :build-infra:clean - :build-infra:compileTestJava - :build-infra:processTestResources - Reachable nodes: - :build-infra:compileTestJava (state=SHOULD_RUN, dependencies=COMPLETE_AND_SUCCESSFUL, group=task group 1, dependencies=[Resolve mutations for :build-infra:compileTestJava (EXECUTED), destroyer locations for task group 0 (EXECUTED), :build-infra:classes (EXECUTED), :build-infra:compileJava (EXECUTED)], has-failed-dependency=false ) - :build-infra:processTestResources (state=SHOULD_RUN, dependencies=COMPLETE_AND_SUCCESSFUL, group=task group 1, dependencies=[Resolve mutations for :build-infra:processTestResources (EXECUTED), destroyer locations for task group 0 (EXECUTED)], has-failed-dependency=false ) - :build-infra:testClasses (state=SHOULD_RUN, dependencies=NOT_COMPLETE, group=task group 1, dependencies=[:build-infra:compileTestJava (SHOULD_RUN), :build-infra:processTestResources (SHOULD_RUN)], waiting-for=[:build-infra:processTestResources (SHOULD_RUN), :build-infra:compileTestJava (SHOULD_RUN)], has-failed-dependency=false ) - :build-infra:forbiddenApisTest (state=SHOULD_RUN, dependencies=NOT_COMPLETE, group=task group 1, dependencies=[:build-infra:classes (EXECUTED), :build-infra:compileJava (EXECUTED), :build-infra:compileTestJava (SHOULD_RUN), :build-infra:testClasses (SHOULD_RUN)], waiting-for=[:build-infra:compileTestJava (SHOULD_RUN), :build-infra:testClasses (SHOULD_RUN)], has-failed-dependency=false ) - Scheduling events: - node added to plan: destroyer locations for task group 0, when: scheduled, state: SHOULD_RUN, dependencies: 1, is ready node? false - Ordinal groups: - group 0 entry nodes: [:build-infra:clean (SHOULD_RUN)] - group 1 entry nodes: [:build-infra:forbiddenApis (SHOULD_RUN)] - Workers waiting for work: 12 - Stopped workers: 3 ``` I've dumped task execution ordering using this plugin: ``` id 'org.barfuin.gradle.taskinfo' version '2.2.0' ``` and I can see composite project tasks are intertwined with top level project tasks - this may lead to insanity. **The only workaround for now is to not run 'clean' together with other tasks.** ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] "gradlew clean check" results in internal gradle error "Unable to make progress running work." [lucene]
dweiss commented on issue #13567: URL: https://github.com/apache/lucene/issues/13567#issuecomment-2225232211 I think this comment provides an explanation of what is happening: https://github.com/gradle/gradle/issues/23585#issuecomment-1403862031 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Minor cleanup in some Facet tests [lucene]
mikemccand commented on PR #13489: URL: https://github.com/apache/lucene/pull/13489#issuecomment-2225379585 Thank you @slow-J and @stefanvodita! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]
cpoerschke commented on code in PR #13525: URL: https://github.com/apache/lucene/pull/13525#discussion_r1675724584 ## lucene/core/src/java/org/apache/lucene/index/FieldInfo.java: ## @@ -92,6 +97,8 @@ public FieldInfo( int vectorDimension, VectorEncoding vectorEncoding, VectorSimilarityFunction vectorSimilarityFunction, + boolean isTensor, + TensorSimilarityFunction.Aggregation tensorAggregate, Review Comment: added https://github.com/apache/lucene/pull/13525/commits/b2b95cdb31c7deb73c8ab5a70e002bef4414 to give that a go -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]
cpoerschke commented on code in PR #13525: URL: https://github.com/apache/lucene/pull/13525#discussion_r1675735652 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatMultiVectorsWriter.java: ## @@ -0,0 +1,824 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.lucene99; + +import static org.apache.lucene.codecs.lucene99.Lucene99FlatMultiVectorsFormat.DIRECT_MONOTONIC_BLOCK_SHIFT; +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.Closeable; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.util.ArrayList; +import java.util.List; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnFieldVectorsWriter; +import org.apache.lucene.codecs.hnsw.FlatFieldVectorsWriter; +import org.apache.lucene.codecs.hnsw.FlatVectorsScorer; +import org.apache.lucene.codecs.hnsw.FlatVectorsWriter; +import org.apache.lucene.codecs.lucene95.OffHeapByteVectorValues; +import org.apache.lucene.codecs.lucene95.OffHeapFloatVectorValues; +import org.apache.lucene.codecs.lucene95.OrdToDocDISIReaderConfiguration; +import org.apache.lucene.index.ByteVectorValues; +import org.apache.lucene.index.DocsWithFieldSet; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.IndexFileNames; +import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.index.Sorter; +import org.apache.lucene.index.VectorEncoding; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.ReadAdvice; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.ByteMultiVectorValue; +import org.apache.lucene.util.FloatMultiVectorValue; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.LongValues; +import org.apache.lucene.util.RamUsageEstimator; +import org.apache.lucene.util.hnsw.CloseableRandomVectorScorerSupplier; +import org.apache.lucene.util.hnsw.RandomVectorScorer; +import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier; + +/** + * Writes vector values to index segments. + * + * @lucene.experimental + */ +// noCommit - pending tests +public final class Lucene99FlatMultiVectorsWriter extends FlatVectorsWriter { + + private static final long SHALLLOW_RAM_BYTES_USED = + RamUsageEstimator.shallowSizeOfInstance(Lucene99FlatMultiVectorsWriter.class); + + private final SegmentWriteState segmentWriteState; + private final IndexOutput meta, vectorData; + + private final List> fields = new ArrayList<>(); + private boolean finished; + + public Lucene99FlatMultiVectorsWriter(SegmentWriteState state, FlatVectorsScorer scorer) + throws IOException { +super(scorer); +segmentWriteState = state; +String metaFileName = +IndexFileNames.segmentFileName( +state.segmentInfo.name, +state.segmentSuffix, +Lucene99FlatMultiVectorsFormat.META_EXTENSION); + +String vectorDataFileName = +IndexFileNames.segmentFileName( +state.segmentInfo.name, +state.segmentSuffix, +Lucene99FlatMultiVectorsFormat.VECTOR_DATA_EXTENSION); + +boolean success = false; +try { + meta = state.directory.createOutput(metaFileName, state.context); + vectorData = state.directory.createOutput(vectorDataFileName, state.context); + + CodecUtil.writeIndexHeader( + meta, + Lucene99FlatMultiVectorsFormat.META_CODEC_NAME, + Lucene99FlatMultiVectorsFormat.VERSION_CURRENT, + state.segmentInfo.getId(), + state.segmentSuffix); + CodecUtil.writeIndexHeader( + vectorData, + Lucene99FlatMultiVectorsFormat.VECTOR_DATA_CODEC_NAME, + Lucene99FlatMultiVectorsFormat.VERSION_CURRENT, + state.segmentInfo.getId(), + state.segmentSuffix); + success = true; +} finally { + if (success == false) { +IOUtils.closeWhileHandlingEx
Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]
cpoerschke commented on code in PR #13525: URL: https://github.com/apache/lucene/pull/13525#discussion_r1675737097 ## lucene/core/src/java/org/apache/lucene/index/IndexingChain.java: ## @@ -1527,15 +1549,20 @@ void setPoints(int dimensionCount, int indexDimensionCount, int numBytes) { } void setVectors( -VectorEncoding encoding, VectorSimilarityFunction similarityFunction, int dimension) { +VectorEncoding encoding, +VectorSimilarityFunction similarityFunction, +int dimension, +MultiVectorSimilarityFunction.Aggregation multiVectorAggregate) { Review Comment: added https://github.com/apache/lucene/pull/13525/commits/eacc63cb86f12a4087996c310a68ed3882d6f0f6 to fold `setMultiVectors` into `setVectors` here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]
cpoerschke commented on code in PR #13525: URL: https://github.com/apache/lucene/pull/13525#discussion_r1675739599 ## lucene/core/src/java/org/apache/lucene/index/FieldInfos.java: ## @@ -452,7 +465,8 @@ synchronized int addOrGet(FieldInfo fi) { new FieldVectorProperties( fi.getVectorDimension(), fi.getVectorEncoding(), -fi.getVectorSimilarityFunction())); +fi.getVectorSimilarityFunction()), +new FieldMultiVectorProperties(fi.getMultiVectorSimilarityFunction())); Review Comment: Wondering if a separate `FieldMultiVectorProperties` is needed or if we could tack onto the `FieldVectorProperties` that is already there? ```suggestion fi.getVectorSimilarityFunction(), fi.getMultiVectorSimilarityFunction())); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Compute facets while collecting [lucene]
epotyom opened a new pull request, #13568: URL: https://github.com/apache/lucene/pull/13568 @Shradha26 and I are working on a new faceting implementation in the sandbox module. With this, we are proposing the following new features to Lucene’s faceting and aggregation capabilities - Compute multiple associated values for all field types For an index with documents that have facet field `color` and numeric field `popularity`, today, we can associate the MAX or SUM of `popularity` with the facets of color using the `TaxonomyFloatFacetAssociations` implementations. We are limited to computing only one associated value in one iteration per Faceting implementation. But sometimes, we also want to compute multiple values e.g. SUM(popularity) and MAX(bm25_score). With the new sandbox faceting, we have native support for multiple associated values with each facet without requiring to iterate through the documents multiple times. Current faceting supports computing associated values for Taxonomy fields only. With the new implementation, users will be able to compute associated values for any faceting implementation. Computing facets during collection Today `FacetsCollector#collect` only collects docIDs, and facet computations happens in each `Facets` instance in its own doc ID loop. Some numeric field values may be expensive to compute (e.g. some relevance scores). In order to re-use the values from these fields of documents to compute the different associated values for different facets, we have to cache values for all docs, which might also be expensive. The new implementation computes facets as it collects the docIDs. I allows us to do the expensive computation just once for a document, and then use this value for all the facets the document belongs to. Decouple aggregations computation from the logic where we decide which facets a document belongs to The current faceting implementation is responsible for retrieving the different facets associated with a document and storing aggregated values or counts across different facets. Due to this, we have different facet implementations to handle each type of aggregated values - e.g IntTaxonomyFacets, FloatTaxonomyFacets etc. In the proposed implementation, we have two new interfaces that help decouple the “faceting” and the “aggregation” logic - 1. FacetCutter - “cuts” a document into facets, yields facet IDs/ordinals (int). 2. FacetRecorder - for given document ID and facet ID, record some data (count, long aggregations, double aggregations or anything else depending on selected FacetRecorder implementation) As a result, any facet type (taxonomy, numeric ranges, etc) can aggregate any type of values (count only, or any type or number of associated value aggregations) Unified, flexible facet results filtering and sorting Today, `Facets#getTopChildren`, `getAllChildren`, `getTopDims` etc method can be used to get facet results. This API has some limitations. One limitation is that each Facets implementation has slightly different logic. For example, most implementation sort results by count to get topN, but the ones that compute associations sort by aggregated value (in descending order). They also often have different tie-break algorithms. Some implementations tie break by facet ordinal (e.g. `TaxonomyFacets`), some by facet value (e.g. `LongValueFacetCounts`), etc. To implement different sort order, clients have to extend these classes and override these methods, which is not always convenient, e.g. some of these classes are package private. We have introduced an `OrdinalIterator` interface, which has implementations to perform “get children”, “get top N”, “get specific values” operations. Ordinals can be sorted by values, collected by FacetRecorder, by labels, etc. Clients can extend `OrdinalsIterator` interface to implement custom filtering or sort order. DrillSideways supports any collectors For the sandbox module to work, we had to change DrillSideways to support any type of Collector, not just FacetCollector. The challenge there is that different drill sideways dimension might want to use different Collector types and return different result types (`CollectorManager`), and Java generics don't make handling of unknown number of generic types very easy. In the PR it is solved by adding `CollectorOwner` class which keeps collectors that CollectorManager creates. As a result, DrillSideways doesn't have to manage Collectors and results itself, which allows us to use wildcard type `?` and let client handle collector and result types. Latency Improvement Computing facets during collection is more expensive, because we need to collect in each searcher slice, and then merge results in `CollectorManager#reduce`. At the same time, it reduces latency, becau
Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]
msokolov commented on PR #13566: URL: https://github.com/apache/lucene/pull/13566#issuecomment-2225606218 I tested using KnnGraphTester and in the process changed this to handle multiple levels (it just seemed to make sense form a consistency perspective). I also found there were a couple of bugs: * the `M`/max-connections check was off-by-one * added a call to `finish()` so it would actually do something! Testing w/1M 256-dim documents I got results that are within noise for recall, latency, and index time: |test|recall|latency|index time| |-|-|-|-| | baseline| 0.927 | 2.32 | 164755 | | baseline| 0.926 | 2.26 | 167982 | | candidate| 0.924 | 2.16 | 159109 | | candidate| 0.927 | 2.25 | 169839 | I printed out the graph connectivity and observed that there were many singleton nodes produced (nodes not connected to any other node) in the baseline case and these were largely eliminated in the candidate, although I did still see some on level 1 on one of my tests in some segments. I also printed out the time taken to do the connection - it was about 3s in a 3m indexing run. Overall I think this is worth doing in order to improve sanity. It would still be better if we could *guarantee* that the graphs are fully connected. It is challenging though since we assume nodes will also comply with max-connection limit and in theory every node in a component could be maximally connected which would not allow for any further connection, so I think I will put this off for some future and opt for P not P. Still need to address the concurrent graph builder. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]
benwtrent commented on code in PR #13566: URL: https://github.com/apache/lucene/pull/13566#discussion_r1676046779 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -408,7 +410,23 @@ private void finish() throws IOException { } private void connectComponents() throws IOException { -List components = HnswUtil.components(hnsw); +long start = System.nanoTime(); +for (int level = 0; level < hnsw.numLevels(); level++) { + if (connectComponents(level) == false) { Review Comment: I like the logging, we should do this for sure. But I also think this should be an `assert` as it should never happen in tests. ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswUtil.java: ## @@ -30,46 +30,83 @@ import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.util.FixedBitSet; -import org.apache.lucene.util.hnsw.HnswGraph; /** Utilities for use in tests involving HNSW graphs */ -public class HnswTestUtil { +public class HnswUtil { /** * Returns true iff level 0 of the graph is fully connected - that is every node is reachable from * any entry point. */ public static boolean isFullyConnected(HnswGraph knnValues) throws IOException { -return componentSizes(knnValues).size() < 2; +for (int level = 0; level < knnValues.numLevels(); level++) { + if (components(knnValues, level).size() > 1) { +return false; + } +} +return true; } /** * Returns the sizes of the distinct graph components on level 0. If the graph is fully-connected * there will only be a single component. If the graph is empty, the returned list will be empty. */ public static List componentSizes(HnswGraph hnsw) throws IOException { -List sizes = new ArrayList<>(); +return componentSizes(hnsw, 0); + } + + /** + * Returns the sizes of the distinct graph components on the given level. If the graph is + * fully-connected there will only be a single component. If the graph is empty, the returned list + * will be empty. + */ + public static List componentSizes(HnswGraph hnsw, int level) throws IOException { +return components(hnsw, level).stream().map(Component::size).toList(); + } + + // Finds all connected components of the graph + static List components(HnswGraph hnsw, int level) throws IOException { Review Comment: Do you mind providing a bit more color to this java doc. It took me a minute to understand that a component is really "A selection of nodes that are traversable to each other". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]
msokolov commented on code in PR #13566: URL: https://github.com/apache/lucene/pull/13566#discussion_r1676077428 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -408,7 +410,23 @@ private void finish() throws IOException { } private void connectComponents() throws IOException { -List components = HnswUtil.components(hnsw); +long start = System.nanoTime(); +for (int level = 0; level < hnsw.numLevels(); level++) { + if (connectComponents(level) == false) { Review Comment: I don't think we can make this an `assert` without some more major change for the reason (mentioned above) that one of the components may already be maximally connected (ie its nodes all have the max number of neighbors). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]
msokolov commented on code in PR #13566: URL: https://github.com/apache/lucene/pull/13566#discussion_r1676077871 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswUtil.java: ## @@ -30,46 +30,83 @@ import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.util.FixedBitSet; -import org.apache.lucene.util.hnsw.HnswGraph; /** Utilities for use in tests involving HNSW graphs */ -public class HnswTestUtil { +public class HnswUtil { /** * Returns true iff level 0 of the graph is fully connected - that is every node is reachable from * any entry point. */ public static boolean isFullyConnected(HnswGraph knnValues) throws IOException { -return componentSizes(knnValues).size() < 2; +for (int level = 0; level < knnValues.numLevels(); level++) { + if (components(knnValues, level).size() > 1) { +return false; + } +} +return true; } /** * Returns the sizes of the distinct graph components on level 0. If the graph is fully-connected * there will only be a single component. If the graph is empty, the returned list will be empty. */ public static List componentSizes(HnswGraph hnsw) throws IOException { -List sizes = new ArrayList<>(); +return componentSizes(hnsw, 0); + } + + /** + * Returns the sizes of the distinct graph components on the given level. If the graph is + * fully-connected there will only be a single component. If the graph is empty, the returned list + * will be empty. + */ + public static List componentSizes(HnswGraph hnsw, int level) throws IOException { +return components(hnsw, level).stream().map(Component::size).toList(); + } + + // Finds all connected components of the graph + static List components(HnswGraph hnsw, int level) throws IOException { Review Comment: sure - I'll explain the terminology -- it comes from graph theory https://en.wikipedia.org/wiki/Component_(graph_theory) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]
msokolov commented on code in PR #13566: URL: https://github.com/apache/lucene/pull/13566#discussion_r1676091586 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -408,7 +410,23 @@ private void finish() throws IOException { } private void connectComponents() throws IOException { -List components = HnswUtil.components(hnsw); +long start = System.nanoTime(); +for (int level = 0; level < hnsw.numLevels(); level++) { + if (connectComponents(level) == false) { Review Comment: A degenerate case is `M=1` in this case every component will have 2 nodes connected to each other and we cannot connect them further, but this will arise for M>1 too. We can try to use pruning again, but then we run the risk of orphaning a new component and would have to iterate and as far as I can tell there is no guarantee we would converge even for higher M. I think in order to guarantee connectedness we have to relax the number of connections limit, but we assume elsewhere (not only in NeighborArray) that this limit is obeyed (eg when searching) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]
benwtrent commented on code in PR #13566: URL: https://github.com/apache/lucene/pull/13566#discussion_r1676111453 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -408,7 +410,23 @@ private void finish() throws IOException { } private void connectComponents() throws IOException { -List components = HnswUtil.components(hnsw); +long start = System.nanoTime(); +for (int level = 0; level < hnsw.numLevels(); level++) { + if (connectComponents(level) == false) { Review Comment: Yeah, ok, the degenerate case is indeed problematic. I think it is OK to make this best effort and log. If there are still many disconnected components, the better way is for the number of connections to be bumped higher. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]
msokolov commented on code in PR #13566: URL: https://github.com/apache/lucene/pull/13566#discussion_r1676154393 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswUtil.java: ## @@ -30,46 +30,83 @@ import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.util.FixedBitSet; -import org.apache.lucene.util.hnsw.HnswGraph; /** Utilities for use in tests involving HNSW graphs */ -public class HnswTestUtil { +public class HnswUtil { /** * Returns true iff level 0 of the graph is fully connected - that is every node is reachable from * any entry point. */ public static boolean isFullyConnected(HnswGraph knnValues) throws IOException { -return componentSizes(knnValues).size() < 2; +for (int level = 0; level < knnValues.numLevels(); level++) { + if (components(knnValues, level).size() > 1) { +return false; + } +} +return true; } /** * Returns the sizes of the distinct graph components on level 0. If the graph is fully-connected * there will only be a single component. If the graph is empty, the returned list will be empty. */ public static List componentSizes(HnswGraph hnsw) throws IOException { -List sizes = new ArrayList<>(); +return componentSizes(hnsw, 0); + } + + /** + * Returns the sizes of the distinct graph components on the given level. If the graph is + * fully-connected there will only be a single component. If the graph is empty, the returned list + * will be empty. + */ + public static List componentSizes(HnswGraph hnsw, int level) throws IOException { +return components(hnsw, level).stream().map(Component::size).toList(); + } + + // Finds all connected components of the graph + static List components(HnswGraph hnsw, int level) throws IOException { Review Comment: After reading that I realized that this whole notion of components really is only well-defined for undirected graphs. Ours are directed, so this isn't really the same thing after all. Actually it raises a question about this whole business because we can make it so every node is reachable from node 0, but that still doesn't guarantee that every node is reachable from every other node. I still think this is progress, but we might want to think about how to measure this more global property? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]
msokolov commented on code in PR #13566: URL: https://github.com/apache/lucene/pull/13566#discussion_r1676160815 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswUtil.java: ## @@ -30,46 +30,83 @@ import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.util.FixedBitSet; -import org.apache.lucene.util.hnsw.HnswGraph; /** Utilities for use in tests involving HNSW graphs */ -public class HnswTestUtil { +public class HnswUtil { /** * Returns true iff level 0 of the graph is fully connected - that is every node is reachable from * any entry point. */ public static boolean isFullyConnected(HnswGraph knnValues) throws IOException { -return componentSizes(knnValues).size() < 2; +for (int level = 0; level < knnValues.numLevels(); level++) { + if (components(knnValues, level).size() > 1) { +return false; + } +} +return true; } /** * Returns the sizes of the distinct graph components on level 0. If the graph is fully-connected * there will only be a single component. If the graph is empty, the returned list will be empty. */ public static List componentSizes(HnswGraph hnsw) throws IOException { -List sizes = new ArrayList<>(); +return componentSizes(hnsw, 0); + } + + /** + * Returns the sizes of the distinct graph components on the given level. If the graph is + * fully-connected there will only be a single component. If the graph is empty, the returned list + * will be empty. + */ + public static List componentSizes(HnswGraph hnsw, int level) throws IOException { +return components(hnsw, level).stream().map(Component::size).toList(); + } + + // Finds all connected components of the graph + static List components(HnswGraph hnsw, int level) throws IOException { Review Comment: https://en.wikipedia.org/wiki/Directed_graph#Directed_graph_connectivity defines the "strongly connected" property, which I think is what we want -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]
msokolov commented on code in PR #13566: URL: https://github.com/apache/lucene/pull/13566#discussion_r1676192442 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswUtil.java: ## @@ -30,46 +30,83 @@ import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.util.FixedBitSet; -import org.apache.lucene.util.hnsw.HnswGraph; /** Utilities for use in tests involving HNSW graphs */ -public class HnswTestUtil { +public class HnswUtil { /** * Returns true iff level 0 of the graph is fully connected - that is every node is reachable from * any entry point. */ public static boolean isFullyConnected(HnswGraph knnValues) throws IOException { -return componentSizes(knnValues).size() < 2; +for (int level = 0; level < knnValues.numLevels(); level++) { + if (components(knnValues, level).size() > 1) { +return false; + } +} +return true; } /** * Returns the sizes of the distinct graph components on level 0. If the graph is fully-connected * there will only be a single component. If the graph is empty, the returned list will be empty. */ public static List componentSizes(HnswGraph hnsw) throws IOException { -List sizes = new ArrayList<>(); +return componentSizes(hnsw, 0); + } + + /** + * Returns the sizes of the distinct graph components on the given level. If the graph is + * fully-connected there will only be a single component. If the graph is empty, the returned list + * will be empty. + */ + public static List componentSizes(HnswGraph hnsw, int level) throws IOException { +return components(hnsw, level).stream().map(Component::size).toList(); + } + + // Finds all connected components of the graph + static List components(HnswGraph hnsw, int level) throws IOException { Review Comment: OK, I have an idea how we can maybe compute strongly-connected efficiently b/c our graphs are "mostly" undirected -- I think we can check back-links while walking the graph? I'll noodle on it and see if I can come up with a stronger test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]
msokolov commented on PR #13566: URL: https://github.com/apache/lucene/pull/13566#issuecomment-2225970694 This version also applies the connectedness-checking and patching to concurrent graph build. It was a trivial addition. I tested and it seems to maybe help a bit: |test | recall | latency | index time | |---||-|| | baseline | 0.925 | 2.29| 172225 | | baseline | 0.925 | 2.39| 168809 | | candidate | 0.928 | 2.27| 163188 | | candidate | 0.928 | 2.25| 168005 | I still plan to try applying the strongly-connected graph check, so might want to hold off further testing until we have that. In theory it could suggest additional connections to add to the graph, although I suspect in practice its impact will be limited since what we seem to see is one big component (probably strongly connected?) plus a bunch of single disconnected nodes, but we'll see. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]
benwtrent commented on PR #13566: URL: https://github.com/apache/lucene/pull/13566#issuecomment-2226232962 I have benchmarked 2 data sets with 2 scenarios (int4 and int7 quantization) and have found no significant difference in runtime between this branch and the main branch. || e5Small build | e5Small recall | e5small vectors visited | CohereV2 Build | CohereV2 recall | CohereV2 vectors visited | ||---||-||-|--| | candidate int7 | 582387| 0.974 | 3090 | 1360523| 0.825 | 4219 | | baseline int7 | 574839| 0.974 | 3090 | 1360658| 0.825 | 4218 | | candidate int4 | 571491| 0.866 | 3118 | 1367738| 0.516 | 4513 | | baseline int4 | 586543| 0.866 | 3118 | 1381992| 0.516 | 4512 | We should test a previously bad case to see how much longer indexing takes & how much recall is improved. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]
msokolov commented on PR #13566: URL: https://github.com/apache/lucene/pull/13566#issuecomment-2226299595 To make things look bad, I think we need to reduce M and/or beamWidth? In my test I saw some impact with M=16 and beamWidth=50 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Merge on Commit: No merges if new data is flushed (but not committed) [lucene]
ameyakarve commented on issue #13537: URL: https://github.com/apache/lucene/issues/13537#issuecomment-2226555771 Yeah this works on 9.9 Thanks for pointing it out. I'll close this issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Merge on Commit: No merges if new data is flushed (but not committed) [lucene]
ameyakarve closed issue #13537: Merge on Commit: No merges if new data is flushed (but not committed) URL: https://github.com/apache/lucene/issues/13537 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Issue #13482] Fix english grammar error in Field that stores a per-document long values for scoring [lucene]
github-actions[bot] commented on PR #13490: URL: https://github.com/apache/lucene/pull/13490#issuecomment-2226563782 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org