Re: [PR] Convert BKDConfig to a record [lucene]
iverase merged PR #13668: URL: https://github.com/apache/lucene/pull/13668 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Convert BKDConfig to a record [lucene]
uschindler commented on PR #13668: URL: https://github.com/apache/lucene/pull/13668#issuecomment-2298573177 Thanks. All fine now. 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]
expani commented on PR #13521: URL: https://github.com/apache/lucene/pull/13521#issuecomment-2298718372 While trying out 2 different ways ( one a subset of other ) I found that `Bit21With3StepsEncoder` is better than `Bit21With2StepsEncoder` in aarch64 platforms with OpenJDK 21/22 whereas they are similar in x86. x86 EC2 instance type `c5.9xlarge` ``` Benchmark(encoderName) Mode Cnt ScoreError Units DocIdEncodingBenchmark.performEncodeDecode Bit21With3StepsEncoder avgt 5 6905.920 ± 33.626 ms/op DocIdEncodingBenchmark.performEncodeDecode Bit21With2StepsEncoder avgt 5 6983.592 ± 8.864 ms/op DocIdEncodingBenchmark.performEncodeDecodeBit24Encoder avgt 5 7795.338 ± 78.580 ms/op ``` aarch64 - EC2 instance type `r6g.large` ``` Benchmark(encoderName) Mode Cnt Score Error Units DocIdEncodingBenchmark.performEncodeDecode Bit21With3StepsEncoder avgt 5 7736.603 ± 82.581 ms/op DocIdEncodingBenchmark.performEncodeDecode Bit21With2StepsEncoder avgt 5 8705.315 ± 123.062 ms/op DocIdEncodingBenchmark.performEncodeDecodeBit24Encoder avgt 5 8767.359 ± 179.656 ms/op ``` JDK ``` openjdk version "22.0.2" 2024-07-16 OpenJDK Runtime Environment Corretto-22.0.2.9.1 (build 22.0.2+9-FR) OpenJDK 64-Bit Server VM Corretto-22.0.2.9.1 (build 22.0.2+9-FR, mixed mode, sharing) ``` @jpountz IMO We should use `Bit21With3StepsEncoder` in DocIdsWriter as using `Bit21With2StepsEncoder` might cause regression for workloads in aarch64 platforms. We can replace it with `Bit21With2StepsEncoder` in future when the performance is comparable to x86. Let me know your thoughts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Try applying bipartite graph reordering to KNN graph node ids [lucene]
mikemccand commented on issue #13565: URL: https://github.com/apache/lucene/issues/13565#issuecomment-2298787050 > update: I found we set this `DEFAULT_MAX_CONN` in two different places! Once in `Lucene99HnswVectorsWriter` and also in `HnswGraphBuilder`. In my case it's the latter one that seems to govern, but this is pretty unintuitive. Could we at least factor this out so they share a single `static final int` constant? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Link and Code ANN algorithm from FaceBook in Lucene? [lucene]
tanyaroosta opened a new issue, #13673: URL: https://github.com/apache/lucene/issues/13673 ### Description Hi, I was wondering if anyone has looked into implementing the link and code algorithm from FaceBook in Lucene. It seems the performance is really good based on the benchmarking results reported in the paper. There is an implementation in faiss, though for an older version of the library. https://github.com/facebookresearch/faiss/blob/main/benchs/link_and_code/README.md -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] import definition of default parameter values from HnswGraphBuilder [lucene]
msokolov opened a new pull request, #13674: URL: https://github.com/apache/lucene/pull/13674 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Try applying bipartite graph reordering to KNN graph node ids [lucene]
msokolov commented on issue #13565: URL: https://github.com/apache/lucene/issues/13565#issuecomment-2299587209 Yes, that seems incontrovertibly correct. I was planning to include it along with some other changes, but let's fix the easy stuff first. https://github.com/apache/lucene/pull/13674 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Try applying bipartite graph reordering to KNN graph node ids [lucene]
msokolov commented on issue #13565: URL: https://github.com/apache/lucene/issues/13565#issuecomment-2299601054 Another thing I stumbled on that surprised me is that SortingCodecReader now requires loading vectors into RAM in order to do its work. It used to rely on random access to avoid this, but we stopped doing that in https://github.com/apache/lucene/pull/11964. I guess I was fully aware but now I think we need to find a way to provide random access when possible. In that issue it was stated we can't rely on readers supporting random access, but in fact I think they always do. Certainly in the main path random access is supported. Perhaps we can handle with a class cast in SortingCodecReader so it can use RA when available. But I still am not sure why we don't just make it always available. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Override single byte writes to OutputStreamIndexOutput to remove locking [lucene]
msfroh commented on PR #13543: URL: https://github.com/apache/lucene/pull/13543#issuecomment-2299646044 I was just looking at a flame graph for an indexing performance regression with JDK21 and noticed a lot of time spent in `BufferedOutputStream#growIfNeeded`. Without virtual threads, it looks like `growIfNeeded` (uselessly) does an addition and two comparisons, which doesn't sound like much, but may add up when `BufferedOutputStream#write(int)` is called a bajillion times. This change should also help with that by cutting the number of calls to `BufferedOutputStream#write(int)` by a factor of 8192, which cuts the number of calls to `growIfNeeded` by the same factor. Of course, if it followed the same pattern as `writeShort`/`writeInt`/`writeLong`, we could eliminate calls to `growIfNeeded` altogether. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Eclipse - one or more cycles were detected [lucene]
plutext opened a new issue, #13676: URL: https://github.com/apache/lucene/issues/13676 ### Description I followed the instructions at https://github.com/apache/lucene/blob/main/help/IDEs.txt#L7 to set up Eclipse project files: ``` ./gradlew eclipse ``` and then imported the project into Eclipse. Following this, I have 77 errors: - mostly "one or more cycles were detected" in the build path of projects - 3 non-existing libraries, as to which see below - various "The project cannot be built until build path errors are resolved" (resulting from the above?) `The container 'Project and External Dependencies' references non existing library '/home/jharrop/git/lucene/lucene/analysis.tests/build/libs/lucene-analysis.tests-10.0.0-SNAPSHOT-test.jar'` Regarding missing lucene-analysis-morfologik.tests-10.0.0-SNAPSHOT-test.jar, following `./gradlew assemble` I have lucene-analysis-morfologik.tests-10.0.0-SNAPSHOT.jar but not SNAPSHOT-test.jar. I'm not sure how /lucene-analysis.tests-10.0.0-SNAPSHOT-test.jar gets to be on the build path. ``` The container 'Project and External Dependencies' references non existing library 'lucene/analysis.tests/build/libs/lucene-analysis.tests-10.0.0-SNAPSHOT-test.jar' The container 'Project and External Dependencies' references non existing library 'lucene/core.tests/build/libs/lucene-core.tests-10.0.0-SNAPSHOT-test.jar' ``` Similar comments to above Eclipse 2024-06 (4.32.0) Java 21 ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Try applying bipartite graph reordering to KNN graph node ids [lucene]
jpountz commented on issue #13565: URL: https://github.com/apache/lucene/issues/13565#issuecomment-2299807259 Sorry that this change is being annoying for BP, the goal of this change was to simplify the API, which I found a bit messy at the time with both a sequential and random-access API. Since the random-access API was only used internally by the file format, it felt cleaner to only expose the sequential access API to users, and also consistent with doc values. If this is being limiting for more interesting things we want to do, we could look into replacing the sequential-access API of vectors with a random-access API in Lucene 10. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Display frame types when analyzing top frames. [lucene]
jpountz merged PR #13670: URL: https://github.com/apache/lucene/pull/13670 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Take advantage of the doc value skipper when it is primary sort [lucene]
gsmiller commented on code in PR #13592: URL: https://github.com/apache/lucene/pull/13592#discussion_r1723543404 ## lucene/core/src/java/org/apache/lucene/index/DocValuesSkipper.java: ## @@ -98,4 +98,29 @@ public abstract class DocValuesSkipper { /** Return the global number of documents with a value for the field. */ public abstract int docCount(); + + /** + * Advance this skipper so that all levels intersects the range given by {@code minValue} and + * {@code maxValue}. If there are no intersecting levels, the skipper is exhausted. + * + * NOTE: The behavior is undefined if this method is called and {@link #advance(int)} + * has not been called yet. + */ + public final void advance(long minValue, long maxValue) throws IOException { +while (true) { Review Comment: If it's required that `#advance` has already been called at least once before this is used, should we add something like `assert minDocID(0) > -1`? ## lucene/core/src/java/org/apache/lucene/index/DocValuesSkipper.java: ## @@ -98,4 +98,29 @@ public abstract class DocValuesSkipper { /** Return the global number of documents with a value for the field. */ public abstract int docCount(); + + /** + * Advance this skipper so that all levels intersects the range given by {@code minValue} and + * {@code maxValue}. If there are no intersecting levels, the skipper is exhausted. + * + * NOTE: The behavior is undefined if this method is called and {@link #advance(int)} + * has not been called yet. + */ + public final void advance(long minValue, long maxValue) throws IOException { +while (true) { + if (minDocID(0) == DocIdSetIterator.NO_MORE_DOCS + || (minValue(0) <= maxValue && maxValue(0) >= minValue)) { +break; + } else { +int maxDocID = maxDocID(0); Review Comment: This might just be me, but I got really confused for a while until I realized that `maxDocID` does not necessarily return the docID associated with `maxValue` (in the standard ascending sort case it would, but it's reversed in the reverse sort case... assuming I'm getting this right). I had the same confusing back in the calling code in the reverse-sort case. Would you mind adding a couple comments to this code to make that a little more clear in case someone else suffers from my same confusion? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Modernize list get first element [lucene]
mrhbj opened a new pull request, #13677: URL: https://github.com/apache/lucene/pull/13677 ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Compute facets while collecting [lucene]
epotyom commented on code in PR #13568: URL: https://github.com/apache/lucene/pull/13568#discussion_r1724471801 ## lucene/core/src/java/org/apache/lucene/search/CollectorOwner.java: ## @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collection; +import java.util.Collections; +import java.util.List; + +/** + * Like {@link CollectorManager}, but it owns the collectors its manager creates. It is convenient + * that clients of the class don't have to worry about keeping the list of collectors, as well as + * about making the collectors type (C) compatible when reduce is called. Instance of this class + * also caches results of {@link CollectorManager#reduce(Collection)}. + * + * Note that instance of this class ignores any {@link Collector} created by {@link + * CollectorManager#newCollector()} directly, not through {@link #newCollector()} + * + * @lucene.experimental + */ +public final class CollectorOwner { Review Comment: The issue to follow up on that thread https://github.com/apache/lucene/issues/13671 - thank you @gsmiller for creating! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org