Re: [PR] Convert BKDConfig to a record [lucene]

2024-08-20 Thread via GitHub


iverase merged PR #13668:
URL: https://github.com/apache/lucene/pull/13668


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Convert BKDConfig to a record [lucene]

2024-08-20 Thread via GitHub


uschindler commented on PR #13668:
URL: https://github.com/apache/lucene/pull/13668#issuecomment-2298573177

   Thanks. All fine now. 👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

2024-08-20 Thread via GitHub


expani commented on PR #13521:
URL: https://github.com/apache/lucene/pull/13521#issuecomment-2298718372

   While trying out 2 different ways ( one a subset of other ) I found that 
`Bit21With3StepsEncoder` is better than `Bit21With2StepsEncoder` in aarch64 
platforms with OpenJDK 21/22 whereas they are similar in x86.
   
   x86 EC2 instance type `c5.9xlarge`
   ```
   Benchmark(encoderName)  Mode  
Cnt ScoreError  Units
   DocIdEncodingBenchmark.performEncodeDecode  Bit21With3StepsEncoder  avgt
5  6905.920 ± 33.626  ms/op
   DocIdEncodingBenchmark.performEncodeDecode  Bit21With2StepsEncoder  avgt
5  6983.592 ±  8.864  ms/op
   DocIdEncodingBenchmark.performEncodeDecodeBit24Encoder  avgt
5  7795.338 ± 78.580  ms/op
   ```
   
   aarch64 - EC2 instance type `r6g.large`
   ```
   Benchmark(encoderName)  Mode  
Cnt Score Error  Units
   DocIdEncodingBenchmark.performEncodeDecode  Bit21With3StepsEncoder  avgt
5  7736.603 ±  82.581  ms/op
   DocIdEncodingBenchmark.performEncodeDecode  Bit21With2StepsEncoder  avgt
5  8705.315 ± 123.062  ms/op
   DocIdEncodingBenchmark.performEncodeDecodeBit24Encoder  avgt
5  8767.359 ± 179.656  ms/op
   ```
   
   JDK 
   
   ```
   openjdk version "22.0.2" 2024-07-16
   OpenJDK Runtime Environment Corretto-22.0.2.9.1 (build 22.0.2+9-FR)
   OpenJDK 64-Bit Server VM Corretto-22.0.2.9.1 (build 22.0.2+9-FR, mixed mode, 
sharing)
   ```
   @jpountz IMO We should use `Bit21With3StepsEncoder` in DocIdsWriter as using 
`Bit21With2StepsEncoder` might cause regression for workloads in aarch64 
platforms. 
   We can replace it with `Bit21With2StepsEncoder` in future when the 
performance is comparable to x86. 
   
   Let me know your thoughts. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Try applying bipartite graph reordering to KNN graph node ids [lucene]

2024-08-20 Thread via GitHub


mikemccand commented on issue #13565:
URL: https://github.com/apache/lucene/issues/13565#issuecomment-2298787050

   > update: I found we set this `DEFAULT_MAX_CONN` in two different places! 
Once in `Lucene99HnswVectorsWriter` and also in `HnswGraphBuilder`. In my case 
it's the latter one that seems to govern, but this is pretty unintuitive.
   
   Could we at least factor this out so they share a single `static final int` 
constant?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Link and Code ANN algorithm from FaceBook in Lucene? [lucene]

2024-08-20 Thread via GitHub


tanyaroosta opened a new issue, #13673:
URL: https://github.com/apache/lucene/issues/13673

   ### Description
   
   Hi,
   I was wondering if anyone has looked into implementing the link and code 
algorithm from FaceBook in Lucene.  It seems the performance is really good 
based on the benchmarking results reported in the paper.  There is an 
implementation in faiss, though for an older version of the library.
   
   
https://github.com/facebookresearch/faiss/blob/main/benchs/link_and_code/README.md


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] import definition of default parameter values from HnswGraphBuilder [lucene]

2024-08-20 Thread via GitHub


msokolov opened a new pull request, #13674:
URL: https://github.com/apache/lucene/pull/13674

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Try applying bipartite graph reordering to KNN graph node ids [lucene]

2024-08-20 Thread via GitHub


msokolov commented on issue #13565:
URL: https://github.com/apache/lucene/issues/13565#issuecomment-2299587209

   Yes, that seems incontrovertibly correct. I was planning to include it along 
with some other changes, but let's fix the easy stuff first. 
https://github.com/apache/lucene/pull/13674


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Try applying bipartite graph reordering to KNN graph node ids [lucene]

2024-08-20 Thread via GitHub


msokolov commented on issue #13565:
URL: https://github.com/apache/lucene/issues/13565#issuecomment-2299601054

   Another thing I stumbled on that surprised me is that SortingCodecReader now 
requires loading vectors into RAM in order to do its work. It used to rely on 
random access to avoid this, but we stopped doing that in 
https://github.com/apache/lucene/pull/11964. I guess I was fully aware but now 
I think we need to find a way to provide random access when possible. In that 
issue it was stated we can't rely on readers supporting random access, but in 
fact I think they always do. Certainly in the main path random access is 
supported. Perhaps we can handle with a class cast in SortingCodecReader so it 
can use RA when available. But I still am not sure why we don't just make it 
always available.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Override single byte writes to OutputStreamIndexOutput to remove locking [lucene]

2024-08-20 Thread via GitHub


msfroh commented on PR #13543:
URL: https://github.com/apache/lucene/pull/13543#issuecomment-2299646044

   I was just looking at a flame graph for an indexing performance regression 
with JDK21 and noticed a lot of time spent in 
`BufferedOutputStream#growIfNeeded`. Without virtual threads, it looks like 
`growIfNeeded` (uselessly) does an addition and two comparisons, which doesn't 
sound like much, but may add up when `BufferedOutputStream#write(int)` is 
called a bajillion times.
   
   This change should also help with that by cutting the number of calls to 
`BufferedOutputStream#write(int)` by a factor of 8192, which cuts the number of 
calls to `growIfNeeded` by the same factor.
   
   Of course, if it followed the same pattern as 
`writeShort`/`writeInt`/`writeLong`, we could eliminate calls to `growIfNeeded` 
altogether.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Eclipse - one or more cycles were detected [lucene]

2024-08-20 Thread via GitHub


plutext opened a new issue, #13676:
URL: https://github.com/apache/lucene/issues/13676

   ### Description
   
   I followed the instructions at 
https://github.com/apache/lucene/blob/main/help/IDEs.txt#L7 to set up Eclipse 
project files:
   
   ```
   ./gradlew eclipse
   ```
   and then imported the project into Eclipse.
   
   Following this, I have 77 errors:
   
   - mostly "one or more cycles were detected" in the build path of projects
   - 3 non-existing libraries, as to which see below
   - various "The project cannot be built until build path errors are resolved" 
(resulting from the above?)
   
   `The container 'Project and External Dependencies' references non existing 
library 
'/home/jharrop/git/lucene/lucene/analysis.tests/build/libs/lucene-analysis.tests-10.0.0-SNAPSHOT-test.jar'`
   
   Regarding missing lucene-analysis-morfologik.tests-10.0.0-SNAPSHOT-test.jar, 
following `./gradlew assemble` I have 
lucene-analysis-morfologik.tests-10.0.0-SNAPSHOT.jar but not SNAPSHOT-test.jar.
   
   I'm not sure how /lucene-analysis.tests-10.0.0-SNAPSHOT-test.jar gets to be 
on the build path.
   
   ```
   The container 'Project and External Dependencies' references non existing 
library 
'lucene/analysis.tests/build/libs/lucene-analysis.tests-10.0.0-SNAPSHOT-test.jar'
   
   The container 'Project and External Dependencies' references non existing 
library 
'lucene/core.tests/build/libs/lucene-core.tests-10.0.0-SNAPSHOT-test.jar'
   ```
   Similar comments to above
   
   
   Eclipse 2024-06 (4.32.0)
   Java 21
   
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Try applying bipartite graph reordering to KNN graph node ids [lucene]

2024-08-20 Thread via GitHub


jpountz commented on issue #13565:
URL: https://github.com/apache/lucene/issues/13565#issuecomment-2299807259

   Sorry that this change is being annoying for BP, the goal of this change was 
to simplify the API, which I found a bit messy at the time with both a 
sequential and random-access API. Since the random-access API was only used 
internally by the file format, it felt cleaner to only expose the sequential 
access API to users, and also consistent with doc values. If this is being 
limiting for more interesting things we want to do, we could look into 
replacing the sequential-access API of vectors with a random-access API in 
Lucene 10.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Display frame types when analyzing top frames. [lucene]

2024-08-20 Thread via GitHub


jpountz merged PR #13670:
URL: https://github.com/apache/lucene/pull/13670


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Take advantage of the doc value skipper when it is primary sort [lucene]

2024-08-20 Thread via GitHub


gsmiller commented on code in PR #13592:
URL: https://github.com/apache/lucene/pull/13592#discussion_r1723543404


##
lucene/core/src/java/org/apache/lucene/index/DocValuesSkipper.java:
##
@@ -98,4 +98,29 @@ public abstract class DocValuesSkipper {
 
   /** Return the global number of documents with a value for the field. */
   public abstract int docCount();
+
+  /**
+   * Advance this skipper so that all levels intersects the range given by 
{@code minValue} and
+   * {@code maxValue}. If there are no intersecting levels, the skipper is 
exhausted.
+   *
+   * NOTE: The behavior is undefined if this method is called and 
{@link #advance(int)}
+   * has not been called yet.
+   */
+  public final void advance(long minValue, long maxValue) throws IOException {
+while (true) {

Review Comment:
   If it's required that `#advance` has already been called at least once 
before this is used, should we add something like `assert minDocID(0) > -1`?



##
lucene/core/src/java/org/apache/lucene/index/DocValuesSkipper.java:
##
@@ -98,4 +98,29 @@ public abstract class DocValuesSkipper {
 
   /** Return the global number of documents with a value for the field. */
   public abstract int docCount();
+
+  /**
+   * Advance this skipper so that all levels intersects the range given by 
{@code minValue} and
+   * {@code maxValue}. If there are no intersecting levels, the skipper is 
exhausted.
+   *
+   * NOTE: The behavior is undefined if this method is called and 
{@link #advance(int)}
+   * has not been called yet.
+   */
+  public final void advance(long minValue, long maxValue) throws IOException {
+while (true) {
+  if (minDocID(0) == DocIdSetIterator.NO_MORE_DOCS
+  || (minValue(0) <= maxValue && maxValue(0) >= minValue)) {
+break;
+  } else {
+int maxDocID = maxDocID(0);

Review Comment:
   This might just be me, but I got really confused for a while until I 
realized that `maxDocID` does not necessarily return the docID associated with 
`maxValue` (in the standard ascending sort case it would, but it's reversed in 
the reverse sort case... assuming I'm getting this right). I had the same 
confusing back in the calling code in the reverse-sort case. Would you mind 
adding a couple comments to this code to make that a little more clear in case 
someone else suffers from my same confusion?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Modernize list get first element [lucene]

2024-08-20 Thread via GitHub


mrhbj opened a new pull request, #13677:
URL: https://github.com/apache/lucene/pull/13677

   ### Description
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Compute facets while collecting [lucene]

2024-08-20 Thread via GitHub


epotyom commented on code in PR #13568:
URL: https://github.com/apache/lucene/pull/13568#discussion_r1724471801


##
lucene/core/src/java/org/apache/lucene/search/CollectorOwner.java:
##
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Like {@link CollectorManager}, but it owns the collectors its manager 
creates. It is convenient
+ * that clients of the class don't have to worry about keeping the list of 
collectors, as well as
+ * about making the collectors type (C) compatible when reduce is called. 
Instance of this class
+ * also caches results of {@link CollectorManager#reduce(Collection)}.
+ *
+ * Note that instance of this class ignores any {@link Collector} created 
by {@link
+ * CollectorManager#newCollector()} directly, not through {@link 
#newCollector()}
+ *
+ * @lucene.experimental
+ */
+public final class CollectorOwner {

Review Comment:
   The issue to follow up on that thread 
https://github.com/apache/lucene/issues/13671 - thank you @gsmiller for 
creating!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org