Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-09-14 Thread via GitHub


uschindler commented on code in PR #13572:
URL: https://github.com/apache/lucene/pull/13572#discussion_r1759686028


##
gradle/java/javac.gradle:
##
@@ -24,7 +24,11 @@ allprojects { project ->
 
 // Use 'release' flag instead of 'source' and 'target'
 tasks.withType(JavaCompile) {
-  options.compilerArgs += ["--release", 
rootProject.minJavaVersion.toString()]
+  options.compilerArgs += ["--release", 
rootProject.minJavaVersion.toString(), "--enable-preview"]

Review Comment:
   No need for this when everything is moved to java21 folder



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-09-14 Thread via GitHub


uschindler commented on PR #13572:
URL: https://github.com/apache/lucene/pull/13572#issuecomment-2350900783

   > > Anyways: At moment we do not want to have native code in Lucene Core.
   > 
   > +1, we don't put native code in Lucene's `core`.
   > 
   > But @uschindler is there maybe a way forward not using core? I.e. add 
native code to `misc` or `sandbox` or so, to enable iterating on direct access 
to SIMD for early adopters (opt in)?
   > 
   > Or is the `src/javaXX` approach core-only for some reason?
   
   In the current setup it is. One could generalize this, but this should be 
fine step by step in separate PRs. There are several problems to solve, also 
with regards to modularization and where the stub jars are:
   
   - compilation stubs (apijars) need to move to a common folder across Gradle 
projects (we don't have dich common module yet as core itself has no 
dependencies).
   - the bootstrapping of Panama and memory segment code is based on factories 
which are not yet pluggable. This may need SPI. So one could simply drop a 
separate JAR into class path/module path and then it enabled native code.
   - I would like to prevent duplicate and hairy initialization code, so one 
single place that loads the service providers and fies appropriate checks 
before.
   
   In general it's doable but I really doubt if this is healthy for this 
project's infrastructure.
   
   So I am still -1 to Gabe native code before we have refactored major parts. 
One condition is to have java 22 as minimum version, so we can at least 
introduce MemorySegment as top level citizen into our own APIs. I was arguing 
in the java community to have the foreign API stabilized and released with 21, 
but unfortunately it was done in 22. If we can get rid of the shit and simply 
tell our downstream users to accept java 22 as minimum version, and only the 
Panama vector stuff needs to be me jarred it would simplify a lot.
   
   Uwe


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Only apply deletion one time for unique term update in FrozenBufferedUpdates.applyTermDeletes [lucene]

2024-09-14 Thread via GitHub


vsop-479 commented on PR #13486:
URL: https://github.com/apache/lucene/pull/13486#issuecomment-2350906740

   > BTW, since we applyDeletes before writeTerms, should we skip deleted docs 
when writing postings(maybe skip storedFields too).
   
   I see this has been discussed in 
https://issues.apache.org/jira/browse/LUCENE-5693, and  
`IDVersionPostingsWriter#startDoc` implemented it (although it was markd as 
`Won't Fix`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

2024-09-14 Thread via GitHub


gf2121 commented on code in PR #13521:
URL: https://github.com/apache/lucene/pull/13521#discussion_r1759694410


##
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/DocIdEncodingBenchmark.java:
##
@@ -0,0 +1,404 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.benchmark.jmh;
+
+import java.io.IOException;
+import java.nio.charset.Charset;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Locale;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Scanner;
+import java.util.concurrent.TimeUnit;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.NIOFSDirectory;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.TearDown;
+import org.openjdk.jmh.annotations.Warmup;
+
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.MILLISECONDS)
+@State(Scope.Benchmark)
+@Warmup(iterations = 3, time = 5)
+@Measurement(iterations = 5, time = 8)
+@Fork(value = 1)
+public class DocIdEncodingBenchmark {
+
+  private static final List DOC_ID_SEQUENCES = new ArrayList<>();
+
+  private static int INPUT_SCALE_FACTOR;
+
+  static {
+parseInput();
+  }
+
+  @Param({"Bit21With3StepsEncoder", "Bit21With2StepsEncoder", "Bit24Encoder", 
"Bit32Encoder"})
+  String encoderName;
+
+  @Param({"encode", "decode"})
+  String methodName;
+
+  private DocIdEncoder docIdEncoder;
+
+  private Path tmpDir;
+
+  private IndexInput in;
+
+  private IndexOutput out;
+
+  private final int[] scratch = new int[512];
+
+  @Setup(Level.Trial)
+  public void init() throws IOException {
+tmpDir = Files.createTempDirectory("docIdJmh");
+docIdEncoder = DocIdEncoder.SingletonFactory.fromName(encoderName);
+// Create file once for decoders to read from in every iteration
+if (methodName.equalsIgnoreCase("decode")) {
+  String dataFile =
+  String.join("_", "docIdJmhData", 
docIdEncoder.getClass().getSimpleName(), "DecoderInput");
+  try (Directory dir = new NIOFSDirectory(tmpDir)) {
+out = dir.createOutput(dataFile, IOContext.DEFAULT);
+encode();
+  } finally {
+out.close();
+  }
+}
+  }
+
+  @TearDown(Level.Trial)
+  public void finish() throws IOException {
+if (methodName.equalsIgnoreCase("decode")) {
+  String dataFile =
+  String.join("_", "docIdJmhData", 
docIdEncoder.getClass().getSimpleName(), "DecoderInput");
+  Files.delete(tmpDir.resolve(dataFile));
+}
+Files.delete(tmpDir);
+  }
+
+  @Benchmark
+  public void executeEncodeOrDecode() throws IOException {
+if (methodName.equalsIgnoreCase("encode")) {
+  String dataFile =
+  String.join(
+  "_",
+  "docIdJmhData",
+  docIdEncoder.getClass().getSimpleName(),
+  String.valueOf(System.nanoTime()));
+  try (Directory dir = new NIOFSDirectory(tmpDir)) {
+out = dir.createOutput(dataFile, IOContext.DEFAULT);
+encode();
+  } finally {
+Files.delete(tmpDir.resolve(dataFile));
+out.close();
+  }
+} else if (methodName.equalsIgnoreCase("decode")) {
+  String inputFile =
+  String.join("_", "docIdJmhData", 
docIdEncoder.getClass().getSimpleName(), "DecoderInput");
+  try (Directory dir = new NIOFSDirectory(tmpDir)) {

Review Comment:
   Can you use `FSDirectory.open(tmpDir)` and rerun the benchmark? It helps to 
get a chance to take advantage of `MMapDirectory`, which is more often used in 
luc

Re: [PR] Add unit-of-least-precision float comparison [lucene]

2024-09-14 Thread via GitHub


stefanvodita commented on PR #13723:
URL: https://github.com/apache/lucene/pull/13723#issuecomment-2350917639

   Thank you Mike and Uwe! I opened a separate issue to replace various 
epsilon-based equality checks (#13789), since that could be a large enough 
task, and I'll merge this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Reproducible test failure in TestTaxonomyFacetAssociations.testFloatSumAssociation -- ULP float issue? [lucene]

2024-09-14 Thread via GitHub


stefanvodita closed issue #13720: Reproducible test failure in 
TestTaxonomyFacetAssociations.testFloatSumAssociation -- ULP float issue?
URL: https://github.com/apache/lucene/issues/13720


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add unit-of-least-precision float comparison [lucene]

2024-09-14 Thread via GitHub


stefanvodita merged PR #13723:
URL: https://github.com/apache/lucene/pull/13723


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove 8 bit quantization for HNSW/KNN vector indexing [lucene]

2024-09-14 Thread via GitHub


mikemccand commented on PR #13767:
URL: https://github.com/apache/lucene/pull/13767#issuecomment-2350934386

   I plan to resolve the conflict (just `MIGRATE.md`) and merge this in the 
next day or so!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add unit-of-least-precision float comparison [lucene]

2024-09-14 Thread via GitHub


uschindler commented on PR #13723:
URL: https://github.com/apache/lucene/pull/13723#issuecomment-2350956051

   👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-14 Thread via GitHub


javanna commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1759746984


##
lucene/core/src/java/org/apache/lucene/search/BulkScorer.java:
##
@@ -27,18 +27,6 @@
  */
 public abstract class BulkScorer {
 
-  /**
-   * Scores and collects all matching documents.
-   *
-   * @param collector The collector to which all matching documents are passed.
-   * @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
-   * if they are all allowed to match.
-   */
-  public void score(LeafCollector collector, Bits acceptDocs) throws 
IOException {

Review Comment:
   @jpountz I've been going back and forth on this a few times. Ideally we 
would deprecate the method in 9x and give users some notice. It is rather easy 
to remove usages and deprecate, I only worry about possible custom 
implementations of the method, because it is not final and subclasses can 
entirely override its behaviour. Such overrides would be a no-op if we replace 
all the usages of the deprecated method in 9x, which feels like it breaks stuff 
potentially. Users would need to immediately switch to overriding the other 
method instead. If we leave the main usage (IndexSearcher#searchLeaf) behind 
though, it feels odd that we use a deprecated method in such a core part of 
search.
   
   I am a bit torn on how to proceed, do you have an opinion?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce number of calculations in FSTCompiler [lucene]

2024-09-14 Thread via GitHub


mikemccand merged PR #13788:
URL: https://github.com/apache/lucene/pull/13788


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove usage of IndexSearcher#Search(Query, Collector) from monitor package [lucene]

2024-09-14 Thread via GitHub


javanna commented on PR #13735:
URL: https://github.com/apache/lucene/pull/13735#issuecomment-2351020586

   I've been changing my mind a few times on the topic, but I am concluding 
that it probably was not a good idea to expose the sequential collector 
manager. I think we are ok using it internally at our own risk, for situations 
where we know for sure that we can't possibly have a searcher with an executor 
set. But exposing it publicly makes it too easy for users to get unexpected 
failures, and the disclaimer "don't use this with a searcher that has an 
executor set" is kind of odd. What if those users do have other queries that 
make already use of concurrency, and use the same searcher for these others 
that get converted to leveraging the sequential collector manager? It is also 
trappy that they get errors only once there's more than one slice, so the 
behaviour may be hard to follow for users.
   
   I would propose that we make the new collector manager private with a bug 
disclaimer on when it should be used, and make a plan to remove it perhaps in 
the medium to long term. We should really design all new features with 
concurrency in mind, and migrate old ones to support concurrency.
   
   There is some urgency around this especially for the 9.12 release, I am 
happy to open a PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce number of calculations in FSTCompiler [lucene]

2024-09-14 Thread via GitHub


mrhbj commented on PR #13788:
URL: https://github.com/apache/lucene/pull/13788#issuecomment-2351022335

   @mikemccand You are welcome. I am a ream human. no an AI. -_-


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Remove CollectorManager#forSequentialExecution [lucene]

2024-09-14 Thread via GitHub


javanna opened a new pull request, #13790:
URL: https://github.com/apache/lucene/pull/13790

   We have recently (see #13735) introduced this utility method that creates a 
collector manager which only works when a searcher does not have an executor 
set, otherwise it throws exception once we attempt to create a new collector 
for more than one slice.
   
   While we discussed it should be safe to use in some specific scenarios like 
the monitor module, we should be careful exposing this utility publicly, 
because while we'd like to ease migration from the search(Query, Collector) 
method, we may end up making users like even worse, in that it exposes them to 
failures whenever an executor is set and there are more than one slice created, 
which is hard to follow and does not provide a good user experience.
   
   My proposal is that we use a similar collector manager locally, where safe 
and required, but we don't expose it to users. In most places, we should rather 
expose collector managers that do support search concurrency, rather than 
working around the lack of those.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager) [LUCENE-10002] [lucene]

2024-09-14 Thread via GitHub


javanna closed issue #11041: Remove IndexSearcher#search(Query,Collector) in 
favor of IndexSearcher#search(Query,CollectorManager) [LUCENE-10002]
URL: https://github.com/apache/lucene/issues/11041


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager) [LUCENE-10002] [lucene]

2024-09-14 Thread via GitHub


javanna commented on issue #11041:
URL: https://github.com/apache/lucene/issues/11041#issuecomment-2351046379

   Only the actual removal of the deprecated search(Query, Collector) is left 
to do. That is tracked by #12892 hence this issue can be closed/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Should we auto-adjust top score doc and top field collector manager based on slices? [lucene]

2024-09-14 Thread via GitHub


javanna opened a new issue, #13791:
URL: https://github.com/apache/lucene/issues/13791

   We recently made a change to `TotalHitCountCollectorManager`, to make it 
take the leaf slices in its constructor, so that we can automatically decide 
what type of collectors are needed. That is because there is additional 
handling needed in case the search targets segment partitions, which causes a 
bit of overhead and we want to be able to avoid that overhead when not 
necessary.
   
   There is a somewhat similar mechanism in the `TopScoreDocCollectorManager` 
and `TopFieldCollectorManager` constructors that take a `boolean` as last 
argument. That flag tells the manager if it has to create collectors that 
support concurrency or not. That does the job, but it can lead to errors in 
that users may not know exactly what to pass in and use the wrong value, which 
lead them to getting non deterministic results if they provide `false` and 
their searcher has an executor set to it. The other way around, there is a bit 
of overhead caused by the need for concurrency that perhaps can be 
automatically avoided when the search targets a single slice. As far as I can 
see the concurrent path is currently the default so far, and we mostly use the 
flag to disable concurrency when we are sure there is not executor set.
   
   I wonder if we should replace that boolean argument with the array of leaf 
slices instead. That is not perfect but better than a boolean, I think. It is 
consistent with the current solution for `TotalHitCountCollectorManager`, and 
it allows the manager to make an accurate decision with lower likelihood of 
making a mistake (you may take slices from the wrong searcher instance perhaps 
if you use multiple searchers with different settings, but that's about it?). 
There are two main implied question in this:
   1. should we use LeafSlice[] as an argument to automatically detect if 
concurrency needs to be supported or not?
   2. while we are at it, should we also make that argument mandatory so that 
we may even avoid overhead caused by concurrency when the search targets a 
single slice, despite an executor has been set to the searcher (more adaptive 
to what we do today)?
   
   I would like to collect opinions on this, I think it could be an additional 
small breaking change to make in Lucene 10 if there is appeal for it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org