Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]
uschindler commented on code in PR #13572: URL: https://github.com/apache/lucene/pull/13572#discussion_r1759686028 ## gradle/java/javac.gradle: ## @@ -24,7 +24,11 @@ allprojects { project -> // Use 'release' flag instead of 'source' and 'target' tasks.withType(JavaCompile) { - options.compilerArgs += ["--release", rootProject.minJavaVersion.toString()] + options.compilerArgs += ["--release", rootProject.minJavaVersion.toString(), "--enable-preview"] Review Comment: No need for this when everything is moved to java21 folder -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]
uschindler commented on PR #13572: URL: https://github.com/apache/lucene/pull/13572#issuecomment-2350900783 > > Anyways: At moment we do not want to have native code in Lucene Core. > > +1, we don't put native code in Lucene's `core`. > > But @uschindler is there maybe a way forward not using core? I.e. add native code to `misc` or `sandbox` or so, to enable iterating on direct access to SIMD for early adopters (opt in)? > > Or is the `src/javaXX` approach core-only for some reason? In the current setup it is. One could generalize this, but this should be fine step by step in separate PRs. There are several problems to solve, also with regards to modularization and where the stub jars are: - compilation stubs (apijars) need to move to a common folder across Gradle projects (we don't have dich common module yet as core itself has no dependencies). - the bootstrapping of Panama and memory segment code is based on factories which are not yet pluggable. This may need SPI. So one could simply drop a separate JAR into class path/module path and then it enabled native code. - I would like to prevent duplicate and hairy initialization code, so one single place that loads the service providers and fies appropriate checks before. In general it's doable but I really doubt if this is healthy for this project's infrastructure. So I am still -1 to Gabe native code before we have refactored major parts. One condition is to have java 22 as minimum version, so we can at least introduce MemorySegment as top level citizen into our own APIs. I was arguing in the java community to have the foreign API stabilized and released with 21, but unfortunately it was done in 22. If we can get rid of the shit and simply tell our downstream users to accept java 22 as minimum version, and only the Panama vector stuff needs to be me jarred it would simplify a lot. Uwe -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Only apply deletion one time for unique term update in FrozenBufferedUpdates.applyTermDeletes [lucene]
vsop-479 commented on PR #13486: URL: https://github.com/apache/lucene/pull/13486#issuecomment-2350906740 > BTW, since we applyDeletes before writeTerms, should we skip deleted docs when writing postings(maybe skip storedFields too). I see this has been discussed in https://issues.apache.org/jira/browse/LUCENE-5693, and `IDVersionPostingsWriter#startDoc` implemented it (although it was markd as `Won't Fix`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]
gf2121 commented on code in PR #13521: URL: https://github.com/apache/lucene/pull/13521#discussion_r1759694410 ## lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/DocIdEncodingBenchmark.java: ## @@ -0,0 +1,404 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.benchmark.jmh; + +import java.io.IOException; +import java.nio.charset.Charset; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.Paths; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.Objects; +import java.util.Scanner; +import java.util.concurrent.TimeUnit; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.NIOFSDirectory; +import org.openjdk.jmh.annotations.Benchmark; +import org.openjdk.jmh.annotations.BenchmarkMode; +import org.openjdk.jmh.annotations.Fork; +import org.openjdk.jmh.annotations.Level; +import org.openjdk.jmh.annotations.Measurement; +import org.openjdk.jmh.annotations.Mode; +import org.openjdk.jmh.annotations.OutputTimeUnit; +import org.openjdk.jmh.annotations.Param; +import org.openjdk.jmh.annotations.Scope; +import org.openjdk.jmh.annotations.Setup; +import org.openjdk.jmh.annotations.State; +import org.openjdk.jmh.annotations.TearDown; +import org.openjdk.jmh.annotations.Warmup; + +@BenchmarkMode(Mode.AverageTime) +@OutputTimeUnit(TimeUnit.MILLISECONDS) +@State(Scope.Benchmark) +@Warmup(iterations = 3, time = 5) +@Measurement(iterations = 5, time = 8) +@Fork(value = 1) +public class DocIdEncodingBenchmark { + + private static final List DOC_ID_SEQUENCES = new ArrayList<>(); + + private static int INPUT_SCALE_FACTOR; + + static { +parseInput(); + } + + @Param({"Bit21With3StepsEncoder", "Bit21With2StepsEncoder", "Bit24Encoder", "Bit32Encoder"}) + String encoderName; + + @Param({"encode", "decode"}) + String methodName; + + private DocIdEncoder docIdEncoder; + + private Path tmpDir; + + private IndexInput in; + + private IndexOutput out; + + private final int[] scratch = new int[512]; + + @Setup(Level.Trial) + public void init() throws IOException { +tmpDir = Files.createTempDirectory("docIdJmh"); +docIdEncoder = DocIdEncoder.SingletonFactory.fromName(encoderName); +// Create file once for decoders to read from in every iteration +if (methodName.equalsIgnoreCase("decode")) { + String dataFile = + String.join("_", "docIdJmhData", docIdEncoder.getClass().getSimpleName(), "DecoderInput"); + try (Directory dir = new NIOFSDirectory(tmpDir)) { +out = dir.createOutput(dataFile, IOContext.DEFAULT); +encode(); + } finally { +out.close(); + } +} + } + + @TearDown(Level.Trial) + public void finish() throws IOException { +if (methodName.equalsIgnoreCase("decode")) { + String dataFile = + String.join("_", "docIdJmhData", docIdEncoder.getClass().getSimpleName(), "DecoderInput"); + Files.delete(tmpDir.resolve(dataFile)); +} +Files.delete(tmpDir); + } + + @Benchmark + public void executeEncodeOrDecode() throws IOException { +if (methodName.equalsIgnoreCase("encode")) { + String dataFile = + String.join( + "_", + "docIdJmhData", + docIdEncoder.getClass().getSimpleName(), + String.valueOf(System.nanoTime())); + try (Directory dir = new NIOFSDirectory(tmpDir)) { +out = dir.createOutput(dataFile, IOContext.DEFAULT); +encode(); + } finally { +Files.delete(tmpDir.resolve(dataFile)); +out.close(); + } +} else if (methodName.equalsIgnoreCase("decode")) { + String inputFile = + String.join("_", "docIdJmhData", docIdEncoder.getClass().getSimpleName(), "DecoderInput"); + try (Directory dir = new NIOFSDirectory(tmpDir)) { Review Comment: Can you use `FSDirectory.open(tmpDir)` and rerun the benchmark? It helps to get a chance to take advantage of `MMapDirectory`, which is more often used in luc
Re: [PR] Add unit-of-least-precision float comparison [lucene]
stefanvodita commented on PR #13723: URL: https://github.com/apache/lucene/pull/13723#issuecomment-2350917639 Thank you Mike and Uwe! I opened a separate issue to replace various epsilon-based equality checks (#13789), since that could be a large enough task, and I'll merge this one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Reproducible test failure in TestTaxonomyFacetAssociations.testFloatSumAssociation -- ULP float issue? [lucene]
stefanvodita closed issue #13720: Reproducible test failure in TestTaxonomyFacetAssociations.testFloatSumAssociation -- ULP float issue? URL: https://github.com/apache/lucene/issues/13720 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add unit-of-least-precision float comparison [lucene]
stefanvodita merged PR #13723: URL: https://github.com/apache/lucene/pull/13723 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove 8 bit quantization for HNSW/KNN vector indexing [lucene]
mikemccand commented on PR #13767: URL: https://github.com/apache/lucene/pull/13767#issuecomment-2350934386 I plan to resolve the conflict (just `MIGRATE.md`) and merge this in the next day or so! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add unit-of-least-precision float comparison [lucene]
uschindler commented on PR #13723: URL: https://github.com/apache/lucene/pull/13723#issuecomment-2350956051 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for intra-segment search concurrency [lucene]
javanna commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1759746984 ## lucene/core/src/java/org/apache/lucene/search/BulkScorer.java: ## @@ -27,18 +27,6 @@ */ public abstract class BulkScorer { - /** - * Scores and collects all matching documents. - * - * @param collector The collector to which all matching documents are passed. - * @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null} - * if they are all allowed to match. - */ - public void score(LeafCollector collector, Bits acceptDocs) throws IOException { Review Comment: @jpountz I've been going back and forth on this a few times. Ideally we would deprecate the method in 9x and give users some notice. It is rather easy to remove usages and deprecate, I only worry about possible custom implementations of the method, because it is not final and subclasses can entirely override its behaviour. Such overrides would be a no-op if we replace all the usages of the deprecated method in 9x, which feels like it breaks stuff potentially. Users would need to immediately switch to overriding the other method instead. If we leave the main usage (IndexSearcher#searchLeaf) behind though, it feels odd that we use a deprecated method in such a core part of search. I am a bit torn on how to proceed, do you have an opinion? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce number of calculations in FSTCompiler [lucene]
mikemccand merged PR #13788: URL: https://github.com/apache/lucene/pull/13788 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove usage of IndexSearcher#Search(Query, Collector) from monitor package [lucene]
javanna commented on PR #13735: URL: https://github.com/apache/lucene/pull/13735#issuecomment-2351020586 I've been changing my mind a few times on the topic, but I am concluding that it probably was not a good idea to expose the sequential collector manager. I think we are ok using it internally at our own risk, for situations where we know for sure that we can't possibly have a searcher with an executor set. But exposing it publicly makes it too easy for users to get unexpected failures, and the disclaimer "don't use this with a searcher that has an executor set" is kind of odd. What if those users do have other queries that make already use of concurrency, and use the same searcher for these others that get converted to leveraging the sequential collector manager? It is also trappy that they get errors only once there's more than one slice, so the behaviour may be hard to follow for users. I would propose that we make the new collector manager private with a bug disclaimer on when it should be used, and make a plan to remove it perhaps in the medium to long term. We should really design all new features with concurrency in mind, and migrate old ones to support concurrency. There is some urgency around this especially for the 9.12 release, I am happy to open a PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce number of calculations in FSTCompiler [lucene]
mrhbj commented on PR #13788: URL: https://github.com/apache/lucene/pull/13788#issuecomment-2351022335 @mikemccand You are welcome. I am a ream human. no an AI. -_- -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Remove CollectorManager#forSequentialExecution [lucene]
javanna opened a new pull request, #13790: URL: https://github.com/apache/lucene/pull/13790 We have recently (see #13735) introduced this utility method that creates a collector manager which only works when a searcher does not have an executor set, otherwise it throws exception once we attempt to create a new collector for more than one slice. While we discussed it should be safe to use in some specific scenarios like the monitor module, we should be careful exposing this utility publicly, because while we'd like to ease migration from the search(Query, Collector) method, we may end up making users like even worse, in that it exposes them to failures whenever an executor is set and there are more than one slice created, which is hard to follow and does not provide a good user experience. My proposal is that we use a similar collector manager locally, where safe and required, but we don't expose it to users. In most places, we should rather expose collector managers that do support search concurrency, rather than working around the lack of those. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager) [LUCENE-10002] [lucene]
javanna closed issue #11041: Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager) [LUCENE-10002] URL: https://github.com/apache/lucene/issues/11041 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager) [LUCENE-10002] [lucene]
javanna commented on issue #11041: URL: https://github.com/apache/lucene/issues/11041#issuecomment-2351046379 Only the actual removal of the deprecated search(Query, Collector) is left to do. That is tracked by #12892 hence this issue can be closed/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Should we auto-adjust top score doc and top field collector manager based on slices? [lucene]
javanna opened a new issue, #13791: URL: https://github.com/apache/lucene/issues/13791 We recently made a change to `TotalHitCountCollectorManager`, to make it take the leaf slices in its constructor, so that we can automatically decide what type of collectors are needed. That is because there is additional handling needed in case the search targets segment partitions, which causes a bit of overhead and we want to be able to avoid that overhead when not necessary. There is a somewhat similar mechanism in the `TopScoreDocCollectorManager` and `TopFieldCollectorManager` constructors that take a `boolean` as last argument. That flag tells the manager if it has to create collectors that support concurrency or not. That does the job, but it can lead to errors in that users may not know exactly what to pass in and use the wrong value, which lead them to getting non deterministic results if they provide `false` and their searcher has an executor set to it. The other way around, there is a bit of overhead caused by the need for concurrency that perhaps can be automatically avoided when the search targets a single slice. As far as I can see the concurrent path is currently the default so far, and we mostly use the flag to disable concurrency when we are sure there is not executor set. I wonder if we should replace that boolean argument with the array of leaf slices instead. That is not perfect but better than a boolean, I think. It is consistent with the current solution for `TotalHitCountCollectorManager`, and it allows the manager to make an accurate decision with lower likelihood of making a mistake (you may take slices from the wrong searcher instance perhaps if you use multiple searchers with different settings, but that's about it?). There are two main implied question in this: 1. should we use LeafSlice[] as an argument to automatically detect if concurrency needs to be supported or not? 2. while we are at it, should we also make that argument mandatory so that we may even avoid overhead caused by concurrency when the search targets a single slice, despite an executor has been set to the searcher (more adaptive to what we do today)? I would like to collect opinions on this, I think it could be an additional small breaking change to make in Lucene 10 if there is appeal for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org