Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
jpountz commented on code in PR #13190: URL: https://github.com/apache/lucene/pull/13190#discussion_r1533418454 ## lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java: ## @@ -445,8 +456,15 @@ private static String rateToString(double mbPerSec) { } @Override - public void close() { -sync(); + public void close() throws IOException { +super.close(); Review Comment: nit: should we try to close as best as possible in the event of an exception, e.g. doing something like below ```java IOUtils.close( this::sync, super::close, intraMergeExecutor == null ? null : intraMergeExecutor::shutdown); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
jpountz commented on code in PR #13190: URL: https://github.com/apache/lucene/pull/13190#discussion_r1533418454 ## lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java: ## @@ -445,8 +456,15 @@ private static String rateToString(double mbPerSec) { } @Override - public void close() { -sync(); + public void close() throws IOException { +super.close(); Review Comment: nit: should we try to close as cleanly as possible in the event of an exception, e.g. doing something like below ```java IOUtils.close( this::sync, super::close, intraMergeExecutor == null ? null : intraMergeExecutor::shutdown); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]
ChrisHegarty commented on issue #13194: URL: https://github.com/apache/lucene/issues/13194#issuecomment-2011662051 > @ChrisHegarty: For MMapDirectory there's another way for madvise to work (at least for reading index files): When the user has enabled native Panama access, we can also mmap our MemorySegments with our own code without using FileChannel, because we have to deal with file descriptor anyways and native access was already enabled. In fact we close the FileChannel after mapping anyways, so the FileChannel is completely useless for MMapDirectory code. We would need code for Posix and Windows. Yes, that is exactly what I've been thinking too. We're leaning more and more into FFI with Elasticsearch and it is proving to both behave and perform very well. Abstracting the functionality at a sightly higher-level may allow to have a Posix implementation and a fallback to an implementation based FileChannel. The latter would cover Windows - possibly ignoring the "hints" for random access. This could be a start, allowing to add more platform support as needed. > For writing files it is more complicated, so setting fadvise there also involves replicating the whole (File-)OutputStream used for writing index files. Ugh!! not great. We need either the memory address or the file descriptor. Not to mention that we map in 16GB chunks, so may require a little more plumbing (but nothing too hard). Is the mapping in chunks still needed? Did you @uschindler encounter issues when mapping larger chunks? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]
uschindler commented on issue #13194: URL: https://github.com/apache/lucene/issues/13194#issuecomment-2011672091 > Not to mention that we map in 16GB chunks, so may require a little more plumbing (but nothing too hard). Is the mapping in chunks still needed? Did you @uschindler encounter issues when mapping larger chunks? Yes, tests on Windows showed havoc with large files. But this is not a problem form MemorySegmentIndexInputProvider, the mmap call is easy to do, the splicing is already there. We just have two implementations for the filename -> MemorySegment[]. Easy, no changes needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
mikemccand commented on PR #13190: URL: https://github.com/apache/lucene/pull/13190#issuecomment-2011765808 > Thanks for pushing on this change, I like it. The fact that the extra merge concurrency may not starve merging from threads is good. I'm curious how the nightly benchmarks will react to it, given the high number of CPU cores of beast3. Nightly benchy is still forcing its own thread pool into the HNSW indexing Codec component (`Lucene99Hnsw*VectorsFormat`) -- once this change is merged, I'll remove that so we switch to this awesome default approach. But maybe we won't see too much change in the nightly indexing throughput since it's already doing intra-merge concurrency "itself". > One question on my mind is whether this change should make us update the default number of merging threads that `ConcurrentMergeScheduler` configures (in a follow-up issue/PR). +1 to explore this. Nightly benchy hardwires the maxMergeCount=16, maxThreadCount=12. Maybe they should be higher :) But also note that the nightly benchy does not wait for merges on `IndexWriter.close`, so it's not really a fair test of merge performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
jpountz commented on PR #13190: URL: https://github.com/apache/lucene/pull/13190#issuecomment-2011813580 > Nightly benchy hardwires the maxMergeCount=16, maxThreadCount=12 Do you remember how you came up with these numbers? Is there some reasoning behind these numbers, or do they come from experimentation? I'm also curious if you know where the current `maxThreadCount = max(1, min(4, coreCount / 2))` is coming from. Is the `4` number trying to approximate the number of tiers of `TieredMergePolicy` (max seg size = 5GB, 500MB, 50MB, 5MB ~= 2MB = min seg size) in order to allow one merge on each tier to run concurrently? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]
uschindler commented on issue #13194: URL: https://github.com/apache/lucene/issues/13194#issuecomment-2012002493 I just noticed, for madvise no file descriptor is needed. We can implement that straight away using the memory segments. So fadvise is the bigger problem needed for writing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]
ChrisHegarty commented on issue #13194: URL: https://github.com/apache/lucene/issues/13194#issuecomment-2012042036 > I just noticed, for madvise no file descriptor is needed. We can implement that straight away using the memory segments. ++ I think we have all we need enable this for reading - through a "hints" style API in IOContext. I can take a look at it, but probably not until next week. I'm sure you already know this, but here's how we've been abstracting FFI in Elasticsearch, https://github.com/elastic/elasticsearch/tree/main/libs/native/src/main/java/org/elasticsearch/nativeaccess . While the use case is not the same here, the necessary abstractions and framework are not unalike. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]
uschindler commented on issue #13194: URL: https://github.com/apache/lucene/issues/13194#issuecomment-2012136360 > > I just noticed, for madvise no file descriptor is needed. We can implement that straight away using the memory segments. > > ++ I think we have all we need enable this for reading - through a "hints" style API in IOContext. > > I can take a look at it, but probably not until next week. I'm sure you already know this, but here's how we've been abstracting FFI in Elasticsearch, https://github.com/elastic/elasticsearch/tree/main/libs/native/src/main/java/org/elasticsearch/nativeaccess . While the use case is not the same here, the necessary abstractions and framework are not unalike. Thanks, this is great. For the use-case here, I think abstractions are not needed, so we can implement it in our MR-JAR: - we only need `defaultLinker()` so no loadLibrary - looking up function addresses will be guarded with tryCatch and when we get a MethodHandle for `madvise()` or `posix_madvise()` we store it in the provider. If we can't find it or native access was denied by JVM (not allowed to module or by command line) we make it a noop MethodHandle or set it null - after when FileChannel returns a memorySegment, we just call madvise depending on IOContext with the MemorySegment as first parameter and the slice (0...length of segment): https://github.com/apache/lucene/blob/d5f8853fda8f488d73aed09d376e1c1c3c9ea85f/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java#L96 I will hack a draft PR with the basic setup later (only in main branch with Java 21), exposing madvise() and mlock2() ase examples and then we can do tests on Linux (maybe also macos) by passing different options depending on IOContext. P.S.: The `IOContext` is already available to the provider class (I did this in the original API design for exactly that use case; currently the parameter is unused): https://github.com/apache/lucene/blob/d5f8853fda8f488d73aed09d376e1c1c3c9ea85f/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java#L33 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Create a read-only index that drops index files not needed for searching [lucene]
mikemccand commented on issue #13158: URL: https://github.com/apache/lucene/issues/13158#issuecomment-2012195070 This is a cool idea @msokolov. It is wasteful to lug around those `float32` precision vectors out to the searchers in an NRT segment replication architecture. In practice, they would consume disk space on the searchers, and waste time copying them out, but since the OS would never load them at search time, their bytes would remain cold on disk and not put much pressure on OS free RAM? The OS would only cache the disk pages in the index that are actually needed at search time. It would be nice not to copy all that deadweight around ... Probably the solution would have to be something like segment to segment? I.e. for each segment in the index, we would make a corresponding "read only" segment (stripped of the `float32` vectors). This way, as the normal index changes (gets new flushed/merged segments), we could also incrementally/NRT maintain the shadow read-only index. I wonder if there are other things in a Lucene index today that are needed only during indexing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Create a read-only index that drops index files not needed for searching [lucene]
benwtrent commented on issue #13158: URL: https://github.com/apache/lucene/issues/13158#issuecomment-2012237415 Don't "deletes" require "writes"? Meaning, if enough docs get deleted in a segment, it will ultimately require to be merged, which then is a "write"? For a quicker win in scalar quantization, it could be cheaper or easier to have a configured threshold where we throw away the floating point as we know the distribution of the values won't change significantly from that point on. Then on merges, if adjustments are required, we assume the cost of de-quantizing and re-quantizing. This can help relevancy more than you would expect, but obviously not as much as having access to the raw values. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
benwtrent merged PR #13190: URL: https://github.com/apache/lucene/pull/13190 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]
uschindler opened a new pull request, #13196: URL: https://github.com/apache/lucene/pull/13196 This is a first idea how we can use Panama Foreign to pass `madvise()` hints to the kernel when mapping memory segments. The code looks up the function pointer from stdlib (libc) on Linux and Macos (untested, but should work) and then invokes `madvise()` for all MemorySegments we have mmapped when the following is true: - IOContext#readOnce is set - The chunk size is large enough (at least 8192 means `chunkSizePower>=13`) - this prevents TestMultiMMap from failing because for very small mappings (as done by this test), the `FileChannel#map` call will produce unaligned memory segments (it uses some tricks and maps larger segments and returns slices - which are no longer pageSize aligned) - There is a noop implementation doing nothing which is choosen if you disable native access Interestingly it works without any extra parameters to command line (at least in Java 21). This is a draft only to do some performance tests and extend the IOContext interpretation to try out more possibilities. The current "readOnce => MADV_SEQUENTIAL" is just an example as this is the main issue: We merge segments and don't want the soon to be trashed segments be sticky in RAM. MADV_SEQUENTIAL instructs kernel to forget about the mappings and also do readahead which helps during merging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]
uschindler commented on issue #13194: URL: https://github.com/apache/lucene/issues/13194#issuecomment-2012529188 I created a draft PR for IOContext#readOnce: #13196 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Create a read-only index that drops index files not needed for searching [lucene]
msokolov commented on issue #13158: URL: https://github.com/apache/lucene/issues/13158#issuecomment-2012935594 > Don't "deletes" require "writes"? Meaning, if enough docs get deleted in a segment, it will ultimately require to be merged, which then is a "write"? The goal of segment-replication is to completely separate searching from writing, so in that world, no merging is done by searchers -- it happens upstream in a writer/indexer, or perhaps in a dedicated merger (we have both setups going on). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]
jpountz commented on PR #13196: URL: https://github.com/apache/lucene/pull/13196#issuecomment-2012967809 I remember this kind of things being discussed more than 10 years ago, it's extremely exciting to see it close to being included in the default `Directory`! > current "readOnce => MADV_SEQUENTIAL" is just an example as this is the main issue: We merge segments and don't want the soon to be trashed segments be sticky in I believe that we don't use `readOnce` for merging, but for tiny metadata files that we load fully in memory. That said, I don't think that this should block this change, we can later look into the proper way to detect whether to pass `MADV_SEQUENTIAL` based on the `IOContext`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]
ChrisHegarty commented on code in PR #13196: URL: https://github.com/apache/lucene/pull/13196#discussion_r1534254114 ## lucene/core/src/java21/org/apache/lucene/store/NativeAccess.java: ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.lang.foreign.MemorySegment; +import java.util.logging.Logger; +import org.apache.lucene.util.Constants; + +@SuppressWarnings("preview") +abstract class NativeAccess { + private static final Logger LOG = Logger.getLogger(NativeAccess.class.getName()); + + // these constants were extracted form glibc and macos header files - luckily they are the same: + + /** No further special treatment. */ + public static final int POSIX_MADV_NORMAL = 0; + + /** Expect random page references. */ + public static final int POSIX_MADV_RANDOM = 1; + + /** Expect sequential page references. */ + public static final int POSIX_MADV_SEQUENTIAL = 2; + + /** Will need these pages. */ + public static final int POSIX_MADV_WILLNEED = 3; + + /** Don't need these pages. */ + public static final int POSIX_MADV_DONTNEED = 4; + + /** Invoke the {@code madvise} call for the given {@link MemorySegment}. */ + public abstract void madvise(MemorySegment segment, int advise) throws IOException; + + /** + * Return the NativeAccess instance for this platform. At moment we only support Linux and MacOS + */ + public static NativeAccess getImplementation() { +if (Constants.LINUX || Constants.MAC_OS_X) { Review Comment: For me, this is the right level of abstraction. Support can be added for other platforms if and when desired - no need to block anything on Windows! ;-) ## lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java: ## @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.lang.foreign.FunctionDescriptor; +import java.lang.foreign.Linker; +import java.lang.foreign.MemorySegment; +import java.lang.foreign.SymbolLookup; +import java.lang.foreign.ValueLayout; +import java.lang.invoke.MethodHandle; +import java.util.Locale; + +@SuppressWarnings("preview") +final class PosixNativeAccess extends NativeAccess { + + private final MethodHandle mh$posix_madvise; + + PosixNativeAccess() { +final Linker linker = Linker.nativeLinker(); +final SymbolLookup stdlib = linker.defaultLookup(); +this.mh$posix_madvise = +findFunction( +linker, +stdlib, +"posix_madvise", +FunctionDescriptor.of( +ValueLayout.JAVA_INT, +ValueLayout.ADDRESS, +ValueLayout.JAVA_LONG, +ValueLayout.JAVA_INT)); + } + + private static MethodHandle findFunction( + Linker linker, SymbolLookup lookup, String name, FunctionDescriptor desc) { +final MemorySegment symbol = +lookup +.find(name) +.orElseThrow( +() -> +new UnsupportedOperationException( +"Platform has no symbol for '" + name + "' in libc.")); +return linker.downcallHandle(symbol, desc); Review Comment: we can use captureCallState to capture the errno if this fails. https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/foreign/Linker.Option.html#captureCall
Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]
uschindler commented on code in PR #13196: URL: https://github.com/apache/lucene/pull/13196#discussion_r1534481239 ## lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java: ## @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.lang.foreign.FunctionDescriptor; +import java.lang.foreign.Linker; +import java.lang.foreign.MemorySegment; +import java.lang.foreign.SymbolLookup; +import java.lang.foreign.ValueLayout; +import java.lang.invoke.MethodHandle; +import java.util.Locale; + +@SuppressWarnings("preview") +final class PosixNativeAccess extends NativeAccess { + + private final MethodHandle mh$posix_madvise; + + PosixNativeAccess() { +final Linker linker = Linker.nativeLinker(); +final SymbolLookup stdlib = linker.defaultLookup(); +this.mh$posix_madvise = +findFunction( +linker, +stdlib, +"posix_madvise", +FunctionDescriptor.of( +ValueLayout.JAVA_INT, +ValueLayout.ADDRESS, +ValueLayout.JAVA_LONG, +ValueLayout.JAVA_INT)); + } + + private static MethodHandle findFunction( + Linker linker, SymbolLookup lookup, String name, FunctionDescriptor desc) { +final MemorySegment symbol = +lookup +.find(name) +.orElseThrow( +() -> +new UnsupportedOperationException( +"Platform has no symbol for '" + name + "' in libc.")); +return linker.downcallHandle(symbol, desc); Review Comment: That's not needed for `posix_madvise()`; it returns the errno code directly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]
uschindler commented on PR #13196: URL: https://github.com/apache/lucene/pull/13196#issuecomment-2013348340 I refactored the code a bit: - @jpountz I added the merge context (I checked DirectIODirectory). - I retrieve the pageSize to check if a segment is correctly aligned. Unfortunately this is mega-stupid. I tend to remove it again (the constant `_SC_PAGESIZE` is part of a C `enum` and its value is different in various platforms as no fixed value is assigned) and maybe keep a hardcoded page size to handle the case. Maybe just check that it is a multiple of 64K (native page size is 8K, but may be different), because it is normally only an issue in this test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]
uschindler commented on PR #13196: URL: https://github.com/apache/lucene/pull/13196#issuecomment-2013453556 I removed the page size retrieval again. No need for it, we just be safe and only apply the advice, when the chunk size is large enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]
uschindler commented on PR #13196: URL: https://github.com/apache/lucene/pull/13196#issuecomment-2013499830 Unfortunately after the problems with other constants I tend to hide the advice constants in the class any use some abstraction, so we can call We should maybe also simply remove the NoopNativeAccess and do a null check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] New int4 scalar quantization [lucene]
benwtrent opened a new pull request, #13197: URL: https://github.com/apache/lucene/pull/13197 So, there are a handful of new and interesting things that this PR adds: - Confidence interval optimizations, unlocked even smaller quantization bytes - New int4 codec which is 8x smaller than float32 with exceptionally good recall & faster search But, its not complete yet. While I was finishing up the code here, it came to my mind, why is this its own codec? Why shouldn't we add a new option to `Lucene99ScalarQuantizedVectorsFormat` (really it would be `Lucene911ScalarQuantizedVectorsFormat`) that allows the quantization `bits` to be set. I know we don't want a ton of configuration for each codec, but to me this seemed reasonable. What do you think @jpountz? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]
ChrisHegarty commented on code in PR #13196: URL: https://github.com/apache/lucene/pull/13196#discussion_r1534598790 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java: ## @@ -92,10 +110,19 @@ private final MemorySegment[] map( } if (preload) { segment.load(); + } else if (segSize > 0L && advice.isPresent()) { // not when preloading! +nativeAccess.madvise(segment, advice.getAsInt()); } segments[segNr] = segment; startOffset += segSize; } return segments; } + + private OptionalInt mapContextToMadvise(IOContext context) { +if (context.readOnce || context.context == Context.MERGE) { + return OptionalInt.of(NativeAccess.POSIX_MADV_SEQUENTIAL); +} Review Comment: I'd like to plumb `POSIX_MADV_RANDOM`, for use when reading from flat vector data. Any objection do doing it in this PR, or you wanna wait to get this in first? ## lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java: ## @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.lang.foreign.FunctionDescriptor; +import java.lang.foreign.Linker; +import java.lang.foreign.MemorySegment; +import java.lang.foreign.SymbolLookup; +import java.lang.foreign.ValueLayout; +import java.lang.invoke.MethodHandle; +import java.util.Locale; +import java.util.logging.Logger; + +@SuppressWarnings("preview") +final class PosixNativeAccess extends NativeAccess { + + private static final Logger LOG = Logger.getLogger(PosixNativeAccess.class.getName()); + + private final MethodHandle mh$posix_madvise; Review Comment: we should probably make this MethodHandle `static final`. In which case the whole instance construction can be cleaned up a bit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]
ChrisHegarty commented on code in PR #13196: URL: https://github.com/apache/lucene/pull/13196#discussion_r1534636363 ## lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java: ## @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.lang.foreign.FunctionDescriptor; +import java.lang.foreign.Linker; +import java.lang.foreign.MemorySegment; +import java.lang.foreign.SymbolLookup; +import java.lang.foreign.ValueLayout; +import java.lang.invoke.MethodHandle; +import java.util.Locale; +import java.util.logging.Logger; + +@SuppressWarnings("preview") +final class PosixNativeAccess extends NativeAccess { + + private static final Logger LOG = Logger.getLogger(PosixNativeAccess.class.getName()); + + private final MethodHandle mh$posix_madvise; Review Comment: I pushed a small change to refactor this. Revert or refactor as you like, but the MH is now static final. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]
ChrisHegarty commented on code in PR #13196: URL: https://github.com/apache/lucene/pull/13196#discussion_r1534678873 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java: ## @@ -92,10 +110,19 @@ private final MemorySegment[] map( } if (preload) { segment.load(); + } else if (segSize > 0L && advice.isPresent()) { // not when preloading! +nativeAccess.madvise(segment, advice.getAsInt()); } segments[segNr] = segment; startOffset += segSize; } return segments; } + + private OptionalInt mapContextToMadvise(IOContext context) { +if (context.readOnce || context.context == Context.MERGE) { + return OptionalInt.of(NativeAccess.POSIX_MADV_SEQUENTIAL); +} Review Comment: https://github.com/uschindler/lucene/pull/3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]
uschindler commented on code in PR #13196: URL: https://github.com/apache/lucene/pull/13196#discussion_r1534727218 ## lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java: ## @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.lang.foreign.FunctionDescriptor; +import java.lang.foreign.Linker; +import java.lang.foreign.MemorySegment; +import java.lang.foreign.SymbolLookup; +import java.lang.foreign.ValueLayout; +import java.lang.invoke.MethodHandle; +import java.util.Locale; +import java.util.logging.Logger; + +@SuppressWarnings("preview") +final class PosixNativeAccess extends NativeAccess { + + private static final Logger LOG = Logger.getLogger(PosixNativeAccess.class.getName()); + + private final MethodHandle mh$posix_madvise; Review Comment: I don't like this, because now the Unsupported operation exception is triggered by static initializer and is causing NoClassDefFoundError. If you look at the caller of ctor you will see the try catch. I added this because for some reason it could fail to find the symbol (e.g. new libc version,...). So I want it bullet proof. The non static method handle is not a problem, it's called not at perf critical places. The constructor is only called once. Can you revert? If you really want it static we would need another logic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]
ChrisHegarty commented on code in PR #13196: URL: https://github.com/apache/lucene/pull/13196#discussion_r1534775594 ## lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java: ## @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.lang.foreign.FunctionDescriptor; +import java.lang.foreign.Linker; +import java.lang.foreign.MemorySegment; +import java.lang.foreign.SymbolLookup; +import java.lang.foreign.ValueLayout; +import java.lang.invoke.MethodHandle; +import java.util.Locale; +import java.util.logging.Logger; + +@SuppressWarnings("preview") +final class PosixNativeAccess extends NativeAccess { + + private static final Logger LOG = Logger.getLogger(PosixNativeAccess.class.getName()); + + private final MethodHandle mh$posix_madvise; Review Comment: I’ll take another look or revert tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation [lucene]
msfroh commented on issue #13029: URL: https://github.com/apache/lucene/issues/13029#issuecomment-2014023729 I think the `else` clause for the cost estimate is also not great. I came across this same problem where a user was essentially running a single-term `TermInSetQuery` (that actually matched a single doc) AND a numeric range query that matches millions of docs. I was shocked when profiler output showed a lot of time spent in the BKReader -- the IndexOrDocValues query over the range query should have clearly taken the doc values path. Let's say you have a string field with 10 million distinct values (so 10 million terms), and they match 20 million documents (with individual terms matching 1-3 docs, say). My read is that this `estimateCost()` function will say that a `TermInSetQuery` over a single term has cost 10,000,001 (i.e. 20 million `sumDocFreq` minus 10 million for the terms `size`, plus 1 for the query term count). I get that the absolute worst case is that 9,999,999 terms each have doc freq 1 and the remaining term has doc freq 10,000,001, but this feels silly as a cost estimate for a query that is just going to rewrite to a single `TermQuery` with cost <= 3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation [lucene]
msfroh commented on issue #13029: URL: https://github.com/apache/lucene/issues/13029#issuecomment-2014084615 I get that part of the point of this cost estimate is to avoid the (potentially-expensive) rewrite if, e.g. we can do a doc-value rewrite instead, but I'm thinking we could do something a little bit more term-aware. How expensive is `MultiTermQuery#getTermsEnum()`? I think it's cheap, but I'm not positive. Maybe we could get it, try walking through it, summing doc freqs, giving up if the number of terms gets big. We could go down that path only if the cost estimate from the existing logic is very high. I can try sketching out a PR with a test for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org