Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-21 Thread via GitHub


jpountz commented on code in PR #13190:
URL: https://github.com/apache/lucene/pull/13190#discussion_r1533418454


##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -445,8 +456,15 @@ private static String rateToString(double mbPerSec) {
   }
 
   @Override
-  public void close() {
-sync();
+  public void close() throws IOException {
+super.close();

Review Comment:
   nit: should we try to close as best as possible in the event of an 
exception, e.g. doing something like below
   
   ```java
   IOUtils.close(
   this::sync,
   super::close,
   intraMergeExecutor == null ? null : intraMergeExecutor::shutdown);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-21 Thread via GitHub


jpountz commented on code in PR #13190:
URL: https://github.com/apache/lucene/pull/13190#discussion_r1533418454


##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -445,8 +456,15 @@ private static String rateToString(double mbPerSec) {
   }
 
   @Override
-  public void close() {
-sync();
+  public void close() throws IOException {
+super.close();

Review Comment:
   nit: should we try to close as cleanly as possible in the event of an 
exception, e.g. doing something like below
   
   ```java
   IOUtils.close(
   this::sync,
   super::close,
   intraMergeExecutor == null ? null : intraMergeExecutor::shutdown);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]

2024-03-21 Thread via GitHub


ChrisHegarty commented on issue #13194:
URL: https://github.com/apache/lucene/issues/13194#issuecomment-2011662051

   > @ChrisHegarty: For MMapDirectory there's another way for madvise to work 
(at least for reading index files): When the user has enabled native Panama 
access, we can also mmap our MemorySegments with our own code without using 
FileChannel, because we have to deal with file descriptor anyways and native 
access was already enabled. In fact we close the FileChannel after mapping 
anyways, so the FileChannel is completely useless for MMapDirectory code. We 
would need code for Posix and Windows.
   
   Yes, that is exactly what I've been thinking too. We're leaning more and 
more into FFI with Elasticsearch and it is proving to both behave and perform 
very well. Abstracting the functionality at a sightly higher-level may allow to 
have a Posix implementation and a fallback to an implementation based 
FileChannel. The latter would cover Windows - possibly ignoring the "hints" for 
random access. This could be a start, allowing to add more platform support as 
needed.
   
   > For writing files it is more complicated, so setting fadvise there also 
involves replicating the whole (File-)OutputStream used for writing index files.
   
   Ugh!! not great. We need either the memory address or the file descriptor.
   
   Not to mention that we map in 16GB chunks, so may require a little more 
plumbing (but nothing too hard).  Is the mapping in chunks still needed? Did 
you @uschindler encounter issues when mapping larger chunks?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]

2024-03-21 Thread via GitHub


uschindler commented on issue #13194:
URL: https://github.com/apache/lucene/issues/13194#issuecomment-2011672091

   > Not to mention that we map in 16GB chunks, so may require a little more 
plumbing (but nothing too hard). Is the mapping in chunks still needed? Did you 
@uschindler encounter issues when mapping larger chunks?
   
   Yes, tests on Windows showed havoc with large files.
   
   But this is not a problem form MemorySegmentIndexInputProvider, the mmap 
call is easy to do, the splicing is already there. We just have two 
implementations for the filename -> MemorySegment[]. Easy, no changes needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-21 Thread via GitHub


mikemccand commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2011765808

   > Thanks for pushing on this change, I like it. The fact that the extra 
merge concurrency may not starve merging from threads is good. I'm curious how 
the nightly benchmarks will react to it, given the high number of CPU cores of 
beast3.
   
   Nightly benchy is still forcing its own thread pool into the HNSW indexing 
Codec component (`Lucene99Hnsw*VectorsFormat`) -- once this change is merged, 
I'll remove that so we switch to this awesome default approach.  But maybe we 
won't see too much change in the nightly indexing throughput since it's already 
doing intra-merge concurrency "itself".

   > One question on my mind is whether this change should make us update the 
default number of merging threads that `ConcurrentMergeScheduler` configures 
(in a follow-up issue/PR).
   
   +1 to explore this.  Nightly benchy hardwires the maxMergeCount=16, 
maxThreadCount=12.  Maybe they should be higher :)
   
   But also note that the nightly benchy does not wait for merges on 
`IndexWriter.close`, so it's not really a fair test of merge performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-21 Thread via GitHub


jpountz commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2011813580

   > Nightly benchy hardwires the maxMergeCount=16, maxThreadCount=12
   
   Do you remember how you came up with these numbers? Is there some reasoning 
behind these numbers, or do they come from experimentation?
   
   I'm also curious if you know where the current `maxThreadCount = max(1, 
min(4, coreCount / 2))` is coming from. Is the `4` number trying to approximate 
the number of tiers of `TieredMergePolicy` (max seg size = 5GB, 500MB, 50MB, 
5MB ~= 2MB = min seg size) in order to allow one merge on each tier to run 
concurrently?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]

2024-03-21 Thread via GitHub


uschindler commented on issue #13194:
URL: https://github.com/apache/lucene/issues/13194#issuecomment-2012002493

   I just noticed, for madvise no file descriptor is needed. We can implement 
that straight away using the memory segments.
   
   So fadvise is the bigger problem needed for writing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]

2024-03-21 Thread via GitHub


ChrisHegarty commented on issue #13194:
URL: https://github.com/apache/lucene/issues/13194#issuecomment-2012042036

   > I just noticed, for madvise no file descriptor is needed. We can implement 
that straight away using the memory segments.
   
   ++ I think we have all we need enable this for reading - through a "hints" 
style API in IOContext.  
   
   I can take a look at it, but probably not until next week. I'm sure you 
already know this, but here's how we've been abstracting FFI in Elasticsearch,  
 
https://github.com/elastic/elasticsearch/tree/main/libs/native/src/main/java/org/elasticsearch/nativeaccess
 . While the use case is not the same here, the necessary abstractions and 
framework are not unalike.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]

2024-03-21 Thread via GitHub


uschindler commented on issue #13194:
URL: https://github.com/apache/lucene/issues/13194#issuecomment-2012136360

   > > I just noticed, for madvise no file descriptor is needed. We can 
implement that straight away using the memory segments.
   > 
   > ++ I think we have all we need enable this for reading - through a "hints" 
style API in IOContext.
   > 
   > I can take a look at it, but probably not until next week. I'm sure you 
already know this, but here's how we've been abstracting FFI in Elasticsearch, 
https://github.com/elastic/elasticsearch/tree/main/libs/native/src/main/java/org/elasticsearch/nativeaccess
 . While the use case is not the same here, the necessary abstractions and 
framework are not unalike.
   
   Thanks, this is great. For the use-case here, I think abstractions are not 
needed, so we can implement it in our MR-JAR:
   - we only need `defaultLinker()` so no loadLibrary
   - looking up function addresses will be guarded with tryCatch and when we 
get a MethodHandle for `madvise()` or `posix_madvise()` we store it in the 
provider. If we can't find it or native access was denied by JVM (not allowed 
to module or by command line) we make it a noop MethodHandle or set it null
   - after when FileChannel returns a memorySegment, we just call madvise 
depending on IOContext with the MemorySegment as first parameter and the slice 
(0...length of segment): 
https://github.com/apache/lucene/blob/d5f8853fda8f488d73aed09d376e1c1c3c9ea85f/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java#L96
   
   I will hack a draft PR with the basic setup later (only in main branch with 
Java 21), exposing madvise() and mlock2() ase examples and then we can do tests 
on Linux (maybe also macos) by passing different options depending on IOContext.
   
   P.S.: The `IOContext` is already available to the provider class (I did this 
in the original API design for exactly that use case; currently the parameter 
is unused): 
https://github.com/apache/lucene/blob/d5f8853fda8f488d73aed09d376e1c1c3c9ea85f/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java#L33


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Create a read-only index that drops index files not needed for searching [lucene]

2024-03-21 Thread via GitHub


mikemccand commented on issue #13158:
URL: https://github.com/apache/lucene/issues/13158#issuecomment-2012195070

   This is a cool idea @msokolov.  It is wasteful to lug around those `float32` 
precision vectors out to the searchers in an NRT segment replication 
architecture.  In practice, they would consume disk space on the searchers, and 
waste time copying them out, but since the OS would never load them at search 
time, their bytes would remain cold on disk and not put much pressure on OS 
free RAM?  The OS would only cache the disk pages in the index that are 
actually needed at search time.  It would be nice not to copy all that 
deadweight around ...
   
   Probably the solution would have to be something like segment to segment?  
I.e. for each segment in the index, we would make a corresponding "read only" 
segment (stripped of the `float32` vectors).  This way, as the normal index 
changes (gets new flushed/merged segments), we could also incrementally/NRT 
maintain the shadow read-only index.
   
   I wonder if there are other things in a Lucene index today that are needed 
only during indexing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Create a read-only index that drops index files not needed for searching [lucene]

2024-03-21 Thread via GitHub


benwtrent commented on issue #13158:
URL: https://github.com/apache/lucene/issues/13158#issuecomment-2012237415

   Don't "deletes" require "writes"? Meaning, if enough docs get deleted in a 
segment, it will ultimately require to be merged, which then is a "write"?
   
   For a quicker win in scalar quantization, it could be cheaper or easier to 
have a configured threshold where we throw away the floating point as we know 
the distribution of the values won't change significantly from that point on. 
Then on merges, if adjustments are required, we assume the cost of 
de-quantizing and re-quantizing. This can help relevancy more than you would 
expect, but obviously not as much as having access to the raw values.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-21 Thread via GitHub


benwtrent merged PR #13190:
URL: https://github.com/apache/lucene/pull/13190


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-21 Thread via GitHub


uschindler opened a new pull request, #13196:
URL: https://github.com/apache/lucene/pull/13196

   This is a first idea how we can use Panama Foreign to pass `madvise()` hints 
to the kernel when mapping memory segments.
   
   The code looks up the function pointer from stdlib (libc) on Linux and Macos 
(untested, but should work) and then invokes `madvise()` for all MemorySegments 
we have mmapped when the following is true:
   - IOContext#readOnce is set
   - The chunk size is large enough (at least 8192 means `chunkSizePower>=13`) 
- this prevents TestMultiMMap from failing because for very small mappings (as 
done by this test), the `FileChannel#map` call will produce unaligned memory 
segments (it uses some tricks and maps larger segments and returns slices - 
which are no longer pageSize aligned)
   - There is a noop implementation doing nothing which is choosen if you 
disable native access
   
   Interestingly it works without any extra parameters to command line (at 
least in Java 21).
   
   This is a draft only to do some performance tests and extend the IOContext 
interpretation to try out more possibilities. The current "readOnce => 
MADV_SEQUENTIAL" is just an example as this is the main issue: We merge 
segments and don't want the soon to be trashed segments be sticky in RAM. 
MADV_SEQUENTIAL instructs kernel to forget about the mappings and also do 
readahead which helps during merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]

2024-03-21 Thread via GitHub


uschindler commented on issue #13194:
URL: https://github.com/apache/lucene/issues/13194#issuecomment-2012529188

   I created a draft PR for IOContext#readOnce: #13196


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Create a read-only index that drops index files not needed for searching [lucene]

2024-03-21 Thread via GitHub


msokolov commented on issue #13158:
URL: https://github.com/apache/lucene/issues/13158#issuecomment-2012935594

   > Don't "deletes" require "writes"? Meaning, if enough docs get deleted in a 
segment, it will ultimately require to be merged, which then is a "write"?
   
   The goal of segment-replication is to completely separate searching from 
writing, so in that world, no merging is done by searchers -- it happens 
upstream in a writer/indexer, or perhaps in a dedicated merger (we have both 
setups going on).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-21 Thread via GitHub


jpountz commented on PR #13196:
URL: https://github.com/apache/lucene/pull/13196#issuecomment-2012967809

   I remember this kind of things being discussed more than 10 years ago, it's 
extremely exciting to see it close to being included in the default `Directory`!
   
   > current "readOnce => MADV_SEQUENTIAL" is just an example as this is the 
main issue: We merge segments and don't want the soon to be trashed segments be 
sticky in
   
   I believe that we don't use `readOnce` for merging, but for tiny metadata 
files that we load fully in memory. That said, I don't think that this should 
block this change, we can later look into the proper way to detect whether to 
pass `MADV_SEQUENTIAL` based on the `IOContext`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-21 Thread via GitHub


ChrisHegarty commented on code in PR #13196:
URL: https://github.com/apache/lucene/pull/13196#discussion_r1534254114


##
lucene/core/src/java21/org/apache/lucene/store/NativeAccess.java:
##
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.lang.foreign.MemorySegment;
+import java.util.logging.Logger;
+import org.apache.lucene.util.Constants;
+
+@SuppressWarnings("preview")
+abstract class NativeAccess {
+  private static final Logger LOG = 
Logger.getLogger(NativeAccess.class.getName());
+
+  // these constants were extracted form glibc and macos header files - 
luckily they are the same:
+
+  /** No further special treatment. */
+  public static final int POSIX_MADV_NORMAL = 0;
+
+  /** Expect random page references. */
+  public static final int POSIX_MADV_RANDOM = 1;
+
+  /** Expect sequential page references. */
+  public static final int POSIX_MADV_SEQUENTIAL = 2;
+
+  /** Will need these pages. */
+  public static final int POSIX_MADV_WILLNEED = 3;
+
+  /** Don't need these pages. */
+  public static final int POSIX_MADV_DONTNEED = 4;
+
+  /** Invoke the {@code madvise} call for the given {@link MemorySegment}. */
+  public abstract void madvise(MemorySegment segment, int advise) throws 
IOException;
+
+  /**
+   * Return the NativeAccess instance for this platform. At moment we only 
support Linux and MacOS
+   */
+  public static NativeAccess getImplementation() {
+if (Constants.LINUX || Constants.MAC_OS_X) {

Review Comment:
   For me, this is the right level of abstraction. Support can be added for 
other platforms if and when desired - no need to block anything on Windows! ;-) 



##
lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java:
##
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.lang.foreign.FunctionDescriptor;
+import java.lang.foreign.Linker;
+import java.lang.foreign.MemorySegment;
+import java.lang.foreign.SymbolLookup;
+import java.lang.foreign.ValueLayout;
+import java.lang.invoke.MethodHandle;
+import java.util.Locale;
+
+@SuppressWarnings("preview")
+final class PosixNativeAccess extends NativeAccess {
+
+  private final MethodHandle mh$posix_madvise;
+
+  PosixNativeAccess() {
+final Linker linker = Linker.nativeLinker();
+final SymbolLookup stdlib = linker.defaultLookup();
+this.mh$posix_madvise =
+findFunction(
+linker,
+stdlib,
+"posix_madvise",
+FunctionDescriptor.of(
+ValueLayout.JAVA_INT,
+ValueLayout.ADDRESS,
+ValueLayout.JAVA_LONG,
+ValueLayout.JAVA_INT));
+  }
+
+  private static MethodHandle findFunction(
+  Linker linker, SymbolLookup lookup, String name, FunctionDescriptor 
desc) {
+final MemorySegment symbol =
+lookup
+.find(name)
+.orElseThrow(
+() ->
+new UnsupportedOperationException(
+"Platform has no symbol for '" + name + "' in libc."));
+return linker.downcallHandle(symbol, desc);

Review Comment:
   we can use captureCallState to capture the errno if this fails.
   
   
https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/foreign/Linker.Option.html#captureCall

Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-21 Thread via GitHub


uschindler commented on code in PR #13196:
URL: https://github.com/apache/lucene/pull/13196#discussion_r1534481239


##
lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java:
##
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.lang.foreign.FunctionDescriptor;
+import java.lang.foreign.Linker;
+import java.lang.foreign.MemorySegment;
+import java.lang.foreign.SymbolLookup;
+import java.lang.foreign.ValueLayout;
+import java.lang.invoke.MethodHandle;
+import java.util.Locale;
+
+@SuppressWarnings("preview")
+final class PosixNativeAccess extends NativeAccess {
+
+  private final MethodHandle mh$posix_madvise;
+
+  PosixNativeAccess() {
+final Linker linker = Linker.nativeLinker();
+final SymbolLookup stdlib = linker.defaultLookup();
+this.mh$posix_madvise =
+findFunction(
+linker,
+stdlib,
+"posix_madvise",
+FunctionDescriptor.of(
+ValueLayout.JAVA_INT,
+ValueLayout.ADDRESS,
+ValueLayout.JAVA_LONG,
+ValueLayout.JAVA_INT));
+  }
+
+  private static MethodHandle findFunction(
+  Linker linker, SymbolLookup lookup, String name, FunctionDescriptor 
desc) {
+final MemorySegment symbol =
+lookup
+.find(name)
+.orElseThrow(
+() ->
+new UnsupportedOperationException(
+"Platform has no symbol for '" + name + "' in libc."));
+return linker.downcallHandle(symbol, desc);

Review Comment:
   That's not needed for `posix_madvise()`; it returns the errno code directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-21 Thread via GitHub


uschindler commented on PR #13196:
URL: https://github.com/apache/lucene/pull/13196#issuecomment-2013348340

   I refactored the code a bit:
   - @jpountz I added the merge context (I checked DirectIODirectory).
   - I retrieve the pageSize to check if a segment is correctly aligned. 
Unfortunately this is mega-stupid. I tend to remove it again (the constant 
`_SC_PAGESIZE` is part of a C `enum` and its value is different in various 
platforms as no fixed value is assigned) and maybe keep a hardcoded page size 
to handle the case. Maybe just check that it is a multiple of 64K (native page 
size is 8K, but may be different), because it is normally only an issue in this 
test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-21 Thread via GitHub


uschindler commented on PR #13196:
URL: https://github.com/apache/lucene/pull/13196#issuecomment-2013453556

   I removed the page size retrieval again. No need for it, we just be safe and 
only apply the advice, when the chunk size is large enough.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-21 Thread via GitHub


uschindler commented on PR #13196:
URL: https://github.com/apache/lucene/pull/13196#issuecomment-2013499830

   Unfortunately after the problems with other constants I tend to hide the 
advice constants in the class any use some abstraction, so we can call
   
   We should maybe also simply remove the NoopNativeAccess and do a null check.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] New int4 scalar quantization [lucene]

2024-03-21 Thread via GitHub


benwtrent opened a new pull request, #13197:
URL: https://github.com/apache/lucene/pull/13197

   So, there are a handful of new and interesting things that this PR adds:
   
- Confidence interval optimizations, unlocked even smaller quantization 
bytes
- New int4 codec which is 8x smaller than float32 with exceptionally good 
recall & faster search
   
   
   But, its not complete yet. While I was finishing up the code here, it came 
to my mind, why is this its own codec? Why shouldn't we add a new option to 
`Lucene99ScalarQuantizedVectorsFormat` (really it would be 
`Lucene911ScalarQuantizedVectorsFormat`) that allows the quantization `bits` to 
be set. I know we don't want a ton of configuration for each codec, but to me 
this seemed reasonable.
   
   What do you think @jpountz?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-21 Thread via GitHub


ChrisHegarty commented on code in PR #13196:
URL: https://github.com/apache/lucene/pull/13196#discussion_r1534598790


##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java:
##
@@ -92,10 +110,19 @@ private final MemorySegment[] map(
   }
   if (preload) {
 segment.load();
+  } else if (segSize > 0L && advice.isPresent()) { // not when preloading!
+nativeAccess.madvise(segment, advice.getAsInt());
   }
   segments[segNr] = segment;
   startOffset += segSize;
 }
 return segments;
   }
+
+  private OptionalInt mapContextToMadvise(IOContext context) {
+if (context.readOnce || context.context == Context.MERGE) {
+  return OptionalInt.of(NativeAccess.POSIX_MADV_SEQUENTIAL);
+}

Review Comment:
   I'd like to plumb `POSIX_MADV_RANDOM`, for use when reading from flat vector 
data. Any objection do doing it in this PR, or you wanna wait to get this in 
first?



##
lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java:
##
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.lang.foreign.FunctionDescriptor;
+import java.lang.foreign.Linker;
+import java.lang.foreign.MemorySegment;
+import java.lang.foreign.SymbolLookup;
+import java.lang.foreign.ValueLayout;
+import java.lang.invoke.MethodHandle;
+import java.util.Locale;
+import java.util.logging.Logger;
+
+@SuppressWarnings("preview")
+final class PosixNativeAccess extends NativeAccess {
+
+  private static final Logger LOG = 
Logger.getLogger(PosixNativeAccess.class.getName());
+
+  private final MethodHandle mh$posix_madvise;

Review Comment:
   we should probably make this MethodHandle `static final`. In which case the 
whole instance construction can be cleaned up a bit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-21 Thread via GitHub


ChrisHegarty commented on code in PR #13196:
URL: https://github.com/apache/lucene/pull/13196#discussion_r1534636363


##
lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java:
##
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.lang.foreign.FunctionDescriptor;
+import java.lang.foreign.Linker;
+import java.lang.foreign.MemorySegment;
+import java.lang.foreign.SymbolLookup;
+import java.lang.foreign.ValueLayout;
+import java.lang.invoke.MethodHandle;
+import java.util.Locale;
+import java.util.logging.Logger;
+
+@SuppressWarnings("preview")
+final class PosixNativeAccess extends NativeAccess {
+
+  private static final Logger LOG = 
Logger.getLogger(PosixNativeAccess.class.getName());
+
+  private final MethodHandle mh$posix_madvise;

Review Comment:
   I pushed a small change to refactor this. Revert or refactor as you like, 
but the MH is now static final.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-21 Thread via GitHub


ChrisHegarty commented on code in PR #13196:
URL: https://github.com/apache/lucene/pull/13196#discussion_r1534678873


##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java:
##
@@ -92,10 +110,19 @@ private final MemorySegment[] map(
   }
   if (preload) {
 segment.load();
+  } else if (segSize > 0L && advice.isPresent()) { // not when preloading!
+nativeAccess.madvise(segment, advice.getAsInt());
   }
   segments[segNr] = segment;
   startOffset += segSize;
 }
 return segments;
   }
+
+  private OptionalInt mapContextToMadvise(IOContext context) {
+if (context.readOnce || context.context == Context.MERGE) {
+  return OptionalInt.of(NativeAccess.POSIX_MADV_SEQUENTIAL);
+}

Review Comment:
   https://github.com/uschindler/lucene/pull/3



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-21 Thread via GitHub


uschindler commented on code in PR #13196:
URL: https://github.com/apache/lucene/pull/13196#discussion_r1534727218


##
lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java:
##
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.lang.foreign.FunctionDescriptor;
+import java.lang.foreign.Linker;
+import java.lang.foreign.MemorySegment;
+import java.lang.foreign.SymbolLookup;
+import java.lang.foreign.ValueLayout;
+import java.lang.invoke.MethodHandle;
+import java.util.Locale;
+import java.util.logging.Logger;
+
+@SuppressWarnings("preview")
+final class PosixNativeAccess extends NativeAccess {
+
+  private static final Logger LOG = 
Logger.getLogger(PosixNativeAccess.class.getName());
+
+  private final MethodHandle mh$posix_madvise;

Review Comment:
   I don't like this, because now the Unsupported operation exception is 
triggered by static initializer and is causing NoClassDefFoundError.
   
   If you look at the caller of ctor you will see the try catch.
   
   I added this because for some reason it could fail to find the symbol (e.g. 
new libc version,...). So I want it bullet proof.
   
   The non static method handle is not a problem, it's called not at perf 
critical places. The constructor is only called once.
   
   Can you revert? If you really want it static we would need another logic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] DRAFT: Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-21 Thread via GitHub


ChrisHegarty commented on code in PR #13196:
URL: https://github.com/apache/lucene/pull/13196#discussion_r1534775594


##
lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java:
##
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.lang.foreign.FunctionDescriptor;
+import java.lang.foreign.Linker;
+import java.lang.foreign.MemorySegment;
+import java.lang.foreign.SymbolLookup;
+import java.lang.foreign.ValueLayout;
+import java.lang.invoke.MethodHandle;
+import java.util.Locale;
+import java.util.logging.Logger;
+
+@SuppressWarnings("preview")
+final class PosixNativeAccess extends NativeAccess {
+
+  private static final Logger LOG = 
Logger.getLogger(PosixNativeAccess.class.getName());
+
+  private final MethodHandle mh$posix_madvise;

Review Comment:
   I’ll take another look or revert tomorrow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation [lucene]

2024-03-21 Thread via GitHub


msfroh commented on issue #13029:
URL: https://github.com/apache/lucene/issues/13029#issuecomment-2014023729

   I think the `else` clause for the cost estimate is also not great. 
   
   I came across this same problem where a user was essentially running a 
single-term `TermInSetQuery` (that actually matched a single doc) AND a numeric 
range query that matches millions of docs. I was shocked when profiler output 
showed a lot of time spent in the BKReader -- the IndexOrDocValues query over 
the range query should have clearly taken the doc values path.
   
   Let's say you have a string field with 10 million distinct values (so 10 
million terms), and they match 20 million documents (with individual terms 
matching 1-3 docs, say). My read is that this `estimateCost()` function will 
say that a `TermInSetQuery` over a single term has cost 10,000,001 (i.e. 20 
million `sumDocFreq` minus 10 million for the terms `size`, plus 1 for the 
query term count). 
   
   I get that the absolute worst case is that 9,999,999 terms each have doc 
freq 1 and the remaining term has doc freq 10,000,001, but this feels silly as 
a cost estimate for a query that is just going to rewrite to a single 
`TermQuery` with cost <= 3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation [lucene]

2024-03-21 Thread via GitHub


msfroh commented on issue #13029:
URL: https://github.com/apache/lucene/issues/13029#issuecomment-2014084615

   I get that part of the point of this cost estimate is to avoid the 
(potentially-expensive) rewrite if, e.g. we can do a doc-value rewrite instead, 
but I'm thinking we could do something a little bit more term-aware.
   
   How expensive is `MultiTermQuery#getTermsEnum()`? I think it's cheap, but 
I'm not positive. Maybe we could get it, try walking through it, summing doc 
freqs, giving up if the number of terms gets big.
   
   We could go down that path only if the cost estimate from the existing logic 
is very high. I can try sketching out a PR with a test for it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org