gf2121 opened a new issue, #16044:
URL: https://github.com/apache/lucene/issues/16044

   ### Description
   
   > Following is generally written by LLM but benchmark is run by myself :) 
   
   ## 1. Motivation: `NIOFSDirectory` is still relevant
   
   In recent memory-constrained deployments (cgroup-limited containers with 
large indices), `MMapDirectory` triggered severe page-fault storms — 
`pgmajfault` rates spiking by an order of magnitude once the working set 
exceeded the cgroup limit, with sharply degraded query latency. Switching to 
`NIOFSDirectory` helps us resolve it.
   
   ## 2. Problem: a JDK monitor caps `NIOFSDirectory` at ~4 threads
   
   After moving more workloads onto `NIOFSDirectory`, we hit a hard scaling 
ceiling. The bottleneck is **not** the kernel — it's a synchronized block in 
`sun.nio.ch.FileChannelImpl`. Every positioned read registers the calling 
thread into a `NativeThreadSet` (so a concurrent `close()` can interrupt it via 
`pthread_kill`), and that registration takes a global monitor on every read.
   
   ```java
   // sun.nio.ch.FileChannelImpl
   private int readInternal(ByteBuffer dst, long position) throws IOException {
       int n = 0;
       int ti = -1;
       try {
           beginBlocking();
           // ↓↓↓ contention point — monitor-protected, on every single read ↓↓↓
           ti = threads.add();
           if (!isOpen()) return -1;
           do {
               // ... Blocker.begin / IOUtil.read(fd, dst, position, ...) / 
Blocker.end ...
           } while ((n == IOStatus.INTERRUPTED) && isOpen());
           return IOStatus.normalize(n);
       } finally {
           threads.remove(ti);   // takes the same monitor again
           endBlocking(n > 0);
       }
   }
   ```
   
   ```java
   // sun.nio.ch.NativeThreadSet — the monitor every reader fights for
   int add() {
       long th = NativeThread.current();
       synchronized (this) {                              // ← global monitor 
per channel
           // ... grow array, find free slot, write thread handle ...
       }
   }
   ```
   
   Past ~4 threads, this monitor's cache-line bouncing dominates the cost of 
`pread64` itself, and throughput stops scaling. This is structurally tied to 
the `Channel.close()` interruption contract and unlikely to be removed from the 
JDK in the near term.
   
   ## 3. Benchmark: native `pread(2)` via Panama FFI scales 4× higher
   
   JMH on Java 25, Linux x86_64, NVMe; 1 GiB file, 16 KiB random reads, 16 
reads/op. Throughput in **ops/ms** (higher is better):
   
   | Benchmark | 1 thr | 2 thr | 4 thr | 8 thr | 16 thr | 32 thr |
   |---|---:|---:|---:|---:|---:|---:|
   | `ffiPread` | 371.8 | 633.8 | 1104.5 | **1854.5** | **2838.1** | **2862.5** 
|
   | `fileChannelReadDirect` | 358.9 | 428.1 | 683.4 | 637.3 | 737.0 | 737.4 |
   | `fileChannelReadHeap` | 318.1 | 495.4 | 668.2 | 596.0 | 757.4 | 712.8 |
   
   - 1 thread: FFI is ~4% faster — same syscall, less Java overhead.
   - `FileChannel` plateaus at ~700 ops/ms from 4 threads onward; profiling 
shows time inside `NativeThreadSet`'s monitor.
   - FFI scales near-linearly to 16 threads, then hits the hardware ceiling at 
32.
   
   ## 4. Proposal: `PreadDirectory`
   
   A new `Directory` that performs random reads via `pread(2)` through Panama 
FFI:
   
   - **POSIX** → FFI `pread`. No `NativeThreadSet`, no monitor, stateless 
syscall.
   - **Non-POSIX** → fallback to `NIOFSDirectory`. Behavior never worse than 
today;
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to