amalatlas opened a new issue, #16213:
URL: https://github.com/apache/lucene/issues/16213

   ### Description
   
   ### Summary
   
   We are seeing a significant increase in read-side block I/O during 
merge-related stored-fields reads after moving from Lucene `9.4.0` (OpenSearch 
2.19) to Lucene `10.4.0`(OpenSearch 3.6). Although, the issue was seen in 
OpenSearch, I am posting it here as I believe the regression may originate from 
Lucene.
   
   ## Benchmark Test
   
   We are running single-node instance of OpenSearch 2.19 vs OpenSearch 3.6 and 
[OpenSearch 
Benchmark](https://docs.opensearch.org/latest/benchmark/user-guide/install-and-configure/installing-benchmark/)
 `geonames` (indexing only) against those. We captured a `perf` trace to 
identify the source of `block_rq_issue` kernel events which gives us an 
indication of what causes the high read IOPS.
   
   ## Results & Observations
   
   Results show OpenSearch 3.6 (Lucene 10) results in signifcantly higher 
number of `block_rq_issue` events.
   
   | Metric                       | `2.19` (Lucene 9) | `3.6` (Lucene 10) |
   | ---------------------------- | ----------------: | ----------------: |
   | `block:block_rq_issue` count |          `42,768` |         `186,344` |
   
   Also, the `block_rq_issue` events originating from 
`Lucene90CompressingStoredFieldsWriter.merge` has significantly increased.
   
   | Metric                       | `2.19` (Lucene 9) | `3.6` (Lucene 10) |
   | ---------------------------- | ----------------: | ----------------: |
   | `block:block_rq_issue` count |           `5,076` |         `173,757` |
   
   You can see the flame graphs for results for 2.19 
(block-rq-issue-opensearch-2.19-lucene-9.svg) and 3.6 
(block-rq-issue-opensearch-3.6-lucene-10.svg) attached.
   
   ### Why this looks like a Lucene regression?
   
   The behavior aligns with how the stored-fields reader changed between Lucene 
`9.4.0` and `10.4.0`.
   
   #### Suspected root cause
   
   Our analysis show that `.fdt` file is opened with `RANDOM` read advise, 
which prevents the kernel from doing read-aheads, that leads to additional IOPS 
during a merge operation.
   
   ```java
   fieldsStream = d.openInput(fieldsStreamFN, context);
   ```
   
   In `10.4.0`, the `.fdt` stream is opened with a forced `RANDOM` hint:
   
   ```java
   fieldsStream =
       d.openInput(fieldsStreamFN, context.withHints(FileTypeHint.DATA, 
DataAccessHint.RANDOM));
   ```
   
   Based on the current Lucene code:
   
   - merge calls `Lucene90CompressingStoredFieldsWriter.merge`,
   - that calls `checkIntegrity()` on source readers,
   - `checkIntegrity()` calls `CodecUtil.checksumEntireFile()`,
   - `checksumEntireFile()` scans the entire file through 
`ChecksumIndexInput.seek()` / `skipByReading()`,
   - and with `MemorySegmentIndexInput` plus `RANDOM` advice, the scan appears 
to fault pages in a way that results in many more block I/Os.
   
   The regression appears to be the combination of:
   
   1. `MemorySegmentIndexInput` making read advice effective at the kernel 
level, and
   2. `Lucene90CompressingStoredFieldsReader` forcing `DataAccessHint.RANDOM` 
even when the caller is merge code that would otherwise use sequential access.
   
   **Note:** Above, was based on my scanning of the codebase and I am not 
familiar with the Lucene codebase. Hence, above may not be accurate and I would 
like confirmation someone familiar with the code to confirm above 
implementation.
   
   ## Expected behavior
   
   Lucene 10 should not cause higher IOPS during segment merge operations, when 
compared to Lucene 9, unless if that's a trade off for certain gain elsewhere. 
During merge-time integrity scans and stored-fields copying, stored-fields data 
reads should preserve sequential read behavior so the kernel can use read-ahead 
effectively.
   
   ### Version and environment details
   
   ## Test Setup Details
   
   Below are the details of the benchmarking and profiling setup.
   
   ### OpenSearch setup
   
   Docker images were built from below versions locally by checking out 2.19 
and 3.6 branches in OpenSearch repo.
   
   - OpenSearch `2.19` / Lucene `9.4.0`
   - OpenSearch `3.6` / Lucene `10.4.0`
   
   ```
   docker run --name opensearch \
        --rm \
        -p 9200:9200 \
        -p 9600:9600 \
       --ulimit memlock=-1:-1 \
       --ulimit nofile=65536:65536 \
        --cap-add IPC_LOCK \
        -e "discovery.type=single-node" \
        -e "bootstrap.memory_lock=true" \
        -e "OPENSEARCH_JAVA_OPTS=-Xms3g -Xmx3g -XX:+PreserveFramePointer 
-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints" \
        -e "DISABLE_INSTALL_DEMO_CONFIG=true" \
        -e "DISABLE_SECURITY_PLUGIN=true" \
        -v /var/lib/opensearch/data-$VERSION:/usr/share/opensearch/data \
        docker.opensearch.org/opensearch:$VERSION-SNAPSHOT
   ```
   
   ### Benchmark setup
   
   ```
   docker run --rm --network host \
       -e HOME=/work \
       -v "$OSB_HOME:/work" \
       -v "$RESULTS_DIR:/results" \
       opensearchproject/opensearch-benchmark:latest \
       run \
         --workload=geonames \
         --target-hosts=127.0.0.1:9200 \
         --pipeline=benchmark-only \
         --include-tasks="index-append" \
         
--workload-params='{"number_of_shards":3,"number_of_replicas":0,"bulk_size":5000}'
 \
         
--client-options="use_ssl:false,verify_certs:false,basic_auth_user:'xxxx',basic_auth_password:'xxxx'"
 \
         --results-format=markdown \
         --results-file="/results/geonames-${TASKS}-$(date +%Y%m%d-%H%M%S).md" \
         --show-in-results=all || true
   
   ```
   
   ### Tracing setup
   
   ```
   ...
   perf record \
       -o "/tmp/block-rq-issue.data" \
       -p "$PID" \
       -e "block:block_rq_issue" \
       --call-graph fp
   ...
   ```
   
   <img width="1200" height="1318" alt="Image" 
src="https://github.com/user-attachments/assets/a45efe20-6d88-4182-a737-808d8daf56ba";
 />
   <img width="1200" height="1142" alt="Image" 
src="https://github.com/user-attachments/assets/699d22c2-c88e-4ede-945d-f68f1718e652";
 />


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to