[PR] Add opensearch-java module using java client [stormcrawler]

via GitHub Fri, 03 Apr 2026 02:58:45 -0700


dpol1 opened a new pull request, #1868:
URL: https://github.com/apache/stormcrawler/pull/1868


   ## Summary
   
   This PR introduces the `external/opensearch-java` module, migrating 
StormCrawler from the deprecated `RestHighLevelClient` to the official 
[`opensearch-java`](https://opensearch.org/docs/latest/clients/java/) client.
   
   Following 
[suggestion](https://github.com/apache/stormcrawler/issues/1515#issuecomment-4128937174),
 this is built as a **separate module** to act as a **drop-in replacement**. 
Users can migrate seamlessly by simply updating their `pom.xml` artifactId, 
with zero changes required for Flux topologies, imports, or YAML configuration 
keys.
   
   The legacy `external/opensearch` module remains untouched in this PR to 
allow a gradual phase-out.
   
   ## Architectural Decisions & Engineering
   
   Since the new `opensearch-java` client introduces a completely different 
paradigm (fluent builders, strict JSON mappers) and removes several legacy 
utility classes, the following architectural decisions were made:
   
   ### 1. `AsyncBulkProcessor` & Backpressure
   The legacy `BulkProcessor` was removed in the new client. To prevent 
`OutOfMemoryError`s and preserve Storm's backpressure, I implemented a custom 
`AsyncBulkProcessor`:
   * Buffers `BulkOperation`s and flushes based on action count or a 
`ScheduledExecutorService` timer.
   * Uses a `Semaphore` to limit concurrent in-flight HTTP requests.
   * Uses a dedicated `ThreadPoolExecutor` with `CallerRunsPolicy` to process 
async callbacks without starving the JVM's `ForkJoinPool.commonPool()`.
   
   ### 2. Transport, 100MB Buffer Limit & Sniffer Support
   In the original issue discussion, I mentioned I would likely use the new 
`ApacheHttpClient5Transport` and `ApacheHttpClient5Options` to configure the 
response buffer and bypass the `ContentTooLongException` (100MB ceiling). 
   
   However, during implementation, I realized that switching to HC5 would break 
the **Sniffer** feature, as the official `opensearch-rest-client-sniffer` 
heavily relies on the low-level HC4 `RestClient`.
   
   To maintain 100% feature parity as a drop-in replacement, I intentionally 
kept the `RestClientTransport` (which wraps the HC4 `RestClient`). This 
brilliantly solves both problems:
   1. The `Sniffer` works out of the box using the underlying `RestClient`.
   2. We bypass the 100MB buffer limit by injecting 
`HeapBufferedResponseConsumerFactory` via the classic `RequestOptions` (which 
`RestClientTransport` fully supports).
   
   ### 3. Concurrency & Race Condition Fixes
   During the migration, I identified and fixed a race condition in 
`IndexerBolt` and `DeletionBolt` where tuples were added to the processor 
*before* being safely locked in the `waitAck` map. The locking order has been 
inverted to guarantee zero tuple loss during high-throughput flushes.
   
   ### 4. Upstream Bugfixes Sync
   This module is perfectly aligned with `main`. It incorporates the recent 
bugfixes applied to the legacy module, adapted for the new asynchronous 
paradigm:
   * **#1864**: Guarded double-close race conditions in `AbstractSpout`.
   * **#1865**: Fixed resource leaks in `OpenSearchConnection` during `Sniffer` 
initialization failures.
   * **#1866 & #1867**: Resolved `Timer` and `OpenSearchClient` memory leaks in 
`JSONResourceWrapper` and `JSONURLFilterWrapper` by properly implementing 
`cleanup()`.
   
   ## Test plan
   
   - [x] **End-to-End Integration:** Verified the correct behavior of all 
Spouts and Bolts (`IndexerBolt`, `StatusUpdaterBolt`, `DeletionBolt`, etc.) 
against a real OpenSearch instance using Testcontainers.
   - [x] **Concurrency & Backpressure:** Validated the custom 
`AsyncBulkProcessor` under load to ensure it correctly flushes based on 
size/time thresholds and strictly respects the `Semaphore` concurrency limits 
without dropping tuples.
   - [x] **Data Serialization:** Confirmed correct JSON serialization for 
complex types, specifically ensuring that `nextFetchDate` and `timestamp` 
fields conform to ISO-8601 format to prevent OpenSearch mapping errors.
   - [x] **Archetype Validation:** Verified that the updated Maven archetype 
successfully generates a working StormCrawler project, correctly wired with the 
new `opensearch-java` dependency and its associated configurations.
   - [x] **Project Compliance:** Ensured all static analysis checks pass, 
including Apache RAT for license headers (with explicit safe exclusions for 
NDJSON dashboards) and code formatting rules.
   
   Closes #1515


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Add opensearch-java module using java client [stormcrawler]

Reply via email to