dpol1 opened a new pull request, #1868: URL: https://github.com/apache/stormcrawler/pull/1868
## Summary This PR introduces the `external/opensearch-java` module, migrating StormCrawler from the deprecated `RestHighLevelClient` to the official [`opensearch-java`](https://opensearch.org/docs/latest/clients/java/) client. Following [suggestion](https://github.com/apache/stormcrawler/issues/1515#issuecomment-4128937174), this is built as a **separate module** to act as a **drop-in replacement**. Users can migrate seamlessly by simply updating their `pom.xml` artifactId, with zero changes required for Flux topologies, imports, or YAML configuration keys. The legacy `external/opensearch` module remains untouched in this PR to allow a gradual phase-out. ## Architectural Decisions & Engineering Since the new `opensearch-java` client introduces a completely different paradigm (fluent builders, strict JSON mappers) and removes several legacy utility classes, the following architectural decisions were made: ### 1. `AsyncBulkProcessor` & Backpressure The legacy `BulkProcessor` was removed in the new client. To prevent `OutOfMemoryError`s and preserve Storm's backpressure, I implemented a custom `AsyncBulkProcessor`: * Buffers `BulkOperation`s and flushes based on action count or a `ScheduledExecutorService` timer. * Uses a `Semaphore` to limit concurrent in-flight HTTP requests. * Uses a dedicated `ThreadPoolExecutor` with `CallerRunsPolicy` to process async callbacks without starving the JVM's `ForkJoinPool.commonPool()`. ### 2. Transport, 100MB Buffer Limit & Sniffer Support In the original issue discussion, I mentioned I would likely use the new `ApacheHttpClient5Transport` and `ApacheHttpClient5Options` to configure the response buffer and bypass the `ContentTooLongException` (100MB ceiling). However, during implementation, I realized that switching to HC5 would break the **Sniffer** feature, as the official `opensearch-rest-client-sniffer` heavily relies on the low-level HC4 `RestClient`. To maintain 100% feature parity as a drop-in replacement, I intentionally kept the `RestClientTransport` (which wraps the HC4 `RestClient`). This brilliantly solves both problems: 1. The `Sniffer` works out of the box using the underlying `RestClient`. 2. We bypass the 100MB buffer limit by injecting `HeapBufferedResponseConsumerFactory` via the classic `RequestOptions` (which `RestClientTransport` fully supports). ### 3. Concurrency & Race Condition Fixes During the migration, I identified and fixed a race condition in `IndexerBolt` and `DeletionBolt` where tuples were added to the processor *before* being safely locked in the `waitAck` map. The locking order has been inverted to guarantee zero tuple loss during high-throughput flushes. ### 4. Upstream Bugfixes Sync This module is perfectly aligned with `main`. It incorporates the recent bugfixes applied to the legacy module, adapted for the new asynchronous paradigm: * **#1864**: Guarded double-close race conditions in `AbstractSpout`. * **#1865**: Fixed resource leaks in `OpenSearchConnection` during `Sniffer` initialization failures. * **#1866 & #1867**: Resolved `Timer` and `OpenSearchClient` memory leaks in `JSONResourceWrapper` and `JSONURLFilterWrapper` by properly implementing `cleanup()`. ## Test plan - [x] **End-to-End Integration:** Verified the correct behavior of all Spouts and Bolts (`IndexerBolt`, `StatusUpdaterBolt`, `DeletionBolt`, etc.) against a real OpenSearch instance using Testcontainers. - [x] **Concurrency & Backpressure:** Validated the custom `AsyncBulkProcessor` under load to ensure it correctly flushes based on size/time thresholds and strictly respects the `Semaphore` concurrency limits without dropping tuples. - [x] **Data Serialization:** Confirmed correct JSON serialization for complex types, specifically ensuring that `nextFetchDate` and `timestamp` fields conform to ISO-8601 format to prevent OpenSearch mapping errors. - [x] **Archetype Validation:** Verified that the updated Maven archetype successfully generates a working StormCrawler project, correctly wired with the new `opensearch-java` dependency and its associated configurations. - [x] **Project Compliance:** Ensured all static analysis checks pass, including Apache RAT for license headers (with explicit safe exclusions for NDJSON dashboards) and code formatting rules. Closes #1515 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
