turboFei opened a new pull request, #16693:
URL: https://github.com/apache/iceberg/pull/16693

   ## Problem
   
   `RESTMetricsReporter.report()` uses 
`Tasks.range(1).executeWith(METRICS_EXECUTOR).run()` to submit the HTTP metrics 
call to a single-background-thread executor. However, after submitting the 
task, the **calling thread is blocked** inside `Tasks.waitFor()` — a 
`sleep(10ms)` poll loop that spins until the submitted future is done:
   
   ```java
   // Tasks.waitFor() — called by runParallel() after submitting to 
METRICS_EXECUTOR
   while (true) {
       if (allDone(futures)) { ... return; }
       Thread.sleep(10);   // ← calling thread is blocked here
   }
   ```
   
   This makes `report()` **synchronous** from the caller's perspective, despite 
appearing to use an executor.
   
   ### Thread pile-up in practice
   
   The intended use of this API is async, best-effort metrics reporting. Any 
caller that wraps `report()` in its own thread pool to make it non-blocking 
will encounter a severe thread leak:
   
   ```
   caller pool thread
     → report()
       → Tasks.range(1).executeWith(METRICS_EXECUTOR).run()
         → submits HTTP call to METRICS_EXECUTOR (1 thread)
         → blocks in Tasks.waitFor(), waiting for METRICS_EXECUTOR
                            ↑
            caller thread is STUCK here until HTTP completes
   ```
   
   If `report()` is called faster than `METRICS_EXECUTOR` can process (e.g. 
many concurrent Iceberg scans/commits), the caller pool keeps growing while its 
threads pile up in `waitFor()`. In production we observed **9,351 
`Iceberg-CompactionReport-Thread` threads** accumulating at ~2.7 threads/second 
over several hours, consuming several GB of thread stack memory.
   
   ## Fix
   
   Replace `Tasks.range(1).executeWith(METRICS_EXECUTOR).run()` with a direct 
`METRICS_EXECUTOR.execute(lambda)`:
   
   ```java
   METRICS_EXECUTOR.execute(() -> {
       try {
           client.post(...);
       } catch (Exception e) {
           LOG.warn("Failed to report metrics ...", e);
       }
   });
   ```
   
   `report()` now enqueues the HTTP call and **returns immediately**. The 
background thread processes reports serially; callers are never blocked.
   
   Switch `METRICS_EXECUTOR` from `ThreadPools.newExitingWorkerPool` (which 
uses `LinkedBlockingQueue` internally but registers a JVM shutdown hook) to a 
plain `ThreadPoolExecutor` with daemon threads and an unbounded 
`LinkedBlockingQueue`. Daemon threads exit automatically when the JVM shuts 
down, making the shutdown hook unnecessary.
   
   ## Testing
   
   `TestRESTMetricsReporter` verifies three things:
   1. **`reportIsNonBlocking`** — `report()` returns in < 2 s even when the 
HTTP client blocks for 5 s
   2. **`reportSendsHttpPost`** — the HTTP POST is eventually delivered to the 
client
   3. **`reportNullIsIgnored`** — a null report does not trigger any HTTP call


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to