[PR] Add segment-level failure details capture to reload status tracking [pinot]

via GitHub Tue, 18 Nov 2025 13:56:12 -0800


suvodeep-pyne opened a new pull request, #17234:
URL: https://github.com/apache/pinot/pull/17234


   ## Summary
   
   This PR enhances the reload status tracking system to capture detailed 
failure information for individual segments during table reload operations. 
Building on the existing in-memory reload job status cache infrastructure, this 
change provides operators with actionable debugging information directly via 
the reload status API.
   
   ## Key Changes
   
   ### 1. New DTO: `SegmentReloadFailureResponse`
   - **Location**: 
`pinot-common/response/server/SegmentReloadFailureResponse.java`
   - Captures segment name, server name, error message, full stack trace, and 
failure timestamp
   - Shared between server (serialization) and controller (deserialization)
   - Follows existing patterns for response DTOs in pinot-common
   
   ### 2. Server-Side Failure Detail Capture
   - **`ServerReloadJobStatusCache`**: Enhanced with `recordFailure()` method 
that:
     - Always increments failure count (exact counting)
     - Stores detailed failure information for the first N failures 
(configurable, default: 5)
     - Populates server name using instance ID for server context
     - Thread-safe with synchronized access to failure details list
     
   - **`ReloadJobStatus`**: Added `_failedSegmentDetails` list to track failed 
segment information
   
   - **`ServerReloadJobStatusCacheConfig`**: Added `segmentFailureDetailsCount` 
configuration (default: 5)
     - ZK config key: 
`pinot.server.table.reload.status.cache.segment.failure.details.count`
   
   ### 3. Integration Point
   - **`BaseTableDataManager`** (line 804): Changed from simple counter 
increment to full failure recording
     - Before: 
`_reloadJobStatusCache.getOrCreate(reloadJobId).incrementAndGetFailureCount()`
     - After: `_reloadJobStatusCache.recordFailure(reloadJobId, segmentName, t)`
   
   ### 4. API Response Enhancement
   - **Server API**: `ServerReloadStatusResponse` (formerly 
`SegmentReloadStatusValue`)
     - Moved to `pinot-common` for sharing between modules
     - Added `sampleSegmentReloadFailures` field with fluent setters
     - Returns failed segment details (with server name populated)
     
   - **Controller API**: `PinotTableReloadStatusResponse`
     - Added `sampleSegmentReloadFailures` field
     - Aggregates ALL failures from all servers (NO deduplication)
     - Preserves server context: same segment failures on different servers 
kept separately
     - Limited to 500 failures max to prevent huge responses
   
   ### 5. Controller Aggregation Logic
   - **`PinotTableReloadStatusReporter`**: Enhanced to:
     - Collect failed segment details from all server responses
     - **NO deduplication**: Preserves server-specific context for debugging
     - Apply 500-segment limit across all servers
     - Enables pattern detection (e.g., "Server A failing many segments due to 
OOM")
   
   ## Design Rationale
   
   ### Why NO Deduplication?
   Same segment can fail on Server A (disk full) but succeed on Server B. 
Keeping all failures separately:
   - Enables root cause analysis (infrastructure vs. data corruption)
   - Preserves server context for targeted troubleshooting
   - Allows pattern detection across servers
   
   ### Memory Impact
   - **Per segment failure**: ~2KB (stack trace + metadata)
   - **Per job** (default 5 failures): ~10.4KB
   - **Cache-wide** (worst case 10,000 jobs): ~108MB
   - **Percentage of heap**: 0.34% - 0.67% (on 16-32GB heap)
   
   ### Thread Safety
   - Cache layer handles all business logic and synchronization
   - Data classes remain simple POJOs
   - Synchronized access to failure details list per job status
   
   ## Testing
   
   - **18 comprehensive unit tests** in `ServerReloadJobStatusCacheTest`
     - Failure recording under/over limit
     - Concurrent failure recording (thread safety)
     - Config changes and cache rebuilds
     - Server name population
     - All tests passing ✅
   
   ## Backward Compatibility
   
   - New `sampleSegmentReloadFailures` field is nullable
   - Servers without reload job ID continue working (null handling)
   - Old API clients ignore new fields (JSON serialization)
   - No breaking changes to existing functionality
   
   ## Configuration
   
   New configuration property (dynamic via ZooKeeper):
   ```
   pinot.server.table.reload.status.cache.segment.failure.details.count = 5
   ```
   
   ## Example API Response
   
   ```json
   {
     "totalSegmentCount": 300,
     "successCount": 285,
     "failureCount": 15,
     "sampleSegmentReloadFailures": [
       {
         "segmentName": "myTable__0__123__20240101T0000Z",
         "serverName": "Server_192.168.1.10_8098",
         "errorMessage": "IOException: Disk full",
         "stackTrace": "java.io.IOException: Disk full\n  at ...",
         "failedAtMs": 1704067200000
       }
     ]
   }
   ```
   
   ## Related Work
   
   - Built on Phase 1 reload status cache infrastructure
   - Part of enhanced segment reload status tracking initiative
   - Addresses need for actionable debugging information during reload 
operations
   
   ## Next Steps
   
   Future enhancements may include:
   - Success/in-progress tracking with aggregate statistics (Phase 2)
   - Query parameters for detail level control
   - Filtering and pagination for large failure lists


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add segment-level failure details capture to reload status tracking [pinot]

Reply via email to