suvodeep-pyne opened a new issue, #17123:
URL: https://github.com/apache/pinot/issues/17123

   ## Problem Statement
   
   ### Current Limitation
   The existing segment reload status API (`GET 
/segments/segmentReloadStatus/{jobId}`) uses timestamp-based heuristics to 
determine reload success by comparing `segmentLoadTimeMs >= jobSubmissionTime`. 
This approach has several limitations:
   
   1. **Cannot distinguish between states**: Pending, in-progress, and failed 
segments all appear as "not successful"
   2. **No failure visibility**: When reloads fail, errors are logged but not 
queryable via API
   3. **No error details**: Users must access pod logs to understand why 
segments failed to reload
   4. **False positives**: Segments reloaded for unrelated reasons may be 
incorrectly counted as successful
   
   ### Impact
   - **Debugging difficulty**: Users cannot determine which segments failed or 
why without log access
   - **Monitoring gaps**: No programmatic way to alert on reload failures
   - **Incomplete observability**: Controllers and clients lack visibility into 
reload operation health
   
   ---
   
   ## Proposed Solution
   
   Implement an in-memory status cache on Pinot servers to track per-segment 
reload status with failure details, enabling comprehensive reload observability 
through the existing API.
   
   ### Design Principles
   1. **Memory-efficient**: Bounded cache size with LRU eviction and 
configurable TTL
   2. **Thread-safe**: Handle concurrent reload operations safely
   3. **Backward compatible**: Enhance existing APIs without breaking changes
   4. **Fail-fast**: Cache is required, not optional - fail at startup if 
misconfigured
   5. **Operationally friendly**: Configurable limits and auto-cleanup
   
   ---
   
   ## Implementation Plan
   
   ### Phase 1: Basic Failure Tracking ✅ (PR #17099)
   **Status**: In review
   
   **Scope**:
   - In-memory cache tracking failure counts per reload job
   - Job ID preservation in Helix messages
   - Enhanced server API returning failure counts
   - Controller aggregation of failure counts
   
   **Deliverables**:
   - `ServerReloadJobStatusCache` with configurable size/TTL
   - Server API: `GET /controllerJob/reloadStatus?reloadJobId={id}` returns 
`failureCount`
   - Controller API: `GET /segments/segmentReloadStatus/{jobId}` includes 
aggregated failures
   - Full backward compatibility maintained
   
   **Current PR**: #17099
   
   ---
   
   ### Phase 2: Per-Segment Status Tracking (Planned)
   **Scope**:
   - Track individual segment status: `PENDING`, `IN_PROGRESS`, `SUCCESS`, 
`FAILED`
   - Store full error details including:
     - Exception class and message
     - Complete stack traces
     - Log correlation IDs
     - Failure timestamps
   - Enhanced response with per-segment breakdown
   
   **API Enhancement**:
   ```json
   {
     "successCount": 7,
     "failedCount": 2,
     "inProgressCount": 1,
     "segmentStatuses": [
       {
         "segmentName": "seg1",
         "status": "SUCCESS",
         "startTimeMs": 1234567890,
         "endTimeMs": 1234567900
       },
       {
         "segmentName": "seg2",
         "status": "FAILED",
         "errorSummary": "IOException: Connection timeout",
         "stackTrace": "java.io.IOException: ...",
         "logCorrelationId": "2024-01-15T10:23:45.123Z"
       }
     ]
   }
   ```
   
   **Memory Budget**:
   - Typical case: ~230 MB for 10,000 jobs with 100 segments each (1% failure 
rate)
   - Worst case: ~1.1 GB for 10% failure rate
   - Impact: 0.7-6.9% of typical 16-32 GB server heap (acceptable)
   
   ---
   
   ### Phase 3: Complete Failure Path Coverage (Planned)
   **Scope**:
   - Fix gaps in failure tracking coverage:
     - Single segment reload path (currently untracked)
     - Config fetch failures (occurs before instrumented try-catch)
     - Semaphore acquire failures (partial coverage)
   - Add safety net at message handler level to catch all failure types
   
   **Coverage Matrix** (Current State):
   
   | Failure Type | Single Segment | Batch Reload | Coverage |
   |--------------|----------------|--------------|----------|
   | Config fetch | ❌ Not tracked | ❌ Not tracked | 0% |
   | Semaphore acquire | ❌ Not tracked | ✅ Tracked | Partial |
   | Download failures | ❌ Not tracked | ✅ Tracked | Partial |
   | Index building | ❌ Not tracked | ✅ Tracked | Partial |
   
   ---
   
   ## Design Details
   
   ### Cache Structure
   ```java
   Cache<String, ConcurrentHashMap<String, ReloadSegmentStatus>>
   // Key: jobId (UUID)
   // Value: Map of segmentName -> status
   
   class ReloadSegmentStatus {
     ReloadStatus status;           // PENDING, IN_PROGRESS, SUCCESS, FAILED
     long startTimeMs;
     long endTimeMs;
     String errorSummary;           // Exception class + message (200 chars max)
     String stackTrace;             // Full stack trace (null if success)
     String logCorrelationId;       // ISO timestamp for log lookup
   }
   ```
   
   ### Configuration
   ```properties
   # Server configuration
   pinot.server.reload.status.cache.max.size=10000
   pinot.server.reload.status.cache.ttl.days=30
   pinot.server.reload.status.error.summary.max.chars=200
   ```
   
   ### Thread Safety
   - **Guava Cache** protects outer map (jobId → segment map)
   - **ConcurrentHashMap** protects inner map (segment → status)
   - State transition validation prevents invalid updates
   
   ---
   
   ## Benefits
   
   1. **Improved Debugging**: Users can query failure details via API instead 
of searching logs
   2. **Better Monitoring**: Programmatic access to failure counts and error 
types
   3. **Operational Visibility**: Clear insight into reload job health across 
the cluster
   4. **Reduced MTTR**: Faster incident diagnosis with accessible error details
   5. **API Consistency**: Job-based tracking aligns with job ID already 
returned by reload APIs
   
   ---
   
   ## Alternatives Considered
   
   1. **ZooKeeper-based persistence**: Rejected due to write overhead and ZK 
load concerns
   2. **Compressed stack traces**: Rejected for simplicity; full traces fit 
acceptable memory budget
   3. **Optional cache**: Rejected; making cache required simplifies code and 
improves reliability
   
   ---
   
   ## Open Questions
   
   1. Should failed entries have longer TTL than successful ones for debugging?
   2. Should we expose cache admin APIs (clear, stats) for operational 
management?
   3. What metrics/alerts should be added for reload failure monitoring?
   
   ---
   
   ## References
   
   - Related PR: #17099 (Phase 1 implementation)
   - Design Principle: Memory-first with bounded growth
   - Inspiration: Similar tracking in other distributed systems (Kubernetes Job 
status, Spark task tracking)
   
   ---
   
   ## Community Feedback Welcome
   
   We welcome feedback on:
   - Phase priorities and scope
   - API response format
   - Memory budget concerns
   - Alternative approaches
   - Additional use cases
   
   ---
   
   **Note**: This is a multi-phase enhancement designed to incrementally 
improve reload observability while maintaining backward compatibility and 
operational stability. Phase 1 provides immediate value with minimal risk, 
while subsequent phases build toward comprehensive failure tracking.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to