suvodeep-pyne opened a new issue, #17123:
URL: https://github.com/apache/pinot/issues/17123
## Problem Statement
### Current Limitation
The existing segment reload status API (`GET
/segments/segmentReloadStatus/{jobId}`) uses timestamp-based heuristics to
determine reload success by comparing `segmentLoadTimeMs >= jobSubmissionTime`.
This approach has several limitations:
1. **Cannot distinguish between states**: Pending, in-progress, and failed
segments all appear as "not successful"
2. **No failure visibility**: When reloads fail, errors are logged but not
queryable via API
3. **No error details**: Users must access pod logs to understand why
segments failed to reload
4. **False positives**: Segments reloaded for unrelated reasons may be
incorrectly counted as successful
### Impact
- **Debugging difficulty**: Users cannot determine which segments failed or
why without log access
- **Monitoring gaps**: No programmatic way to alert on reload failures
- **Incomplete observability**: Controllers and clients lack visibility into
reload operation health
---
## Proposed Solution
Implement an in-memory status cache on Pinot servers to track per-segment
reload status with failure details, enabling comprehensive reload observability
through the existing API.
### Design Principles
1. **Memory-efficient**: Bounded cache size with LRU eviction and
configurable TTL
2. **Thread-safe**: Handle concurrent reload operations safely
3. **Backward compatible**: Enhance existing APIs without breaking changes
4. **Fail-fast**: Cache is required, not optional - fail at startup if
misconfigured
5. **Operationally friendly**: Configurable limits and auto-cleanup
---
## Implementation Plan
### Phase 1: Basic Failure Tracking ✅ (PR #17099)
**Status**: In review
**Scope**:
- In-memory cache tracking failure counts per reload job
- Job ID preservation in Helix messages
- Enhanced server API returning failure counts
- Controller aggregation of failure counts
**Deliverables**:
- `ServerReloadJobStatusCache` with configurable size/TTL
- Server API: `GET /controllerJob/reloadStatus?reloadJobId={id}` returns
`failureCount`
- Controller API: `GET /segments/segmentReloadStatus/{jobId}` includes
aggregated failures
- Full backward compatibility maintained
**Current PR**: #17099
---
### Phase 2: Per-Segment Status Tracking (Planned)
**Scope**:
- Track individual segment status: `PENDING`, `IN_PROGRESS`, `SUCCESS`,
`FAILED`
- Store full error details including:
- Exception class and message
- Complete stack traces
- Log correlation IDs
- Failure timestamps
- Enhanced response with per-segment breakdown
**API Enhancement**:
```json
{
"successCount": 7,
"failedCount": 2,
"inProgressCount": 1,
"segmentStatuses": [
{
"segmentName": "seg1",
"status": "SUCCESS",
"startTimeMs": 1234567890,
"endTimeMs": 1234567900
},
{
"segmentName": "seg2",
"status": "FAILED",
"errorSummary": "IOException: Connection timeout",
"stackTrace": "java.io.IOException: ...",
"logCorrelationId": "2024-01-15T10:23:45.123Z"
}
]
}
```
**Memory Budget**:
- Typical case: ~230 MB for 10,000 jobs with 100 segments each (1% failure
rate)
- Worst case: ~1.1 GB for 10% failure rate
- Impact: 0.7-6.9% of typical 16-32 GB server heap (acceptable)
---
### Phase 3: Complete Failure Path Coverage (Planned)
**Scope**:
- Fix gaps in failure tracking coverage:
- Single segment reload path (currently untracked)
- Config fetch failures (occurs before instrumented try-catch)
- Semaphore acquire failures (partial coverage)
- Add safety net at message handler level to catch all failure types
**Coverage Matrix** (Current State):
| Failure Type | Single Segment | Batch Reload | Coverage |
|--------------|----------------|--------------|----------|
| Config fetch | ❌ Not tracked | ❌ Not tracked | 0% |
| Semaphore acquire | ❌ Not tracked | ✅ Tracked | Partial |
| Download failures | ❌ Not tracked | ✅ Tracked | Partial |
| Index building | ❌ Not tracked | ✅ Tracked | Partial |
---
## Design Details
### Cache Structure
```java
Cache<String, ConcurrentHashMap<String, ReloadSegmentStatus>>
// Key: jobId (UUID)
// Value: Map of segmentName -> status
class ReloadSegmentStatus {
ReloadStatus status; // PENDING, IN_PROGRESS, SUCCESS, FAILED
long startTimeMs;
long endTimeMs;
String errorSummary; // Exception class + message (200 chars max)
String stackTrace; // Full stack trace (null if success)
String logCorrelationId; // ISO timestamp for log lookup
}
```
### Configuration
```properties
# Server configuration
pinot.server.reload.status.cache.max.size=10000
pinot.server.reload.status.cache.ttl.days=30
pinot.server.reload.status.error.summary.max.chars=200
```
### Thread Safety
- **Guava Cache** protects outer map (jobId → segment map)
- **ConcurrentHashMap** protects inner map (segment → status)
- State transition validation prevents invalid updates
---
## Benefits
1. **Improved Debugging**: Users can query failure details via API instead
of searching logs
2. **Better Monitoring**: Programmatic access to failure counts and error
types
3. **Operational Visibility**: Clear insight into reload job health across
the cluster
4. **Reduced MTTR**: Faster incident diagnosis with accessible error details
5. **API Consistency**: Job-based tracking aligns with job ID already
returned by reload APIs
---
## Alternatives Considered
1. **ZooKeeper-based persistence**: Rejected due to write overhead and ZK
load concerns
2. **Compressed stack traces**: Rejected for simplicity; full traces fit
acceptable memory budget
3. **Optional cache**: Rejected; making cache required simplifies code and
improves reliability
---
## Open Questions
1. Should failed entries have longer TTL than successful ones for debugging?
2. Should we expose cache admin APIs (clear, stats) for operational
management?
3. What metrics/alerts should be added for reload failure monitoring?
---
## References
- Related PR: #17099 (Phase 1 implementation)
- Design Principle: Memory-first with bounded growth
- Inspiration: Similar tracking in other distributed systems (Kubernetes Job
status, Spark task tracking)
---
## Community Feedback Welcome
We welcome feedback on:
- Phase priorities and scope
- API response format
- Memory budget concerns
- Alternative approaches
- Additional use cases
---
**Note**: This is a multi-phase enhancement designed to incrementally
improve reload observability while maintaining backward compatibility and
operational stability. Phase 1 provides immediate value with minimal risk,
while subsequent phases build toward comprehensive failure tracking.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]