hongkunxu opened a new issue, #16889:
URL: https://github.com/apache/pinot/issues/16889
### Description
Currently, Pinot’s DataIngestionJob has a limitation when performing
backfill ingestion. The job assumes that the backfill run will generate the
same number of segments (or more) compared to the original ingestion.
When the backfill input directory contains fewer files than the original
run, the segment generation job will produce fewer segments. As a result, only
part of the existing segments will be replaced, and the remaining old segments
will continue to exist in the table, causing stale data issues.
### Example
- Suppose table airlineStats has 2 segments for 2014-01-01:
- airlineStats_2014-01-01_2014-01-01_0
- airlineStats_2014-01-01_2014-01-01_1
- The backfill input directory only contains 1 input file for the same date.
- The segment generation job produces just 1 segment:
- airlineStats_2014-01-01_2014-01-01_0
- After pushing, only _0 gets replaced, while _1 from the original ingestion
is still present, leading to incorrect/stale data.
### Impact
If raw data changes such that a given time bucket has fewer input files than
the first ingestion run, backfill will fail to fully replace existing segments.
This makes it difficult to rely on backfill for correcting historical data.
### Proposal
Introduce a new job, tentatively named BackfillIngestionJob, which is
designed to correctly handle these edge cases. This job should:
1. Ensure that all original segments in the target time range are
replaced/removed.
2. Guarantee that stale data from older segments does not persist after
backfill.
3. Provide a consistent and reliable workflow for batch backfill ingestion.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]