tarun11Mavani opened a new issue, #15845: URL: https://github.com/apache/pinot/issues/15845
UpsertCompactMergeTask was introduced in [#14477](https://github.com/apache/pinot/pull/14477). I am creating this parent issue to track work required to make this feature production ready. - Fix the data inconsistency issue across segment replica - SegmentRefresh task compatibility with UpsertCompactMerge task [#14633](https://github.com/apache/pinot/issues/14633) - Add documentation for UpsertCompactMergeTask ### Data inconsistency issue across segment replica due to different segment creation time **Description:** We've identified an issue where discrepancies in segment creation times across replicas lead to inconsistent behavior during merge compaction, resulting in data inconsistencies across servers. After several runs of the MergeCompactTask, we observed data inconsistencies across segment replicas. A COUNT(*) query began returning inconsistent total row counts for a table where consumption had been paused. Additionally, querying by a specific primary key, which should consistently return exactly one record regardless of the server handling the query, showed inconsistent behavior—sometimes returning one record, and other times none—depending on the server. **Root Cause Analysis:** During merge compaction, Pinot determines the creation time of the new segment (creationTimeNewSegment) using the maximum creation time among the old segments [here](https://github.com/apache/pinot/blob/c9f0c47d0ad96607760b706a79802d1598222ef3/pinot-plugins/pinot-minion-tasks/pinot-minion-builtin-tasks/src/main/java/org/apache/pinot/plugin/minion/tasks/upsertcompactmerge/UpsertCompactMergeTaskExecutor.java#L103): `creationTimeNewSegment >= max(creationTime(oldSegments))` However, since replicas of the same segment can have different creation times across servers, this approach can lead to inconsistencies. **Scenario:** - Record R1 is indexed in segment S1, which is committed at time T on Server1 and T+10 on Server2. - S1 is selected for merge compaction along with segment S0. The new compacted segment, compact_S3, is assigned a creation time of T (based on the minimum creation time among replicas). When compact_S3 is added or replaced: - Server1: The comparison of R1's value is the same in S1 and compact_S3, and their creation times are also the same (T). Therefore, shouldReplaceOnComparisonTie returns true, and R1 in S1 is replaced with R1 in compact_S3. - Server2: The comparison of R1's value is the same, but S1's creation time is T+10, while compact_S3's is T. Thus, shouldReplaceOnComparisonTie returns false, and R1 in S1 is retained. In the next compaction task, validDocIds of older segments are fetched. If Server1's validDocIds for S1 indicate all records are replaced, S1 is marked for deletion. Consequently, S1 is deleted from both Server1 and Server2. Post-deletion: - Server1: No impact; PK metadata points to R1 in compact_S3. - Server2: PK metadata still points to R1 in the now-deleted S1, leading to data inconsistency. **Proposed Solution:** To ensure consistency across replicas during merge compaction, we propose modifying the logic to determine creationTimeNewSegment by using the maximum creation time across all replicas of the old segments. This approach would ensure that the shouldReplaceOnComparisonTie function behaves consistently across all servers. **Next Steps:** I plan to raise a PR implementing this change. cc: @klsince @Jackie-Jiang @ankitsultana @rohityadav1993 @tibrewalpratik17 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org