snleee commented on a change in pull request #8242:
URL: https://github.com/apache/pinot/pull/8242#discussion_r816382003



##########
File path: 
pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManager.java
##########
@@ -2963,6 +2961,25 @@ public String startReplaceSegments(String 
tableNameWithType, List<String> segmen
     return segmentLineageEntryId;
   }
 
+  // TODO: Add more conflict checks over segmentsTo later. For example, for 
APPEND table,
+  //       if the new segments from 2 batch jobs are overlapping, we reject 
one of job.
+  private static boolean isConflicted(List<String> segmentsFrom, LineageEntry 
lineageEntry, TableConfig tableConfig) {
+    if (!segmentsFrom.isEmpty()) {
+      // It's conflicted if there is any overlap between segmentsFrom.
+      return !Collections.disjoint(segmentsFrom, 
lineageEntry.getSegmentsFrom());
+    }
+    // For REFRESH table, it's conflicted if both segmentsFrom are empty.
+    if (isRefreshTable(tableConfig)) {

Review comment:
       @Jackie-Jiang 
   
   My reasoning was based on the following assumption:
   
   ```
   For REFRESH table, each offline ingestion flow would push the full data 
snapshot.
   ```
   
   If this is the case, the correct behavior should be blocking if 2 flows with 
`segmentsFrom=empty` are running at the same time because this can result in 
the REFRESH table having 2 snapshots of data. (Even if the input files are the 
same, we now put `timestamp` to the segment name so 2 flows will be uploaded)
   
   If we can make the assumption that we have 1 flow per REFRESH table, this 
can only happen if some user runs 2 push flow at the same time for an empty 
table, which will not likely happen. Then, I see your point that it's cleaner 
if we don't handle special logic for different push types. I'm fine with 
removing special handling for REFRESH.
   
   On the other hand, if you have REFRESH tables with multiple flows set up, I 
think that this model will be hard to support the consistent push feature 
because you will have the hard time differentiating the segments from one flow 
to another as @jackjlli mentioned above. IMO, we should not support such model 
(having 2+ flows pushing to a single REFRESH table).
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to