swaminathanmanish commented on PR #11650:
URL: https://github.com/apache/pinot/pull/11650#issuecomment-1740308268

   > Related to #11649
   > 
   > We use the `normalizedDate` segment name generator for an append table w/ 
a time column. The generators add a min/max time value to the segment name (ex. 
`example_table_2023-09-22_2023-10-02_46`). We had a user backfill their batch 
jobs overwriting existing data. Some of the upstream data had changed as part 
of this backfill which caused the min/max time value in the segment name to 
also change. This caused inconsistent data since some new segments didn't 
replace thier old ones.
   > 
   > This PR adds param `omit.timestamps.in.segment.name` to the 
SimpleSegmentNameGenerator to omit time values from the segment name. I've only 
added this to the simple generator since it doesn't seem intuitive to have the 
`normalizedDate` generator omit timestamps. We plan to create segments w/ this 
generator by including the execution date for the batch run (ex. 
`example_table_20230928_46`) since it will give us a consistent set of segment 
file names as long as the # of input files is the same.
   > 
   > I've updated the unit tests. It seems like the best place to write an end 
to end test might be in 
https://github.com/apache/pinot/blob/master/pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-standalone/src/test/java/org/apache/pinot/plugin/ingestion/batch/standalone/SegmentGenerationJobRunnerTest.java
 Let me know if that's desired.
   > 
   > Long-term, it seems ideal that we'd use 
https://docs.pinot.apache.org/operators/operating-pinot/consistent-push-and-rollback
 for re-running segment creation on a given day so users aren't contrained on 
how the input data is structured.
   > 
   > cc @Jackie-Jiang
   
   Thanks for adding detailed notes. Did you consider using 
**FixedSegmentNameGenerator**, where you can exactly specify the segment name 
that you want ? 
   
   ```
   BatchConfigProperties.SegmentNameGeneratorType.FIXED:
           return new FixedSegmentNameGenerator(_segmentName);
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to