swaminathanmanish commented on PR #11650: URL: https://github.com/apache/pinot/pull/11650#issuecomment-1740308268
> Related to #11649 > > We use the `normalizedDate` segment name generator for an append table w/ a time column. The generators add a min/max time value to the segment name (ex. `example_table_2023-09-22_2023-10-02_46`). We had a user backfill their batch jobs overwriting existing data. Some of the upstream data had changed as part of this backfill which caused the min/max time value in the segment name to also change. This caused inconsistent data since some new segments didn't replace thier old ones. > > This PR adds param `omit.timestamps.in.segment.name` to the SimpleSegmentNameGenerator to omit time values from the segment name. I've only added this to the simple generator since it doesn't seem intuitive to have the `normalizedDate` generator omit timestamps. We plan to create segments w/ this generator by including the execution date for the batch run (ex. `example_table_20230928_46`) since it will give us a consistent set of segment file names as long as the # of input files is the same. > > I've updated the unit tests. It seems like the best place to write an end to end test might be in https://github.com/apache/pinot/blob/master/pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-standalone/src/test/java/org/apache/pinot/plugin/ingestion/batch/standalone/SegmentGenerationJobRunnerTest.java Let me know if that's desired. > > Long-term, it seems ideal that we'd use https://docs.pinot.apache.org/operators/operating-pinot/consistent-push-and-rollback for re-running segment creation on a given day so users aren't contrained on how the input data is structured. > > cc @Jackie-Jiang Thanks for adding detailed notes. Did you consider using **FixedSegmentNameGenerator**, where you can exactly specify the segment name that you want ? ``` BatchConfigProperties.SegmentNameGeneratorType.FIXED: return new FixedSegmentNameGenerator(_segmentName); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org