rohityadav1993 commented on issue #14083: URL: https://github.com/apache/pinot/issues/14083#issuecomment-2449690082
@pengding-stripe , I believe the reason for the two to exist is batchConfigMap can come from tableConfig and jobs should consider both(flink-connector does it). Looked at the segmentGenerationJob and it gets the configs from spec only and currently it does not read configs needed for generating uploadedRealtime segments. To support , following is needed: - **Creation time**: Usually, the segment creation time can be currentTimeMs() but some usecases can also put a more deterministic time i.e. an upload time. - **Prefix** can be anything. - **Suffix** is generally good to keep as sequence id and current spark jobs do have indexing based on files in a directory. This can be reused. - **PartitionId**: For append only table it does not matter but we should try to generate partition id as spread out as possible to avoid data skew in a partition. For upsert tables it must be provided consistent with the partitioning of the stream based on primary keys. I think the spark jobs are not implemented in way to generate partitioned segments for upserts to work when uploaded. (I'll cover this as part of #12987) For non-upsert realtime usecases, I'll raise a PR to support uploadedRealtime segment conforming to UploadedRealtimeSegmentName -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org