rohityadav1993 commented on issue #14083:
URL: https://github.com/apache/pinot/issues/14083#issuecomment-2449690082

   @pengding-stripe , I believe the reason for the two to exist is 
batchConfigMap can come from tableConfig and jobs should consider 
both(flink-connector does it). Looked at the segmentGenerationJob and it gets 
the configs from spec only and currently it does not read configs needed  for 
generating uploadedRealtime segments.
   
   To support , following is needed:
   
   - **Creation time**: Usually, the segment creation time can be 
currentTimeMs() but some usecases can also put a more deterministic time i.e. 
an upload time.
   
   - **Prefix** can be anything.
   - **Suffix** is generally good to keep as sequence id and current spark jobs 
do have indexing based on files in a directory. This can be reused.
   - **PartitionId**: For append only table it does not matter but we should 
try to generate partition id as spread out as possible to avoid data skew in a 
partition. For upsert tables it must be provided consistent with the 
partitioning of the stream based on primary keys. I think the spark jobs are 
not implemented in way to generate partitioned segments for upserts to work 
when uploaded. (I'll cover this as part of #12987)
   
   
   For non-upsert realtime usecases, I'll raise a PR to support 
uploadedRealtime segment conforming to UploadedRealtimeSegmentName
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to