jackjlli opened a new pull request #6479:
URL: https://github.com/apache/incubator-pinot/pull/6479


   ## Description
   Currently when generating an offline segment, raw data has to be traversed 
two times; one time for gathering stats in 
`RecordReaderSegmentCreationDataSource`, another time for ingesting actual data.
   
   In this PR, we introduce a new way of ingesting offline data by traversing 
the raw data only once. Similarly to mutable realtime segment, a simplified 
class called `IntermediateSegment` will be initialized to gather all the data 
into dictionary and forwarded index. After all the records are ingested, 
`IntermediateSegmentRecordReader` will be passed into 
`SegmentIndexCreationDriver` with `SegmentGeneratorConfig`, and the final 
offline segment will be built like this:
   ```
       // Build the segment from intermediate segment.
       SegmentIndexCreationDriverImpl driver = new 
SegmentIndexCreationDriverImpl();
       driver.init(segmentGeneratorConfig, intermediateSegmentRecordReader);
       driver.build();
   ```
   
   This mechanism helps greatly reduce the memory pressure in some environment. 
E.g. if pinot segments need to be generated directly from spark executors when 
data is loaded as iterator, the raw data doesn't have to load into memory two 
times.
   
   
   ## Upgrade Notes
   Does this PR prevent a zero down-time upgrade? (Assume upgrade order: 
Controller, Broker, Server, Minion)
   * [ ] Yes (Please label as **<code>backward-incompat</code>**, and complete 
the section below on Release Notes)
   
   Does this PR fix a zero-downtime upgrade introduced earlier?
   * [ ] Yes (Please label this as **<code>backward-incompat</code>**, and 
complete the section below on Release Notes)
   
   Does this PR otherwise need attention when creating release notes? Things to 
consider:
   - New configuration options
   - Deprecation of configurations
   - Signature changes to public methods/interfaces
   - New plugins added or old plugins removed
   * [ ] Yes (Please label this PR as **<code>release-notes</code>** and 
complete the section on Release Notes)
   ## Release Notes
   If you have tagged this as either backward-incompat or release-notes,
   you MUST add text here that you would like to see appear in release notes of 
the
   next release.
   
   If you have a series of commits adding or enabling a feature, then
   add this section only in final commit that marks the feature completed.
   Refer to earlier release notes to see examples of text
   
   ## Documentation
   If you have introduced a new feature or configuration, please add it to the 
documentation as well.
   See 
https://docs.pinot.apache.org/developers/developers-and-contributors/update-document
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to