[GitHub] [incubator-pinot] jackjlli edited a comment on pull request #6479: Support data ingestion for generating offline segment in one pass

GitBox Mon, 25 Jan 2021 20:28:53 -0800


jackjlli edited a comment on pull request #6479:
URL: https://github.com/apache/incubator-pinot/pull/6479#issuecomment-767180093



   > I think this is still two pass, right? The second pass is on 
IntermediateSegment.
   > Could you elaborate a bit on where's the memory pressure coming from in 
the existing implementation? And how does this approach solve it 
(IntermediateSegment) does need its own storage.
   
   This is still two passes, but only one pass on the raw data.
   E.g. in spark job where the data type is on row basis (like Dataset\<Row\>, 
RDD\<Row\>, Dataset\<T\>, RDD\<T\>), in order to  traverse the record, the API 
forEachPartition(iterator) is required to be called. While iterator can only 
provide the data once, thus the raw data has to be cached in the same executor 
in order to traverse it two times, if we stick on to the existing segment 
generation code.
   
   The below lists out the possible solutions:
   1) cache all the raw data into executor memory in order to reuse the same 
data (using existing code of segment generation)
   2) traverse the raw data once and gathers the intermediate stats, and create 
offline segment based on the intermediate results (current PR)
   3) call forEachPartition(iterator) two times; gathering stats to an extra 
DF/RDD in the 1st iteration, and ingesting actual raw data based on the former 
DF/RDD in the 2nd round (code refactor still needed, and one extra DF/RDD 
involved)
   
   The Approach 1) needs to load all the raw data into the singe executor 
before initializing the `SegmentIndexCreationDriver`( raw data traversal is 
encapsulated in that class). The Approach 2) and 3) basically have the same 
idea that spilt the current code into two parts, while the difference is that 
2) works on a stats collector and 3) works on the raw data (one extra data 
traversal). 
   The Approach 2) works well if the cardinality of columns is not very high 
(just like mutable segment for realtime segments), while it could consume the 
same memory resource if cardinality is high. 
   The approach 3) works well if data is cached in the same executor, while the 
cached raw data itself can still be costly since the deserialized data is 
stored in the memory.
   
   One optimization for this PR is to reused the dictionaries and forwarded 
indices stored in intermediateSegment, and pass it to immutable segment.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [incubator-pinot] jackjlli edited a comment on pull request #6479: Support data ingestion for generating offline segment in one pass

Reply via email to