jackjlli commented on pull request #6479: URL: https://github.com/apache/incubator-pinot/pull/6479#issuecomment-767180093
> I think this is still two pass, right? The second pass is on IntermediateSegment. > Could you elaborate a bit on where's the memory pressure coming from in the existing implementation? And how does this approach solve it (IntermediateSegment) does need its own storage. This is still two passes, but only one pass on the raw data. E.g. in spark job where the data type is on row basis (like Dataset\<Row\>, RDD\<Row\>, Dataset\<T\>, RDD\<T\>), in order to traverse the record, the API forEachPartition(iterator) is required to be called. While iterator can only provide the data once, thus the raw data has to be cached in the same executor in order to traverse it two times, if we stick on to the existing segment generation code. The below lists out the possible solutions: 1) cache all the raw data into executor memory in order to reuse the same data (using existing code of segment generation) 2) traverse the raw data once and gathers the intermediate stats, and create offline segment based on the intermediate results (current PR) 3) call forEachPartition(iterator) two times; gathering stats to an extra DF/RDD in the 1st iteration, and ingesting actual raw data based on the former DF/RDD in the 2nd round (code refactor still needed, and one extra DF/RDD involved) The Approach 1) needs to load all the raw data into the singe executor before initializing the `SegmentIndexCreationDriver`( raw data traversal is encapsulated in that class). The Approach 2) and 3) basically have the same idea that spilt the current code into two parts, while the difference is that 2) works on a stats collector and 3) works on the raw data (one extra data traversal). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org