swaminathanmanish opened a new pull request, #10874:
URL: https://github.com/apache/pinot/pull/10874

   **Problem**:
   RecordReaders are used to iterate over the source/input files, in order to 
ingest data/create segments. Although we iterate one row at at time from a 
file, we have readers (like ParquetRecordReader) that allocate a rowGroup 
(collection of rows) for better read throughput, while reading from Parquet 
files. This uses up heap memory. The SegmentProcessesorFramework takes in N 
RecordReaders. Users of this framework allocate N RecordReaders using 
getRecordReader factory, which also initializes the reader. Depending on how 
many readers are created, there's a possibility of running out of heap space 
due to eager allocation/initialization.
   
   **Solution**:
   Provide the flexibility to pass the info required to initialize and clean up 
record reader in the mapper, where it is used. This will ensure that the 
readers use memory only when being iterated in the mapper and we don't eagerly 
allocate memory.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to