Re: [PR] Exclude reading _pos column if it's not in the scan list [iceberg]

via GitHub Thu, 31 Oct 2024 21:08:07 -0700


huaxingao commented on PR #11390:
URL: https://github.com/apache/iceberg/pull/11390#issuecomment-2451241576


   @szehon-ho Thanks for the comment.
   
   We actually also use the 
[requiredSchema](https://github.com/apache/iceberg/blob/fda2b3a5706fd580b0371e8a7c4b31d536eac0a3/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java#L90),
 that's the schema with the `_pos` column. In 
ReadConf#[generateOffsetToStartPos](https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L185),
 we actually need to know if pos delete exists. 
   
   We can pass in a flag to `SparkDeleteFilter` to not add the `_pos` column, 
but then I think we need to add another flag to pass the hasPosDelete info to 
Parquet `ReaderBuilder`, and then pass to `ReadConfig`. 
   
   ORC uses 
[expectedSchema()](https://github.com/apache/iceberg/blob/fda2b3a5706fd580b0371e8a7c4b31d536eac0a3/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java#L125),
 the schema without _pos column, to build vectorized readers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Exclude reading _pos column if it's not in the scan list [iceberg]

Reply via email to