amogh-jahagirdar commented on issue #14043: URL: https://github.com/apache/iceberg/issues/14043#issuecomment-3315144693
I poked into this a bit more, I take back what I said about the projection being produced being incorrect. I do think it's expected that after `PruneColumns` is invoked it only returns the `id` column for this file. The [model created ](https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetReaders.java#L79) for reading into Spark's internal row for this case where it's a map<some_key, some_value_struct> and a field which does not exist in the file is projected I think the real issue is a mismatch between the model created in Spark for this case and the internal page readers. The internal page readers, will just project the id column based on pruning for that particular file. The Spark model produces a InternalRowReader for the ID column and then a map reader, where the underlying struct field reader in the value of the map is a null reader. Then when setting the page source to the model, the page reader for the ID expectedly cannot be set on the reader trying to read the whole map. I think if a nested struct field in a map is being projected that does not exist in the file, we should ideally create the null or default value reader, instead of creating the whole map reader, but I need to check this further. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
