chenjunjiedada commented on PR #6026: URL: https://github.com/apache/iceberg/pull/6026#issuecomment-1287561773
> > I'm a bit confused of this behavior: `ReadConf.startRowPositions` is valid only if `_pos` column exists in the `expectedSchema` due to #1716. Are there use cases that `_pos` is absent and we still need `ReadConf.startRowPositions`? By looking at the class `VectorizedParquetReader` and `ParquetReader` who are consuming `ReadConf.startRowPositions`, it seems likely the schema doesn't have `_pos`. cc @chenjunjiedada @aokolnychyi The row group start positions are always computed but are only correct when it is projected right now. That's intended because we don't want to read the parquet footer one more time. But since the footer must be read at least once, we should be able to cache some content during the first access to avoid the current optimization logic and thus simply the logic to check `_pos` column. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org