steveloughran commented on PR #14297: URL: https://github.com/apache/iceberg/pull/14297#issuecomment-4157395184
FYI I've been benchmarking parquet row scans with variants, latest results up at https://github.com/apache/parquet-java/pull/3452#issuecomment-4157307880 As copilot says * Lean schema + shredded = 45% faster than reading all columns. Skipping idstr, varid, and col4 typed columns saves ~590ms per 1M rows. * Lean schema + unshredded = 93% slower. The lean schema requests typed_value.varcategory which does not exist in the unshredded file. Parquet handles the missing columns at every row, which is more expensive than reading the single binary blob directly. * Schema detection in ReadSupport.init() is essential. Applying containsField("typed_value") to choose between lean and full schema prevents the unshredded penalty while preserving the shredded speedup. ----- What that means is the automatic shredding is critical, and equally critical, the query engine mustn't try to read files with a shredded schema unless there's shredded data. That's possibly what's been leading to the perf numbers I'm seeing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
