steveloughran commented on PR #14297:
URL: https://github.com/apache/iceberg/pull/14297#issuecomment-4157395184

   FYI I've been benchmarking parquet row scans with variants, latest results 
up at 
   https://github.com/apache/parquet-java/pull/3452#issuecomment-4157307880
   
   As copilot says
   
   * Lean schema + shredded = 45% faster than reading all columns. Skipping 
idstr, varid, and col4 typed columns saves ~590ms per 1M rows.
   * Lean schema + unshredded = 93% slower. The lean schema requests 
typed_value.varcategory which does not exist in the unshredded file. Parquet 
handles the missing columns at every row, which is more expensive than
     reading the single binary blob directly.
   *  Schema detection in ReadSupport.init() is essential. Applying 
containsField("typed_value") to choose between lean and full schema prevents 
the unshredded penalty while preserving the shredded speedup.
   
   -----
   
   What that means is the automatic shredding is critical, and equally 
critical, the query engine mustn't try to read files with a shredded schema 
unless there's shredded data. That's possibly what's been leading to the perf 
numbers I'm seeing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to