rustyconover commented on PR #8625:
URL: https://github.com/apache/iceberg/pull/8625#issuecomment-1858921063

   Hello @aokolnychyi and @Fokko,
   
   >> Question. Aren't we using DataFileWriter from Avro in our 
AvroFileAppender? If so, how is this PR affecting it? Won't we still use direct 
encoders there?
   
   > This is a good question. The goal of this PR is to write the block sizes 
for the manifests. @rustyconover any thoughts on this?
   
   I am particularly interested in the explicit inclusion of block sizes in 
manifest files. Currently, PyIceberg requires deserialization of three maps 
(high, low, and null counts) even when the query planner may not necessarily 
need that information. If block sizes are explicitly written, I plan to modify 
the PyIceberg Avro decoder to adopt a "lazy" approach. It would involve copying 
arrays or maps as byte buffers, deferring the decoding process until they are 
accessed by client code. Consequently, if the code never accesses the map, the 
deserialization code would not execute.
   
   This optimization could significantly reduce scan planning time, especially 
for tables with numerous files across multiple columns. For instance, in a 
table with 1,250,000 files, copying bytes in memory for later decoding proves 
much faster than the deserialization of Avro structures.
   
   Currently the PyIceberg Avro code isn't performing lazy decoding, it just 
decodes everything. This is because due to the choices in the Java code about 
how Avro is serialized directly without buffering so the byte length count be 
included.  I can prepare a PR to do this.
   
   Rusty


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to