rustyconover opened a new pull request, #8625:
URL: https://github.com/apache/iceberg/pull/8625

   When writing Avro files often Iceberg is writing arrays and maps. The 
current use of `binaryEncoder()` and `directBinaryEncoder()` of 
`org.apache.avro.io.EncoderFactory` do not write the length of the arrays or 
maps to Avro since the `binaryEncoder()` and `directBinaryEncoder()` does not 
buffer the output to calculate a length.
   
   Knowing the length of an array or map is useful to clients decoding the Avro 
file since they can skip decoding the entire array or map if it is not needed 
when reading the file.  This PR changes all Avro writers to use 
`blockingBinaryEncoder()`, this encoder does not "block" in the concurrency 
sense but it does buffer the output of objects such that the lengths of arrays 
and maps will be calculated.
   
   Having the byte lengths of maps and arrays written will speed up the Python 
decoding of Avro files significantly for tables that have many columns.
   
   See:
   
   
https://avro.apache.org/docs/1.5.1/api/java/org/apache/avro/io/EncoderFactory.html#blockingBinaryEncoder(java.io.OutputStream,%20org.apache.avro.io.BinaryEncoder)
   
   For details between the different Avro encoders.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to