rustyconover opened a new pull request, #8625: URL: https://github.com/apache/iceberg/pull/8625
When writing Avro files often Iceberg is writing arrays and maps. The current use of `binaryEncoder()` and `directBinaryEncoder()` of `org.apache.avro.io.EncoderFactory` do not write the length of the arrays or maps to Avro since the `binaryEncoder()` and `directBinaryEncoder()` does not buffer the output to calculate a length. Knowing the length of an array or map is useful to clients decoding the Avro file since they can skip decoding the entire array or map if it is not needed when reading the file. This PR changes all Avro writers to use `blockingBinaryEncoder()`, this encoder does not "block" in the concurrency sense but it does buffer the output of objects such that the lengths of arrays and maps will be calculated. Having the byte lengths of maps and arrays written will speed up the Python decoding of Avro files significantly for tables that have many columns. See: https://avro.apache.org/docs/1.5.1/api/java/org/apache/avro/io/EncoderFactory.html#blockingBinaryEncoder(java.io.OutputStream,%20org.apache.avro.io.BinaryEncoder) For details between the different Avro encoders. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org