jbewing opened a new pull request, #14800:
URL: https://github.com/apache/iceberg/pull/14800

   ### What
   
   This PR adds vectorized read support to Iceberg for the Apache Parquet v2 
specification (see https://github.com/apache/iceberg/issues/7162). This builds 
on top of the existing support for reading 
[DELTA_BINARY_PACKED](https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-encoding-delta_binary_packed--5)
 implemented by @eric-maynard in https://github.com/apache/iceberg/pull/13391 
with:
   - Building upon @eric-maynard 's support for [DELTA_LENGTH_BYTE_ARRAY 
encoding](https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6)
 in https://github.com/apache/iceberg/pull/13709
     * I made a couple of changes to that implementation around using temporary 
on-heap buffers rather than off-heap buffers as while it's possible to use 
temporary off-heap buffers, the plumbing required is significant and I wasn't 
able to get things totally resolved with that path working as there was a 
remaining subtle bug around leaked off-heap buffers & correctness
   - Implementing vectorized read support for [DELTA_BYTE_ARRAY 
encoding](https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-strings-delta_byte_array--7)
   - Implementing vectorized read support for [BYTE_STREAM_SPLIT 
encoding](https://parquet.apache.org/docs/file-format/data-pages/encodings/#byte-stream-split-byte_stream_split--9)
   - Implementing vectorized read support for [RLE encoded data 
pages](https://parquet.apache.org/docs/file-format/data-pages/encodings/#run-length-encoding--bit-packing-hybrid-rle--3)
   - Bolstering golden files test coverage to cover each of the paths above. In 
addition, I added golden file tests that include rows with null values for each 
data type as well to ensure our handling of those is correct
   
   ### Background
   
   This solves a longstanding issue of: the reference Apache Iceberg Spark 
implementation with the default settings enabled (e.g. 
`spark.sql.iceberg.vectorization.enabled` = `true`) isn't able to read iceberg 
tables that may have been written by other compute engines that utilize the not 
so new anymore Apache Parquet v2 writer specification. It's a widely known 
workaround to need to disable the vectorized reader in Spark if you need to 
interop with other compute engines or adjust all compute engines to use the 
Apache Parquet v1 writer specification when writing parquet files. With 
disabling the vectorization flag, clients take a performance hit that we've 
anecdotally measured is _quite_ large in some cases/workloads. If forcing all 
writers of an iceberg table to write in Apache Parquet v1 format, clients are 
incurring additional performance and storage penalties (files written with 
parquet v2 tend to be smaller than those written with the v1 spec as the new 
encodings tend to save spa
 ce and are often faster to read/write). So really, it's a lose-lose for 
performance & data size in the current setup with the additional papercut of 
Apache Iceberg not being super portable across engines in it's default setup. 
This PR seeks to solve that by finishing the swing on implementing vectorized 
parquet read support for the v2 format. In the future, we may also consider 
allowing clients to write Apache Parquet v2 files natively gated via a setting 
from Apache Iceberg. Even longer down that road, we may even consider changing 
that to be the "default" setting.
   
   ### Previous Work / Thanks
   
   This PR is a revival + extension to the work that @eric-maynard was doing in 
https://github.com/apache/iceberg/pull/13709. That PR had been active for a 
little while, so I literally started from where Eric left off. Thank you for 
the great work here @eric-maynard, you made implementing the rest of the 
changes required for vectorized read support _way_ easier!
   
   
   ### Note to Reviewers
   
   I debated splitting up the implementation by encoding type but decided that 
given that the diff _isn't_ crazy with everything all together that it may be 
better to bunch these all together. Especially since sometimes internal APIs 
got added on successive encoding versions to support implementations of 
subsequent encodings. I can split this up into multiple PRs if we think that 
would be substantially easier to review or if we're having problems reviewing 
this as a single atomic unit.
   
   ### Testing
   I've tested this on a fork of Spark 3.5 & Iceberg 1.10.0 and verified that a 
Spark job is able to read a table written with Parquet V2 writer without issues.
   
   Successor to: https://github.com/apache/iceberg/pull/13290, 
https://github.com/apache/iceberg/pull/13391, 
https://github.com/apache/iceberg/pull/13709
   Issue: https://github.com/apache/iceberg/issues/7162


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to