rahulsmahadev opened a new pull request, #714: URL: https://github.com/apache/iceberg-cpp/pull/714
## Summary Implements the remaining `LargeListArray` gaps from #502, following the design suggested by @wgtmac in #513: 1. `ValidateParquetSchemaEvolution` now accepts `LARGE_LIST` wherever `LIST` is accepted, so schema projection works when the Arrow reader presents 64-bit offset list types. 2. A new reader property `read.arrow.use-large-list` (default: `false`) configures the Parquet reader via `ArrowReaderProperties::set_list_type(::arrow::Type::LARGE_LIST)` to decode list columns as `large_list`. Since `ToArrowSchema` builds the reader's output schema with 32-bit lists (and `ProjectRecordBatch` dispatches on the output schema type), enabling the property also rewrites list fields in the output Arrow schema to `large_list` so the projection layer takes the `ProjectLargeListArray` path added in #502. The rewrite is local to the Parquet reader to avoid changing the `ToArrowSchema` signature used across writers and manifest readers. Closes #513 ## Changes - `src/iceberg/parquet/parquet_schema_util.cc`: accept `LARGE_LIST` for `TypeId::kList` in schema evolution validation. - `src/iceberg/file_reader.h`: add `ReaderProperties::kArrowUseLargeList` (`read.arrow.use-large-list`, default `false`), following the `kBatchSize` pattern. - `src/iceberg/parquet/parquet_reader.cc`: set `ArrowReaderProperties::set_list_type` when the property is enabled, and align the output Arrow schema (lists nested in structs/maps included) with the large_list arrays produced by the reader. ## Test plan - `ParquetSchemaProjectionTest.ValidateSchemaEvolutionAllowsLargeList`: `large_list` Arrow type validates against an Iceberg `ListType`. - `ParquetSchemaProjectionTest.ProjectLargeListType`: projection over a `SchemaManifest` built with `set_list_type(LARGE_LIST)` (the same path `BuildProjection` uses in the reader). - `ParquetReaderTest.ReadListType`: default behavior unchanged — list columns read as 32-bit offset `list`. - `ParquetReaderTest.ReadListAsLargeList`: with `read.arrow.use-large-list=true`, the output schema exposes `large_list` and values round-trip correctly (verified via array slices since JSON parsing creates regular `ListArray`). Note: my local environment lacks a C++23 toolchain (cmake 3.16/gcc 10), so I could not build locally; relying on CI to verify. All Arrow APIs used (`set_list_type`, `large_list(field)`, `MapType(key_field, item_field, keys_sorted)`, `Field::WithType`) were checked against the pinned Arrow 24.0.0 headers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
