rahulsmahadev opened a new pull request, #714:
URL: https://github.com/apache/iceberg-cpp/pull/714

   ## Summary
   
   Implements the remaining `LargeListArray` gaps from #502, following the 
design suggested by @wgtmac in #513:
   
   1. `ValidateParquetSchemaEvolution` now accepts `LARGE_LIST` wherever `LIST` 
is accepted, so schema projection works when the Arrow reader presents 64-bit 
offset list types.
   2. A new reader property `read.arrow.use-large-list` (default: `false`) 
configures the Parquet reader via 
`ArrowReaderProperties::set_list_type(::arrow::Type::LARGE_LIST)` to decode 
list columns as `large_list`.
   
   Since `ToArrowSchema` builds the reader's output schema with 32-bit lists 
(and `ProjectRecordBatch` dispatches on the output schema type), enabling the 
property also rewrites list fields in the output Arrow schema to `large_list` 
so the projection layer takes the `ProjectLargeListArray` path added in #502. 
The rewrite is local to the Parquet reader to avoid changing the 
`ToArrowSchema` signature used across writers and manifest readers.
   
   Closes #513
   
   ## Changes
   
   - `src/iceberg/parquet/parquet_schema_util.cc`: accept `LARGE_LIST` for 
`TypeId::kList` in schema evolution validation.
   - `src/iceberg/file_reader.h`: add `ReaderProperties::kArrowUseLargeList` 
(`read.arrow.use-large-list`, default `false`), following the `kBatchSize` 
pattern.
   - `src/iceberg/parquet/parquet_reader.cc`: set 
`ArrowReaderProperties::set_list_type` when the property is enabled, and align 
the output Arrow schema (lists nested in structs/maps included) with the 
large_list arrays produced by the reader.
   
   ## Test plan
   
   - `ParquetSchemaProjectionTest.ValidateSchemaEvolutionAllowsLargeList`: 
`large_list` Arrow type validates against an Iceberg `ListType`.
   - `ParquetSchemaProjectionTest.ProjectLargeListType`: projection over a 
`SchemaManifest` built with `set_list_type(LARGE_LIST)` (the same path 
`BuildProjection` uses in the reader).
   - `ParquetReaderTest.ReadListType`: default behavior unchanged — list 
columns read as 32-bit offset `list`.
   - `ParquetReaderTest.ReadListAsLargeList`: with 
`read.arrow.use-large-list=true`, the output schema exposes `large_list` and 
values round-trip correctly (verified via array slices since JSON parsing 
creates regular `ListArray`).
   
   Note: my local environment lacks a C++23 toolchain (cmake 3.16/gcc 10), so I 
could not build locally; relying on CI to verify. All Arrow APIs used 
(`set_list_type`, `large_list(field)`, `MapType(key_field, item_field, 
keys_sorted)`, `Field::WithType`) were checked against the pinned Arrow 24.0.0 
headers.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to