[I] `to_arrow_batch_reader` returns a different schema than `to_arrow` [iceberg-python]

via GitHub Thu, 24 Jul 2025 22:07:01 -0700


enkidulan opened a new issue, #2250:
URL: https://github.com/apache/iceberg-python/issues/2250


   ### Apache Iceberg version
   
   main (development)
   
   ### Please describe the bug 🐞
   
   In the development version, I noticed that the `to_arrow_batch_reader` 
method casts all string types to `large_string`, whereas the `to_arrow` method 
returns the schema as defined in the parquet file. At first glance, it looks 
like a bug, likely a regression from 
https://github.com/apache/iceberg-python/pull/1669/
   
   Here is a script you can use to reproduce the issue:
   
   ```py
   import pyarrow as pa
   from pyiceberg.catalog import load_catalog
   from uuid import uuid4
   from pyiceberg.schema import Schema
   from pyiceberg.types import NestedField, StringType, DoubleType
   
   catalog = load_catalog("default")
   
   df = pa.Table.from_pylist(
       [
           {"city": "Amsterdam", "lat": 52.371807, "long": 4.896029},
           {"city": "San Francisco", "lat": 37.773972, "long": -122.431297},
           {"city": "Drachten", "lat": 53.11254, "long": 6.0989},
           {"city": "Paris", "lat": 48.864716, "long": 2.349014},
       ],
   )
   
   schema = Schema(
       NestedField(1, "city", StringType(), required=False),
       NestedField(2, "lat", DoubleType(), required=False),
       NestedField(3, "long", DoubleType(), required=False),
   )
   
   tbl = catalog.create_table(f"default.cities-{uuid4()}", schema=schema)
   
   tbl.overwrite(df)
   
   
   schema_to_arrow = tbl.scan().to_arrow().schema
   
   schema_to_arrow_batch_reader = tbl.scan().to_arrow_batch_reader().schema
   
   print("schema_to_arrow == schema_to_arrow_batch_reader", schema_to_arrow == 
schema_to_arrow_batch_reader)
   print("\nschema_to_arrow")
   print(schema_to_arrow)
   print("\nschema_to_arrow_batch_reader")
   print(schema_to_arrow_batch_reader)
   ```
   output:
   ```
   schema_to_arrow == schema_to_arrow_batch_reader False
   
   schema_to_arrow:
   city: string
   lat: double
   long: double
   
   schema_to_arrow_batch_reader:
   city: large_string                               
     -- field metadata --
     PARQUET:field_id: '1'
   lat: double
     -- field metadata --
     PARQUET:field_id: '2'
   long: double
     -- field metadata --
     PARQUET:field_id: '3'
   ```
   Notice that in `to_arrow` schema says `city: string`, while in 
`to_arrow_batch_reader` it's `city: large_string`
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [x] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] `to_arrow_batch_reader` returns a different schema than `to_arrow` [iceberg-python]

Reply via email to