enkidulan commented on issue #2250:
URL: 
https://github.com/apache/iceberg-python/issues/2250#issuecomment-3161919353

   Thanks for getting back to me.
   
   As for the expectation, I'd expect `to_arrow_batch_reader` to return the 
type defined in the table schema. Why would any of the parquet files in the 
table not be compatible with the table schema?
   
   ----- 
   
   However, there is another aspect to this. After upgrading to the upstream 
version, `table.scan().to_arrow()` call started to fail on existing tables with 
the following traceback:
   
   ```
   *** pyarrow.lib.ArrowTypeError: Unable to merge: Field XXXXX has 
incompatible types: string vs dictionary<values=string, indices=int32, 
ordered=0>
   Traceback (most recent call last):
     File ".../pyiceberg/table/__init__.py", line 1982, in to_arrow
       ).to_table(self.plan_files())
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File ".../pyiceberg/io/pyarrow.py", line 1623, in to_table
       result = pa.concat_tables(
                ^^^^^^^^^^^^^^^^^
     File "pyarrow/table.pxi", line 6256, in pyarrow.lib.concat_tables
     File "pyarrow/error.pxi", line 155, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   ```
   
   We are doing recoding using `to_arrow_batch_reader` function and then 
writing results with `table.append` back to the table. From `table.append` 
point of view, the type that `to_arrow_batch_reader` returns is backward 
compatible; however, pyarrow fails on trying to read it, as it gets two parquet 
files with incompatible schemas (from its point of view). To be fair, we 
specify both pyiceberd and pyarrow schemas explicitly, but I didn't expect 
`table.append` to succeed if the table schema is not compatible with other 
parquet files. So this is the point of conduction for me: to_arrow_batch_reader 
~and to_arrow~ were returning different types than specified in the table 
schema. 
   
   Here is the code you can reproduce this issue with:
   
   ```py
   from pyiceberg.catalog import load_catalog
   from pyiceberg.schema import Schema
   from pyiceberg.types import NestedField, StringType
   import pyarrow as pa
   
   # Create schema
   pa_schema = pa.schema(
       [
           pa.field("name", pa.dictionary(pa.uint8(), pa.string())),
       ])
   
   schema = Schema(
       NestedField(field_id=1, name="name", field_type=StringType(), 
required=False)
   )
   
   # Load catalog (using in-memory for simplicity)
   catalog = load_catalog(
       "example",
       **{
           "uri": "http://127.0.0.1:8181";,
           "s3.endpoint": "http://127.0.0.1:9000";,
           "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",
           "s3.access-key-id": "admin",
           "s3.secret-access-key": "password",
       }
   )
   
   # Create table
   table = catalog.create_table(f"test.example", schema=schema)
   
   # Create sample data in table
   table.append(pa.table({"name": ["Alice", "Bob", "Charlie"]}, 
schema=pa_schema))
   
   # this works fine
   print("\n\nReading data from table")
   print(table.scan().to_arrow())
   
   # Read data using to_arrow_batch_reader and write to table
   print("\n\nWriting data to table")
   batch_reader = table.scan().to_arrow_batch_reader()
   for batch in batch_reader:
       arrow_table = pa.Table.from_batches([batch])
       table.append(arrow_table)
   
   # this fails
   print("\n\nReading data from table")
   print(table.scan().to_arrow())
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to