enkidulan commented on issue #2250:
URL:
https://github.com/apache/iceberg-python/issues/2250#issuecomment-3161919353
Thanks for getting back to me.
As for the expectation, I'd expect `to_arrow_batch_reader` to return the
type defined in the table schema. Why would any of the parquet files in the
table not be compatible with the table schema?
-----
However, there is another aspect to this. After upgrading to the upstream
version, `table.scan().to_arrow()` call started to fail on existing tables with
the following traceback:
```
*** pyarrow.lib.ArrowTypeError: Unable to merge: Field XXXXX has
incompatible types: string vs dictionary<values=string, indices=int32,
ordered=0>
Traceback (most recent call last):
File ".../pyiceberg/table/__init__.py", line 1982, in to_arrow
).to_table(self.plan_files())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../pyiceberg/io/pyarrow.py", line 1623, in to_table
result = pa.concat_tables(
^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 6256, in pyarrow.lib.concat_tables
File "pyarrow/error.pxi", line 155, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
```
We are doing recoding using `to_arrow_batch_reader` function and then
writing results with `table.append` back to the table. From `table.append`
point of view, the type that `to_arrow_batch_reader` returns is backward
compatible; however, pyarrow fails on trying to read it, as it gets two parquet
files with incompatible schemas (from its point of view). To be fair, we
specify both pyiceberd and pyarrow schemas explicitly, but I didn't expect
`table.append` to succeed if the table schema is not compatible with other
parquet files. So this is the point of conduction for me: to_arrow_batch_reader
~and to_arrow~ were returning different types than specified in the table
schema.
Here is the code you can reproduce this issue with:
```py
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType
import pyarrow as pa
# Create schema
pa_schema = pa.schema(
[
pa.field("name", pa.dictionary(pa.uint8(), pa.string())),
])
schema = Schema(
NestedField(field_id=1, name="name", field_type=StringType(),
required=False)
)
# Load catalog (using in-memory for simplicity)
catalog = load_catalog(
"example",
**{
"uri": "http://127.0.0.1:8181",
"s3.endpoint": "http://127.0.0.1:9000",
"py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",
"s3.access-key-id": "admin",
"s3.secret-access-key": "password",
}
)
# Create table
table = catalog.create_table(f"test.example", schema=schema)
# Create sample data in table
table.append(pa.table({"name": ["Alice", "Bob", "Charlie"]},
schema=pa_schema))
# this works fine
print("\n\nReading data from table")
print(table.scan().to_arrow())
# Read data using to_arrow_batch_reader and write to table
print("\n\nWriting data to table")
batch_reader = table.scan().to_arrow_batch_reader()
for batch in batch_reader:
arrow_table = pa.Table.from_batches([batch])
table.append(arrow_table)
# this fails
print("\n\nReading data from table")
print(table.scan().to_arrow())
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]