syun64 commented on code in PR #890: URL: https://github.com/apache/iceberg-python/pull/890#discussion_r1670709039
########## pyiceberg/table/__init__.py: ########## @@ -1866,7 +1866,7 @@ def plan_files(self) -> Iterable[FileScanTask]: for data_entry in data_entries ] - def to_arrow(self) -> pa.Table: + def to_arrow(self, with_large_types: bool = True) -> pa.Table: Review Comment: > I don't think we should expose this in the public API. Do people want to control this? In an ideal world: > > * When writing you want to take the type that's being handed to PyIceberg from the user > * When reading you want to take this information from what comes out of the Parquet files I agree with this in the ideal world. However, PyArrow API cannot handle both large_* and normal types in its APIs without the us manually casting the type to one or the other. For example, the RecordBatchReader will fail to produce the next RecordBatch if the schema doesn't align completely, and requires us to choose one and always cast the types. If the concern is in exposing this option in the public API, I think we can walk back on this change and remove it from: - `to_arrow_batch_reader()` - `to_arrow_table()` - `to_requested_schema()` But we may still need it in `schema_to_pyarrow` because here, we are making an opinionated decision about the type we are choosing to represent the data as, for when we write and for when we read. > My first assumption was to go with the large one since that seems what most libraries seem to be using. But unfortunately, that doesn't seem to be the case. - I think this is the case for daft and polars: Daft: ``` >>> import pyarrow as pa >>> import pyarrow.parquet as pq >>> import daft >>> daft.read_parquet("strings.parquet").to_arrow() pyarrow.Table strings: large_string ---- strings: [["a","b"]] >>> daft.read_parquet("strings.parquet").to_arrow() pyarrow.Table strings: large_string ---- strings: [["a","b"]] >>> daft.read_parquet("strings.parquet").to_arrow().cast(pa.schema([("strings", pa.string())])).write_parquet("small-strings.parquet") Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'pyarrow.lib.Table' object has no attribute 'write_parquet' >>> daft.from_arrow(daft.read_parquet("strings.parquet").to_arrow().cast(pa.schema([("strings", pa.string())]))).write_parquet("small-strings.parquet") ╭────────────────────────────────╮ │ path │ │ --- │ │ Utf8 │ ╞════════════════════════════════╡ │ small-strings.parquet/74515f6… │ ╰────────────────────────────────╯ (Showing first 1 of 1 rows) >>> daft.read_parquet("small-strings.parquet").to_arrow() pyarrow.Table strings: large_string ---- strings: [["a","b"]] >>> pq.read_table("small-strings.parquet") pyarrow.Table strings: large_string ---- strings: [["a","b"]] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org