Re: [PR] Forward Compatible large_* type support: read as large, write as small [iceberg-python]

via GitHub Tue, 09 Jul 2024 08:14:24 -0700


syun64 commented on code in PR #890:
URL: https://github.com/apache/iceberg-python/pull/890#discussion_r1670709039



##########
pyiceberg/table/__init__.py:
##########
@@ -1866,7 +1866,7 @@ def plan_files(self) -> Iterable[FileScanTask]:
             for data_entry in data_entries
         ]
 
-    def to_arrow(self) -> pa.Table:
+    def to_arrow(self, with_large_types: bool = True) -> pa.Table:

Review Comment:
   > I don't think we should expose this in the public API. Do people want to 
control this? In an ideal world:
   > 
   > * When writing you want to take the type that's being handed to PyIceberg 
from the user
   > * When reading you want to take this information from what comes out of 
the Parquet files
   
   I agree with this in the ideal world. However, PyArrow API cannot handle 
both large_* and normal types in its APIs without the us manually casting the 
type to one or the other. For example, the RecordBatchReader will fail to 
produce the next RecordBatch if the schema doesn't align completely, and 
requires us to choose one and always cast the types.
   
   If the concern is in exposing this option in the public API, I think we can 
walk back on this change and remove it from:
   
   - `to_arrow_batch_reader()`
   - `to_arrow_table()`
   - `to_requested_schema()`
   
   But we may still need it in `schema_to_pyarrow` because here, we are making 
an opinionated decision about the type we are choosing to represent the data 
as, for when we write and for when we read.
   
   > My first assumption was to go with the large one since that seems what 
most libraries seem to be using. But unfortunately, that doesn't seem to be the 
case.
   
   - I think this is the case for daft and polars:
   
   Daft:
   ```
   >>> import pyarrow as pa
   >>> import pyarrow.parquet as pq
   >>> import daft
   >>> daft.read_parquet("strings.parquet").to_arrow()
   pyarrow.Table                                                                
                                                                                
                                                                                
                                                                         
   strings: large_string
   ----
   strings: [["a","b"]]
   >>> daft.read_parquet("strings.parquet").to_arrow()
   pyarrow.Table                                                                
                                                                                
                                                                                
                                                                         
   strings: large_string
   ----
   strings: [["a","b"]]
   >>> 
daft.read_parquet("strings.parquet").to_arrow().cast(pa.schema([("strings", 
pa.string())])).write_parquet("small-strings.parquet")
   Traceback (most recent call last):                                           
                                                                                
                                                                                
                                                                         
     File "<stdin>", line 1, in <module>
   AttributeError: 'pyarrow.lib.Table' object has no attribute 'write_parquet'
   >>> 
daft.from_arrow(daft.read_parquet("strings.parquet").to_arrow().cast(pa.schema([("strings",
 pa.string())]))).write_parquet("small-strings.parquet")
   ╭────────────────────────────────╮                                           
                                                                                
                                                                                
                                                                         
   │ path                           │
   │ ---                            │
   │ Utf8                           │
   ╞════════════════════════════════╡
   │ small-strings.parquet/74515f6… │
   ╰────────────────────────────────╯
   
   (Showing first 1 of 1 rows)
   >>> daft.read_parquet("small-strings.parquet").to_arrow()
   pyarrow.Table                                                                
                                                                                
                                                                                
                                                                         
   strings: large_string
   ----
   strings: [["a","b"]]
   >>> pq.read_table("small-strings.parquet")
   pyarrow.Table
   strings: large_string
   ----
   strings: [["a","b"]]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Forward Compatible large_* type support: read as large, write as small [iceberg-python]

Reply via email to