Re: [I] Upcasting and Downcasting inconsistencies with PyArrow Schema [iceberg-python]

2024-06-14 Thread via GitHub
Fokko closed issue #791: Upcasting and Downcasting inconsistencies with PyArrow Schema URL: https://github.com/apache/iceberg-python/issues/791 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [I] Upcasting and Downcasting inconsistencies with PyArrow Schema [iceberg-python]

2024-06-10 Thread via GitHub
syun64 commented on issue #791: URL: https://github.com/apache/iceberg-python/issues/791#issuecomment-2159307590 Gotcha - thank you for the explanation @Fokko I didn't think of how using a large_binary could actually improve the performance because the data is grouped together into large bu

Re: [I] Upcasting and Downcasting inconsistencies with PyArrow Schema [iceberg-python]

2024-06-10 Thread via GitHub
Fokko commented on issue #791: URL: https://github.com/apache/iceberg-python/issues/791#issuecomment-2159294140 I agree that you cannot write a single field of 2GB+ to a parquet file. In that case, Parquet is probably not the best way of storing such a big blob. The difference between how

Re: [I] Upcasting and Downcasting inconsistencies with PyArrow Schema [iceberg-python]

2024-06-09 Thread via GitHub
syun64 commented on issue #791: URL: https://github.com/apache/iceberg-python/issues/791#issuecomment-2156782618 > For Arrow, the `binary` cannot store more than 2GB in a single buffer, not a single field. See [Arrow docs](https://arrow.apache.org/docs/format/Columnar.html#variable-size-bin

Re: [I] Upcasting and Downcasting inconsistencies with PyArrow Schema [iceberg-python]

2024-06-09 Thread via GitHub
Fokko commented on issue #791: URL: https://github.com/apache/iceberg-python/issues/791#issuecomment-2156736837 For Arrow, the `binary` cannot store more than 2GB in a single buffer, not a single field. See [Arrow docs](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary

Re: [I] Upcasting and Downcasting inconsistencies with PyArrow Schema [iceberg-python]

2024-06-09 Thread via GitHub
Fokko commented on issue #791: URL: https://github.com/apache/iceberg-python/issues/791#issuecomment-2156729320 This is interesting, why would Polars go with `large_binary` by default? See https://github.com/apache/iceberg-python/pull/409 -- This is an automated message from the Apache Gi

Re: [I] Upcasting and Downcasting inconsistencies with PyArrow Schema [iceberg-python]

2024-06-06 Thread via GitHub
syun64 commented on issue #791: URL: https://github.com/apache/iceberg-python/issues/791#issuecomment-2153441786 I'm seeing the same restriction when using PolaRs write_parquet, so it looks like a Parquet limitation, instead of an Arrow restriction: ``` ComputeError: parquet: File

[I] Upcasting and Downcasting inconsistencies with PyArrow Schema [iceberg-python]

2024-06-03 Thread via GitHub
syun64 opened a new issue, #791: URL: https://github.com/apache/iceberg-python/issues/791 ### Apache Iceberg version 0.6.0 (latest release) ### Please describe the bug 🐞 `schema_to_pyarrow` converts BinaryType to `pa.large_binary()` type. This creates inconsistencies wit