syun64 commented on PR #929:
URL: https://github.com/apache/iceberg-python/pull/929#issuecomment-2239084293

   > @syun64 @HonahX @kevinjqliu This provides a nice cleanup of the types (and 
probably also a speed-up), the downside is that we have to raise the lower 
bound to PyArrow 17. PTAL
   
   Great question @Fokko ... after thinking a lot about this the past week, 
here's my long answer organized by different topics of consideration
   
   #### Benefits of 17.0.0
   - Table Scan API `to_arrow()` will now be able to infer the type correctly 
based on the encoding in a parquet file (instead of always reading as large 
types). If there's a discrepancy in the encodings across the files, this will 
result in some of the tables's data being casted to larger types. For 
`to_arrow_batches()` this means that we will be reading the batches with types 
matching the encoding in the parquet file, and then always casting to large 
types **
   - We will be able conform with the Iceberg Spec's physical type mapping for 
DecimalType: https://github.com/apache/iceberg-python/issues/936
   
   #### User's ability to use PyIceberg in applications
   - pyarrow 8.0.0 ~ pyarrow 17.0.0 all have the same single subdependency on 
numpy of the same version, which means it shouldn't be all that difficult for 
users to switch to a higher version of pyarrow in existing applications
   - However, switching versions of a core dependency comes with the risk of 
introducing changes to the data underneath. Although the arrow representation 
under the hood won't change, I wonder if there'd still be subtle differences in 
the API args that would need to be considered carefully in a version update. 
I'd imagine that it would require a lot of effort for owners of existing 
Production applications to update the PyArrow version and QC their output 
artifacts as a pre-requisite for adding a PyIceberg dependency to their 
application (if they want to use pyarrow for scans and writes)
   - This would also mean that there is only one version of PyArrow available 
for users to use with PyIceberg - there's some element of risk in having just 
one version of a package available to use. For example, what if there's a 
really bad issue with 17.0.0 that affects a specific use case?
   
   ** -> I'm of the impression that while this change seems to make sense from 
the perspective of preserving type or encoding correctness, it will actually 
result in a performance regression due to the fact that we will be reading most 
batches as small types, but having to cast them to large types (infrequently 
for pa.Table, but always for pa.RecordBatchReader). Another option is to always 
choose to cast to a small type instead in `to_arrow_batch_reader`
   
   Based on these points, I'm leaning towards not aggressively increasing the 
lower bound to 17.0.0, at least for this minor release, but I'm very excited to 
hear what others think as well!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to