[I] Getting Original Schema of a DataFile in a FileScanTask? [iceberg-python]

via GitHub Thu, 08 Feb 2024 19:47:02 -0800


srilman opened a new issue, #401:
URL: https://github.com/apache/iceberg-python/issues/401


   ### Question
   
   Is there a recommended way to getting the base / original schema or 
schema-id of a data file in a FileScanTask returned during 
`FileTableScan.plan_files`? This is useful to determine what kind of schema 
evolution occurred with the subset of files we are reading, and group files 
together with the same schemas for reads.
   
   I had a hard time accomplishing this in the Java library, but found it much 
easier to do in Python. In `plan_files`, we can get the snapshot id a data file 
was created by looking at the `snapshot_id` of the associated manifest entry 
(or `added_snapshot_id` of the manifest list if the previous is null). From 
there, we can get the associated schema per snapshot.
   
   Is this a logical approach, or is there a better way to get the original 
schema? Happy to open a PR to integrate this into `FileTableScan` if it would 
be useful!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Getting Original Schema of a DataFile in a FileScanTask? [iceberg-python]

Reply via email to