syun64 commented on PR #219: URL: https://github.com/apache/iceberg-python/pull/219#issuecomment-1862055260
Hi folks, I've rebased the branch from main to reflect the changes in [Add name-mapping](https://github.com/apache/iceberg-python/pull/212), and added some negative test cases. I have some questions: 1. The current proposed approach assumes that if any of the fields present in the pyarrow schema does not have a field ID, it has missing field IDs, and hence would require the table name mapping property to be set. Conversely, the version of ConvertToIceberg visitor on main [simply ignores](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L744) any fields that do not have field_ids in its metadata, and hence is able to read files with partially missing field_ids. Does the proposed behavior sound correct? 2. In order to achieve above, I am creating NestedFields with field_id=-1 (since there is an int_type constraint on NestedField. Although this feels arbitrary, we could argue that this is reasonable since '-1' is also Parquet's choice of arbitrary int when representing missing field_ids. [Example](https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing): ``` <pyarrow._parquet.ParquetSchema object at 0x7f72b317fec0> required group field_id=-1 schema { optional double field_id=-1 one; optional binary field_id=-1 two (String); optional boolean field_id=-1 three; optional binary field_id=-1 __index_level_0__ (String); } ``` Does this sound reasonable? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org