sungwy opened a new pull request, #994:
URL: https://github.com/apache/iceberg-python/pull/994

   Today in PyIceberg, we have support for identifier parsing in public APIs
   belonging to two different classes:
   
   
   - Catalog class: load_table, purge_table, drop_table
   - Table class: scan
   
   
   These APIs currently have optional support for the identifier that the
   instance itself belongs to.
   
   For example, the catalog class APIs support:
   
   
   ```
   catalog = load_catalog(“animals”, **properties)
   catalog.load_table(“cats.whiskers”)
   ```
   
   But it also supports:
   
   `catalog.load_table(“animals.cats.whiskers”)`
   
   Which is redundant, because the catalog.name is already “animals”.
   
   Similarly, row_filter in the Table scan API supports:
   
   
   ```
   table = catalog.load_table(“cats.whiskers”)
   table.scan(row_filter=”n_legs== 4”)
   ```
   
   But we also support
   
   
   `table.scan(row_filter=”whiskers.n_legs == 4”)`
   
   Which is also redundant, because the table name is already “whiskers” (or
   cats.whiskers)
   
   The benefits of this change are as follows:
   
   - As observed above, specifying instance-level identifier in these APIs
   is redundant
   - This optional support adds a lot of complexity to the code base and
   leads to issues like: #742
   <https://github.com/apache/iceberg-python/issues/742> It would be really
   great to clean this up before as we prepare for a 1.0.0 later this year
   - The optional support opens up the possibility of resulting in
   correctness issues if there exists a name in the level below as the
   instance-level identifier.
   - For example, if in the above catalog, we have a table namespace
   named “animals.lower” catalog.load(“animals.lower.cats”) can be construed
   as table name “cats” in the namespace “animals.lower” but it will be
   interpreted as table name “cats” in the namespace “lower” which is
   erroneous.
   - We would see a similar issue with tables and field names as well.
   Field name parsing is already complicated because we have to represent
   nested fields as flat representations. So it would be great to remove one
   unnecessary level of complication here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to