c-thiel opened a new issue, #742:
URL: https://github.com/apache/iceberg-python/issues/742

   ### Apache Iceberg version
   
   main (development)
   
   ### Please describe the bug 🐞
   
   This is a harder one:
   
   I am currently unhappy with the way pyiceberg handles the RestCatalogs 
`name` property.
   There is one probably undisputed bug here:
   
https://github.com/apache/iceberg-python/blob/20b7b5339a1a0ab59d884c7d042c4bc96a166b11/pyiceberg/catalog/rest.py#L356
   
   In the previous line we cut the first segment of the namespace (presumably 
because this is the `name`) - even if the warehouse name is `None`. If there 
warehouse name is `None`, it isn't even added to the identifier namespace:
   
https://github.com/apache/iceberg-python/blob/20b7b5339a1a0ab59d884c7d042c4bc96a166b11/pyiceberg/catalog/rest.py#L464
   (notice the `if self.name`)
   
   Thus, when name is None or "", the first bit of the namespace identifier is 
removed in L356 (first snippet) wrongly.
   If the catalog has a name, we could argue that removing it is correct.
   
   However there is a second Problem:
   I would argue that the RestCatalogs name should NEVER be prepended to a 
Tables namespace identifier in the first place.
   And pyiceberg has this opinion too for 50% ;) - unfortunately the namespace 
provided in the url and in some requests diverge.
   With the script at the bottom of this issue I receive the following request 
to the endpoint POST 
`http://localhost:8080/catalog/v1/2cb6e8f0-0b7e-11ef-a4fb-c3ca318d8520/namespaces/my_namespace/tables/my_table`
 (so namespace "my_namespace"):
   ```json
   {
       "identifier": {
           "namespace": [
               "my_catalog_name",
               "my_namespace"
           ],
           "name": "my_table"
       },
       "requirements": [
           ...
       ],
       "updates": [
           ...
       ]
   }
   ```
   
   Notice how the namespace provided in the body is different from the 
namespace in the URL. The URL is not prefixed (due to the [:1] that we path 
magic function does as seen above), while the body contains a prefixed 
namespace.
   
   
   My solution would be to get rid of the name as part of the namespace 
altogether, thus removing the name prefix here:
   
https://github.com/apache/iceberg-python/blob/20b7b5339a1a0ab59d884c7d042c4bc96a166b11/pyiceberg/catalog/rest.py#L464
   
   And consequently also the removal of the name for urls here:
   
https://github.com/apache/iceberg-python/blob/20b7b5339a1a0ab59d884c7d042c4bc96a166b11/pyiceberg/catalog/rest.py#L464
   
   Any thoughts welcome!
   My baseline is that at least the URL and the body should match.
   
   My code:
   ```python
   import pandas as pd
   import pyarrow as pa
   from pyiceberg.catalog.rest import RestCatalog
   
   catalog = RestCatalog(
       name="my_catalog_name",
       uri="http://localhost:8080/catalog/";,
       warehouse="test",
       # For now we need a dummy token even for unauthenticated catalogs.
       # This is a requirement of pyiceberg.
       token="dummy",
   )
   
   # Lets work with the catalog
   namespace = ("my_namespace",)
   if namespace not in catalog.list_namespaces():
       catalog.create_namespace(namespace)
   
   df = pd.DataFrame(
       [[1, 1.2, "foo"], [2, 2.2, "bar"]], columns=["my_ints", "my_floats", 
"strings"]
   )
   tbl = pa.Table.from_pandas(df)
   table = catalog.create_table((*namespace, "my_table"), schema=tbl.schema)
   table.overwrite(tbl) # <-- This is the important part
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to