[GitHub] [iceberg] arminnajafi commented on a diff in pull request #6646: Implement Support for DynamoDB Catalog

via GitHub Mon, 30 Jan 2023 22:57:37 -0800


arminnajafi commented on code in PR #6646:
URL: https://github.com/apache/iceberg/pull/6646#discussion_r1091520849



##########
python/pyiceberg/catalog/__init__.py:
##########
@@ -431,3 +440,114 @@ def namespace_from(identifier: Union[str, Identifier]) -> 
Identifier:
             Identifier: Namespace identifier
         """
         return Catalog.identifier_to_tuple(identifier)[:-1]
+
+    @staticmethod
+    def _check_for_overlap(removals: Optional[Set[str]], updates: Properties) 
-> None:
+        if updates and removals:
+            overlap = set(removals) & set(updates.keys())
+            if overlap:
+                raise ValueError(f"Updates and deletes have an overlap: 
{overlap}")
+
+    def _resolve_table_location(self, location: Optional[str], database_name: 
str, table_name: str) -> str:
+        if not location:
+            return self._get_default_warehouse_location(database_name, 
table_name)
+        return location
+
+    def _get_default_warehouse_location(self, database_name: str, table_name: 
str) -> str:
+        database_properties = self.load_namespace_properties(database_name)
+        if database_location := database_properties.get(LOCATION):
+            database_location = database_location.rstrip("/")
+            return f"{database_location}/{table_name}"
+
+        if warehouse_path := self.properties.get(WAREHOUSE_LOCATION):
+            warehouse_path = warehouse_path.rstrip("/")
+            return f"{warehouse_path}/{database_name}.db/{table_name}"
+
+        raise ValueError("No default path is set, please specify a location 
when creating a table")
+
+    @staticmethod
+    def identifier_to_database(
+        identifier: Union[str, Identifier], err: Union[Type[ValueError], 
Type[NoSuchNamespaceError]] = ValueError
+    ) -> str:
+        tuple_identifier = Catalog.identifier_to_tuple(identifier)
+        if len(tuple_identifier) != 1:
+            raise err(f"Invalid database, hierarchical namespaces are not 
supported: {identifier}")
+
+        return tuple_identifier[0]
+
+    @staticmethod
+    def identifier_to_database_and_table(
+        identifier: Union[str, Identifier],
+        err: Union[Type[ValueError], Type[NoSuchTableError], 
Type[NoSuchNamespaceError]] = ValueError,
+    ) -> Tuple[str, str]:
+        tuple_identifier = Catalog.identifier_to_tuple(identifier)
+        if len(tuple_identifier) != 2:
+            raise err(f"Invalid path, hierarchical namespaces are not 
supported: {identifier}")
+
+        return tuple_identifier[0], tuple_identifier[1]
+
+    def purge_table(self, identifier: Union[str, Identifier]) -> None:
+        """Drop a table and purge all data and metadata files.
+
+        Note: This method only logs warning rather than raise exception when 
encountering file deletion failure
+
+        Args:
+            identifier (str | Identifier): Table identifier.
+
+        Raises:
+            NoSuchTableError: If a table with the name does not exist, or the 
identifier is invalid
+        """
+        table = self.load_table(identifier)
+        self.drop_table(identifier)
+        io = load_file_io(self.properties, table.metadata_location)
+        metadata = table.metadata
+        manifest_lists_to_delete = set()
+        manifests_to_delete = []
+        for snapshot in metadata.snapshots:
+            manifests_to_delete += snapshot.manifests(io)
+            if snapshot.manifest_list is not None:
+                manifest_lists_to_delete.add(snapshot.manifest_list)
+
+        manifest_paths_to_delete = {manifest.manifest_path for manifest in 
manifests_to_delete}
+        prev_metadata_files = {log.metadata_file for log in 
metadata.metadata_log}
+
+        delete_data_files(io, manifests_to_delete)
+        delete_files(io, manifest_paths_to_delete, MANIFEST)
+        delete_files(io, manifest_lists_to_delete, MANIFEST_LIST)
+        delete_files(io, prev_metadata_files, PREVIOUS_METADATA)
+        delete_files(io, {table.metadata_location}, METADATA)
+
+    @staticmethod
+    def _write_metadata(metadata: TableMetadata, io: FileIO, metadata_path: 
str) -> None:
+        ToOutputFile.table_metadata(metadata, io.new_output(metadata_path))
+
+    @staticmethod
+    def _get_metadata_location(location: str) -> str:
+        return f"{location}/metadata/00000-{uuid.uuid4()}.metadata.json"
+
+    def _get_updated_props_and_update_summary(

Review Comment:
   Thanks for seeing this. Yeah that was my plan too. 
   
   But hive is doing this slightly different:
   
   `hive.py`:
   ```
   if removals:
       for key in removals:
           if key in parameters:
               parameters[key] = None
               removed.add(key)
   ```
   
   `glue.py` and `dynamodb.py`:
   ```
   if removals:
       for key in removals:
           if key in updated_properties:
               updated_properties.pop(key)
               removed.add(key)
   ```
   
   The difference is in `parameters[key] = None` vs 
`updated_properties.pop(key)` I wasn't sure if this difference was intended or 
we could consolidate the behavior and refactor hive too.
   
   Hopping @Fokko can clarify that. 
    
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] arminnajafi commented on a diff in pull request #6646: Implement Support for DynamoDB Catalog

Reply via email to