Re: [PR] Feature/add files retry logic [iceberg-python]

via GitHub Fri, 17 Oct 2025 22:25:31 -0700


ForeverAngry commented on PR #2205:
URL: https://github.com/apache/iceberg-python/pull/2205#issuecomment-3344012166


   > While I agree with the need to retry for the `add_files` function, i dont 
think we should add specific retry logic for this function only. These are all 
commit operations so it would be more maintainable to address commit retries in 
general.
   > 
   > The use case you described regarding collecting file stats is useful. I do 
think we need to improve our retry mechanism that is better than "starting 
everything from scratch". However, I think we should have separation of 
concerns; `add_files` shouldnt need to worry about caching file stats.
   > 
   > As a library, pyiceberg aims to provide all the building blocks. We let 
users implement their own specific solution for these type of optimizations.
   
   
   Fair enough - following the spirit of providing building blocks, wdyt of 
doing something like the following, this would expose the functions users need 
to address the issue, rather than implementing the fix for them:
   
   ```python
   
       def _is_valid_object_path(self, path: str) -> bool:
           """
           Simple check to validate if a string is a potentially valid object 
path.
           """
           if not isinstance(path, str) or not path.strip():
               return False
   
           from urllib.parse import urlparse
           parsed = urlparse(path)
   
           if parsed.scheme:
               # It's a URL-like path (e.g., s3://, gs://, file://)
               return True
           else:
               return False
   
       def _does_file_exist(self, path: str) -> bool:
           """
           Check if a file exists at the given path using the table's IO.
           """
           if not self._is_valid_object_path(path):
               return False
           
           try:
               with self._table.io.new_input(path).open():
                   return True
           except Exception:
               return False
   
       def get_data_files_from_objects(self, list_of_objects: List[Union[str, 
pa.Table]]) -> List[DataFile]:
           """
           Convert a list of objects (uris or dataframes) to DataFile objects.
           Args:
               list_of_objects: List of object paths (strings) or PyArrow 
tables.
           Returns:
               List of DataFile objects.
           """
           from pyiceberg.io.pyarrow import _dataframe_to_data_files
           data_files = []
           for obj in list_of_objects:
               if isinstance(obj, str) and self._is_valid_object_path(obj):
                   file_paths = [obj]
                   data_files.extend(
                       _parquet_files_to_data_files(
                           table_metadata=self.table_metadata,
                           file_paths=file_paths,
                           io=self._table.io,
                       )
                   )
               elif isinstance(obj, pa.Table):
   ##  This feature lets users pass DataFrames directly — skipping file I/O — 
for cases where data is already in memory. 
   ## However, it may cause confusion: some users might expect files to be 
auto-written, while others assume pre-written files
   ## are being referenced. Clear docs or explicit parameters (e.g., 
write_files=) can help set expectations.
                   data_files.extend(
                       list(
                           _dataframe_to_data_files(
                               table_metadata=self.table_metadata,
                               write_uuid=uuid.uuid4(),
                               df=obj,
                               io=self._table.io,
                           )
                       )
                   )
               raise ValueError(f"Unsupported object type: {type(obj)}. Must be 
a valid path string or PyArrow Table.")
           return data_files
   
       def add_data_files(
           self, data_files: List[DataFile], snapshot_properties: Dict[str, 
str] = EMPTY_DICT) -> None:
           """Shorthand API for adding DataFile objects to the table 
transaction.
           """
           if not data_files:
               return  # No files to add
   
           # Set name mapping if not already set
           if self.table_metadata.name_mapping() is None:
               self.set_properties(
                   **{TableProperties.DEFAULT_NAME_MAPPING: 
self.table_metadata.schema().name_mapping.model_dump_json()}
               )
   
           with 
self.update_snapshot(snapshot_properties=snapshot_properties).fast_append() as 
update_snapshot:
               for data_file in data_files:
                   update_snapshot.append_data_file(data_file)
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Feature/add files retry logic [iceberg-python]

Reply via email to