ForeverAngry commented on PR #2205:
URL: https://github.com/apache/iceberg-python/pull/2205#issuecomment-3344012166
> While I agree with the need to retry for the `add_files` function, i dont
think we should add specific retry logic for this function only. These are all
commit operations so it would be more maintainable to address commit retries in
general.
>
> The use case you described regarding collecting file stats is useful. I do
think we need to improve our retry mechanism that is better than "starting
everything from scratch". However, I think we should have separation of
concerns; `add_files` shouldnt need to worry about caching file stats.
>
> As a library, pyiceberg aims to provide all the building blocks. We let
users implement their own specific solution for these type of optimizations.
Fair enough - following the spirit of providing building blocks, wdyt of
doing something like the following, this would expose the functions users need
to address the issue, rather than implementing the fix for them:
```python
def _is_valid_object_path(self, path: str) -> bool:
"""
Simple check to validate if a string is a potentially valid object
path.
"""
if not isinstance(path, str) or not path.strip():
return False
from urllib.parse import urlparse
parsed = urlparse(path)
if parsed.scheme:
# It's a URL-like path (e.g., s3://, gs://, file://)
return True
else:
return False
def _does_file_exist(self, path: str) -> bool:
"""
Check if a file exists at the given path using the table's IO.
"""
if not self._is_valid_object_path(path):
return False
try:
with self._table.io.new_input(path).open():
return True
except Exception:
return False
def get_data_files_from_objects(self, list_of_objects: List[Union[str,
pa.Table]]) -> List[DataFile]:
"""
Convert a list of objects (uris or dataframes) to DataFile objects.
Args:
list_of_objects: List of object paths (strings) or PyArrow
tables.
Returns:
List of DataFile objects.
"""
from pyiceberg.io.pyarrow import _dataframe_to_data_files
data_files = []
for obj in list_of_objects:
if isinstance(obj, str) and self._is_valid_object_path(obj):
file_paths = [obj]
data_files.extend(
_parquet_files_to_data_files(
table_metadata=self.table_metadata,
file_paths=file_paths,
io=self._table.io,
)
)
elif isinstance(obj, pa.Table):
## This feature lets users pass DataFrames directly — skipping file I/O —
for cases where data is already in memory.
## However, it may cause confusion: some users might expect files to be
auto-written, while others assume pre-written files
## are being referenced. Clear docs or explicit parameters (e.g.,
write_files=) can help set expectations.
data_files.extend(
list(
_dataframe_to_data_files(
table_metadata=self.table_metadata,
write_uuid=uuid.uuid4(),
df=obj,
io=self._table.io,
)
)
)
raise ValueError(f"Unsupported object type: {type(obj)}. Must be
a valid path string or PyArrow Table.")
return data_files
def add_data_files(
self, data_files: List[DataFile], snapshot_properties: Dict[str,
str] = EMPTY_DICT) -> None:
"""Shorthand API for adding DataFile objects to the table
transaction.
"""
if not data_files:
return # No files to add
# Set name mapping if not already set
if self.table_metadata.name_mapping() is None:
self.set_properties(
**{TableProperties.DEFAULT_NAME_MAPPING:
self.table_metadata.schema().name_mapping.model_dump_json()}
)
with
self.update_snapshot(snapshot_properties=snapshot_properties).fast_append() as
update_snapshot:
for data_file in data_files:
update_snapshot.append_data_file(data_file)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]