rynewang opened a new issue, #2556:
URL: https://github.com/apache/iceberg-python/issues/2556

   ### Apache Iceberg version
   
   0.9.1
   
   ### Please describe the bug 🐞
   
    When printing or using repr() on a PyIceberg DataFile object, it displays 
as an
     empty DataFile[] instead of showing its actual attributes. This makes 
debugging
     and logging difficult as you cannot see the contents of DataFile objects.
   
   ##  Example Code to Reproduce
   ```
     from pyiceberg.manifest import DataFile, DataFileContent, FileFormat
     from pyiceberg.typedef import Record
   
     # Create a DataFile with data
     data_file = DataFile(
         content=DataFileContent.DATA,
         file_path="s3://my-bucket/data/part-00000.parquet",
         file_format=FileFormat.PARQUET,
         partition=Record(),
         record_count=50000,
         file_size_in_bytes=1048576,
         spec_id=0
     )
   
     # Print the DataFile
     print(data_file)
     # Output: DataFile[]
   
     print(repr(data_file))
     # Output: DataFile[]
   
     # But the data is there:
     print(data_file.file_path)
     # Output: s3://my-bucket/data/part-00000.parquet
   ```
   
   ##   Root Cause
   
     The issue occurs because:
     1. DataFile class uses `__slots__` for memory efficiency (defined in 
manifest.py)
     2. DataFile inherits from Record class (defined in typedef.py)
     3. Record's `__repr__` method iterates over `self.__dict__.items()` to 
build the
     string representation
     4. Classes using `__slots__` don't populate `__dict__` by default - 
attributes are
     stored differently
     5. Therefore, `__dict__` is empty and the repr shows `"DataFile[]"`
   
     Relevant Code
   
     In pyiceberg/typedef.py:
   ```
     class Record(StructProtocol):
         # ...
         def __repr__(self) -> str:
             """Return the string representation of the Record class."""
             return f"{self.__class__.__name__}[{', 
'.join(f'{key}={repr(value)}' for 
     key, value in self.__dict__.items() if not key.startswith('_'))}]"
   ```
     In pyiceberg/manifest.py:
   ```
     class DataFile(Record):
         __slots__ = (
             "content",
             "file_path",
             "file_format",
             "partition",
             "record_count",
             "file_size_in_bytes",
             # ... many more fields
         )
   ```
   
    ## Proposed Solution
   
     The `Record.__repr__` method should check if the subclass uses `__slots__` 
and iterate
      over those attributes instead of only checking `__dict__`. Here's a 
potential fix:
   ```
     def __repr__(self) -> str:
         """Return the string representation of the Record class."""
         attrs = []
   
         # Check if the class uses __slots__
         if hasattr(self.__class__, '__slots__'):
             for slot in self.__class__.__slots__:
                 if hasattr(self, slot) and not slot.startswith('_'):
                     value = getattr(self, slot)
                     attrs.append(f'{slot}={repr(value)}')
   
         # Also include __dict__ items for non-slotted attributes
         for key, value in self.__dict__.items():
             if not key.startswith('_'):
                 attrs.append(f'{key}={repr(value)}')
   
         return f"{self.__class__.__name__}[{', '.join(attrs)}]"
   ```
     Expected Behavior
   ```
     print(data_file)
     # Should output:
     # DataFile[content=DataFileContent.DATA, 
     file_path='s3://my-bucket/data/part-00000.parquet', 
     file_format=FileFormat.PARQUET, partition=Record[], record_count=50000, 
     file_size_in_bytes=1048576, spec_id=0]
   ```
     Environment
   
     - PyIceberg version: 0.9.1
     - Python version: 3.11
   
     Impact
   
     This affects debugging and logging when working with Iceberg manifests. 
Developers
      cannot easily inspect DataFile objects during development or when 
troubleshooting
      issues.
   
     Workaround
   
     Until fixed, users can create a custom function to display DataFile 
contents:
   ```
     def format_datafile(datafile):
         """Format a DataFile object for display."""
         from pyiceberg.manifest import DataFile
   
         if not isinstance(datafile, DataFile):
             return str(datafile)
   
         attrs = []
         for slot in DataFile.__slots__:
             if hasattr(datafile, slot):
                 value = getattr(datafile, slot)
                 attrs.append(f"{slot}={value!r}")
   
         return f"DataFile[{', '.join(attrs)}]"
   ```
   
   ### Willingness to contribute
   
   - [x] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to