Re: [I] Incremental Changelog Scan [iceberg-python]

via GitHub Wed, 24 Jul 2024 08:22:51 -0700


glesperance commented on issue #240:
URL: https://github.com/apache/iceberg-python/issues/240#issuecomment-2248294331


   This would be great. In the meantime I naively hacked this to get newly 
appended rows -- seems to work for my use case.
   Looking at the code, wouldn't this feature be easier to implement if 
plan_files allowed to pass an optional screenshot_id argument?
   
   
https://github.com/apache/iceberg-python/blob/861c5631587f0d54e2550733d0f8557d57f5060a/pyiceberg/table/__init__.py#L1929-L1937
   
   ```
   from typing import Iterable, Optional, Tuple, Union
   from pyiceberg.table import (
       DataScan, FileScanTask, Table, Properties, ALWAYS_TRUE, EMPTY_DICT, 
BooleanExpression
   )
   
   class AppendScan(DataScan):
       start_snapshot_id: int | None = None
   
       @classmethod
       def from_table(cls, table: Table,
           row_filter: Union[str, BooleanExpression] = ALWAYS_TRUE,
           selected_fields: Tuple[str, ...] = ("*",),
           case_sensitive: bool = True,
           start_snapshot_id: Optional[int] = None,
           snapshot_id: Optional[int] = None,
           options: Properties = EMPTY_DICT,
           limit: Optional[int] = None,
       ) -> DataScan:
           instance = cls(
               table_metadata=table.metadata,
               io=table.io,
               row_filter=row_filter,
               selected_fields=selected_fields,
               case_sensitive=case_sensitive,
               snapshot_id=snapshot_id,
               options=options,
               limit=limit,
           )
   
           instance.start_snapshot_id = start_snapshot_id
   
           return instance
   
       def plan_files(self) -> Iterable[FileScanTask]:
           current_plan = super().plan_files()
           
           if self.start_snapshot_id is None:
               return current_plan
           
           # We need to filter out the files that were already in the old 
snapshot
           try:
               orig_snapshot_id = self.snapshot_id
               self.snapshot_id = self.start_snapshot_id
               prev_plan = super().plan_files()
               
               return [task for task in current_plan if task not in prev_plan]
           
           # Restore the snapshot id
           finally:
               self.snapshot_id = orig_snapshot_id
   
   append_scan = AppendScan.from_table(product, 
start_snapshot_id=product.history()[-2].snapshot_id)
   append_scan.to_pandas()
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Incremental Changelog Scan [iceberg-python]

Reply via email to