Re: [PR] Refactor: extract `BaseScan` and `ManifestGroupPlanner` [iceberg-python]

via GitHub Mon, 15 Jun 2026 18:56:48 -0700


smaheshwar-pltr commented on code in PR #3511:
URL: https://github.com/apache/iceberg-python/pull/3511#discussion_r3417663587



##########
pyiceberg/table/__init__.py:
##########
@@ -1685,20 +1685,19 @@ def _parse_row_filter(expr: str | BooleanExpression) -> 
BooleanExpression:
     return parser.parse(expr) if isinstance(expr, str) else expr
 
 
-S = TypeVar("S", bound="TableScan", covariant=True)
+A = TypeVar("A", bound="BaseScan", covariant=True)
+
 
+class BaseScan(ABC):

Review Comment:
   [AI reviewer aid] New base class holding the snapshot-independent scan 
surface (row filter, options, limit, chaining helpers, and the format-converter 
sinks built on `to_arrow()`). Split out rather than collapsing the hierarchy so 
`TableScan` can keep `snapshot_id` / `use_ref()` / etc. and avoid the 
user-facing break that 
[#533](https://github.com/apache/iceberg-python/pull/533) caused by dropping 
them. This is the shared base the incremental append scan builds on in the 
follow-up PR.



##########
pyiceberg/table/__init__.py:
##########
@@ -1707,20 +1706,127 @@ def __init__(
         row_filter: str | BooleanExpression = ALWAYS_TRUE,
         selected_fields: tuple[str, ...] = ("*",),
         case_sensitive: bool = True,
-        snapshot_id: int | None = None,
         options: Properties = EMPTY_DICT,
         limit: int | None = None,
-        catalog: Catalog | None = None,
-        table_identifier: Identifier | None = None,
     ):
         self.table_metadata = table_metadata
         self.io = io
         self.row_filter = _parse_row_filter(row_filter)
         self.selected_fields = selected_fields
         self.case_sensitive = case_sensitive
-        self.snapshot_id = snapshot_id
         self.options = options
         self.limit = limit
+
+    @abstractmethod
+    def projection(self) -> Schema: ...
+
+    @abstractmethod
+    def plan_files(self) -> Iterable[ScanTask]: ...
+
+    @abstractmethod
+    def to_arrow(self) -> pa.Table: ...
+
+    def update(self: A, **overrides: Any) -> A:
+        """Create a copy of this table scan with updated fields."""
+        from inspect import signature
+
+        # Extract those attributes that are constructor parameters. We don't 
use self.__dict__ as the kwargs to the
+        # constructors because it may contain additional attributes that are 
not part of the constructor signature.
+        params = signature(type(self).__init__).parameters.keys() - {"self"}  
# Skip "self" parameter
+        kwargs = {param: getattr(self, param) for param in params}  # Assume 
parameters are attributes
+
+        return type(self)(**{**kwargs, **overrides})
+
+    def select(self: A, *field_names: str) -> A:
+        if "*" in self.selected_fields:
+            return self.update(selected_fields=field_names)
+        return 
self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names))))
+
+    def filter(self: A, expr: str | BooleanExpression) -> A:
+        return self.update(row_filter=And(self.row_filter, 
_parse_row_filter(expr)))
+
+    def with_case_sensitive(self: A, case_sensitive: bool = True) -> A:
+        return self.update(case_sensitive=case_sensitive)
+
+    def to_pandas(self, **kwargs: Any) -> pd.DataFrame:

Review Comment:
   [AI reviewer aid] `to_pandas` / `to_polars` were abstract on `TableScan`; 
they're now concrete defaults on `BaseScan` (built on `to_arrow()`), and 
`to_duckdb` / `to_ray` move up here too. This loosens the abstract contract — 
additive for existing `TableScan` subclasses, not breaking.



##########
pyiceberg/table/__init__.py:
##########
@@ -1707,20 +1706,127 @@ def __init__(
         row_filter: str | BooleanExpression = ALWAYS_TRUE,
         selected_fields: tuple[str, ...] = ("*",),
         case_sensitive: bool = True,
-        snapshot_id: int | None = None,
         options: Properties = EMPTY_DICT,
         limit: int | None = None,
-        catalog: Catalog | None = None,
-        table_identifier: Identifier | None = None,
     ):
         self.table_metadata = table_metadata
         self.io = io
         self.row_filter = _parse_row_filter(row_filter)
         self.selected_fields = selected_fields
         self.case_sensitive = case_sensitive
-        self.snapshot_id = snapshot_id
         self.options = options
         self.limit = limit
+
+    @abstractmethod
+    def projection(self) -> Schema: ...
+
+    @abstractmethod
+    def plan_files(self) -> Iterable[ScanTask]: ...
+
+    @abstractmethod
+    def to_arrow(self) -> pa.Table: ...
+
+    def update(self: A, **overrides: Any) -> A:
+        """Create a copy of this table scan with updated fields."""
+        from inspect import signature
+
+        # Extract those attributes that are constructor parameters. We don't 
use self.__dict__ as the kwargs to the
+        # constructors because it may contain additional attributes that are 
not part of the constructor signature.
+        params = signature(type(self).__init__).parameters.keys() - {"self"}  
# Skip "self" parameter
+        kwargs = {param: getattr(self, param) for param in params}  # Assume 
parameters are attributes
+
+        return type(self)(**{**kwargs, **overrides})
+
+    def select(self: A, *field_names: str) -> A:
+        if "*" in self.selected_fields:
+            return self.update(selected_fields=field_names)
+        return 
self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names))))
+
+    def filter(self: A, expr: str | BooleanExpression) -> A:
+        return self.update(row_filter=And(self.row_filter, 
_parse_row_filter(expr)))
+
+    def with_case_sensitive(self: A, case_sensitive: bool = True) -> A:
+        return self.update(case_sensitive=case_sensitive)
+
+    def to_pandas(self, **kwargs: Any) -> pd.DataFrame:
+        """Read a Pandas DataFrame eagerly from this Iceberg table.
+
+        Returns:
+            pd.DataFrame: Materialized Pandas Dataframe from the Iceberg table
+        """
+        return self.to_arrow().to_pandas(**kwargs)
+
+    def to_duckdb(self, table_name: str, connection: DuckDBPyConnection | None 
= None) -> DuckDBPyConnection:
+        """Shorthand for loading the Iceberg Table in DuckDB.
+
+        Returns:
+            DuckDBPyConnection: In memory DuckDB connection with the Iceberg 
table.
+        """
+        import duckdb
+
+        con = connection or duckdb.connect(database=":memory:")
+        con.register(table_name, self.to_arrow())
+
+        return con
+
+    def to_ray(self) -> ray.data.dataset.Dataset:
+        """Read a Ray Dataset eagerly from this Iceberg table.
+
+        Returns:
+            ray.data.dataset.Dataset: Materialized Ray Dataset from the 
Iceberg table
+        """
+        import ray
+
+        return ray.data.from_arrow(self.to_arrow())
+
+    def to_polars(self) -> pl.DataFrame:
+        """Read a Polars DataFrame from this Iceberg table.
+
+        Returns:
+            pl.DataFrame: Materialized Polars Dataframe from the Iceberg table
+        """
+        import polars as pl
+
+        result = pl.from_arrow(self.to_arrow())
+        if isinstance(result, pl.Series):
+            result = result.to_frame()
+
+        return result
+
+
+S = TypeVar("S", bound="TableScan", covariant=True)
+
+
+class TableScan(BaseScan, ABC):

Review Comment:
   [AI reviewer aid] `TableScan`'s public surface is unchanged: it now 
subclasses the new `BaseScan` but keeps `snapshot_id`, `catalog`, 
`table_identifier`, `snapshot()`, `projection()`, `use_ref()`, and abstract 
`count()`. The large diff is `TableScan` being moved below `BaseScan`, not 
renamed or removed.



##########
pyiceberg/table/__init__.py:
##########
@@ -2215,6 +2128,172 @@ def count(self) -> int:
         return res
 
 
+class ManifestGroupPlanner:

Review Comment:
   [AI reviewer aid] Extracted from `DataScan`. The `_build_*` evaluators and 
`_check_sequence_number` are **moved** here, not new — so `DataScan` (and the 
incremental append scan in the follow-up) share one planning pipeline. 
`DataScan.scan_plan_helper` and `_plan_files_local` now delegate to it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Refactor: extract `BaseScan` and `ManifestGroupPlanner` [iceberg-python]

Reply via email to