Re: [PR] Add missing registration methods [datafusion-python]

via GitHub Fri, 03 Apr 2026 12:23:47 -0700


Copilot commented on code in PR #1474:
URL: 
https://github.com/apache/datafusion-python/pull/1474#discussion_r3034076330



##########
python/datafusion/context.py:
##########
@@ -1328,6 +1364,39 @@ def read_avro(
             self.ctx.read_avro(str(path), schema, file_partition_cols, 
file_extension)
         )
 
+    def read_arrow(
+        self,
+        path: str | pathlib.Path,
+        schema: pa.Schema | None = None,
+        file_extension: str = ".arrow",
+        file_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
+    ) -> DataFrame:
+        """Create a :py:class:`DataFrame` for reading an Arrow IPC data source.
+
+        Args:
+            path: Path to the Arrow IPC file.
+            schema: The data source schema.
+            file_extension: File extension to select.
+            file_partition_cols: Partition columns.
+
+        Returns:
+            DataFrame representation of the read Arrow IPC file.
+        """
+        if file_partition_cols is None:
+            file_partition_cols = []
+        file_partition_cols = 
_convert_table_partition_cols(file_partition_cols)
+        return DataFrame(
+            self.ctx.read_arrow(str(path), schema, file_extension, 
file_partition_cols)
+        )
+
+    def read_empty(self) -> DataFrame:
+        """Create an empty :py:class:`DataFrame` with no columns or rows.
+
+        Returns:
+            An empty DataFrame.
+        """
+        return DataFrame(self.ctx.read_empty())

Review Comment:
   `read_empty()` duplicates the existing `empty_table()` API (already exposed 
in Python and backed by the Rust `empty_table` method). To avoid two parallel 
ways to create an empty DataFrame (and an extra Rust binding), consider 
implementing `SessionContext.read_empty()` as a simple alias that calls 
`self.empty_table()` (or `self.ctx.empty_table()`), and keep only one low-level 
binding. If both names are intentionally supported, consider documenting one as 
an alias/deprecating `empty_table` for clarity.
   ```suggestion
           This method is an alias for :meth:`empty_table`.
   
           Returns:
               An empty DataFrame.
           """
           return self.empty_table()
   ```



##########
crates/core/src/context.rs:
##########
@@ -1184,6 +1217,34 @@ impl PySessionContext {
         Ok(PyDataFrame::new(df))
     }
 
+    pub fn read_empty(&self) -> PyDataFusionResult<PyDataFrame> {
+        let df = self.ctx.read_empty()?;
+        Ok(PyDataFrame::new(df))

Review Comment:
   This PR adds a new `read_empty` binding, but the Rust layer already exposes 
`empty_table()` (which calls `self.ctx.read_empty()?` at ~context.rs:1078). 
Having both `empty_table` and `read_empty` at the Rust/PyO3 layer is redundant 
and increases API surface. Consider removing the new `read_empty` method and 
having the Python `SessionContext.read_empty()` wrapper call the existing 
`empty_table()` binding instead, or otherwise clearly define one as an 
alias/deprecated.
   ```suggestion
       /// Deprecated alias for `empty_table()`.
       #[deprecated(note = "use empty_table() instead")]
       pub fn read_empty(&self) -> PyDataFusionResult<PyDataFrame> {
           self.empty_table()
   ```



##########
python/tests/test_context.py:
##########
@@ -668,6 +668,45 @@ def test_read_avro(ctx):
     assert avro_df is not None
 
 
+def test_read_arrow(ctx, tmp_path):
+    # Write an Arrow IPC file, then read it back
+    table = pa.table({"a": [1, 2, 3], "b": ["x", "y", "z"]})
+    arrow_path = tmp_path / "test.arrow"
+    with pa.ipc.new_file(str(arrow_path), table.schema) as writer:
+        writer.write_table(table)
+
+    df = ctx.read_arrow(str(arrow_path))
+    result = df.collect()
+    assert result[0].column(0) == pa.array([1, 2, 3])
+    assert result[0].column(1) == pa.array(["x", "y", "z"])
+
+
+def test_read_empty(ctx):
+    df = ctx.read_empty()
+    result = df.collect()
+    assert result[0].num_columns == 0

Review Comment:
   `ctx.read_empty().collect()` may legally return an empty list of 
`RecordBatch`es (depending on how DataFusion plans/executes empty 
relations/partitions). Indexing `result[0]` makes this test potentially flaky. 
Consider asserting on `len(result)` first and/or validating emptiness via 
`df.schema()` and `df.count() == 0`, or accept both `[]` and `[empty_batch]` 
forms.
   ```suggestion
       assert len(result) in (0, 1)
       if len(result) == 1:
           assert result[0].num_columns == 0
   ```



##########
python/datafusion/context.py:
##########
@@ -1328,6 +1364,39 @@ def read_avro(
             self.ctx.read_avro(str(path), schema, file_partition_cols, 
file_extension)
         )
 
+    def read_arrow(
+        self,
+        path: str | pathlib.Path,
+        schema: pa.Schema | None = None,
+        file_extension: str = ".arrow",
+        file_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
+    ) -> DataFrame:
+        """Create a :py:class:`DataFrame` for reading an Arrow IPC data source.
+
+        Args:
+            path: Path to the Arrow IPC file.
+            schema: The data source schema.
+            file_extension: File extension to select.
+            file_partition_cols: Partition columns.
+
+        Returns:
+            DataFrame representation of the read Arrow IPC file.
+        """
+        if file_partition_cols is None:
+            file_partition_cols = []
+        file_partition_cols = 
_convert_table_partition_cols(file_partition_cols)
+        return DataFrame(
+            self.ctx.read_arrow(str(path), schema, file_extension, 
file_partition_cols)
+        )

Review Comment:
   The PR metadata/description indicates this closes #1458, but the issue lists 
additional missing `SessionContext` APIs (`read_batch`, `read_batches`, and 
context-level `write_csv`/`write_json`/`write_parquet`) that are still not 
exposed (no definitions found in the Python/Rust bindings). If the intent is to 
fully close #1458, these methods still need implementation; otherwise, consider 
adjusting the PR description/issue closure reference to reflect the partial 
coverage.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add missing registration methods [datafusion-python]

Reply via email to