Copilot commented on code in PR #1474:
URL:
https://github.com/apache/datafusion-python/pull/1474#discussion_r3034076330
##########
python/datafusion/context.py:
##########
@@ -1328,6 +1364,39 @@ def read_avro(
self.ctx.read_avro(str(path), schema, file_partition_cols,
file_extension)
)
+ def read_arrow(
+ self,
+ path: str | pathlib.Path,
+ schema: pa.Schema | None = None,
+ file_extension: str = ".arrow",
+ file_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
+ ) -> DataFrame:
+ """Create a :py:class:`DataFrame` for reading an Arrow IPC data source.
+
+ Args:
+ path: Path to the Arrow IPC file.
+ schema: The data source schema.
+ file_extension: File extension to select.
+ file_partition_cols: Partition columns.
+
+ Returns:
+ DataFrame representation of the read Arrow IPC file.
+ """
+ if file_partition_cols is None:
+ file_partition_cols = []
+ file_partition_cols =
_convert_table_partition_cols(file_partition_cols)
+ return DataFrame(
+ self.ctx.read_arrow(str(path), schema, file_extension,
file_partition_cols)
+ )
+
+ def read_empty(self) -> DataFrame:
+ """Create an empty :py:class:`DataFrame` with no columns or rows.
+
+ Returns:
+ An empty DataFrame.
+ """
+ return DataFrame(self.ctx.read_empty())
Review Comment:
`read_empty()` duplicates the existing `empty_table()` API (already exposed
in Python and backed by the Rust `empty_table` method). To avoid two parallel
ways to create an empty DataFrame (and an extra Rust binding), consider
implementing `SessionContext.read_empty()` as a simple alias that calls
`self.empty_table()` (or `self.ctx.empty_table()`), and keep only one low-level
binding. If both names are intentionally supported, consider documenting one as
an alias/deprecating `empty_table` for clarity.
```suggestion
This method is an alias for :meth:`empty_table`.
Returns:
An empty DataFrame.
"""
return self.empty_table()
```
##########
crates/core/src/context.rs:
##########
@@ -1184,6 +1217,34 @@ impl PySessionContext {
Ok(PyDataFrame::new(df))
}
+ pub fn read_empty(&self) -> PyDataFusionResult<PyDataFrame> {
+ let df = self.ctx.read_empty()?;
+ Ok(PyDataFrame::new(df))
Review Comment:
This PR adds a new `read_empty` binding, but the Rust layer already exposes
`empty_table()` (which calls `self.ctx.read_empty()?` at ~context.rs:1078).
Having both `empty_table` and `read_empty` at the Rust/PyO3 layer is redundant
and increases API surface. Consider removing the new `read_empty` method and
having the Python `SessionContext.read_empty()` wrapper call the existing
`empty_table()` binding instead, or otherwise clearly define one as an
alias/deprecated.
```suggestion
/// Deprecated alias for `empty_table()`.
#[deprecated(note = "use empty_table() instead")]
pub fn read_empty(&self) -> PyDataFusionResult<PyDataFrame> {
self.empty_table()
```
##########
python/tests/test_context.py:
##########
@@ -668,6 +668,45 @@ def test_read_avro(ctx):
assert avro_df is not None
+def test_read_arrow(ctx, tmp_path):
+ # Write an Arrow IPC file, then read it back
+ table = pa.table({"a": [1, 2, 3], "b": ["x", "y", "z"]})
+ arrow_path = tmp_path / "test.arrow"
+ with pa.ipc.new_file(str(arrow_path), table.schema) as writer:
+ writer.write_table(table)
+
+ df = ctx.read_arrow(str(arrow_path))
+ result = df.collect()
+ assert result[0].column(0) == pa.array([1, 2, 3])
+ assert result[0].column(1) == pa.array(["x", "y", "z"])
+
+
+def test_read_empty(ctx):
+ df = ctx.read_empty()
+ result = df.collect()
+ assert result[0].num_columns == 0
Review Comment:
`ctx.read_empty().collect()` may legally return an empty list of
`RecordBatch`es (depending on how DataFusion plans/executes empty
relations/partitions). Indexing `result[0]` makes this test potentially flaky.
Consider asserting on `len(result)` first and/or validating emptiness via
`df.schema()` and `df.count() == 0`, or accept both `[]` and `[empty_batch]`
forms.
```suggestion
assert len(result) in (0, 1)
if len(result) == 1:
assert result[0].num_columns == 0
```
##########
python/datafusion/context.py:
##########
@@ -1328,6 +1364,39 @@ def read_avro(
self.ctx.read_avro(str(path), schema, file_partition_cols,
file_extension)
)
+ def read_arrow(
+ self,
+ path: str | pathlib.Path,
+ schema: pa.Schema | None = None,
+ file_extension: str = ".arrow",
+ file_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
+ ) -> DataFrame:
+ """Create a :py:class:`DataFrame` for reading an Arrow IPC data source.
+
+ Args:
+ path: Path to the Arrow IPC file.
+ schema: The data source schema.
+ file_extension: File extension to select.
+ file_partition_cols: Partition columns.
+
+ Returns:
+ DataFrame representation of the read Arrow IPC file.
+ """
+ if file_partition_cols is None:
+ file_partition_cols = []
+ file_partition_cols =
_convert_table_partition_cols(file_partition_cols)
+ return DataFrame(
+ self.ctx.read_arrow(str(path), schema, file_extension,
file_partition_cols)
+ )
Review Comment:
The PR metadata/description indicates this closes #1458, but the issue lists
additional missing `SessionContext` APIs (`read_batch`, `read_batches`, and
context-level `write_csv`/`write_json`/`write_parquet`) that are still not
exposed (no definitions found in the Python/Rust bindings). If the intent is to
fully close #1458, these methods still need implementation; otherwise, consider
adjusting the PR description/issue closure reference to reflect the partial
coverage.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]