andygrove opened a new pull request, #1614:
URL: https://github.com/apache/datafusion-ballista/pull/1614

   # Which issue does this PR close?
   
   Closes #.
   
   # Rationale for this change
   
   Ballista's Python bindings extend `datafusion-python` heavily through 
subclassing and metaclass introspection (see 
`python/python/ballista/extension.py`):
   
   - `RedefiningDataFrameMeta` walks the parent `DataFrame.__dict__` and 
re-wraps every method whose return annotation is the literal string 
`"DataFrame"` so it returns `DistributedDataFrame` instead.
   - `RedefiningSessionContextMeta` does the same for `SessionContext`.
   - A hardcoded `EXECUTION_METHODS = ["collect", "collect_partitioned", 
"show", "count", "to_arrow_table", "to_pandas", "to_polars", "write_json"]` is 
wrapped to route execution through the Ballista cluster.
   
   If a future `datafusion-python` release changes annotation style (e.g. 
switches from forward-reference strings to real class objects, or to PEP 604 
unions) or renames any of those methods, the wrapping silently stops happening. 
Queries quietly fall back to local DataFusion execution while every existing 
test still passes — the failure mode is invisible until users notice their 
cluster doing nothing.
   
   Today only `collect()` is exercised under Ballista in `test_context.py`. 
Nothing asserts that wrapping actually occurred, and the other seven 
`EXECUTION_METHODS` are entirely uncovered.
   
   # What changes are included in this PR?
   
   New file `python/python/tests/test_datafusion_compat.py` with 11 tests in 
three groups:
   
   **Metaclass smoke tests (3)** — fail loudly if introspection no longer 
matches:
   - `test_distributed_dataframe_wraps_dataframe_returning_methods` — confirms 
representative `DataFrame` methods (`select`, `filter`, `with_column`, 
`aggregate`) carry the string `"DataFrame"` return annotation and are 
re-wrapped on `DistributedDataFrame`.
   - `test_ballista_session_context_wraps_dataframe_returning_methods` — same 
check for `sql` / `read_csv` / `read_parquet` on `BallistaSessionContext`.
   - `test_execution_methods_are_present_on_dataframe` — every name in 
`EXECUTION_METHODS` still exists on `datafusion.DataFrame`.
   
   **Per-method round-trip tests (8)** — one per name in `EXECUTION_METHODS`. 
Builds a small `DistributedDataFrame` and calls `collect`, 
`collect_partitioned`, `show`, `count`, `to_arrow_table`, `to_pandas`, 
`to_polars`, and `write_json`, asserting return shape and content. Catches both 
renames (loud `AttributeError`) and silent fallback (return type would be 
wrong).
   
   **Dev dependency additions** — `pandas>=2.0.0` and `polars>=1.0.0` added to 
`[dependency-groups].dev` in `python/pyproject.toml` so the `to_pandas` / 
`to_polars` tests run unconditionally in CI rather than skipping when those 
libraries are absent. `uv.lock` is regenerated accordingly.
   
   Note: `write_json` requires its `write_options` argument to be passed 
explicitly even though datafusion's signature declares it optional with a 
`None` default — captured in a comment in the test.
   
   # Are there any user-facing changes?
   
   No. New tests and dev dependencies only.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to