Re: [PR] Add missing Dataframe functions [datafusion-python]

via GitHub Fri, 03 Apr 2026 12:24:39 -0700


Copilot commented on code in PR #1472:
URL: 
https://github.com/apache/datafusion-python/pull/1472#discussion_r3034080932



##########
python/datafusion/dataframe.py:
##########
@@ -1036,6 +1078,109 @@ def except_all(self, other: DataFrame) -> DataFrame:
         """
         return DataFrame(self.df.except_all(other.df))
 
+    def except_distinct(self, other: DataFrame) -> DataFrame:
+        """Calculate the set difference with deduplication.
+
+        Returns rows that are in this DataFrame but not in ``other``,
+        removing any duplicates. This is the complement of 
:py:meth:`except_all`
+        which preserves duplicates.
+
+        The two :py:class:`DataFrame` must have exactly the same schema.
+
+        Args:
+            other: DataFrame to calculate exception with.
+
+        Returns:
+            DataFrame after set difference with deduplication.
+        """
+        return DataFrame(self.df.except_distinct(other.df))
+
+    def intersect_distinct(self, other: DataFrame) -> DataFrame:
+        """Calculate the intersection with deduplication.
+
+        Returns distinct rows that appear in both DataFrames. This is the
+        complement of :py:meth:`intersect` which preserves duplicates.
+
+        The two :py:class:`DataFrame` must have exactly the same schema.
+
+        Args:
+            other: DataFrame to intersect with.
+
+        Returns:
+            DataFrame after intersection with deduplication.
+        """
+        return DataFrame(self.df.intersect_distinct(other.df))
+
+    def union_by_name(self, other: DataFrame) -> DataFrame:
+        """Union two :py:class:`DataFrame` matching columns by name.
+
+        Unlike :py:meth:`union` which matches columns by position, this method
+        matches columns by their names, allowing DataFrames with different
+        column orders to be combined.
+
+        Args:
+            other: DataFrame to union with.
+
+        Returns:
+            DataFrame after union by name.
+        """
+        return DataFrame(self.df.union_by_name(other.df))
+
+    def union_by_name_distinct(self, other: DataFrame) -> DataFrame:
+        """Union two :py:class:`DataFrame` by name with deduplication.
+
+        Combines :py:meth:`union_by_name` with deduplication of rows.
+
+        Args:
+            other: DataFrame to union with.
+
+        Returns:
+            DataFrame after union by name with deduplication.
+        """
+        return DataFrame(self.df.union_by_name_distinct(other.df))
+
+    def distinct_on(
+        self,
+        on_expr: list[Expr],
+        select_expr: list[Expr],
+        sort_expr: list[SortKey] | None = None,
+    ) -> DataFrame:
+        """Deduplicate rows based on specific columns.
+
+        Returns a new DataFrame with one row per unique combination of the
+        ``on_expr`` columns, keeping the first row per group as determined by
+        ``sort_expr``.
+
+        Args:
+            on_expr: Expressions that determine uniqueness.
+            select_expr: Expressions to include in the output.
+            sort_expr: Optional sort expressions to determine which row to 
keep.
+
+        Returns:
+            DataFrame after deduplication.
+        """
+        on_raw = expr_list_to_raw_expr_list(on_expr)
+        select_raw = expr_list_to_raw_expr_list(select_expr)
+        sort_raw = sort_list_to_raw_sort_list(sort_expr) if sort_expr else None
+        return DataFrame(self.df.distinct_on(on_raw, select_raw, sort_raw))
+
+    def sort_by(self, *exprs: Expr | str) -> DataFrame:
+        """Sort the DataFrame by column expressions in ascending order.
+
+        This is a convenience method that sorts all columns in ascending order
+        with nulls last. For more control over sort direction and null 
ordering,
+        use :py:meth:`sort` instead.

Review Comment:
   The `sort_by` docstring says this method "sorts all columns". The 
implementation actually sorts by the provided `exprs` only; it does not default 
to sorting by every column in the DataFrame. Please adjust the wording to avoid 
implying it sorts the entire schema by default (e.g., "sorts the DataFrame by 
the given expressions in ascending order with nulls last").
   ```suggestion
           This is a convenience method that sorts the DataFrame by the given
           expressions in ascending order with nulls last. For more control over
           sort direction and null ordering, use :py:meth:`sort` instead.
   ```



##########
python/datafusion/dataframe.py:
##########
@@ -1036,6 +1078,109 @@ def except_all(self, other: DataFrame) -> DataFrame:
         """
         return DataFrame(self.df.except_all(other.df))
 
+    def except_distinct(self, other: DataFrame) -> DataFrame:
+        """Calculate the set difference with deduplication.
+
+        Returns rows that are in this DataFrame but not in ``other``,
+        removing any duplicates. This is the complement of 
:py:meth:`except_all`
+        which preserves duplicates.
+

Review Comment:
   This PR description indicates it closes #1455/#1456 (which include exposing 
`DataFrame.with_param_values`), but `with_param_values` still isn't present on 
the Python `DataFrame` API (and isn't bound in `crates/core/src/dataframe.rs`). 
Either expose `with_param_values` on `DataFrame` as well, or update the PR 
metadata/issues being closed so they accurately reflect what's implemented here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add missing Dataframe functions [datafusion-python]

Reply via email to