timsaucer commented on code in PR #1579:
URL:
https://github.com/apache/datafusion-python/pull/1579#discussion_r3413131897
##########
docs/source/user-guide/common-operations/aggregations.md:
##########
@@ -0,0 +1,433 @@
+---
+jupytext:
+ text_representation:
+ extension: .md
+ format_name: myst
+kernelspec:
+ name: python3
+ display_name: Python 3
+---
Review Comment:
These lines indicate to the build system that we are running the jupyter
extension to build the examples.
##########
docs/source/conf.py:
##########
@@ -48,16 +48,40 @@
extensions = [
"sphinx.ext.mathjax",
"sphinx.ext.napoleon",
- "myst_parser",
- "IPython.sphinxext.ipython_directive",
+ # myst_nb is a superset of myst_parser: it provides the MyST markdown
+ # parser plus executable `{code-cell}` notebook directives. Do NOT also
+ # list "myst_parser" — myst_nb activates it internally and listing both
+ # raises an extension conflict.
+ "myst_nb",
"autoapi.extension",
]
+# NOTE: .rst stays alongside .md because sphinx-autoapi generates RST
+# under autoapi/ and Sphinx needs the suffix to parse it. The human-
+# authored docs are all MyST .md now. ".md" is routed through myst-nb so
+# pages carrying jupytext/kernelspec front matter execute their
+# `{code-cell}` blocks; pages without that front matter render as plain
+# MyST markdown. The ".rst" entry is only for the autoapi build artifacts.
source_suffix = {
".rst": "restructuredtext",
- ".md": "markdown",
+ ".md": "myst-nb",
}
+# Execute notebook code cells at build time and fail the build if any cell
+# raises — this replaces the old IPython sphinx directive, whose executed
+# examples are now `{code-cell}` blocks. "force" re-executes every build so
+# stale cached output can never ship.
+nb_execution_mode = "force"
+nb_execution_timeout = 120
+nb_execution_raise_on_error = True
Review Comment:
These lines will force failing examples to fail the documentation build.
##########
docs/source/conf.py:
##########
@@ -48,16 +48,40 @@
extensions = [
"sphinx.ext.mathjax",
"sphinx.ext.napoleon",
- "myst_parser",
- "IPython.sphinxext.ipython_directive",
+ # myst_nb is a superset of myst_parser: it provides the MyST markdown
+ # parser plus executable `{code-cell}` notebook directives. Do NOT also
+ # list "myst_parser" — myst_nb activates it internally and listing both
+ # raises an extension conflict.
+ "myst_nb",
"autoapi.extension",
]
+# NOTE: .rst stays alongside .md because sphinx-autoapi generates RST
+# under autoapi/ and Sphinx needs the suffix to parse it. The human-
+# authored docs are all MyST .md now. ".md" is routed through myst-nb so
+# pages carrying jupytext/kernelspec front matter execute their
+# `{code-cell}` blocks; pages without that front matter render as plain
+# MyST markdown. The ".rst" entry is only for the autoapi build artifacts.
source_suffix = {
".rst": "restructuredtext",
- ".md": "markdown",
+ ".md": "myst-nb",
}
+# Execute notebook code cells at build time and fail the build if any cell
+# raises — this replaces the old IPython sphinx directive, whose executed
+# examples are now `{code-cell}` blocks. "force" re-executes every build so
+# stale cached output can never ship.
+nb_execution_mode = "force"
+nb_execution_timeout = 120
+nb_execution_raise_on_error = True
+
+# Prefer the plain-text repr of a cell's last expression over its rich
+# `_repr_html_`. A DataFrame's HTML repr is a self-contained widget (inline
+# styles + an injected <script>) built for Jupyter; in the docs theme it
+# renders at the wrong width. The text repr is the readable table the old
+# IPython directive showed and is stable across datafusion versions.
+nb_mime_priority_overrides = [("html", "text/plain", 0)]
Review Comment:
This was necessary to enforce text output of the dataframes instead of
rendered html. It gives a more consistent experience IMO, especially as the
html rendering code has had some changes in the past few releases.
##########
.github/workflows/build.yml:
##########
@@ -552,9 +552,10 @@ jobs:
run: |
set -x
cd docs
- curl -O
https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv
- curl -O
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet
- uv run --no-project make html
+ # build.sh downloads the example data, registers the Jupyter kernel
+ # myst-nb needs, symlinks the data next to each executed page, and
+ # runs sphinx. Using it here keeps CI identical to a local build.
+ uv run --no-project bash ./build.sh
Review Comment:
Use a single build path, both for local development and CI.
##########
docs/source/user-guide/common-operations/aggregations.md:
##########
@@ -0,0 +1,433 @@
+---
+jupytext:
+ text_representation:
+ extension: .md
+ format_name: myst
+kernelspec:
+ name: python3
+ display_name: Python 3
+---
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+(aggregation)=
+
+# Aggregation
+
+An aggregate or aggregation is a function where the values of multiple rows
are processed together
+to form a single summary value. For performing an aggregation, DataFusion
provides the
+{py:func}`~datafusion.dataframe.DataFrame.aggregate`
+
+```{code-cell} ipython3
+from datafusion import SessionContext, col, lit, functions as f
+
+ctx = SessionContext()
+df = ctx.read_csv("pokemon.csv")
+
+col_type_1 = col('"Type 1"')
+col_type_2 = col('"Type 2"')
+col_speed = col('"Speed"')
+col_attack = col('"Attack"')
+
+df.aggregate([col_type_1], [
+ f.approx_distinct(col_speed).alias("Count"),
+ f.approx_median(col_speed).alias("Median Speed"),
+ f.approx_percentile_cont(col_speed, 0.9).alias("90% Speed")])
+```
+
+When {code}`group_by` is {code}`None` or an empty list, the aggregation is
done over the whole
+{class}`.DataFrame`. For grouping the {code}`group_by` list must contain at
least one column.
+
+```{code-cell} ipython3
+df.aggregate([col_type_1], [
+ f.max(col_speed).alias("Max Speed"),
+ f.avg(col_speed).alias("Avg Speed"),
+ f.min(col_speed).alias("Min Speed")])
+```
+
+More than one column can be used for grouping
+
+```{code-cell} ipython3
+df.aggregate([col_type_1, col_type_2], [
+ f.max(col_speed).alias("Max Speed"),
+ f.avg(col_speed).alias("Avg Speed"),
+ f.min(col_speed).alias("Min Speed")])
+
+
+```
+
+## Setting Parameters
+
+Each of the built in aggregate functions provides arguments for the parameters
that affect their
+operation. These can also be overridden using the builder approach to setting
any of the following
+parameters. When you use the builder, you must call `build()` to finish. For
example, these two
+expressions are equivalent.
+
+```{code-cell} ipython3
+first_1 = f.first_value(col("a"), order_by=[col("a")])
+first_2 = f.first_value(col("a")).order_by(col("a")).build()
+```
+
+### Ordering
+
+You can control the order in which rows are processed by window functions by
providing
+a list of `order_by` functions for the `order_by` parameter. In the following
example, we
+sort the Pokemon by their attack in increasing order and take the first value,
which gives us the
+Pokemon with the smallest attack value in each `Type 1`.
+
+```{code-cell} ipython3
+df.aggregate(
+ [col('"Type 1"')],
+ [f.first_value(
+ col('"Name"'),
+ order_by=[col('"Attack"').sort(ascending=True)]
+ ).alias("Smallest Attack")
+ ])
+```
+
+### Distinct
+
+When you set the parameter `distinct` to `True`, then unique values will only
be evaluated one
+time each. Suppose we want to create an array of all of the `Type 2` for each
`Type 1` of our
+Pokemon set. Since there will be many entries of `Type 2` we only one each
distinct value.
+
+```{code-cell} ipython3
+df.aggregate([col_type_1], [f.array_agg(col_type_2, distinct=True).alias("Type
2 List")])
+```
+
+In the output of the above we can see that there are some `Type 1` for which
the `Type 2` entry
+is `null`. In reality, we probably want to filter those out. We can do this in
two ways. First,
+we can filter DataFrame rows that have no `Type 2`. If we do this, we might
have some `Type 1`
+entries entirely removed. The second is we can use the `filter` argument
described below.
+
+```{code-cell} ipython3
+df.filter(col_type_2.is_not_null()).aggregate([col_type_1],
[f.array_agg(col_type_2, distinct=True).alias("Type 2 List")])
+
+df.aggregate([col_type_1], [f.array_agg(col_type_2, distinct=True,
filter=col_type_2.is_not_null()).alias("Type 2 List")])
+```
+
+Which approach you take should depend on your use case.
+
+### Null Treatment
+
+This option allows you to either respect or ignore null values.
+
+One common usage for handling nulls is the case where you want to find the
first value within a
+partition. By setting the null treatment to ignore nulls, we can find the
first non-null value
+in our partition.
+
+```{code-cell} ipython3
+from datafusion.common import NullTreatment
+
+df.aggregate([col_type_1], [
+ f.first_value(
+ col_type_2,
+ order_by=[col_attack],
+ null_treatment=NullTreatment.RESPECT_NULLS
+ ).alias("Lowest Attack Type 2")])
+
+df.aggregate([col_type_1], [
+ f.first_value(
+ col_type_2,
+ order_by=[col_attack],
+ null_treatment=NullTreatment.IGNORE_NULLS
+ ).alias("Lowest Attack Type 2")])
+```
+
+### Filter
+
+Using the filter option is useful for filtering results to include in the
aggregate function. It can
+be seen in the example above on how this can be useful to only filter rows
evaluated by the
+aggregate function without filtering rows from the entire DataFrame.
+
+Filter takes a single expression.
+
+Suppose we want to find the speed values for only Pokemon that have low Attack
values.
+
+```{code-cell} ipython3
+df.aggregate([col_type_1], [
+ f.avg(col_speed).alias("Avg Speed All"),
+ f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low
Attack")])
+
+```
+
+### Comparing subsets within a group
+
+Sometimes you need to compare the full membership of a group against a
+subset that meets some condition — for example, "which groups have at least
+one failure, but not every member failed?". The `filter` argument on an
+aggregate restricts the rows that contribute to *that* aggregate without
+dropping the group, so a single pass can produce both the full set and the
+filtered subset side by side. Pairing
+{py:func}`~datafusion.functions.array_agg` with `distinct=True` and
+`filter=` is a compact way to express this: collect the distinct values
+of the group, collect the distinct values that satisfy the condition, then
+compare the two arrays.
+
+Suppose each row records a line item with the supplier that fulfilled it and
+a flag for whether that supplier met the commit date. We want to identify
+*partially failed* orders — orders where at least one supplier failed but
+not every supplier failed:
+
+```{code-cell} ipython3
+orders_df = ctx.from_pydict(
+ {
+ "order_id": [1, 1, 1, 2, 2, 3, 4, 4],
+ "supplier_id": [100, 101, 102, 200, 201, 300, 400, 401],
+ "failed": [False, True, False, False, False, True, True, True],
+ },
+)
+
+grouped = orders_df.aggregate(
+ [col("order_id")],
+ [
+ f.array_agg(col("supplier_id"), distinct=True).alias("all_suppliers"),
+ f.array_agg(
+ col("supplier_id"),
+ filter=col("failed"),
+ distinct=True,
+ ).alias("failed_suppliers"),
+ ],
+)
+
+grouped.filter(
+ (f.array_length(col("failed_suppliers")) > lit(0))
+ & (f.array_length(col("failed_suppliers")) <
f.array_length(col("all_suppliers")))
+).select(col("order_id"), col("failed_suppliers"))
+```
+
+Order 1 is partial (one of three suppliers failed). Order 2 is excluded
+because no supplier failed, order 3 because its only supplier failed, and
+order 4 because both of its suppliers failed.
+
+## Grouping Sets
+
+The default style of aggregation produces one row per group. Sometimes you
want a single query to
+produce rows at multiple levels of detail — for example, totals per type *and*
an overall grand
+total, or subtotals for every combination of two columns plus the individual
column totals. Writing
+separate queries and concatenating them is tedious and runs the data multiple
times. Grouping sets
+solve this by letting you specify several grouping levels in one pass.
+
+DataFusion supports three grouping set styles through the
+{py:class}`~datafusion.expr.GroupingSet` class:
+
+- {py:meth}`~datafusion.expr.GroupingSet.rollup` — hierarchical subtotals,
like a drill-down report
+- {py:meth}`~datafusion.expr.GroupingSet.cube` — every possible subtotal
combination, like a pivot table
+- {py:meth}`~datafusion.expr.GroupingSet.grouping_sets` — explicitly list
exactly which grouping levels you want
+
+Because result rows come from different grouping levels, a column that is
*not* part of a
+particular level will be `null` in that row. Use
{py:func}`~datafusion.functions.grouping` to
+distinguish a real `null` in the data from one that means "this column was
aggregated across."
+It returns `0` when the column is a grouping key for that row, and `1` when it
is not.
+
+### Rollup
+
+{py:meth}`~datafusion.expr.GroupingSet.rollup` creates a hierarchy. `rollup(a,
b)` produces
+grouping sets `(a, b)`, `(a)`, and `()` — like nested subtotals in a report.
This is useful
+when your columns have a natural hierarchy, such as region → city or type →
subtype.
+
+Suppose we want to summarize Pokemon stats by `Type 1` with subtotals and a
grand total. With
+the default aggregation style we would need two separate queries. With
`rollup` we get it all at
+once:
+
+```{code-cell} ipython3
+from datafusion.expr import GroupingSet
+
+df.aggregate(
+ [GroupingSet.rollup(col_type_1)],
+ [f.count(col_speed).alias("Count"),
+ f.avg(col_speed).alias("Avg Speed"),
+ f.max(col_speed).alias("Max Speed")]
+).sort(col_type_1.sort(ascending=True, nulls_first=True))
+```
+
+The first row — where `Type 1` is `null` — is the grand total across all
types. But how do you
+tell a grand-total `null` apart from a Pokemon that genuinely has no type? The
+{py:func}`~datafusion.functions.grouping` function returns `0` when the column
is a grouping key
+for that row and `1` when it is aggregated across.
+
+Apply `.alias()` to the `grouping()` expression to give the column a readable
name:
+
+```{code-cell} ipython3
+result = df.aggregate(
+ [GroupingSet.rollup(col_type_1)],
+ [f.count(col_speed).alias("Count"),
+ f.avg(col_speed).alias("Avg Speed"),
+ f.grouping(col_type_1).alias("Is Total")]
+)
+result.sort(col_type_1.sort(ascending=True, nulls_first=True))
+```
Review Comment:
This *is* a substantive difference from the prior work. The issue in
https://github.com/apache/datafusion/issues/21411 has been resolved and
verified in 54.0.0 so I removed the old warning about grouping sets with
aliases. You can see the old text in this section:
https://datafusion.apache.org/python/user-guide/common-operations/aggregations.html#rollup
##########
docs/build.sh:
##########
@@ -36,6 +36,21 @@ rm -rf build 2> /dev/null
rm -rf temp 2> /dev/null
mkdir temp
cp -rf source/* temp/
+
+# myst-nb executes each page as a notebook from the directory that page
+# lives in, so the example data files must sit alongside every page that
+# loads them by relative name (e.g. `ctx.read_csv("pokemon.csv")`). Symlink
+# them into each directory that has such a page rather than copying the
+# 20 MB parquet repeatedly.
+for d in temp temp/user-guide temp/user-guide/common-operations; do
+ ln -sf "$script_dir/pokemon.csv" "$d/pokemon.csv"
+ ln -sf "$script_dir/yellow_tripdata_2021-01.parquet"
"$d/yellow_tripdata_2021-01.parquet"
+done
Review Comment:
Description above explains why these changes were added to build steps.
##########
docs/source/_static/theme_overrides.css:
##########
@@ -63,3 +63,48 @@ html[data-theme="dark"] .table tbody tr:nth-of-type(odd) {
white-space: normal !important;
}
}
+
+
+/* Hideable right-hand "On this page" sidebar.
+ * toc-toggle.js adds the button and toggles `pst-secondary-hidden` on <body>;
+ * hiding the sidebar lets the flex article container reclaim the width. */
+
+body.pst-secondary-hidden .bd-sidebar-secondary {
+ display: none;
+}
Review Comment:
The code added to this section allows you to hide away the right hand table
of contents so you get a bigger view of the site content. There is also the
corresponding work in `toc-toggle.js`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]