timsaucer opened a new pull request, #1504:
URL: https://github.com/apache/datafusion-python/pull/1504
## Which issue does this PR close?
Relates to #1394.
## Rationale for this change
The TPC-H examples under `examples/tpch/` serve as the canonical hands-on
reference for how to write DataFusion Python DataFrame code. Before this PR:
- Q20 had a bug where a filter was computed and discarded (`df.filter(...)`
without assignment).
- Several queries used non-idiomatic constructs (switched CASE on boolean
subjects, `array_position(make_array(...))` in place of `in_list`,
0-based substring tricks, a pyarrow UDF re-implementing a disjunctive
predicate, `aggregate([col], [])` in place of `distinct()`, etc.).
- The reference SQL was not embedded in the files, so readers had to
cross-reference `benchmarks/tpch/queries/` to see the intended query.
- Where reference SQL was embedded, it used different TPC-H substitution
parameters than the DataFrame code, so the two expressions described
different queries.
## What changes are included in this PR?
Four commits, grouped by concern:
1. **`tpch examples: add reference SQL to each query, fix Q20`** — append
the canonical TPC-H reference SQL to each `q01..q22` module docstring;
fix the missing assignment on Q20's excess-quantity filter.
2. **`tpch examples: rewrite non-idiomatic queries in idiomatic DataFrame
form`** — rewrite Q04, Q07, Q08, Q12, Q19, Q20, Q21 using the
DataFrame-native pattern (semi/anti joins for EXISTS/NOT EXISTS,
searched `F.when` for `CASE WHEN`, `F.in_list` for `IN`, compound
predicates in place of a pyarrow UDF, etc.).
3. **`tpch examples: align reference SQL constants with DataFrame
queries`** — update the embedded SQL in 21 of 22 docstrings so the
substitution parameters match the DataFrame code (which is validated
at scale factor 1 against `answers_sf1/`). Interval units (month,
year) are preserved where the problem-statement text reads \"quarter\",
\"year\", or \"month\".
4. **`tpch examples: apply SKILL.md idioms across all 22 queries`** —
sweep all 22 queries for \`SKILL.md\` idioms: auto-wrap on comparison
RHS, plain-name group/sort keys, drop \`how=\"inner\"\`, collapse chained
\`.filter()\` calls, \`F.count_star()\` for SQL \`count(*)\`,
\`F.starts_with\` / \`F.in_list\` / searched \`F.when\`. Q16 also picks up
the secondary sort keys (\`p_brand\`, \`p_type\`, \`p_size\`) that the
TPC-H spec requires but the original DataFrame omitted.
All 22 answer-file comparisons under `examples/tpch/_tests.py` pass.
## Are there any user-facing changes?
No public API changes. The `examples/tpch/` directory is a teaching aid
shipped in the source tree, not in the wheel, so the visible effect is
limited to readers of the examples.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]