timsaucer opened a new pull request, #1504:
URL: https://github.com/apache/datafusion-python/pull/1504

   ## Which issue does this PR close?
   
   Relates to #1394.
   
   ## Rationale for this change
   
   The TPC-H examples under `examples/tpch/` serve as the canonical hands-on
   reference for how to write DataFusion Python DataFrame code. Before this PR:
   
   - Q20 had a bug where a filter was computed and discarded (`df.filter(...)`
     without assignment).
   - Several queries used non-idiomatic constructs (switched CASE on boolean
     subjects, `array_position(make_array(...))` in place of `in_list`,
     0-based substring tricks, a pyarrow UDF re-implementing a disjunctive
     predicate, `aggregate([col], [])` in place of `distinct()`, etc.).
   - The reference SQL was not embedded in the files, so readers had to
     cross-reference `benchmarks/tpch/queries/` to see the intended query.
   - Where reference SQL was embedded, it used different TPC-H substitution
     parameters than the DataFrame code, so the two expressions described
     different queries.
   
   ## What changes are included in this PR?
   
   Four commits, grouped by concern:
   
   1. **`tpch examples: add reference SQL to each query, fix Q20`** — append
      the canonical TPC-H reference SQL to each `q01..q22` module docstring;
      fix the missing assignment on Q20's excess-quantity filter.
   2. **`tpch examples: rewrite non-idiomatic queries in idiomatic DataFrame
      form`** — rewrite Q04, Q07, Q08, Q12, Q19, Q20, Q21 using the
      DataFrame-native pattern (semi/anti joins for EXISTS/NOT EXISTS,
      searched `F.when` for `CASE WHEN`, `F.in_list` for `IN`, compound
      predicates in place of a pyarrow UDF, etc.).
   3. **`tpch examples: align reference SQL constants with DataFrame
      queries`** — update the embedded SQL in 21 of 22 docstrings so the
      substitution parameters match the DataFrame code (which is validated
      at scale factor 1 against `answers_sf1/`). Interval units (month,
      year) are preserved where the problem-statement text reads \"quarter\",
      \"year\", or \"month\".
   4. **`tpch examples: apply SKILL.md idioms across all 22 queries`** —
      sweep all 22 queries for \`SKILL.md\` idioms: auto-wrap on comparison
      RHS, plain-name group/sort keys, drop \`how=\"inner\"\`, collapse chained
      \`.filter()\` calls, \`F.count_star()\` for SQL \`count(*)\`,
      \`F.starts_with\` / \`F.in_list\` / searched \`F.when\`. Q16 also picks up
      the secondary sort keys (\`p_brand\`, \`p_type\`, \`p_size\`) that the
      TPC-H spec requires but the original DataFrame omitted.
   
   All 22 answer-file comparisons under `examples/tpch/_tests.py` pass.
   
   ## Are there any user-facing changes?
   
   No public API changes. The `examples/tpch/` directory is a teaching aid
   shipped in the source tree, not in the wheel, so the visible effect is
   limited to readers of the examples.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to