Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

via GitHub Thu, 16 Apr 2026 05:37:46 -0700


Copilot commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093205802



##########
benchmarks/bench.sh:
##########
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
     debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}
 }
 
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+    INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+    if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet 
2>/dev/null)" ]; then
+        echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+        return
+    fi
+
+    echo "Generating sort pushdown Inexact benchmark data (single file, 
shuffled RGs)..."
+
+    # Re-use the sort_pushdown data as the source (generate if missing)
+    data_sort_pushdown
+
+    mkdir -p "${INEXACT_DIR}"
+    SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+    # Use datafusion-cli to bucket rows into 64 groups by a deterministic
+    # scrambler, then sort within each bucket by orderkey. This produces
+    # ~64 RG-sized segments where each has a tight orderkey range but the
+    # segments appear in scrambled (non-sorted) order in the file.
+    (cd "${SCRIPT_DIR}/.." && cargo run --release -p datafusion-cli -- -c "

Review Comment:
   This uses `cargo run --release -p datafusion-cli` directly, which bypasses 
the script’s `CARGO_COMMAND` override (often used to add 
flags/features/profiles) and also skips the `debug_run` wrapper used elsewhere 
for reproducibility. Consider invoking datafusion-cli via the same configurable 
mechanism (e.g., reusing `CARGO_COMMAND` or building the CLI once and executing 
the binary) so benchmark data generation behaves consistently across 
environments.
   ```suggestion
       (cd "${SCRIPT_DIR}/.." && debug_run ${CARGO_COMMAND} -p datafusion-cli 
-- -c "
   ```



##########
benchmarks/bench.sh:
##########
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
     debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}
 }
 
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+    INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+    if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet 
2>/dev/null)" ]; then
+        echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+        return
+    fi
+
+    echo "Generating sort pushdown Inexact benchmark data (single file, 
shuffled RGs)..."
+
+    # Re-use the sort_pushdown data as the source (generate if missing)
+    data_sort_pushdown
+
+    mkdir -p "${INEXACT_DIR}"
+    SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+    # Use datafusion-cli to bucket rows into 64 groups by a deterministic
+    # scrambler, then sort within each bucket by orderkey. This produces
+    # ~64 RG-sized segments where each has a tight orderkey range but the
+    # segments appear in scrambled (non-sorted) order in the file.
+    (cd "${SCRIPT_DIR}/.." && cargo run --release -p datafusion-cli -- -c "
+        CREATE EXTERNAL TABLE src
+        STORED AS PARQUET
+        LOCATION '${SRC_DIR}';
+
+        COPY (
+            SELECT * FROM src
+            ORDER BY
+                (l_orderkey * 1664525 + 1013904223) % 64,
+                l_orderkey
+        )
+        TO '${INEXACT_DIR}/shuffled.parquet'
+        STORED AS PARQUET
+        OPTIONS ('format.max_row_group_size' '100000');
+    ")
+
+    echo "Sort pushdown Inexact data generated at ${INEXACT_DIR}"
+    ls -la "${INEXACT_DIR}"
+}
+
+# Runs the sort pushdown Inexact benchmark (tests RG reorder by statistics).
+run_sort_pushdown_inexact() {
+    INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact"
+    RESULTS_FILE="${RESULTS_DIR}/sort_pushdown_inexact.json"
+    echo "Running sort pushdown Inexact benchmark (row group reorder by 
statistics)..."
+    debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${INEXACT_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown_inexact" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}

Review Comment:
   `run_sort_pushdown_inexact` passes `--sorted`, which tells DataFusion the 
file is ordered by `l_orderkey` (via WITH ORDER). However, the generated 
`shuffled.parquet` is intentionally not globally sorted by `l_orderkey` (row 
groups are arranged by the bucket key). This benchmark therefore relies on the 
planner staying on an Inexact path (keeping TopK/Sort for correctness). Please 
document this assumption explicitly and/or add a guard to ensure the benchmark 
can’t silently become incorrect if future optimizations start treating the 
reversed scan as Exact and eliminate the ordering operator.



##########
benchmarks/bench.sh:
##########
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
     debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}
 }
 
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.

Review Comment:
   The data generator comment says row groups have “NON-OVERLAPPING but 
OUT-OF-ORDER l_orderkey ranges” and that bucketing produces “tight orderkey 
range” segments. With the current ORDER BY `(l_orderkey * 1664525 + 1013904223) 
% 64, l_orderkey`, each bucket is essentially a permutation of `l_orderkey % 
64`, so row-group min/max ranges will heavily overlap (not be 
non-overlapping/tight). Please update the comment to describe the intended 
property accurately (e.g., overlapping/out-of-order row-group statistics to 
force the Inexact path).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

Reply via email to