adriangb commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093307815
##########
benchmarks/bench.sh:
##########
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
}
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+ INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+ if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet
2>/dev/null)" ]; then
+ echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+ return
+ fi
+
+ echo "Generating sort pushdown Inexact benchmark data (single file,
shuffled RGs)..."
+
+ # Re-use the sort_pushdown data as the source (generate if missing)
+ data_sort_pushdown
+
+ mkdir -p "${INEXACT_DIR}"
+ SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+ # Use datafusion-cli to bucket rows into 64 groups by a deterministic
+ # scrambler, then sort within each bucket by orderkey. This produces
+ # ~64 RG-sized segments where each has a tight orderkey range but the
+ # segments appear in scrambled (non-sorted) order in the file.
+ (cd "${SCRIPT_DIR}/.." && cargo run --release -p datafusion-cli -- -c "
+ CREATE EXTERNAL TABLE src
+ STORED AS PARQUET
+ LOCATION '${SRC_DIR}';
+
+ COPY (
+ SELECT * FROM src
+ ORDER BY
+ (l_orderkey * 1664525 + 1013904223) % 64,
+ l_orderkey
+ )
+ TO '${INEXACT_DIR}/shuffled.parquet'
+ STORED AS PARQUET
+ OPTIONS ('format.max_row_group_size' '100000');
+ ")
+
+ echo "Sort pushdown Inexact data generated at ${INEXACT_DIR}"
+ ls -la "${INEXACT_DIR}"
+}
+
+# Runs the sort pushdown Inexact benchmark (tests RG reorder by statistics).
+run_sort_pushdown_inexact() {
+ INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact"
+ RESULTS_FILE="${RESULTS_DIR}/sort_pushdown_inexact.json"
+ echo "Running sort pushdown Inexact benchmark (row group reorder by
statistics)..."
+ debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${INEXACT_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown_inexact" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
Review Comment:
I'm not sure what the `--sorted` flag is supposed to do but worth checking
@zhuqi-lucas
##########
benchmarks/bench.sh:
##########
@@ -1137,6 +1144,62 @@ run_sort_pushdown_sorted() {
debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
}
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+ INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+ if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet
2>/dev/null)" ]; then
+ echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+ return
+ fi
+
+ echo "Generating sort pushdown Inexact benchmark data (single file,
shuffled RGs)..."
+
+ # Re-use the sort_pushdown data as the source (generate if missing)
+ data_sort_pushdown
+
+ mkdir -p "${INEXACT_DIR}"
+ SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+ # Use datafusion-cli to bucket rows into 64 groups by a deterministic
+ # scrambler, then sort within each bucket by orderkey. This produces
+ # ~64 RG-sized segments where each has a tight orderkey range but the
+ # segments appear in scrambled (non-sorted) order in the file.
Review Comment:
I think another interesting benchmark, which is our case at least, is when
there is overlap between the row groups but some general logic. E.g.:
```
rg,min,max
0,0,5
1,4,7
2,5,32
```
In our case, this happens because there's a stream of data that's coming in
with time stamps, and it should arrive at around the same time it was created,
but there are always some network delays, time skew, etc. that means that it's
not perfect. But data that arrived 30 minutes later is guaranteed to have
timestamps in a different range than data that arrive 30 minutes before it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]