kosiew commented on code in PR #21401:
URL: https://github.com/apache/datafusion/pull/21401#discussion_r3136337825
##########
benchmarks/bench.sh:
##########
Review Comment:
The skip check only looks for `q1.out`, but the new implementation downloads
the 22 answer files one at a time. If the loop fails after writing `q1.out`,
for example due to a network hiccup, rate limit, or interrupted run, the next
invocation will treat the answers as complete and skip the remaining files.
The previous Docker copy was effectively all or nothing, so this introduces
a partial-download correctness issue. Could we either validate all expected
files before skipping, or download into a temporary directory and move it into
place only after everything succeeds?
##########
benchmarks/bench.sh:
##########
@@ -612,7 +612,11 @@ data_tpch() {
else
echo " Copying answers to ${TPCH_DIR}/answers"
mkdir -p "${TPCH_DIR}/answers"
- docker run -v "${TPCH_DIR}":/data -it --entrypoint /bin/bash --rm
ghcr.io/scalytics/tpch-docker:main -c "cp -f
/opt/tpch/2.18.0_rc2/dbgen/answers/* /data/answers/"
+
BASE_GH_URL="https://raw.githubusercontent.com/scalytics/TPCH-Docker/refs/heads/main/data/tpch/2.18.0_rc2/dbgen/answers"
Review Comment:
Small maintainability thought: `bench.sh` now has several download call
sites with slightly different retry and tool-selection behavior. It might be
worth extracting a small helper for downloading a URL to a file with retries,
so this stays consistent with the other benchmark data fetch paths and future
source changes are easier.
##########
benchmarks/bench.sh:
##########
@@ -612,7 +612,11 @@ data_tpch() {
else
echo " Copying answers to ${TPCH_DIR}/answers"
mkdir -p "${TPCH_DIR}/answers"
- docker run -v "${TPCH_DIR}":/data -it --entrypoint /bin/bash --rm
ghcr.io/scalytics/tpch-docker:main -c "cp -f
/opt/tpch/2.18.0_rc2/dbgen/answers/* /data/answers/"
+
BASE_GH_URL="https://raw.githubusercontent.com/scalytics/TPCH-Docker/refs/heads/main/data/tpch/2.18.0_rc2/dbgen/answers"
+ for i in $(seq 1 22); do
+ OUT_FILE="q${i}.out"
+ wget -q --timeout=30 --tries=3 -O
"${TPCH_DIR}/answers/${OUT_FILE}" "${BASE_GH_URL}/${OUT_FILE}"
Review Comment:
This swaps the Docker dependency for a hard `wget` dependency, but `wget` is
not available by default on macOS. Since the goal here is to make TPC-H
downloads easier to use without extra tooling, this would still fail on a
common developer setup.
Could we use a `curl` or `wget` fallback here, similar to
`benchmarks/queries/clickbench/update_queries.sh`, so this path works out of
the box for more contributors?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]