kosiew commented on code in PR #21401:
URL: https://github.com/apache/datafusion/pull/21401#discussion_r3136337825


##########
benchmarks/bench.sh:
##########


Review Comment:
   The skip check only looks for `q1.out`, but the new implementation downloads 
the 22 answer files one at a time. If the loop fails after writing `q1.out`, 
for example due to a network hiccup, rate limit, or interrupted run, the next 
invocation will treat the answers as complete and skip the remaining files.
   
   The previous Docker copy was effectively all or nothing, so this introduces 
a partial-download correctness issue. Could we either validate all expected 
files before skipping, or download into a temporary directory and move it into 
place only after everything succeeds?



##########
benchmarks/bench.sh:
##########
@@ -612,7 +612,11 @@ data_tpch() {
     else
         echo " Copying answers to ${TPCH_DIR}/answers"
         mkdir -p "${TPCH_DIR}/answers"
-        docker run -v "${TPCH_DIR}":/data -it --entrypoint /bin/bash --rm 
ghcr.io/scalytics/tpch-docker:main  -c "cp -f 
/opt/tpch/2.18.0_rc2/dbgen/answers/* /data/answers/"
+        
BASE_GH_URL="https://raw.githubusercontent.com/scalytics/TPCH-Docker/refs/heads/main/data/tpch/2.18.0_rc2/dbgen/answers";

Review Comment:
   Small maintainability thought: `bench.sh` now has several download call 
sites with slightly different retry and tool-selection behavior. It might be 
worth extracting a small helper for downloading a URL to a file with retries, 
so this stays consistent with the other benchmark data fetch paths and future 
source changes are easier.



##########
benchmarks/bench.sh:
##########
@@ -612,7 +612,11 @@ data_tpch() {
     else
         echo " Copying answers to ${TPCH_DIR}/answers"
         mkdir -p "${TPCH_DIR}/answers"
-        docker run -v "${TPCH_DIR}":/data -it --entrypoint /bin/bash --rm 
ghcr.io/scalytics/tpch-docker:main  -c "cp -f 
/opt/tpch/2.18.0_rc2/dbgen/answers/* /data/answers/"
+        
BASE_GH_URL="https://raw.githubusercontent.com/scalytics/TPCH-Docker/refs/heads/main/data/tpch/2.18.0_rc2/dbgen/answers";
+        for i in $(seq 1 22); do
+            OUT_FILE="q${i}.out"
+            wget -q --timeout=30 --tries=3 -O 
"${TPCH_DIR}/answers/${OUT_FILE}" "${BASE_GH_URL}/${OUT_FILE}"

Review Comment:
   This swaps the Docker dependency for a hard `wget` dependency, but `wget` is 
not available by default on macOS. Since the goal here is to make TPC-H 
downloads easier to use without extra tooling, this would still fail on a 
common developer setup.
   
   Could we use a `curl` or `wget` fallback here, similar to 
`benchmarks/queries/clickbench/update_queries.sh`, so this path works out of 
the box for more contributors?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to