zhuqi-lucas commented on PR #21711:
URL: https://github.com/apache/datafusion/pull/21711#issuecomment-4293415231

   Addressing review comments:
   
   **@alamb re: documentation** — Will update the benchmarks README to document 
the new q5-q8 DESC LIMIT queries.
   
   **@alamb re: datafusion-cli vs pyarrow** — I tried using pure datafusion-cli 
initially, but DataFusion's COPY writes rows sequentially. When two adjacent 
chunks have different l_orderkey ranges, the RG boundary merges rows from both, 
widening the min/max range to ~6M instead of ~100K. This defeats 
`reorder_by_statistics`. pyarrow's `ParquetWriter.write_table()` per-RG is the 
only way to get narrow-range RGs in scrambled order. Happy to add a small Rust 
helper instead if the python dependency is a concern.
   
   **@alamb re: pyarrow error** — Will add a check before the python block:
   ```bash
   if ! python3 -c "import pyarrow" 2>/dev/null; then
       echo "Error: pyarrow is required. Install with: pip install pyarrow"
       exit 1
   fi
   ```
   
   **Copilot comments** — Both were about the old ORDER BY + jitter approach 
which has been completely replaced with the pyarrow split approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to