zhuqi-lucas commented on PR #21711:
URL: https://github.com/apache/datafusion/pull/21711#issuecomment-4293415231
Addressing review comments:
**@alamb re: documentation** — Will update the benchmarks README to document
the new q5-q8 DESC LIMIT queries.
**@alamb re: datafusion-cli vs pyarrow** — I tried using pure datafusion-cli
initially, but DataFusion's COPY writes rows sequentially. When two adjacent
chunks have different l_orderkey ranges, the RG boundary merges rows from both,
widening the min/max range to ~6M instead of ~100K. This defeats
`reorder_by_statistics`. pyarrow's `ParquetWriter.write_table()` per-RG is the
only way to get narrow-range RGs in scrambled order. Happy to add a small Rust
helper instead if the python dependency is a concern.
**@alamb re: pyarrow error** — Will add a check before the python block:
```bash
if ! python3 -c "import pyarrow" 2>/dev/null; then
echo "Error: pyarrow is required. Install with: pip install pyarrow"
exit 1
fi
```
**Copilot comments** — Both were about the old ORDER BY + jitter approach
which has been completely replaced with the pyarrow split approach.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]