alamb commented on PR #20823:
URL: https://github.com/apache/datafusion/pull/20823#issuecomment-4186890554

   > No, I'm having troubles coming up with a realistic benchmark.
   > 
   > The previous benchmark 
https://github.com/apache/datafusion/pull/19687/changes#diff-5358b38b6265d769b66b614f7ba88ed9320f7a9fce5197330b7c01c2a8a3ed3b
 incorrectly assumes that all the requested bytes (via get_opts) will be read, 
while you can actually request a 10GiB stream of bytes and read only 16KiB from 
it.
   > 
   > As a result, the benchmark of the previous PR for reducing the read 
amplification shows impressive improvements, but it hides the fact that it 
breaks the parallelization between data fetching and json decoding (by doing 
all the data fetching in the JsonOpener instead of allowing FileStream to do 
its magic).
   
   Yes, this is the kind of thing I am worried about -- that there is some code 
churn but it does not 
   
   > 
   > * I'm increasing performance (because there are no more read requests in 
the JsonOpener)
   > * This optimization is relevant for real-world object store 
implementations (where network latency matters, network speed matters, data 
computation can happen while waiting for bytes to be read, read-ahead is a 
relevant optimization etc.)
   
   One thing we could use potentially is this change from @Dandandan that can 
simulate high latency storess 
   - https://github.com/apache/datafusion/pull/20954
   
   Another thing we could potentially do is to use the clickhouse benchmark 
dataset:
   https://datasets.clickhouse.com/hits_compatible/hits.json.gz
   
   And put it somewhere on object store 🤔 and show the performance improvement
   
   ```shell
   wget https://datasets.clickhouse.com/hits_compatible/hits.json.gz
   gunzip hits.json.gz
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to