alamb commented on PR #20823: URL: https://github.com/apache/datafusion/pull/20823#issuecomment-4186890554
> No, I'm having troubles coming up with a realistic benchmark. > > The previous benchmark https://github.com/apache/datafusion/pull/19687/changes#diff-5358b38b6265d769b66b614f7ba88ed9320f7a9fce5197330b7c01c2a8a3ed3b incorrectly assumes that all the requested bytes (via get_opts) will be read, while you can actually request a 10GiB stream of bytes and read only 16KiB from it. > > As a result, the benchmark of the previous PR for reducing the read amplification shows impressive improvements, but it hides the fact that it breaks the parallelization between data fetching and json decoding (by doing all the data fetching in the JsonOpener instead of allowing FileStream to do its magic). Yes, this is the kind of thing I am worried about -- that there is some code churn but it does not > > * I'm increasing performance (because there are no more read requests in the JsonOpener) > * This optimization is relevant for real-world object store implementations (where network latency matters, network speed matters, data computation can happen while waiting for bytes to be read, read-ahead is a relevant optimization etc.) One thing we could use potentially is this change from @Dandandan that can simulate high latency storess - https://github.com/apache/datafusion/pull/20954 Another thing we could potentially do is to use the clickhouse benchmark dataset: https://datasets.clickhouse.com/hits_compatible/hits.json.gz And put it somewhere on object store 🤔 and show the performance improvement ```shell wget https://datasets.clickhouse.com/hits_compatible/hits.json.gz gunzip hits.json.gz ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
