Re: [PR] Dynamic work scheduling in FileStream [datafusion]

via GitHub Mon, 13 Apr 2026 10:27:57 -0700


Dandandan commented on PR #21351:
URL: https://github.com/apache/datafusion/pull/21351#issuecomment-4238357134


   > > Prefetching IO / combining small IO requests (reducing spawn_blocking / 
thread switching overhead)
   > 
   > Should we think about the interaction with `ObjectStore`? The GCS/S3/etc. 
implementations do range coalescing. When we do combining of small IO requests 
should we take into account the layout of that data so that we combine requests 
that would be adjacent in the data, thus maximizing block hit rates for these 
providers?
   
   I did some tinkering more recently with object store and think that actually 
doing much more requests (e.g. 1000s) in parallel (ideally using HTTP/2 or 3) 
would be better than a lot of coalescing, because:
   
   1. For highly concurrent queries there isn't a lot of coalescing that could 
work as data is in different files or row groups / wide tables (so effectively 
you would download a large part of the file).
   2. Multiple requests finish more quickly than coalescing with lots of space 
(and causing lot's of extra bandwidth), for large requests splitting them into 
smaller chunks and downloading in parallel also helps.
   3. If you can process the data once it arrives you can hide the latency much 
better than needing to wait (and idling) between requests
    
   HTTP/2 is disabled by default in object-store I think mainly because forcing 
everything through a single TCP channel causes some HOL-blocking issues even 
with HTTP/2 (and perhaps in the h2 implementation), but this is circumventable 
by creating a larger connection pool (combining multiplexing with multiple 
requests). Using HTTP/2 avoids creating lots of connections (which is somewhat 
slower as well) when issuing requests in parallel and helps with speeding up 
new connections and issuing smaller requests.
   HTTP/3 probably would solve the HOL blocking issue (as it doesn't use TCP).
   
   Splitting large requests into smaller ones helps increasing throughput (not 
sure if the standard object-store implementation does it).
   
   There is a limit per shard / prefix of 5500/s (S3) / 5000/s (GCS), so when 
doing many concurrent reads you need to make sure to spread the load over the 
key space that the store creates with increasing load, so that's something else 
to consider.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Dynamic work scheduling in FileStream [datafusion]

Reply via email to