gene-bordegaray commented on PR #21832:
URL: https://github.com/apache/datafusion/pull/21832#issuecomment-4315534346

   I would also want to highlight how important supporting partitioned data is 
for large, scalable systems. 
   
   We have this feature in use and have seen amazing results by eliminating 
repartitions with pre-partitioned data and pushing dynamic filters down to the 
correct partition of that unshuffled data:
   
   Here are some metrics on this:
   
   ```text
   | Metric                | No Dyn Filter | With Dyn Filter | Reduction        
    |
   
|----------------------|--------------|-----------------|----------------------|
   | Bytes per Worker     | ~14–16 GB    | ~75–170 MB      | ~100–130×          
  |
   | Rows per Worker      | ~400–550M    | ~0.5–1.5K       | ~400,000–500,000×  
  |
   | HashJoin Compute     | ~32–39s      | ~10–50ms        | ~1,000×            
  |
   | Aggregation Compute  | ~1s          | ~10–50µs        | ~10,000–50,000×    
  |
   | Leaf Fetch Time      | ~110–120s    | ~47–60s         | ~2×                
  |
   ```
   
   I am sure others that use partitioned data will appreciate such results and 
the contributors at Datadog plan to continue to strengthen DataFusion's support 
of pre-partitioned data 😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to