Re: [PR] Fix bug that `COUNT(DISTINCT)` on StringView panics [datafusion]

via GitHub Thu, 01 Aug 2024 11:19:06 -0700


alamb commented on PR #11768:
URL: https://github.com/apache/datafusion/pull/11768#issuecomment-2263684224


   > Do you @alamb have any thoughts on why clickbench query 5 won't previously 
panic?
   
   Yes, I think it is because 
   ```
   SELECT COUNT(DISTINCT "SearchPhrase") FROM hits;
   ```
   
   Is rewritten to a group by without distinct here: 
https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/single_distinct_to_groupby.rs
   
   ```
   andrewlamb@Andrews-MacBook-Pro-2:~/Downloads/benchmarking$ datafusion-cli -c 
'EXPLAIN SELECT COUNT(DISTINCT "SearchPhrase") FROM "hits.parquet"'
   DataFusion CLI v40.0.0
   
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
            |
   
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: count(alias1) AS count(DISTINCT 
hits.parquet.SearchPhrase)                                                      
                                                                                
                                                                                
                                                                                
                                                                                
                           |
   |               |   Aggregate: groupBy=[[]], aggr=[[count(alias1)]]          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
            |
   |               |     Aggregate: groupBy=[[hits.parquet.SearchPhrase AS 
alias1]], aggr=[[]]                                                             
                                                                                
                                                                                
                                                                                
                                                                                
                 |
   |               |       TableScan: hits.parquet projection=[SearchPhrase]    
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
            |
   | physical_plan | ProjectionExec: expr=[count(alias1)@0 as count(DISTINCT 
hits.parquet.SearchPhrase)]                                                     
                                                                                
                                                                                
                                                                                
                                                                                
               |
   |               |   AggregateExec: mode=Final, gby=[], aggr=[count(alias1)]  
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
            |
   |               |     CoalescePartitionsExec                                 
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
            |
   |               |       AggregateExec: mode=Partial, gby=[], 
aggr=[count(alias1)]                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                            |
   |               |         AggregateExec: mode=FinalPartitioned, 
gby=[alias1@0 as alias1], aggr=[]                                               
                                                                                
                                                                                
                                                                                
                                                                                
                         |
   |               |           CoalesceBatchesExec: target_batch_size=8192      
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
            |
   |               |             RepartitionExec: partitioning=Hash([alias1@0], 
16), input_partitions=16                                                        
                                                                                
                                                                                
                                                                                
                                                                                
            |
   |               |               AggregateExec: mode=Partial, 
gby=[SearchPhrase@0 as alias1], aggr=[]                                         
                                                                                
                                                                                
                                                                                
                                                                                
                            |
   |               |                 ParquetExec: file_groups={16 groups: 
[[Users/andrewlamb/Downloads/benchmarking/hits.parquet:0..923748528], 
[Users/andrewlamb/Downloads/benchmarking/hits.parquet:923748528..1847497056], 
[Users/andrewlamb/Downloads/benchmarking/hits.parquet:1847497056..2771245584], 
[Users/andrewlamb/Downloads/benchmarking/hits.parquet:2771245584..3694994112], 
[Users/andrewlamb/Downloads/benchmarking/hits.parquet:3694994112..4618742640], 
...]}, projection=[SearchPhrase] |
   |               |                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
            |
   
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 row(s) fetched.
   Elapsed 0.046 seconds.
   ```
   
   This was largely needed before we had implemented native distinct 
accumulators. But I wonder if we should re-evaluate now that we have fast 
string accumulators 🤔  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix bug that `COUNT(DISTINCT)` on StringView panics [datafusion]

Reply via email to