Re: [PR] benchmark: Add parquet h2o support [datafusion]

via GitHub Wed, 16 Jul 2025 20:33:59 -0700


zhuqi-lucas commented on PR #16804:
URL: https://github.com/apache/datafusion/pull/16804#issuecomment-3082332726


   Updated: error for parquet join data generate, it works for group by:
   
   ```rust
   ./bench.sh data h2o_medium_join_parquet
   ***************************
   DataFusion Benchmark Runner and Data Generator
   COMMAND: data
   BENCHMARK: h2o_medium_join_parquet
   DATA_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/data
   CARGO_COMMAND: cargo run --release
   PREFER_HASH_JOIN: true
   ***************************
   Found Python version 3.13, which is suitable.
   Using Python command: /opt/homebrew/bin/python3
   Installing falsa...
   Generating h2o test data in 
/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o with size=MEDIUM and 
format=PARQUET
   100 rows will be saved into: 
/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e2_0.parquet
   
   100000 rows will be saved into: 
/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e5_0.parquet
   
   100000000 rows will be saved into: 
/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e8_NA.parquet
   
   An SMALL data schema is the following:
   id1: int64 not null
   id4: string
   v2: double
   
   An output format is PARQUET
   
   Batch mode is supported.
   In case of memory problems you can try to reduce a batch_size.
   
   
   Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--
   
╭────────────────────────────────────────────────────────────────────────────────────────
 Traceback (most recent call last) 
─────────────────────────────────────────────────────────────────────────────────────────╮
   │ 
/Users/zhuqi/arrow-datafusion/benchmarks/venv/lib/python3.13/site-packages/falsa/app.py:144
 in join                                                                        
                                        │
   │                                                                            
                                                                                
                                                        │
   │   141 │   )                                                                
                                                                                
                                                        │
   │   142 │                                                                    
                                                                                
                                                        │
   │   143 │   for batch in track(join_small.iter_batches(), 
total=len(join_small.batches)):                                                 
                                                                           │
   │ ❱ 144 │   │   writer_small.write_batch(batch)                              
                                                                                
                                                        │
   │   145 │   writer_small.close()                                             
                                                                                
                                                        │
   │   146 │                                                                    
                                                                                
                                                        │
   │   147 │   if data_format is Format.DELTA:                                  
                                                                                
                                                        │
   │                                                                            
                                                                                
                                                        │
   │ 
╭────────────────────────────────────────────────────────────────────────────────────────────────────
 locals 
────────────────────────────────────────────────────────────────────────────────────────────────────╮
 │
   │ │                batch = pyarrow.RecordBatch                               
                                                                                
                                                      │ │
   │ │                        id1: int64 not null                               
                                                                                
                                                      │ │
   │ │                        id4: string not null                              
                                                                                
                                                      │ │
   │ │                        v2: double not null                               
                                                                                
                                                      │ │
   │ │                        ----                                              
                                                                                
                                                      │ │
   │ │                        id1: 
[12,77,106,10,52,105,29,64,46,51,...,110,82,72,8,1,104,69,5,44,25]              
                                                                                
                   │ │
   │ │                        id4: 
["id12","id77","id106","id10","id52","id105","id29","id64","id46","id51",...,"id110","id82","id72","id8","id1","id104","id69","id5","id44","id25"]
                                 │ │
   │ │                        v2:                                               
                                                                                
                                                      │ │
   │ │                        
[53.0075954693085,32.410072200393316,72.68372205230826,71.61363809771296,86.99915627358179,15.539006557813716,23.840799451398684,7.23383214431385,6.366591524991982,20.222312628293857…
 │ │
   │ │           batch_size = 5000000                                           
                                                                                
                                                      │ │
   │ │    data_filename_big = 'J1_1e8_1e8_NA.parquet'                           
                                                                                
                                                      │ │
   │ │    data_filename_lhs = 'J1_1e8_NA_0.parquet'                             
                                                                                
                                                      │ │
   │ │ data_filename_medium = 'J1_1e8_1e5_0.parquet'                            
                                                                                
                                                      │ │
   │ │  data_filename_small = 'J1_1e8_1e2_0.parquet'                            
                                                                                
                                                      │ │
   │ │          data_format = <Format.PARQUET: 'PARQUET'>                       
                                                                                
                                                      │ │
   │ │      generation_seed = 6839596180442651345                               
                                                                                
                                                      │ │
   │ │             join_big = <falsa.local_fs.JoinBigGenerator object at 
0x105e2fe00>                                                                    
                                                             │ │
   │ │             join_lhs = <falsa.local_fs.JoinLHSGenerator object at 
0x105e2da90>                                                                    
                                                             │ │
   │ │          join_medium = <falsa.local_fs.JoinMediumGenerator object at 
0x105e2fcb0>                                                                    
                                                          │ │
   │ │           join_small = <falsa.local_fs.JoinSmallGenerator object at 
0x105e2fb60>                                                                    
                                                           │ │
   │ │                    k = 10                                                
                                                                                
                                                      │ │
   │ │            keys_seed = 1026847926404610461                               
                                                                                
                                                      │ │
   │ │                n_big = 100000000                                         
                                                                                
                                                      │ │
   │ │             n_medium = 100000                                            
                                                                                
                                                      │ │
   │ │              n_small = 100                                               
                                                                                
                                                      │ │
   │ │                  nas = 0                                                 
                                                                                
                                                      │ │
   │ │           output_big = 
PosixPath('/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e8_NA.parquet')
                                                                                
                    │ │
   │ │           output_dir = 
PosixPath('/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o')                  
                                                                                
                        │ │
   │ │           output_lhs = 
PosixPath('/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_NA_0.parquet')
                                                                                
                      │ │
   │ │        output_medium = 
PosixPath('/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e5_0.parquet')
                                                                                
                     │ │
   │ │         output_small = 
PosixPath('/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e2_0.parquet')
                                                                                
                     │ │
   │ │          path_prefix = 
'/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o'                             
                                                                                
                        │ │
   │ │                 seed = 42                                                
                                                                                
                                                      │ │
   │ │                 size = <Size.MEDIUM: 'MEDIUM'>                           
                                                                                
                                                      │ │
   │ │         writer_small = <pyarrow.parquet.core.ParquetWriter object at 
0x1067d4050>                                                                    
                                                          │ │
   │ 
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
 │
   │                                                                            
                                                                                
                                                        │
   │ 
/Users/zhuqi/arrow-datafusion/benchmarks/venv/lib/python3.13/site-packages/pyarrow/parquet/core.py:1089
 in write_batch                                                                 
                            │
   │                                                                            
                                                                                
                                                        │
   │   1086 │   │   │   will be used instead.                                   
                                                                                
                                                        │
   │   1087 │   │   """                                                         
                                                                                
                                                        │
   │   1088 │   │   table = pa.Table.from_batches([batch], batch.schema)        
                                                                                
                                                        │
   │ ❱ 1089 │   │   self.write_table(table, row_group_size)                     
                                                                                
                                                        │
   │   1090 │                                                                   
                                                                                
                                                        │
   │   1091 │   def write_table(self, table, row_group_size=None):              
                                                                                
                                                        │
   │   1092 │   │   """                                                         
                                                                                
                                                        │
   │                                                                            
                                                                                
                                                        │
   │ 
╭────────────────────────────────────────────────────────────────────────────────────────────────────
 locals 
────────────────────────────────────────────────────────────────────────────────────────────────────╮
 │
   │ │          batch = pyarrow.RecordBatch                                     
                                                                                
                                                      │ │
   │ │                  id1: int64 not null                                     
                                                                                
                                                      │ │
   │ │                  id4: string not null                                    
                                                                                
                                                      │ │
   │ │                  v2: double not null                                     
                                                                                
                                                      │ │
   │ │                  ----                                                    
                                                                                
                                                      │ │
   │ │                  id1: 
[12,77,106,10,52,105,29,64,46,51,...,110,82,72,8,1,104,69,5,44,25]              
                                                                                
                         │ │
   │ │                  id4: 
["id12","id77","id106","id10","id52","id105","id29","id64","id46","id51",...,"id110","id82","id72","id8","id1","id104","id69","id5","id44","id25"]
                                       │ │
   │ │                  v2:                                                     
                                                                                
                                                      │ │
   │ │                  
[53.0075954693085,32.410072200393316,72.68372205230826,71.61363809771296,86.99915627358179,15.539006557813716,23.840799451398684,7.23383214431385,6.366591524991982,20.222312628293857,...,6…
 │ │
   │ │ row_group_size = None                                                    
                                                                                
                                                      │ │
   │ │           self = <pyarrow.parquet.core.ParquetWriter object at 
0x1067d4050>                                                                    
                                                                │ │
   │ │          table = pyarrow.Table                                           
                                                                                
                                                      │ │
   │ │                  id1: int64 not null                                     
                                                                                
                                                      │ │
   │ │                  id4: string not null                                    
                                                                                
                                                      │ │
   │ │                  v2: double not null                                     
                                                                                
                                                      │ │
   │ │                  ----                                                    
                                                                                
                                                      │ │
   │ │                  id1: [[12,77,106,10,52,...,104,69,5,44,25]]             
                                                                                
                                                      │ │
   │ │                  id4: 
[["id12","id77","id106","id10","id52",...,"id104","id69","id5","id44","id25"]]  
                                                                                
                         │ │
   │ │                  v2:                                                     
                                                                                
                                                      │ │
   │ │                  
[[53.0075954693085,32.410072200393316,72.68372205230826,71.61363809771296,86.99915627358179,...,26.7118533955444,73.44416011403574,93.63022604514522,51.816253173876824,78.95727980955964]]
   │ │
   │ 
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
 │
   │                                                                            
                                                                                
                                                        │
   │ 
/Users/zhuqi/arrow-datafusion/benchmarks/venv/lib/python3.13/site-packages/pyarrow/parquet/core.py:1113
 in write_table                                                                 
                            │
   │                                                                            
                                                                                
                                                        │
   │   1110 │   │   │   msg = ('Table schema does not match schema used to 
create file: '                                                                  
                                                             │
   │   1111 │   │   │   │      '\ntable:\n{!s} vs. \nfile:\n{!s}'               
                                                                                
                                                        │
   │   1112 │   │   │   │      .format(table.schema, self.schema))              
                                                                                
                                                        │
   │ ❱ 1113 │   │   │   raise ValueError(msg)                                   
                                                                                
                                                        │
   │   1114 │   │                                                               
                                                                                
                                                        │
   │   1115 │   │   self.writer.write_table(table, 
row_group_size=row_group_size)                                                  
                                                                                
     │
   │   1116                                                                     
                                                                                
                                                        │
   │                                                                            
                                                                                
                                                        │
   │ 
╭────────────────────────────────────────────────────────────────────────────────────────────────────
 locals 
────────────────────────────────────────────────────────────────────────────────────────────────────╮
 │
   │ │            msg = 'Table schema does not match schema used to create 
file: \ntable:\nid1: int64 not n'+98                                            
                                                           │ │
   │ │ row_group_size = None                                                    
                                                                                
                                                      │ │
   │ │           self = <pyarrow.parquet.core.ParquetWriter object at 
0x1067d4050>                                                                    
                                                                │ │
   │ │          table = pyarrow.Table                                           
                                                                                
                                                      │ │
   │ │                  id1: int64 not null                                     
                                                                                
                                                      │ │
   │ │                  id4: string not null                                    
                                                                                
                                                      │ │
   │ │                  v2: double not null                                     
                                                                                
                                                      │ │
   │ │                  ----                                                    
                                                                                
                                                      │ │
   │ │                  id1: [[12,77,106,10,52,...,104,69,5,44,25]]             
                                                                                
                                                      │ │
   │ │                  id4: 
[["id12","id77","id106","id10","id52",...,"id104","id69","id5","id44","id25"]]  
                                                                                
                         │ │
   │ │                  v2:                                                     
                                                                                
                                                      │ │
   │ │                  
[[53.0075954693085,32.410072200393316,72.68372205230826,71.61363809771296,86.99915627358179,...,26.7118533955444,73.44416011403574,93.63022604514522,51.816253173876824,78.95727980955964]]
   │ │
   │ 
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
 │
   
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
   ValueError: Table schema does not match schema used to create file:
   table:
   id1: int64 not null
   id4: string not null
   v2: double not null vs.
   file:
   id1: int64 not null
   id4: string
   v2: double
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] benchmark: Add parquet h2o support [datafusion]

Reply via email to