[I] [Bug] batch_size 不一样时，返回结果行数不同 [doris]

via GitHub Tue, 10 Dec 2024 01:05:33 -0800


DachuanXUAN opened a new issue, #45247:
URL: https://github.com/apache/doris/issues/45247


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Version
   
   2.1.5
   
   ### What's Wrong?
   
   SQL 为简单的 select into s3
   
   SELECT file_id, cast(data_start_time as String ) as data_start_time, 
cast(data_end_time as String ) as data_end_time, device_type... FROM 
t_table_name where data_start_time >= '2024-11-28 00:00:00.000' and 
data_start_time < '2024-11-28 01:00:00.000' order by file_id,data_start_time 
INTO OUTFILE "s3://xxx/xxx/2024_11_28_00/part_v2_" FORMAT AS PARQUET 
PROPERTIES( "s3.endpoint" = "http://xxx.com/";, "s3.access_key" = "xxx", 
"s3.secret_key"="xxx", "s3.region" = "xxx", "max_file_size" = "120MB" );
   
   batch_size 设置为 10 万时
   set batch_size=100000;
   
+------------+-----------+-----------+------------------------------------------------------------------------------------------------+
   | FileNumber | TotalRows | FileSize | URL |
   
+------------+-----------+-----------+------------------------------------------------------------------------------------------------+
   | 5 | 14602938 | 671738439 | 
s3://xxx/2024_11_28_00/part_v2_66d16409bc2a4b37-9905759434b51248_* |
   
+------------+-----------+-----------+------------------------------------------------------------------------------------------------+
   
   batch_size 设置为默认值
   set batch_size=4096;
   
+------------+-----------+------------+------------------------------------------------------------------------------------------------+
   | FileNumber | TotalRows | FileSize | URL |
   
+------------+-----------+------------+------------------------------------------------------------------------------------------------+
   | 10 | 29803106 | 1316012670 | 
s3://xxx/2024_11_28_00/part_v2_2cef1e9dba2a4749-a9a688a90843fa53_* |
   
+------------+-----------+------------+------------------------------------------------------------------------------------------------+
   
   batch_size 为默认值的行数应该是对的，batch_size 比较大的情况下，就会少数。batch_size 如果更大，SQL 
会卡死，看不出原因。
   
   之所以要设置 batch_size 是因为导出 parquet 时，希望能够设置 block 的行数，减少 parquet 中 block 
的数量。因为除了设置 batch_size 没有别的方法能够控制这个数量。
   
   ### What You Expected?
   
   skip
   
   ### How to Reproduce?
   
   _No response_
   
   ### Anything Else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[I] [Bug] batch_size 不一样时，返回结果行数不同 [doris]

Reply via email to