suxiaogang223 commented on PR #52228:
URL: https://github.com/apache/doris/pull/52228#issuecomment-3021667188
@Z-SWEI I tried to use pyarrow to generate the smallest reproducible case
file, maybe it can help you add a case
```python
import pyarrow as pa
import pyarrow.parquet as pq
import os
# 构造小数据
data1 = pa.table(
{
"a": pa.array([i for i in range(1, 1500)], type=pa.int32()),
}
)
data2 = pa.table(
{
"a": pa.array([i for i in range(1500, 3000)], type=pa.int32()),
}
)
# 设置writer参数
output_file = "small_2rowgroup.parquet"
writer = pq.ParquetWriter(
output_file,
data1.schema,
use_dictionary=False, # 禁用字典编码,减小复杂性
compression="NONE", # 不压缩
data_page_size=512, # 小 page 尺寸,强制生成多个 page,从而产生 page index
write_statistics=True, # 开启统计信息,有助于 index 生成
write_page_index=True, # 开启 page index
)
# 写入两个 row group
writer.write_table(data1) # 第一个 row group
writer.write_table(data2) # 第二个 row group
writer.close()
# 查看生成文件大小
print(f"Parquet file size: {os.path.getsize(output_file)} bytes")
```
use local tvf to query this parquet file
```sql
select * from local("file_path"="small_2rowgroup.parquet",
"backend_id"="1751341622620", "format"="parquet") where a > 1000 and a < 2000;
```
<img width="1901" alt="image"
src="https://github.com/user-attachments/assets/1f654e78-a5dd-45e6-a683-2c5d10d2fb19"
/>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]