icexelloss opened a new issue, #45287:
URL: https://github.com/apache/arrow/issues/45287
### Describe the bug, including details regarding any error messages,
version, and platform.
Hi,
I have observed some memory leak when loading parquet dataset, which I think
is related to metadata file.
I ran with Pyarrow 19.0 .0 Here is the code to repro
```
import pyarrow.parquet as pq
t =
pq.read_table("bamboo-streaming-parquet-test-data/10000col_2_short_name",
columns=['time', 'id'])
print(t)
```
Here is the description of the dataset:
* It is a daily partitioned parquet dataset, total size is 1G, each parquet
/ partition is 3.7M, total 260 parquet files.
* Each partition has a single row, and 10k double columns.
The dataset roughly looks like this
```
time id md_0 md_1 md_2 md_3
md_4 md_5 md_6 md_7 md_8 md_9 md_10 md_11
md_12 md_13 md_14 ... md_9986 md_9987 md_9988 md_9989
md_9990 md_9991 md_9992 md_9993 md_9994 md_9995 md_9996 md_9997
md_9998 md_9999 year month day
0 2023-01-02 09:00:00+00:00 0 0.345584 0.821618 0.330437 -1.303157
0.905356 0.446375 -0.536953 0.581118 0.364572 0.294132 0.028422 0.546713
-0.736454 -0.162910 -0.482119 ... -0.559077 0.422268 -0.694504 -0.024630
-1.142861 2.203289 -0.293591 -1.076218 -2.264640 1.424887 1.601123 0.301252
-0.771280 0.185484 2023 1 2
1 2023-01-03 09:00:00+00:00 0 -0.581676 -0.889318 0.487676 0.678370
-0.834241 0.990142 -0.502560 -3.089640 -1.354553 0.669394 0.173036 0.904321
0.528163 1.386469 -1.018272 ... 2.348579 0.682227 -0.212912 0.404263
-1.527967 -0.636490 -1.094308 -0.049889 0.290552 -0.428462 -0.688299 1.856678
1.714070 0.228840 2023 1 3
2 2023-01-04 09:00:00+00:00 0 -0.436375 1.554100 1.583000 -0.427829
-0.105547 -1.210442 -1.995322 -0.676878 0.957899 -1.569809 0.411940 0.190030
-1.502412 -0.006992 0.086427 ... -0.039152 -0.325682 -3.200570 0.415924
-1.892018 -0.324783 -0.397570 1.310791 1.284943 0.148449 0.844266 -0.045938
0.745099 1.037851 2023 1 4
3 2023-01-05 09:00:00+00:00 0 -0.158549 -1.239811 -4.030404 1.357348
0.323645 -1.222858 -0.285377 0.963126 -0.531556 -0.652767 0.161818 -0.727889
-0.845209 2.557909 0.192841 ... 0.349263 1.362306 0.993748 -0.198351
-0.270906 0.667339 0.265590 -0.344429 -0.025954 -0.751611 -0.614933 0.629236
-0.765841 1.214225 2023 1 5
4 2023-01-06 09:00:00+00:00 0 0.165239 1.645823 1.345670 -0.966753
-1.149769 0.245695 0.731457 -0.902745 1.270495 2.031029 0.312967 -1.554449
1.177362 -0.843873 -0.216501 ... -0.070219 1.582911 0.146530 -2.169505
-0.474960 0.896453 -1.591739 0.560348 -1.130101 1.137671 1.327553 -0.383506
-0.346886 -0.189187 2023 1 6
.. ... .. ... ... ... ...
... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
... ... ... ... ...
255 2023-12-25 09:00:00+00:00 0 -1.346927 0.359054 0.539482 0.367916
-1.574514 0.986346 -0.695192 0.658779 1.335143 1.846663 -0.341364 0.817412
-0.797522 0.073098 0.821410 ... -1.186771 0.887036 1.411563 -0.292395
0.430151 1.141385 0.496770 -0.644220 -0.799314 -1.696699 0.862889 2.979495
0.630375 1.303667 2023 12 25
256 2023-12-26 09:00:00+00:00 0 0.227574 -1.466949 -0.333808 -1.710143
1.314850 -0.322474 0.048659 0.470558 -0.045580 1.193444 -1.826998 -1.368194
0.489085 0.947896 0.640531 ... 0.914886 0.261353 -0.691675 -0.399880
2.045703 -2.356994 1.374474 0.398776 -1.112503 -0.821812 1.238957 -0.940858
-0.912673 -0.784034 2023 12 26
257 2023-12-27 09:00:00+00:00 0 0.054617 -1.524966 0.890249 0.360648
2.271556 -0.964410 1.819533 -0.050139 1.859295 -0.590993 0.306090 0.354523
0.094928 0.191593 -0.225309 ... -0.488067 -0.309505 0.544273 -0.408513
-0.111164 0.974175 -0.441507 2.331777 0.726422 -0.165301 -1.163866 0.077637
0.404457 1.498559 2023 12 27
258 2023-12-28 09:00:00+00:00 0 0.827725 1.090989 0.273126 0.586210
0.753180 -1.544673 0.180036 -1.136032 0.919575 -0.733295 -0.661449 0.194519
0.228403 -0.531628 -0.226339 ... -0.986043 0.099540 -0.729874 0.692716
-0.506130 -0.122421 0.321638 -2.592867 0.083722 0.418742 -0.076682 1.067173
-0.331503 0.617221 2023 12 28
259 2023-12-29 09:00:00+00:00 0 0.527097 0.358271 -0.659745 1.500467
-0.977564 1.198143 0.650929 0.876694 -0.144450 1.175169 0.749327 -0.475795
-0.978405 -0.888626 0.041753 ... -0.090532 -2.414195 1.619769 -0.005002
-0.672586 0.638271 1.819008 -0.446535 -0.629320 -1.241598 0.926157 -0.304448
-0.129029 0.750146 2023 12 29
[260 rows x 10005 columns]
```
When running the code above with "time -v", it shows the memory usage is
about 6G, which is significantly larger than the data loaded so I think there
is some metadata related memory leak.
This issue is probably the same root cause as
https://github.com/apache/arrow/issues/37630
There is script that can be used to generate the dataset for repro, but has
permissioned access (due to company policy), but happy to give permission to
who is looking into this:
https://github.com/twosigma/bamboo-streaming/blob/master/notebooks/generate_parquet_test_data.ipynb
### Component(s)
Parquet, C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]