icexelloss opened a new issue, #45287: URL: https://github.com/apache/arrow/issues/45287
### Describe the bug, including details regarding any error messages, version, and platform. Hi, I have observed some memory leak when loading parquet dataset, which I think is related to metadata file. I ran with Pyarrow 19.0 .0 Here is the code to repro ``` import pyarrow.parquet as pq t = pq.read_table("bamboo-streaming-parquet-test-data/10000col_2_short_name", columns=['time', 'id']) print(t) ``` Here is the description of the dataset: * It is a daily partitioned parquet dataset, total size is 1G, each parquet / partition is 3.7M, total 260 parquet files. * Each partition has a single row, and 10k double columns. The dataset roughly looks like this ``` time id md_0 md_1 md_2 md_3 md_4 md_5 md_6 md_7 md_8 md_9 md_10 md_11 md_12 md_13 md_14 ... md_9986 md_9987 md_9988 md_9989 md_9990 md_9991 md_9992 md_9993 md_9994 md_9995 md_9996 md_9997 md_9998 md_9999 year month day 0 2023-01-02 09:00:00+00:00 0 0.345584 0.821618 0.330437 -1.303157 0.905356 0.446375 -0.536953 0.581118 0.364572 0.294132 0.028422 0.546713 -0.736454 -0.162910 -0.482119 ... -0.559077 0.422268 -0.694504 -0.024630 -1.142861 2.203289 -0.293591 -1.076218 -2.264640 1.424887 1.601123 0.301252 -0.771280 0.185484 2023 1 2 1 2023-01-03 09:00:00+00:00 0 -0.581676 -0.889318 0.487676 0.678370 -0.834241 0.990142 -0.502560 -3.089640 -1.354553 0.669394 0.173036 0.904321 0.528163 1.386469 -1.018272 ... 2.348579 0.682227 -0.212912 0.404263 -1.527967 -0.636490 -1.094308 -0.049889 0.290552 -0.428462 -0.688299 1.856678 1.714070 0.228840 2023 1 3 2 2023-01-04 09:00:00+00:00 0 -0.436375 1.554100 1.583000 -0.427829 -0.105547 -1.210442 -1.995322 -0.676878 0.957899 -1.569809 0.411940 0.190030 -1.502412 -0.006992 0.086427 ... -0.039152 -0.325682 -3.200570 0.415924 -1.892018 -0.324783 -0.397570 1.310791 1.284943 0.148449 0.844266 -0.045938 0.745099 1.037851 2023 1 4 3 2023-01-05 09:00:00+00:00 0 -0.158549 -1.239811 -4.030404 1.357348 0.323645 -1.222858 -0.285377 0.963126 -0.531556 -0.652767 0.161818 -0.727889 -0.845209 2.557909 0.192841 ... 0.349263 1.362306 0.993748 -0.198351 -0.270906 0.667339 0.265590 -0.344429 -0.025954 -0.751611 -0.614933 0.629236 -0.765841 1.214225 2023 1 5 4 2023-01-06 09:00:00+00:00 0 0.165239 1.645823 1.345670 -0.966753 -1.149769 0.245695 0.731457 -0.902745 1.270495 2.031029 0.312967 -1.554449 1.177362 -0.843873 -0.216501 ... -0.070219 1.582911 0.146530 -2.169505 -0.474960 0.896453 -1.591739 0.560348 -1.130101 1.137671 1.327553 -0.383506 -0.346886 -0.189187 2023 1 6 .. ... .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 255 2023-12-25 09:00:00+00:00 0 -1.346927 0.359054 0.539482 0.367916 -1.574514 0.986346 -0.695192 0.658779 1.335143 1.846663 -0.341364 0.817412 -0.797522 0.073098 0.821410 ... -1.186771 0.887036 1.411563 -0.292395 0.430151 1.141385 0.496770 -0.644220 -0.799314 -1.696699 0.862889 2.979495 0.630375 1.303667 2023 12 25 256 2023-12-26 09:00:00+00:00 0 0.227574 -1.466949 -0.333808 -1.710143 1.314850 -0.322474 0.048659 0.470558 -0.045580 1.193444 -1.826998 -1.368194 0.489085 0.947896 0.640531 ... 0.914886 0.261353 -0.691675 -0.399880 2.045703 -2.356994 1.374474 0.398776 -1.112503 -0.821812 1.238957 -0.940858 -0.912673 -0.784034 2023 12 26 257 2023-12-27 09:00:00+00:00 0 0.054617 -1.524966 0.890249 0.360648 2.271556 -0.964410 1.819533 -0.050139 1.859295 -0.590993 0.306090 0.354523 0.094928 0.191593 -0.225309 ... -0.488067 -0.309505 0.544273 -0.408513 -0.111164 0.974175 -0.441507 2.331777 0.726422 -0.165301 -1.163866 0.077637 0.404457 1.498559 2023 12 27 258 2023-12-28 09:00:00+00:00 0 0.827725 1.090989 0.273126 0.586210 0.753180 -1.544673 0.180036 -1.136032 0.919575 -0.733295 -0.661449 0.194519 0.228403 -0.531628 -0.226339 ... -0.986043 0.099540 -0.729874 0.692716 -0.506130 -0.122421 0.321638 -2.592867 0.083722 0.418742 -0.076682 1.067173 -0.331503 0.617221 2023 12 28 259 2023-12-29 09:00:00+00:00 0 0.527097 0.358271 -0.659745 1.500467 -0.977564 1.198143 0.650929 0.876694 -0.144450 1.175169 0.749327 -0.475795 -0.978405 -0.888626 0.041753 ... -0.090532 -2.414195 1.619769 -0.005002 -0.672586 0.638271 1.819008 -0.446535 -0.629320 -1.241598 0.926157 -0.304448 -0.129029 0.750146 2023 12 29 [260 rows x 10005 columns] ``` When running the code above with "time -v", it shows the memory usage is about 6G, which is significantly larger than the data loaded so I think there is some metadata related memory leak. This issue is probably the same root cause as https://github.com/apache/arrow/issues/37630 There is script that can be used to generate the dataset for repro, but has permissioned access (due to company policy), but happy to give permission to who is looking into this: https://github.com/twosigma/bamboo-streaming/blob/master/notebooks/generate_parquet_test_data.ipynb ### Component(s) Parquet, C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org