icexelloss opened a new issue, #45287:
URL: https://github.com/apache/arrow/issues/45287

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Hi,
   
   I have observed some memory leak when loading parquet dataset, which I think 
is related to metadata file.
   
   I ran with Pyarrow 19.0 .0 Here is the code to repro
   
   ```
   import pyarrow.parquet as pq
   t = 
pq.read_table("bamboo-streaming-parquet-test-data/10000col_2_short_name", 
columns=['time', 'id'])
   print(t)
   ```
   
   Here is the description of the dataset:
   * It is a daily partitioned parquet dataset, total size is 1G, each parquet 
/ partition is 3.7M, total 260 parquet files.
   * Each partition has a single row, and 10k double columns.
   
   The dataset roughly looks like this
   ```
                            time  id      md_0      md_1      md_2      md_3    
  md_4      md_5      md_6      md_7      md_8      md_9     md_10     md_11    
 md_12     md_13     md_14  ...   md_9986   md_9987   md_9988   md_9989   
md_9990   md_9991   md_9992   md_9993   md_9994   md_9995   md_9996   md_9997   
md_9998   md_9999  year  month  day
   0   2023-01-02 09:00:00+00:00   0  0.345584  0.821618  0.330437 -1.303157  
0.905356  0.446375 -0.536953  0.581118  0.364572  0.294132  0.028422  0.546713 
-0.736454 -0.162910 -0.482119  ... -0.559077  0.422268 -0.694504 -0.024630 
-1.142861  2.203289 -0.293591 -1.076218 -2.264640  1.424887  1.601123  0.301252 
-0.771280  0.185484  2023      1    2
   1   2023-01-03 09:00:00+00:00   0 -0.581676 -0.889318  0.487676  0.678370 
-0.834241  0.990142 -0.502560 -3.089640 -1.354553  0.669394  0.173036  0.904321 
 0.528163  1.386469 -1.018272  ...  2.348579  0.682227 -0.212912  0.404263 
-1.527967 -0.636490 -1.094308 -0.049889  0.290552 -0.428462 -0.688299  1.856678 
 1.714070  0.228840  2023      1    3
   2   2023-01-04 09:00:00+00:00   0 -0.436375  1.554100  1.583000 -0.427829 
-0.105547 -1.210442 -1.995322 -0.676878  0.957899 -1.569809  0.411940  0.190030 
-1.502412 -0.006992  0.086427  ... -0.039152 -0.325682 -3.200570  0.415924 
-1.892018 -0.324783 -0.397570  1.310791  1.284943  0.148449  0.844266 -0.045938 
 0.745099  1.037851  2023      1    4
   3   2023-01-05 09:00:00+00:00   0 -0.158549 -1.239811 -4.030404  1.357348  
0.323645 -1.222858 -0.285377  0.963126 -0.531556 -0.652767  0.161818 -0.727889 
-0.845209  2.557909  0.192841  ...  0.349263  1.362306  0.993748 -0.198351 
-0.270906  0.667339  0.265590 -0.344429 -0.025954 -0.751611 -0.614933  0.629236 
-0.765841  1.214225  2023      1    5
   4   2023-01-06 09:00:00+00:00   0  0.165239  1.645823  1.345670 -0.966753 
-1.149769  0.245695  0.731457 -0.902745  1.270495  2.031029  0.312967 -1.554449 
 1.177362 -0.843873 -0.216501  ... -0.070219  1.582911  0.146530 -2.169505 
-0.474960  0.896453 -1.591739  0.560348 -1.130101  1.137671  1.327553 -0.383506 
-0.346886 -0.189187  2023      1    6
   ..                        ...  ..       ...       ...       ...       ...    
   ...       ...       ...       ...       ...       ...       ...       ...    
   ...       ...       ...  ...       ...       ...       ...       ...       
...       ...       ...       ...       ...       ...       ...       ...       
...       ...   ...    ...  ...
   255 2023-12-25 09:00:00+00:00   0 -1.346927  0.359054  0.539482  0.367916 
-1.574514  0.986346 -0.695192  0.658779  1.335143  1.846663 -0.341364  0.817412 
-0.797522  0.073098  0.821410  ... -1.186771  0.887036  1.411563 -0.292395  
0.430151  1.141385  0.496770 -0.644220 -0.799314 -1.696699  0.862889  2.979495  
0.630375  1.303667  2023     12   25
   256 2023-12-26 09:00:00+00:00   0  0.227574 -1.466949 -0.333808 -1.710143  
1.314850 -0.322474  0.048659  0.470558 -0.045580  1.193444 -1.826998 -1.368194  
0.489085  0.947896  0.640531  ...  0.914886  0.261353 -0.691675 -0.399880  
2.045703 -2.356994  1.374474  0.398776 -1.112503 -0.821812  1.238957 -0.940858 
-0.912673 -0.784034  2023     12   26
   257 2023-12-27 09:00:00+00:00   0  0.054617 -1.524966  0.890249  0.360648  
2.271556 -0.964410  1.819533 -0.050139  1.859295 -0.590993  0.306090  0.354523  
0.094928  0.191593 -0.225309  ... -0.488067 -0.309505  0.544273 -0.408513 
-0.111164  0.974175 -0.441507  2.331777  0.726422 -0.165301 -1.163866  0.077637 
 0.404457  1.498559  2023     12   27
   258 2023-12-28 09:00:00+00:00   0  0.827725  1.090989  0.273126  0.586210  
0.753180 -1.544673  0.180036 -1.136032  0.919575 -0.733295 -0.661449  0.194519  
0.228403 -0.531628 -0.226339  ... -0.986043  0.099540 -0.729874  0.692716 
-0.506130 -0.122421  0.321638 -2.592867  0.083722  0.418742 -0.076682  1.067173 
-0.331503  0.617221  2023     12   28
   259 2023-12-29 09:00:00+00:00   0  0.527097  0.358271 -0.659745  1.500467 
-0.977564  1.198143  0.650929  0.876694 -0.144450  1.175169  0.749327 -0.475795 
-0.978405 -0.888626  0.041753  ... -0.090532 -2.414195  1.619769 -0.005002 
-0.672586  0.638271  1.819008 -0.446535 -0.629320 -1.241598  0.926157 -0.304448 
-0.129029  0.750146  2023     12   29
   
   [260 rows x 10005 columns]
   
   ```
   
   When running the code above with "time -v", it shows the memory usage is 
about 6G, which is significantly larger than the data loaded so I think there 
is some metadata related memory leak.
   
   This issue is probably the same root cause as 
https://github.com/apache/arrow/issues/37630
   
   There is script that can be used to generate the dataset for repro, but has 
permissioned access (due to company policy), but happy to give permission to 
who is looking into this:
   
https://github.com/twosigma/bamboo-streaming/blob/master/notebooks/generate_parquet_test_data.ipynb
   
   ### Component(s)
   
   Parquet, C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to