[I] [CI][C++] Use a separated Docker image for Emscripten [arrow]
kou opened a new issue, #44471: URL: https://github.com/apache/arrow/issues/44471 ### Describe the enhancement requested emsdk isn't small for `ubuntu-22.04-cpp.dockerfile`. `ubuntu-22.04-cpp.dockerfile` is shared with some images. So it should be small as much as possible. ### Component(s) C++, Continuous Integration -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Memory Leak when Importing Parquet File with PyArrow Engine in Pandas [arrow]
Voltagabbana opened a new issue, #44472: URL: https://github.com/apache/arrow/issues/44472 ### Describe the bug, including details regarding any error messages, version, and platform. **Description** We've identified a memory leak when importing Parquet files into Pandas DataFrames using the PyArrow engine. The issue occurs specifically during the conversion from Arrow to Pandas objects, as memory is not released even after deleting the DataFrame and invoking garbage collection. **Key findings:** - **No leak with PyArrow alone:** When using PyArrow to read Parquet without converting to Pandas (i.e., no _.to_pandas()_), the memory leak does not occur. - **Leak with _.to_pandas()_:** The memory leak appears during the conversion from Arrow to Pandas, suggesting the problem is tied to this process. - **No issue with Fastparquet or Polars:** Fastparquet and Polars (even with PyArrow) do not exhibit this memory issue, reinforcing that the problem is in Pandas’ handling of Arrow data. **Reproduction Code** ```python # dataset_creation.py # just a fake dataset import pandas as pd import numpy as np import random import string np.random.seed(42) random.seed(42) def random_string(length): letters = string.ascii_letters return ''.join(random.choice(letters) for _ in range(length)) num_rows = 10**6 col_types = { 'col1': lambda: random_string(10), 'col2': lambda: np.random.randint(0, 1000), 'col3': lambda: np.random.random(), 'col4': lambda: random_string(5), 'col5': lambda: np.random.randint(1000, 1), 'col6': lambda: np.random.uniform(0, 100), 'col7': lambda: random_string(8), 'col8': lambda: np.random.random() * 1000, 'col9': lambda: np.random.randint(0, 2), 'col10': lambda: random_string(1000) } data = {col: [func() for _ in range(num_rows)] for col, func in col_types.items()} df = pd.DataFrame(data) df.to_parquet('random_dataset.parquet', index=True) import os file_size = os.path.getsize('random_dataset.parquet') / (1024**3) print(f"File size: {file_size:.2f} GB") ``` ```python # memory_test.py import pandas as pd import polars as pl import gc import pyarrow.parquet import ctypes data_path = 'random_dataset.parquet' # To manually trigger memory release malloc_trim = ctypes.CDLL("libc.so.6").malloc_trim for _ in range(10): df = pd.read_parquet(data_path, engine="pyarrow") # Also tested with: # df = pyarrow.parquet.read_pandas(data_path).to_pandas() # df = pl.read_parquet(data_path, use_pyarrow=True) del df # Explicitly delete DataFrame for _ in range(3): # Force garbage collection multiple times gc.collect() memory_info = psutil.virtual_memory() print(f"\n\nIteration number: {i}") print(f"Total Memory: {memory_info.total / (1024 ** 3):.2f} GB") print(f"Memory at disposal: {memory_info.available / (1024 ** 3):.2f} GB") print(f"Memory Used: {memory_info.used / (1024 ** 3):.2f} GB") print(f"Percentage of memory used: {memory_info.percent}%") # Calling malloc_trim(0) is the only way we found to release the memory malloc_trim(0) ```  **Observations:** - **Garbage Collection:** Despite invoking the garbage collector multiple times, memory allocated to the Python process keeps increasing when _.to_pandas()_ is used, indicating improper memory release during the conversion. - **Direct Use of PyArrow:** When we import the data directly using PyArrow (without converting to Pandas), the memory usage remains stable, showing that the problem originates in the Arrow-to-Pandas conversion process. - **Manual Memory Release (ctypes):** The only reliable way we have found to release the memory is by manually calling _malloc_trim(0)_ via ctypes. However, we believe this is not a proper solution and that memory management should be handled internally by Pandas. **OS environment** _Icon name: computer-vm Chassis: vm Virtualization: microsoft Operating System: Red Hat Enterprise Linux 8.10 (Ootpa) CPE OS Name: cpe:/o:redhat:enterprise_linux:8::baseos Kernel: Linux 4.18.0-553.16.1.el8_10.x86_64 Architecture: x86-64_ **Conclusion** The issue seems to occur during the conversion from Arrow to Pandas, rather than being a problem within PyArrow itself. Given that memory is only released by manually invoking _malloc_trim(0)_, we suspect there is a problem with how PyArrow handles memory management when converting the data to Panda. This issue does not arise when using Fastparquet engine or Polars instea
Re: [I] [C#] Security issue in JSON dependency [arrow]
raulcd closed issue #44463: [C#] Security issue in JSON dependency URL: https://github.com/apache/arrow/issues/44463 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [R] Can't install `adbcflightsql` from CRAN [arrow-adbc]
paleolimbot closed issue #1647: [R] Can't install `adbcflightsql` from CRAN URL: https://github.com/apache/arrow-adbc/issues/1647 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [R] Can't install `adbcflightsql` from CRAN [arrow-adbc]
paleolimbot closed issue #1647: [R] Can't install `adbcflightsql` from CRAN URL: https://github.com/apache/arrow-adbc/issues/1647 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [Website] Improve project description [arrow]
ianmcook opened a new issue, #44474: URL: https://github.com/apache/arrow/issues/44474 ### Describe the enhancement requested Currently the Apache Arrow project descriptions that appear prominently at the top of the website and GitHub repo do not match and have not been updated in quite some time. Currently the description on the website is: > A cross-language development platform for in-memory analytics and the description on GitHub is: > A multi-language toolbox for accelerated data interchange and in-memory processing Given the immense growth in the adoption of Arrow that has occurred since we last updated these descriptions, and the current status of the Arrow format as a de facto standard with no directly comparable alternatives, I think it would be appropriate for us to be somewhat bolder in how we introduce the project. I also think that the description should include some mention of the fact that Arrow is a _format_ in addition to a toolbox. And I think we should prefer simpler words ("fast" over "accelerated"; "toolbox" over "development platform). Following this rationale, I propose that we change the description on both the website and GitHub to: > The universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics Thoughts? ### Component(s) Website -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] `join`ing tables with ExtensionArrays [arrow]
NellyWhads opened a new issue, #44473: URL: https://github.com/apache/arrow/issues/44473 ### Describe the enhancement requested I'm looking for documentation on how to implement an ExtensionArray which supports `join` functionality. Particularly, I'd like to join a table which includes a `FixedShapeTensorArray` column with another table. Here's a simple example which does not work. ```python import numpy as np import pyarrow as pa # First dim is the batch dim tensors = np.arange(3 * 10 * 10).reshape((3, 10, 10)).astype(np.uint8) tensor_array = pa.FixedShapeTensorArray.from_numpy_ndarray(tensors) ids = pa.array([1,2,3], type=pa.uint8()) table = pa.Table.from_arrays([ids, tensor_array], names=["id", "tensor"]) print(table.schema) classes = pa.array(["one", "two", "three"], type=pa.string()) table_2 = pa.Table.from_arrays([ids, classes], names=["id", "name"]) print(table_2.schema) table.join(table_2, keys=["id"], join_type="full outer") ``` This raises the error ``` --- ArrowInvalid Traceback (most recent call last) Cell In[42], [line 1](vscode-notebook-cell:?execution_count=42&line=1) > [1](vscode-notebook-cell:?execution_count=42&line=1) table.join(table_2, keys=["id"], join_type="full outer") File ~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/table.pxi:5570, in pyarrow.lib.Table.join() File ~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:247, in _perform_join(join_type, left_operand, left_keys, right_operand, right_keys, left_suffix, right_suffix, use_threads, coalesce_keys, output_type) [242](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:242) projection = Declaration( [243](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:243) "project", ProjectNodeOptions(projections, projected_col_names) [244](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:244) ) [245](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:245) decl = Declaration.from_sequence([decl, projection]) --> [247](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:247) result_table = decl.to_table(use_threads=use_threads) [249](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:249) if output_type == Table: [250](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:250) return result_table File ~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/_acero.pyx:590, in pyarrow._acero.Declaration.to_table() File ~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status() File ~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status() ArrowInvalid: Data type extension is not supported in join non-key field tensor ``` How can I make this work? The individual tensors I want to store are rather small (single-digit-dimensions), but the join may lead to list aggregation of a few hundred rows. I've tagged this as a python question because I don't know what level of API needs to be adjusted to add this functionality. ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [Java][FlightSQL] Column Duplication When Selecting from no result record in Arrow Flight SQL JDBC Driver [arrow]
mingnuj opened a new issue, #44467: URL: https://github.com/apache/arrow/issues/44467 ### Describe the bug, including details regarding any error messages, version, and platform. I encountered an issue when working with Arrow Flight SQL, and I would appreciate your help. I created a test_table with the following columns: ```sql CREATE TABLE default.public.test_table (col1 int); ``` When I run a `SELECT *` query without inserting any data into the table, the columns appear duplicated. I get the following output:  I am developing the database myself, using Rust and connecting through FlightSQL. When executing SELECT clause and the table is empty, I handle this by returning an endpoint with empty endpoint in the `get_flight_info` method. This did not occur in Arrow Flight SQL JDBC Driver versions prior to 15.0.0, but it started happening with version 15.0.0. ### Component(s) Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org