[I] [CI][C++] Use a separated Docker image for Emscripten [arrow]

2024-10-18 Thread via GitHub


kou opened a new issue, #44471:
URL: https://github.com/apache/arrow/issues/44471

   ### Describe the enhancement requested
   
   emsdk isn't small for `ubuntu-22.04-cpp.dockerfile`.  
`ubuntu-22.04-cpp.dockerfile` is shared with some images. So it should be small 
as much as possible.
   
   ### Component(s)
   
   C++, Continuous Integration


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Memory Leak when Importing Parquet File with PyArrow Engine in Pandas [arrow]

2024-10-18 Thread via GitHub


Voltagabbana opened a new issue, #44472:
URL: https://github.com/apache/arrow/issues/44472

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   **Description**
   
   We've identified a memory leak when importing Parquet files into Pandas 
DataFrames using the PyArrow engine. The issue occurs specifically during the 
conversion from Arrow to Pandas objects, as memory is not released even after 
deleting the DataFrame and invoking garbage collection.
   
   **Key findings:**
   
   - **No leak with PyArrow alone:** When using PyArrow to read Parquet without 
converting to Pandas (i.e., no _.to_pandas()_), the memory leak does not occur.
   - **Leak with _.to_pandas()_:** The memory leak appears during the 
conversion from Arrow to Pandas, suggesting the problem is tied to this process.
   - **No issue with Fastparquet or Polars:** Fastparquet and Polars (even with 
PyArrow) do not exhibit this memory issue, reinforcing that the problem is in 
Pandas’ handling of Arrow data.
   
   **Reproduction Code**
   
   ```python
   
   # dataset_creation.py
   
   # just a fake dataset 
   
   import pandas as pd
   import numpy as np
   import random
   import string
   
   np.random.seed(42)
   random.seed(42)
   
   def random_string(length):
   letters = string.ascii_letters
   return ''.join(random.choice(letters) for _ in range(length))
   
   num_rows = 10**6  
   col_types = {
   'col1': lambda: random_string(10),   
   'col2': lambda: np.random.randint(0, 1000),  
   'col3': lambda: np.random.random(),   
   'col4': lambda: random_string(5), 
   'col5': lambda: np.random.randint(1000, 1),  
   'col6': lambda: np.random.uniform(0, 100), 
   'col7': lambda: random_string(8),   
   'col8': lambda: np.random.random() * 1000,  
   'col9': lambda: np.random.randint(0, 2),  
   'col10': lambda: random_string(1000)
   }
   
   data = {col: [func() for _ in range(num_rows)] for col, func in 
col_types.items()}
   df = pd.DataFrame(data)
   df.to_parquet('random_dataset.parquet', index=True)
   
   import os
   file_size = os.path.getsize('random_dataset.parquet') / (1024**3) 
   print(f"File size: {file_size:.2f} GB")
   
   
   ```
   
   
   ```python
   # memory_test.py
   
   import pandas as pd 
   import polars as pl
   import gc
   import pyarrow.parquet
   import ctypes
   
   data_path = 'random_dataset.parquet'
   
   # To manually trigger memory release
   malloc_trim = ctypes.CDLL("libc.so.6").malloc_trim
   
   for _ in range(10): 
   df = pd.read_parquet(data_path, engine="pyarrow")
   # Also tested with:
   # df = pyarrow.parquet.read_pandas(data_path).to_pandas()
   # df = pl.read_parquet(data_path, use_pyarrow=True)
   
   del df  # Explicitly delete DataFrame
   
   for _ in range(3):  # Force garbage collection multiple times
   gc.collect()
   memory_info = psutil.virtual_memory()
   
   print(f"\n\nIteration number: {i}")
   print(f"Total Memory: {memory_info.total / (1024 ** 3):.2f} GB")
   print(f"Memory at disposal: {memory_info.available / (1024 ** 3):.2f} 
GB")
   print(f"Memory Used: {memory_info.used / (1024 ** 3):.2f} GB")
   print(f"Percentage of memory used: {memory_info.percent}%")
   
   # Calling malloc_trim(0) is the only way we found to release the memory
   malloc_trim(0)
   ```
   
   
![image](https://github.com/user-attachments/assets/f05bf547-4e4c-41cb-9f49-8f6e164d4cbd)
   
   
   **Observations:**
   
   - **Garbage Collection:** Despite invoking the garbage collector multiple 
times, memory allocated to the Python process keeps increasing when 
_.to_pandas()_ is used, indicating improper memory release during the 
conversion.
   - **Direct Use of PyArrow:** When we import the data directly using PyArrow 
(without converting to Pandas), the memory usage remains stable, showing that 
the problem originates in the Arrow-to-Pandas conversion process.
   - **Manual Memory Release (ctypes):** The only reliable way we have found to 
release the memory is by manually calling _malloc_trim(0)_ via ctypes. However, 
we believe this is not a proper solution and that memory management should be 
handled internally by Pandas.
   
   **OS environment**
   
   _Icon name: computer-vm
   Chassis: vm
   Virtualization: microsoft
   Operating System: Red Hat Enterprise Linux 8.10 (Ootpa)
   CPE OS Name: cpe:/o:redhat:enterprise_linux:8::baseos
   Kernel: Linux 4.18.0-553.16.1.el8_10.x86_64
   Architecture: x86-64_
   
   **Conclusion**
   
   The issue seems to occur during the conversion from Arrow to Pandas, rather 
than being a problem within PyArrow itself. Given that memory is only released 
by manually invoking _malloc_trim(0)_, we suspect there is a problem with how 
PyArrow handles memory management when converting the data to Panda. This issue 
does not arise when using Fastparquet engine or Polars instea

Re: [I] [C#] Security issue in JSON dependency [arrow]

2024-10-18 Thread via GitHub


raulcd closed issue #44463: [C#] Security issue in JSON dependency
URL: https://github.com/apache/arrow/issues/44463


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [R] Can't install `adbcflightsql` from CRAN [arrow-adbc]

2024-10-18 Thread via GitHub


paleolimbot closed issue #1647: [R] Can't install `adbcflightsql` from CRAN
URL: https://github.com/apache/arrow-adbc/issues/1647


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [R] Can't install `adbcflightsql` from CRAN [arrow-adbc]

2024-10-18 Thread via GitHub


paleolimbot closed issue #1647: [R] Can't install `adbcflightsql` from CRAN
URL: https://github.com/apache/arrow-adbc/issues/1647


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [Website] Improve project description [arrow]

2024-10-18 Thread via GitHub


ianmcook opened a new issue, #44474:
URL: https://github.com/apache/arrow/issues/44474

   ### Describe the enhancement requested
   
   Currently the Apache Arrow project descriptions that appear prominently at 
the top of the website and GitHub repo do not match and have not been updated 
in quite some time. Currently the description on the website is:
   
   > A cross-language development platform for in-memory analytics
   
   and the description on GitHub is:
   
   > A multi-language toolbox for accelerated data interchange and in-memory 
processing
   
   Given the immense growth in the adoption of Arrow that has occurred since we 
last updated these descriptions, and the current status of the Arrow format as 
a de facto standard with no directly comparable alternatives, I think it would 
be appropriate for us to be somewhat bolder in how we introduce the project. I 
also think that the description should include some mention of the fact that 
Arrow is a _format_ in addition to a toolbox. And I think we should prefer 
simpler words ("fast" over "accelerated"; "toolbox" over "development platform).
   
   Following this rationale, I propose that we change the description on both 
the website and GitHub to:
   
   > The universal columnar format and multi-language toolbox for fast data 
interchange and in-memory analytics
   
   Thoughts?
   
   ### Component(s)
   
   Website


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] `join`ing tables with ExtensionArrays [arrow]

2024-10-18 Thread via GitHub


NellyWhads opened a new issue, #44473:
URL: https://github.com/apache/arrow/issues/44473

   ### Describe the enhancement requested
   
   I'm looking for documentation on how to implement an ExtensionArray which 
supports `join` functionality.
   
   Particularly, I'd like to join a table which includes a 
`FixedShapeTensorArray` column with another table.
   
   Here's a simple example which does not work.
   
   ```python
   import numpy as np
   import pyarrow as pa
   
   # First dim is the batch dim
   tensors = np.arange(3 * 10 * 10).reshape((3, 10, 10)).astype(np.uint8)
   tensor_array = pa.FixedShapeTensorArray.from_numpy_ndarray(tensors)
   ids = pa.array([1,2,3], type=pa.uint8())
   table = pa.Table.from_arrays([ids, tensor_array], names=["id", "tensor"])
   print(table.schema)
   
   classes = pa.array(["one", "two", "three"], type=pa.string())
   table_2 = pa.Table.from_arrays([ids, classes], names=["id", "name"])
   print(table_2.schema)
   
   table.join(table_2, keys=["id"], join_type="full outer")
   ```
   
   This raises the error
   ```
   ---
   ArrowInvalid  Traceback (most recent call last)
   Cell In[42], [line 1](vscode-notebook-cell:?execution_count=42&line=1)
   > [1](vscode-notebook-cell:?execution_count=42&line=1) 
table.join(table_2, keys=["id"], join_type="full outer")
   
   File 
~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/table.pxi:5570,
 in pyarrow.lib.Table.join()
   
   File 
~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:247,
 in _perform_join(join_type, left_operand, left_keys, right_operand, 
right_keys, left_suffix, right_suffix, use_threads, coalesce_keys, output_type)
   
[242](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:242)
 projection = Declaration(
   
[243](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:243)
 "project", ProjectNodeOptions(projections, projected_col_names)
   
[244](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:244)
 )
   
[245](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:245)
 decl = Declaration.from_sequence([decl, projection])
   --> 
[247](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:247)
 result_table = decl.to_table(use_threads=use_threads)
   
[249](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:249)
 if output_type == Table:
   
[250](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:250)
 return result_table
   
   File 
~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/_acero.pyx:590,
 in pyarrow._acero.Declaration.to_table()
   
   File 
~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/error.pxi:155,
 in pyarrow.lib.pyarrow_internal_check_status()
   
   File 
~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/error.pxi:92,
 in pyarrow.lib.check_status()
   
   ArrowInvalid: Data type extension is not supported in join non-key field tensor
   ```
   
   How can I make this work? The individual tensors I want to store are rather 
small (single-digit-dimensions), but the join may lead to list aggregation of a 
few hundred rows.
   
   I've tagged this as a python question because I don't know what level of API 
needs to be adjusted to add this functionality.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [Java][FlightSQL] Column Duplication When Selecting from no result record in Arrow Flight SQL JDBC Driver [arrow]

2024-10-18 Thread via GitHub


mingnuj opened a new issue, #44467:
URL: https://github.com/apache/arrow/issues/44467

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I encountered an issue when working with Arrow Flight SQL, and I would 
appreciate your help.
   
   I created a test_table with the following columns:
   ```sql
   CREATE TABLE default.public.test_table (col1 int);
   ```
   When I run a `SELECT *` query without inserting any data into the table, the 
columns appear duplicated.
   I get the following output:
   
![image](https://github.com/user-attachments/assets/efc99718-1cb2-4e63-92f7-cc74f1dee9d0)
   
   I am developing the database myself, using Rust and connecting through 
FlightSQL.
   When executing SELECT clause and the table is empty, I handle this by 
returning an endpoint with empty endpoint in the `get_flight_info` method.
   
   This did not occur in Arrow Flight SQL JDBC Driver versions prior to 15.0.0, 
but it started happening with version 15.0.0.
   
   
   ### Component(s)
   
   Java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org