[I] RuntimeWarning: Python binding for CumulativeOptions not exposed [arrow]
nbro10 opened a new issue, #39169: URL: https://github.com/apache/arrow/issues/39169 ### Describe the bug, including details regarding any error messages, version, and platform. Settings - Python: Python 3.7.17 - OS: Mac OS Sonoma [14.1.2 (23B92)] - I did `brew install cmake` - I did `brew install apache-arrow` - pyarrow version: 12.0.1 - pandas version: 1.3.5 Logs ``` ... import pandas as pd /Users/me/Library/Caches/pypoetry/virtualenvs/myproj-kIMGt_mS-py3.7/lib/python3.7/site-packages/pandas/__init__.py:50: in from pandas.core.api import ( /Users/me/Library/Caches/pypoetry/virtualenvs/myproj-kIMGt_mS-py3.7/lib/python3.7/site-packages/pandas/core/api.py:29: in from pandas.core.arrays import Categorical /Users/me/Library/Caches/pypoetry/virtualenvs/myproj-kIMGt_mS-py3.7/lib/python3.7/site-packages/pandas/core/arrays/__init__.py:20: in from pandas.core.arrays.string_arrow import ArrowStringArray /Users/me/Library/Caches/pypoetry/virtualenvs/myproj-kIMGt_mS-py3.7/lib/python3.7/site-packages/pandas/core/arrays/string_arrow.py:65: in import pyarrow.compute as pc /Users/me/Library/Caches/pypoetry/virtualenvs/myproj-kIMGt_mS-py3.7/lib/python3.7/site-packages/pyarrow/compute.py:331: in _make_global_functions() /Users/me/Library/Caches/pypoetry/virtualenvs/myproj-kIMGt_mS-py3.7/lib/python3.7/site-packages/pyarrow/compute.py:328: in _make_global_functions g[cpp_name] = g[name] = _wrap_function(name, func) /Users/me/Library/Caches/pypoetry/virtualenvs/myproj-kIMGt_mS-py3.7/lib/python3.7/site-packages/pyarrow/compute.py:287: in _wrap_function options_class = _get_options_class(func) /Users/me/Library/Caches/pypoetry/virtualenvs/myproj-kIMGt_mS-py3.7/lib/python3.7/site-packages/pyarrow/compute.py:207: in _get_options_class .format(class_name), RuntimeWarning) E RuntimeWarning: Python binding for CumulativeOptions not exposed ``` ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Improve error message when Flight Core TestTls fails [arrow]
jbonofre opened a new issue, #39170: URL: https://github.com/apache/arrow/issues/39170 ### Describe the enhancement requested As discussed on the mailing list, when we forgot `git submodule update --init --recursive`, `cert0.pem` and other files are not present in `testing/data` folder. It means `TestTls` fails, but it's not obvious to find why related to https://arrow.apache.org/docs/dev/developers/java/building.html#building. I propose to display a "dev friendly" message ;) ### Component(s) Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++] Make Buffer::device_type_ non-optional [arrow]
felipecrv closed issue #39159: [C++] Make Buffer::device_type_ non-optional URL: https://github.com/apache/arrow/issues/39159 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] ArrowNotImplementedError('Unsupported cast from list> to utf8 using function cast_string') [arrow]
anemohan opened a new issue, #39172: URL: https://github.com/apache/arrow/issues/39172 ### Describe the usage question you have. Please include as many useful details as possible. Hello, I use a schema file defining all the column types. I have a column with a list of dicts and when I try to call the dataset with this schema I get an error stating: ``` The original error is below: ArrowNotImplementedError('Unsupported cast from list> to utf8 using function cast_string') Traceback: - File "/opt/conda/lib/python3.10/site-packages/dask/dataframe/utils.py", line 193, in raise_on_meta_error yield File "/opt/conda/lib/python3.10/site-packages/dask/dataframe/core.py", line 6897, in _emulate return func(*_extract_meta(args, True), **_extract_meta(kwargs, True)) File "/opt/conda/lib/python3.10/site-packages/intakewrapper/MHParquetSource.py", line 67, in __call__ df = pd.read_parquet( File "/opt/conda/lib/python3.10/site-packages/pandas/io/parquet.py", line 509, in read_parquet return impl.read( File "/opt/conda/lib/python3.10/site-packages/pandas/io/parquet.py", line 227, in read pa_table = self.api.parquet.read_table( File "/opt/conda/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", line 2780, in read_table return dataset.read(columns=columns, use_threads=use_threads, File "/opt/conda/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", line 2443, in read table = self._dataset.to_table( File "pyarrow/_dataset.pyx", line 304, in pyarrow._dataset.Dataset.to_table File "pyarrow/_dataset.pyx", line 2549, in pyarrow._dataset.Scanner.to_table File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status ``` Can someone help me with this? ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] reflect.SliceHeader is deprecated as of go-1.20, unsafe.SliceData is recommended instead. [arrow]
dr2chase opened a new issue, #39181: URL: https://github.com/apache/arrow/issues/39181 ### Describe the bug, including details regarding any error messages, version, and platform. `bitutils.go` and `type_traits_*.go` cast a slice address to a `*reflect.SliceHeader` to to extract the pointer data, but use of `reflect.SliceHeader` is deprecated as of go-1.20. There is no user-visible behavior that depends on this, although the new idioms tend to be result in slightly better code. This is not an urgent bug, but the fix is also not complex. ### Component(s) Go -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Python] Function 'dictionary_encode' fails with DictionaryArray input (for compute kernel / ChunkedArray method) [arrow]
bkietz closed issue #34890: [Python] Function 'dictionary_encode' fails with DictionaryArray input (for compute kernel / ChunkedArray method) URL: https://github.com/apache/arrow/issues/34890 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Hex decoding strings/allow casting strings to UUIDs and vice-versa [arrow]
shenker opened a new issue, #39183: URL: https://github.com/apache/arrow/issues/39183 ### Describe the enhancement requested Continuing discussion from https://github.com/apache/arrow/issues/15058. Relevant after @rok's PR https://github.com/apache/arrow/pull/37298 lands. Two ways of doing this come to mind. Thoughts? **Option 1: Allow string-UUID casting** ```python import pyarrow as pa str_ary = pa.array(["bf1e9e1f-feab-4230-8a37-0240ccbefe8a", "189a6934-99ef-4a70-b10f-cbb5ef178373"], pa.string()) uuid_ary = str_ary.cast(pa.uuid()) str_ary_roundtrip = uuid_ary.cast(pa.string()) ``` The canonical string representation of UUIDs contains `-`'s but it's not unusual to see them omitted, so my proposal would be to handle the cases where string is length 36 (`-`'s included), string is length 32 (no `-`'s), and error if string is in any other format. For the rare cases where strings have whitespace/other delimiters, it should be left up to the user to use string operations to convert them into one of the two accepted formats. For casting UUIDs back to strings, I'm not sure if there's a way (or if it's important enough to bother with) letting the user specify which of those two formats they prefer, so I'd propose UUIDs cast to strings should include the `-`'s. Or a flag could be added to `CastOptions` **Option 2: Implement general hex-encoding and -decoding functions** Here we implement the general operation of casting hex-encoded strings to binary data and vice-versa. ```python import pyarrow as pa import pyarrow.compute as pc str_ary = pa.array(["bf1e9e1f-feab-4230-8a37-0240ccbefe8a", "189a6934-99ef-4a70-b10f-cbb5ef178373"], pa.string()) # ignore_chars is a string of characters to silently skip when parsing hex-encoded strings, raise error if we see any unexpected characters bin_ary = pc.decode_hex(str_ary, ignore_chars="-") # will be type pa.binary(), variable-length binary bin_fixed_length_ary = bin_ary.cast(pa.binary(16)) # not sure if this should be required or not uuid_ary = bin_fixed_length_ary.cast(pa.uuid()) str_ary_nodashes = pc.encode_hex(uuid_ary.cast(pa.binary(16))) # -> pa.string() ``` To get the final UUID string with `-`'s from `str_ary_nodashes`, you could do that with existing string operations, but it might be better to just have a convenience function `pc.encode_uuid` that does the hex encoding and adds dashes at the same time: ```python str_ary_roundtrip = pc.encode_uuid(uuid_ary.cast(pa.binary(16))) # -> pa.string() ``` ### Component(s) C++, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Java] Improve error message when Flight Core TestTls fails [arrow]
lidavidm closed issue #39170: [Java] Improve error message when Flight Core TestTls fails URL: https://github.com/apache/arrow/issues/39170 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] ArrowNotImplementedError('Unsupported cast from list> to utf8 using function cast_string') [arrow]
anemohan closed issue #39172: ArrowNotImplementedError('Unsupported cast from list> to utf8 using function cast_string') URL: https://github.com/apache/arrow/issues/39172 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Creating S3FileSystem using env AWS_PROFILE without profile config entry extremely slow [arrow]
glindstr opened a new issue, #39184: URL: https://github.com/apache/arrow/issues/39184 ### Describe the bug, including details regarding any error messages, version, and platform. There appears to be a bug slowing down the creation of a S3FileSystem object when supplying a AWS_PROFILE via environment variable but if that profile doesn't have a corresponding .aws/config entry it hangs. I'm using python but the problematic code may be in the underlining c/c++ library. .aws/credentials ``` [profile_name_with_config] aws_access_key_id = ... aws_secret_access_key = ... [profile_name_without_config] aws_access_key_id = ... aws_secret_access_key = ... ``` .aws/config ``` [profile profile_name_with_config] region = us-west-2 output = json ``` fresh python 3.12.1 environment ``` numpy 1.26.2 pip 23.2.1 pyarrow 14.0.1 ``` test code ```python from pyarrow.fs import S3FileSystem import time as t import os os.environ["AWS_PROFILE"] = "profile_name_with_config" tic = t.perf_counter() fs = S3FileSystem() toc = t.perf_counter() print(f"S3FileSystem where config exists {toc - tic:0.4f} seconds") #S3FileSystem where config exists 0.0071 seconds os.environ["AWS_PROFILE"] = "profile_name_without_config" tic = t.perf_counter() fs = S3FileSystem() toc = t.perf_counter() print(f"S3FileSystem where config does not exist {toc - tic:0.4f} seconds") #S3FileSystem where config does not exist 24.0123 seconds ``` It takes S3FileSystem 24 additional seconds to complete when no config entry is provided. A pretty surprising result. Thank you for reviewing. ### Component(s) C++, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [C++] Compiler warnings with clang + `-Wconversion -Wno-sign-conversion` in public headers [arrow]
paleolimbot opened a new issue, #39185: URL: https://github.com/apache/arrow/issues/39185 ### Describe the bug, including details regarding any error messages, version, and platform. There are also a number of R-specific warnings that occur, but a few are in the C++ library in public headers. I am not sure if this is a compiler warning scheme that is meaningful to adhere to for the entire C++ library; however, it seems reasonable to add the requisite static casts in headers included by other projects. ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [Java] [Java] Bump com.h2database:h2 from 1.4.196 to 2.2.224 [arrow]
danepitkin opened a new issue, #39189: URL: https://github.com/apache/arrow/issues/39189 ### Describe the enhancement requested This was raised by dependabot, but requires several code changes before merging. ### Component(s) Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Uncaught `std::bad_alloc` exception when `group_by` very big `large_utf8` columns [arrow]
Nathan-Fenner opened a new issue, #39190: URL: https://github.com/apache/arrow/issues/39190 ### Describe the bug, including details regarding any error messages, version, and platform. When a `pyarrow.Table` contains very large rows, whose size is very close to `2**31 - 1`, segfault or allocator exceptions can be raised when performing a `group_by` on very big `large_utf8` columns: ```py import pyarrow as pa # MAX_SIZE is the largest value that can fit in a 32-bit signed integer. MAX_SIZE = int(2**31) - 1 # Create a string whose length is very close to MAX_SIZE: BIG_STR_LEN = MAX_SIZE - 1 print(f"{BIG_STR_LEN=} = 2**31 - {2**31 - BIG_STR_LEN}") BIG_STR = "A" * BIG_STR_LEN # Create a record batch with two rows, both containing the BIG_STR in each of their columns: record_batch = pa.RecordBatch.from_pydict( mapping={ "id": [BIG_STR, BIG_STR], "other": [BIG_STR, BIG_STR], }, schema=pa.schema( { "id": pa.large_utf8(), "other": pa.large_utf8(), } ), ) # Create a table containing just the one RecordBatch: table = pa.Table.from_batches([record_batch]) # Attempt to group by `id`: ans = table.group_by(["id"]).aggregate([("other", "max")]) print(ans) ``` On my M1 mac, the output from running this program looks like: **Pyarrow version: 14.0.1** ``` BIG_STR_LEN=2147483646 = 2**31 - 2 libc++abi: terminating due to uncaught exception of type std::bad_alloc: std::bad_alloc zsh: abort python main.py= ``` (in the previous version Pyarrow==10.0.1, this was a segfault instead of just a bad_alloc exception): ``` BIG_STR_LEN=2147483642 = 2**31 - 2 zsh: segmentation fault python main.py ``` --- I need to emphasize that there is more than enough memory to satisfy this operation. The problem is actually caused by integer overflow; I believe in one/both of the following places: - In [`VarLengthKeyEncoder::AddLength`](https://github.com/apache/arrow/blob/087fc8f5d31b377916711e98024048b76eae06e8/cpp/src/arrow/compute/kernels/row_encoder_internal.h#L134) there is no check that the size of the offset does not cause the length of the buffer to overflow an `int32_t` - In [`GrouperImpl::Consume`](https://github.com/apache/arrow/blob/087fc8f5d31b377916711e98024048b76eae06e8/cpp/src/arrow/compute/row/grouper.cc#L433-L441) there's no check that sums of the `offsets_batch` do not overflow an `int32_t` Overflow in signed integer arithmetic is undefined behavior in C++, but typically results in "wrap-around". The result is that we're getting a negative `int32_t` value. Then, when we construct ```cpp std::vector key_bytes_batch(total_length); ``` the `total_length` is converted from `int32_t` to `uint64_t` (since `std::vector`'s length constructor accepts a `size_t`, which is `uint64_t` on most modern computers). The conversion goes like this: ``` int32_t(-1) ==> int64_t(-1) ==> uint64_t(2**64 - 1) ``` But `2**64 - 1` bytes is obviously more memory than is available on my computer. The overflow needs to be detected sooner to prevent this excessively-large number from being used as an impossible allocation request. ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] arrow R package: support for stringr::str_replace_all() incomplete [arrow]
abfleishman opened a new issue, #39191: URL: https://github.com/apache/arrow/issues/39191 ### Describe the bug, including details regarding any error messages, version, and platform. The [arrow R package claims to support the `stringr::str_replace_all()` function](https://arrow.apache.org/docs/dev/r/reference/acero.html#stringr), however, it does not support using a vector of pattern/replacements as [the function says it should.](https://stringr.tidyverse.org/reference/str_replace.html) ```r library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp library(dplyr) #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union library(stringr) dat<-data.frame(common_name=c("Cobalt-rumped Parrotlet", "Striped Woodcreeper", "Allard's Ground Cricket", "Southern Double-collared Sunbird")) dat_arrow<-arrow_table(dat) # if working with a normal data.frame the following code works dat_df<-dat %>% mutate(common_name_clean = str_replace_all(common_name, c(" |-"="_", "'" = ""))) dat_df #>common_namecommon_name_clean #> 1 Cobalt-rumped Parrotlet Cobalt_rumped_Parrotlet #> 2 Striped Woodcreeper Striped_Woodcreeper #> 3 Allard's Ground Cricket Allards_Ground_Cricket #> 4 Southern Double-collared Sunbird Southern_Double_collared_Sunbird #but if working with an arrow data.frame it says I have to collect first dat_arrow<-dat_arrow%>% mutate(common_name_clean = str_replace_all(common_name, c(" |-"="_", "'" = ""))) #> Warning: Expression str_replace_all(common_name, c(` |-` = "_", `'` = "")) not #> supported in Arrow; pulling data into R Created on 2023-12-11 with reprex v2.0.2 ``` ### Component(s) R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org