[I] RuntimeWarning: Python binding for CumulativeOptions not exposed [arrow]

2023-12-11 Thread via GitHub


nbro10 opened a new issue, #39169:
URL: https://github.com/apache/arrow/issues/39169

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Settings
   
   - Python: Python 3.7.17
   - OS: Mac OS Sonoma [14.1.2 (23B92)]
   - I did `brew install cmake`
   - I did `brew install apache-arrow`
   - pyarrow version: 12.0.1
   - pandas version: 1.3.5
   
   Logs
   
   ```
   ...
   import pandas as pd
   
/Users/me/Library/Caches/pypoetry/virtualenvs/myproj-kIMGt_mS-py3.7/lib/python3.7/site-packages/pandas/__init__.py:50:
 in 
   from pandas.core.api import (
   
/Users/me/Library/Caches/pypoetry/virtualenvs/myproj-kIMGt_mS-py3.7/lib/python3.7/site-packages/pandas/core/api.py:29:
 in 
   from pandas.core.arrays import Categorical
   
/Users/me/Library/Caches/pypoetry/virtualenvs/myproj-kIMGt_mS-py3.7/lib/python3.7/site-packages/pandas/core/arrays/__init__.py:20:
 in 
   from pandas.core.arrays.string_arrow import ArrowStringArray
   
/Users/me/Library/Caches/pypoetry/virtualenvs/myproj-kIMGt_mS-py3.7/lib/python3.7/site-packages/pandas/core/arrays/string_arrow.py:65:
 in 
   import pyarrow.compute as pc
   
/Users/me/Library/Caches/pypoetry/virtualenvs/myproj-kIMGt_mS-py3.7/lib/python3.7/site-packages/pyarrow/compute.py:331:
 in 
   _make_global_functions()
   
/Users/me/Library/Caches/pypoetry/virtualenvs/myproj-kIMGt_mS-py3.7/lib/python3.7/site-packages/pyarrow/compute.py:328:
 in _make_global_functions
   g[cpp_name] = g[name] = _wrap_function(name, func)
   
/Users/me/Library/Caches/pypoetry/virtualenvs/myproj-kIMGt_mS-py3.7/lib/python3.7/site-packages/pyarrow/compute.py:287:
 in _wrap_function
   options_class = _get_options_class(func)
   
/Users/me/Library/Caches/pypoetry/virtualenvs/myproj-kIMGt_mS-py3.7/lib/python3.7/site-packages/pyarrow/compute.py:207:
 in _get_options_class
   .format(class_name), RuntimeWarning)
   E   RuntimeWarning: Python binding for CumulativeOptions not exposed
   
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Improve error message when Flight Core TestTls fails [arrow]

2023-12-11 Thread via GitHub


jbonofre opened a new issue, #39170:
URL: https://github.com/apache/arrow/issues/39170

   ### Describe the enhancement requested
   
   As discussed on the mailing list, when we forgot `git submodule update 
--init --recursive`, `cert0.pem` and other files are not present in 
`testing/data` folder.
   It means `TestTls` fails, but it's not obvious to find why related to 
https://arrow.apache.org/docs/dev/developers/java/building.html#building.
   
   I propose to display a "dev friendly" message ;)
   
   ### Component(s)
   
   Java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++] Make Buffer::device_type_ non-optional [arrow]

2023-12-11 Thread via GitHub


felipecrv closed issue #39159: [C++] Make Buffer::device_type_ non-optional 
URL: https://github.com/apache/arrow/issues/39159


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] ArrowNotImplementedError('Unsupported cast from list> to utf8 using function cast_string') [arrow]

2023-12-11 Thread via GitHub


anemohan opened a new issue, #39172:
URL: https://github.com/apache/arrow/issues/39172

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hello, I use a schema file defining all the column types. I have a column 
with a list of dicts and when I try to call the dataset with this schema I get 
an error stating:
   
   ```
   The original error is below:
   
   ArrowNotImplementedError('Unsupported cast from list> to utf8 using 
function cast_string')
   
   Traceback:
   -
 File "/opt/conda/lib/python3.10/site-packages/dask/dataframe/utils.py", 
line 193, in raise_on_meta_error
   yield
 File "/opt/conda/lib/python3.10/site-packages/dask/dataframe/core.py", 
line 6897, in _emulate
   return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
 File 
"/opt/conda/lib/python3.10/site-packages/intakewrapper/MHParquetSource.py", 
line 67, in __call__
   df = pd.read_parquet(
 File "/opt/conda/lib/python3.10/site-packages/pandas/io/parquet.py", line 
509, in read_parquet
   return impl.read(
 File "/opt/conda/lib/python3.10/site-packages/pandas/io/parquet.py", line 
227, in read
   pa_table = self.api.parquet.read_table(
 File 
"/opt/conda/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", line 
2780, in read_table
   return dataset.read(columns=columns, use_threads=use_threads,
 File 
"/opt/conda/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", line 
2443, in read
   table = self._dataset.to_table(
 File "pyarrow/_dataset.pyx", line 304, in pyarrow._dataset.Dataset.to_table
 File "pyarrow/_dataset.pyx", line 2549, in 
pyarrow._dataset.Scanner.to_table
 File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
 File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
   ```
   
   Can someone help me with this?
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] reflect.SliceHeader is deprecated as of go-1.20, unsafe.SliceData is recommended instead. [arrow]

2023-12-11 Thread via GitHub


dr2chase opened a new issue, #39181:
URL: https://github.com/apache/arrow/issues/39181

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   `bitutils.go` and `type_traits_*.go` cast a slice address to a 
`*reflect.SliceHeader` to to extract the pointer data, but use of 
`reflect.SliceHeader` is deprecated as of go-1.20.  There is no user-visible 
behavior that depends on this, although the new idioms tend to be result in 
slightly better code.
   
   This is not an urgent bug, but the fix is also not complex.
   
   ### Component(s)
   
   Go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Python] Function 'dictionary_encode' fails with DictionaryArray input (for compute kernel / ChunkedArray method) [arrow]

2023-12-11 Thread via GitHub


bkietz closed issue #34890: [Python] Function 'dictionary_encode' fails with 
DictionaryArray input (for compute kernel / ChunkedArray method)
URL: https://github.com/apache/arrow/issues/34890


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Hex decoding strings/allow casting strings to UUIDs and vice-versa [arrow]

2023-12-11 Thread via GitHub


shenker opened a new issue, #39183:
URL: https://github.com/apache/arrow/issues/39183

   ### Describe the enhancement requested
   
   Continuing discussion from https://github.com/apache/arrow/issues/15058. 
Relevant after @rok's PR https://github.com/apache/arrow/pull/37298 lands.
   
   Two ways of doing this come to mind. Thoughts?
   
   **Option 1: Allow string-UUID casting** 
   ```python
   import pyarrow as pa
   str_ary = pa.array(["bf1e9e1f-feab-4230-8a37-0240ccbefe8a", 
"189a6934-99ef-4a70-b10f-cbb5ef178373"], pa.string())
   uuid_ary = str_ary.cast(pa.uuid())
   str_ary_roundtrip = uuid_ary.cast(pa.string())
   ```
   
   The canonical string representation of UUIDs contains `-`'s but it's not 
unusual to see them omitted, so my proposal would be to handle the cases where 
string is length 36 (`-`'s included), string is length 32 (no `-`'s), and error 
if string is in any other format. For the rare cases where strings have 
whitespace/other delimiters, it should be left up to the user to use string 
operations to convert them into one of the two accepted formats.
   
   For casting UUIDs back to strings, I'm not sure if there's a way (or if it's 
important enough to bother with) letting the user specify which of those two 
formats they prefer, so I'd propose UUIDs cast to strings should include the 
`-`'s. Or a flag could be added to `CastOptions`
   
   **Option 2: Implement general hex-encoding and -decoding functions**
   
   Here we implement the general operation of casting hex-encoded strings to 
binary data and vice-versa.
   ```python
   import pyarrow as pa
   import pyarrow.compute as pc
   str_ary = pa.array(["bf1e9e1f-feab-4230-8a37-0240ccbefe8a", 
"189a6934-99ef-4a70-b10f-cbb5ef178373"], pa.string())
   # ignore_chars is a string of characters to silently skip when parsing 
hex-encoded strings, raise error if we see any unexpected characters
   bin_ary = pc.decode_hex(str_ary, ignore_chars="-") # will be type 
pa.binary(), variable-length binary
   bin_fixed_length_ary = bin_ary.cast(pa.binary(16)) # not sure if this should 
be required or not
   uuid_ary = bin_fixed_length_ary.cast(pa.uuid())
   str_ary_nodashes = pc.encode_hex(uuid_ary.cast(pa.binary(16))) # -> 
pa.string()
   ```
   
   To get the final UUID string with `-`'s from `str_ary_nodashes`, you could 
do that with existing string operations, but it might be better to just have a 
convenience function `pc.encode_uuid` that does the hex encoding and adds 
dashes at the same time:
   ```python
   str_ary_roundtrip = pc.encode_uuid(uuid_ary.cast(pa.binary(16))) # -> 
pa.string()
   ```
   
   
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Java] Improve error message when Flight Core TestTls fails [arrow]

2023-12-11 Thread via GitHub


lidavidm closed issue #39170: [Java] Improve error message when Flight Core 
TestTls fails
URL: https://github.com/apache/arrow/issues/39170


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] ArrowNotImplementedError('Unsupported cast from list> to utf8 using function cast_string') [arrow]

2023-12-11 Thread via GitHub


anemohan closed issue #39172: ArrowNotImplementedError('Unsupported cast from 
list> to utf8 using function cast_string')
URL: https://github.com/apache/arrow/issues/39172


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Creating S3FileSystem using env AWS_PROFILE without profile config entry extremely slow [arrow]

2023-12-11 Thread via GitHub


glindstr opened a new issue, #39184:
URL: https://github.com/apache/arrow/issues/39184

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   There appears to be a bug slowing down the creation of a S3FileSystem object 
when supplying a AWS_PROFILE via environment variable but if that profile 
doesn't have a corresponding .aws/config entry it hangs. I'm using python but 
the problematic code may be in the underlining c/c++ library.
   
   .aws/credentials
   ```
   [profile_name_with_config]
   aws_access_key_id = ...
   aws_secret_access_key = ...
   [profile_name_without_config]
   aws_access_key_id = ...
   aws_secret_access_key = ...
   ```
   
   .aws/config
   ```
   [profile profile_name_with_config]
   region = us-west-2
   output = json
   ```
   
   fresh python 3.12.1 environment
   ```
   numpy   1.26.2
   pip 23.2.1
   pyarrow 14.0.1
   ```
   
   test code
   ```python
   from pyarrow.fs import S3FileSystem
   import time as t
   import os
   
   os.environ["AWS_PROFILE"] = "profile_name_with_config"
   
   tic = t.perf_counter()
   fs = S3FileSystem()
   toc = t.perf_counter()
   print(f"S3FileSystem where config exists {toc - tic:0.4f} seconds")
   #S3FileSystem where config exists 0.0071 seconds
   
   os.environ["AWS_PROFILE"] = "profile_name_without_config"
   
   tic = t.perf_counter()
   fs = S3FileSystem()
   toc = t.perf_counter()
   print(f"S3FileSystem where config does not exist {toc - tic:0.4f} seconds")
   #S3FileSystem where config does not exist 24.0123 seconds
   ``` 
   
   It takes S3FileSystem 24 additional seconds to complete when no config entry 
is provided. A pretty surprising result.
   
   Thank you for reviewing.
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [C++] Compiler warnings with clang + `-Wconversion -Wno-sign-conversion` in public headers [arrow]

2023-12-11 Thread via GitHub


paleolimbot opened a new issue, #39185:
URL: https://github.com/apache/arrow/issues/39185

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   There are also a number of R-specific warnings that occur, but a few are in 
the C++ library in public headers. I am not sure if this is a compiler warning 
scheme that is meaningful to adhere to for the entire C++ library; however, it 
seems reasonable to add the requisite static casts in headers included by other 
projects.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [Java] [Java] Bump com.h2database:h2 from 1.4.196 to 2.2.224 [arrow]

2023-12-11 Thread via GitHub


danepitkin opened a new issue, #39189:
URL: https://github.com/apache/arrow/issues/39189

   ### Describe the enhancement requested
   
   This was raised by dependabot, but requires several code changes before 
merging.
   
   ### Component(s)
   
   Java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Uncaught `std::bad_alloc` exception when `group_by` very big `large_utf8` columns [arrow]

2023-12-11 Thread via GitHub


Nathan-Fenner opened a new issue, #39190:
URL: https://github.com/apache/arrow/issues/39190

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When a `pyarrow.Table` contains very large rows, whose size is very close to 
`2**31 - 1`, segfault or allocator exceptions can be raised when performing a 
`group_by` on very big `large_utf8` columns:
   
   ```py
   import pyarrow as pa
   
   # MAX_SIZE is the largest value that can fit in a 32-bit signed integer.
   MAX_SIZE = int(2**31) - 1
   
   # Create a string whose length is very close to MAX_SIZE:
   BIG_STR_LEN = MAX_SIZE - 1
   print(f"{BIG_STR_LEN=} = 2**31 - {2**31 - BIG_STR_LEN}")
   BIG_STR = "A" * BIG_STR_LEN
   
   # Create a record batch with two rows, both containing the BIG_STR in each 
of their columns:
   record_batch = pa.RecordBatch.from_pydict(
   mapping={
   "id": [BIG_STR, BIG_STR],
   "other": [BIG_STR, BIG_STR],
   },
   schema=pa.schema(
   {
   "id": pa.large_utf8(),
   "other": pa.large_utf8(),
   }
   ),
   )
   
   # Create a table containing just the one RecordBatch:
   table = pa.Table.from_batches([record_batch])
   
   # Attempt to group by `id`:
   ans = table.group_by(["id"]).aggregate([("other", "max")])
   print(ans)
   ```
   
   On my M1 mac, the output from running this program looks like:
   
   **Pyarrow version: 14.0.1**
   
   ```
   BIG_STR_LEN=2147483646 = 2**31 - 2
   libc++abi: terminating due to uncaught exception of type std::bad_alloc: 
std::bad_alloc
   zsh: abort  python main.py=
   ```
   
   (in the previous version Pyarrow==10.0.1, this was a segfault instead of 
just a bad_alloc exception):
   ```
   BIG_STR_LEN=2147483642 = 2**31 - 2
   zsh: segmentation fault  python main.py
   ```
   
   ---
   
   I need to emphasize that there is more than enough memory to satisfy this 
operation. The problem is actually caused by integer overflow; I believe in 
one/both of the following places:
   
   - In 
[`VarLengthKeyEncoder::AddLength`](https://github.com/apache/arrow/blob/087fc8f5d31b377916711e98024048b76eae06e8/cpp/src/arrow/compute/kernels/row_encoder_internal.h#L134)
 there is no check that the size of the offset does not cause the length of the 
buffer to overflow an `int32_t`
   - In 
[`GrouperImpl::Consume`](https://github.com/apache/arrow/blob/087fc8f5d31b377916711e98024048b76eae06e8/cpp/src/arrow/compute/row/grouper.cc#L433-L441)
 there's no check that sums of the `offsets_batch` do not overflow an `int32_t`
   
   Overflow in signed integer arithmetic is undefined behavior in C++, but 
typically results in "wrap-around". The result is that we're getting a negative 
`int32_t` value.
   
   Then, when we construct
   
   ```cpp
   std::vector key_bytes_batch(total_length);
   ```
   
   the `total_length` is converted from `int32_t` to `uint64_t` (since 
`std::vector`'s length constructor accepts a `size_t`, which is `uint64_t` on 
most modern computers). The conversion goes like this:
   
   ```
   int32_t(-1)  ==>  int64_t(-1)  ==>  uint64_t(2**64 - 1)
   ```
   
   But `2**64 - 1` bytes is obviously more memory than is available on my 
computer. The overflow needs to be detected sooner to prevent this 
excessively-large number from being used as an impossible allocation request.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] arrow R package: support for stringr::str_replace_all() incomplete [arrow]

2023-12-11 Thread via GitHub


abfleishman opened a new issue, #39191:
URL: https://github.com/apache/arrow/issues/39191

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   The [arrow R package claims to support the `stringr::str_replace_all()` 
function](https://arrow.apache.org/docs/dev/r/reference/acero.html#stringr), 
however, it does not support using a vector of pattern/replacements as [the 
function says it 
should.](https://stringr.tidyverse.org/reference/str_replace.html)
   ```r
   library(arrow)
   #> 
   #> Attaching package: 'arrow'
   #> The following object is masked from 'package:utils':
   #> 
   #> timestamp
   library(dplyr)
   #> 
   #> Attaching package: 'dplyr'
   #> The following objects are masked from 'package:stats':
   #> 
   #> filter, lag
   #> The following objects are masked from 'package:base':
   #> 
   #> intersect, setdiff, setequal, union
   library(stringr)
   dat<-data.frame(common_name=c("Cobalt-rumped Parrotlet", 
 "Striped Woodcreeper", 
 "Allard's Ground Cricket",
 "Southern Double-collared Sunbird"))
   dat_arrow<-arrow_table(dat)
   
   # if working with a normal data.frame the following code works
   dat_df<-dat %>% 
   mutate(common_name_clean = str_replace_all(common_name, c(" |-"="_", "'" 
= "")))
   
   dat_df
   #>common_namecommon_name_clean
   #> 1  Cobalt-rumped Parrotlet  Cobalt_rumped_Parrotlet
   #> 2  Striped Woodcreeper  Striped_Woodcreeper
   #> 3  Allard's Ground Cricket   Allards_Ground_Cricket
   #> 4 Southern Double-collared Sunbird Southern_Double_collared_Sunbird
   
   #but if working with an arrow data.frame it says I have to collect first
   dat_arrow<-dat_arrow%>% 
   mutate(common_name_clean = str_replace_all(common_name, c(" |-"="_", "'" 
= "")))
   #> Warning: Expression str_replace_all(common_name, c(` |-` = "_", `'` = 
"")) not
   #> supported in Arrow; pulling data into R
   Created on 2023-12-11 with reprex v2.0.2
   ```
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org