stevendavis opened a new issue, #41890:
URL: https://github.com/apache/arrow/issues/41890

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Tested with :
   
   - pyarrow 11, pandas 1.5
   - pyarrow 16, pandas 2.0
   
   This Python code reproduces the problem:
   
   ```
   import random
   import pandas as pd
   
   NUM_ROWS = 1_000_000
   MAX_WORDS_PER_SENTENCE = 1000
   NUM_PRE_DEFINED_SENTENCES = 1000
   
   WORDS = """
   story where fish and children are alarmed when thing one and thing two
   and the cat in the big red hat wreck the house while mother is away
   """.split()
   
   def sentence(num_words):
       return " ".join(random.choices(WORDS, k=num_words))
   
   PRE_DEFINED_SENTENCES = [sentence(random.randint(1, MAX_WORDS_PER_SENTENCE)) 
for s in range(NUM_PRE_DEFINED_SENTENCES)]
   
   text_list = random.choices(PRE_DEFINED_SENTENCES, k=NUM_ROWS)
   
   df = pd.DataFrame({"text": text_list})
   df["len"] = df.text.str.len()
   print(df)
   print("\nMemory usage:")
   print(df.memory_usage(deep=True))
   
   df.text = df.text.astype("string[pyarrow]")
   
   contains_fish = df.text.str.contains("fish")
   
   df_fish = df.loc[contains_fish]
   ```
   
   This is the resulting stack trace:
   
   ```
   Traceback (most recent call last):
     File "test_arrow.py", line 30, in <module>
       df_fish = df.loc[contains_fish]
     File 
"/app/conda_envs/myvenv/lib/python3.8/site-packages/pandas/core/indexing.py", 
line 1073, in __getitem__
       return self._getitem_axis(maybe_callable, axis=axis)
     File 
"/app/conda_envs/myvenv/lib/python3.8/site-packages/pandas/core/indexing.py", 
line 1292, in _getitem_axis
       return self._getbool_axis(key, axis=axis)
     File 
"/app/conda_envs/myvenv/lib/python3.8/site-packages/pandas/core/indexing.py", 
line 1093, in _getbool_axis
       return self.obj._take_with_is_copy(inds, axis=axis)
     File 
"/app/conda_envs/myvenv/lib/python3.8/site-packages/pandas/core/generic.py", 
line 3902, in _take_with_is_copy
       result = self._take(indices=indices, axis=axis)
     File 
"/app/conda_envs/myvenv/lib/python3.8/site-packages/pandas/core/generic.py", 
line 3886, in _take
       new_data = self._mgr.take(
     File 
"/app/conda_envs/myvenv/lib/python3.8/site-packages/pandas/core/internals/managers.py",
 line 978, in take
       return self.reindex_indexer(
     File 
"/app/conda_envs/myvenv/lib/python3.8/site-packages/pandas/core/internals/managers.py",
 line 751, in reindex_indexer
       new_blocks = [
     File 
"/app/conda_envs/myvenv/lib/python3.8/site-packages/pandas/core/internals/managers.py",
 line 752, in <listcomp>
       blk.take_nd(
     File 
"/app/conda_envs/myvenv/lib/python3.8/site-packages/pandas/core/internals/blocks.py",
 line 1775, in take_nd
       new_values = self.values.take(indexer, fill_value=fill_value, 
allow_fill=True)
     File 
"/app/conda_envs/myvenv/lib/python3.8/site-packages/pandas/core/arrays/arrow/array.py",
 line 725, in take
       return type(self)(self._data.take(indices))
     File "pyarrow/table.pxi", line 1001, in pyarrow.lib.ChunkedArray.take
     File 
"/app/conda_envs/myvenv/lib/python3.8/site-packages/pyarrow/compute.py", line 
473, in take
       return call_function('take', [data, indices], options, memory_pool)
     File "pyarrow/_compute.pyx", line 560, in pyarrow._compute.call_function
     File "pyarrow/_compute.pyx", line 355, in pyarrow._compute.Function.call
     File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
   
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to