Tom-Newton opened a new issue, #93:
URL: https://github.com/apache/arrow-java/issues/93

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   So far I've only been able to reproduce this case with `pyspark` but I think 
the bug is probably on the arrow side. The problem was introduced with 
https://github.com/apache/arrow/pull/15210 and reverting this change still 
fixes the problem on the 16.0.0 release.
   
   ### Reproduce
   The smallest reproducer I've found is the following. 
   
[reproduce_pyspark.py.txt](https://github.com/apache/arrow/files/15164159/reproduce_pyspark.py.txt)
  (it has a `.txt` extensions because github doesn't let me upload `.py`)
   Versions:
   ```
   pandas==2.2.2
   pyspark==3.5.1
   pyarrow==16.0.0
   
   python 3.10.14 
   ```
   
   
   Error is:
   ```
   py4j.protocol.Py4JJavaError: An error occurred while calling 
o63.collectToPython.
   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 
(TID 32) (192.168.1.222 executor driver): org.apache.spark.SparkException: 
Python worker exited unexpectedly (crashed): Fatal Python error: Segmentation 
fault
   
   Current thread 0x00007f3d6621a740 (most recent call first):
     File 
"/home/tomnewton/segault-venv/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
 line 188 in arrow_to_pandas
   ```
   
[full_stdout.txt](https://github.com/apache/arrow/files/15164379/full_stdout.txt)
   
   
   ### A few things I've noticed:
   1. Reproducible with various combinations of nullable nested arrays that may 
contain null, where the data is null at some level before the final layer of 
nesting.
   2. Adding a second row which is not-null avoids the problem. 
   3. I was unable to reproduce the problem just by creating a chunked array 
that looks the same and calling `to_pandas` on it. 
   4. I was unable to reproduce with an IPC stream fully in python created from 
a pyarrow table. 
   5. I was unable to reproduce with highly nested struct or map types. 
   
   
   ### Component(s)
   
   C++, Python, Java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to