sjperkins opened a new issue, #43696:
URL: https://github.com/apache/arrow/issues/43696

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   platform: Ubuntu 22.04, x86_64
   pyarrow: 17.0.0
   
   ```python
   import os
   import numpy as np
   import pyarrow as pa
   import pyarrow.dataset as pad
   import pyarrow.parquet as pq
   import tempfile
   
   id1 = np.arange(40)[:, None, None]
   id2 = np.arange(50)[None, :, None]
   id3 = np.arange(100)[None, None, :]
   cell_shape = (6, 15)
   
   id1, id2, id3 = map(np.ravel, np.broadcast_arrays(id1, id2, id3))
   nrow, = id3.shape
   data = pa.array(np.arange(nrow * np.prod(cell_shape)))
   data = pa.FixedSizeListArray.from_arrays(data, cell_shape[-1])
   data = pa.FixedSizeListArray.from_arrays(data, cell_shape[-2])
   assert len(data) == nrow
   T = pa.Table.from_pydict({"ID1": id1, "ID2": id2, "ID3": id3, "DATA": data})
   print(f"{T.nbytes / (1024.**2)}MB")
   
   with tempfile.TemporaryDirectory() as dir:
       # Succeeds
       pq.write_table(T, dir + os.path.sep + "test.parquet")
       print("Wrote parquet file")
   
   with tempfile.TemporaryDirectory() as dir:
       partition_fields = [T.schema.field(c) for c in ("ID1", "ID2")]
       partition = pad.partitioning(pa.schema(partition_fields), flavor="hive")
       # Segfaults
       pad.write_dataset(T, dir, partitioning=partition,
                       format="parquet")
   ```
   
   produces the following type of core dump:
   
   ```core
   Program terminated with signal SIGSEGV, Segmentation fault.
   #0  __pthread_kill_implementation (no_tid=0, signo=11, 
threadid=139680878245440)
       at ./nptl/pthread_kill.c:44
   44      ./nptl/pthread_kill.c: No such file or directory.
   [Current thread is 1 (Thread 0x7f09fd213640 (LWP 211153))]
   (gdb) bt
   #0  __pthread_kill_implementation (no_tid=0, signo=11, 
threadid=139680878245440)
       at ./nptl/pthread_kill.c:44
   #1  __pthread_kill_internal (signo=11, threadid=139680878245440)
       at ./nptl/pthread_kill.c:78
   #2  __GI___pthread_kill (threadid=139680878245440, signo=signo@entry=11)
       at ./nptl/pthread_kill.c:89
   #3  0x00007f0a34442476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
   #4  <signal handler called>
   #5  __memmove_avx_unaligned_erms ()
       at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:513
   #6  0x00007f0a2fd0c0b9 in 
arrow::compute::internal::FixedWidthTakeExec(arrow::compute::KernelContext*, 
arrow::compute::ExecSpan const&, arrow::compute::ExecResult*) ()
      from 
/home/simon/venv/arcaedev/lib/python3.11/site-packages/pyarrow/libarrow.so.1700
   #7  0x00007f0a2fd20b2e in 
arrow::compute::internal::FSLTakeExec(arrow::compute::KernelContext*, 
arrow::compute::ExecSpan const&, arrow::compute::ExecResult*) ()
      from 
/home/simon/venv/arcaedev/lib/python3.11/site-packages/pyarrow/libarrow.so.1700
   #8  0x00007f0a2fe0e8a3 in arrow::compute::detail::(anonymous 
namespace)::VectorExecutor::Exec(arrow::compute::ExecSpan const&, 
arrow::compute::detail::ExecListener*) ()
      from 
/home/simon/venv/arcaedev/lib/python3.11/site-packages/pyarrow/libarrow.so.1700
   #9  0x00007f0a2fe0ec42 in arrow::compute::detail::(anonymous 
namespace)::VectorExecutor::Execute(arrow::compute::ExecBatch const&, 
arrow::compute::detail::ExecListener*) ()
      from 
/home/simon/venv/arcaedev/lib/python3.11/site-packages/pyarrow/libarrow.so.1700
   #10 0x00007f0a2fe2597b in 
arrow::compute::detail::FunctionExecutorImpl::Execute(std::vector<arrow::Datum, 
std::allocator<arrow::Datum> > const&, long) ()
   ```
   
   The `__memmove_avx_unaligned_erms` call looks like it could be the trigger. 
Also note that calling `pyarrow.parquet.write_table` on the same table succeeds.
   
   `write_dataset` seems to succeed if the `ID*` ranges are made smaller.
   
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to