sjperkins opened a new issue, #43696: URL: https://github.com/apache/arrow/issues/43696
### Describe the bug, including details regarding any error messages, version, and platform. platform: Ubuntu 22.04, x86_64 pyarrow: 17.0.0 ```python import os import numpy as np import pyarrow as pa import pyarrow.dataset as pad import pyarrow.parquet as pq import tempfile id1 = np.arange(40)[:, None, None] id2 = np.arange(50)[None, :, None] id3 = np.arange(100)[None, None, :] cell_shape = (6, 15) id1, id2, id3 = map(np.ravel, np.broadcast_arrays(id1, id2, id3)) nrow, = id3.shape data = pa.array(np.arange(nrow * np.prod(cell_shape))) data = pa.FixedSizeListArray.from_arrays(data, cell_shape[-1]) data = pa.FixedSizeListArray.from_arrays(data, cell_shape[-2]) assert len(data) == nrow T = pa.Table.from_pydict({"ID1": id1, "ID2": id2, "ID3": id3, "DATA": data}) print(f"{T.nbytes / (1024.**2)}MB") with tempfile.TemporaryDirectory() as dir: # Succeeds pq.write_table(T, dir + os.path.sep + "test.parquet") print("Wrote parquet file") with tempfile.TemporaryDirectory() as dir: partition_fields = [T.schema.field(c) for c in ("ID1", "ID2")] partition = pad.partitioning(pa.schema(partition_fields), flavor="hive") # Segfaults pad.write_dataset(T, dir, partitioning=partition, format="parquet") ``` produces the following type of core dump: ```core Program terminated with signal SIGSEGV, Segmentation fault. #0 __pthread_kill_implementation (no_tid=0, signo=11, threadid=139680878245440) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. [Current thread is 1 (Thread 0x7f09fd213640 (LWP 211153))] (gdb) bt #0 __pthread_kill_implementation (no_tid=0, signo=11, threadid=139680878245440) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=11, threadid=139680878245440) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=139680878245440, signo=signo@entry=11) at ./nptl/pthread_kill.c:89 #3 0x00007f0a34442476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26 #4 <signal handler called> #5 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:513 #6 0x00007f0a2fd0c0b9 in arrow::compute::internal::FixedWidthTakeExec(arrow::compute::KernelContext*, arrow::compute::ExecSpan const&, arrow::compute::ExecResult*) () from /home/simon/venv/arcaedev/lib/python3.11/site-packages/pyarrow/libarrow.so.1700 #7 0x00007f0a2fd20b2e in arrow::compute::internal::FSLTakeExec(arrow::compute::KernelContext*, arrow::compute::ExecSpan const&, arrow::compute::ExecResult*) () from /home/simon/venv/arcaedev/lib/python3.11/site-packages/pyarrow/libarrow.so.1700 #8 0x00007f0a2fe0e8a3 in arrow::compute::detail::(anonymous namespace)::VectorExecutor::Exec(arrow::compute::ExecSpan const&, arrow::compute::detail::ExecListener*) () from /home/simon/venv/arcaedev/lib/python3.11/site-packages/pyarrow/libarrow.so.1700 #9 0x00007f0a2fe0ec42 in arrow::compute::detail::(anonymous namespace)::VectorExecutor::Execute(arrow::compute::ExecBatch const&, arrow::compute::detail::ExecListener*) () from /home/simon/venv/arcaedev/lib/python3.11/site-packages/pyarrow/libarrow.so.1700 #10 0x00007f0a2fe2597b in arrow::compute::detail::FunctionExecutorImpl::Execute(std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, long) () ``` The `__memmove_avx_unaligned_erms` call looks like it could be the trigger. Also note that calling `pyarrow.parquet.write_table` on the same table succeeds. `write_dataset` seems to succeed if the `ID*` ranges are made smaller. ### Component(s) C++, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org