coastalwhite opened a new issue, #43584: URL: https://github.com/apache/arrow/issues/43584
### Describe the bug, including details regarding any error messages, version, and platform. When writing away large files with fixed size binary, it will write an invalid Parquet dictionary. ```python import pyarrow as pa import pyarrow.parquet as pq import pyarrow.csv as pacsv import random import uuid import io f = io.BytesIO() for i in range(0, 100000): N = random.randint(1, 12) arr = pa.array([str(uuid.uuid4())[:N] for _ in range(1_000_000)], type=pa.binary(N)) table = pa.table({ 'a': arr }) f.seek(0) pq.write_table(table, f) f.seek(0) roundtrip_pa = pq.read_table(f) assert table == roundtrip_pa ``` The error is the following: ``` Traceback (most recent call last): File "/home/johndoe/Projects/polars/fsl.py", line 20, in <module> roundtrip_pa = pq.read_table(f) ^^^^^^^^^^^^^^^^ File "/nix/store/lpyxz6g2gjddddivs60aqm97rmbiakha-python3-3.11.9-env/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 1811, in read_table return dataset.read(columns=columns, use_threads=use_threads, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/nix/store/lpyxz6g2gjddddivs60aqm97rmbiakha-python3-3.11.9-env/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 1454, in read table = self._dataset.to_table( ^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table File "pyarrow/_dataset.pyx", line 3804, in pyarrow._dataset.Scanner.to_table File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status OSError: Unexpected end of stream /build/apache-arrow-16.0.0/cpp/src/parquet/arrow/reader.cc:109 LoadBatch(batch_size) /build/apache-arrow-16.0.0/cpp/src/parquet/arrow/reader.cc:1252 ReadColumn(static_cast<int>(i), row_groups, reader.get(), &column) ``` From debugging with the Polars parquet reader, it seems to generate a wrong Parquet dictionary index. ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org