coastalwhite opened a new issue, #43584:
URL: https://github.com/apache/arrow/issues/43584

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When writing away large files with fixed size binary, it will write an 
invalid Parquet dictionary.
   
   ```python
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.csv as pacsv
   import random
   import uuid
   import io
   
   f = io.BytesIO()
   
   for i in range(0, 100000):
       N = random.randint(1, 12)
       arr = pa.array([str(uuid.uuid4())[:N] for _ in range(1_000_000)], 
type=pa.binary(N))
       table = pa.table({ 'a': arr })
   
       f.seek(0)
       pq.write_table(table, f)
   
       f.seek(0)
       roundtrip_pa = pq.read_table(f)
       assert table == roundtrip_pa
   ```
   
   The error is the following:
   
   ```
   Traceback (most recent call last):
     File "/home/johndoe/Projects/polars/fsl.py", line 20, in <module>
       roundtrip_pa = pq.read_table(f)
                      ^^^^^^^^^^^^^^^^
     File 
"/nix/store/lpyxz6g2gjddddivs60aqm97rmbiakha-python3-3.11.9-env/lib/python3.11/site-packages/pyarrow/parquet/core.py",
 line 1811, in read_table
       return dataset.read(columns=columns, use_threads=use_threads,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/nix/store/lpyxz6g2gjddddivs60aqm97rmbiakha-python3-3.11.9-env/lib/python3.11/site-packages/pyarrow/parquet/core.py",
 line 1454, in read
       table = self._dataset.to_table(
               ^^^^^^^^^^^^^^^^^^^^^^^
     File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table
     File "pyarrow/_dataset.pyx", line 3804, in 
pyarrow._dataset.Scanner.to_table
     File "pyarrow/error.pxi", line 154, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
   OSError: Unexpected end of stream
   /build/apache-arrow-16.0.0/cpp/src/parquet/arrow/reader.cc:109  
LoadBatch(batch_size)
   /build/apache-arrow-16.0.0/cpp/src/parquet/arrow/reader.cc:1252  
ReadColumn(static_cast<int>(i), row_groups, reader.get(), &column)
   ```
   
   From debugging with the Polars parquet reader, it seems to generate a wrong 
Parquet dictionary index.
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to