[I] reading parquet with lz4 codec throws zstd error [arrow]

via GitHub Wed, 04 Dec 2024 09:01:54 -0800


MikeB2019x opened a new issue, #44942:
URL: https://github.com/apache/arrow/issues/44942


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I am experiencing inconsistent behaviour reading parquet files.  I have 
series of parquet files all from the same source using _lz4_ compression.  
Using pandas and pyarrow in a very simple conda environment in both a MacOS and 
Ubuntu I have tried reading the files.  For most files there is no problem but 
for some files an error is thrown.
   
   For example, I have an unexceptional file that looks like this.  It is ~24K 
rows with 11 columns of which 8 are int64 like below and three are short 
strings.
   ```
                     rowid              txid  ...          type_hashed        
account_hashed
   0      1185273534742529  8907143696613377  ...  5639366292526364020   
6590043424706028011
   1      1185273534742530  8907143696613377  ...  5639366292526364020  
14759846602110569298
   2      1185273534742531  8907143696613378  ...  5639366292526364020   
6590043424706028011
   3      1185273534742532  8907143696613378  ...  5639366292526364020  
14759846602110569298
   4      1185273534742533  8907143696613379  ...  5639366292526364020   
6590043424706028011
   ...                 ...               ...  ...                  ...          
         ...
   23956  1185273534766485  8907143696624419  ...  1436686925913123874  
11079598282867098476
   23957  1185273534766486  8907143696624419  ...  1436686925913123874   
2681930189654727950
   23958  1185273534766487  8907143696624419  ...  1436686925913123874   
2903981374529106592
   23959  1185273534766488  8907143696624420  ...  1436686925913123874   
2377379431753203189
   23960  1185273534766489  8907143696624420  ...  1436686925913123874   
3493484824210393012
   ```
   In first case (MacOS) I can read the file, in the second case (Ubuntu) I 
can't.   What is very strange about the latter case is the error is for a 
different codec, _zstd_ and not _lz4_ as the error below shows.
   ```
   >>> import pandas as pd; import pyarrow as pa
   >>> pd.read_parquet('/mnt/xx/journal_6296549a37444051c7cb_clean_.parquet')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File 
"/home/xxx/miniconda3/envs/manuelenv_forge/lib/python3.11/site-packages/pandas/io/parquet.py",
 line 667, in read_parquet
       return impl.read(
              ^^^^^^^^^^
     File 
"/home/xxx/miniconda3/envs/manuelenv_forge/lib/python3.11/site-packages/pandas/io/parquet.py",
 line 274, in read
       pa_table = self.api.parquet.read_table(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/xxx/miniconda3/envs/manuelenv_forge/lib/python3.11/site-packages/pyarrow/parquet/core.py",
 line 1843, in read_table
       return dataset.read(columns=columns, use_threads=use_threads,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/xxx/miniconda3/envs/manuelenv_forge/lib/python3.11/site-packages/pyarrow/parquet/core.py",
 line 1485, in read
       table = self._dataset.to_table(
               ^^^^^^^^^^^^^^^^^^^^^^^
     File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table
     File "pyarrow/_dataset.pyx", line 3804, in 
pyarrow._dataset.Scanner.to_table
     File "pyarrow/error.pxi", line 155, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   OSError: ZSTD decompression failed: Data corruption detected
   ```
   To replicate the conda env I'm using:
   ```
   conda create -n my_env python=3.11
   conda activate my_env
   pip install pandas==2.2.2 pyarrow==17.0.0 numpy==2.0.0
   ```
   The resulting env should be:
   ```
   % pip list
   Package         Version
   --------------- -----------
   numpy           2.0.0
   pandas          2.2.2
   pip             24.3.1
   pyarrow         17.0.0
   python-dateutil 2.9.0.post0
   pytz            2024.2
   setuptools      75.6.0
   six             1.16.0
   tzdata          2024.2
   wheel           0.45.1
   ```
   The OS's are:  MacOS 15.0 (24A335) and Ubuntu 22.04.4 LTS
   The files were created using the polars 
[polars.LazyFrame.sink_parquet()](https://docs.pola.rs/api/python/stable/reference/api/polars.LazyFrame.sink_parquet.html#polars-lazyframe-sink-parquet)
 because some files are larger than memory.  
   I have used both the 'zstd' and 'lz4' and can confirm that the files were 
saved with the 'lz4' codec.
   
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] reading parquet with lz4 codec throws zstd error [arrow]

Reply via email to