MikeB2019x opened a new issue, #44942: URL: https://github.com/apache/arrow/issues/44942
### Describe the bug, including details regarding any error messages, version, and platform. I am experiencing inconsistent behaviour reading parquet files. I have series of parquet files all from the same source using _lz4_ compression. Using pandas and pyarrow in a very simple conda environment in both a MacOS and Ubuntu I have tried reading the files. For most files there is no problem but for some files an error is thrown. For example, I have an unexceptional file that looks like this. It is ~24K rows with 11 columns of which 8 are int64 like below and three are short strings. ``` rowid txid ... type_hashed account_hashed 0 1185273534742529 8907143696613377 ... 5639366292526364020 6590043424706028011 1 1185273534742530 8907143696613377 ... 5639366292526364020 14759846602110569298 2 1185273534742531 8907143696613378 ... 5639366292526364020 6590043424706028011 3 1185273534742532 8907143696613378 ... 5639366292526364020 14759846602110569298 4 1185273534742533 8907143696613379 ... 5639366292526364020 6590043424706028011 ... ... ... ... ... ... 23956 1185273534766485 8907143696624419 ... 1436686925913123874 11079598282867098476 23957 1185273534766486 8907143696624419 ... 1436686925913123874 2681930189654727950 23958 1185273534766487 8907143696624419 ... 1436686925913123874 2903981374529106592 23959 1185273534766488 8907143696624420 ... 1436686925913123874 2377379431753203189 23960 1185273534766489 8907143696624420 ... 1436686925913123874 3493484824210393012 ``` In first case (MacOS) I can read the file, in the second case (Ubuntu) I can't. What is very strange about the latter case is the error is for a different codec, _zstd_ and not _lz4_ as the error below shows. ``` >>> import pandas as pd; import pyarrow as pa >>> pd.read_parquet('/mnt/xx/journal_6296549a37444051c7cb_clean_.parquet') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/xxx/miniconda3/envs/manuelenv_forge/lib/python3.11/site-packages/pandas/io/parquet.py", line 667, in read_parquet return impl.read( ^^^^^^^^^^ File "/home/xxx/miniconda3/envs/manuelenv_forge/lib/python3.11/site-packages/pandas/io/parquet.py", line 274, in read pa_table = self.api.parquet.read_table( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xxx/miniconda3/envs/manuelenv_forge/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 1843, in read_table return dataset.read(columns=columns, use_threads=use_threads, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xxx/miniconda3/envs/manuelenv_forge/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 1485, in read table = self._dataset.to_table( ^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table File "pyarrow/_dataset.pyx", line 3804, in pyarrow._dataset.Scanner.to_table File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status OSError: ZSTD decompression failed: Data corruption detected ``` To replicate the conda env I'm using: ``` conda create -n my_env python=3.11 conda activate my_env pip install pandas==2.2.2 pyarrow==17.0.0 numpy==2.0.0 ``` The resulting env should be: ``` % pip list Package Version --------------- ----------- numpy 2.0.0 pandas 2.2.2 pip 24.3.1 pyarrow 17.0.0 python-dateutil 2.9.0.post0 pytz 2024.2 setuptools 75.6.0 six 1.16.0 tzdata 2024.2 wheel 0.45.1 ``` The OS's are: MacOS 15.0 (24A335) and Ubuntu 22.04.4 LTS The files were created using the polars [polars.LazyFrame.sink_parquet()](https://docs.pola.rs/api/python/stable/reference/api/polars.LazyFrame.sink_parquet.html#polars-lazyframe-sink-parquet) because some files are larger than memory. I have used both the 'zstd' and 'lz4' and can confirm that the files were saved with the 'lz4' codec. ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org