MikeB2019x opened a new issue, #44942:
URL: https://github.com/apache/arrow/issues/44942
### Describe the bug, including details regarding any error messages,
version, and platform.
I am experiencing inconsistent behaviour reading parquet files. I have
series of parquet files all from the same source using _lz4_ compression.
Using pandas and pyarrow in a very simple conda environment in both a MacOS and
Ubuntu I have tried reading the files. For most files there is no problem but
for some files an error is thrown.
For example, I have an unexceptional file that looks like this. It is ~24K
rows with 11 columns of which 8 are int64 like below and three are short
strings.
```
rowid txid ... type_hashed
account_hashed
0 1185273534742529 8907143696613377 ... 5639366292526364020
6590043424706028011
1 1185273534742530 8907143696613377 ... 5639366292526364020
14759846602110569298
2 1185273534742531 8907143696613378 ... 5639366292526364020
6590043424706028011
3 1185273534742532 8907143696613378 ... 5639366292526364020
14759846602110569298
4 1185273534742533 8907143696613379 ... 5639366292526364020
6590043424706028011
... ... ... ... ...
...
23956 1185273534766485 8907143696624419 ... 1436686925913123874
11079598282867098476
23957 1185273534766486 8907143696624419 ... 1436686925913123874
2681930189654727950
23958 1185273534766487 8907143696624419 ... 1436686925913123874
2903981374529106592
23959 1185273534766488 8907143696624420 ... 1436686925913123874
2377379431753203189
23960 1185273534766489 8907143696624420 ... 1436686925913123874
3493484824210393012
```
In first case (MacOS) I can read the file, in the second case (Ubuntu) I
can't. What is very strange about the latter case is the error is for a
different codec, _zstd_ and not _lz4_ as the error below shows.
```
>>> import pandas as pd; import pyarrow as pa
>>> pd.read_parquet('/mnt/xx/journal_6296549a37444051c7cb_clean_.parquet')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/home/xxx/miniconda3/envs/manuelenv_forge/lib/python3.11/site-packages/pandas/io/parquet.py",
line 667, in read_parquet
return impl.read(
^^^^^^^^^^
File
"/home/xxx/miniconda3/envs/manuelenv_forge/lib/python3.11/site-packages/pandas/io/parquet.py",
line 274, in read
pa_table = self.api.parquet.read_table(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/xxx/miniconda3/envs/manuelenv_forge/lib/python3.11/site-packages/pyarrow/parquet/core.py",
line 1843, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/xxx/miniconda3/envs/manuelenv_forge/lib/python3.11/site-packages/pyarrow/parquet/core.py",
line 1485, in read
table = self._dataset.to_table(
^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3804, in
pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 155, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: ZSTD decompression failed: Data corruption detected
```
To replicate the conda env I'm using:
```
conda create -n my_env python=3.11
conda activate my_env
pip install pandas==2.2.2 pyarrow==17.0.0 numpy==2.0.0
```
The resulting env should be:
```
% pip list
Package Version
--------------- -----------
numpy 2.0.0
pandas 2.2.2
pip 24.3.1
pyarrow 17.0.0
python-dateutil 2.9.0.post0
pytz 2024.2
setuptools 75.6.0
six 1.16.0
tzdata 2024.2
wheel 0.45.1
```
The OS's are: MacOS 15.0 (24A335) and Ubuntu 22.04.4 LTS
The files were created using the polars
[polars.LazyFrame.sink_parquet()](https://docs.pola.rs/api/python/stable/reference/api/polars.LazyFrame.sink_parquet.html#polars-lazyframe-sink-parquet)
because some files are larger than memory.
I have used both the 'zstd' and 'lz4' and can confirm that the files were
saved with the 'lz4' codec.
### Component(s)
Parquet, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]