This is an automated email from the ASF dual-hosted git repository.
alenka pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new d31644aa79 GH-41863: [Python][Parquet] Support lz4_raw as a
compression name alias (#49135)
d31644aa79 is described below
commit d31644aa79c9bf351b55252f004014f42f984c4e
Author: Nick Woolmer <[email protected]>
AuthorDate: Thu Feb 5 09:04:50 2026 +0000
GH-41863: [Python][Parquet] Support lz4_raw as a compression name alias
(#49135)
Closes https://github.com/apache/arrow/issues/41863
### Rationale for this change
Other tools in the parquet ecosystem distinguish between `LZ4` and
`LZ4_RAW`, matching the specification:
https://parquet.apache.org/docs/file-format/data-pages/compression/
`LZ4` (framing) is of course deprecated. PyArrow does not support it, and
instead simplifies the user-facing API, using `LZ4` as an alias for the
`LZ4_RAW` codec.
However, PyArrow does not accept `LZ4_RAW` as a valid alias for the
`LZ4_RAW` codec:
```
ArrowException: Unsupported compression: lz4_raw
```
This is a friction issue, and confusing for some users who are aware of the
differences.
### What changes are included in this PR?
- Adding `LZ4_RAW` to the acceptable codec names list.
- Modifying the `LZ4->LZ4_RAW` mapping to also accept `LZ4_RAW->LZ4_RAW`.
- Adding a test
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes, an additive change to the accepted codec names.
* GitHub Issue: #41863
Authored-by: Nick Woolmer <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
---
docs/source/python/parquet.rst | 3 +++
python/pyarrow/_parquet.pyx | 4 ++--
python/pyarrow/parquet/core.py | 4 +++-
python/pyarrow/tests/parquet/test_basic.py | 8 ++++++++
4 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/docs/source/python/parquet.rst b/docs/source/python/parquet.rst
index 30a84b3dc6..2c42d97f98 100644
--- a/docs/source/python/parquet.rst
+++ b/docs/source/python/parquet.rst
@@ -437,6 +437,9 @@ also supported:
Snappy generally results in better performance, while Gzip may yield smaller
files.
+``'lz4_raw'`` is also accepted as an alias for ``'lz4'``. Both use the
+LZ4_RAW codec as defined in the Parquet specification.
+
These settings can also be set on a per-column basis:
.. code-block:: python
diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx
index ce1d9fbeb1..fa89b6812e 100644
--- a/python/pyarrow/_parquet.pyx
+++ b/python/pyarrow/_parquet.pyx
@@ -1524,7 +1524,7 @@ cdef compression_name_from_enum(ParquetCompression
compression_):
cdef int check_compression_name(name) except -1:
if name.upper() not in {'NONE', 'SNAPPY', 'GZIP', 'LZO', 'BROTLI', 'LZ4',
- 'ZSTD'}:
+ 'LZ4_RAW', 'ZSTD'}:
raise ArrowException("Unsupported compression: " + name)
return 0
@@ -1539,7 +1539,7 @@ cdef ParquetCompression compression_from_name(name):
return ParquetCompression_LZO
elif name == 'BROTLI':
return ParquetCompression_BROTLI
- elif name == 'LZ4':
+ elif name == 'LZ4' or name == 'LZ4_RAW':
return ParquetCompression_LZ4
elif name == 'ZSTD':
return ParquetCompression_ZSTD
diff --git a/python/pyarrow/parquet/core.py b/python/pyarrow/parquet/core.py
index 676bc44523..354f18124b 100644
--- a/python/pyarrow/parquet/core.py
+++ b/python/pyarrow/parquet/core.py
@@ -768,7 +768,9 @@ use_dictionary : bool or list, default True
doesn't support dictionary encoding.
compression : str or dict, default 'snappy'
Specify the compression codec, either on a general basis or per-column.
- Valid values: {'NONE', 'SNAPPY', 'GZIP', 'BROTLI', 'LZ4', 'ZSTD'}.
+ Valid values: {'NONE', 'SNAPPY', 'GZIP', 'BROTLI', 'LZ4', 'LZ4_RAW',
'ZSTD'}.
+ 'LZ4_RAW' is accepted as an alias for 'LZ4' (both use the LZ4_RAW
+ codec as defined in the Parquet specification).
write_statistics : bool or list, default True
Specify if we should write statistics in general (default is True) or only
for some columns.
diff --git a/python/pyarrow/tests/parquet/test_basic.py
b/python/pyarrow/tests/parquet/test_basic.py
index 94868741f3..345aee3c4e 100644
--- a/python/pyarrow/tests/parquet/test_basic.py
+++ b/python/pyarrow/tests/parquet/test_basic.py
@@ -612,6 +612,14 @@ def test_compression_level():
compression_level=level)
+def test_lz4_raw_compression_alias():
+ # GH-41863: lz4_raw should be accepted as a compression name alias
+ arr = pa.array(list(map(int, range(1000))))
+ table = pa.Table.from_arrays([arr, arr], names=['a', 'b'])
+ _check_roundtrip(table, expected=table, compression="lz4_raw")
+ _check_roundtrip(table, expected=table, compression="LZ4_RAW")
+
+
def test_sanitized_spark_field_names():
a0 = pa.array([0, 1, 2, 3, 4])
name = 'prohib; ,\t{}'