This is an automated email from the ASF dual-hosted git repository.

alenka pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/main by this push:
     new d31644aa79 GH-41863: [Python][Parquet] Support lz4_raw as a 
compression name alias (#49135)
d31644aa79 is described below

commit d31644aa79c9bf351b55252f004014f42f984c4e
Author: Nick Woolmer <[email protected]>
AuthorDate: Thu Feb 5 09:04:50 2026 +0000

    GH-41863: [Python][Parquet] Support lz4_raw as a compression name alias 
(#49135)
    
    Closes https://github.com/apache/arrow/issues/41863
    
    ### Rationale for this change
    
    Other tools in the parquet ecosystem distinguish between `LZ4` and 
`LZ4_RAW`, matching the specification: 
https://parquet.apache.org/docs/file-format/data-pages/compression/
    
    `LZ4` (framing) is of course deprecated. PyArrow does not support it, and 
instead simplifies the user-facing API, using `LZ4` as an alias for the 
`LZ4_RAW` codec.
    
    However, PyArrow does not accept `LZ4_RAW` as a valid alias for the 
`LZ4_RAW` codec:
    
    ```
    ArrowException: Unsupported compression: lz4_raw
    ```
    
    This is a friction issue, and confusing for some users who are aware of the 
differences.
    
    ### What changes are included in this PR?
    
    - Adding `LZ4_RAW` to the acceptable codec names list.
    - Modifying the `LZ4->LZ4_RAW` mapping to also accept `LZ4_RAW->LZ4_RAW`.
    - Adding a test
    
    ### Are these changes tested?
    
    Yes.
    
    ### Are there any user-facing changes?
    
    Yes, an additive change to the accepted codec names.
    
    * GitHub Issue: #41863
    
    Authored-by: Nick Woolmer <[email protected]>
    Signed-off-by: AlenkaF <[email protected]>
---
 docs/source/python/parquet.rst             | 3 +++
 python/pyarrow/_parquet.pyx                | 4 ++--
 python/pyarrow/parquet/core.py             | 4 +++-
 python/pyarrow/tests/parquet/test_basic.py | 8 ++++++++
 4 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/docs/source/python/parquet.rst b/docs/source/python/parquet.rst
index 30a84b3dc6..2c42d97f98 100644
--- a/docs/source/python/parquet.rst
+++ b/docs/source/python/parquet.rst
@@ -437,6 +437,9 @@ also supported:
 Snappy generally results in better performance, while Gzip may yield smaller
 files.
 
+``'lz4_raw'`` is also accepted as an alias for ``'lz4'``. Both use the
+LZ4_RAW codec as defined in the Parquet specification.
+
 These settings can also be set on a per-column basis:
 
 .. code-block:: python
diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx
index ce1d9fbeb1..fa89b6812e 100644
--- a/python/pyarrow/_parquet.pyx
+++ b/python/pyarrow/_parquet.pyx
@@ -1524,7 +1524,7 @@ cdef compression_name_from_enum(ParquetCompression 
compression_):
 
 cdef int check_compression_name(name) except -1:
     if name.upper() not in {'NONE', 'SNAPPY', 'GZIP', 'LZO', 'BROTLI', 'LZ4',
-                            'ZSTD'}:
+                            'LZ4_RAW', 'ZSTD'}:
         raise ArrowException("Unsupported compression: " + name)
     return 0
 
@@ -1539,7 +1539,7 @@ cdef ParquetCompression compression_from_name(name):
         return ParquetCompression_LZO
     elif name == 'BROTLI':
         return ParquetCompression_BROTLI
-    elif name == 'LZ4':
+    elif name == 'LZ4' or name == 'LZ4_RAW':
         return ParquetCompression_LZ4
     elif name == 'ZSTD':
         return ParquetCompression_ZSTD
diff --git a/python/pyarrow/parquet/core.py b/python/pyarrow/parquet/core.py
index 676bc44523..354f18124b 100644
--- a/python/pyarrow/parquet/core.py
+++ b/python/pyarrow/parquet/core.py
@@ -768,7 +768,9 @@ use_dictionary : bool or list, default True
     doesn't support dictionary encoding.
 compression : str or dict, default 'snappy'
     Specify the compression codec, either on a general basis or per-column.
-    Valid values: {'NONE', 'SNAPPY', 'GZIP', 'BROTLI', 'LZ4', 'ZSTD'}.
+    Valid values: {'NONE', 'SNAPPY', 'GZIP', 'BROTLI', 'LZ4', 'LZ4_RAW', 
'ZSTD'}.
+    'LZ4_RAW' is accepted as an alias for 'LZ4' (both use the LZ4_RAW
+    codec as defined in the Parquet specification).
 write_statistics : bool or list, default True
     Specify if we should write statistics in general (default is True) or only
     for some columns.
diff --git a/python/pyarrow/tests/parquet/test_basic.py 
b/python/pyarrow/tests/parquet/test_basic.py
index 94868741f3..345aee3c4e 100644
--- a/python/pyarrow/tests/parquet/test_basic.py
+++ b/python/pyarrow/tests/parquet/test_basic.py
@@ -612,6 +612,14 @@ def test_compression_level():
                          compression_level=level)
 
 
+def test_lz4_raw_compression_alias():
+    # GH-41863: lz4_raw should be accepted as a compression name alias
+    arr = pa.array(list(map(int, range(1000))))
+    table = pa.Table.from_arrays([arr, arr], names=['a', 'b'])
+    _check_roundtrip(table, expected=table, compression="lz4_raw")
+    _check_roundtrip(table, expected=table, compression="LZ4_RAW")
+
+
 def test_sanitized_spark_field_names():
     a0 = pa.array([0, 1, 2, 3, 4])
     name = 'prohib; ,\t{}'

Reply via email to