k8ika0s opened a new issue, #48213:
URL: https://github.com/apache/arrow/issues/48213
### Describe the bug, including details regarding any error messages,
version, and platform.
### Describe the bug
Running Apache Arrow / Parquet 22.0.0 on s390x (big-endian) exposed multiple
endianness-related correctness bugs and some allocator/test fragility:
* Parquet encoders sometimes emit host-order bytes instead of canonical
little-endian.
* Decoders may double-swap values or interpret host-order statistics
incorrectly.
* INT96 timestamp handling mixes host-order and LE limbs, breaking
comparison and decoding.
* Page index min/max expectations are built from host-order bytes.
* Float16 (half) serialization does not always produce canonical bytes.
* BYTE_ARRAY and dictionary 4-byte length prefixes are encoded
inconsistently across paths.
* Bit-packing, VLQ, and ZigZag helpers assume little-endian layout and
misbehave on big-endian.
* Synthetic OOM tests can trigger allocator-level aborts on large-memory
s390x systems (e.g. mimalloc) before Arrow can raise OutOfMemory.
Together these can lead to:
* Non-portable or corrupted Parquet pages and statistics on big-endian hosts.
* Incorrect page index min/max and statistics (wrong sort order, incorrect
filtering).
* Mis-decoded INT96 timestamps.
* Crashes in test OOM paths on s390x.
### To Reproduce
On an s390x (big-endian) system:
1. Build Arrow C++ (22.0.0) with Parquet enabled.
2. Run relevant gtest suites (e.g. parquet-reader/writer, parquet-arrow
reader/writer, page index tests, chunker/Float16 tests, bit-utility tests, IPC
message tests, and allocator/buffer OOM tests).
3. Observe:
* Parquet tests failing due to mismatched bytes (stats, page index,
INT96, Float16, RLE prefixes, dictionary pages).
* IPC byte-identical test comparing against a macOS/ARM snapshot that
does not match s390x output.
* OOM tests aborting when mimalloc is used on large-memory s390x
environments.
### Expected behavior
* Parquet encoders/decoders should consistently read/write **canonical
little-endian bytes** for all primitives and metadata, independently of host
endianness.
* INT96 timestamps, page index min/max, Float16, and binary/dictionary pages
should round-trip correctly on s390x and match the expected canonical layout.
* IPC “byte identical” tests should compare against a canonical snapshot
that matches big-endian builds as well.
* Synthetic OOM tests should exercise Arrow’s OutOfMemory behavior without
provoking allocator-level hard aborts.
### What this PR changes
The associated PR:
* Introduces an `ARROW_ENSURE_S390X_ENDIANNESS` CMake option (default ON on
s390x) so Parquet I/O always uses explicit little-endian conversions on
big-endian hosts.
* Adds `parquet/endian_internal.h` with portable helpers for
encoding/decoding primitives (integers, floats, INT96, etc.) in little-endian
format regardless of host architecture.
* Updates Parquet encoders/decoders (Plain, ByteStreamSplit, Dictionary,
DeltaBitPack, FLBA) to use these helpers, including Arrow array fast-paths.
* Ensures level headers, RLE prefixes, and BYTE_ARRAY length prefixes are
always written/read as little-endian.
* Canonicalizes INT96 timestamp handling and test fixtures to little-endian
limb layout.
* Fixes page index and Float16 expectations so tests compare against
canonical LE bytes.
* Adjusts geospatial WKB parsing and related tests to respect a unified
endianness shim.
* Updates IPC byte-identical tests to use a canonical s390x snapshot and
emit hex diffs on mismatch.
* Clamps synthetic OOM allocations and skips mimalloc-specific OOM tests to
avoid allocator aborts while still exercising Arrow’s OutOfMemory behavior.
### Environment
* Architecture: s390x (big-endian)
* Component(s): C++, Parquet
* Arrow version: 22.0.0 (and current `main` while developing the fix)
* Allocators: default pool, mimalloc (where OOM tests were unstable)
### Additional context
The goal is to make Arrow/Parquet fully correct and stable on s390x by
enforcing canonical little-endian Parquet I/O and adjusting tests so that all
suites pass on big-endian platforms without behavior differences or crashes.
### Component(s)
C++, Parquet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]