Hi everyone, I have only recently joined the online community, so please accept my apologies for jumping in with this input so late in the process. I've been catching up on the discussions around the parquet3.fbs FlatBuffer footer proposal, and first, I want to say I completely agree with the core motivation: Parquet desperately needs O(1) random access to column metadata to avoid the linear scanning and heap pressure that currently penalize wide-table workloads.
I write Parquet parsers, and my profiling aligns with what many here have likely seen. As demonstrated in the *arrow-rs* blog post about custom Thrift parsing, simply decoding but discarding (skipping heap allocation for) unneeded columns yields a 3x-9x speedup. This highlights that the historic cost of Thrift parsing has been as much a code problem as a data problem. If we can introduce a way to completely *skip* rather than *decode* the Thrift bytes of unwanted columns, we can make things faster still. Given the ongoing challenges with file bloat and metadata duplication in the current FlatBuffer proposal, I wanted to float an alternative architectural approach: *What if we keep the Thrift footer intact as the single source of truth, but append a lightweight, O(1) "Footer Index"?* The Core Proposal: A Lightweight Offset Index Instead of duplicating ColumnMetaData (statistics, encodings, etc.) into a massive new FlatBuffer structure, we could append a highly compact jump table. For any given row group and column ordinal, this index would simply tell the engine exactly where that column's metadata begins inside the legacy Thrift footer. A query engine would simply: 1. 2. Consult the new Footer Index to get the exact byte offsets of the target column's metadata across the required row groups. 2. Because this parsing is highly targeted, the parser can employ a low- or zero-allocation pattern to save those very last CPU cycles. 3. Jump the instruction pointer directly to those bytes in the legacy Thrift footer and decode only the requested ColumnMetaData structs. Because this parsing is highly targeted, the parser can employ a low- or zero-allocation pattern to save those very last CPU cycles. Optional Add-On: O(1) Access by Name For engines or use cases that don't rely on external catalogs or ID mappings, we could easily extend this concept to natively support O(1) column resolution by name. We could bake a lightweight hash table into the index that simply maps node names to their ordinals, and struct names to their corresponding groups of ordinals. This provides immediate, zero-scan access to any field by name while keeping the footprint microscopically small compared to duplicating the entire metadata payload. Potential Benefits - *Solves the Allocation Bottleneck:* It provides the exact O(1) random access needed to skip unwanted columns, entirely eliminating the linear scanning and garbage collection overhead. - *Drastically Smaller File Sizes:* We avoid the 3x-4x file bloat currently seen when translating Thrift metadata directly to FlatBuffers. We are only storing an index of integers (and optionally a small hash table). - *Single Source of Truth:* We avoid "dual parser" drift. Writers don't have to serialize two complete metadata payloads, and the format doesn't have to define complex rules about which footer takes precedence if they disagree. I realize there is already significant momentum and code written for parquet3.fbs, and it’s entirely possible this index-only approach was evaluated and discarded early on. If so, I’d love to understand the technical hurdles it faced (e.g., were there issues safely instantiating a Thrift decoder mid-stream?). I appreciate all the hard work going into modernizing the format and would love to hear your thoughts on whether a lightweight index could give us the O(1) read performance we want without the file bloat we are trying to avoid. Thanks, Will
