Yes, thank you. Perhaps something like https://github.com/apache/arrow-rs/pull/8714 would work ;-)
I've been looking for an excuse to revisit this. Ed On 2026/03/26 09:38:21 Andrew Lamb wrote: > Thank you Will, > > I think this idea is very interesting. > > > it’s entirely possible this index-only approach was evaluated and > discarded early on. > > I don't know of any such discussion or evaluation. One challenge of this > approach might be getting thrift-compiler generated parsers to cooperate. > > > would love to hear your thoughts on whether a lightweight index could > give us the O(1) read performance we want without the file bloat we are > trying to avoid. > > I personally think it sounds like a great idea -- and I suggest as a next > step a prototype in one of the open source parquet readers so we can > measure its performance. The arrow-rs implementation (now) has its own > custom thrift decoder, so it might be relatively easy to test out such a > scheme. > > Andrew > > > > > On Thu, Mar 26, 2026 at 5:19 AM Will Edwards via dev <[email protected]> > wrote: > > > Hi everyone, > > > > I have only recently joined the online community, so please accept my > > apologies for jumping in with this input so late in the process. I've been > > catching up on the discussions around the parquet3.fbs FlatBuffer footer > > proposal, and first, I want to say I completely agree with the core > > motivation: Parquet desperately needs O(1) random access to column metadata > > to avoid the linear scanning and heap pressure that currently penalize > > wide-table workloads. > > > > I write Parquet parsers, and my profiling aligns with what many here have > > likely seen. As demonstrated in the *arrow-rs* blog post about custom > > Thrift parsing, simply decoding but discarding (skipping heap allocation > > for) unneeded columns yields a 3x-9x speedup. This highlights that the > > historic cost of Thrift parsing has been as much a code problem as a data > > problem. If we can introduce a way to completely *skip* rather than > > *decode* > > the Thrift bytes of unwanted columns, we can make things faster still. > > > > Given the ongoing challenges with file bloat and metadata duplication in > > the current FlatBuffer proposal, I wanted to float an alternative > > architectural approach: *What if we keep the Thrift footer intact as the > > single source of truth, but append a lightweight, O(1) "Footer Index"?* > > The Core Proposal: A Lightweight Offset Index > > > > Instead of duplicating ColumnMetaData (statistics, encodings, etc.) into a > > massive new FlatBuffer structure, we could append a highly compact jump > > table. For any given row group and column ordinal, this index would simply > > tell the engine exactly where that column's metadata begins inside the > > legacy Thrift footer. > > > > A query engine would simply: > > > > 1. > > > > 2. Consult the new Footer Index to get the exact byte offsets of the > > target column's metadata across the required row groups. > > 2. > > > > Because this parsing is highly targeted, the parser can employ a low- or > > zero-allocation pattern to save those very last CPU cycles. > > 3. > > > > Jump the instruction pointer directly to those bytes in the legacy > > Thrift footer and decode only the requested ColumnMetaData structs. > > Because this parsing is highly targeted, the parser can employ a low- or > > zero-allocation pattern to save those very last CPU cycles. > > > > Optional Add-On: O(1) Access by Name > > > > For engines or use cases that don't rely on external catalogs or ID > > mappings, we could easily extend this concept to natively support O(1) > > column resolution by name. > > > > We could bake a lightweight hash table into the index that simply maps node > > names to their ordinals, and struct names to their corresponding groups of > > ordinals. This provides immediate, zero-scan access to any field by name > > while keeping the footprint microscopically small compared to duplicating > > the entire metadata payload. > > Potential Benefits > > > > - > > > > *Solves the Allocation Bottleneck:* It provides the exact O(1) random > > access needed to skip unwanted columns, entirely eliminating the linear > > scanning and garbage collection overhead. > > - > > > > *Drastically Smaller File Sizes:* We avoid the 3x-4x file bloat > > currently seen when translating Thrift metadata directly to > > FlatBuffers. We > > are only storing an index of integers (and optionally a small hash > > table). > > - > > > > *Single Source of Truth:* We avoid "dual parser" drift. Writers don't > > have to serialize two complete metadata payloads, and the format doesn't > > have to define complex rules about which footer takes precedence if they > > disagree. > > > > I realize there is already significant momentum and code written for > > parquet3.fbs, and it’s entirely possible this index-only approach was > > evaluated and discarded early on. If so, I’d love to understand the > > technical hurdles it faced (e.g., were there issues safely instantiating a > > Thrift decoder mid-stream?). > > > > I appreciate all the hard work going into modernizing the format and would > > love to hear your thoughts on whether a lightweight index could give us the > > O(1) read performance we want without the file bloat we are trying to > > avoid. > > > > Thanks, > > > > Will > > >
