Hi everyone,

I have only recently joined the online community, so please accept my
apologies for jumping in with this input so late in the process. I've been
catching up on the discussions around the parquet3.fbs FlatBuffer footer
proposal, and first, I want to say I completely agree with the core
motivation: Parquet desperately needs O(1) random access to column metadata
to avoid the linear scanning and heap pressure that currently penalize
wide-table workloads.

I write Parquet parsers, and my profiling aligns with what many here have
likely seen. As demonstrated in the *arrow-rs* blog post about custom
Thrift parsing, simply decoding but discarding (skipping heap allocation
for) unneeded columns yields a 3x-9x speedup. This highlights that the
historic cost of Thrift parsing has been as much a code problem as a data
problem. If we can introduce a way to completely *skip* rather than *decode*
the Thrift bytes of unwanted columns, we can make things faster still.

Given the ongoing challenges with file bloat and metadata duplication in
the current FlatBuffer proposal, I wanted to float an alternative
architectural approach: *What if we keep the Thrift footer intact as the
single source of truth, but append a lightweight, O(1) "Footer Index"?*
The Core Proposal: A Lightweight Offset Index

Instead of duplicating ColumnMetaData (statistics, encodings, etc.) into a
massive new FlatBuffer structure, we could append a highly compact jump
table. For any given row group and column ordinal, this index would simply
tell the engine exactly where that column's metadata begins inside the
legacy Thrift footer.

A query engine would simply:

   1.

   2. Consult the new Footer Index to get the exact byte offsets of the
   target column's metadata across the required row groups.
   2.

   Because this parsing is highly targeted, the parser can employ a low- or
   zero-allocation pattern to save those very last CPU cycles.
   3.

   Jump the instruction pointer directly to those bytes in the legacy
   Thrift footer and decode only the requested ColumnMetaData structs.
   Because this parsing is highly targeted, the parser can employ a low- or
   zero-allocation pattern to save those very last CPU cycles.

Optional Add-On: O(1) Access by Name

For engines or use cases that don't rely on external catalogs or ID
mappings, we could easily extend this concept to natively support O(1)
column resolution by name.

We could bake a lightweight hash table into the index that simply maps node
names to their ordinals, and struct names to their corresponding groups of
ordinals. This provides immediate, zero-scan access to any field by name
while keeping the footprint microscopically small compared to duplicating
the entire metadata payload.
Potential Benefits

   -

   *Solves the Allocation Bottleneck:* It provides the exact O(1) random
   access needed to skip unwanted columns, entirely eliminating the linear
   scanning and garbage collection overhead.
   -

   *Drastically Smaller File Sizes:* We avoid the 3x-4x file bloat
   currently seen when translating Thrift metadata directly to FlatBuffers. We
   are only storing an index of integers (and optionally a small hash table).
   -

   *Single Source of Truth:* We avoid "dual parser" drift. Writers don't
   have to serialize two complete metadata payloads, and the format doesn't
   have to define complex rules about which footer takes precedence if they
   disagree.

I realize there is already significant momentum and code written for
parquet3.fbs, and it’s entirely possible this index-only approach was
evaluated and discarded early on. If so, I’d love to understand the
technical hurdles it faced (e.g., were there issues safely instantiating a
Thrift decoder mid-stream?).

I appreciate all the hard work going into modernizing the format and would
love to hear your thoughts on whether a lightweight index could give us the
O(1) read performance we want without the file bloat we are trying to avoid.

Thanks,

Will

Reply via email to