Hi folks, I just realized the table did not render very well, apologies for that. Please ignore it, it's just a condensed version of the text.
-- Divjot Arora On Thu, Apr 9, 2026 at 6:25 PM Divjot Arora <[email protected]> wrote: > Hi all, > > Following up from the previous mailing list thread [1] about alternative > options to the flatbuffer footer proposal [2]. > > Goals: Improve performance and stability reading wide-schema Parquet files > (10K+ columns). This requires (1) faster access to column metadata in the > footer, and (2) reducing footer bloat. For example, path_in_schema causes > quadratic size blowup with deeply nested schemas - we've seen production > files with 300 MB+ footers, almost 60% of which was path_in_schema alone > (see the linked original flatbuf proposal for an example). > > Background: PR #544 proposes a FlatBuffer-based footer written alongside > the existing Thrift one via the extension framework. A recent mailing list > thread proposes an alternative: leave the Thrift footer as-is and add an > optional "jump table" index for O(1) access to individual column chunks. > > Note: the jump table is complementary to any FlatBuffer approach — it > benefits existing files regardless of the path we take for new ones. > > Options: > > 1. Jump table only. Add an optional index into the existing Thrift footer > - Pros: Simplest approach, no incompatible changes, minimal file size > increase, solves faster access > - Cons: Does not address footer bloat. For huge footers, O(1) seek helps > but the entire footer must still be fetched > > 2. Jump table + targeted Thrift fixes. Add the jump table and fix the > worst bloat sources (e.g., make path_in_schema optional). > - Pros: Minimal incompatible changes that address both goals. > - Cons: parquet-java cannot parse files with empty path_in_schema. The > code change is easy, but this cannot be in effect immediately as > open-source Spark > and downstream offerings(EMR, Fabric, etc) would need to upgrade. > > 3. Minimal FlatBuffer footer. New footer with just schema + column chunk > placement. Statistics, page indexes, etc. added as optional modules over > time and don’t necessarily > need to live in the footer. For the pathological footer case in the > flatbuf proposal, the schema and column chunk placement information account > for only 3% of the full footer size. > Pros: Smallest incremental step toward a redesigned footer. Addresses both > goals long-term and allows for an incremental redesign of all fields, > not just the most obvious ones. Performance-sensitive engines can leverage > the new footer immediately. > Cons: Both footers written during transition, increasing file size. > Engines that need statistics can't drop the Thrift footer until those > modules ship, > so near-term benefit is limited. > > 4. Full FlatBuffer footer. Finalize the FlatBuffer design with all fields > from the Thrift footer. The two evolve in lockstep until a format version > bump drops Thrift. > Pros: Addresses both goals and fully redesigns all footer components. > Cons: Largest scope. PR #544 has already generated extended design debate, > we risk stalling and preventing any win until the full proposal is agreed > upon. > > Summary Table: > > > Option 1: Jump Table Only > > Option 2: Jump Table + Thrift Fixes > > Option 3: Minimal FlatBuffer > > Option 4: Full FlatBuffer > > What > > Optional index for O(1) access into existing Thrift footer > > Jump table + make worst bloat sources optional (e.g. path_in_schema) > > New footer with schema + column placement only; stats/indexes added later > > Complete FlatBuffer replacement for all Thrift footer fields > > Faster access > > Yes > > Yes > > Yes > > Yes > > Reduces bloat > > No > > Yes > > Yes (long-term) > > Yes > > Incompatible changes > > None > > Medium, is a breaking format change > > Dual-write during transition > > Dual-write, eventual format version bump > > File size impact > > Minimal increase > > Minimal increase > > Increases (two footers) until Thrift dropped > > Increases (two footers) until Thrift dropped > > Scope / risk > > Simplest > > Small > > Medium > > Largest — risk of stalling on design debate > > *Main downside* > > Entire bloated footer still fetched > > parquet-java can't parse empty path_in_schema; needs upstream upgrades > across Spark/EMR/Fabric > > Engines needing stats can't drop Thrift until stat modules ship > > Extended design debate (PR #544) may block any near-term wins > > > [1] https://lists.apache.org/thread/czm2bk45wwtkhhpqxqvmx9dk5wkwk1kt > [2] > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0#heading=h.ccu4zzsy0tm5 >
