Hi folks,

I just realized the table did not render very well, apologies for that.
Please ignore it, it's just a condensed version of the text.

-- Divjot Arora

On Thu, Apr 9, 2026 at 6:25 PM Divjot Arora <[email protected]>
wrote:

> Hi all,
>
> Following up from the previous mailing list thread [1] about alternative
> options to the flatbuffer footer proposal [2].
>
> Goals: Improve performance and stability reading wide-schema Parquet files
> (10K+ columns). This requires (1) faster access to column metadata in the
> footer, and (2) reducing footer bloat. For example, path_in_schema causes
> quadratic size blowup with deeply nested schemas - we've seen production
> files with 300 MB+ footers, almost 60% of which was path_in_schema alone
> (see the linked original flatbuf proposal for an example).
>
> Background: PR #544 proposes a FlatBuffer-based footer written alongside
> the existing Thrift one via the extension framework. A recent mailing list
> thread proposes an alternative: leave the Thrift footer as-is and add an
> optional "jump table" index for O(1) access to individual column chunks.
>
> Note: the jump table is complementary to any FlatBuffer approach — it
> benefits existing files regardless of the path we take for new ones.
>
> Options:
>
> 1. Jump table only. Add an optional index into the existing Thrift footer
> - Pros: Simplest approach, no incompatible changes, minimal file size
> increase, solves faster access
> - Cons: Does not address footer bloat. For huge footers, O(1) seek helps
> but the entire footer must still be fetched
>
> 2. Jump table + targeted Thrift fixes. Add the jump table and fix the
> worst bloat sources (e.g., make path_in_schema optional).
> - Pros: Minimal incompatible changes that address both goals.
> - Cons: parquet-java cannot parse files with empty path_in_schema. The
> code change is easy, but this cannot be in effect immediately as
> open-source Spark
> and downstream offerings(EMR, Fabric, etc) would need to upgrade.
>
> 3. Minimal FlatBuffer footer. New footer with just schema + column chunk
> placement. Statistics, page indexes, etc. added as optional modules over
> time and don’t necessarily
> need to live in the footer. For the pathological footer case in the
> flatbuf proposal, the schema and column chunk placement information account
> for only 3% of the full footer size.
> Pros: Smallest incremental step toward a redesigned footer. Addresses both
> goals long-term and allows for an incremental redesign of all fields,
> not just the most obvious ones. Performance-sensitive engines can leverage
> the new footer immediately.
> Cons: Both footers written during transition, increasing file size.
> Engines that need statistics can't drop the Thrift footer until those
> modules ship,
> so near-term benefit is limited.
>
> 4. Full FlatBuffer footer. Finalize the FlatBuffer design with all fields
> from the Thrift footer. The two evolve in lockstep until a format version
> bump drops Thrift.
> Pros: Addresses both goals and fully redesigns all footer components.
> Cons: Largest scope. PR #544 has already generated extended design debate,
> we risk stalling and preventing any win until the full proposal is agreed
> upon.
>
> Summary Table:
>
>
> Option 1: Jump Table Only
>
> Option 2: Jump Table + Thrift Fixes
>
> Option 3: Minimal FlatBuffer
>
> Option 4: Full FlatBuffer
>
> What
>
> Optional index for O(1) access into existing Thrift footer
>
> Jump table + make worst bloat sources optional (e.g. path_in_schema)
>
> New footer with schema + column placement only; stats/indexes added later
>
> Complete FlatBuffer replacement for all Thrift footer fields
>
> Faster access
>
> Yes
>
> Yes
>
> Yes
>
> Yes
>
> Reduces bloat
>
> No
>
> Yes
>
> Yes (long-term)
>
> Yes
>
> Incompatible changes
>
> None
>
> Medium, is a breaking format change
>
> Dual-write during transition
>
> Dual-write, eventual format version bump
>
> File size impact
>
> Minimal increase
>
> Minimal increase
>
> Increases (two footers) until Thrift dropped
>
> Increases (two footers) until Thrift dropped
>
> Scope / risk
>
> Simplest
>
> Small
>
> Medium
>
> Largest — risk of stalling on design debate
>
> *Main downside*
>
> Entire bloated footer still fetched
>
> parquet-java can't parse empty path_in_schema; needs upstream upgrades
> across Spark/EMR/Fabric
>
> Engines needing stats can't drop Thrift until stat modules ship
>
> Extended design debate (PR #544) may block any near-term wins
>
>
> [1] https://lists.apache.org/thread/czm2bk45wwtkhhpqxqvmx9dk5wkwk1kt
> [2]
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0#heading=h.ccu4zzsy0tm5
>

Reply via email to