Hi all, Following up from the previous mailing list thread [1] about alternative options to the flatbuffer footer proposal [2].
Goals: Improve performance and stability reading wide-schema Parquet files (10K+ columns). This requires (1) faster access to column metadata in the footer, and (2) reducing footer bloat. For example, path_in_schema causes quadratic size blowup with deeply nested schemas - we've seen production files with 300 MB+ footers, almost 60% of which was path_in_schema alone (see the linked original flatbuf proposal for an example). Background: PR #544 proposes a FlatBuffer-based footer written alongside the existing Thrift one via the extension framework. A recent mailing list thread proposes an alternative: leave the Thrift footer as-is and add an optional "jump table" index for O(1) access to individual column chunks. Note: the jump table is complementary to any FlatBuffer approach — it benefits existing files regardless of the path we take for new ones. Options: 1. Jump table only. Add an optional index into the existing Thrift footer - Pros: Simplest approach, no incompatible changes, minimal file size increase, solves faster access - Cons: Does not address footer bloat. For huge footers, O(1) seek helps but the entire footer must still be fetched 2. Jump table + targeted Thrift fixes. Add the jump table and fix the worst bloat sources (e.g., make path_in_schema optional). - Pros: Minimal incompatible changes that address both goals. - Cons: parquet-java cannot parse files with empty path_in_schema. The code change is easy, but this cannot be in effect immediately as open-source Spark and downstream offerings(EMR, Fabric, etc) would need to upgrade. 3. Minimal FlatBuffer footer. New footer with just schema + column chunk placement. Statistics, page indexes, etc. added as optional modules over time and don’t necessarily need to live in the footer. For the pathological footer case in the flatbuf proposal, the schema and column chunk placement information account for only 3% of the full footer size. Pros: Smallest incremental step toward a redesigned footer. Addresses both goals long-term and allows for an incremental redesign of all fields, not just the most obvious ones. Performance-sensitive engines can leverage the new footer immediately. Cons: Both footers written during transition, increasing file size. Engines that need statistics can't drop the Thrift footer until those modules ship, so near-term benefit is limited. 4. Full FlatBuffer footer. Finalize the FlatBuffer design with all fields from the Thrift footer. The two evolve in lockstep until a format version bump drops Thrift. Pros: Addresses both goals and fully redesigns all footer components. Cons: Largest scope. PR #544 has already generated extended design debate, we risk stalling and preventing any win until the full proposal is agreed upon. Summary Table: Option 1: Jump Table Only Option 2: Jump Table + Thrift Fixes Option 3: Minimal FlatBuffer Option 4: Full FlatBuffer What Optional index for O(1) access into existing Thrift footer Jump table + make worst bloat sources optional (e.g. path_in_schema) New footer with schema + column placement only; stats/indexes added later Complete FlatBuffer replacement for all Thrift footer fields Faster access Yes Yes Yes Yes Reduces bloat No Yes Yes (long-term) Yes Incompatible changes None Medium, is a breaking format change Dual-write during transition Dual-write, eventual format version bump File size impact Minimal increase Minimal increase Increases (two footers) until Thrift dropped Increases (two footers) until Thrift dropped Scope / risk Simplest Small Medium Largest — risk of stalling on design debate *Main downside* Entire bloated footer still fetched parquet-java can't parse empty path_in_schema; needs upstream upgrades across Spark/EMR/Fabric Engines needing stats can't drop Thrift until stat modules ship Extended design debate (PR #544) may block any near-term wins [1] https://lists.apache.org/thread/czm2bk45wwtkhhpqxqvmx9dk5wkwk1kt [2] https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0#heading=h.ccu4zzsy0tm5
