Hi all,

Following up from the previous mailing list thread [1] about alternative
options to the flatbuffer footer proposal [2].

Goals: Improve performance and stability reading wide-schema Parquet files
(10K+ columns). This requires (1) faster access to column metadata in the
footer, and (2) reducing footer bloat. For example, path_in_schema causes
quadratic size blowup with deeply nested schemas - we've seen production
files with 300 MB+ footers, almost 60% of which was path_in_schema alone
(see the linked original flatbuf proposal for an example).

Background: PR #544 proposes a FlatBuffer-based footer written alongside
the existing Thrift one via the extension framework. A recent mailing list
thread proposes an alternative: leave the Thrift footer as-is and add an
optional "jump table" index for O(1) access to individual column chunks.

Note: the jump table is complementary to any FlatBuffer approach — it
benefits existing files regardless of the path we take for new ones.

Options:

1. Jump table only. Add an optional index into the existing Thrift footer
- Pros: Simplest approach, no incompatible changes, minimal file size
increase, solves faster access
- Cons: Does not address footer bloat. For huge footers, O(1) seek helps
but the entire footer must still be fetched

2. Jump table + targeted Thrift fixes. Add the jump table and fix the worst
bloat sources (e.g., make path_in_schema optional).
- Pros: Minimal incompatible changes that address both goals.
- Cons: parquet-java cannot parse files with empty path_in_schema. The code
change is easy, but this cannot be in effect immediately as open-source
Spark
and downstream offerings(EMR, Fabric, etc) would need to upgrade.

3. Minimal FlatBuffer footer. New footer with just schema + column chunk
placement. Statistics, page indexes, etc. added as optional modules over
time and don’t necessarily
need to live in the footer. For the pathological footer case in the flatbuf
proposal, the schema and column chunk placement information account for
only 3% of the full footer size.
Pros: Smallest incremental step toward a redesigned footer. Addresses both
goals long-term and allows for an incremental redesign of all fields,
not just the most obvious ones. Performance-sensitive engines can leverage
the new footer immediately.
Cons: Both footers written during transition, increasing file size. Engines
that need statistics can't drop the Thrift footer until those modules ship,
so near-term benefit is limited.

4. Full FlatBuffer footer. Finalize the FlatBuffer design with all fields
from the Thrift footer. The two evolve in lockstep until a format version
bump drops Thrift.
Pros: Addresses both goals and fully redesigns all footer components.
Cons: Largest scope. PR #544 has already generated extended design debate,
we risk stalling and preventing any win until the full proposal is agreed
upon.

Summary Table:


Option 1: Jump Table Only

Option 2: Jump Table + Thrift Fixes

Option 3: Minimal FlatBuffer

Option 4: Full FlatBuffer

What

Optional index for O(1) access into existing Thrift footer

Jump table + make worst bloat sources optional (e.g. path_in_schema)

New footer with schema + column placement only; stats/indexes added later

Complete FlatBuffer replacement for all Thrift footer fields

Faster access

Yes

Yes

Yes

Yes

Reduces bloat

No

Yes

Yes (long-term)

Yes

Incompatible changes

None

Medium, is a breaking format change

Dual-write during transition

Dual-write, eventual format version bump

File size impact

Minimal increase

Minimal increase

Increases (two footers) until Thrift dropped

Increases (two footers) until Thrift dropped

Scope / risk

Simplest

Small

Medium

Largest — risk of stalling on design debate

*Main downside*

Entire bloated footer still fetched

parquet-java can't parse empty path_in_schema; needs upstream upgrades
across Spark/EMR/Fabric

Engines needing stats can't drop Thrift until stat modules ship

Extended design debate (PR #544) may block any near-term wins


[1] https://lists.apache.org/thread/czm2bk45wwtkhhpqxqvmx9dk5wkwk1kt
[2]
https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0#heading=h.ccu4zzsy0tm5

Reply via email to