Hi Andrew,

> “I wonder if you can clarify what you mean by "footer bloat"

I was using “footer bloat” to mean “any information that gets in the way of
quickly accessing column data”. This includes unnecessarily large fields
like “path_in_schema” and “converted_type” but also other non-placement
fields (i.e. not data/index/dictionary offsets) that are interspersed.

> As Raphael pointed out earlier, switching to use Flatbuffers in the
footer seems to increase footer bloat, in the sense that files require more
bytes to store the same data.
> So my point is that maybe we shouldn't convolve a discussion about footer
bloat and the flatbuffers proposal, as it is already proving challenging to
get some consensus.

Fair point. Options 3 and 4 could be more generic as the exact format is
not important. I wanted the thread to focus on how far we can push things
with “backwards-compatible changes (option 1) vs. with
backwards-incompatible changes (option 2/3/4)”.

If we agree that option 1 is not moving the needle much and that we want to
only make 1 incompatible change, then we should focus on finding the
incompatible change with the highest impact.

Best,
Div

On Fri, Apr 10, 2026 at 8:43 PM Andrew Lamb <[email protected]> wrote:

>
> > Goals: Improve performance and stability reading wide-schema Parquet
> files
> > (10K+ columns). This requires (1) faster access to column metadata in the
> > footer, and (2) reducing footer bloat.
>
> I wonder if you can clarify what you mean by "footer bloat"
>
> As Raphael pointed out earlier, switching to use Flatbuffers in the footer
> seems to increase footer bloat, in the sense that files require more bytes
> to store the same data.
>
> This is true both:
> 1. In the period where two copies of the footer are present (flatbuffers
> and thrift)
> 2. Likely even for files that only use flatbuffers, given that thrift is a
> relatively compact encoding.
>
> So my point is that maybe we shouldn't convolve a discussion about footer
> bloat and the flatbuffers proposal, as it is already proving challenging to
> get some consensus.
>
> Andrew
>
>
> On Fri, Apr 10, 2026 at 10:48 AM Divjot Arora via dev <
> [email protected]> wrote:
>
>> Hi Will,
>>
>> Thanks for the reply. Some thoughts:
>>
>> > Consistency in how we evaluate adoption risk
>> > Format fragmentation and the double-write penalty
>>
>> The flatbuffer proposal uses the extension framework to write both footers
>> during the transitionary time. Given this, I think option 3 carries less
>> adoption risk than option 2: with this framework, readers that lack
>> support
>> just see a Thrift footer and ignore the rest. By contrast, option 2
>> produces files that existing parquet-java readers cannot parse at all. I
>> don’t feel the two-tier ecosystem is a major concern as there would be
>> proper comms and a deprecation period for Thrift before a breaking upgrade
>> to PAR3. There is no preference for “early adopters”; engines get locked
>> out only if they don’t upgrade during this whole period.
>>
>> > How much of the problem is the format vs. the implementations?
>>
>> You’re correct that there is a wide gap in the existing Thrift parsers and
>> there is likely room for improvement in raw parsing throughput for
>> most/all
>> of the implementations. However, the biggest win from the flatbuffer
>> proposal comes from removing fields such as path_in_schema that cause
>> massive blowup in footer size.
>>
>> Expanding on the example I mentioned in the previous message: we observed
>> one footer in our production fleet that was 367 MB. With a jump table +
>> highly optimized Thrift parser: fetching the footer from cloud storage
>> (~50
>> MB/s) takes ~7 seconds; even assuming 200 MB/s with aggressive
>> prefetching,
>> this is still almost 2 seconds. Assuming the jump table lookup and Thrift
>> parsing are free, this is still a long delay before the engine can read
>> data for the file. The path_in_schema field accounted for ~57% of the
>> footer, so with that removed, the footer is 157 MB and requires 0.8 - 3
>> seconds to fetch.
>>
>> With option 3 (minimal flatbuf): the schema + column chunk placement
>> information account for ~11 MB of the total footer (~7% of the footer
>> after
>> path_in_schema is removed).  This would be appended to the file after the
>> Thrift footer, increasing file size by 3%. Fetching just this piece would
>> take 220 ms, a 3-13x improvement over the Thrift option, even with
>> path_in_schema removed.
>>
>> > Looking ahead
>>
>> I fully agree that we should pursue making path_in_schema optional for
>> Thrift and the jump table approach as these will greatly improve the
>> performance for existing workloads.
>> However, when fetch time alone takes seconds, no amount of parsing
>> optimization gets us where we need to be.
>>
>> Best,
>> Div
>>
>

Reply via email to