Hi Andrew, > “I wonder if you can clarify what you mean by "footer bloat"
I was using “footer bloat” to mean “any information that gets in the way of quickly accessing column data”. This includes unnecessarily large fields like “path_in_schema” and “converted_type” but also other non-placement fields (i.e. not data/index/dictionary offsets) that are interspersed. > As Raphael pointed out earlier, switching to use Flatbuffers in the footer seems to increase footer bloat, in the sense that files require more bytes to store the same data. > So my point is that maybe we shouldn't convolve a discussion about footer bloat and the flatbuffers proposal, as it is already proving challenging to get some consensus. Fair point. Options 3 and 4 could be more generic as the exact format is not important. I wanted the thread to focus on how far we can push things with “backwards-compatible changes (option 1) vs. with backwards-incompatible changes (option 2/3/4)”. If we agree that option 1 is not moving the needle much and that we want to only make 1 incompatible change, then we should focus on finding the incompatible change with the highest impact. Best, Div On Fri, Apr 10, 2026 at 8:43 PM Andrew Lamb <[email protected]> wrote: > > > Goals: Improve performance and stability reading wide-schema Parquet > files > > (10K+ columns). This requires (1) faster access to column metadata in the > > footer, and (2) reducing footer bloat. > > I wonder if you can clarify what you mean by "footer bloat" > > As Raphael pointed out earlier, switching to use Flatbuffers in the footer > seems to increase footer bloat, in the sense that files require more bytes > to store the same data. > > This is true both: > 1. In the period where two copies of the footer are present (flatbuffers > and thrift) > 2. Likely even for files that only use flatbuffers, given that thrift is a > relatively compact encoding. > > So my point is that maybe we shouldn't convolve a discussion about footer > bloat and the flatbuffers proposal, as it is already proving challenging to > get some consensus. > > Andrew > > > On Fri, Apr 10, 2026 at 10:48 AM Divjot Arora via dev < > [email protected]> wrote: > >> Hi Will, >> >> Thanks for the reply. Some thoughts: >> >> > Consistency in how we evaluate adoption risk >> > Format fragmentation and the double-write penalty >> >> The flatbuffer proposal uses the extension framework to write both footers >> during the transitionary time. Given this, I think option 3 carries less >> adoption risk than option 2: with this framework, readers that lack >> support >> just see a Thrift footer and ignore the rest. By contrast, option 2 >> produces files that existing parquet-java readers cannot parse at all. I >> don’t feel the two-tier ecosystem is a major concern as there would be >> proper comms and a deprecation period for Thrift before a breaking upgrade >> to PAR3. There is no preference for “early adopters”; engines get locked >> out only if they don’t upgrade during this whole period. >> >> > How much of the problem is the format vs. the implementations? >> >> You’re correct that there is a wide gap in the existing Thrift parsers and >> there is likely room for improvement in raw parsing throughput for >> most/all >> of the implementations. However, the biggest win from the flatbuffer >> proposal comes from removing fields such as path_in_schema that cause >> massive blowup in footer size. >> >> Expanding on the example I mentioned in the previous message: we observed >> one footer in our production fleet that was 367 MB. With a jump table + >> highly optimized Thrift parser: fetching the footer from cloud storage >> (~50 >> MB/s) takes ~7 seconds; even assuming 200 MB/s with aggressive >> prefetching, >> this is still almost 2 seconds. Assuming the jump table lookup and Thrift >> parsing are free, this is still a long delay before the engine can read >> data for the file. The path_in_schema field accounted for ~57% of the >> footer, so with that removed, the footer is 157 MB and requires 0.8 - 3 >> seconds to fetch. >> >> With option 3 (minimal flatbuf): the schema + column chunk placement >> information account for ~11 MB of the total footer (~7% of the footer >> after >> path_in_schema is removed). This would be appended to the file after the >> Thrift footer, increasing file size by 3%. Fetching just this piece would >> take 220 ms, a 3-13x improvement over the Thrift option, even with >> path_in_schema removed. >> >> > Looking ahead >> >> I fully agree that we should pursue making path_in_schema optional for >> Thrift and the jump table approach as these will greatly improve the >> performance for existing workloads. >> However, when fetch time alone takes seconds, no amount of parsing >> optimization gets us where we need to be. >> >> Best, >> Div >> >
