Hi Will, Thanks for the reply. Some thoughts:
> Consistency in how we evaluate adoption risk > Format fragmentation and the double-write penalty The flatbuffer proposal uses the extension framework to write both footers during the transitionary time. Given this, I think option 3 carries less adoption risk than option 2: with this framework, readers that lack support just see a Thrift footer and ignore the rest. By contrast, option 2 produces files that existing parquet-java readers cannot parse at all. I don’t feel the two-tier ecosystem is a major concern as there would be proper comms and a deprecation period for Thrift before a breaking upgrade to PAR3. There is no preference for “early adopters”; engines get locked out only if they don’t upgrade during this whole period. > How much of the problem is the format vs. the implementations? You’re correct that there is a wide gap in the existing Thrift parsers and there is likely room for improvement in raw parsing throughput for most/all of the implementations. However, the biggest win from the flatbuffer proposal comes from removing fields such as path_in_schema that cause massive blowup in footer size. Expanding on the example I mentioned in the previous message: we observed one footer in our production fleet that was 367 MB. With a jump table + highly optimized Thrift parser: fetching the footer from cloud storage (~50 MB/s) takes ~7 seconds; even assuming 200 MB/s with aggressive prefetching, this is still almost 2 seconds. Assuming the jump table lookup and Thrift parsing are free, this is still a long delay before the engine can read data for the file. The path_in_schema field accounted for ~57% of the footer, so with that removed, the footer is 157 MB and requires 0.8 - 3 seconds to fetch. With option 3 (minimal flatbuf): the schema + column chunk placement information account for ~11 MB of the total footer (~7% of the footer after path_in_schema is removed). This would be appended to the file after the Thrift footer, increasing file size by 3%. Fetching just this piece would take 220 ms, a 3-13x improvement over the Thrift option, even with path_in_schema removed. > Looking ahead I fully agree that we should pursue making path_in_schema optional for Thrift and the jump table approach as these will greatly improve the performance for existing workloads. However, when fetch time alone takes seconds, no amount of parsing optimization gets us where we need to be. Best, Div
