Hi Will,

Thanks for the reply. Some thoughts:

> Consistency in how we evaluate adoption risk
> Format fragmentation and the double-write penalty

The flatbuffer proposal uses the extension framework to write both footers
during the transitionary time. Given this, I think option 3 carries less
adoption risk than option 2: with this framework, readers that lack support
just see a Thrift footer and ignore the rest. By contrast, option 2
produces files that existing parquet-java readers cannot parse at all. I
don’t feel the two-tier ecosystem is a major concern as there would be
proper comms and a deprecation period for Thrift before a breaking upgrade
to PAR3. There is no preference for “early adopters”; engines get locked
out only if they don’t upgrade during this whole period.

> How much of the problem is the format vs. the implementations?

You’re correct that there is a wide gap in the existing Thrift parsers and
there is likely room for improvement in raw parsing throughput for most/all
of the implementations. However, the biggest win from the flatbuffer
proposal comes from removing fields such as path_in_schema that cause
massive blowup in footer size.

Expanding on the example I mentioned in the previous message: we observed
one footer in our production fleet that was 367 MB. With a jump table +
highly optimized Thrift parser: fetching the footer from cloud storage (~50
MB/s) takes ~7 seconds; even assuming 200 MB/s with aggressive prefetching,
this is still almost 2 seconds. Assuming the jump table lookup and Thrift
parsing are free, this is still a long delay before the engine can read
data for the file. The path_in_schema field accounted for ~57% of the
footer, so with that removed, the footer is 157 MB and requires 0.8 - 3
seconds to fetch.

With option 3 (minimal flatbuf): the schema + column chunk placement
information account for ~11 MB of the total footer (~7% of the footer after
path_in_schema is removed).  This would be appended to the file after the
Thrift footer, increasing file size by 3%. Fetching just this piece would
take 220 ms, a 3-13x improvement over the Thrift option, even with
path_in_schema removed.

> Looking ahead

I fully agree that we should pursue making path_in_schema optional for
Thrift and the jump table approach as these will greatly improve the
performance for existing workloads.
However, when fetch time alone takes seconds, no amount of parsing
optimization gets us where we need to be.

Best,
Div

Reply via email to