Re: [DISCUSS] Alternative to FlatBuffer Footer: A Lightweight Byte-Offset Index

Ed Seidl Wed, 08 Apr 2026 09:18:38 -0700

Hi Jiayi,

This is getting a little off topic, but I did a quick test to see how much would
be involved in getting some major implementations to support an optional
path_in_schema.

In short, parquet-java required changing a single constructor call to using
a setter, and replacing two accesses of the path_in_schema with an already 
available array of paths from the schema metadata. arrow-cpp
required no code changes at all beyond regenerating the thrift structures.
And as mentioned previously, arrow-rs has never used the field at all.  As
to performance, removing the field from a 10000 column flat schema saved
around 2MB out of 11, so a 17% reduction. Parsing time in arrow-rs improved
only about 3% since the field is simply skipped if encountered anyway, so
no allocations are saved. I haven't tried benchmarking the other 
implementations.

So I don't think it's going to take years of effort to deprecate that field. Of 
course,
the guidelines for forward-incompatible changes [1] will need to be followed,
so it will take some time for the changes to ripple through the ecosystem, but
users would have the ability to save a good bit of space by turning the unused
field off themselves. 

If the field is so damaging, I simply don't see why we need to wait any longer 
to
remove it. Just because the v3 proposal exists doesn't mean all work on the
current format needs to halt. Will we forestall work on new encodings like ALP
until v3 is ready to go? I hope we won't make the perfect the enemy of the good
here.

Cheers,
Ed

[1] 
https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#additionschanges-to-the-format

On 2026/04/07 12:41:35 王嘉仪 wrote:
> Hi all,
> 
> We tested whether various Parquet readers depend on path_in_schema (in
> ColumnMetaData) to understand the impact of deprecating it.
> 
> We tested parquet-mr, Databricks, DuckDB, ClickHouse, Snowflake, and
> Fabric. The result is that parquet-mr and Fabric use path_in_schema as a
> hard dependency and cannot read files without it. Databricks supports
> reading files without path_in_schema in newer versions, and the other
> readers support reading them in their latest versions.
> 
> That said, deprecating the field is hard in the current Thrift-based
> Parquet spec and would require years of effort for the ecosystem to adopt
> the change. A reminder that this field is a list of strings and contributes
> heavily to footer bloat. We've seen footers as large as 367 MB in
> production, with over 60% of the size coming from this field alone.
> 
> On the other hand, the FlatBuffer footer proposal gives us the ability to
> not only decode the schema in an efficient way but also remove redundant
> fields completely without breaking compatibility with the existing Thrift
> footer. It doesn't introduce a breaking change, but also doesn't slow down
> our ability to evolve the footer. It buys us time to embrace the change
> across the entire ecosystem.
> 
> Best, Jiayi
>

Re: [DISCUSS] Alternative to FlatBuffer Footer: A Lightweight Byte-Offset Index

Reply via email to