Re: [DISCUSS] Alternative to FlatBuffer Footer: A Lightweight Byte-Offset Index

Raphael Taylor-Davies Wed, 01 Apr 2026 11:27:40 -0700

For what it is worth, arrow-rs/parquet-rs does not, and as far as I knownever has, used path_in_schema beyond propagating it, so there are atleast some readers out there for which writing placeholder data wouldn'tbreak anything. I wouldn't be surprised to learn that most readersactually don't use it, although some may validate it. Perhaps it couldbe explicit opt-out for users running into issues with wide tables andwho don't rely on readers that rely on path_in_schema?

That being said I am a little bit confused, are we trying to reducefooter bloat or are we trying to add support for random access? If thegoal is to reduce footer bloat, flatbuffers seem like an unusual choiceas the encoding is inherently not space-efficient...

My 2 cents is that Thrift is actually a pretty good encoding for theparquet footer, it's a simple, relatively space efficient encoding withbroad ecosystem support and a good story for evolution. It's majordeficiency is the lack of random access support, which the byte-offsetindex would appear to address for the forms of access patterns users arelikely to care about, without requiring any breaking changes. Withregards to design decisions like path_in_schema, there's nothing aboutusing flatbuffers that makes it immune to these sorts of designoversights, yes a clean sheet design is an opportunity now to fix theones we're aware of up, but in 5-10 years time I imagine similar issueswill have cropped up. Phrasing it differently - if doing a breakingchange it makes sense to clean these sorts of thing up, but thisshouldn't be used as justification for the breaking change in the firstplace.

R.e. the options presented I don't see them as exclusive, the one thingI would say is option B does run the risk of getting stuck in apartially migrated state indefinitely, which would be IMO the worstpossible outcome.


Kind Regards,

Raphael Taylor-Davies

On 01/04/2026 18:00, Alkis Evlogimenos via dev wrote:

Thank you Will, Ed, and Andrew for the discussion.

path_in_schema being optional. Making it optional doesn't avoid a breaking
change - it is a breaking change. Every existing reader uses path_in_schema
to reconstruct the column→schema mapping. The moment writers stop emitting
it, old readers break. Same ecosystem coordination cost as a new footer
format. If we're paying that cost, we should get more than one fewer field
for it.

Statistics overpopulation. Agreed - writers should be smarter about what
they emit. Andrew's parquet-linter helps here. But writer hygiene and
compact encoding are complementary, not competing.

distinct_count and SizeStatistics. Fair point Ed, I overstated "dead
weight." These have real uses. They belong in a better structure though -
separate from the critical path of "give me offset + stats for columns X,
Y, Z."

Jump table. Will, the benchmarks are compelling for existing files. I'd
support it as a backward-compatible optimization for PAR1.

O(1) name resolution. This doesn't work in general. Name matching is
engine-specific - many engines do case-insensitive lookup, which means
lowercasing strings in a locale-dependent way. A hash table baked into the
file has to pick one canonicalization, and that won't match what half the
engines do. We could restrict to ASCII but I don't think we can impose that
restriction on column names now. And it doesn't matter in practice - at 10K
columns, hashing short column name strings and building a lookup on read
doesn't show up in profiles.

Moving forward. There is a tradeoff between complexity and scope. I see
three paths:

A. Ship the full FlatBuffer footer as proposed. Move forward with PR #544
as-is, logically compatible with the Thrift footer - schema, column chunk
metadata, statistics, page indexes, encryption, all of it. One transition,
one spec. Risk: the scope keeps generating debate and we
stay stuck.

B. Ship a minimal FlatBuffer core, add modules later. Strip the FlatBuffer
footer to schema + column chunk placement (file offset, compressed size,
uncompressed size) - the minimum a reader needs to plan I/O. Statistics,
size statistics, page indexes, encryption become separate
optional FlatBuffer modules that live before the footer and are referenced
by pointer from the core. Ratify the core now, add modules as independent
work streams. This unblocks the part everyone agrees on and lets us iterate
on the contentious pieces without re-litigating the core.

C. Improve statistics and page indexes within the current format. Hold off
on the FlatBuffer footer. Focus on smarter writer defaults, tooling like
parquet-linter, and Will's jump table for O(1) access to existing files. No
format break, but we accept the structural limitations of Thrift.

My preference is A or B, whichever lands faster.


On Tue, Mar 31, 2026 at 1:18 PM Andrew Lamb <[email protected]> wrote:

I also agree the statistics are a mess. But then, I think a bigger

problem is overpopulation of the statistics. There is very little benefit
to simple min/max statistics on unsorted columns. If writers were a little
more conservative and simply omitted these optional statistics for columns
that have no chance of benefiting from them that would reduce a great deal
of bloat.

This is a great idea about how to take advantage of the (existing) metadata
better.

Something Xiangpeng Hao, Jigao Luo and I have been exploring is a
parquet-linter[1] (still in the early phase) to help users find the best
settings for their data when writing Parquet (without changing the format).
This might be helpful to identify such sources of bloat for existing files

Andrew

[1]: https://github.com/XiangpengHao/parquet-linter



On Mon, Mar 30, 2026 at 1:59 PM Ed Seidl <[email protected]> wrote:

Thanks for the perspective, Alkis. I'd just like to add a few comments.

On 2026/03/27 13:37:46 Alkis Evlogimenos via dev wrote:

1. Dedup. The Thrift footer repeats path_in_schema (a list of strings)

for

every column in every row group. For a 10K-column, 4-RG file that's 40K
string lists and it's the single biggest source of footer bloat. The
FlatBuffer footer drops it entirely — it's derivable from schema +

column

ordinal. Same for type (already in the schema), the full encodings

list,

and encoding_stats (replaced by a single bool).

I agree path_in_schema is pretty useless, but we could just make that
field optional. Yes this would break old readers, but then so would

adding

a new encoding or compression codec. Old readers can't be expected to

work

forever.

2. Compact stats. Thrift Statistics stores min/max as variable-length
binary with per-field framing. The FlatBuffer footer uses fixed-width
integers for numeric types and a prefix+truncated-suffix scheme for

byte

arrays. Across thousands of columns this adds up.

I also agree the statistics are a mess. But then, I think a bigger

problem

is overpopulation of the statistics. There is very little benefit to

simple

min/max statistics on unsorted columns. If writers were a little more
conservative and simply omitted these optional statistics for columns

that

have no chance of benefiting from them that would reduce a great deal of
bloat.

3. Dropped dead weight. ConvertedType, deprecated min/max,

distinct_count,

SizeStatistics

I'll grant the first two, but already I've seen calls to do something

with

distinct_count, and I personally use the size statistics, so I do not

agree

to the "dead weight" label for those. I do agree that their current form

is

not ideal, but was a compromise at the time. I think one benefit of the
flatbuffers work would be to separate out metadata needed for traversing
the file from metadata supporting indexes/other purposes. If we can

easily

add new specialized structures that are easy to ignore I think that would
be a win.

A jump table into the existing Thrift footer preserves all of this
duplication and bloat. You still have to decode the same fat

ColumnMetaData

structs, you just get to skip to the right one faster.

Given that most of the ColumnMetaData bloat is at the tail end of the
struct, the jump table allows for stopping parsing early and skipping to
the next column. No need to parse the bloat, but it is still there.

And the index itself
adds at least 12 bytes plus framing per column per row group (you need
offset+length since Thrift fields are variable-width), so the total

footer

actually gets bigger.

Not quite. Given that row groups and column chunks are serialized
back-to-back, one simply needs N+1 offsets, the lengths can then be
derived. Alternatively, if we use 0 offsets for the start of the row

groups

and the first column chunk in a row group, you could instead just encode

lengths and do an exclusive scan to deduce the offsets. This would allow
for using fewer bytes to encode the lengths at the expense of a little

more

computation when instantiating the table.

Now, if we accept a breaking change is needed to meaningfully shrink

the

footer, then why not break into a format that also gives us zero-copy
access natively?

I do agree that if we are going to completely redo the metadata, then why
not change to flatbuffers, so long as we're good with the trade offs
(zero-copy and random access for larger representations).

Cheers,
Ed

Re: [DISCUSS] Alternative to FlatBuffer Footer: A Lightweight Byte-Offset Index

Reply via email to