Thank you Will for the proposal and everyone for the discussion. The O(1)
jump table numbers are impressive and the core idea is solid.

We've been working on the parquet3.fbs FlatBuffer footer (PR #544) so I
wanted to share some context on what that work addresses beyond the O(1)
access.

A big chunk of the FlatBuffer footer wins come from restructuring the
metadata itself:

1. Dedup. The Thrift footer repeats path_in_schema (a list of strings) for
every column in every row group. For a 10K-column, 4-RG file that's 40K
string lists and it's the single biggest source of footer bloat. The
FlatBuffer footer drops it entirely — it's derivable from schema + column
ordinal. Same for type (already in the schema), the full encodings list,
and encoding_stats (replaced by a single bool).

2. Compact stats. Thrift Statistics stores min/max as variable-length
binary with per-field framing. The FlatBuffer footer uses fixed-width
integers for numeric types and a prefix+truncated-suffix scheme for byte
arrays. Across thousands of columns this adds up.

3. Dropped dead weight. ConvertedType, deprecated min/max, distinct_count,
SizeStatistics

A jump table into the existing Thrift footer preserves all of this
duplication and bloat. You still have to decode the same fat ColumnMetaData
structs, you just get to skip to the right one faster. And the index itself
adds at least 12 bytes plus framing per column per row group (you need
offset+length since Thrift fields are variable-width), so the total footer
actually gets bigger.

The tricky part: to realize 1-3 within Thrift we will need a breaking
change. Removing path_in_schema breaks readers that depend on it. Removing
ConvertedType breaks readers that don't understand LogicalType. Statistics
we could do with Statistics2 and leave old readers behind.

Now, if we accept a breaking change is needed to meaningfully shrink the
footer, then why not break into a format that also gives us zero-copy
access natively?

That said, I think the jump table idea has independent value as a
backward-compatible optimization for existing files. It could complement
the FlatBuffer work nicely — old files benefit from the index, new files
use the FlatBuffer footer. I just don't think it's a substitute for the
cleanup.

On Fri, Mar 27, 2026 at 12:03 PM Will Edwards via dev <
[email protected]> wrote:

> Thanks Andrew, Ed, and Jan for the great feedback and context! :dance:
>
> As you saw in my previous reply, I created some Java micro-benchmarks to
> test this out, and the numbers align perfectly with our goal.
>
> *@Ed:* Thank you for linking PR #8714! I hadn't seen that POC, but it is
> incredibly validating to see the arrow-rs team actively exploring this
> exact concept. It looks like we are completely aligned on the mechanics of
> why an index is necessary to escape the Thrift framing overhead.
>
> *@Andrew:* You raised a great point about getting standard thrift-compiler
> generated parsers to cooperate.
>
> To keep them happy, the layout of the pointer is key. If we inject the
> pointer to the index as a high-ID fixed-width binary field at the very
> beginning of the FileMetaData struct (e.g., Field ID 32767), standard
> parsers will simply see an unknown field, use the type header to skip those
> 12 bytes, and parse the rest of the legacy Thrift payload normally. They
> don't get the O(1) speedup, but it ensures 100% backward compatibility
> without breaking their deserialization loop.
>
> *@Jan:* Your point about cold-table latency and small, wide Parquet files
> hits the nail on the head. That is exactly why I am so hesitant about the
> current parquet3.fbs proposal. If we duplicate the entire metadata payload
> into FlatBuffers, the file bloat on small, uncompacted files will be
> massive. A byte-offset index gives us the exact point-access speedup you
> are asking for to reduce that latency, but keeps the footer footprint
> incredibly tiny. Of course, if a query does a SELECT *, we still have to
> read all the columns anyway. For that specific case, the index doesn't help
> us skip anything, so performance there really just comes down to whether
> the fastest custom Thrift decoders are fast enough.
>
> Thanks again for the pointers, and I'm looking forward to seeing if the
> Rust/C++ implementations see the same O(1) flatline that the Java prototype
> did.
>
> Cheers,
>
> Will
>
> On Fri, 27 Mar 2026 at 11:32, Will Edwards <[email protected]> wrote:
>
> > Thx for the interest :D
> >
> > I ran some quick micro-benchmarks on my Java footer parser to see if the
> > idea might work.  Sorry I can't share code, but I can share recipes etc.
> >
> > This is running on a weak cloud VM.  The test files were created using an
> > obvious Python pyarrow script.  The test runs a few hundred iterations
> for
> > warmup then runs a few thousand iterations of each test to get a timing.
> >
> > The *parquet-format* column is the
> > standard org.apache.parquet.format.Util.readFileMetaData(new
> > ByteArrayInputStream(footerBytes)).  This excludes IO (the Hadoop
> libraries
> > are so slow, that's actually the number one thing to avoid).
> >
> > The *full* column shows my footer parser performing a full footer decode.
> > It spends about 5-10% of the time in the schema making the name tree and
> so
> > on; the rest of the time is spent on column metadata.  It creates a
> long[]
> > of num_row_groups * num_columns * num_interesting_fields to decode into.
> > Strings/bytes are stored in the long[] as offsets into the Thrift footer.
> > This is about 8x+ faster than the standard library, and more importantly
> > for my use-cases it generates significantly less garbage.
> >
> > The *by ordinal* column shows my footer parser skipping all columns
> except
> > one.  The skipping is barely faster than the full decode (you can't early
> > out after the interesting column in the last row-group if you are still
> > waiting for metadata etc) but, importantly for my use-cases, it uses
> > significantly less memory.
> >
> > The *single col* column is O(1) random access using a jump table into the
> > Thrift footer.  This is the speed of the proposal.
> >
> > | Cols | RGs | Footer | parquet-format | full | by ordinal | single col |
> > |------|-----|--------|---------------|------|-----------|------------|
> > | 100 | 1 | 19 KB | 0.27 | 0.040 | 0.017 | 0.001 |
> > | 100 | 2 | 30 KB | 0.35 | 0.040 | 0.031 | 0.001 |
> > | 100 | 4 | 52 KB | 0.57 | 0.077 | 0.061 | 0.001 |
> > | 1K | 1 | 195 KB | 2.58 | 0.21 | 0.17 | 0.000 |
> > | 1K | 2 | 304 KB | 2.88 | 0.35 | 0.32 | 0.001 |
> > | 1K | 4 | 528 KB | 5.39 | 0.69 | 0.63 | 0.001 |
> > | 10K | 1 | 2.0 MB | 16.4 | 1.89 | 1.77 | 0.000 |
> > | 10K | 2 | 3.1 MB | 28.9 | 3.58 | 3.25 | 0.001 |
> > | 10K | 4 | 5.4 MB | 54.2 | 6.99 | 6.33 | 0.001 |
> > | 50K | 1 | 10.3 MB | 85.1 | 9.94 | 9.54 | 0.000 |
> > | 50K | 2 | 15.9 MB | 148.9 | 18.8 | 17.6 | 0.001 |
> > | 50K | 4 | 27.4 MB | 280.7 | 36.7 | 34.1 | 0.001 |
> > | 100K | 1 | 20.6 MB | 174.1 | 20.0 | 19.2 | 0.000 |
> > | 100K | 2 | 31.9 MB | 305.7 | 37.8 | 35.3 | 0.001 |
> > | 100K | 4 | 55.1 MB | 565.3 | 73.6 | 68.5 | 0.002 |
> >
> > Performance is measured in milliseconds.  My fast footer parser is very
> > consistent, but the parquet-format has higher variability because of
> > garbage collections, etc.
> >
> > The O(1) access is very stable: approximately 100 ns to create the arrays
> > etc for the result and then approximately 270 ns per decoded ColumnChunk.
> >
> > The jump table extension might work like this: we have a normal Thrift
> > footer, and immediately after the normal version field, we add a new
> field
> > that signals the presence of a footer index.  This is a binary field
> with a
> > new high field-id.  It doesn't have to contain the index; that can be
> > sandwiched between the end of the Thrift footer and the end of the file.
> > It contains an uncompressed long that is the offset to the footer index's
> > start.  Because the field is not a varint, a writer can patch it after
> > writing the footer etc.  Readers who don't know about the index will just
> > skip it.
> >
> > The footer index has a little header indicating its contents: whether it
> > includes a name hash table, a column ordinals jump table etc.  These are
> > all flat, fixed-size arrays.  There is also lists the positions for any
> > fields not in the index e.g. the position of the metadata and items the
> > reader may want to access by skipping the parts that the index allows
> them
> > to.
> >
> > Would love to know how Java performance compares to rust etc :D
> >
> > Yours hopefully,
> > Will
> >
> > On Thu, 26 Mar 2026 at 10:38, Andrew Lamb <[email protected]>
> wrote:
> >
> >> Thank you Will,
> >>
> >> I think this idea is very interesting.
> >>
> >> >  it’s entirely possible this index-only approach was evaluated and
> >> discarded early on.
> >>
> >> I don't know of any such discussion or evaluation. One challenge of this
> >> approach might be getting thrift-compiler generated parsers to
> cooperate.
> >>
> >> >  would love to hear your thoughts on whether a lightweight index could
> >> give us the O(1) read performance we want without the file bloat we are
> >> trying to avoid.
> >>
> >> I personally think it sounds like a great idea -- and I suggest as a
> next
> >> step a prototype in one of the open source parquet readers so we can
> >> measure its performance. The arrow-rs implementation (now) has its own
> >> custom thrift decoder, so it might be relatively easy to test out such a
> >> scheme.
> >>
> >> Andrew
> >>
> >
>

Reply via email to