Hi Alkis,

Thanks for the detailed context on PR #544 — I think it's really useful to
separate what we store from how we read and represent it in memory.

I completely agree that metadata restructuring — deduplicating
path_in_schema, compacting statistics, dropping ConvertedType — is where
real footer size wins live. A jump table into the existing Thrift footer
doesn't address any of that; it's only solving the random-access problem.

Regarding parsing speed: I will be the first to concede that Thrift is
fundamentally slow, particularly if you must walk all of it. Even with
aggressive optimization, I haven't been able to break the hardware
bottleneck of the serial dependency chain, which requires about 4 CPU
cycles per touched byte (if anyone has managed to squeeze more out of a
Thrift parser, I'd love to learn how). That translates to about one to
three cycles per byte for your average footer, depending on the number of
string bytes it contains.

However, in terms of what users actually experience today, that
format-level overhead is often overshadowed by API design. Existing
implementations (like C++ PyArrow or Apache Parquet-MR) naturally evolved
around small column counts and object-tree heap allocations. There is a lot
of room for newer Parquet libraries to offer structurally faster,
zero-allocation APIs to better support ever-wider files. Arrow Rust, for
example, is already much faster precisely because of its array-of-structs
column groups. We can grab a lot of low-hanging fruit here without changing
the storage format at all.

That said, I completely agree that if we do decide to execute a hard format
break, FlatBuffers is a very strong contender, and the work in PR #544 is
valuable for scoping out what a next-generation footer would look like.
Where I see the jump table fitting in is as a complement to that
exploration, not an alternative. The jump table is about newly authored
files that remain fully valid PAR1 — readers that don't understand the
index simply ignore it, while newer engines get O(1) column access. Because
it indexes the existing metadata rather than duplicating it, it avoids the
size and coordination costs of a dual-footer transition, and any single
writer or reader can adopt it independently.

I've been testing an implementation where the footer index entries are
fixed-width to support O(1) lookup. Crucially, these entries don't need to
be 4 or 8 bytes in size. Because they only encode the relative byte offset
from the start of the row-group position in the Thrift stream, they are
typically just 2 or 3 bytes per column per row group. Thrift structs have
stop tokens so parsers can detect their end without stored lengths.

I also bake in a compact name hashtable for fast resolution by name, plus a
few bytes for functional fields like logical type. In my tests, this entire
structure adds only 2–5% to the total Thrift footer size. It's not
shrinking the footer the way PR #544 will, but it's a modest cost for
giving legacy files a capability they currently lack.

If the FlatBuffer footer also addressed O(1) resolution by name (rather
than just ordinal), I think that would be a strong functional improvement
worth scoping in — it's one of the sharper pain points for some query
engines dealing with wide schemas.

Cheers,
Will

On Fri, 27 Mar 2026 at 14:38, Alkis Evlogimenos <
[email protected]> wrote:

> Thank you Will for the proposal and everyone for the discussion. The O(1)
> jump table numbers are impressive and the core idea is solid.
>
> We've been working on the parquet3.fbs FlatBuffer footer (PR #544) so I
> wanted to share some context on what that work addresses beyond the O(1)
> access.
>
> A big chunk of the FlatBuffer footer wins come from restructuring the
> metadata itself:
>
> 1. Dedup. The Thrift footer repeats path_in_schema (a list of strings) for
> every column in every row group. For a 10K-column, 4-RG file that's 40K
> string lists and it's the single biggest source of footer bloat. The
> FlatBuffer footer drops it entirely — it's derivable from schema + column
> ordinal. Same for type (already in the schema), the full encodings list,
> and encoding_stats (replaced by a single bool).
>
> 2. Compact stats. Thrift Statistics stores min/max as variable-length
> binary with per-field framing. The FlatBuffer footer uses fixed-width
> integers for numeric types and a prefix+truncated-suffix scheme for byte
> arrays. Across thousands of columns this adds up.
>
> 3. Dropped dead weight. ConvertedType, deprecated min/max, distinct_count,
> SizeStatistics
>
> A jump table into the existing Thrift footer preserves all of this
> duplication and bloat. You still have to decode the same fat ColumnMetaData
> structs, you just get to skip to the right one faster. And the index itself
> adds at least 12 bytes plus framing per column per row group (you need
> offset+length since Thrift fields are variable-width), so the total footer
> actually gets bigger.
>
> The tricky part: to realize 1-3 within Thrift we will need a breaking
> change. Removing path_in_schema breaks readers that depend on it. Removing
> ConvertedType breaks readers that don't understand LogicalType. Statistics
> we could do with Statistics2 and leave old readers behind.
>
> Now, if we accept a breaking change is needed to meaningfully shrink the
> footer, then why not break into a format that also gives us zero-copy
> access natively?
>
> That said, I think the jump table idea has independent value as a
> backward-compatible optimization for existing files. It could complement
> the FlatBuffer work nicely — old files benefit from the index, new files
> use the FlatBuffer footer. I just don't think it's a substitute for the
> cleanup.
>
> On Fri, Mar 27, 2026 at 12:03 PM Will Edwards via dev <
> [email protected]> wrote:
>
>> Thanks Andrew, Ed, and Jan for the great feedback and context! :dance:
>>
>> As you saw in my previous reply, I created some Java micro-benchmarks to
>> test this out, and the numbers align perfectly with our goal.
>>
>> *@Ed:* Thank you for linking PR #8714! I hadn't seen that POC, but it is
>> incredibly validating to see the arrow-rs team actively exploring this
>> exact concept. It looks like we are completely aligned on the mechanics of
>> why an index is necessary to escape the Thrift framing overhead.
>>
>> *@Andrew:* You raised a great point about getting standard thrift-compiler
>> generated parsers to cooperate.
>>
>> To keep them happy, the layout of the pointer is key. If we inject the
>> pointer to the index as a high-ID fixed-width binary field at the very
>> beginning of the FileMetaData struct (e.g., Field ID 32767), standard
>> parsers will simply see an unknown field, use the type header to skip
>> those
>> 12 bytes, and parse the rest of the legacy Thrift payload normally. They
>> don't get the O(1) speedup, but it ensures 100% backward compatibility
>> without breaking their deserialization loop.
>>
>> *@Jan:* Your point about cold-table latency and small, wide Parquet files
>> hits the nail on the head. That is exactly why I am so hesitant about the
>> current parquet3.fbs proposal. If we duplicate the entire metadata payload
>> into FlatBuffers, the file bloat on small, uncompacted files will be
>> massive. A byte-offset index gives us the exact point-access speedup you
>> are asking for to reduce that latency, but keeps the footer footprint
>> incredibly tiny. Of course, if a query does a SELECT *, we still have to
>> read all the columns anyway. For that specific case, the index doesn't
>> help
>> us skip anything, so performance there really just comes down to whether
>> the fastest custom Thrift decoders are fast enough.
>>
>> Thanks again for the pointers, and I'm looking forward to seeing if the
>> Rust/C++ implementations see the same O(1) flatline that the Java
>> prototype
>> did.
>>
>> Cheers,
>>
>> Will
>>
>> On Fri, 27 Mar 2026 at 11:32, Will Edwards <[email protected]> wrote:
>>
>> > Thx for the interest :D
>> >
>> > I ran some quick micro-benchmarks on my Java footer parser to see if the
>> > idea might work.  Sorry I can't share code, but I can share recipes etc.
>> >
>> > This is running on a weak cloud VM.  The test files were created using
>> an
>> > obvious Python pyarrow script.  The test runs a few hundred iterations
>> for
>> > warmup then runs a few thousand iterations of each test to get a timing.
>> >
>> > The *parquet-format* column is the
>> > standard org.apache.parquet.format.Util.readFileMetaData(new
>> > ByteArrayInputStream(footerBytes)).  This excludes IO (the Hadoop
>> libraries
>> > are so slow, that's actually the number one thing to avoid).
>> >
>> > The *full* column shows my footer parser performing a full footer
>> decode.
>> > It spends about 5-10% of the time in the schema making the name tree
>> and so
>> > on; the rest of the time is spent on column metadata.  It creates a
>> long[]
>> > of num_row_groups * num_columns * num_interesting_fields to decode into.
>> > Strings/bytes are stored in the long[] as offsets into the Thrift
>> footer.
>> > This is about 8x+ faster than the standard library, and more importantly
>> > for my use-cases it generates significantly less garbage.
>> >
>> > The *by ordinal* column shows my footer parser skipping all columns
>> except
>> > one.  The skipping is barely faster than the full decode (you can't
>> early
>> > out after the interesting column in the last row-group if you are still
>> > waiting for metadata etc) but, importantly for my use-cases, it uses
>> > significantly less memory.
>> >
>> > The *single col* column is O(1) random access using a jump table into
>> the
>> > Thrift footer.  This is the speed of the proposal.
>> >
>> > | Cols | RGs | Footer | parquet-format | full | by ordinal | single col
>> |
>> > |------|-----|--------|---------------|------|-----------|------------|
>> > | 100 | 1 | 19 KB | 0.27 | 0.040 | 0.017 | 0.001 |
>> > | 100 | 2 | 30 KB | 0.35 | 0.040 | 0.031 | 0.001 |
>> > | 100 | 4 | 52 KB | 0.57 | 0.077 | 0.061 | 0.001 |
>> > | 1K | 1 | 195 KB | 2.58 | 0.21 | 0.17 | 0.000 |
>> > | 1K | 2 | 304 KB | 2.88 | 0.35 | 0.32 | 0.001 |
>> > | 1K | 4 | 528 KB | 5.39 | 0.69 | 0.63 | 0.001 |
>> > | 10K | 1 | 2.0 MB | 16.4 | 1.89 | 1.77 | 0.000 |
>> > | 10K | 2 | 3.1 MB | 28.9 | 3.58 | 3.25 | 0.001 |
>> > | 10K | 4 | 5.4 MB | 54.2 | 6.99 | 6.33 | 0.001 |
>> > | 50K | 1 | 10.3 MB | 85.1 | 9.94 | 9.54 | 0.000 |
>> > | 50K | 2 | 15.9 MB | 148.9 | 18.8 | 17.6 | 0.001 |
>> > | 50K | 4 | 27.4 MB | 280.7 | 36.7 | 34.1 | 0.001 |
>> > | 100K | 1 | 20.6 MB | 174.1 | 20.0 | 19.2 | 0.000 |
>> > | 100K | 2 | 31.9 MB | 305.7 | 37.8 | 35.3 | 0.001 |
>> > | 100K | 4 | 55.1 MB | 565.3 | 73.6 | 68.5 | 0.002 |
>> >
>> > Performance is measured in milliseconds.  My fast footer parser is very
>> > consistent, but the parquet-format has higher variability because of
>> > garbage collections, etc.
>> >
>> > The O(1) access is very stable: approximately 100 ns to create the
>> arrays
>> > etc for the result and then approximately 270 ns per decoded
>> ColumnChunk.
>> >
>> > The jump table extension might work like this: we have a normal Thrift
>> > footer, and immediately after the normal version field, we add a new
>> field
>> > that signals the presence of a footer index.  This is a binary field
>> with a
>> > new high field-id.  It doesn't have to contain the index; that can be
>> > sandwiched between the end of the Thrift footer and the end of the file.
>> > It contains an uncompressed long that is the offset to the footer
>> index's
>> > start.  Because the field is not a varint, a writer can patch it after
>> > writing the footer etc.  Readers who don't know about the index will
>> just
>> > skip it.
>> >
>> > The footer index has a little header indicating its contents: whether it
>> > includes a name hash table, a column ordinals jump table etc.  These are
>> > all flat, fixed-size arrays.  There is also lists the positions for any
>> > fields not in the index e.g. the position of the metadata and items the
>> > reader may want to access by skipping the parts that the index allows
>> them
>> > to.
>> >
>> > Would love to know how Java performance compares to rust etc :D
>> >
>> > Yours hopefully,
>> > Will
>> >
>> > On Thu, 26 Mar 2026 at 10:38, Andrew Lamb <[email protected]>
>> wrote:
>> >
>> >> Thank you Will,
>> >>
>> >> I think this idea is very interesting.
>> >>
>> >> >  it’s entirely possible this index-only approach was evaluated and
>> >> discarded early on.
>> >>
>> >> I don't know of any such discussion or evaluation. One challenge of
>> this
>> >> approach might be getting thrift-compiler generated parsers to
>> cooperate.
>> >>
>> >> >  would love to hear your thoughts on whether a lightweight index
>> could
>> >> give us the O(1) read performance we want without the file bloat we are
>> >> trying to avoid.
>> >>
>> >> I personally think it sounds like a great idea -- and I suggest as a
>> next
>> >> step a prototype in one of the open source parquet readers so we can
>> >> measure its performance. The arrow-rs implementation (now) has its own
>> >> custom thrift decoder, so it might be relatively easy to test out such
>> a
>> >> scheme.
>> >>
>> >> Andrew
>> >>
>> >
>>
>

Reply via email to