Thx for the interest :D

I ran some quick micro-benchmarks on my Java footer parser to see if the
idea might work.  Sorry I can't share code, but I can share recipes etc.

This is running on a weak cloud VM.  The test files were created using an
obvious Python pyarrow script.  The test runs a few hundred iterations for
warmup then runs a few thousand iterations of each test to get a timing.

The *parquet-format* column is the
standard org.apache.parquet.format.Util.readFileMetaData(new
ByteArrayInputStream(footerBytes)).  This excludes IO (the Hadoop libraries
are so slow, that's actually the number one thing to avoid).

The *full* column shows my footer parser performing a full footer decode.
It spends about 5-10% of the time in the schema making the name tree and so
on; the rest of the time is spent on column metadata.  It creates a long[]
of num_row_groups * num_columns * num_interesting_fields to decode into.
Strings/bytes are stored in the long[] as offsets into the Thrift footer.
This is about 8x+ faster than the standard library, and more importantly
for my use-cases it generates significantly less garbage.

The *by ordinal* column shows my footer parser skipping all columns except
one.  The skipping is barely faster than the full decode (you can't early
out after the interesting column in the last row-group if you are still
waiting for metadata etc) but, importantly for my use-cases, it uses
significantly less memory.

The *single col* column is O(1) random access using a jump table into the
Thrift footer.  This is the speed of the proposal.

| Cols | RGs | Footer | parquet-format | full | by ordinal | single col |
|------|-----|--------|---------------|------|-----------|------------|
| 100 | 1 | 19 KB | 0.27 | 0.040 | 0.017 | 0.001 |
| 100 | 2 | 30 KB | 0.35 | 0.040 | 0.031 | 0.001 |
| 100 | 4 | 52 KB | 0.57 | 0.077 | 0.061 | 0.001 |
| 1K | 1 | 195 KB | 2.58 | 0.21 | 0.17 | 0.000 |
| 1K | 2 | 304 KB | 2.88 | 0.35 | 0.32 | 0.001 |
| 1K | 4 | 528 KB | 5.39 | 0.69 | 0.63 | 0.001 |
| 10K | 1 | 2.0 MB | 16.4 | 1.89 | 1.77 | 0.000 |
| 10K | 2 | 3.1 MB | 28.9 | 3.58 | 3.25 | 0.001 |
| 10K | 4 | 5.4 MB | 54.2 | 6.99 | 6.33 | 0.001 |
| 50K | 1 | 10.3 MB | 85.1 | 9.94 | 9.54 | 0.000 |
| 50K | 2 | 15.9 MB | 148.9 | 18.8 | 17.6 | 0.001 |
| 50K | 4 | 27.4 MB | 280.7 | 36.7 | 34.1 | 0.001 |
| 100K | 1 | 20.6 MB | 174.1 | 20.0 | 19.2 | 0.000 |
| 100K | 2 | 31.9 MB | 305.7 | 37.8 | 35.3 | 0.001 |
| 100K | 4 | 55.1 MB | 565.3 | 73.6 | 68.5 | 0.002 |

Performance is measured in milliseconds.  My fast footer parser is very
consistent, but the parquet-format has higher variability because of
garbage collections, etc.

The O(1) access is very stable: approximately 100 ns to create the arrays
etc for the result and then approximately 270 ns per decoded ColumnChunk.

The jump table extension might work like this: we have a normal Thrift
footer, and immediately after the normal version field, we add a new field
that signals the presence of a footer index.  This is a binary field with a
new high field-id.  It doesn't have to contain the index; that can be
sandwiched between the end of the Thrift footer and the end of the file.
It contains an uncompressed long that is the offset to the footer index's
start.  Because the field is not a varint, a writer can patch it after
writing the footer etc.  Readers who don't know about the index will just
skip it.

The footer index has a little header indicating its contents: whether it
includes a name hash table, a column ordinals jump table etc.  These are
all flat, fixed-size arrays.  There is also lists the positions for any
fields not in the index e.g. the position of the metadata and items the
reader may want to access by skipping the parts that the index allows them
to.

Would love to know how Java performance compares to rust etc :D

Yours hopefully,
Will

On Thu, 26 Mar 2026 at 10:38, Andrew Lamb <[email protected]> wrote:

> Thank you Will,
>
> I think this idea is very interesting.
>
> >  it’s entirely possible this index-only approach was evaluated and
> discarded early on.
>
> I don't know of any such discussion or evaluation. One challenge of this
> approach might be getting thrift-compiler generated parsers to cooperate.
>
> >  would love to hear your thoughts on whether a lightweight index could
> give us the O(1) read performance we want without the file bloat we are
> trying to avoid.
>
> I personally think it sounds like a great idea -- and I suggest as a next
> step a prototype in one of the open source parquet readers so we can
> measure its performance. The arrow-rs implementation (now) has its own
> custom thrift decoder, so it might be relatively easy to test out such a
> scheme.
>
> Andrew
>

Reply via email to