Thx for the interest :D I ran some quick micro-benchmarks on my Java footer parser to see if the idea might work. Sorry I can't share code, but I can share recipes etc.
This is running on a weak cloud VM. The test files were created using an obvious Python pyarrow script. The test runs a few hundred iterations for warmup then runs a few thousand iterations of each test to get a timing. The *parquet-format* column is the standard org.apache.parquet.format.Util.readFileMetaData(new ByteArrayInputStream(footerBytes)). This excludes IO (the Hadoop libraries are so slow, that's actually the number one thing to avoid). The *full* column shows my footer parser performing a full footer decode. It spends about 5-10% of the time in the schema making the name tree and so on; the rest of the time is spent on column metadata. It creates a long[] of num_row_groups * num_columns * num_interesting_fields to decode into. Strings/bytes are stored in the long[] as offsets into the Thrift footer. This is about 8x+ faster than the standard library, and more importantly for my use-cases it generates significantly less garbage. The *by ordinal* column shows my footer parser skipping all columns except one. The skipping is barely faster than the full decode (you can't early out after the interesting column in the last row-group if you are still waiting for metadata etc) but, importantly for my use-cases, it uses significantly less memory. The *single col* column is O(1) random access using a jump table into the Thrift footer. This is the speed of the proposal. | Cols | RGs | Footer | parquet-format | full | by ordinal | single col | |------|-----|--------|---------------|------|-----------|------------| | 100 | 1 | 19 KB | 0.27 | 0.040 | 0.017 | 0.001 | | 100 | 2 | 30 KB | 0.35 | 0.040 | 0.031 | 0.001 | | 100 | 4 | 52 KB | 0.57 | 0.077 | 0.061 | 0.001 | | 1K | 1 | 195 KB | 2.58 | 0.21 | 0.17 | 0.000 | | 1K | 2 | 304 KB | 2.88 | 0.35 | 0.32 | 0.001 | | 1K | 4 | 528 KB | 5.39 | 0.69 | 0.63 | 0.001 | | 10K | 1 | 2.0 MB | 16.4 | 1.89 | 1.77 | 0.000 | | 10K | 2 | 3.1 MB | 28.9 | 3.58 | 3.25 | 0.001 | | 10K | 4 | 5.4 MB | 54.2 | 6.99 | 6.33 | 0.001 | | 50K | 1 | 10.3 MB | 85.1 | 9.94 | 9.54 | 0.000 | | 50K | 2 | 15.9 MB | 148.9 | 18.8 | 17.6 | 0.001 | | 50K | 4 | 27.4 MB | 280.7 | 36.7 | 34.1 | 0.001 | | 100K | 1 | 20.6 MB | 174.1 | 20.0 | 19.2 | 0.000 | | 100K | 2 | 31.9 MB | 305.7 | 37.8 | 35.3 | 0.001 | | 100K | 4 | 55.1 MB | 565.3 | 73.6 | 68.5 | 0.002 | Performance is measured in milliseconds. My fast footer parser is very consistent, but the parquet-format has higher variability because of garbage collections, etc. The O(1) access is very stable: approximately 100 ns to create the arrays etc for the result and then approximately 270 ns per decoded ColumnChunk. The jump table extension might work like this: we have a normal Thrift footer, and immediately after the normal version field, we add a new field that signals the presence of a footer index. This is a binary field with a new high field-id. It doesn't have to contain the index; that can be sandwiched between the end of the Thrift footer and the end of the file. It contains an uncompressed long that is the offset to the footer index's start. Because the field is not a varint, a writer can patch it after writing the footer etc. Readers who don't know about the index will just skip it. The footer index has a little header indicating its contents: whether it includes a name hash table, a column ordinals jump table etc. These are all flat, fixed-size arrays. There is also lists the positions for any fields not in the index e.g. the position of the metadata and items the reader may want to access by skipping the parts that the index allows them to. Would love to know how Java performance compares to rust etc :D Yours hopefully, Will On Thu, 26 Mar 2026 at 10:38, Andrew Lamb <[email protected]> wrote: > Thank you Will, > > I think this idea is very interesting. > > > it’s entirely possible this index-only approach was evaluated and > discarded early on. > > I don't know of any such discussion or evaluation. One challenge of this > approach might be getting thrift-compiler generated parsers to cooperate. > > > would love to hear your thoughts on whether a lightweight index could > give us the O(1) read performance we want without the file bloat we are > trying to avoid. > > I personally think it sounds like a great idea -- and I suggest as a next > step a prototype in one of the open source parquet readers so we can > measure its performance. The arrow-rs implementation (now) has its own > custom thrift decoder, so it might be relatively easy to test out such a > scheme. > > Andrew >
