Thanks Andrew, Ed, and Jan for the great feedback and context! :dance: As you saw in my previous reply, I created some Java micro-benchmarks to test this out, and the numbers align perfectly with our goal.
*@Ed:* Thank you for linking PR #8714! I hadn't seen that POC, but it is incredibly validating to see the arrow-rs team actively exploring this exact concept. It looks like we are completely aligned on the mechanics of why an index is necessary to escape the Thrift framing overhead. *@Andrew:* You raised a great point about getting standard thrift-compiler generated parsers to cooperate. To keep them happy, the layout of the pointer is key. If we inject the pointer to the index as a high-ID fixed-width binary field at the very beginning of the FileMetaData struct (e.g., Field ID 32767), standard parsers will simply see an unknown field, use the type header to skip those 12 bytes, and parse the rest of the legacy Thrift payload normally. They don't get the O(1) speedup, but it ensures 100% backward compatibility without breaking their deserialization loop. *@Jan:* Your point about cold-table latency and small, wide Parquet files hits the nail on the head. That is exactly why I am so hesitant about the current parquet3.fbs proposal. If we duplicate the entire metadata payload into FlatBuffers, the file bloat on small, uncompacted files will be massive. A byte-offset index gives us the exact point-access speedup you are asking for to reduce that latency, but keeps the footer footprint incredibly tiny. Of course, if a query does a SELECT *, we still have to read all the columns anyway. For that specific case, the index doesn't help us skip anything, so performance there really just comes down to whether the fastest custom Thrift decoders are fast enough. Thanks again for the pointers, and I'm looking forward to seeing if the Rust/C++ implementations see the same O(1) flatline that the Java prototype did. Cheers, Will On Fri, 27 Mar 2026 at 11:32, Will Edwards <[email protected]> wrote: > Thx for the interest :D > > I ran some quick micro-benchmarks on my Java footer parser to see if the > idea might work. Sorry I can't share code, but I can share recipes etc. > > This is running on a weak cloud VM. The test files were created using an > obvious Python pyarrow script. The test runs a few hundred iterations for > warmup then runs a few thousand iterations of each test to get a timing. > > The *parquet-format* column is the > standard org.apache.parquet.format.Util.readFileMetaData(new > ByteArrayInputStream(footerBytes)). This excludes IO (the Hadoop libraries > are so slow, that's actually the number one thing to avoid). > > The *full* column shows my footer parser performing a full footer decode. > It spends about 5-10% of the time in the schema making the name tree and so > on; the rest of the time is spent on column metadata. It creates a long[] > of num_row_groups * num_columns * num_interesting_fields to decode into. > Strings/bytes are stored in the long[] as offsets into the Thrift footer. > This is about 8x+ faster than the standard library, and more importantly > for my use-cases it generates significantly less garbage. > > The *by ordinal* column shows my footer parser skipping all columns except > one. The skipping is barely faster than the full decode (you can't early > out after the interesting column in the last row-group if you are still > waiting for metadata etc) but, importantly for my use-cases, it uses > significantly less memory. > > The *single col* column is O(1) random access using a jump table into the > Thrift footer. This is the speed of the proposal. > > | Cols | RGs | Footer | parquet-format | full | by ordinal | single col | > |------|-----|--------|---------------|------|-----------|------------| > | 100 | 1 | 19 KB | 0.27 | 0.040 | 0.017 | 0.001 | > | 100 | 2 | 30 KB | 0.35 | 0.040 | 0.031 | 0.001 | > | 100 | 4 | 52 KB | 0.57 | 0.077 | 0.061 | 0.001 | > | 1K | 1 | 195 KB | 2.58 | 0.21 | 0.17 | 0.000 | > | 1K | 2 | 304 KB | 2.88 | 0.35 | 0.32 | 0.001 | > | 1K | 4 | 528 KB | 5.39 | 0.69 | 0.63 | 0.001 | > | 10K | 1 | 2.0 MB | 16.4 | 1.89 | 1.77 | 0.000 | > | 10K | 2 | 3.1 MB | 28.9 | 3.58 | 3.25 | 0.001 | > | 10K | 4 | 5.4 MB | 54.2 | 6.99 | 6.33 | 0.001 | > | 50K | 1 | 10.3 MB | 85.1 | 9.94 | 9.54 | 0.000 | > | 50K | 2 | 15.9 MB | 148.9 | 18.8 | 17.6 | 0.001 | > | 50K | 4 | 27.4 MB | 280.7 | 36.7 | 34.1 | 0.001 | > | 100K | 1 | 20.6 MB | 174.1 | 20.0 | 19.2 | 0.000 | > | 100K | 2 | 31.9 MB | 305.7 | 37.8 | 35.3 | 0.001 | > | 100K | 4 | 55.1 MB | 565.3 | 73.6 | 68.5 | 0.002 | > > Performance is measured in milliseconds. My fast footer parser is very > consistent, but the parquet-format has higher variability because of > garbage collections, etc. > > The O(1) access is very stable: approximately 100 ns to create the arrays > etc for the result and then approximately 270 ns per decoded ColumnChunk. > > The jump table extension might work like this: we have a normal Thrift > footer, and immediately after the normal version field, we add a new field > that signals the presence of a footer index. This is a binary field with a > new high field-id. It doesn't have to contain the index; that can be > sandwiched between the end of the Thrift footer and the end of the file. > It contains an uncompressed long that is the offset to the footer index's > start. Because the field is not a varint, a writer can patch it after > writing the footer etc. Readers who don't know about the index will just > skip it. > > The footer index has a little header indicating its contents: whether it > includes a name hash table, a column ordinals jump table etc. These are > all flat, fixed-size arrays. There is also lists the positions for any > fields not in the index e.g. the position of the metadata and items the > reader may want to access by skipping the parts that the index allows them > to. > > Would love to know how Java performance compares to rust etc :D > > Yours hopefully, > Will > > On Thu, 26 Mar 2026 at 10:38, Andrew Lamb <[email protected]> wrote: > >> Thank you Will, >> >> I think this idea is very interesting. >> >> > it’s entirely possible this index-only approach was evaluated and >> discarded early on. >> >> I don't know of any such discussion or evaluation. One challenge of this >> approach might be getting thrift-compiler generated parsers to cooperate. >> >> > would love to hear your thoughts on whether a lightweight index could >> give us the O(1) read performance we want without the file bloat we are >> trying to avoid. >> >> I personally think it sounds like a great idea -- and I suggest as a next >> step a prototype in one of the open source parquet readers so we can >> measure its performance. The arrow-rs implementation (now) has its own >> custom thrift decoder, so it might be relatively easy to test out such a >> scheme. >> >> Andrew >> >
