The community sync is starting now. The URL to join us is: https://meet.google.com/gvu-yxxs-jvg?authuser=0
On Wed, Apr 8, 2026 at 6:32 AM Steve Loughran <[email protected]> wrote: > i do have some (bad) news about parquet variant file read performance, but > have my own commitments. > > I will put up a detailed gist covering this. For now know: shredded > variant performance is really bad. I had hoped to talk about the > iceberg-level issues last week, hopefully I will get space on the agenda > next time. > > At the parquet level, it's here are some benchmarks comparing shredded and > unshredded files. ignore the numbers, just look at the line lengths. > > > 1. graph 1: reading all the data in the variant. shredded is slower > 2. graph 2: reading some of the columns, using the parquet schema of > the file. unshreadded is faster > 3. graph 3. reading that same subset of columns, but now with a "lean" > schema that explicitly asks fo r > > > [image: Screenshot 2026-04-01 at 16.38.33.png] > > > Schema for graph 2; the one used to create the file > public static final String UNSHREDDED_SCHEMA = "message vschema {" > + "required int64 id;" > + "required int32 category;" > + "optional group nested (VARIANT(1)) {" > + " required binary metadata;" > + " required binary value;" > + " }" > + "}"; > > Schema for graph 3, which explicitly expects the shredded values and > declares the typed_value struct with the single shredded field "varcolumn" > which we want. > > public static final String SELECT_SCHEMA = "message vschema {" > + "required int64 id;" > + "required int32 category;" > + "optional group nested (VARIANT(1)) {" > + " required binary metadata;" > + " optional binary value;" > + " optional group typed_value {" > + " required group varcategory {" > + " optional binary value;" > + " optional int32 typed_value;" > + " }" > + " }" > + " }" > + "}"; > > > Like I said, I'll do a gist. I am now doing some profiling and should be > able to cut out a buffer -> string -> buffer conversion sequence which > takes place, simply by having VariantBuilder add a package private operation > > void appendAsString(Binary binary) { > onAppend(); > writeUTF8bytes(binary.getBytesUnsafe()); > } > > The current conversion spread acrosss two methods is effectively > binary.toStringUsingUTF8().getBytes(StandardCharsets.UTF_8); > this shows up on the profile flamegraphs because of the memory operations. > Assuming strings are common in variants, thls should help. > > It'd be interesting to know > > 1. the structure of variants people are currently storing > 2. any queries which are being made of their contents, both filtering > and projection. > > > > On Tue, 7 Apr 2026 at 21:53, Julien Le Dem <[email protected]> wrote: > >> Thank you! >> >> On Tue, Apr 7, 2026 at 12:23 PM Andrew Lamb <[email protected]> >> wrote: >> >> > I can help facilitate the meeting tomorrow. >> > >> > On Tue, Apr 7, 2026 at 3:13 PM Julien Le Dem <[email protected]> wrote: >> > >> > > Please reply by end of day to volunteer to facilitate the meeting >> > tomorrow. >> > > Otherwise, I'll cancel it. >> > > >> > > On Mon, Apr 6, 2026 at 8:55 AM Julien Le Dem <[email protected]> >> wrote: >> > > >> > > > Hello all, >> > > > The next Parquet sync on Wednesday is conflicting with the Iceberg >> > > summit. >> > > > (10am PT - 1pm ET - 7pm CET) >> > > > I will not be able to facilitate the meeting and I suspect some of >> the >> > > > regular attendees will be at the conference. >> > > > Is there a volunteer to facilitate the meeting? (basically, just >> some >> > > time >> > > > management and making sure notes are taken) >> > > > Otherwise, we can also skip this one and reconvene in 2 weeks. >> > > > Best, >> > > > Julien >> > > > >> > > >> > >> >
