The community sync is starting now. The URL to join us is:
https://meet.google.com/gvu-yxxs-jvg?authuser=0


On Wed, Apr 8, 2026 at 6:32 AM Steve Loughran <[email protected]> wrote:

> i do have some (bad) news about parquet variant file read performance, but
> have my own commitments.
>
> I will put up a detailed gist covering this. For now know: shredded
> variant performance is really bad. I had hoped to talk about the
> iceberg-level issues last week, hopefully I will get space on the agenda
> next time.
>
> At the parquet level, it's here are some benchmarks comparing shredded and
> unshredded files. ignore the numbers, just look at the line lengths.
>
>
>    1. graph 1: reading all the data in the variant. shredded is slower
>    2. graph 2: reading some of the columns, using the parquet schema of
>    the file. unshreadded is faster
>    3. graph 3. reading that same subset of columns, but now with a "lean"
>    schema that explicitly asks fo r
>
>
> [image: Screenshot 2026-04-01 at 16.38.33.png]
>
>
> Schema for graph 2; the one used to create the file
>   public static final String UNSHREDDED_SCHEMA = "message vschema {"
>       + "required int64 id;"
>       + "required int32 category;"
>       + "optional group nested (VARIANT(1)) {"
>       + "  required binary metadata;"
>       + "  required binary value;"
>       + "  }"
>       + "}";
>
> Schema for graph 3, which explicitly expects the shredded values and
> declares the typed_value struct with the single shredded field "varcolumn"
> which we want.
>
>   public static final String SELECT_SCHEMA = "message vschema {"
>       + "required int64 id;"
>       + "required int32 category;"
>       + "optional group nested (VARIANT(1)) {"
>       + "  required binary metadata;"
>       + "  optional binary value;"
>       + "  optional group typed_value {"
>       + "    required group varcategory {"
>       + "      optional binary value;"
>       + "      optional int32 typed_value;"
>       + "      }"
>       + "    }"
>       + "  }"
>       + "}";
>
>
> Like I said, I'll do a gist. I am now doing some profiling and should be
> able to cut out a buffer -> string -> buffer conversion sequence which
> takes place, simply by having VariantBuilder add a package private operation
>
>   void appendAsString(Binary binary) {
>     onAppend();
>     writeUTF8bytes(binary.getBytesUnsafe());
>   }
>
> The current conversion spread acrosss two methods is effectively
>   binary.toStringUsingUTF8().getBytes(StandardCharsets.UTF_8);
> this shows up on the profile flamegraphs because of the memory operations.
> Assuming strings are common in variants, thls should help.
>
> It'd be interesting to know
>
>    1. the structure of variants people are currently storing
>    2. any queries which are being made of their contents, both filtering
>    and projection.
>
>
>
> On Tue, 7 Apr 2026 at 21:53, Julien Le Dem <[email protected]> wrote:
>
>> Thank you!
>>
>> On Tue, Apr 7, 2026 at 12:23 PM Andrew Lamb <[email protected]>
>> wrote:
>>
>> > I can help facilitate the meeting tomorrow.
>> >
>> > On Tue, Apr 7, 2026 at 3:13 PM Julien Le Dem <[email protected]> wrote:
>> >
>> > > Please reply by end of day to volunteer to facilitate the meeting
>> > tomorrow.
>> > > Otherwise, I'll cancel it.
>> > >
>> > > On Mon, Apr 6, 2026 at 8:55 AM Julien Le Dem <[email protected]>
>> wrote:
>> > >
>> > > > Hello all,
>> > > > The next Parquet sync on Wednesday is conflicting with the Iceberg
>> > > summit.
>> > > > (10am PT - 1pm ET - 7pm CET)
>> > > > I will not be able to facilitate the meeting and I suspect some of
>> the
>> > > > regular attendees will be at the conference.
>> > > > Is there a volunteer to facilitate the meeting? (basically, just
>> some
>> > > time
>> > > > management and making sure notes are taken)
>> > > > Otherwise, we can also skip this one and reconvene in 2 weeks.
>> > > > Best,
>> > > > Julien
>> > > >
>> > >
>> >
>>
>

Reply via email to