Re: Parquet sync tomorrow Wednesday Mar 24th

Julien Le Dem Wed, 25 Mar 2026 16:15:42 -0700

Thank you all for participating.
Here are the notes:
https://docs.google.com/document/d/e/2PACX-1vSDHW7gvG8eO6aIxaIVPrZSqYYhtRDb5W1imnbpM4QRYNPsTwEO1fU5z7SEhVIFa4YqWJeSRJ9tcXYS/pub


Attendees:


   - Micah Kornfield - Databricks - Parquet sync Seattle
   - Martin Prammer - CMU - Listening in
   - Ben Owad - Snowflake - Listening in
   - Dewey Dunnington - Wherobots - Geo CRS question on mailing list
   - Gunnar Morling - Confluent - happy to talk about Hardwood
   
<https://www.google.com/url?q=https://github.com/hardwood-hq/hardwood&sa=D&source=editors&ust=1774484064027368&usg=AOvVaw3TrAkZ36wOGzd4N4fpvXe4>,
   a new Java Parser for Parquet (Details
   
<https://www.google.com/url?q=https://www.morling.dev/blog/hardwood-new-parser-for-apache-parquet/&sa=D&source=editors&ust=1774484064027461&usg=AOvVaw1JNL_ler0VfnwxDk74Pu2c>
   )
   - Julien Le Dem: Datadog. Encodings.
   - Russell Spitzer - Snowflake - Listening in
   - Pakin Paul - Listening in.
   - Steve Loughran @ Cloudera. topic: benchmarking variants
   - Divjot Arora: Databricks. Listening in, flatbuf
   - Dusan Paripovic: SWE @RTE Listening in
   - Gaurav Miglani: Zepto, Listening
   - Michael Chavinda - DataHaskell
   - Arnav Balyan - Encodings
   - Jiaying Li - Snowflake - Listening in
   - Will Manning - Spiral/Vortex - Encodings
   - Will Edwards - Spotify, Listening in

Notes

   - Parquet meetup Seattle


   - May 4th
   - Uber hosting, thank you XinLi!
   - Call for proposals coming!


   - CRS question
   
<https://www.google.com/url?q=https://lists.apache.org/thread/r5x0do8f241bpf565rx8s5s3wc9ogp0f&sa=D&source=editors&ust=1774484064028408&usg=AOvVaw1oYOGZkuO8mh76uPCiTNZ9>
on
   mailing list


   - Dewey: the current implemented behavior is slightly different from the
   spec.
   - => need a clarification in the spec. 2 Recommendations in the spec
   don’t match the behavior.
   - TODO: clarify the spec and make sure implementations are aligned.
   - It is a string but there’s a convention on how it’s used.
   - People are encouraged to chime in in the thread linked above.


   - Benchmarking Variants


   - Steve: benchmarking Variant


   - PR up: in Parquet
   - Verifying: Can we generate variant from flat or slightly nested
   - Numbers on Variant unshredded/shredded/avro:


   - Shredded parquet is slow in Iceberg/Spark.


   - TODO: send numbers on the list.
   - Need to figure out where the slowness is coming from.


   - Russel: There is no pushdown of shredded values yet.
   - For now re-assembling.
   - Work in progress to add the push down


   - Iceberg Impl for Parquet Reading / Shredding Unshredding


   - ShreddedObject — core reassembly class
   
https://github.com/apache/iceberg/blob/e40c2d653793c3ec27fee5be32de38f57c0dd22e/core/src/main/java/org/apache/iceberg/variants/ShreddedObject.java#L41-L117
   
<https://www.google.com/url?q=https://github.com/apache/iceberg/blob/e40c2d653793c3ec27fee5be32de38f57c0dd22e/core/src/main/java/org/apache/iceberg/variants/ShreddedObject.java%23L41-L117&sa=D&source=editors&ust=1774484064029716&usg=AOvVaw1ndlx3s1h3m-ydYEmGpCOZ>
   - ShreddedObjectReader — Parquet read-time reassembly (partially
   shredded objects)
   
https://github.com/apache/iceberg/blob/571056929091f1e62412500045f71e5ba6ea00ad/parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantReaders.java#L255-L342
   
<https://www.google.com/url?q=https://github.com/apache/iceberg/blob/571056929091f1e62412500045f71e5ba6ea00ad/parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantReaders.java%23L255-L342&sa=D&source=editors&ust=1774484064030059&usg=AOvVaw3RREsRx_QleNkrtTskBdFM>
   - ShreddedVariantReader — leaf-level shredded value reassembly (value vs
   typed_value)
   
https://github.com/apache/iceberg/blob/571056929091f1e62412500045f71e5ba6ea00ad/parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantReaders.java#L189-L247
   
<https://www.google.com/url?q=https://github.com/apache/iceberg/blob/571056929091f1e62412500045f71e5ba6ea00ad/parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantReaders.java%23L189-L247&sa=D&source=editors&ust=1774484064030396&usg=AOvVaw17t5DAL7k_110ppNUL7lXZ>
   - VariantReaderBuilder — wires the reader tree together
   
https://github.com/apache/iceberg/blob/22d194f5d685fdf5bec17c6bcc92a69db4ae4957/parquet/src/main/java/org/apache/iceberg/parquet/VariantReaderBuilder.java#L131-L175
   
<https://www.google.com/url?q=https://github.com/apache/iceberg/blob/22d194f5d685fdf5bec17c6bcc92a69db4ae4957/parquet/src/main/java/org/apache/iceberg/parquet/VariantReaderBuilder.java%23L131-L175&sa=D&source=editors&ust=1774484064030691&usg=AOvVaw2RUsxBpwDM822aYB59zhok>
   - Variants — factory methods (object(), of(), etc.)
   
https://github.com/apache/iceberg/blob/15351e6ab06cbfa2cbf771fe81512268c22bd75d/core/src/main/java/org/apache/iceberg/variants/Variants.java#L98-L118
   
<https://www.google.com/url?q=https://github.com/apache/iceberg/blob/15351e6ab06cbfa2cbf771fe81512268c22bd75d/core/src/main/java/org/apache/iceberg/variants/Variants.java%23L98-L118&sa=D&source=editors&ust=1774484064030967&usg=AOvVaw3UiHLjl8O4eE3nje41k3uX>
   -


   - Encodings


   - ALP:


   - Initial review on format proposal
   
<https://www.google.com/url?q=https://github.com/apache/parquet-format/pull/557&sa=D&source=editors&ust=1774484064031160&usg=AOvVaw0R0Mk4rJ3xEy9qe78k9niA>
   - G-ALP?


   - Comments from Will on the doc


   - Vectors 1024 elements fine
   - 64000 elements -> binary search more expensive => atrocious on GPU
   - 6TB/s on fastlanes => much faster than arrow-cpp bitpacking
   - Fastlanes + FoR with right patching is hard to beat on GPU.


   - Martin:


   - Being able to control for this will be valuable.
   - Parquet-testing for canonical benchmarks.


   - TODO:


   - We’ll need to revisit when vector size we work on cascading encodings
   - Need to catch up with RAPIDS teams.


   - Thank you Will for chiming in!


   - FSST - format proposal
   
<https://www.google.com/url?q=https://docs.google.com/document/d/1Xg2b8HR19QnI3nhtQUDWZJhCLwJzW6y9tU1ziiLFZrM/edit?tab%3Dt.0%23heading%3Dh.a9r0tnd6fhtq&sa=D&source=editors&ust=1774484064031940&usg=AOvVaw03bL-tbvZPVuNIWQM3jlwk>


   - Needs another round of reviews (Micah didn’t get to it since last sync)
   - Please review!


   - Footer work
   
<https://www.google.com/url?q=https://github.com/apache/parquet-format/pull/544&sa=D&source=editors&ust=1774484064032138&usg=AOvVaw0vsAScDyMWBhTJk1cKk4he>


   - Rok has done a pass of reviews (Micah didn’t get to it last sync)
   - It is close.
   - Please review!


   - Hardwood


   - Gunnar: confluent, CDC to datalakes
   - Motivations:


   - No Hadoop dependency
   - 1 billion row challenge. (down to 300ms)


   - Take of that experience to solving parquet


   - Using ai coding


   - Thanks to spec and tests, it is relatively easy to do


   - Dependencies:


   - No required transitive dependencies
   - Optional compression libraries.


   - Apis:


   - Row based api
   - Column reader api, lower level


   - Performance: how to get good cpu utilization


   - Page level parallelism
   - Adaptive page prefetching
   - Cross-file prefetching


   - Multi-file access. (allows pre-fetching)
   - Complete reading implementation


   - S3/obj
   - Projection push downs
   - Avro bindings


   - People are contributing
   - Read path for now. 1.0 in APril.
   - 1.1 release focusing on writes?
   - All the code is reviewed, based on spec.
   - Compatibility layer on parquet-java?


   - What’s the goal?
   - Easy to try


   - Need an Arrow based API?



On Wed, Mar 25, 2026 at 7:16 AM Steve Loughran <[email protected]> wrote:

> i have some variant benchmarks to share...
>
> On Wed, 25 Mar 2026 at 05:55, Julien Le Dem <[email protected]> wrote:
>
> > The next Parquet sync is tomorrow Wednesday Mar 24th at 10am PT - 1pm ET
> -
> > *6pm
> > CET*
> > (because of the daylight saving time change not being on the same date in
> > US and EU, the meeting is 1h earlier than usual in CET TZ)
> >
> > To join the invite, join the group:
> > https://groups.google.com/g/apache-parquet-community-sync
> >
> > Everybody is welcome, bring your topic or just listen in.
> >
> > (Some more details on how the meeting is run:
> > https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t )
> >
>

Re: Parquet sync tomorrow Wednesday Mar 24th

Reply via email to