Thank you all for participating. Here are the notes: https://docs.google.com/document/d/e/2PACX-1vSDHW7gvG8eO6aIxaIVPrZSqYYhtRDb5W1imnbpM4QRYNPsTwEO1fU5z7SEhVIFa4YqWJeSRJ9tcXYS/pub
Attendees: - Micah Kornfield - Databricks - Parquet sync Seattle - Martin Prammer - CMU - Listening in - Ben Owad - Snowflake - Listening in - Dewey Dunnington - Wherobots - Geo CRS question on mailing list - Gunnar Morling - Confluent - happy to talk about Hardwood <https://www.google.com/url?q=https://github.com/hardwood-hq/hardwood&sa=D&source=editors&ust=1774484064027368&usg=AOvVaw3TrAkZ36wOGzd4N4fpvXe4>, a new Java Parser for Parquet (Details <https://www.google.com/url?q=https://www.morling.dev/blog/hardwood-new-parser-for-apache-parquet/&sa=D&source=editors&ust=1774484064027461&usg=AOvVaw1JNL_ler0VfnwxDk74Pu2c> ) - Julien Le Dem: Datadog. Encodings. - Russell Spitzer - Snowflake - Listening in - Pakin Paul - Listening in. - Steve Loughran @ Cloudera. topic: benchmarking variants - Divjot Arora: Databricks. Listening in, flatbuf - Dusan Paripovic: SWE @RTE Listening in - Gaurav Miglani: Zepto, Listening - Michael Chavinda - DataHaskell - Arnav Balyan - Encodings - Jiaying Li - Snowflake - Listening in - Will Manning - Spiral/Vortex - Encodings - Will Edwards - Spotify, Listening in Notes - Parquet meetup Seattle - May 4th - Uber hosting, thank you XinLi! - Call for proposals coming! - CRS question <https://www.google.com/url?q=https://lists.apache.org/thread/r5x0do8f241bpf565rx8s5s3wc9ogp0f&sa=D&source=editors&ust=1774484064028408&usg=AOvVaw1oYOGZkuO8mh76uPCiTNZ9> on mailing list - Dewey: the current implemented behavior is slightly different from the spec. - => need a clarification in the spec. 2 Recommendations in the spec don’t match the behavior. - TODO: clarify the spec and make sure implementations are aligned. - It is a string but there’s a convention on how it’s used. - People are encouraged to chime in in the thread linked above. - Benchmarking Variants - Steve: benchmarking Variant - PR up: in Parquet - Verifying: Can we generate variant from flat or slightly nested - Numbers on Variant unshredded/shredded/avro: - Shredded parquet is slow in Iceberg/Spark. - TODO: send numbers on the list. - Need to figure out where the slowness is coming from. - Russel: There is no pushdown of shredded values yet. - For now re-assembling. - Work in progress to add the push down - Iceberg Impl for Parquet Reading / Shredding Unshredding - ShreddedObject — core reassembly class https://github.com/apache/iceberg/blob/e40c2d653793c3ec27fee5be32de38f57c0dd22e/core/src/main/java/org/apache/iceberg/variants/ShreddedObject.java#L41-L117 <https://www.google.com/url?q=https://github.com/apache/iceberg/blob/e40c2d653793c3ec27fee5be32de38f57c0dd22e/core/src/main/java/org/apache/iceberg/variants/ShreddedObject.java%23L41-L117&sa=D&source=editors&ust=1774484064029716&usg=AOvVaw1ndlx3s1h3m-ydYEmGpCOZ> - ShreddedObjectReader — Parquet read-time reassembly (partially shredded objects) https://github.com/apache/iceberg/blob/571056929091f1e62412500045f71e5ba6ea00ad/parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantReaders.java#L255-L342 <https://www.google.com/url?q=https://github.com/apache/iceberg/blob/571056929091f1e62412500045f71e5ba6ea00ad/parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantReaders.java%23L255-L342&sa=D&source=editors&ust=1774484064030059&usg=AOvVaw3RREsRx_QleNkrtTskBdFM> - ShreddedVariantReader — leaf-level shredded value reassembly (value vs typed_value) https://github.com/apache/iceberg/blob/571056929091f1e62412500045f71e5ba6ea00ad/parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantReaders.java#L189-L247 <https://www.google.com/url?q=https://github.com/apache/iceberg/blob/571056929091f1e62412500045f71e5ba6ea00ad/parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantReaders.java%23L189-L247&sa=D&source=editors&ust=1774484064030396&usg=AOvVaw17t5DAL7k_110ppNUL7lXZ> - VariantReaderBuilder — wires the reader tree together https://github.com/apache/iceberg/blob/22d194f5d685fdf5bec17c6bcc92a69db4ae4957/parquet/src/main/java/org/apache/iceberg/parquet/VariantReaderBuilder.java#L131-L175 <https://www.google.com/url?q=https://github.com/apache/iceberg/blob/22d194f5d685fdf5bec17c6bcc92a69db4ae4957/parquet/src/main/java/org/apache/iceberg/parquet/VariantReaderBuilder.java%23L131-L175&sa=D&source=editors&ust=1774484064030691&usg=AOvVaw2RUsxBpwDM822aYB59zhok> - Variants — factory methods (object(), of(), etc.) https://github.com/apache/iceberg/blob/15351e6ab06cbfa2cbf771fe81512268c22bd75d/core/src/main/java/org/apache/iceberg/variants/Variants.java#L98-L118 <https://www.google.com/url?q=https://github.com/apache/iceberg/blob/15351e6ab06cbfa2cbf771fe81512268c22bd75d/core/src/main/java/org/apache/iceberg/variants/Variants.java%23L98-L118&sa=D&source=editors&ust=1774484064030967&usg=AOvVaw3UiHLjl8O4eE3nje41k3uX> - - Encodings - ALP: - Initial review on format proposal <https://www.google.com/url?q=https://github.com/apache/parquet-format/pull/557&sa=D&source=editors&ust=1774484064031160&usg=AOvVaw0R0Mk4rJ3xEy9qe78k9niA> - G-ALP? - Comments from Will on the doc - Vectors 1024 elements fine - 64000 elements -> binary search more expensive => atrocious on GPU - 6TB/s on fastlanes => much faster than arrow-cpp bitpacking - Fastlanes + FoR with right patching is hard to beat on GPU. - Martin: - Being able to control for this will be valuable. - Parquet-testing for canonical benchmarks. - TODO: - We’ll need to revisit when vector size we work on cascading encodings - Need to catch up with RAPIDS teams. - Thank you Will for chiming in! - FSST - format proposal <https://www.google.com/url?q=https://docs.google.com/document/d/1Xg2b8HR19QnI3nhtQUDWZJhCLwJzW6y9tU1ziiLFZrM/edit?tab%3Dt.0%23heading%3Dh.a9r0tnd6fhtq&sa=D&source=editors&ust=1774484064031940&usg=AOvVaw03bL-tbvZPVuNIWQM3jlwk> - Needs another round of reviews (Micah didn’t get to it since last sync) - Please review! - Footer work <https://www.google.com/url?q=https://github.com/apache/parquet-format/pull/544&sa=D&source=editors&ust=1774484064032138&usg=AOvVaw0vsAScDyMWBhTJk1cKk4he> - Rok has done a pass of reviews (Micah didn’t get to it last sync) - It is close. - Please review! - Hardwood - Gunnar: confluent, CDC to datalakes - Motivations: - No Hadoop dependency - 1 billion row challenge. (down to 300ms) - Take of that experience to solving parquet - Using ai coding - Thanks to spec and tests, it is relatively easy to do - Dependencies: - No required transitive dependencies - Optional compression libraries. - Apis: - Row based api - Column reader api, lower level - Performance: how to get good cpu utilization - Page level parallelism - Adaptive page prefetching - Cross-file prefetching - Multi-file access. (allows pre-fetching) - Complete reading implementation - S3/obj - Projection push downs - Avro bindings - People are contributing - Read path for now. 1.0 in APril. - 1.1 release focusing on writes? - All the code is reviewed, based on spec. - Compatibility layer on parquet-java? - What’s the goal? - Easy to try - Need an Arrow based API? On Wed, Mar 25, 2026 at 7:16 AM Steve Loughran <[email protected]> wrote: > i have some variant benchmarks to share... > > On Wed, 25 Mar 2026 at 05:55, Julien Le Dem <[email protected]> wrote: > > > The next Parquet sync is tomorrow Wednesday Mar 24th at 10am PT - 1pm ET > - > > *6pm > > CET* > > (because of the daylight saving time change not being on the same date in > > US and EU, the meeting is 1h earlier than usual in CET TZ) > > > > To join the invite, join the group: > > https://groups.google.com/g/apache-parquet-community-sync > > > > Everybody is welcome, bring your topic or just listen in. > > > > (Some more details on how the meeting is run: > > https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t ) > > >
