Hey Gang, thanks for linking the Arrow code! That functionality would be great to have in parquet-java. Would you see it living in the parquet-avro reader code specifically (and therefore picked up by parquet-cli), or added to the core reader functionality in parquet-column?
- Claire On Wed, Apr 1, 2026 at 10:22 PM Gang Wu <[email protected]> wrote: > Hi Claire, > > I agree that supporting all "legacy" list encodings is painful and it has > caused troubles in the past. > > It seems that parquet-cli mainly depends on parquet-avro so it also > requires > settings from parquet-avro to resolve list structure. Perhaps we can do > something similar to what parquet-cpp currently does for list encoding > resolution [1], which does not require extra information other than the > MessageType. > > [1] > > https://github.com/apache/arrow/blob/976d547fba9b4bff4178e515ca8cdcb8a5db4d46/cpp/src/parquet/arrow/schema.cc#L706-L790 > > > Best, > Gang > > On Wed, Apr 1, 2026 at 2:08 AM Claire McGinty <[email protected]> > wrote: > > > Hi all, > > > > I wanted to bring up the topic of Parquet's supported encodings for List > > logical types > > < > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists > > > > > . > > > > Having multiple valid List encodings is becoming a pain point for my org, > > especially since we read and write Parquet from different engines with > > different default values (for example, Ray/pyarrow > > < > > > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html > > > > > writes Parquet lists using the latest 3-level list encoding; writes from > > Scio <https://spotify.github.io/scio/io/Parquet.html> use the default > > parquet-avro encoding, which uses an older encoding; we even have a few > > datasets with primitive required list types that just encode using one > > level, e.g. `repeated int32 my_element`). > > > > Parquet-cli > > < > https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md> > > also doesn't work out of the box for all these encoding types, unless you > > manually specify a Configuration file specifying the encoding. Overall, > > it's frustrating for our users reading these files to have to look up the > > write schema, then look up the right Configuration key, then figure out > how > > to pass in that Configuration to parquet-cli or parquet-avro. > > > > So I'm wondering if there'd be any interest in: > > > > - Contributing a public utility method (to parquet-common? Or maybe > > there's a better place for it) that accepts either a Parquet > > `MessageType` > > or a `Path` and detects which type of List encoding is being used. > > (This is > > probably easier said than done, but at least the > backwards-compatibility > > rules > > < > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules > > > > > are > > finite and clear to interpret.) > > - Integrating that utility method into parquet-cli/parquet-avro, as > well > > as any other parquet formats that support Lists (i.e. > magnolify-parquet > > <https://spotify.github.io/magnolify/parquet.html>). > > > > One potential corner case I can think of is that I guess if you're > manually > > specifying your Parquet schema (rather than using an established format > > like parquet-avro), there's nothing preventing you from mixing and > matching > > list encodings. But we could just have the utility method throw an > > exception in that case and force the user to specify a schema explicitly. > > > > Thanks, and let me know what you think, > > Claire > > >
