Hey Gang, thanks for linking the Arrow code! That functionality would be
great to have in parquet-java. Would you see it living in the parquet-avro
reader code specifically (and therefore picked up by parquet-cli), or added
to the core reader functionality in parquet-column?

- Claire

On Wed, Apr 1, 2026 at 10:22 PM Gang Wu <[email protected]> wrote:

> Hi Claire,
>
> I agree that supporting all "legacy" list encodings is painful and it has
> caused troubles in the past.
>
> It seems that parquet-cli mainly depends on parquet-avro so it also
> requires
> settings from parquet-avro to resolve list structure. Perhaps we can do
> something similar to what parquet-cpp currently does for list encoding
> resolution [1], which does not require extra information other than the
> MessageType.
>
> [1]
>
> https://github.com/apache/arrow/blob/976d547fba9b4bff4178e515ca8cdcb8a5db4d46/cpp/src/parquet/arrow/schema.cc#L706-L790
>
>
> Best,
> Gang
>
> On Wed, Apr 1, 2026 at 2:08 AM Claire McGinty <[email protected]>
> wrote:
>
> > Hi all,
> >
> > I wanted to bring up the topic of Parquet's supported encodings for List
> > logical types
> > <
> >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists
> > >
> > .
> >
> > Having multiple valid List encodings is becoming a pain point for my org,
> > especially since we read and write Parquet from different engines with
> > different default values (for example, Ray/pyarrow
> > <
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html
> > >
> > writes Parquet lists using the latest 3-level list encoding; writes from
> > Scio <https://spotify.github.io/scio/io/Parquet.html> use the default
> > parquet-avro encoding, which uses an older encoding; we even have a few
> > datasets with primitive required list types that just encode using one
> > level, e.g. `repeated int32 my_element`).
> >
> > Parquet-cli
> > <
> https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md>
> > also doesn't work out of the box for all these encoding types, unless you
> > manually specify a Configuration file specifying the encoding. Overall,
> > it's frustrating for our users reading these files to have to look up the
> > write schema, then look up the right Configuration key, then figure out
> how
> > to pass in that Configuration to parquet-cli or parquet-avro.
> >
> > So I'm wondering if there'd be any interest in:
> >
> >    - Contributing a public utility method (to parquet-common? Or maybe
> >    there's a better place for it) that accepts either a Parquet
> > `MessageType`
> >    or a `Path` and detects which type of List encoding is being used.
> > (This is
> >    probably easier said than done, but at least the
> backwards-compatibility
> >    rules
> >    <
> >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
> > >
> > are
> >    finite and clear to interpret.)
> >    - Integrating that utility method into parquet-cli/parquet-avro, as
> well
> >    as any other parquet formats that support Lists (i.e.
> magnolify-parquet
> >    <https://spotify.github.io/magnolify/parquet.html>).
> >
> > One potential corner case I can think of is that I guess if you're
> manually
> > specifying your Parquet schema (rather than using an established format
> > like parquet-avro), there's nothing preventing you from mixing and
> matching
> > list encodings. But we could just have the utility method throw an
> > exception in that case and force the user to specify a schema explicitly.
> >
> > Thanks, and let me know what you think,
> > Claire
> >
>

Reply via email to