Hi all,

I wanted to bring up the topic of Parquet's supported encodings for List
logical types
<https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists>
.

Having multiple valid List encodings is becoming a pain point for my org,
especially since we read and write Parquet from different engines with
different default values (for example, Ray/pyarrow
<https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html>
writes Parquet lists using the latest 3-level list encoding; writes from
Scio <https://spotify.github.io/scio/io/Parquet.html> use the default
parquet-avro encoding, which uses an older encoding; we even have a few
datasets with primitive required list types that just encode using one
level, e.g. `repeated int32 my_element`).

Parquet-cli
<https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md>
also doesn't work out of the box for all these encoding types, unless you
manually specify a Configuration file specifying the encoding. Overall,
it's frustrating for our users reading these files to have to look up the
write schema, then look up the right Configuration key, then figure out how
to pass in that Configuration to parquet-cli or parquet-avro.

So I'm wondering if there'd be any interest in:

   - Contributing a public utility method (to parquet-common? Or maybe
   there's a better place for it) that accepts either a Parquet `MessageType`
   or a `Path` and detects which type of List encoding is being used. (This is
   probably easier said than done, but at least the backwards-compatibility
   rules
   
<https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules>
are
   finite and clear to interpret.)
   - Integrating that utility method into parquet-cli/parquet-avro, as well
   as any other parquet formats that support Lists (i.e. magnolify-parquet
   <https://spotify.github.io/magnolify/parquet.html>).

One potential corner case I can think of is that I guess if you're manually
specifying your Parquet schema (rather than using an established format
like parquet-avro), there's nothing preventing you from mixing and matching
list encodings. But we could just have the utility method throw an
exception in that case and force the user to specify a schema explicitly.

Thanks, and let me know what you think,
Claire

Reply via email to