Hi all, I wanted to bring up the topic of Parquet's supported encodings for List logical types <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists> .
Having multiple valid List encodings is becoming a pain point for my org, especially since we read and write Parquet from different engines with different default values (for example, Ray/pyarrow <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html> writes Parquet lists using the latest 3-level list encoding; writes from Scio <https://spotify.github.io/scio/io/Parquet.html> use the default parquet-avro encoding, which uses an older encoding; we even have a few datasets with primitive required list types that just encode using one level, e.g. `repeated int32 my_element`). Parquet-cli <https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md> also doesn't work out of the box for all these encoding types, unless you manually specify a Configuration file specifying the encoding. Overall, it's frustrating for our users reading these files to have to look up the write schema, then look up the right Configuration key, then figure out how to pass in that Configuration to parquet-cli or parquet-avro. So I'm wondering if there'd be any interest in: - Contributing a public utility method (to parquet-common? Or maybe there's a better place for it) that accepts either a Parquet `MessageType` or a `Path` and detects which type of List encoding is being used. (This is probably easier said than done, but at least the backwards-compatibility rules <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules> are finite and clear to interpret.) - Integrating that utility method into parquet-cli/parquet-avro, as well as any other parquet formats that support Lists (i.e. magnolify-parquet <https://spotify.github.io/magnolify/parquet.html>). One potential corner case I can think of is that I guess if you're manually specifying your Parquet schema (rather than using an established format like parquet-avro), there's nothing preventing you from mixing and matching list encodings. But we could just have the utility method throw an exception in that case and force the user to specify a schema explicitly. Thanks, and let me know what you think, Claire
