[GitHub] [pinot] gortiz opened a new issue, #9081: Ingest parquet: Change behavior when ingesting columns of invalid type

GitBox Thu, 21 Jul 2022 00:15:03 -0700


gortiz opened a new issue, #9081:
URL: https://github.com/apache/pinot/issues/9081

TL;DR: There is an inconsistency in Pinot when a table defines a column as a
string but it is populated reading a parquet file whose column type is binary
without the string annotation.

Introduction:

I'm trying to improve the current Pinot implementation of
[ClickHouse/ClickBench](https://github.com/ClickHouse/ClickBench). In that
benchmark the input data can be read from csvs, tsvs, json or parquet files. As
usual, Pinot requires to split the data before ingesting. The current
implementation reads it from a tsv, but it seems easier and faster to read it
from parquet. Even better, ClickHouse already provides a version that is
already split. But the metadata on the parquet file is not correct. String
columns are marked as binaries without [String
annotation](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string).

The problem:

When Pinot tries to populate a String column from a parquet file, it reads
the values from the parquet files, applies some conversions that return an
Object and applies a stringify conversion to that Object (like calling
`Object.toString`). But the value that is read from parquet is depends on the
metadata associated with the parquet column. In parquet String columns are
Binary columns with an annotation that marks them as a String. Other
annotations can be applied to mark the column as an enum or UUID.

Pinot uses these annotations to decide how to read the value, without
knowing the type of the Pinot column where the data will be stored. The code
that does that is
[here](https://github.com/apache/pinot/blob/6903856d399b4d7b09548a75a888382e1659e9e3/pinot-plugins/pinot-input-format/pinot-parquet/src/main/java/org/apache/pinot/plugin/inputformat/parquet/ParquetNativeRecordExtractor.java).
In case the binary column is not annotated with the String annotation, it is
read as a binary (or some other variants, which are not important here). When
the value is read as a String, the Java type will be String. When the value is
read as a binary, the Java type will be `byte[]`.

Once the value is extracted, Pinot stringifies the value before storing it.
If the parquet column was annotated with String, the value read is a String and
therefore the conversion is a noop. But if the value wasn't annotated with
String, the value read is a byte[] and therefore the conversion returns a
description of the bytes. I didn't find where the conversion is applied, but
the result is that Pinot stores a String that is not the bytes read as UTF-8.

This is problematic because there is no clear indicator that the import is
going to change the data and there is no failure. The user may think that
everything is ok and after ingesting terabytes of data he/she may discover that
the data stored is not the same that the one he/she wanted to store.

It is possible that this problem is replicated with other input formats and
other column types.

Proposed solutions:
1. Add some checks to the import process that verifies that the extracted
Java types matches with the expected column type. In this case this would
produce a failfast error that should be clear to the user, which can take
proper adjustments (like adding the corresponding annotation to parquet).
2. Add a property to
`org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReaderConfig` that
let the user to customize how columns should be read.

IMO the error should be mandatory and the ability to _cast_ the columns
should be optional.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [pinot] gortiz opened a new issue, #9081: Ingest parquet: Change behavior when ingesting columns of invalid type

Reply via email to