Re: [PR] Add nullability to FieldSpec and use it in TypeFactory [pinot]

via GitHub Thu, 02 Nov 2023 03:04:37 -0700


gortiz commented on PR #11824:
URL: https://github.com/apache/pinot/pull/11824#issuecomment-1790424446

> Column level can BE overridden by query-level config if configured

I don't think we can do this without actually applying deep changes to the
code. Column level config (defined in Schema and TableConfig) cannot be changed
by query level config right now, as they are concepts that apply to different
contexts.

## Situation right now

Schema and TableConfig define whether the null vector index buffer present
in a segment is read or not*. At the time the index is being read, there is no
concept of query-level config. In fact there is no query context at all.

When the query arrives to the server, servers prepare a `DataSource` for
each column and segment that are used in the query. These `DataSource`s know
about the Schema but not about the TableConfig. These `DataSource`s are then
wrapped in a `ProjectionBlockValSet` which implements `BlockValSet`. These
`BlockValSet` do not carry type information (neither base type, multivalue or
nullability). Neither `DataSource` or `BlockValSet` do anything fancy with
nulls. They just store the null bitmap and the raw segment data. For example, a
column created from input `[1, null, 2] ` (assuming
`IndexingConfig.isNullHandlingEnabled()` is true) will store:
- A null bitmap that says docId 1 is null
- A buffer/array that stores [1, -2147483648**, 1]

These blocks are what `Operator`s use at runtime. `Operator`s have their own
expected base type (int, long, string, etc) and whether are multivalued or not,
but do not carry nullability information. But they are associated with a query
context, so they decide whether to deal with nulls or not depending on
`QueryContext.isNullHandlingEnabled` (which is configured by a query option).
When they deal with nulls, they get the raw value array and null bitmap and
merge them.

Therefore `Schema` and `TableConfig` define whether the null value index is
read or not and `QueryContext.isNullHandlingEnabled` defines whether the
`Operator`s are going to use that null value index or not.

There is no way, AFAIK, to make Operators aware of nullability defined in
`Schema` and `TableConfig` and also there is no way to make IndexType aware of
the `QueryContext.isNullHandlingEnabled`. So one cannot override the other.

What this PR does is to change how
`NullValueIndexType.ReaderFactory.createIndexReader()` behaves so by setting a
column not nullable in `Schema`, the null value index won't be read. I've added
a table in the PR description indicating the new semantics.

\*: In fact in master right now Schema and TableConfig are ignored for null
value vector. The main change in this PR is precisely to change that to do not
read the null index buffer if Schema or TableConfig disable the index.

\**: -2147483648 is the default null value for ints

## Alternative

There is another way to implement a similar behavior, but requires a bit
more code:
1. Don't change `NullValueIndexType` at all. Bitmaps will be returned even
when `Schema` and `TableConfig` say the column is not nullable. It is weird but
that is how V1 works right now.
2. Define the following query modes to deal with nulls:
1. NEVER_NULLABLE -> Equal to current `SET nullHandlingEnabled=false`.
2. ALWAYS_NULLABLE -> Equal to current `SET nullHandlingEnabled=true`.
3. SCHEMA_BASED -> Explained above.
3. In order to change the null mode, we can either change query option
`nullHandlingEnabled` to support more than true/false or add another option.
4. Override the method in
`org.apache.pinot.segment.local.segment.index.datasource.BaseDataSource.NullValueVectorReader
getNullValueVector()` adding
`org.apache.pinot.segment.local.segment.index.datasource.BaseDataSource.NullValueVectorReader
getNullValueVector(NullMode mode)`
1. In case mode is `NEVER_NULLABLE`, always return null.
2. In case mode is `ALWAYS_NULLABLE`, always return whatever is stored in
the segment (this is the current behavior).
3. In case mode is `SCHEMA_BASED`, the result will depend on the schema.
Given DataSource knows the schema, it can change its behavior.
1. If column is defined as nullable: return whatever is stored in the
segment
2. Else: return null

Same should be done in `BlockValSet`. In `ProjectionBlockValSet` we will
delegate on `DataSource.getNullValueVector` and in the others we behave as
`ALWAYS_NULLABLE`. By doing so other `BlockValSet` like `TransformBlockValSet`
would call `_valueBlock.getNullBitmap(mode)` and deal with the result as they
please.

Lately, all operators would need to call `block.getNullBitmap(NullMode)` and
decide which mode to use using the `QueryContext`.

This solution should work and contrary to this PR, it should not break
backward compatibility in any way. But it would require deeper code
modifications.

I'll create a new PR with that approach so we can compare them.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Re: [PR] Add nullability to FieldSpec and use it in TypeFactory [pinot]

Reply via email to