gortiz commented on PR #11824: URL: https://github.com/apache/pinot/pull/11824#issuecomment-1790424446
> Column level can BE overridden by query-level config if configured I don't think we can do this without actually applying deep changes to the code. Column level config (defined in Schema and TableConfig) cannot be changed by query level config right now, as they are concepts that apply to different contexts. ## Situation right now Schema and TableConfig define whether the null vector index buffer present in a segment is read or not*. At the time the index is being read, there is no concept of query-level config. In fact there is no query context at all. When the query arrives to the server, servers prepare a `DataSource` for each column and segment that are used in the query. These `DataSource`s know about the Schema but not about the TableConfig. These `DataSource`s are then wrapped in a `ProjectionBlockValSet` which implements `BlockValSet`. These `BlockValSet` do not carry type information (neither base type, multivalue or nullability). Neither `DataSource` or `BlockValSet` do anything fancy with nulls. They just store the null bitmap and the raw segment data. For example, a column created from input `[1, null, 2] ` (assuming `IndexingConfig.isNullHandlingEnabled()` is true) will store: - A null bitmap that says docId 1 is null - A buffer/array that stores [1, -2147483648**, 1] These blocks are what `Operator`s use at runtime. `Operator`s have their own expected base type (int, long, string, etc) and whether are multivalued or not, but do not carry nullability information. But they are associated with a query context, so they decide whether to deal with nulls or not depending on `QueryContext.isNullHandlingEnabled` (which is configured by a query option). When they deal with nulls, they get the raw value array and null bitmap and merge them. Therefore `Schema` and `TableConfig` define whether the null value index is read or not and `QueryContext.isNullHandlingEnabled` defines whether the `Operator`s are going to use that null value index or not. There is no way, AFAIK, to make Operators aware of nullability defined in `Schema` and `TableConfig` and also there is no way to make IndexType aware of the `QueryContext.isNullHandlingEnabled`. So one cannot override the other. What this PR does is to change how `NullValueIndexType.ReaderFactory.createIndexReader()` behaves so by setting a column not nullable in `Schema`, the null value index won't be read. I've added a table in the PR description indicating the new semantics. \*: In fact in master right now Schema and TableConfig are ignored for null value vector. The main change in this PR is precisely to change that to do not read the null index buffer if Schema or TableConfig disable the index. \**: -2147483648 is the default null value for ints ## Alternative There is another way to implement a similar behavior, but requires a bit more code: 1. Don't change `NullValueIndexType` at all. Bitmaps will be returned even when `Schema` and `TableConfig` say the column is not nullable. It is weird but that is how V1 works right now. 2. Define the following query modes to deal with nulls: 1. NEVER_NULLABLE -> Equal to current `SET nullHandlingEnabled=false`. 2. ALWAYS_NULLABLE -> Equal to current `SET nullHandlingEnabled=true`. 3. SCHEMA_BASED -> Explained above. 3. In order to change the null mode, we can either change query option `nullHandlingEnabled` to support more than true/false or add another option. 4. Override the method in `org.apache.pinot.segment.local.segment.index.datasource.BaseDataSource.NullValueVectorReader getNullValueVector()` adding `org.apache.pinot.segment.local.segment.index.datasource.BaseDataSource.NullValueVectorReader getNullValueVector(NullMode mode)` 1. In case mode is `NEVER_NULLABLE`, always return null. 2. In case mode is `ALWAYS_NULLABLE`, always return whatever is stored in the segment (this is the current behavior). 3. In case mode is `SCHEMA_BASED`, the result will depend on the schema. Given DataSource knows the schema, it can change its behavior. 1. If column is defined as nullable: return whatever is stored in the segment 2. Else: return null Same should be done in `BlockValSet`. In `ProjectionBlockValSet` we will delegate on `DataSource.getNullValueVector` and in the others we behave as `ALWAYS_NULLABLE`. By doing so other `BlockValSet` like `TransformBlockValSet` would call `_valueBlock.getNullBitmap(mode)` and deal with the result as they please. Lately, all operators would need to call `block.getNullBitmap(NullMode)` and decide which mode to use using the `QueryContext`. This solution should work and contrary to this PR, it should not break backward compatibility in any way. But it would require deeper code modifications. I'll create a new PR with that approach so we can compare them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org