gortiz commented on PR #11824:
URL: https://github.com/apache/pinot/pull/11824#issuecomment-1790424446

   > Column level can BE overridden by query-level config if configured
   
   I don't think we can do this without actually applying deep changes to the 
code. Column level config (defined in Schema and TableConfig) cannot be changed 
by query level config right now, as they are concepts that apply to different 
contexts.
   
   ## Situation right now
   
   Schema and TableConfig define whether the null vector index buffer present 
in a segment is read or not*. At the time the index is being read, there is no 
concept of query-level config. In fact there is no query context at all.
   
   When the query arrives to the server, servers prepare a `DataSource` for 
each column and segment that are used in the query. These `DataSource`s know 
about the Schema but not about the TableConfig. These `DataSource`s are then 
wrapped in a `ProjectionBlockValSet` which implements `BlockValSet`. These 
`BlockValSet` do not carry type information (neither base type, multivalue or 
nullability). Neither `DataSource` or `BlockValSet` do anything fancy with 
nulls. They just store the null bitmap and the raw segment data. For example, a 
column created from input `[1, null, 2] ` (assuming 
`IndexingConfig.isNullHandlingEnabled()` is true) will store:
   - A null bitmap that says docId 1 is null
   - A buffer/array that stores [1, -2147483648**, 1]
   
   These blocks are what `Operator`s use at runtime. `Operator`s have their own 
expected base type (int, long, string, etc) and whether are multivalued or not, 
but do not carry nullability information. But they are associated with a query 
context, so they decide whether to deal with nulls or not depending on 
`QueryContext.isNullHandlingEnabled` (which is configured by a query option). 
When they deal with nulls, they get the raw value array and null bitmap and 
merge them.
   
   Therefore `Schema` and `TableConfig` define whether the null value index is 
read or not and `QueryContext.isNullHandlingEnabled` defines whether the 
`Operator`s are going to use that null value index or not.
   
   There is no way, AFAIK, to make Operators aware of nullability defined in 
`Schema` and `TableConfig` and also there is no way to make IndexType aware of 
the `QueryContext.isNullHandlingEnabled`. So one cannot override the other.
   
   What this PR does is to change how 
`NullValueIndexType.ReaderFactory.createIndexReader()` behaves so by setting a 
column not nullable in `Schema`, the null value index won't be read. I've added 
a table in the PR description indicating the new semantics.
   
   \*: In fact in master right now Schema and TableConfig are ignored for null 
value vector. The main change in this PR is precisely to change that to do not 
read the null index buffer if Schema or TableConfig disable the index.
   
   \**: -2147483648 is the default null value for ints
   
   ## Alternative 
   
   There is another way to implement a similar behavior, but requires a bit 
more code:
   1. Don't change `NullValueIndexType` at all. Bitmaps will be returned even 
when `Schema` and `TableConfig` say the column is not nullable. It is weird but 
that is how V1 works right now.
   2. Define the following query modes to deal with nulls:
      1. NEVER_NULLABLE -> Equal to current `SET nullHandlingEnabled=false`.
      2. ALWAYS_NULLABLE -> Equal to current `SET nullHandlingEnabled=true`.
      3. SCHEMA_BASED -> Explained above.
   3. In order to change the null mode, we can either change query option 
`nullHandlingEnabled` to support more than true/false or add another option.
   4. Override the method in 
`org.apache.pinot.segment.local.segment.index.datasource.BaseDataSource.NullValueVectorReader
 getNullValueVector()` adding 
`org.apache.pinot.segment.local.segment.index.datasource.BaseDataSource.NullValueVectorReader
 getNullValueVector(NullMode mode)`
      1. In case mode is `NEVER_NULLABLE`, always return null.
      2. In case mode is `ALWAYS_NULLABLE`, always return whatever is stored in 
the segment (this is the current behavior).
      3. In case mode is `SCHEMA_BASED`, the result will depend on the schema. 
Given DataSource knows the schema, it can change its behavior.
          1. If column is defined as nullable: return whatever is stored in the 
segment
          2. Else: return null
   
   Same should be done in `BlockValSet`. In `ProjectionBlockValSet` we will 
delegate on `DataSource.getNullValueVector` and in the others we behave as 
`ALWAYS_NULLABLE`. By doing so other `BlockValSet` like `TransformBlockValSet` 
would call `_valueBlock.getNullBitmap(mode)` and deal with the result as they 
please.
   
   Lately, all operators would need to call `block.getNullBitmap(NullMode)` and 
decide which mode to use using the `QueryContext`.
   
   This solution should work and contrary to this PR, it should not break 
backward compatibility in any way. But it would require deeper code 
modifications.
   
   I'll create a new PR with that approach so we can compare them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to