gortiz opened a new issue, #10183:
URL: https://github.com/apache/pinot/issues/10183

   ## Why
   
   The objective of this proposal is to add the ability to include new index 
types at runtime in Apache Pinot. This opens the ability of adding third party 
indexes, including proprietary indexes. When we talk about index types in this 
proposal, we are focused on single field index. Constructions like StarTree 
indexes or future indexes that index more than one column are deliberately 
ignored.
   
   ## The plan
   
   All ideas presented here have been already drafted in this branch [pending 
to be created], where we have been exploring the problem in order to find the 
solution we propose here. Given the large number of changed in that branch, we 
think that it is better to use that branch as a reference, but split the 
changes in different PRs. Right now the idea is to create partial PRs in the 
following order:
   - [ ] Add the `IndexType` without modifying already existing code. Include 
few example implementations.
   - [ ] Add the `IndexService` without modifying already existing code.
   - [ ] Partially implement `IndexType` for each existent index type without 
modifying already existing code. Methods that define how to create 
`IndexCreator`s, `IndexReader`s, `IndexHandler`s, etc are not going to be 
implemented here.
   - [ ] Refactor existing code to use `IndexType` whenever `ColumnIdexType` is 
being used right now.
   - [ ] Transform old configs (including usage of `IndexLoadingConfig` and 
`SegmentGeneratorConfig`) 
   - [ ] Refactor each `IndexCreator` to have a common interface.
   - [ ] Remove `IndexCreatorProvider` and `IndexReaderProvider` interfaces and 
implementations, moving the code to each `IndexType`. Here classes which 
delegate on these providers (like `PhysicalColumnIndexContainer`) will be 
modified to call the equivalent methods on IndexType.
   - [ ] Refactor classes like `PhysicalColumnIndexContainer` or 
`SegmentColumnarIndexCreator` to call just iterate over the `IndexType`s 
returned by `IndexService` instead of explicitly ask for each one
   
   ## Current blockers
   
   1. Index types are treated as enums, which are, by definition, decided at 
compile time. For example, `ColumnIndexType`.
   4. Index configuration is specified in `TableConfig`, specially in 
`FieldConfig` and `IndexingConfig`. These classes are POJOs with one or more 
attributes per index type. Therefore to add a new index it is needed to add 
attributes and recompile the class.
   5. There is at least one IndexCreator per index type, but each IndexCreator 
is different to the other. They do not fulfill a common contract. The same 
happen with IndexReader. Ideally IndexCreators and IndexReaders should be 
closer to IndexHandlers, which do follow the same contract for each index type.
   6. IndexCreators are created using different `XIndexCreatorProvider`, where 
X is the index type (like `ForwardIndexCreatorProvider`) but:
       1. Each provider has its own methods with its own arguments that derive 
from `IndexCreationContext`.
       2. These `XIndexCreatorProvider`s do not have a common interface.
       3. There is a `IndexCreatorProvider` interface, but it extends all other 
`XIndexCreatorProvider`. That pattern goes against the extensibility. For 
example, in case a new index type called `Test` was added, we would need to 
modify `IndexCreatorProvider` to also extend `TextIndexCreatorProvider` and 
implement its method in all implementations, which obviously requires to 
recompile the project.
       4. Same happens with `IndexReaderProvider`
       7. Same happens with `MutableIndexProvider`
   5. Indexes are being used when `PlanNode`s are translated into physical 
operators. Right now this transformation is totally static. In case we want to 
add a new index we need to modify methods like 
`FilterPlanNode.constructPhysicalOperator` or 
`FilterOperatorUtils.getLeafFilterOperator` in order to add new transformation 
rules, which are eagerly executed.
   
   ## The IndexService and the IndexType class
   This proposal is evolves around the idea that indexes should be obtained 
from an `IndexService`. There should be a single instance on every Pinot 
component that uses indexes (like Pinot Servers). Optionally this singleton can 
be substituted by a bunch of static methods in the `IndexService` class.
   
   This `IndexService` will contain a map from a String _id_ to values of 
`IndexType`, which has to be a new interface that completely identifies an 
index. The `IndexType` interface will have methods that define the index id, 
how to configure the index given a TableConfig and how to obtain its 
`IndexCreator`, `IndexReader`, `IndexHandler` and optionally its 
`MutableIndex`. These last methods will require information from the 
environment. Specially they will require some contextual information (like 
statistics for the column) and some configuration. There should be a single 
`IndexType` instance for each index type in Pinot (ie one for forward indexes, 
one for range indexes, etc).
   
   Pinot will have a new configuration to specify which indexes should be 
available and at startup Pinot will look for the corresponding `IndexType` in 
the classpath and add it to the `IndexService` singleton. After that, whenever 
Pinot requires to use an index Pinot will use the `IndexService` to get the 
`IndexType` instance. To look for all indexes in the classpath, the draft 
implementation uses an intermediate `IndexPlugin` interface that is discovered 
using Java Service Providers.
   
   ## The index configuration
   As said above, current way to configure is very static, requiring to add new 
attributes at compile time to the POJOs we have. At the same time, indexes are 
configured in different ways depending on their requirements and when it was 
created. 
   
   Right now, index configuration is spread across different fields in the 
`TableConfig`, which has two key resposibilities:
   1. Being the class that is being used to define how the JSON is written by 
the users
   2. Being the class that devs use to create the readers and creators.
    
   Although it is not impossible to correctly implement both things in the same 
class, we think that it would be better to clearly differentiate between the 
concrete syntax used by the user to write JSONs and the classes with the final 
configuration (with default values resolved, for example). Having both 
responsibilities in the same classes makes more difficult to modify the way 
indexes are configured by the users without impacting on the code and the other 
way around. For example, bloom filters were originally designed to be 
configured as a list of columns that have to be indexed but at some point we 
wanted to let users to customize the bloom filter properties, so `TableConfig` 
contains both `bloomFilterColumns` and `bloomFilterConfigs`. Instead of being 
constraint at `TableConfig` level, this duality permeated to all other parts of 
the code, so a very common pattern was to first read `bloomFilterColumns` and 
then `bloomFilterConfigs` in several places of the code.
   
   Previous `TableConfig` has other problems:
   
   - Documentation is spread between `tableIndexConfig` and `fieldConfigList`. 
[Text search support doc 
page](https://docs.pinot.apache.org/basics/indexing/text-search-support) says 
that `fieldConfigList` is the newest way to configure indexes, but when using 
that advanced configuration has to be done in a generic properties field, which 
was mapped to a Map<String, String>. This is bad for various reasons:
       - It creates a single namespace where properties of different index may 
collide.
       - It restricts the value of the configuration to strings, so in order to 
support advanced configurations devs must parse the string their self and users 
must know which syntax is expected for each value. This is even worse when the 
parsing has to be done in several different places in the code (for example, 
when creating an index and when reading an index).
   - In order to add a new index, `TableConfig` source file has to be changed 
to add a new attribute in order to be known by Jackson. That is strong blocker 
in order to have index types defined at runtime.
   - It is impossible to create instances or modify of `TableConfig` in a 
programmatic type safe way. Therefore we introduced some abstractions like 
`IndexLoadingConfig` or `ColumnIndexCreationInfo` so we can programmatically 
modify the config (mainly, but not only, in tests).
   - In order to create or read some indexes, we need not only the 
configuration provided by the user but also some contextual information that 
has to be pre-processed (like for example the max or min values). Therefore we 
created classes like `IndexCreationContext`, where we merged the contextual 
information with index specific configurations. This is again incompatible with 
having index types defined in runtime. For example, we would need to create a 
method like `IndexCreationContext.Common.forBloomFilter()` for each new index 
type, which would require to modify the source code each time a new index is 
added.
   
   `TableConfig` is not in fact very bad designed. The actual problem is that 
we are using `TableConfig` for everything. What this proposal does is to keep 
`TableConfig` as the concrete syntax specific object that is usually 
instantiated by Jackson when a table is deserialized from JSON. In parallel, 
each `IndexType` has a generic `C` (which stands for _config_) and defines how 
to transform a `TableConfig` into an instance of `C`
   
   Therefore all the dirty code we have to introduce in order to support 
historical ways to configure the index is written only once and in the same 
method for each `IndexType` instead of being spread across the code. This 
should produce code that is easier to read and to maintain. Methods that are 
used to instantiate IndexCreators, IndexHandlers or IndexReaders will be 
explained bellow, but they do not receive the TableConfig (or a wrapper like 
`IndexLoadingConfig`) as an argument. Instead, they receive the already 
prepared generic C object they expect.
   
   With the change explained above we have the ability to specify all 
`IndexType` methods without depending on the specially crafted interfaces like 
`IndexLoadingConfig`, `SegmentGeneratorConfig` or `IndexCreationContext`. But 
it doesn't explain how to open the `TableConfig` to be able to store indexes 
that are unknown at compile time. The proposal here is to modify a little bit 
the idea of configuring all indexes in `FieldConfig`. Instead of list active 
indexes in the field `indexTypes`, here we propose to create a new field called 
`indexes`. That field is going to be either null or a JSON object whose keys 
are the id of the configured indexes and the values are defined by each 
`IndexType`. To implement that, we propose the following:
   - `FieldConfig.indexes` will be of type `JsonNode`, which is the most 
abstract Jackson type. Anything that can be deserialized with Jackson can be 
deserialized into a `JsonNode`. 
   - The class `IndexType` will have a method that returns the unique id of the 
index (for example, `ForwardIndexType.getId()` can return "forward").
   - The class `IndexType` will also have a method that given a `JsonNode`, 
returns an instance of C.
   
   When Pinot requires to transform `FieldConfig.indexes` from `JsonNode` into 
proper configuration objects, it will do something like the following:
   ```
     if (fieldConfig.indexes.isNull) {
       return null;
     }
     if (fieldConfig.indexes.isObject) {
       Iterator<Map.Entry<String, JsonNode>> it = node.fields();
   
       Map<IndexType, Object> builder = new HashMap<>();
       while (it.hasNext()) {
         Map.Entry<String, JsonNode> entry = it.next();
         String indexId = entry.getKey();
   
         IndexType indexType = IndexService.getInstance().getIndexType(indexId);
         if (indexType == null) {
            // decide what to do: continue or fail
         }
         JsonNode indexNode = entry.getValue();
         Object deserialized = indexType.deserialize(indexNode);
         builder.put(indexType, deserialized);
       }
     }
   ```
   
   By doing that, we could be sure that if we do something like 
`map.get(BloomIndexType.INSTACE)` we will receive a `BloomFilterConfig`. The 
draft includes a intermediate class called `FieldIndexConfigs` which is a 
typesafe wrapper on top of a map like the one shown here.
   
   So, for example, this configuration configuration that is currently 
supported by Pinot:
   ```js
   {
     "tableIndexConfig": {
       "rangeIndexColumns": [
          "col1"
       ],
       "bloomFilterConfigs": {
         "col1": {
           "fpp": 0.01,
           "maxSizeInBytes": 1000000,
           "loadOnHeap": true
         }
       },
       "invertedIndexColumns": [
         "col1"
       ]
     },
     "fieldConfigList": [
       {
          "name": “col1”,
          "indexTypes": [
             “text”
          ],
          "properties": {
             "fstType": “lucene”,
             "stopWordsInclude": "a,b,c",
             "enableQueryCacheForTextIndex": “false”,
             "isSegmentPartitioned": “true” // this property is not related to 
an specific index
          }
       }
     ]
   }
   ```
   
   Could be implemented as:
   ```js
   {
     "fieldConfigList": [
       {
          "name": “col1”,
          "indexes": {
            "inverted": true, // or {enabled: true}
            "range": true, // or {enabled: true}
            "text": {
              "type": “lucene”,
              "enableQueryCacheForTextIndex": false,
              "stopWordsInclude": ["a", "b", "c"]
            },
           "bloom": {
             "fpp": 0.01,
             "maxSizeInBytes": 1000000,
             "loadOnHeap": true
            },
          },
          "properties": {
             "isSegmentPartitioned": “true” // this property is not related to 
an specific index
          }
       }
     ]
   }
   ```
   
   In cases like BloomFilter, when `BloomFilterConfig` is already 
serializable/deserializable with Jackson, the method that transform from JSON 
to object could be:
   ```java
     BloomFilterConfig deserialize(JsonNode node)
         throws IOException {
       return JsonUtils.jsonNodeToObject(node, getIndexConfigClass());
     }
   ```
   
   Assuming that Java `null` means that the index is not configured, in case we 
need to customize that, for example on inverted indexes when there is no 
argument, we can do the same:
   ```java
     public Boolean deserialize(JsonNode node)
         throws IOException {
       if (node.isBoolean()) {
         if (node.booleanValue()) {
           return Boolean.TRUE;
         } else {
           return null;
         }
       }
       throw ...
     }
   ```
   
   In case we want to support different ways to configure the index, for 
example RangeIndexType when we would like to either use `true` to use the 
default or an object to be able to indicate the version of the range index, we 
can do something like:
   ```java
     public RangeIndexConfig deserialize(JsonNode node)
         throws IOException {
       if (node.isBoolean()) {
         if (node.booleanValue()) {
           return getDefaultConfig();
         } else {
           return null;
         }
       }
       return JsonUtils.jsonNodeToObject(node, getIndexConfigClass());
     }
   ```
   
   Most of these methods can also be delegated to Jackson by using customized 
deserializers, but we think that having this method here is more explicit and 
requires less knowledge of Jackson, which could make it easier to read.
   
   This proposal makes this new way to configure indexes optional to indexes in 
Apache Pinot repository but mandatory for indexes outside this repo. Therefore 
the proposal must support the  _older_ way to configure indexes. This can 
partially be done by the already mentioned method that transform from 
`TableConfig` to `C`, but we may also need other methods to transform from 
`IndexLoadingConfig` and `SegmentGeneratorConfig` to `C`, as they are usually 
modified by tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to