gortiz opened a new issue, #10183: URL: https://github.com/apache/pinot/issues/10183
## Why The objective of this proposal is to add the ability to include new index types at runtime in Apache Pinot. This opens the ability of adding third party indexes, including proprietary indexes. When we talk about index types in this proposal, we are focused on single field index. Constructions like StarTree indexes or future indexes that index more than one column are deliberately ignored. ## The plan All ideas presented here have been already drafted in this branch [pending to be created], where we have been exploring the problem in order to find the solution we propose here. Given the large number of changed in that branch, we think that it is better to use that branch as a reference, but split the changes in different PRs. Right now the idea is to create partial PRs in the following order: - [ ] Add the `IndexType` without modifying already existing code. Include few example implementations. - [ ] Add the `IndexService` without modifying already existing code. - [ ] Partially implement `IndexType` for each existent index type without modifying already existing code. Methods that define how to create `IndexCreator`s, `IndexReader`s, `IndexHandler`s, etc are not going to be implemented here. - [ ] Refactor existing code to use `IndexType` whenever `ColumnIdexType` is being used right now. - [ ] Transform old configs (including usage of `IndexLoadingConfig` and `SegmentGeneratorConfig`) - [ ] Refactor each `IndexCreator` to have a common interface. - [ ] Remove `IndexCreatorProvider` and `IndexReaderProvider` interfaces and implementations, moving the code to each `IndexType`. Here classes which delegate on these providers (like `PhysicalColumnIndexContainer`) will be modified to call the equivalent methods on IndexType. - [ ] Refactor classes like `PhysicalColumnIndexContainer` or `SegmentColumnarIndexCreator` to call just iterate over the `IndexType`s returned by `IndexService` instead of explicitly ask for each one ## Current blockers 1. Index types are treated as enums, which are, by definition, decided at compile time. For example, `ColumnIndexType`. 4. Index configuration is specified in `TableConfig`, specially in `FieldConfig` and `IndexingConfig`. These classes are POJOs with one or more attributes per index type. Therefore to add a new index it is needed to add attributes and recompile the class. 5. There is at least one IndexCreator per index type, but each IndexCreator is different to the other. They do not fulfill a common contract. The same happen with IndexReader. Ideally IndexCreators and IndexReaders should be closer to IndexHandlers, which do follow the same contract for each index type. 6. IndexCreators are created using different `XIndexCreatorProvider`, where X is the index type (like `ForwardIndexCreatorProvider`) but: 1. Each provider has its own methods with its own arguments that derive from `IndexCreationContext`. 2. These `XIndexCreatorProvider`s do not have a common interface. 3. There is a `IndexCreatorProvider` interface, but it extends all other `XIndexCreatorProvider`. That pattern goes against the extensibility. For example, in case a new index type called `Test` was added, we would need to modify `IndexCreatorProvider` to also extend `TextIndexCreatorProvider` and implement its method in all implementations, which obviously requires to recompile the project. 4. Same happens with `IndexReaderProvider` 7. Same happens with `MutableIndexProvider` 5. Indexes are being used when `PlanNode`s are translated into physical operators. Right now this transformation is totally static. In case we want to add a new index we need to modify methods like `FilterPlanNode.constructPhysicalOperator` or `FilterOperatorUtils.getLeafFilterOperator` in order to add new transformation rules, which are eagerly executed. ## The IndexService and the IndexType class This proposal is evolves around the idea that indexes should be obtained from an `IndexService`. There should be a single instance on every Pinot component that uses indexes (like Pinot Servers). Optionally this singleton can be substituted by a bunch of static methods in the `IndexService` class. This `IndexService` will contain a map from a String _id_ to values of `IndexType`, which has to be a new interface that completely identifies an index. The `IndexType` interface will have methods that define the index id, how to configure the index given a TableConfig and how to obtain its `IndexCreator`, `IndexReader`, `IndexHandler` and optionally its `MutableIndex`. These last methods will require information from the environment. Specially they will require some contextual information (like statistics for the column) and some configuration. There should be a single `IndexType` instance for each index type in Pinot (ie one for forward indexes, one for range indexes, etc). Pinot will have a new configuration to specify which indexes should be available and at startup Pinot will look for the corresponding `IndexType` in the classpath and add it to the `IndexService` singleton. After that, whenever Pinot requires to use an index Pinot will use the `IndexService` to get the `IndexType` instance. To look for all indexes in the classpath, the draft implementation uses an intermediate `IndexPlugin` interface that is discovered using Java Service Providers. ## The index configuration As said above, current way to configure is very static, requiring to add new attributes at compile time to the POJOs we have. At the same time, indexes are configured in different ways depending on their requirements and when it was created. Right now, index configuration is spread across different fields in the `TableConfig`, which has two key resposibilities: 1. Being the class that is being used to define how the JSON is written by the users 2. Being the class that devs use to create the readers and creators. Although it is not impossible to correctly implement both things in the same class, we think that it would be better to clearly differentiate between the concrete syntax used by the user to write JSONs and the classes with the final configuration (with default values resolved, for example). Having both responsibilities in the same classes makes more difficult to modify the way indexes are configured by the users without impacting on the code and the other way around. For example, bloom filters were originally designed to be configured as a list of columns that have to be indexed but at some point we wanted to let users to customize the bloom filter properties, so `TableConfig` contains both `bloomFilterColumns` and `bloomFilterConfigs`. Instead of being constraint at `TableConfig` level, this duality permeated to all other parts of the code, so a very common pattern was to first read `bloomFilterColumns` and then `bloomFilterConfigs` in several places of the code. Previous `TableConfig` has other problems: - Documentation is spread between `tableIndexConfig` and `fieldConfigList`. [Text search support doc page](https://docs.pinot.apache.org/basics/indexing/text-search-support) says that `fieldConfigList` is the newest way to configure indexes, but when using that advanced configuration has to be done in a generic properties field, which was mapped to a Map<String, String>. This is bad for various reasons: - It creates a single namespace where properties of different index may collide. - It restricts the value of the configuration to strings, so in order to support advanced configurations devs must parse the string their self and users must know which syntax is expected for each value. This is even worse when the parsing has to be done in several different places in the code (for example, when creating an index and when reading an index). - In order to add a new index, `TableConfig` source file has to be changed to add a new attribute in order to be known by Jackson. That is strong blocker in order to have index types defined at runtime. - It is impossible to create instances or modify of `TableConfig` in a programmatic type safe way. Therefore we introduced some abstractions like `IndexLoadingConfig` or `ColumnIndexCreationInfo` so we can programmatically modify the config (mainly, but not only, in tests). - In order to create or read some indexes, we need not only the configuration provided by the user but also some contextual information that has to be pre-processed (like for example the max or min values). Therefore we created classes like `IndexCreationContext`, where we merged the contextual information with index specific configurations. This is again incompatible with having index types defined in runtime. For example, we would need to create a method like `IndexCreationContext.Common.forBloomFilter()` for each new index type, which would require to modify the source code each time a new index is added. `TableConfig` is not in fact very bad designed. The actual problem is that we are using `TableConfig` for everything. What this proposal does is to keep `TableConfig` as the concrete syntax specific object that is usually instantiated by Jackson when a table is deserialized from JSON. In parallel, each `IndexType` has a generic `C` (which stands for _config_) and defines how to transform a `TableConfig` into an instance of `C` Therefore all the dirty code we have to introduce in order to support historical ways to configure the index is written only once and in the same method for each `IndexType` instead of being spread across the code. This should produce code that is easier to read and to maintain. Methods that are used to instantiate IndexCreators, IndexHandlers or IndexReaders will be explained bellow, but they do not receive the TableConfig (or a wrapper like `IndexLoadingConfig`) as an argument. Instead, they receive the already prepared generic C object they expect. With the change explained above we have the ability to specify all `IndexType` methods without depending on the specially crafted interfaces like `IndexLoadingConfig`, `SegmentGeneratorConfig` or `IndexCreationContext`. But it doesn't explain how to open the `TableConfig` to be able to store indexes that are unknown at compile time. The proposal here is to modify a little bit the idea of configuring all indexes in `FieldConfig`. Instead of list active indexes in the field `indexTypes`, here we propose to create a new field called `indexes`. That field is going to be either null or a JSON object whose keys are the id of the configured indexes and the values are defined by each `IndexType`. To implement that, we propose the following: - `FieldConfig.indexes` will be of type `JsonNode`, which is the most abstract Jackson type. Anything that can be deserialized with Jackson can be deserialized into a `JsonNode`. - The class `IndexType` will have a method that returns the unique id of the index (for example, `ForwardIndexType.getId()` can return "forward"). - The class `IndexType` will also have a method that given a `JsonNode`, returns an instance of C. When Pinot requires to transform `FieldConfig.indexes` from `JsonNode` into proper configuration objects, it will do something like the following: ``` if (fieldConfig.indexes.isNull) { return null; } if (fieldConfig.indexes.isObject) { Iterator<Map.Entry<String, JsonNode>> it = node.fields(); Map<IndexType, Object> builder = new HashMap<>(); while (it.hasNext()) { Map.Entry<String, JsonNode> entry = it.next(); String indexId = entry.getKey(); IndexType indexType = IndexService.getInstance().getIndexType(indexId); if (indexType == null) { // decide what to do: continue or fail } JsonNode indexNode = entry.getValue(); Object deserialized = indexType.deserialize(indexNode); builder.put(indexType, deserialized); } } ``` By doing that, we could be sure that if we do something like `map.get(BloomIndexType.INSTACE)` we will receive a `BloomFilterConfig`. The draft includes a intermediate class called `FieldIndexConfigs` which is a typesafe wrapper on top of a map like the one shown here. So, for example, this configuration configuration that is currently supported by Pinot: ```js { "tableIndexConfig": { "rangeIndexColumns": [ "col1" ], "bloomFilterConfigs": { "col1": { "fpp": 0.01, "maxSizeInBytes": 1000000, "loadOnHeap": true } }, "invertedIndexColumns": [ "col1" ] }, "fieldConfigList": [ { "name": “col1”, "indexTypes": [ “text” ], "properties": { "fstType": “lucene”, "stopWordsInclude": "a,b,c", "enableQueryCacheForTextIndex": “false”, "isSegmentPartitioned": “true” // this property is not related to an specific index } } ] } ``` Could be implemented as: ```js { "fieldConfigList": [ { "name": “col1”, "indexes": { "inverted": true, // or {enabled: true} "range": true, // or {enabled: true} "text": { "type": “lucene”, "enableQueryCacheForTextIndex": false, "stopWordsInclude": ["a", "b", "c"] }, "bloom": { "fpp": 0.01, "maxSizeInBytes": 1000000, "loadOnHeap": true }, }, "properties": { "isSegmentPartitioned": “true” // this property is not related to an specific index } } ] } ``` In cases like BloomFilter, when `BloomFilterConfig` is already serializable/deserializable with Jackson, the method that transform from JSON to object could be: ```java BloomFilterConfig deserialize(JsonNode node) throws IOException { return JsonUtils.jsonNodeToObject(node, getIndexConfigClass()); } ``` Assuming that Java `null` means that the index is not configured, in case we need to customize that, for example on inverted indexes when there is no argument, we can do the same: ```java public Boolean deserialize(JsonNode node) throws IOException { if (node.isBoolean()) { if (node.booleanValue()) { return Boolean.TRUE; } else { return null; } } throw ... } ``` In case we want to support different ways to configure the index, for example RangeIndexType when we would like to either use `true` to use the default or an object to be able to indicate the version of the range index, we can do something like: ```java public RangeIndexConfig deserialize(JsonNode node) throws IOException { if (node.isBoolean()) { if (node.booleanValue()) { return getDefaultConfig(); } else { return null; } } return JsonUtils.jsonNodeToObject(node, getIndexConfigClass()); } ``` Most of these methods can also be delegated to Jackson by using customized deserializers, but we think that having this method here is more explicit and requires less knowledge of Jackson, which could make it easier to read. This proposal makes this new way to configure indexes optional to indexes in Apache Pinot repository but mandatory for indexes outside this repo. Therefore the proposal must support the _older_ way to configure indexes. This can partially be done by the already mentioned method that transform from `TableConfig` to `C`, but we may also need other methods to transform from `IndexLoadingConfig` and `SegmentGeneratorConfig` to `C`, as they are usually modified by tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org