bharath-techie opened a new issue, #13188: URL: https://github.com/apache/lucene/issues/13188
### Description We are exploring the use case of building materialized views for certain fields and dimensions using [Star Tree index](https://github.com/opensearch-project/OpenSearch/issues/12498) while indexing the data. This will be based on the configured fields (dimensions and metrics) during index creation. This is inspired from http://hanj.cs.illinois.edu/pdf/vldb03_starcube.pdf and Apache Pinot’s Star Tree index. Star Tree helps to enforce upper bound on the aggregation queries ensuring predictable latency and resource usage, it is also storage space efficient and configurable. OpenSearch RFC : https://github.com/opensearch-project/OpenSearch/issues/12498 Creating this issue to discuss approaches to support Star Tree in Lucene and also to get feedback on any other approaches/recommendations from the community. ## Quick overview on Star Tree index creation flow The Star Tree DocValues fields and Star Tree index are created during the flush / merge flows of indexing ### Flush / merge flow 1. Create initial set of Star Tree documents based on the configured dimensions and metrics. 2. Sort the Star Tree documents based on dimensions (fields) and aggregate on the metrics (fields). 3. Create Star Tree index. 4. Create Star Tree DocValues fields for each of the Star Tree dimensions and metrics  ### Challenges Main challenge is that ‘StarTree’ index is a multi-field index compared to other formats in Lucene / OpenSearch. This makes it infeasible to use the PerField extension defined in Lucene today. We explored ‘BinaryDocValues’ to encode dimensions and metrics, but the ‘type’ of dimensions and metrics are different. So we couldn’t find a way to extend it. [Dimensions could be numeric or text or combination]. ## Create Star Tree index ### Approach 1 - Create a new format to build materialized views We can create a new dedicated file format (similar to points format, postings format) for materialized views which accepts list of dimensions and metrics and the default implementation for it could be the Star Tree index. Pros * This can be the standard format for supporting materialized views in Lucene which developers can use or extend to create custom solutions. * With this format, users can directly create materialized views during index time without the storage of original documents via DocValues indices Cons * This will be a maintenance overhead. ### Approach 2 - Extend DocValues format #### Indexing - Extend DocValues to support materialized views We can extend DocValues format to support a new type of field ‘AGGREGATED’ which will hold the configured list of dimensions and metrics by the user during index creation. ``` AggregatedField { List<String> DimensionFields List<MetricConfig> MetricFields } MetricConfig { FieldName fieldName MetricFunction function } MetricFunction { SUM, AVG, COUNT .... } ``` During flush / merge , the values of the dimensions and metrics will be read from the associated ‘DocValues’ fields using DocValuesProducer and we will create the Star Tree indices as per the steps mentioned above. #### Search flow We can extend ‘LeafReader’ and ‘DocValuesProducer’ with a new method ‘getAggregatedDocValues’ to get the Star Tree index during query time. This retrieves the root of the Star Tree and the dimensions and metrics DocValues fields. Pros * If the above extensions are in place , any custom codec implementation of ‘DocValues’ can have a custom implementation for materialized views which can create relevant indices during flush/merge. * Less maintenance overhead Cons * Tight coupling between DocValuesFormat and materialized views. * The original documents are needed to create the derived aggregated Star Tree documents ### Open questions Any suggestions on a way to pack values of ‘dimensions’ and ‘metrics' as part of ‘AggregatedField’ during indexing as part of ‘addDocument’ flow? Also, should we explore this or we can simply create the derived ‘AggregatedField’ during flush/merge ? ## Create Star Tree DocValues fields Star Tree index is backed by Star Tree DocValues fields. So to read/write, we can reuse the existing ‘DocValuesFormat’. Each field is stored as ‘Numeric’ DocValues field or ‘SortedSet’ DocValues field in case of text fields. To accommodate this, we propose to make DocValuesFormat extend ‘Codec’ and ‘Extension’ , so that we can create the StarTree DocValues fields with custom extensions. ``` @Override public DocValuesConsumer fieldsConsumer(SegmentWriteState state, DATA_CODEC, DATA_EXTENSION, META_CODEC, META_EXTENSION) throws IOException { return new Lucene90DocValuesConsumer( state, DATA_CODEC, DATA_EXTENSION, META_CODEC, META_EXTENSION); } @Override public DocValuesProducer fieldsProducer(SegmentWriteState state, DATA_CODEC, DATA_EXTENSION, META_CODEC, META_EXTENSION) throws IOException { return new Lucene90DocValuesProducer( state, DATA_CODEC, DATA_EXTENSION, META_CODEC, META_EXTENSION); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org