bharath-techie opened a new issue, #13188:
URL: https://github.com/apache/lucene/issues/13188

   ### Description
   
   We are exploring the use case of building materialized views for certain 
fields and dimensions using [Star Tree 
index](https://github.com/opensearch-project/OpenSearch/issues/12498) while 
indexing the data. This will be based on the configured fields (dimensions and 
metrics) during index creation. This is inspired from 
http://hanj.cs.illinois.edu/pdf/vldb03_starcube.pdf and Apache Pinot’s Star 
Tree index. Star Tree helps to enforce upper bound on the aggregation queries 
ensuring predictable latency and resource usage, it is also storage space 
efficient and configurable.
   OpenSearch RFC : 
https://github.com/opensearch-project/OpenSearch/issues/12498
   
   Creating this issue to discuss approaches to support Star Tree in Lucene and 
also to get feedback on any other approaches/recommendations from the community.
   
   ## Quick overview on Star Tree index creation flow
   The Star Tree DocValues fields and Star Tree index are created during the 
flush / merge flows of indexing
   
   ### Flush / merge flow 
   
   1. Create initial set of Star Tree documents based on the configured 
dimensions and metrics.
   2. Sort the Star Tree documents based on dimensions (fields) and aggregate 
on the metrics (fields).
   3. Create Star Tree index.
   4. Create Star Tree DocValues fields for each of the Star Tree dimensions 
and metrics
   
   
![star-lucene](https://github.com/apache/lucene/assets/58062316/13c80059-26f5-4c0f-b9e2-7dee8315c277)
   
   ### Challenges
   Main challenge is that ‘StarTree’ index is a multi-field index compared to 
other formats in Lucene / OpenSearch. This makes it infeasible to use the 
PerField extension defined in Lucene today. We explored ‘BinaryDocValues’ to 
encode dimensions and metrics, but the ‘type’ of dimensions and metrics are 
different. So we couldn’t find a way to extend it. [Dimensions could be numeric 
or text or combination].
   
   ## Create Star Tree index 
   ### Approach 1 - Create a new format to build materialized views
   We can create a new dedicated file format (similar to points format, 
postings format) for materialized views which accepts list of dimensions and 
metrics and the default implementation for it could be the Star Tree index.
   
   Pros
   
   * This can be the standard format for supporting materialized views in 
Lucene which developers can use or extend to create custom solutions.
   * With this format, users can directly create materialized views during 
index time without the storage of original documents via DocValues indices
   
   Cons
   
   * This will be a maintenance overhead.
   
   ### Approach 2 - Extend DocValues format
   
   #### Indexing - Extend DocValues to support materialized views
   We can extend DocValues format to support a new type of field ‘AGGREGATED’ 
which will hold the configured list of dimensions and metrics by the user 
during index creation.
   ```
   AggregatedField {
       List<String>      DimensionFields
       List<MetricConfig> MetricFields
   }
   MetricConfig {
        FieldName fieldName
       MetricFunction function
   }
   MetricFunction {
        SUM,
       AVG,
       COUNT
       ....
   }
   ```
   During flush / merge , the values of the dimensions and metrics will be read 
from the associated ‘DocValues’ fields using DocValuesProducer and we will 
create the Star Tree indices as per the steps mentioned above.
   
   #### Search flow
   
   We can extend ‘LeafReader’ and ‘DocValuesProducer’ with a new method 
‘getAggregatedDocValues’ to get the Star Tree index during query time. This 
retrieves the root of the Star Tree and the dimensions and metrics DocValues 
fields. 
   
   Pros
   
   * If the above extensions are in place , any custom codec implementation of 
‘DocValues’ can have a custom implementation for materialized views which can 
create relevant indices during flush/merge.
   * Less maintenance overhead
   
   Cons
   
   * Tight coupling between DocValuesFormat and materialized views.
   * The original documents are needed to create the derived aggregated Star 
Tree documents
   
   ### Open questions
   
   Any suggestions on a way to pack values of ‘dimensions’ and ‘metrics' as 
part of ‘AggregatedField’ during indexing as part of ‘addDocument’ flow? Also, 
should we explore this or we can simply create the derived ‘AggregatedField’ 
during flush/merge ?
   
   ## Create Star Tree DocValues fields
   Star Tree index is backed by Star Tree DocValues fields.
   So to read/write, we can reuse the existing ‘DocValuesFormat’.  Each field 
is stored as ‘Numeric’ DocValues field or ‘SortedSet’ DocValues field in case 
of text fields.
   
   To accommodate this, we propose to make DocValuesFormat extend ‘Codec’ and 
‘Extension’ , so that we can create the StarTree DocValues fields with custom 
extensions.
   
   ```
   @Override
     public DocValuesConsumer fieldsConsumer(SegmentWriteState state, 
DATA_CODEC, DATA_EXTENSION, META_CODEC, META_EXTENSION) throws IOException {
       return new Lucene90DocValuesConsumer(
           state, DATA_CODEC, DATA_EXTENSION, META_CODEC, META_EXTENSION);
     }
   
     @Override
     public DocValuesProducer fieldsProducer(SegmentWriteState state, 
DATA_CODEC, DATA_EXTENSION, META_CODEC, META_EXTENSION) throws IOException {
       return new Lucene90DocValuesProducer(
           state, DATA_CODEC, DATA_EXTENSION, META_CODEC, META_EXTENSION);
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to