Re: [I] Support for building materialized views using Lucene formats [lucene]

via GitHub Mon, 22 Apr 2024 02:40:55 -0700


bharath-techie commented on issue #13188:
URL: https://github.com/apache/lucene/issues/13188#issuecomment-2068945005


   There are several advantages to keeping the new index as part of the same 
Lucene segment. It reduces maintenance overhead and enables Near Real-Time 
(NRT) use cases. Specifically, for the star tree index, incrementally building 
the star tree as segments get flushed and merged takes significantly less time, 
as the sort and aggregation operations can be optimized.
   
   Considering these advantages, I'm further exploring the idea of a new format 
to support multi-field indices, which can also be extended to create other 
types of composite indices.
   
   ###  DataCubesFormat vs CompositeValuesFormat
   
   Since Lucene is also used for OLAP use cases, we can create a 
'DataCubesFormat' specifically designed to create multi-field indices on a set 
of dimensions and metrics. [Preferred]
   
   Alternatively, if we want a more generic format for creating indices based 
on any set of fields, we could go with 'CompositeValuesFormat'.
   
   While the underlying implementation for both formats would be similar 
(creating indices on a set of Lucene fields), 'DataCubesFormat' is more 
descriptive and tailored to the OLAP use case.
   
   ### Implementation
   
   For clarity, we will focus on 'DataCubesFormat' in the rest of this section.
   
   Broadly, we have two ways to implement the format.
   
   #### IndexWriterConfig / SegmentInfo [ Preferred ]
   
   - During flush/merge, we create the multi-field indices based on the 
existing fields (DocValues). We can supply a list of 'DataCubeField' 
configurations as part of 'DataCubesConfig' to 'IndexWriterConfig' and save it 
as part of the SegmentInfo.
   - This is quite similar to how 'IndexSort' is implemented in Lucene.
   - So the new 'DataCubesFormat' can be used during flush / merge for 
consuming the existing fields' writers to create 'DataCubesIndices'.
   
   
   Pros
   - Reuses existing writer implementations for individual fields (dimensions 
and metrics) to create the 'DataCube' indices.
   - Aligns with the overall Lucene architecture, making it consistent with 
other features like 'IndexSort'.
   
   Cons
   - Users cannot create 'DataCube' indices without the associated 'DocValues' 
fields.
   
   #### Add/update doc flow with a new DataCubeField
   
   Users can pass the set of dimensions and metrics as part of a new 
'DataCubeField' during the 'ProcessDocument' flow.
   
   Pros
   - Allows users to create 'DataCube' indices without the associated 
'DocValues' fields.
   
   Cons
   - Potentially complicates the indexing process and introduces additional 
overhead.
   - May not be necessary, as the existing 'DocValues' writer implementations 
should cover most use cases for numeric/text dimensions and numeric metrics. 
   - Difficult to fall back to existing fields if the cardinality of the said 
fields is too high, as the 'DataCubeField' would be a separate entity. With the 
'IndexWriterConfig / SegmentInfo' approach, we can exercise a guardrail and not 
create 'DataCubes' for fields with high cardinality, while still leveraging the 
existing field data.
   
   Overall, the preferred approach of using 'IndexWriterConfig' and 
'SegmentInfo' seems more suitable for implementing the 'DataCubesFormat'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Support for building materialized views using Lucene formats [lucene]

Reply via email to