bharath-techie commented on issue #13188: URL: https://github.com/apache/lucene/issues/13188#issuecomment-2068945005
There are several advantages to keeping the new index as part of the same Lucene segment. It reduces maintenance overhead and enables Near Real-Time (NRT) use cases. Specifically, for the star tree index, incrementally building the star tree as segments get flushed and merged takes significantly less time, as the sort and aggregation operations can be optimized. Considering these advantages, I'm further exploring the idea of a new format to support multi-field indices, which can also be extended to create other types of composite indices. ### DataCubesFormat vs CompositeValuesFormat Since Lucene is also used for OLAP use cases, we can create a 'DataCubesFormat' specifically designed to create multi-field indices on a set of dimensions and metrics. [Preferred] Alternatively, if we want a more generic format for creating indices based on any set of fields, we could go with 'CompositeValuesFormat'. While the underlying implementation for both formats would be similar (creating indices on a set of Lucene fields), 'DataCubesFormat' is more descriptive and tailored to the OLAP use case. ### Implementation For clarity, we will focus on 'DataCubesFormat' in the rest of this section. Broadly, we have two ways to implement the format. #### IndexWriterConfig / SegmentInfo [ Preferred ] - During flush/merge, we create the multi-field indices based on the existing fields (DocValues). We can supply a list of 'DataCubeField' configurations as part of 'DataCubesConfig' to 'IndexWriterConfig' and save it as part of the SegmentInfo. - This is quite similar to how 'IndexSort' is implemented in Lucene. - So the new 'DataCubesFormat' can be used during flush / merge for consuming the existing fields' writers to create 'DataCubesIndices'. Pros - Reuses existing writer implementations for individual fields (dimensions and metrics) to create the 'DataCube' indices. - Aligns with the overall Lucene architecture, making it consistent with other features like 'IndexSort'. Cons - Users cannot create 'DataCube' indices without the associated 'DocValues' fields. #### Add/update doc flow with a new DataCubeField Users can pass the set of dimensions and metrics as part of a new 'DataCubeField' during the 'ProcessDocument' flow. Pros - Allows users to create 'DataCube' indices without the associated 'DocValues' fields. Cons - Potentially complicates the indexing process and introduces additional overhead. - May not be necessary, as the existing 'DocValues' writer implementations should cover most use cases for numeric/text dimensions and numeric metrics. - Difficult to fall back to existing fields if the cardinality of the said fields is too high, as the 'DataCubeField' would be a separate entity. With the 'IndexWriterConfig / SegmentInfo' approach, we can exercise a guardrail and not create 'DataCubes' for fields with high cardinality, while still leveraging the existing field data. Overall, the preferred approach of using 'IndexWriterConfig' and 'SegmentInfo' seems more suitable for implementing the 'DataCubesFormat'. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org