[I] Support for building materialized views using Lucene formats [lucene]

via GitHub Sun, 17 Mar 2024 23:32:59 -0700


bharath-techie opened a new issue, #13188:
URL: https://github.com/apache/lucene/issues/13188

### Description

We are exploring the use case of building materialized views for certain
fields and dimensions using [Star Tree
index](https://github.com/opensearch-project/OpenSearch/issues/12498) while
indexing the data. This will be based on the configured fields (dimensions and
metrics) during index creation. This is inspired from
http://hanj.cs.illinois.edu/pdf/vldb03_starcube.pdf and Apache Pinot’s Star
Tree index. Star Tree helps to enforce upper bound on the aggregation queries
ensuring predictable latency and resource usage, it is also storage space
efficient and configurable.
OpenSearch RFC :
https://github.com/opensearch-project/OpenSearch/issues/12498

Creating this issue to discuss approaches to support Star Tree in Lucene and
also to get feedback on any other approaches/recommendations from the community.

## Quick overview on Star Tree index creation flow
The Star Tree DocValues fields and Star Tree index are created during the
flush / merge flows of indexing

### Flush / merge flow

1. Create initial set of Star Tree documents based on the configured
dimensions and metrics.
2. Sort the Star Tree documents based on dimensions (fields) and aggregate
on the metrics (fields).
3. Create Star Tree index.
4. Create Star Tree DocValues fields for each of the Star Tree dimensions
and metrics

![star-lucene](https://github.com/apache/lucene/assets/58062316/13c80059-26f5-4c0f-b9e2-7dee8315c277)

### Challenges
Main challenge is that ‘StarTree’ index is a multi-field index compared to
other formats in Lucene / OpenSearch. This makes it infeasible to use the
PerField extension defined in Lucene today. We explored ‘BinaryDocValues’ to
encode dimensions and metrics, but the ‘type’ of dimensions and metrics are
different. So we couldn’t find a way to extend it. [Dimensions could be numeric
or text or combination].

## Create Star Tree index
### Approach 1 - Create a new format to build materialized views
We can create a new dedicated file format (similar to points format,
postings format) for materialized views which accepts list of dimensions and
metrics and the default implementation for it could be the Star Tree index.

Pros

* This can be the standard format for supporting materialized views in
Lucene which developers can use or extend to create custom solutions.
* With this format, users can directly create materialized views during
index time without the storage of original documents via DocValues indices

Cons

* This will be a maintenance overhead.

### Approach 2 - Extend DocValues format

#### Indexing - Extend DocValues to support materialized views
We can extend DocValues format to support a new type of field ‘AGGREGATED’
which will hold the configured list of dimensions and metrics by the user
during index creation.
```
AggregatedField {
List<String> DimensionFields
List<MetricConfig> MetricFields
}
MetricConfig {
FieldName fieldName
MetricFunction function
}
MetricFunction {
SUM,
AVG,
COUNT
....
}
```
During flush / merge , the values of the dimensions and metrics will be read
from the associated ‘DocValues’ fields using DocValuesProducer and we will
create the Star Tree indices as per the steps mentioned above.

#### Search flow

We can extend ‘LeafReader’ and ‘DocValuesProducer’ with a new method
‘getAggregatedDocValues’ to get the Star Tree index during query time. This
retrieves the root of the Star Tree and the dimensions and metrics DocValues
fields.

Pros

* If the above extensions are in place , any custom codec implementation of
‘DocValues’ can have a custom implementation for materialized views which can
create relevant indices during flush/merge.
* Less maintenance overhead

Cons

* Tight coupling between DocValuesFormat and materialized views.
* The original documents are needed to create the derived aggregated Star
Tree documents

### Open questions

Any suggestions on a way to pack values of ‘dimensions’ and ‘metrics' as
part of ‘AggregatedField’ during indexing as part of ‘addDocument’ flow? Also,
should we explore this or we can simply create the derived ‘AggregatedField’
during flush/merge ?

## Create Star Tree DocValues fields
Star Tree index is backed by Star Tree DocValues fields.
So to read/write, we can reuse the existing ‘DocValuesFormat’. Each field
is stored as ‘Numeric’ DocValues field or ‘SortedSet’ DocValues field in case
of text fields.

To accommodate this, we propose to make DocValuesFormat extend ‘Codec’ and
‘Extension’ , so that we can create the StarTree DocValues fields with custom
extensions.

```
@Override
public DocValuesConsumer fieldsConsumer(SegmentWriteState state,
DATA_CODEC, DATA_EXTENSION, META_CODEC, META_EXTENSION) throws IOException {
return new Lucene90DocValuesConsumer(
state, DATA_CODEC, DATA_EXTENSION, META_CODEC, META_EXTENSION);
}

@Override
public DocValuesProducer fieldsProducer(SegmentWriteState state,
DATA_CODEC, DATA_EXTENSION, META_CODEC, META_EXTENSION) throws IOException {
return new Lucene90DocValuesProducer(
state, DATA_CODEC, DATA_EXTENSION, META_CODEC, META_EXTENSION);
}
```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] Support for building materialized views using Lucene formats [lucene]

Reply via email to