[GitHub] [lucene] rishabhmaurya opened a new issue, #11745: Store summarized results in internal nodes of BKD for time series points

GitBox Fri, 02 Sep 2022 07:11:58 -0700


rishabhmaurya opened a new issue, #11745:
URL: https://github.com/apache/lucene/issues/11745


   ### Description
   
   Time series points have a timestamp, measurement and dimensions associated 
with them. The common queries are range queries on timestamp, metric 
aggregation on measurement and grouping on dimensions. Or similar query with 
histogram on timestamp. 
   
   **Proposal:**
   
   Prototype can be found [here](https://github.com/rishabhmaurya/lucene/pull/1)
   
   1. Introduce a new time series point as a field in lucene - `TSPoint` which 
can be added as - 
   
   ```java
   
   Document doc = new Document(); doc.add(new TSIntPoint("tsid1", "cpu", 
timestamp, measurement));
   ```
   
   `tsid1` is the time series ID of the point. It will be the unit of storage 
for time series points and for prototype each of them represents a unique field 
in lucene. 
   
   `timestamp` is the actual point in BKD on which the index is created. 
   
   Full definition can be found 
[here](https://github.com/rishabhmaurya/lucene/blob/0c07f677c4caf45b1499f236adc0217992e46722/lucene/core/src/java/org/apache/lucene/document/TSIntPoint.java).
   
   2. Interface for decomposable aggregate function, which can be defined as 
part of index configuration - 
   
   Sum function - 
   
   ```java
   
   new BKDSummaryWriter.SummaryMergeFunction<Integer>() {
   
         @Override
         public int getSummarySize()
   
   {         return Integer.BYTES;       }
   
         @Override
         public void merge(byte[] a, byte[] b, byte[] c)
   
   {         packBytes(unpackBytes(a) + unpackBytes(b), c);       }
   
         @Override
         public Integer unpackBytes(byte[] val)
   
   {         return NumericUtils.sortableBytesToInt(val, 0);       }
   
         @Override
         public void packBytes(Integer val, byte[] res)
   
   {         NumericUtils.intToSortableBytes(val, res, 0);       }
   
       };
   
   ```
   
   3. New query per `LeafReader` to perform range queries on TSPoint and 
retrieve summarized results - 
   
   
   ```java
   
   LeafReader leafReader; PointValues points = 
leafReader.getPointValues("tsid1");
   
   TSPointQuery tsPointQuery = new TSPointQuery("tsid1", lowerBoundTimestamp, 
upperBoundTimestamp);
   
   byte[] res = tsPointQuery.getSummary((BKDWithSummaryReader.BKDSummaryTree) 
points.getPointTree(), mergeFunction);
   
   ```
   
   Instead of BKDReader and BKDWriter, we will be using 
[BKDSummaryWriter](https://github.com/rishabhmaurya/lucene/blob/0c07f677c4caf45b1499f236adc0217992e46722/lucene/core/src/java/org/apache/lucene/util/bkd/BKDSummaryWriter.java)
 
[BKDSummaryReader](https://github.com/rishabhmaurya/lucene/blob/0c07f677c4caf45b1499f236adc0217992e46722/lucene/core/src/java/org/apache/lucene/util/bkd/BKDWithSummaryReader.java)
 which supports writing summaries in internal nodes of the tree. 
   
   Changes in 
[IntersectVisitor](https://github.com/rishabhmaurya/lucene/blob/0c07f677c4caf45b1499f236adc0217992e46722/lucene/core/src/java/org/apache/lucene/index/PointValues.java#L320-L337)
 interface here.
   
   #### Comparison with DocValues
   Below is the comparison of running unit test for 
[DocValue](https://github.com/rishabhmaurya/lucene/pull/1/commits/157215cd2c4787787748625e10b58001583c6e6e#diff-87c31ac3b1cd1ef45b15c84d9cf1c5bab03feb94f199256f859f14ed4747abd2R129)
 approach vs 
[TSPoint](https://github.com/rishabhmaurya/lucene/pull/1/commits/157215cd2c4787787748625e10b58001583c6e6e#diff-87c31ac3b1cd1ef45b15c84d9cf1c5bab03feb94f199256f859f14ed4747abd2R52)
 approach - 
   
   This test ingests `10000000` docs against a given TSID and performs a range 
query on timestamp 100 times against the same TSID. Merge function used is 
`sum`.
   
   | DocValues approach     |  TSPoint approach|
   | ----------------------- | ------------------ |
   |Indexing took: 42948 ms  | Indexing took: 32985       |
   |Matching docs count:1304624 \| Segments:3 \| DiskAccess: 1304624 | Matching 
docs count:8784032 \| Segments:10 \| DiskAccess: 302       |
   |Search took: 12382 ms     | Search took: 50ms |
   
   This is not apple to apple comparison since number of segments are 3 in 
DocValues approach whereas its 10 in TSPoint approach. 
   
   #### Limitation of this feature
   * Doc deletion currently not supported. We need to evaluate how important is 
it and possibly find a way to support it in future.
   * Only 
[decomposable](https://en.wikipedia.org/wiki/Aggregate_function#Decomposable_aggregate_functions)
 aggregation functions can be supported. E.g. min, max, sum, avg, count. 
   
   #### Todos
   * Implementation for multiple TSIDs. For now we need to create a new field 
with the name same as TSID for a timeseries.
   * Segment merge for BKD with summaries. Currently, the UTs disables merge 
and perform search across multiple segments and cumulate the results.
   * Pluggable merge function to merge 2 `TSPoint`. Currently its hardcoded in 
`FieldInfo.java` which isn't the right place to define them.
   * Measurement compression in BKD. I'm thinking of using delta encoding to 
store measurement values and summaries while packing the summaries associated 
with nodes of the tree. 
   * Persist first and last docID in internal nodes of BKD with summaries in an 
efficient way. This will be useful to use precomputed summaries and skip over 
batches of documents when iterating using DocIDSetIterator. Its a blocker for 
integration with OpenSearch aggregation framework.
   * Integrate with OpenSearch aggregation framework.
   * Benchmark against real timeseries dataset. 
   * * compare against SortedDocValues approach.
   * * compare against other timeseries databases.
   * Evaluate support of deletion of document/timeseries/batch of documents 
(matching a timestamp range).
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rishabhmaurya opened a new issue, #11745: Store summarized results in internal nodes of BKD for time series points

Reply via email to