[GitHub] [lucene] gashutos opened a new issue, #12448: [Performance] sort query improvement for sequential ordered data [e.g. timestamp field sort in log data]

via GitHub Tue, 18 Jul 2023 03:34:22 -0700


gashutos opened a new issue, #12448:
URL: https://github.com/apache/lucene/issues/12448

### Description

## Problem statement
Currently in `TopFieldCollector`, we have PriorityQueue (min heap binary
implemenation) to find top `K` elements in `asc` or `desc` order. Whenever we
find newly iterated document competitive, we insert that in PriorityQueue and
if size is full, we remove `min` competitive document from PriorityQueue and
replace with newly competitive document. Time complexity for this scenatio is
`O(nlog(K))` in worst case where `n = current number of competitive documents`.
Now what if `n` is pretty high and every document we try to insert in
PriorityQueue take exact O(log(K)) ? We will end up spending lot of time in
that case within PriorityQueue heapification process itself.
We have skipping logic to reduce iterator cost 'n' whenever we update
PriorityQueue, but that doesnt work well in some scenario.

### For example,
for time series based logs/metric data where sort query is performed on
`timestamp` field. This field will be always ever increasing (nearly sorted)
field and it has very high cardinality as well ! Triggering `desc` order
queries wont optimize much with skipping logic as well PriorityQueue will
always result `O(log(K))` for almost all heapification insertion because
incming documents are coming in increasing order.

Like imagine below scenario where we are inserting logical timestamp in
incrwasing order `1, 2, 3, 4, ....` and insertion of 16 will result in worst
case to heapify 16 till leaf node, same is true for `17, 18, 19,.....`.
(Imagine we need top `15` hit so PQ size is 15).

Here in above scenario, skipping logic doesnt work too to find top `K` in
`desc` order. But we can do better on priorityqueue side since we know incoming
documents are in `asc` order (nearly sorted) so can skip heapifying them.

## Solution proposed
The crux of the solution proposed is, `We dont need to heapify elements if
the incoming order is sorted as per desired order`. The idea here is to keep
incoming competitive documents in temporary `circular array` before inserting
them into priorityqueue and trigger heapify.

### Algorithm
1. Create circular array of same size as priority queue. We shoud be able
add element to `last` and remove element from `first` in this circular array
implementation.
2. If incoming document qualifies after
`TopFieldCollector.checkThreshold()`, keep inserting them in circular array
until order breaks, i.e. if next document is smaller for descending order
traveral or vice versa for ascending order. Remove first element from circular
array if queue is full while inserting.
3. Once order breaks, insert entire circular array into priority queue and
empty the circular array.

With this, we will able to skip millions of hits from going through
heapification process.

POC : [POC
code](https://github.com/gashutos/lucene/commit/1822f6a91b8fa11aebe80fe04b124f2c48ae2e6f)
_Implemented above POC for reference with LinkedList instead circular array.
Also need some additional stuff to handle cases like duplicate element where
docid is given preference (But those should be minor changes)._

### Caveat of this implementation
Above proposed solution works pretty well for ever increasing or ever
decreasing ordered fields like time series based data. But this could be
additional overhead if data is completely random. This will incur additional
comparison to add element in circular array first and then inserting to actual
heap.

We have two ways to implement this,
1. Have this implementation by default for all types of BKD based indexed
field.
2. Enable this implementation behind `boolean` flag, and it can be set
`true/false` at collector level similar to what we have for
`PointBasedOptimization` flag.

The approach 2 sounds very safe, but it results on more configuration
overhead for users.
The approach 1 can introduce overhead, but with the random data the skipping
logic based on BKD points works pretty well and insertion on priority queue
itself wont be much and hence we dont see much overhead. I have tested this
with `13` million documents randomly on a numeric field (long) and with
solution 1, we didnt see any regression in latency since skipping logic skipped
many hits already.
We dont have benchmark for numeric sort in Lucene itself, but I did indexed
LongPoint in random order with 13 million points/documents. I see similar
latency for sort `140 ms` while with this additional circular queue it is `147
ms` on `r6g.2xlarge` ec2 instances.

## Suggestions from community
I would like to hear suggestion thoughts from community for above proposed
change and better way to implement this. Specially time series based `desc`
order usecase is pretty widely used, this can be covering larger pie for our
users to get more optimized sort queries on such workload.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gashutos opened a new issue, #12448: [Performance] sort query improvement for sequential ordered data [e.g. timestamp field sort in log data]

Reply via email to