[jira] [Commented] (LUCENE-10425) count aggregation optimization inside one segment in log scenario

Adrien Grand (Jira) Sat, 05 Mar 2022 10:32:04 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-10425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501790#comment-17501790
 ]


Adrien Grand commented on LUCENE-10425:
---------------------------------------

This would require a new API on PostingsEnum so I'll write what I think the 
applications of this change are to make sure I get the benefits correctly: If 
the field is sorted by a numeric field, Lucene could efficiently compute range 
facets (and special forms of range facets like histograms) for this numeric 
field as long as there are no deletions and the query has a single term. For 
instance, an index containing logs and sorted by timestamp could very 
efficiently compute an histogram of the timestamp field given any term query. 
To use an example from a different use-case, an index of an e-commerce catalog 
sorted by price could compute a histogram of prices very efficiently for any 
term query.

This feels quite powerful. The main thing that annoys me a bit is that it only 
works on the primary sort field, so we'd be adding an API for PostingsEnum for 
something that requires a very careful setup of the index as their can be a 
single primary sort field. I wonder if LUCENE-10396 could help this 
optimization more often applicable, e.g. to logs indices sorted by host then 
timestamp, or to e-commerce indices sorted by category then price. Having this 
optimization more generally applicable would make me feel better about 
increasing the surface area of PostingsEnum. At first sight, it feels like this 
should work? Maybe this use-case would also help figure out what the API should 
be on LUCENE-10396.

> count aggregation optimization inside one segment in log scenario
> -----------------------------------------------------------------
>
>                 Key: LUCENE-10425
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10425
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/search
>            Reporter: jianping weng
>            Priority: Major
>          Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In log scenario, we usually want to know the doc count of documents between 
> every time intervals. One possible optimized method is to sort the docuemt in 
> ascend order according to @timestamp field in one segment. then we can use    
> this pr [https://github.com/apache/lucene/pull/687] to find out the min/max 
> docId in on time interval.
> If there is no other filter query, the doc count of one time interval is (max 
> docId- min docId +1)
> if there is only one another term filter query, we can use this pr 
> [https://github.com/apache/lucene/pull/688 
> |https://github.com/apache/lucene/pull/688]to get the diff value of index, 
> when we call advance(minId) and advance(maxId), the diff value is also the 
> doc count of one time interval
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10425) count aggregation optimization inside one segment in log scenario

Reply via email to