[ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Jiang updated LUCENE-8980:
-----------------------------------
    Description: 
*Description*

In Elasticsearch, which is based on Lucene, each document has an indexed _id 
field that uniquely identifies it. When Elasticsearch use the _id field to find 
a document from Lucene, Lucene have to check all the segments of the index. 
When the values of the _id field are very sequentially, the performance is 
optimizable.
 

*Solution*

As Lucene stores min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API. When calling 
SegmentTermsEnum.seekExact() to lookup an term in an index, we can check 
whether the term fall in the range of minTerm and maxTerm, so that we can skip 
some useless segments as soon as possible.
 
This PR is beneficial to ES read/write API and Lucene look up API.



  was:
*Description*
In Elasticsearch, which is based on Lucene, each document has an _id field that 
uniquely identifies it. The _id field is indexed so that each document can be 
looked up from Lucene. When users write documents with sequentially _id values, 
Elasticsearch lookup up t from check _id uniqueness through Lucene API for each 
document, which result in poor write performance. 
 

*Solution*

As Lucene stores min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API. When calling 
SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check 
whether the term fall in the range of minTerm and maxTerm, so that wo skip some 
useless segments as soon as possible.
 




> Optimise SegmentTermsEnum.seekExact performance
> -----------------------------------------------
>
>                 Key: LUCENE-8980
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8980
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 8.2
>            Reporter: Guoqiang Jiang
>            Assignee: David Wayne Smiley
>            Priority: Major
>              Labels: performance
>             Fix For: master (9.0)
>
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> *Description*
> In Elasticsearch, which is based on Lucene, each document has an indexed _id 
> field that uniquely identifies it. When Elasticsearch use the _id field to 
> find a document from Lucene, Lucene have to check all the segments of the 
> index. When the values of the _id field are very sequentially, the 
> performance is optimizable.
>  
> *Solution*
> As Lucene stores min/maxTerm metrics for each segment and field, we can use 
> those metrics to optimise performance of Lucene look up API. When calling 
> SegmentTermsEnum.seekExact() to lookup an term in an index, we can check 
> whether the term fall in the range of minTerm and maxTerm, so that we can 
> skip some useless segments as soon as possible.
>  
> This PR is beneficial to ES read/write API and Lucene look up API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to