[jira] [Updated] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

Guoqiang Jiang (Jira) Mon, 16 Sep 2019 03:16:06 -0700


     [ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Guoqiang Jiang updated LUCENE-8980:
-----------------------------------
    Description: 
*Description*

In Elasticsearch, each document has an _id field that uniquely identifies it, 
which is indexed so that documents can be looked up from Lucene. When users 
write Elasticsearch with self-generated _id values, even if the conflict rate 
is very low, ES have to check _id uniqueness through Lucene API for each 
document, which result in poor write performance. 

 

*Solution*

1. Choose a better _id generator

Different _id formats have a great impact on write performance. We have 
verified this in production cluster. Users can refer to the following blog and 
choose a better _id generator.

[http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html]

2. Optimise with min/maxTerm metrics in Lucene

As Lucene store min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API.

 

*Tests*

I have made some write benchmark using _id in UUID V1 format, and the benchmark 
result is as follows:
||Branch||Write speed after 4h||CPU cost||Overall improvement||Write speed 
after 8h||CPU cost||Overall improvement||
|Original Lucene|29.9w/s|68.4%|N/A|26.7w/s|66.6%|N/A|
|Optimised Lucene|34.5w/s
(+15.4%)|63.8
(-6.7%)|+22.1%|31.5w/s
(18.0%)|61.5
(-7.7%)|+25.7%|

As shown above, after 8 hours of continuous writing, write performance improves 
by 18.0%, CPU overhead decreases by 7.7%, and overall performance improves by 
25.7%. The Elasticsearch GET API and ids query would get similar performance 
improvements.

It should be noted that the benchmark test needs to run several hours 
continuously, because the performance improvements is not obvious when the data 
is completely cached or the number of segments is too small.

  was:
*Description*

In Elasticsearch, each document has an _id field that uniquely identifies it, 
which is indexed so that documents can be looked up from Lucene. When users 
write Elasticsearch with self-generated _id values, even if the conflict rate 
is very low, ES have to check _id uniqueness through Lucene API for each 
document, which result in poor write performance. 

 

*Solution*

1. Choose a better _id generator

Different _id formats have a great impact on write performance. We have 
verified this in production cluster. Users can refer to the following blog and 
choose a better _id generator.

[http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html]

2. Optimise with min/maxTerm metrics in Lucene

As Lucene store min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API.

 

*Tests*

I have made some write benchmark using _id in UUID V1 format, and the benchmark 
result is as follows:
||Branch||write speed after 4h||CPU Cost||Overall improvement||write speed 
after 8h||CPU Cost||Overall improvement||
|Original Lucene|29.9w/s|68.4%|N/A|26.7w/s|66.6%|N/A|
|Optimised Lucene|34.5w/s
(+15.4%)|63.8
(-6.7%)|+22.1%|31.5w/s
(18.0%)|61.5
(-7.7%)|+25.7%|

As shown above, after 8 hours of continuous writing, write performance improves 
by 18.0%, CPU overhead decreases by 7.7%, and overall performance improves by 
25.7%. The Elasticsearch GET API and ids query would get similar performance 
improvements.

It should be noted that the benchmark test needs to run several hours 
continuously, because the performance improvements is not obvious when the data 
is completely cached or the number of segments is too small.


> Optimise SegmentTermsEnum.seekExact performance
> -----------------------------------------------
>
>                 Key: LUCENE-8980
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8980
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 8.2
>            Reporter: Guoqiang Jiang
>            Priority: Major
>              Labels: performance
>             Fix For: master (9.0)
>
>
> *Description*
> In Elasticsearch, each document has an _id field that uniquely identifies it, 
> which is indexed so that documents can be looked up from Lucene. When users 
> write Elasticsearch with self-generated _id values, even if the conflict rate 
> is very low, ES have to check _id uniqueness through Lucene API for each 
> document, which result in poor write performance. 
>  
> *Solution*
> 1. Choose a better _id generator
> Different _id formats have a great impact on write performance. We have 
> verified this in production cluster. Users can refer to the following blog 
> and choose a better _id generator.
> [http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html]
> 2. Optimise with min/maxTerm metrics in Lucene
> As Lucene store min/maxTerm metrics for each segment and field, we can use 
> those metrics to optimise performance of Lucene look up API.
>  
> *Tests*
> I have made some write benchmark using _id in UUID V1 format, and the 
> benchmark result is as follows:
> ||Branch||Write speed after 4h||CPU cost||Overall improvement||Write speed 
> after 8h||CPU cost||Overall improvement||
> |Original Lucene|29.9w/s|68.4%|N/A|26.7w/s|66.6%|N/A|
> |Optimised Lucene|34.5w/s
> (+15.4%)|63.8
> (-6.7%)|+22.1%|31.5w/s
> (18.0%)|61.5
> (-7.7%)|+25.7%|
> As shown above, after 8 hours of continuous writing, write performance 
> improves by 18.0%, CPU overhead decreases by 7.7%, and overall performance 
> improves by 25.7%. The Elasticsearch GET API and ids query would get similar 
> performance improvements.
> It should be noted that the benchmark test needs to run several hours 
> continuously, because the performance improvements is not obvious when the 
> data is completely cached or the number of segments is too small.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

Reply via email to