[jira] [Comment Edited] (LUCENE-8836) Optimize DocValues TermsDict to continue scanning from the last position when possible

Bruno Roustant (Jira) Fri, 20 Dec 2019 06:35:13 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-8836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000917#comment-17000917
 ]


Bruno Roustant edited comment on LUCENE-8836 at 12/20/19 2:34 PM:
------------------------------------------------------------------

I didn't have the opportunity to benchmark with real production. With some 
early benchmark we saw a gain, but at the same time we tried another approach 
using a cache above and the latter was good in our use-case, so I didn't pursue 
this one.

That said, I know [~juan.duran] is currently trying another approach to use a 
FST-based dict to lookup for terms and it will be faster (but will probably use 
a little more memory). We'll update this issue when we have more data.

[~sokolov] are you looking for a solution to make term lookup faster for 
docvalues?

 


was (Author: broustant):
I didn't have the opportunity to benchmark with real production. With some 
early benchmark we saw a gain, but at the same time we tried another approach 
using a cache above and the latter was good in our use-case.

That said, I know [~juan.duran] is currently trying another approach to use a 
FST-based dict to lookup for terms and it will be faster (but will probably use 
a little more memory). We'll update this issue when we have more data.

[~sokolov] are you looking for a solution to make term lookup faster for 
docvalues?

 

> Optimize DocValues TermsDict to continue scanning from the last position when 
> possible
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8836
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Bruno Roustant
>            Priority: Major
>              Labels: docValues, optimization
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Lucene80DocValuesProducer.TermsDict is used to lookup for either a term or a 
> term ordinal.
> Currently it does not have the optimization the FSTEnum has: to be able to 
> continue a sequential scan from where the last lookup was in the IndexInput. 
> For sparse lookups (when searching only a few terms or ordinal) it is not an 
> issue. But for multiple lookups in a row this optimization could save 
> re-scanning all the terms from the block start (since they are delat encoded).
> This patch proposes the optimization.
> To estimate the gain, we ran 3 Lucene tests while counting the seeks and the 
> term reads in the IndexInput, with and without the optimization:
> TestLucene70DocValuesFormat - the optimization saves 24% seeks and 15% term 
> reads.
> TestDocValuesQueries - the optimization adds 0.7% seeks and 0.003% term reads.
> TestDocValuesRewriteMethod.testRegexps - the optimization saves 71% seeks and 
> 82% term reads.
> In some cases, when scanning many terms in lexicographical order, the 
> optimization saves a lot. In some case, when only looking for some sparse 
> terms, the optimization does not bring improvement, but does not penalize 
> neither. It seems to be worth to always have it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8836) Optimize DocValues TermsDict to continue scanning from the last position when possible

Reply via email to