[ https://issues.apache.org/jira/browse/LUCENE-8836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000917#comment-17000917 ]
Bruno Roustant edited comment on LUCENE-8836 at 12/20/19 2:34 PM: ------------------------------------------------------------------ I didn't have the opportunity to benchmark with real production. With some early benchmark we saw a gain, but at the same time we tried another approach using a cache above and the latter was good in our use-case, so I didn't pursue this one. That said, I know [~juan.duran] is currently trying another approach to use a FST-based dict to lookup for terms and it will be faster (but will probably use a little more memory). We'll update this issue when we have more data. [~sokolov] are you looking for a solution to make term lookup faster for docvalues? was (Author: broustant): I didn't have the opportunity to benchmark with real production. With some early benchmark we saw a gain, but at the same time we tried another approach using a cache above and the latter was good in our use-case. That said, I know [~juan.duran] is currently trying another approach to use a FST-based dict to lookup for terms and it will be faster (but will probably use a little more memory). We'll update this issue when we have more data. [~sokolov] are you looking for a solution to make term lookup faster for docvalues? > Optimize DocValues TermsDict to continue scanning from the last position when > possible > -------------------------------------------------------------------------------------- > > Key: LUCENE-8836 > URL: https://issues.apache.org/jira/browse/LUCENE-8836 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Bruno Roustant > Priority: Major > Labels: docValues, optimization > Time Spent: 1h 10m > Remaining Estimate: 0h > > Lucene80DocValuesProducer.TermsDict is used to lookup for either a term or a > term ordinal. > Currently it does not have the optimization the FSTEnum has: to be able to > continue a sequential scan from where the last lookup was in the IndexInput. > For sparse lookups (when searching only a few terms or ordinal) it is not an > issue. But for multiple lookups in a row this optimization could save > re-scanning all the terms from the block start (since they are delat encoded). > This patch proposes the optimization. > To estimate the gain, we ran 3 Lucene tests while counting the seeks and the > term reads in the IndexInput, with and without the optimization: > TestLucene70DocValuesFormat - the optimization saves 24% seeks and 15% term > reads. > TestDocValuesQueries - the optimization adds 0.7% seeks and 0.003% term reads. > TestDocValuesRewriteMethod.testRegexps - the optimization saves 71% seeks and > 82% term reads. > In some cases, when scanning many terms in lexicographical order, the > optimization saves a lot. In some case, when only looking for some sparse > terms, the optimization does not bring improvement, but does not penalize > neither. It seems to be worth to always have it. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org