[ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445032#comment-17445032
 ] 

Feng Guo edited comment on LUCENE-10233 at 11/17/21, 12:37 PM:
---------------------------------------------------------------

[~jpountz]  Thanks for the guide! Actually, there can be no case run into the 
'add extra words ' logic so far since the old usages are all docBase = 0 and 
the new usage (this issue) is using the expert method and do 'or' with an 
offset. So what you are worried is about the potential risk that cause too much 
memory usage right? What if we directly throw an Assert Error if users want to 
extract the bitset but docBase != 0? This may remove the potential risk.

Anyway, i agree with you that a SparseFixedBitSet seems more suitable for the 
existing framework and less intrusive, so I implemented in the newest commit. 
This approach looks good to me too.


was (Author: gf2121):
[~jpountz] Thanks for the guide! I agree with you that a SparseFixedBitSet is 
more suitable for the existing framework and less intrusive, so I implemented 
it in the newest commit. This approach looks good to me too, so let's go ahead!

> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> ------------------------------------------------------------------
>
>                 Key: LUCENE-10233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10233
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Feng Guo
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and {{intersect}} will get into {{addAll}} logic. 
> If we store ids as bitset, and give the IntersectVisitor bulk visiting 
> ability, we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Optimization will be triggered when the following conditions are met at the 
> same time:
>  # leafCardinality = 1
>  # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding 
> too much storage)
>  # no duplicate doc id
> I mocked a field that has 10,000,000 docs per value and search it with a 1 
> term PointInSetQuery, the build scorer time decreased from 71ms to 8ms.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to