date:20211114

[GitHub] [lucene] dweiss commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.

2021-11-14 Thread GitBox



dweiss commented on a change in pull request #430:
URL: https://github.com/apache/lucene/pull/430#discussion_r748818786



##
File path: lucene/core/src/java/org/apache/lucene/util/IntroSelector.java
##
@@ -185,6 +204,115 @@ private void shuffle(int from, int to) {
 }
   }
 
+  /** Selects the k-th entry with a bottom-k algorithm, given that k is close 
to {@code from}. */

Review comment:
   Can you give the reference to which algorithm is actually use here? I 
admit a 5-minute eyeball didn't explain it to me. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] bruno-roustant commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.

2021-11-14 Thread GitBox



bruno-roustant commented on a change in pull request #430:
URL: https://github.com/apache/lucene/pull/430#discussion_r748823600



##
File path: lucene/core/src/java/org/apache/lucene/util/IntroSelector.java
##
@@ -185,6 +204,115 @@ private void shuffle(int from, int to) {
 }
   }
 
+  /** Selects the k-th entry with a bottom-k algorithm, given that k is close 
to {@code from}. */

Review comment:
   Good remark, it means I miss more doc.
   This algorithm is my own, and the idea is actually quite simple, maybe the 
code is not clear enough.
   When k is close to from, we take an int array of size (from - k + 1) called 
bottom, and each bottom array element i points to the corresponding (from + i) 
entry. And we determine the max of this bottom array. Then we loop on all the 
remaining entries, for each entry e we compare it to the max of bottom, if e < 
max then e evicts max and takes it slot in bottom, and we determine the new max 
of bottom. (the speed comes from the fact that most of the time we only compare 
e to bottom-max)
   At the end, all slots in bottom point to the k least entries, we just have 
to swap them and finally put the bottom max at index k.
   
   I'll add more doc!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] bruno-roustant commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.

2021-11-14 Thread GitBox



bruno-roustant commented on a change in pull request #430:
URL: https://github.com/apache/lucene/pull/430#discussion_r748823600



##
File path: lucene/core/src/java/org/apache/lucene/util/IntroSelector.java
##
@@ -185,6 +204,115 @@ private void shuffle(int from, int to) {
 }
   }
 
+  /** Selects the k-th entry with a bottom-k algorithm, given that k is close 
to {@code from}. */

Review comment:
   Good remark, it means I miss more doc.
   This algorithm is my own, and the idea is actually quite simple, maybe the 
code is not clear enough.
   When k is close to from, we take an int array of size (from - k + 1) called 
bottom, and each bottom array element i points to the corresponding (from + i) 
entry. And we determine the max of this bottom array. Then we loop on all the 
remaining entries, for each entry e we compare it to the max of bottom, if e < 
max then e evicts max and takes its slot in bottom, and we determine the new 
max of bottom. (the speed comes from the fact that most of the time we only 
compare e to bottom-max)
   At the end, all slots in bottom point to the k least entries, we just have 
to swap them and finally swap the bottom max at index k.
   
   I'll add more doc!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] bruno-roustant commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.

2021-11-14 Thread GitBox



bruno-roustant commented on a change in pull request #430:
URL: https://github.com/apache/lucene/pull/430#discussion_r748823600



##
File path: lucene/core/src/java/org/apache/lucene/util/IntroSelector.java
##
@@ -185,6 +204,115 @@ private void shuffle(int from, int to) {
 }
   }
 
+  /** Selects the k-th entry with a bottom-k algorithm, given that k is close 
to {@code from}. */

Review comment:
   Good remark, it means I miss more doc.
   This algorithm is my own, and the idea is actually quite simple, maybe the 
code is not clear enough.
   When k is close to from, we take an int array of size (k - from + 1) called 
bottom, and each bottom array element i points to the corresponding (from + i) 
entry. And we determine the max of this bottom array. Then we loop on all the 
remaining entries, for each entry e we compare it to the max of bottom, if e < 
max then e evicts max and takes its slot in bottom, and we determine the new 
max of bottom. (the speed comes from the fact that most of the time we only 
compare e to bottom-max)
   At the end, all slots in bottom point to the k least entries, we just have 
to swap them and finally swap the bottom max at index k.
   
   I'll add more doc!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)

Feng Guo created LUCENE-10233:
-

 Summary: Store docIds as bit set to speed up addAll
 Key: LUCENE-10233
 URL: https://issues.apache.org/jira/browse/LUCENE-10233
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs
Reporter: Feng Guo


In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
bulk visiting ability to IntersectVisitor and store ids as bitset, we can speed 
up addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result and the DocIdSetIteratorBuilder in PointRangeQuery or 
PointInSetQuery should get into the bitset case.
Concerns:
1. Not sure how much space more than before the bitset will occupy.
2. MergeReader, who needs docIds one by one, may become slower.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result and the DocIdSetIteratorBuilder in PointRangeQuery or 
PointInSetQuery should get into the bitset case.
Concerns:
1. Not sure how much space more than before the bitset will occupy.
2. MergeReader, who needs docIds one by one, may become slower.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
bulk visiting ability to IntersectVisitor and store ids as bitset, we can speed 
up addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result and the DocIdSetIteratorBuilder in PointRangeQuery or 
PointInSetQuery should get into the bitset case.
Concerns:
1. Not sure how much space more than before the bitset will occupy.
2. MergeReader, who needs docIds one by one, may become slower.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store ids for just 
> single point value, and intersect will get into addAll logic. If we give the 
> IntersectVisitor bulk visiting ability and store ids as bitset, we can speed 
> up addAll because we can just execute the 'or' logic for the result. This 
> optimization assumes that if we get into the addAll logic, we will have many 
> docs as the result and the DocIdSetIteratorBuilder in PointRangeQuery or 
> PointInSetQuery should get into the bitset case.
> Concerns:
> 1. Not sure how much space more than before the bitset will occupy.
> 2. MergeReader, who needs docIds one by one, may become slower.
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in 
PointRangeQuery or PointInSetQuery should get into the bitset case.
Concerns:
1. Not sure how much space more than before the bitset will occupy.
2. MergeReader, who needs docIds one by one, may become slower.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result and the DocIdSetIteratorBuilder in PointRangeQuery or 
PointInSetQuery should get into the bitset case.
Concerns:
1. Not sure how much space more than before the bitset will occupy.
2. MergeReader, who needs docIds one by one, may become slower.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store ids for just 
> single point value, and intersect will get into addAll logic. If we give the 
> IntersectVisitor bulk visiting ability and store ids as bitset, we can speed 
> up addAll because we can just execute the 'or' logic for the result. This 
> optimization assumes that if we get into the addAll logic, we will have many 
> docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder 
> in PointRangeQuery or PointInSetQuery should get into the bitset case.
> Concerns:
> 1. Not sure how much space more than before the bitset will occupy.
> 2. MergeReader, who needs docIds one by one, may become slower.
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in 
PointRangeQuery or PointInSetQuery should get into the bitset case.
Concerns:
1. Not sure how much disk space more than before the bitset will occupy.
2. MergeReader, who needs to iterate docIds one by one, may become slower.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in 
PointRangeQuery or PointInSetQuery should get into the bitset case.
Concerns:
1. Not sure how much disk space more than before the bitset will occupy.
2. MergeReader, who needs docIds one by one, may become slower.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store ids for just 
> single point value, and intersect will get into addAll logic. If we give the 
> IntersectVisitor bulk visiting ability and store ids as bitset, we can speed 
> up addAll because we can just execute the 'or' logic for the result. This 
> optimization assumes that if we get into the addAll logic, we will have many 
> docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder 
> in PointRangeQuery or PointInSetQuery should get into the bitset case.
> Concerns:
> 1. Not sure how much disk space more than before the bitset will occupy.
> 2. MergeReader, who needs to iterate docIds one by one, may become slower.
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in 
PointRangeQuery or PointInSetQuery should get into the bitset case.
Concerns:
1. Not sure how much disk space more than before the bitset will occupy.
2. MergeReader, who needs docIds one by one, may become slower.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in 
PointRangeQuery or PointInSetQuery should get into the bitset case.
Concerns:
1. Not sure how much space more than before the bitset will occupy.
2. MergeReader, who needs docIds one by one, may become slower.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store ids for just 
> single point value, and intersect will get into addAll logic. If we give the 
> IntersectVisitor bulk visiting ability and store ids as bitset, we can speed 
> up addAll because we can just execute the 'or' logic for the result. This 
> optimization assumes that if we get into the addAll logic, we will have many 
> docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder 
> in PointRangeQuery or PointInSetQuery should get into the bitset case.
> Concerns:
> 1. Not sure how much disk space more than before the bitset will occupy.
> 2. MergeReader, who needs docIds one by one, may become slower.
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in 
PointRangeQuery or PointInSetQuery should get into the bitset case.


Concerns:
1. Not sure how much disk space more than before the bitset will occupy.
2. MergeReader, who needs to iterate docIds one by one, may become slower.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in 
PointRangeQuery or PointInSetQuery should get into the bitset case.
Concerns:
1. Not sure how much disk space more than before the bitset will occupy.
2. MergeReader, who needs to iterate docIds one by one, may become slower.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store ids for just 
> single point value, and intersect will get into addAll logic. If we give the 
> IntersectVisitor bulk visiting ability and store ids as bitset, we can speed 
> up addAll because we can just execute the 'or' logic for the result. This 
> optimization assumes that if we get into the addAll logic, we will have many 
> docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder 
> in PointRangeQuery or PointInSetQuery should get into the bitset case.
> Concerns:
> 1. Not sure how much disk space more than before the bitset will occupy.
> 2. MergeReader, who needs to iterate docIds one by one, may become slower.
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in 
PointRangeQuery or PointInSetQuery should get into the bitset case.

Concerns:
1. Bitset will occupy more dis space. We just store the min~max part of the bit 
set and if ids is strictly increasing, the space for one point value across all 
blocks should be maxDoc bits, and the total spase would be (maxDoc * 
cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in 
PointRangeQuery or PointInSetQuery should get into the bitset case.


Concerns:
1. Not sure how much disk space more than before the bitset will occupy.
2. MergeReader, who needs to iterate docIds one by one, may become slower.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store ids for just 
> single point value, and intersect will get into addAll logic. If we give the 
> IntersectVisitor bulk visiting ability and store ids as bitset, we can speed 
> up addAll because we can just execute the 'or' logic for the result. This 
> optimization assumes that if we get into the addAll logic, we will have many 
> docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder 
> in PointRangeQuery or PointInSetQuery should get into the bitset case.
> Concerns:
> 1. Bitset will occupy more dis space. We just store the min~max part of the 
> bit set and if ids is strictly increasing, the space for one point value 
> across all blocks should be maxDoc bits, and the total spase would be (maxDoc 
> * cardinality) bits.
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. But i think the performance of merge could be less sensitive than 
> queries.
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in 
PointRangeQuery or PointInSetQuery should get into the bitset case.

Concerns:
1. Bitset will occupy more dis space. We just store the min~max part of the bit 
set and if ids is strictly increasing, the space for one point value across all 
blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * 
cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in 
PointRangeQuery or PointInSetQuery should get into the bitset case.

Concerns:
1. Bitset will occupy more dis space. We just store the min~max part of the bit 
set and if ids is strictly increasing, the space for one point value across all 
blocks should be maxDoc bits, and the total spase would be (maxDoc * 
cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store ids for just 
> single point value, and intersect will get into addAll logic. If we give the 
> IntersectVisitor bulk visiting ability and store ids as bitset, we can speed 
> up addAll because we can just execute the 'or' logic for the result. This 
> optimization assumes that if we get into the addAll logic, we will have many 
> docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder 
> in PointRangeQuery or PointInSetQuery should get into the bitset case.
> Concerns:
> 1. Bitset will occupy more dis space. We just store the min~max part of the 
> bit set and if ids is strictly increasing, the space for one point value 
> across all blocks should be maxDoc bits, and the total spase would be roughly 
> (maxDoc * cardinality) bits.
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. But i think the performance of merge could be less sensitive than 
> queries.
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in 
PointRangeQuery or PointInSetQuery should get into the bitset case.

Concerns:
1. Bitset will occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing, the space for one point value across 
all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in 
PointRangeQuery or PointInSetQuery should get into the bitset case.

Concerns:
1. Bitset will occupy more dis space. We just store the min~max part of the bit 
set and if ids is strictly increasing, the space for one point value across all 
blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * 
cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store ids for just 
> single point value, and intersect will get into addAll logic. If we give the 
> IntersectVisitor bulk visiting ability and store ids as bitset, we can speed 
> up addAll because we can just execute the 'or' logic for the result. This 
> optimization assumes that if we get into the addAll logic, we will have many 
> docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder 
> in PointRangeQuery or PointInSetQuery should get into the bitset case.
> Concerns:
> 1. Bitset will occupy more disk space. We just store the min~max part of the 
> bit set and if ids is strictly increasing, the space for one point value 
> across all blocks should be maxDoc bits, and the total spase would be roughly 
> (maxDoc * cardinality) bits.
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. But i think the performance of merge could be less sensitive than 
> queries.
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result.

Concerns:
1. Bitset will occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing, the space for one point value across 
all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result. This 
optimization assumes that if we get into the addAll logic, we will have many 
docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in 
PointRangeQuery or PointInSetQuery should get into the bitset case.

Concerns:
1. Bitset will occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing, the space for one point value across 
all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store ids for just 
> single point value, and intersect will get into addAll logic. If we give the 
> IntersectVisitor bulk visiting ability and store ids as bitset, we can speed 
> up addAll because we can just execute the 'or' logic for the result.
> Concerns:
> 1. Bitset will occupy more disk space. We just store the min~max part of the 
> bit set and if ids is strictly increasing, the space for one point value 
> across all blocks should be maxDoc bits, and the total spase would be roughly 
> (maxDoc * cardinality) bits.
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. But i think the performance of merge could be less sensitive than 
> queries.
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result.

Concerns:
1. Bitset could occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing, the space for one point value across 
all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result.

Concerns:
1. Bitset will occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing, the space for one point value across 
all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store ids for just 
> single point value, and intersect will get into addAll logic. If we give the 
> IntersectVisitor bulk visiting ability and store ids as bitset, we can speed 
> up addAll because we can just execute the 'or' logic for the result.
> Concerns:
> 1. Bitset could occupy more disk space. We just store the min~max part of the 
> bit set and if ids is strictly increasing, the space for one point value 
> across all blocks should be maxDoc bits, and the total spase would be roughly 
> (maxDoc * cardinality) bits.
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. But i think the performance of merge could be less sensitive than 
> queries.
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset when all the ids 
in the block have the same point value, we can speed up addAll because we can 
just execute the 'or' logic for the result.

Concerns:
1. Bitset could occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing, the space for one point value across 
all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset when all the ids 
in the block have the same size, we can speed up addAll because we can just 
execute the 'or' logic for the result.

Concerns:
1. Bitset could occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing, the space for one point value across 
all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store ids for just 
> single point value, and intersect will get into addAll logic. If we give the 
> IntersectVisitor bulk visiting ability and store ids as bitset when all the 
> ids in the block have the same point value, we can speed up addAll because we 
> can just execute the 'or' logic for the result.
> Concerns:
> 1. Bitset could occupy more disk space. We just store the min~max part of the 
> bit set and if ids is strictly increasing, the space for one point value 
> across all blocks should be maxDoc bits, and the total spase would be roughly 
> (maxDoc * cardinality) bits.
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. But i think the performance of merge could be less sensitive than 
> queries.
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset when all the ids 
in the block have the same size, we can speed up addAll because we can just 
execute the 'or' logic for the result.

Concerns:
1. Bitset could occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing, the space for one point value across 
all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up 
addAll because we can just execute the 'or' logic for the result.

Concerns:
1. Bitset could occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing, the space for one point value across 
all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store ids for just 
> single point value, and intersect will get into addAll logic. If we give the 
> IntersectVisitor bulk visiting ability and store ids as bitset when all the 
> ids in the block have the same size, we can speed up addAll because we can 
> just execute the 'or' logic for the result.
> Concerns:
> 1. Bitset could occupy more disk space. We just store the min~max part of the 
> bit set and if ids is strictly increasing, the space for one point value 
> across all blocks should be maxDoc bits, and the total spase would be roughly 
> (maxDoc * cardinality) bits.
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. But i think the performance of merge could be less sensitive than 
> queries.
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset when all the ids 
in the block have the same point value, we can speed up addAll because we can 
just execute the 'or' logic for the result.

Concerns:
1. Bitset could occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing, the space for one point value across 
all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store ids for just 
single point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset when all the ids 
in the block have the same point value, we can speed up addAll because we can 
just execute the 'or' logic for the result.

Concerns:
1. Bitset could occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing, the space for one point value across 
all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and intersect will get into addAll logic. If we 
> give the IntersectVisitor bulk visiting ability and store ids as bitset when 
> all the ids in the block have the same point value, we can speed up addAll 
> because we can just execute the 'or' logic for the result.
> Concerns:
> 1. Bitset could occupy more disk space. We just store the min~max part of the 
> bit set and if ids is strictly increasing, the space for one point value 
> across all blocks should be maxDoc bits, and the total spase would be roughly 
> (maxDoc * cardinality) bits.
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. But i think the performance of merge could be less sensitive than 
> queries.
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when all the ids in the block have the same point value, and give the 
IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
iterator), we can speed up addAll because we can just execute the 'or' logic 
for the result and the block ids.

Concerns:
1. Bitset could occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing, the space for one point value across 
all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we give the 
IntersectVisitor bulk visiting ability and store ids as bitset when all the ids 
in the block have the same point value, we can speed up addAll because we can 
just execute the 'or' logic for the result.

Concerns:
1. Bitset could occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing, the space for one point value across 
all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and intersect will get into addAll logic. If we 
> store ids as bitset when all the ids in the block have the same point value, 
> and give the IntersectVisitor bulk visiting ability (something like 
> visit(DocIdSetIterator iterator), we can speed up addAll because we can just 
> execute the 'or' logic for the result and the block ids.
> Concerns:
> 1. Bitset could occupy more disk space. We just store the min~max part of the 
> bit set and if ids is strictly increasing, the space for one point value 
> across all blocks should be maxDoc bits, and the total spase would be roughly 
> (maxDoc * cardinality) bits.
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. But i think the performance of merge could be less sensitive than 
> queries.
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when all the ids in the block have the same point value, and give the 
IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
iterator), we can speed up addAll because we can just execute the 'or' logic 
between the result and the block ids.

Concerns:
1. Bitset could occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing, the space for one point value across 
all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when all the ids in the block have the same point value, and give the 
IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
iterator), we can speed up addAll because we can just execute the 'or' logic 
for the result and the block ids.

Concerns:
1. Bitset could occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing, the space for one point value across 
all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and intersect will get into addAll logic. If we 
> store ids as bitset when all the ids in the block have the same point value, 
> and give the IntersectVisitor bulk visiting ability (something like 
> visit(DocIdSetIterator iterator), we can speed up addAll because we can just 
> execute the 'or' logic between the result and the block ids.
> Concerns:
> 1. Bitset could occupy more disk space. We just store the min~max part of the 
> bit set and if ids is strictly increasing, the space for one point value 
> across all blocks should be maxDoc bits, and the total spase would be roughly 
> (maxDoc * cardinality) bits.
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. But i think the performance of merge could be less sensitive than 
> queries.
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when all the ids in the block have the same point value, and give the 
IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
iterator), we can speed up addAll because we can just execute the 'or' logic 
between the result and the block ids.

Concerns:
1. Bitset could occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing accross all blocks, the space for one 
point value hould be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when all the ids in the block have the same point value, and give the 
IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
iterator), we can speed up addAll because we can just execute the 'or' logic 
between the result and the block ids.

Concerns:
1. Bitset could occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing, the space for one point value across 
all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and intersect will get into addAll logic. If we 
> store ids as bitset when all the ids in the block have the same point value, 
> and give the IntersectVisitor bulk visiting ability (something like 
> visit(DocIdSetIterator iterator), we can speed up addAll because we can just 
> execute the 'or' logic between the result and the block ids.
> Concerns:
> 1. Bitset could occupy more disk space. We just store the min~max part of the 
> bit set and if ids is strictly increasing accross all blocks, the space for 
> one point value hould be maxDoc bits, and the total spase would be roughly 
> (maxDoc * cardinality) bits.
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. But i think the performance of merge could be less sensitive than 
> queries.
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when all the ids in the block have the same point value, and give the 
IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
iterator), we can speed up addAll because we can just execute the 'or' logic 
between the result and the block ids.

Concerns:
1. Bitset could occupy more disk space.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when all the ids in the block have the same point value, and give the 
IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
iterator), we can speed up addAll because we can just execute the 'or' logic 
between the result and the block ids.

Concerns:
1. Bitset could occupy more disk space. We just store the min~max part of the 
bit set and if ids is strictly increasing accross all blocks, the space for one 
point value hould be maxDoc bits, and the total spase would be roughly (maxDoc 
* cardinality) bits.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and intersect will get into addAll logic. If we 
> store ids as bitset when all the ids in the block have the same point value, 
> and give the IntersectVisitor bulk visiting ability (something like 
> visit(DocIdSetIterator iterator), we can speed up addAll because we can just 
> execute the 'or' logic between the result and the block ids.
> Concerns:
> 1. Bitset could occupy more disk space.
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. But i think the performance of merge could be less sensitive than 
> queries.
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when all the ids in the block have the same point value, and give the 
IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
iterator), we can speed up addAll because we can just execute the 'or' logic 
between the result and the block ids.

Concerns:
1. Bitset could occupy more disk space.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. (But maybe the performance of merge could be less sensitive than queries ?)

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when all the ids in the block have the same point value, and give the 
IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
iterator), we can speed up addAll because we can just execute the 'or' logic 
between the result and the block ids.

Concerns:
1. Bitset could occupy more disk space.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. But i think the performance of merge could be less sensitive than queries.

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and intersect will get into addAll logic. If we 
> store ids as bitset when all the ids in the block have the same point value, 
> and give the IntersectVisitor bulk visiting ability (something like 
> visit(DocIdSetIterator iterator), we can speed up addAll because we can just 
> execute the 'or' logic between the result and the block ids.
> Concerns:
> 1. Bitset could occupy more disk space.
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. (But maybe the performance of merge could be less sensitive than queries 
> ?)
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk 
visiting ability (something like visit(DocIdSetIterator iterator), we can speed 
up addAll because we can just execute the 'or' logic between the result and the 
block ids.

Concerns:
1. Bitset could occupy more disk space.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. (But maybe the performance of merge could be less sensitive than queries ?)

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)

  was:
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when all the ids in the block have the same point value, and give the 
IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
iterator), we can speed up addAll because we can just execute the 'or' logic 
between the result and the block ids.

Concerns:
1. Bitset could occupy more disk space.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. (But maybe the performance of merge could be less sensitive than queries ?)

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bit set to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and intersect will get into addAll logic. If we 
> store ids as bitset when the leafCadinality = 1, and give the 
> IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
> iterator), we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Concerns:
> 1. Bitset could occupy more disk space.
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. (But maybe the performance of merge could be less sensitive than queries 
> ?)
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Summary: Store docIds as bitset when leafCardinality = 1 to speed up addAll 
 (was: Store docIds as bit set when leafCardinality = 1 to speed up addAll)

> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and intersect will get into addAll logic. If we 
> store ids as bitset when the leafCadinality = 1, and give the 
> IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
> iterator), we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Concerns:
> 1. Bitset could occupy more disk space.
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. (But maybe the performance of merge could be less sensitive than queries 
> ?)
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bit set when leafCardinality = 1 to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Summary: Store docIds as bit set when leafCardinality = 1 to speed up 
addAll  (was: Store docIds as bit set to speed up addAll)

> Store docIds as bit set when leafCardinality = 1 to speed up addAll
> ---
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and intersect will get into addAll logic. If we 
> store ids as bitset when the leafCadinality = 1, and give the 
> IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
> iterator), we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Concerns:
> 1. Bitset could occupy more disk space.
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. (But maybe the performance of merge could be less sensitive than queries 
> ?)
> I'd like to do some test for query, merge and space if you think this 
> optimization is worth a try :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk 
visiting ability (something like visit(DocIdSetIterator iterator), we can speed 
up addAll because we can just execute the 'or' logic between the result and the 
block ids.

Concerns:
1. Bitset could occupy more disk space.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. (But maybe the performance of merge could be less sensitive than queries ?)

  was:
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk 
visiting ability (something like visit(DocIdSetIterator iterator), we can speed 
up addAll because we can just execute the 'or' logic between the result and the 
block ids.

Concerns:
1. Bitset could occupy more disk space.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. (But maybe the performance of merge could be less sensitive than queries ?)

I'd like to do some test for query, merge and space if you think this 
optimization is worth a try :)


> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and intersect will get into addAll logic. If we 
> store ids as bitset when the leafCadinality = 1, and give the 
> IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
> iterator), we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Concerns:
> 1. Bitset could occupy more disk space.
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. (But maybe the performance of merge could be less sensitive than queries 
> ?)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk 
visiting ability (something like visit(DocIdSetIterator iterator), we can speed 
up addAll because we can just execute the 'or' logic between the result and the 
block ids.

Concerns:
1. Bitset could occupy more disk space, though we just store part of the bitset 
(with some offset).
2. MergeReader will become slower because it needs to iterate docIds one by 
one. (But maybe the performance of merge could be less sensitive than queries ?)

  was:
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk 
visiting ability (something like visit(DocIdSetIterator iterator), we can speed 
up addAll because we can just execute the 'or' logic between the result and the 
block ids.

Concerns:
1. Bitset could occupy more disk space.
2. MergeReader will become slower because it needs to iterate docIds one by 
one. (But maybe the performance of merge could be less sensitive than queries ?)


> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and intersect will get into addAll logic. If we 
> store ids as bitset when the leafCadinality = 1, and give the 
> IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
> iterator), we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Concerns:
> 1. Bitset could occupy more disk space, though we just store part of the 
> bitset (with some offset).
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. (But maybe the performance of merge could be less sensitive than queries 
> ?)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk 
visiting ability (something like visit(DocIdSetIterator iterator), we can speed 
up addAll because we can just execute the 'or' logic between the result and the 
block ids.

Concerns:
1. Bitset could occupy more disk space.(Maybe we can force this optimization 
only works when block's max-min <= 8*count?)
2. MergeReader will become slower because it needs to iterate docIds one by 
one. 

  was:
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk 
visiting ability (something like visit(DocIdSetIterator iterator), we can speed 
up addAll because we can just execute the 'or' logic between the result and the 
block ids.

Concerns:
1. Bitset could occupy more disk space, though we just store part of the bitset 
(with some offset).
2. MergeReader will become slower because it needs to iterate docIds one by 
one. (But maybe the performance of merge could be less sensitive than queries ?)


> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and intersect will get into addAll logic. If we 
> store ids as bitset when the leafCadinality = 1, and give the 
> IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
> iterator), we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Concerns:
> 1. Bitset could occupy more disk space.(Maybe we can force this optimization 
> only works when block's max-min <= 8*count?)
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk 
visiting ability (something like visit(DocIdSetIterator iterator), we can speed 
up addAll because we can just execute the 'or' logic between the result and the 
block ids.

Concerns:
1. Bitset could occupy more disk space.(Maybe we can force this optimization 
only works when block's max-min <= n*count?)
2. MergeReader will become slower because it needs to iterate docIds one by 
one. 

  was:
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk 
visiting ability (something like visit(DocIdSetIterator iterator), we can speed 
up addAll because we can just execute the 'or' logic between the result and the 
block ids.

Concerns:
1. Bitset could occupy more disk space.(Maybe we can force this optimization 
only works when block's max-min <= 8*count?)
2. MergeReader will become slower because it needs to iterate docIds one by 
one. 


> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and intersect will get into addAll logic. If we 
> store ids as bitset when the leafCadinality = 1, and give the 
> IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
> iterator), we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Concerns:
> 1. Bitset could occupy more disk space.(Maybe we can force this optimization 
> only works when block's max-min <= n*count?)
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.

2021-11-14 Thread GitBox



dweiss commented on a change in pull request #430:
URL: https://github.com/apache/lucene/pull/430#discussion_r748913217



##
File path: lucene/core/src/java/org/apache/lucene/util/IntroSelector.java
##
@@ -185,6 +204,115 @@ private void shuffle(int from, int to) {
 }
   }
 
+  /** Selects the k-th entry with a bottom-k algorithm, given that k is close 
to {@code from}. */

Review comment:
   I'll give it a second spin tomorrow, thanks Bruno. I recall looking at 
generic streaming top-n algorithms a white ago and they all seemed more 
complicated than your code, hence the question. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk 
visiting ability (something like visit(DocIdSetIterator iterator), we can speed 
up addAll because we can just execute the 'or' logic between the result and the 
block ids.

Concerns:
1. Bitset could occupy more disk space.(Maybe we can force this optimization 
only works when block's (max-min) * n <= count?)
2. MergeReader will become slower because it needs to iterate docIds one by 
one. 

  was:
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk 
visiting ability (something like visit(DocIdSetIterator iterator), we can speed 
up addAll because we can just execute the 'or' logic between the result and the 
block ids.

Concerns:
1. Bitset could occupy more disk space.(Maybe we can force this optimization 
only works when block's max-min <= n*count?)
2. MergeReader will become slower because it needs to iterate docIds one by 
one. 


> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and intersect will get into addAll logic. If we 
> store ids as bitset when the leafCadinality = 1, and give the 
> IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
> iterator), we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Concerns:
> 1. Bitset could occupy more disk space.(Maybe we can force this optimization 
> only works when block's (max-min) * n <= count?)
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-14 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk 
visiting ability (something like visit(DocIdSetIterator iterator), we can speed 
up addAll because we can just execute the 'or' logic between the result and the 
block ids.

Concerns:
1. Bitset could occupy more disk space.(Maybe we can force this optimization 
only works when block's (max-min) <= n * count?)
2. MergeReader will become slower because it needs to iterate docIds one by 
one. 

  was:
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk 
visiting ability (something like visit(DocIdSetIterator iterator), we can speed 
up addAll because we can just execute the 'or' logic between the result and the 
block ids.

Concerns:
1. Bitset could occupy more disk space.(Maybe we can force this optimization 
only works when block's (max-min) * n <= count?)
2. MergeReader will become slower because it needs to iterate docIds one by 
one. 


> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and intersect will get into addAll logic. If we 
> store ids as bitset when the leafCadinality = 1, and give the 
> IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator 
> iterator), we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Concerns:
> 1. Bitset could occupy more disk space.(Maybe we can force this optimization 
> only works when block's (max-min) <= n * count?)
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gf2121 opened a new pull request #438: LUCENE-10233: Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-14 Thread GitBox



gf2121 opened a new pull request #438:
URL: https://github.com/apache/lucene/pull/438


   In low cardinality points cases, id blocks will usually store doc ids that 
have the same point value, and `intersect` will get into `addAll` logic. If we 
store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor 
bulk visiting ability, we can speed up addAll because we can just execute the 
'or' logic between the result and the block ids.
   
   I mocked a field that has 10,000,000 docs per value and search it with a 
PointInSetQuery with 1 term, the build scorer time decreased from 71ms to 8ms.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] noblepaul merged pull request #2607: SOLR-15794: Switching a PRS collection from true -> false -> true results in INACTIVE replicas

2021-11-14 Thread GitBox



noblepaul merged pull request #2607:
URL: https://github.com/apache/lucene-solr/pull/2607


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?

2021-11-14 Thread Arjen (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443572#comment-17443572
 ] 

Arjen commented on LUCENE-9921:
---

I'm not sure if this is the best place (perhaps a new issue would be better?), 
but ICU is at version 70.1 since a few weeks. It'd be nice if the upcoming 
Lucene 9 does have the latest ICU, whether you integrate it via this proposed 
change or not :)

> Can ICU regeneration tasks treat icu version as input?
> --
>
> Key: LUCENE-9921
> URL: https://issues.apache.org/jira/browse/LUCENE-9921
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> ICU 69 was released, so i was playing with the upgrade just to test it out 
> and test out our regeneration.
> Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks 
> were SKIPPED by the build.
> So I'm curious if the ICU version can be treated as an "input" to these 
> tasks, such that if it changes, tasks know the generated output is stale?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.

[GitHub] [lucene] bruno-roustant commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.

[GitHub] [lucene] bruno-roustant commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.

[GitHub] [lucene] bruno-roustant commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.

[jira] [Created] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bit set when leafCardinality = 1 to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

[GitHub] [lucene] dweiss commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.

[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

[GitHub] [lucene] gf2121 opened a new pull request #438: LUCENE-10233: Store docIds as bitset when leafCardinality = 1 to speed up addAll

[GitHub] [lucene-solr] noblepaul merged pull request #2607: SOLR-15794: Switching a PRS collection from true -> false -> true results in INACTIVE replicas

[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?

36 matches

Site Navigation

Mail list logo

Footer information