[GitHub] [lucene] dweiss commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.
dweiss commented on a change in pull request #430: URL: https://github.com/apache/lucene/pull/430#discussion_r748818786 ## File path: lucene/core/src/java/org/apache/lucene/util/IntroSelector.java ## @@ -185,6 +204,115 @@ private void shuffle(int from, int to) { } } + /** Selects the k-th entry with a bottom-k algorithm, given that k is close to {@code from}. */ Review comment: Can you give the reference to which algorithm is actually use here? I admit a 5-minute eyeball didn't explain it to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] bruno-roustant commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.
bruno-roustant commented on a change in pull request #430: URL: https://github.com/apache/lucene/pull/430#discussion_r748823600 ## File path: lucene/core/src/java/org/apache/lucene/util/IntroSelector.java ## @@ -185,6 +204,115 @@ private void shuffle(int from, int to) { } } + /** Selects the k-th entry with a bottom-k algorithm, given that k is close to {@code from}. */ Review comment: Good remark, it means I miss more doc. This algorithm is my own, and the idea is actually quite simple, maybe the code is not clear enough. When k is close to from, we take an int array of size (from - k + 1) called bottom, and each bottom array element i points to the corresponding (from + i) entry. And we determine the max of this bottom array. Then we loop on all the remaining entries, for each entry e we compare it to the max of bottom, if e < max then e evicts max and takes it slot in bottom, and we determine the new max of bottom. (the speed comes from the fact that most of the time we only compare e to bottom-max) At the end, all slots in bottom point to the k least entries, we just have to swap them and finally put the bottom max at index k. I'll add more doc! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] bruno-roustant commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.
bruno-roustant commented on a change in pull request #430: URL: https://github.com/apache/lucene/pull/430#discussion_r748823600 ## File path: lucene/core/src/java/org/apache/lucene/util/IntroSelector.java ## @@ -185,6 +204,115 @@ private void shuffle(int from, int to) { } } + /** Selects the k-th entry with a bottom-k algorithm, given that k is close to {@code from}. */ Review comment: Good remark, it means I miss more doc. This algorithm is my own, and the idea is actually quite simple, maybe the code is not clear enough. When k is close to from, we take an int array of size (from - k + 1) called bottom, and each bottom array element i points to the corresponding (from + i) entry. And we determine the max of this bottom array. Then we loop on all the remaining entries, for each entry e we compare it to the max of bottom, if e < max then e evicts max and takes its slot in bottom, and we determine the new max of bottom. (the speed comes from the fact that most of the time we only compare e to bottom-max) At the end, all slots in bottom point to the k least entries, we just have to swap them and finally swap the bottom max at index k. I'll add more doc! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] bruno-roustant commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.
bruno-roustant commented on a change in pull request #430: URL: https://github.com/apache/lucene/pull/430#discussion_r748823600 ## File path: lucene/core/src/java/org/apache/lucene/util/IntroSelector.java ## @@ -185,6 +204,115 @@ private void shuffle(int from, int to) { } } + /** Selects the k-th entry with a bottom-k algorithm, given that k is close to {@code from}. */ Review comment: Good remark, it means I miss more doc. This algorithm is my own, and the idea is actually quite simple, maybe the code is not clear enough. When k is close to from, we take an int array of size (k - from + 1) called bottom, and each bottom array element i points to the corresponding (from + i) entry. And we determine the max of this bottom array. Then we loop on all the remaining entries, for each entry e we compare it to the max of bottom, if e < max then e evicts max and takes its slot in bottom, and we determine the new max of bottom. (the speed comes from the fact that most of the time we only compare e to bottom-max) At the end, all slots in bottom point to the k least entries, we just have to swap them and finally swap the bottom max at index k. I'll add more doc! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10233) Store docIds as bit set to speed up addAll
Feng Guo created LUCENE-10233: - Summary: Store docIds as bit set to speed up addAll Key: LUCENE-10233 URL: https://issues.apache.org/jira/browse/LUCENE-10233 Project: Lucene - Core Issue Type: Improvement Components: core/codecs Reporter: Feng Guo In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the bulk visiting ability to IntersectVisitor and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Not sure how much space more than before the bitset will occupy. 2. MergeReader, who needs docIds one by one, may become slower. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Not sure how much space more than before the bitset will occupy. 2. MergeReader, who needs docIds one by one, may become slower. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the bulk visiting ability to IntersectVisitor and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Not sure how much space more than before the bitset will occupy. 2. MergeReader, who needs docIds one by one, may become slower. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store ids for just > single point value, and intersect will get into addAll logic. If we give the > IntersectVisitor bulk visiting ability and store ids as bitset, we can speed > up addAll because we can just execute the 'or' logic for the result. This > optimization assumes that if we get into the addAll logic, we will have many > docs as the result and the DocIdSetIteratorBuilder in PointRangeQuery or > PointInSetQuery should get into the bitset case. > Concerns: > 1. Not sure how much space more than before the bitset will occupy. > 2. MergeReader, who needs docIds one by one, may become slower. > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Not sure how much space more than before the bitset will occupy. 2. MergeReader, who needs docIds one by one, may become slower. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Not sure how much space more than before the bitset will occupy. 2. MergeReader, who needs docIds one by one, may become slower. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store ids for just > single point value, and intersect will get into addAll logic. If we give the > IntersectVisitor bulk visiting ability and store ids as bitset, we can speed > up addAll because we can just execute the 'or' logic for the result. This > optimization assumes that if we get into the addAll logic, we will have many > docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder > in PointRangeQuery or PointInSetQuery should get into the bitset case. > Concerns: > 1. Not sure how much space more than before the bitset will occupy. > 2. MergeReader, who needs docIds one by one, may become slower. > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Not sure how much disk space more than before the bitset will occupy. 2. MergeReader, who needs to iterate docIds one by one, may become slower. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Not sure how much disk space more than before the bitset will occupy. 2. MergeReader, who needs docIds one by one, may become slower. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store ids for just > single point value, and intersect will get into addAll logic. If we give the > IntersectVisitor bulk visiting ability and store ids as bitset, we can speed > up addAll because we can just execute the 'or' logic for the result. This > optimization assumes that if we get into the addAll logic, we will have many > docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder > in PointRangeQuery or PointInSetQuery should get into the bitset case. > Concerns: > 1. Not sure how much disk space more than before the bitset will occupy. > 2. MergeReader, who needs to iterate docIds one by one, may become slower. > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Not sure how much disk space more than before the bitset will occupy. 2. MergeReader, who needs docIds one by one, may become slower. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Not sure how much space more than before the bitset will occupy. 2. MergeReader, who needs docIds one by one, may become slower. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store ids for just > single point value, and intersect will get into addAll logic. If we give the > IntersectVisitor bulk visiting ability and store ids as bitset, we can speed > up addAll because we can just execute the 'or' logic for the result. This > optimization assumes that if we get into the addAll logic, we will have many > docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder > in PointRangeQuery or PointInSetQuery should get into the bitset case. > Concerns: > 1. Not sure how much disk space more than before the bitset will occupy. > 2. MergeReader, who needs docIds one by one, may become slower. > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Not sure how much disk space more than before the bitset will occupy. 2. MergeReader, who needs to iterate docIds one by one, may become slower. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Not sure how much disk space more than before the bitset will occupy. 2. MergeReader, who needs to iterate docIds one by one, may become slower. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store ids for just > single point value, and intersect will get into addAll logic. If we give the > IntersectVisitor bulk visiting ability and store ids as bitset, we can speed > up addAll because we can just execute the 'or' logic for the result. This > optimization assumes that if we get into the addAll logic, we will have many > docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder > in PointRangeQuery or PointInSetQuery should get into the bitset case. > Concerns: > 1. Not sure how much disk space more than before the bitset will occupy. > 2. MergeReader, who needs to iterate docIds one by one, may become slower. > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Bitset will occupy more dis space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Not sure how much disk space more than before the bitset will occupy. 2. MergeReader, who needs to iterate docIds one by one, may become slower. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store ids for just > single point value, and intersect will get into addAll logic. If we give the > IntersectVisitor bulk visiting ability and store ids as bitset, we can speed > up addAll because we can just execute the 'or' logic for the result. This > optimization assumes that if we get into the addAll logic, we will have many > docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder > in PointRangeQuery or PointInSetQuery should get into the bitset case. > Concerns: > 1. Bitset will occupy more dis space. We just store the min~max part of the > bit set and if ids is strictly increasing, the space for one point value > across all blocks should be maxDoc bits, and the total spase would be (maxDoc > * cardinality) bits. > 2. MergeReader will become slower because it needs to iterate docIds one by > one. But i think the performance of merge could be less sensitive than > queries. > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Bitset will occupy more dis space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Bitset will occupy more dis space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store ids for just > single point value, and intersect will get into addAll logic. If we give the > IntersectVisitor bulk visiting ability and store ids as bitset, we can speed > up addAll because we can just execute the 'or' logic for the result. This > optimization assumes that if we get into the addAll logic, we will have many > docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder > in PointRangeQuery or PointInSetQuery should get into the bitset case. > Concerns: > 1. Bitset will occupy more dis space. We just store the min~max part of the > bit set and if ids is strictly increasing, the space for one point value > across all blocks should be maxDoc bits, and the total spase would be roughly > (maxDoc * cardinality) bits. > 2. MergeReader will become slower because it needs to iterate docIds one by > one. But i think the performance of merge could be less sensitive than > queries. > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Bitset will occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Bitset will occupy more dis space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store ids for just > single point value, and intersect will get into addAll logic. If we give the > IntersectVisitor bulk visiting ability and store ids as bitset, we can speed > up addAll because we can just execute the 'or' logic for the result. This > optimization assumes that if we get into the addAll logic, we will have many > docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder > in PointRangeQuery or PointInSetQuery should get into the bitset case. > Concerns: > 1. Bitset will occupy more disk space. We just store the min~max part of the > bit set and if ids is strictly increasing, the space for one point value > across all blocks should be maxDoc bits, and the total spase would be roughly > (maxDoc * cardinality) bits. > 2. MergeReader will become slower because it needs to iterate docIds one by > one. But i think the performance of merge could be less sensitive than > queries. > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. Concerns: 1. Bitset will occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. This optimization assumes that if we get into the addAll logic, we will have many docs as the result (more than maxdoc >>> 7) and the DocIdSetIteratorBuilder in PointRangeQuery or PointInSetQuery should get into the bitset case. Concerns: 1. Bitset will occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store ids for just > single point value, and intersect will get into addAll logic. If we give the > IntersectVisitor bulk visiting ability and store ids as bitset, we can speed > up addAll because we can just execute the 'or' logic for the result. > Concerns: > 1. Bitset will occupy more disk space. We just store the min~max part of the > bit set and if ids is strictly increasing, the space for one point value > across all blocks should be maxDoc bits, and the total spase would be roughly > (maxDoc * cardinality) bits. > 2. MergeReader will become slower because it needs to iterate docIds one by > one. But i think the performance of merge could be less sensitive than > queries. > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. Concerns: 1. Bitset could occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. Concerns: 1. Bitset will occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store ids for just > single point value, and intersect will get into addAll logic. If we give the > IntersectVisitor bulk visiting ability and store ids as bitset, we can speed > up addAll because we can just execute the 'or' logic for the result. > Concerns: > 1. Bitset could occupy more disk space. We just store the min~max part of the > bit set and if ids is strictly increasing, the space for one point value > across all blocks should be maxDoc bits, and the total spase would be roughly > (maxDoc * cardinality) bits. > 2. MergeReader will become slower because it needs to iterate docIds one by > one. But i think the performance of merge could be less sensitive than > queries. > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset when all the ids in the block have the same point value, we can speed up addAll because we can just execute the 'or' logic for the result. Concerns: 1. Bitset could occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset when all the ids in the block have the same size, we can speed up addAll because we can just execute the 'or' logic for the result. Concerns: 1. Bitset could occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store ids for just > single point value, and intersect will get into addAll logic. If we give the > IntersectVisitor bulk visiting ability and store ids as bitset when all the > ids in the block have the same point value, we can speed up addAll because we > can just execute the 'or' logic for the result. > Concerns: > 1. Bitset could occupy more disk space. We just store the min~max part of the > bit set and if ids is strictly increasing, the space for one point value > across all blocks should be maxDoc bits, and the total spase would be roughly > (maxDoc * cardinality) bits. > 2. MergeReader will become slower because it needs to iterate docIds one by > one. But i think the performance of merge could be less sensitive than > queries. > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset when all the ids in the block have the same size, we can speed up addAll because we can just execute the 'or' logic for the result. Concerns: 1. Bitset could occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset, we can speed up addAll because we can just execute the 'or' logic for the result. Concerns: 1. Bitset could occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store ids for just > single point value, and intersect will get into addAll logic. If we give the > IntersectVisitor bulk visiting ability and store ids as bitset when all the > ids in the block have the same size, we can speed up addAll because we can > just execute the 'or' logic for the result. > Concerns: > 1. Bitset could occupy more disk space. We just store the min~max part of the > bit set and if ids is strictly increasing, the space for one point value > across all blocks should be maxDoc bits, and the total spase would be roughly > (maxDoc * cardinality) bits. > 2. MergeReader will become slower because it needs to iterate docIds one by > one. But i think the performance of merge could be less sensitive than > queries. > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset when all the ids in the block have the same point value, we can speed up addAll because we can just execute the 'or' logic for the result. Concerns: 1. Bitset could occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store ids for just single point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset when all the ids in the block have the same point value, we can speed up addAll because we can just execute the 'or' logic for the result. Concerns: 1. Bitset could occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and intersect will get into addAll logic. If we > give the IntersectVisitor bulk visiting ability and store ids as bitset when > all the ids in the block have the same point value, we can speed up addAll > because we can just execute the 'or' logic for the result. > Concerns: > 1. Bitset could occupy more disk space. We just store the min~max part of the > bit set and if ids is strictly increasing, the space for one point value > across all blocks should be maxDoc bits, and the total spase would be roughly > (maxDoc * cardinality) bits. > 2. MergeReader will become slower because it needs to iterate docIds one by > one. But i think the performance of merge could be less sensitive than > queries. > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when all the ids in the block have the same point value, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic for the result and the block ids. Concerns: 1. Bitset could occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we give the IntersectVisitor bulk visiting ability and store ids as bitset when all the ids in the block have the same point value, we can speed up addAll because we can just execute the 'or' logic for the result. Concerns: 1. Bitset could occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and intersect will get into addAll logic. If we > store ids as bitset when all the ids in the block have the same point value, > and give the IntersectVisitor bulk visiting ability (something like > visit(DocIdSetIterator iterator), we can speed up addAll because we can just > execute the 'or' logic for the result and the block ids. > Concerns: > 1. Bitset could occupy more disk space. We just store the min~max part of the > bit set and if ids is strictly increasing, the space for one point value > across all blocks should be maxDoc bits, and the total spase would be roughly > (maxDoc * cardinality) bits. > 2. MergeReader will become slower because it needs to iterate docIds one by > one. But i think the performance of merge could be less sensitive than > queries. > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when all the ids in the block have the same point value, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when all the ids in the block have the same point value, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic for the result and the block ids. Concerns: 1. Bitset could occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and intersect will get into addAll logic. If we > store ids as bitset when all the ids in the block have the same point value, > and give the IntersectVisitor bulk visiting ability (something like > visit(DocIdSetIterator iterator), we can speed up addAll because we can just > execute the 'or' logic between the result and the block ids. > Concerns: > 1. Bitset could occupy more disk space. We just store the min~max part of the > bit set and if ids is strictly increasing, the space for one point value > across all blocks should be maxDoc bits, and the total spase would be roughly > (maxDoc * cardinality) bits. > 2. MergeReader will become slower because it needs to iterate docIds one by > one. But i think the performance of merge could be less sensitive than > queries. > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when all the ids in the block have the same point value, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing accross all blocks, the space for one point value hould be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when all the ids in the block have the same point value, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing, the space for one point value across all blocks should be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and intersect will get into addAll logic. If we > store ids as bitset when all the ids in the block have the same point value, > and give the IntersectVisitor bulk visiting ability (something like > visit(DocIdSetIterator iterator), we can speed up addAll because we can just > execute the 'or' logic between the result and the block ids. > Concerns: > 1. Bitset could occupy more disk space. We just store the min~max part of the > bit set and if ids is strictly increasing accross all blocks, the space for > one point value hould be maxDoc bits, and the total spase would be roughly > (maxDoc * cardinality) bits. > 2. MergeReader will become slower because it needs to iterate docIds one by > one. But i think the performance of merge could be less sensitive than > queries. > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when all the ids in the block have the same point value, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when all the ids in the block have the same point value, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space. We just store the min~max part of the bit set and if ids is strictly increasing accross all blocks, the space for one point value hould be maxDoc bits, and the total spase would be roughly (maxDoc * cardinality) bits. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and intersect will get into addAll logic. If we > store ids as bitset when all the ids in the block have the same point value, > and give the IntersectVisitor bulk visiting ability (something like > visit(DocIdSetIterator iterator), we can speed up addAll because we can just > execute the 'or' logic between the result and the block ids. > Concerns: > 1. Bitset could occupy more disk space. > 2. MergeReader will become slower because it needs to iterate docIds one by > one. But i think the performance of merge could be less sensitive than > queries. > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when all the ids in the block have the same point value, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space. 2. MergeReader will become slower because it needs to iterate docIds one by one. (But maybe the performance of merge could be less sensitive than queries ?) I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when all the ids in the block have the same point value, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space. 2. MergeReader will become slower because it needs to iterate docIds one by one. But i think the performance of merge could be less sensitive than queries. I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and intersect will get into addAll logic. If we > store ids as bitset when all the ids in the block have the same point value, > and give the IntersectVisitor bulk visiting ability (something like > visit(DocIdSetIterator iterator), we can speed up addAll because we can just > execute the 'or' logic between the result and the block ids. > Concerns: > 1. Bitset could occupy more disk space. > 2. MergeReader will become slower because it needs to iterate docIds one by > one. (But maybe the performance of merge could be less sensitive than queries > ?) > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space. 2. MergeReader will become slower because it needs to iterate docIds one by one. (But maybe the performance of merge could be less sensitive than queries ?) I'd like to do some test for query, merge and space if you think this optimization is worth a try :) was: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when all the ids in the block have the same point value, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space. 2. MergeReader will become slower because it needs to iterate docIds one by one. (But maybe the performance of merge could be less sensitive than queries ?) I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bit set to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and intersect will get into addAll logic. If we > store ids as bitset when the leafCadinality = 1, and give the > IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator > iterator), we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Concerns: > 1. Bitset could occupy more disk space. > 2. MergeReader will become slower because it needs to iterate docIds one by > one. (But maybe the performance of merge could be less sensitive than queries > ?) > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Summary: Store docIds as bitset when leafCardinality = 1 to speed up addAll (was: Store docIds as bit set when leafCardinality = 1 to speed up addAll) > Store docIds as bitset when leafCardinality = 1 to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and intersect will get into addAll logic. If we > store ids as bitset when the leafCadinality = 1, and give the > IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator > iterator), we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Concerns: > 1. Bitset could occupy more disk space. > 2. MergeReader will become slower because it needs to iterate docIds one by > one. (But maybe the performance of merge could be less sensitive than queries > ?) > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bit set when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Summary: Store docIds as bit set when leafCardinality = 1 to speed up addAll (was: Store docIds as bit set to speed up addAll) > Store docIds as bit set when leafCardinality = 1 to speed up addAll > --- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and intersect will get into addAll logic. If we > store ids as bitset when the leafCadinality = 1, and give the > IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator > iterator), we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Concerns: > 1. Bitset could occupy more disk space. > 2. MergeReader will become slower because it needs to iterate docIds one by > one. (But maybe the performance of merge could be less sensitive than queries > ?) > I'd like to do some test for query, merge and space if you think this > optimization is worth a try :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space. 2. MergeReader will become slower because it needs to iterate docIds one by one. (But maybe the performance of merge could be less sensitive than queries ?) was: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space. 2. MergeReader will become slower because it needs to iterate docIds one by one. (But maybe the performance of merge could be less sensitive than queries ?) I'd like to do some test for query, merge and space if you think this optimization is worth a try :) > Store docIds as bitset when leafCardinality = 1 to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and intersect will get into addAll logic. If we > store ids as bitset when the leafCadinality = 1, and give the > IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator > iterator), we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Concerns: > 1. Bitset could occupy more disk space. > 2. MergeReader will become slower because it needs to iterate docIds one by > one. (But maybe the performance of merge could be less sensitive than queries > ?) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space, though we just store part of the bitset (with some offset). 2. MergeReader will become slower because it needs to iterate docIds one by one. (But maybe the performance of merge could be less sensitive than queries ?) was: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space. 2. MergeReader will become slower because it needs to iterate docIds one by one. (But maybe the performance of merge could be less sensitive than queries ?) > Store docIds as bitset when leafCardinality = 1 to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and intersect will get into addAll logic. If we > store ids as bitset when the leafCadinality = 1, and give the > IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator > iterator), we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Concerns: > 1. Bitset could occupy more disk space, though we just store part of the > bitset (with some offset). > 2. MergeReader will become slower because it needs to iterate docIds one by > one. (But maybe the performance of merge could be less sensitive than queries > ?) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space.(Maybe we can force this optimization only works when block's max-min <= 8*count?) 2. MergeReader will become slower because it needs to iterate docIds one by one. was: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space, though we just store part of the bitset (with some offset). 2. MergeReader will become slower because it needs to iterate docIds one by one. (But maybe the performance of merge could be less sensitive than queries ?) > Store docIds as bitset when leafCardinality = 1 to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and intersect will get into addAll logic. If we > store ids as bitset when the leafCadinality = 1, and give the > IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator > iterator), we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Concerns: > 1. Bitset could occupy more disk space.(Maybe we can force this optimization > only works when block's max-min <= 8*count?) > 2. MergeReader will become slower because it needs to iterate docIds one by > one. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space.(Maybe we can force this optimization only works when block's max-min <= n*count?) 2. MergeReader will become slower because it needs to iterate docIds one by one. was: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space.(Maybe we can force this optimization only works when block's max-min <= 8*count?) 2. MergeReader will become slower because it needs to iterate docIds one by one. > Store docIds as bitset when leafCardinality = 1 to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and intersect will get into addAll logic. If we > store ids as bitset when the leafCadinality = 1, and give the > IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator > iterator), we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Concerns: > 1. Bitset could occupy more disk space.(Maybe we can force this optimization > only works when block's max-min <= n*count?) > 2. MergeReader will become slower because it needs to iterate docIds one by > one. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.
dweiss commented on a change in pull request #430: URL: https://github.com/apache/lucene/pull/430#discussion_r748913217 ## File path: lucene/core/src/java/org/apache/lucene/util/IntroSelector.java ## @@ -185,6 +204,115 @@ private void shuffle(int from, int to) { } } + /** Selects the k-th entry with a bottom-k algorithm, given that k is close to {@code from}. */ Review comment: I'll give it a second spin tomorrow, thanks Bruno. I recall looking at generic streaming top-n algorithms a white ago and they all seemed more complicated than your code, hence the question. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space.(Maybe we can force this optimization only works when block's (max-min) * n <= count?) 2. MergeReader will become slower because it needs to iterate docIds one by one. was: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space.(Maybe we can force this optimization only works when block's max-min <= n*count?) 2. MergeReader will become slower because it needs to iterate docIds one by one. > Store docIds as bitset when leafCardinality = 1 to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and intersect will get into addAll logic. If we > store ids as bitset when the leafCadinality = 1, and give the > IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator > iterator), we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Concerns: > 1. Bitset could occupy more disk space.(Maybe we can force this optimization > only works when block's (max-min) * n <= count?) > 2. MergeReader will become slower because it needs to iterate docIds one by > one. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space.(Maybe we can force this optimization only works when block's (max-min) <= n * count?) 2. MergeReader will become slower because it needs to iterate docIds one by one. was: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space.(Maybe we can force this optimization only works when block's (max-min) * n <= count?) 2. MergeReader will become slower because it needs to iterate docIds one by one. > Store docIds as bitset when leafCardinality = 1 to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and intersect will get into addAll logic. If we > store ids as bitset when the leafCadinality = 1, and give the > IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator > iterator), we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Concerns: > 1. Bitset could occupy more disk space.(Maybe we can force this optimization > only works when block's (max-min) <= n * count?) > 2. MergeReader will become slower because it needs to iterate docIds one by > one. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 opened a new pull request #438: LUCENE-10233: Store docIds as bitset when leafCardinality = 1 to speed up addAll
gf2121 opened a new pull request #438: URL: https://github.com/apache/lucene/pull/438 In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and `intersect` will get into `addAll` logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visiting ability, we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. I mocked a field that has 10,000,000 docs per value and search it with a PointInSetQuery with 1 term, the build scorer time decreased from 71ms to 8ms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] noblepaul merged pull request #2607: SOLR-15794: Switching a PRS collection from true -> false -> true results in INACTIVE replicas
noblepaul merged pull request #2607: URL: https://github.com/apache/lucene-solr/pull/2607 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?
[ https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443572#comment-17443572 ] Arjen commented on LUCENE-9921: --- I'm not sure if this is the best place (perhaps a new issue would be better?), but ICU is at version 70.1 since a few weeks. It'd be nice if the upcoming Lucene 9 does have the latest ICU, whether you integrate it via this proposed change or not :) > Can ICU regeneration tasks treat icu version as input? > -- > > Key: LUCENE-9921 > URL: https://issues.apache.org/jira/browse/LUCENE-9921 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > ICU 69 was released, so i was playing with the upgrade just to test it out > and test out our regeneration. > Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks > were SKIPPED by the build. > So I'm curious if the ICU version can be treated as an "input" to these > tasks, such that if it changes, tasks know the generated output is stale? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org