[GitHub] [lucene] gf2121 merged pull request #653: LUCENE-10315: add CHANGES for #541
gf2121 merged pull request #653: URL: https://github.com/apache/lucene/pull/653 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil
[ https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487952#comment-17487952 ] ASF subversion and git services commented on LUCENE-10315: -- Commit e93b08f47160a6550cb559bd7e3786195ced88a5 in lucene's branch refs/heads/main from gf2121 [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e93b08f ] LUCENE-10315: Add CHANGES for #541 (#653) > Speed up BKD leaf block ids codec by a 512 ints ForUtil > --- > > Key: LUCENE-10315 > URL: https://issues.apache.org/jira/browse/LUCENE-10315 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Feng Guo >Priority: Major > Attachments: addall.svg > > Time Spent: 6h > Remaining Estimate: 0h > > Elasticsearch (which based on lucene) can automatically infers types for > users with its dynamic mapping feature. When users index some low cardinality > fields, such as gender / age / status... they often use some numbers to > represent the values, while ES will infer these fields as {{{}long{}}}, and > ES uses BKD as the index of {{long}} fields. When the data volume grows, > building the result set of low-cardinality fields will make the CPU usage and > load very high. > This is a flame graph we obtained from the production environment: > [^addall.svg] > It can be seen that almost all CPU is used in addAll. When we reindex > {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly > reduced ( We spent weeks of time to reindex all indices... ). I know that ES > recommended to use {{keyword}} for term/terms query and {{long}} for range > query in the document, but there are always some users who didn't realize > this and keep their habit of using sql database, or dynamic mapping > automatically selects the type for them. All in all, users won't realize that > there would be such a big difference in performance between {{long}} and > {{keyword}} fields in low cardinality fields. So from my point of view it > will make sense if we can make BKD works better for the low/medium > cardinality fields. > As far as i can see, for low cardinality fields, there are two advantages of > {{keyword}} over {{{}long{}}}: > 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's > delta VInt, because its batch reading (readLongs) and SIMD decode. > 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily > materialize of its result set, and when another small result clause > intersects with this low cardinality condition, the low cardinality field can > avoid reading all docIds into memory. > This ISSUE is targeting to solve the first point. The basic idea is trying to > use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization > by mocking some random {{LongPoint}} and querying them with > {{PointInSetQuery}}. > *Benchmark Result* > |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff > percentage| > |1|32|1|51.44|148.26|188.22%| > |1|32|2|26.8|101.88|280.15%| > |1|32|4|14.04|53.52|281.20%| > |1|32|8|7.04|28.54|305.40%| > |1|32|16|3.54|14.61|312.71%| > |1|128|1|110.56|350.26|216.81%| > |1|128|8|16.6|89.81|441.02%| > |1|128|16|8.45|48.07|468.88%| > |1|128|32|4.2|25.35|503.57%| > |1|128|64|2.13|13.02|511.27%| > |1|1024|1|536.19|843.88|57.38%| > |1|1024|8|109.71|251.89|129.60%| > |1|1024|32|33.24|104.11|213.21%| > |1|1024|128|8.87|30.47|243.52%| > |1|1024|512|2.24|8.3|270.54%| > |1|8192|1|.33|5000|50.00%| > |1|8192|32|139.47|214.59|53.86%| > |1|8192|128|54.59|109.23|100.09%| > |1|8192|512|15.61|36.15|131.58%| > |1|8192|2048|4.11|11.14|171.05%| > |1|1048576|1|2597.4|3030.3|16.67%| > |1|1048576|32|314.96|371.75|18.03%| > |1|1048576|128|99.7|116.28|16.63%| > |1|1048576|512|30.5|37.15|21.80%| > |1|1048576|2048|10.38|12.3|18.50%| > |1|8388608|1|2564.1|3174.6|23.81%| > |1|8388608|32|196.27|238.95|21.75%| > |1|8388608|128|55.36|68.03|22.89%| > |1|8388608|512|15.58|19.24|23.49%| > |1|8388608|2048|4.56|5.71|25.22%| > The indices size is reduced for low cardinality fields and flat for high > cardinality fields. > {code:java} > 113Mindex_1_doc_32_cardinality_baseline > 114Mindex_1_doc_32_cardinality_candidate > 140Mindex_1_doc_128_cardinality_baseline > 133Mindex_1_doc_128_cardinality_candidate > 193Mindex_1_doc_1024_cardinality_baseline > 174Mindex_1_doc_1024_cardinality_candidate > 241Mindex_1_doc_8192_cardinality_baseline > 233Mindex_1_
[jira] [Created] (LUCENE-10409) Improve BKDWriter's DocIdsWriter to better encode decreasing sequences of doc IDs
Adrien Grand created LUCENE-10409: - Summary: Improve BKDWriter's DocIdsWriter to better encode decreasing sequences of doc IDs Key: LUCENE-10409 URL: https://issues.apache.org/jira/browse/LUCENE-10409 Project: Lucene - Core Issue Type: Task Reporter: Adrien Grand [~gf2121] recently improved DocIdsWriter for the case when doc IDs are dense and come in the same order as values via the CONTINUOUS_IDS and BITSET_IDS encodings. We could do the same for the case when doc IDs come in the opposite order to values. This would be used whenever searching on a field that is used for index sorting in the descending order. This would be a frequent case for Elasticsearch users as we're planning on using index sorting more and more on time-based data with a descending sort on the timestamp as the last sort field. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 merged pull request #652: LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil (backport 9x)
gf2121 merged pull request #652: URL: https://github.com/apache/lucene/pull/652 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil
[ https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo reassigned LUCENE-10315: - Assignee: Feng Guo > Speed up BKD leaf block ids codec by a 512 ints ForUtil > --- > > Key: LUCENE-10315 > URL: https://issues.apache.org/jira/browse/LUCENE-10315 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Feng Guo >Assignee: Feng Guo >Priority: Major > Attachments: addall.svg > > Time Spent: 6h 10m > Remaining Estimate: 0h > > Elasticsearch (which based on lucene) can automatically infers types for > users with its dynamic mapping feature. When users index some low cardinality > fields, such as gender / age / status... they often use some numbers to > represent the values, while ES will infer these fields as {{{}long{}}}, and > ES uses BKD as the index of {{long}} fields. When the data volume grows, > building the result set of low-cardinality fields will make the CPU usage and > load very high. > This is a flame graph we obtained from the production environment: > [^addall.svg] > It can be seen that almost all CPU is used in addAll. When we reindex > {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly > reduced ( We spent weeks of time to reindex all indices... ). I know that ES > recommended to use {{keyword}} for term/terms query and {{long}} for range > query in the document, but there are always some users who didn't realize > this and keep their habit of using sql database, or dynamic mapping > automatically selects the type for them. All in all, users won't realize that > there would be such a big difference in performance between {{long}} and > {{keyword}} fields in low cardinality fields. So from my point of view it > will make sense if we can make BKD works better for the low/medium > cardinality fields. > As far as i can see, for low cardinality fields, there are two advantages of > {{keyword}} over {{{}long{}}}: > 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's > delta VInt, because its batch reading (readLongs) and SIMD decode. > 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily > materialize of its result set, and when another small result clause > intersects with this low cardinality condition, the low cardinality field can > avoid reading all docIds into memory. > This ISSUE is targeting to solve the first point. The basic idea is trying to > use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization > by mocking some random {{LongPoint}} and querying them with > {{PointInSetQuery}}. > *Benchmark Result* > |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff > percentage| > |1|32|1|51.44|148.26|188.22%| > |1|32|2|26.8|101.88|280.15%| > |1|32|4|14.04|53.52|281.20%| > |1|32|8|7.04|28.54|305.40%| > |1|32|16|3.54|14.61|312.71%| > |1|128|1|110.56|350.26|216.81%| > |1|128|8|16.6|89.81|441.02%| > |1|128|16|8.45|48.07|468.88%| > |1|128|32|4.2|25.35|503.57%| > |1|128|64|2.13|13.02|511.27%| > |1|1024|1|536.19|843.88|57.38%| > |1|1024|8|109.71|251.89|129.60%| > |1|1024|32|33.24|104.11|213.21%| > |1|1024|128|8.87|30.47|243.52%| > |1|1024|512|2.24|8.3|270.54%| > |1|8192|1|.33|5000|50.00%| > |1|8192|32|139.47|214.59|53.86%| > |1|8192|128|54.59|109.23|100.09%| > |1|8192|512|15.61|36.15|131.58%| > |1|8192|2048|4.11|11.14|171.05%| > |1|1048576|1|2597.4|3030.3|16.67%| > |1|1048576|32|314.96|371.75|18.03%| > |1|1048576|128|99.7|116.28|16.63%| > |1|1048576|512|30.5|37.15|21.80%| > |1|1048576|2048|10.38|12.3|18.50%| > |1|8388608|1|2564.1|3174.6|23.81%| > |1|8388608|32|196.27|238.95|21.75%| > |1|8388608|128|55.36|68.03|22.89%| > |1|8388608|512|15.58|19.24|23.49%| > |1|8388608|2048|4.56|5.71|25.22%| > The indices size is reduced for low cardinality fields and flat for high > cardinality fields. > {code:java} > 113Mindex_1_doc_32_cardinality_baseline > 114Mindex_1_doc_32_cardinality_candidate > 140Mindex_1_doc_128_cardinality_baseline > 133Mindex_1_doc_128_cardinality_candidate > 193Mindex_1_doc_1024_cardinality_baseline > 174Mindex_1_doc_1024_cardinality_candidate > 241Mindex_1_doc_8192_cardinality_baseline > 233Mindex_1_doc_8192_cardinality_candidate > 314Mindex_1_doc_1048576_cardinality_baseline > 315Mindex_1_doc_1048576_cardinality_candidate > 392Mindex_1_doc_8388608_cardinality_baseline > 391Mindex_1_doc_838
[jira] [Commented] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil
[ https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487977#comment-17487977 ] ASF subversion and git services commented on LUCENE-10315: -- Commit 28ba89b8c951cc757e99eea22095375ddeb49f70 in lucene's branch refs/heads/branch_9x from gf2121 [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=28ba89b ] LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil (#541) > Speed up BKD leaf block ids codec by a 512 ints ForUtil > --- > > Key: LUCENE-10315 > URL: https://issues.apache.org/jira/browse/LUCENE-10315 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Feng Guo >Assignee: Feng Guo >Priority: Major > Attachments: addall.svg > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Elasticsearch (which based on lucene) can automatically infers types for > users with its dynamic mapping feature. When users index some low cardinality > fields, such as gender / age / status... they often use some numbers to > represent the values, while ES will infer these fields as {{{}long{}}}, and > ES uses BKD as the index of {{long}} fields. When the data volume grows, > building the result set of low-cardinality fields will make the CPU usage and > load very high. > This is a flame graph we obtained from the production environment: > [^addall.svg] > It can be seen that almost all CPU is used in addAll. When we reindex > {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly > reduced ( We spent weeks of time to reindex all indices... ). I know that ES > recommended to use {{keyword}} for term/terms query and {{long}} for range > query in the document, but there are always some users who didn't realize > this and keep their habit of using sql database, or dynamic mapping > automatically selects the type for them. All in all, users won't realize that > there would be such a big difference in performance between {{long}} and > {{keyword}} fields in low cardinality fields. So from my point of view it > will make sense if we can make BKD works better for the low/medium > cardinality fields. > As far as i can see, for low cardinality fields, there are two advantages of > {{keyword}} over {{{}long{}}}: > 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's > delta VInt, because its batch reading (readLongs) and SIMD decode. > 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily > materialize of its result set, and when another small result clause > intersects with this low cardinality condition, the low cardinality field can > avoid reading all docIds into memory. > This ISSUE is targeting to solve the first point. The basic idea is trying to > use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization > by mocking some random {{LongPoint}} and querying them with > {{PointInSetQuery}}. > *Benchmark Result* > |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff > percentage| > |1|32|1|51.44|148.26|188.22%| > |1|32|2|26.8|101.88|280.15%| > |1|32|4|14.04|53.52|281.20%| > |1|32|8|7.04|28.54|305.40%| > |1|32|16|3.54|14.61|312.71%| > |1|128|1|110.56|350.26|216.81%| > |1|128|8|16.6|89.81|441.02%| > |1|128|16|8.45|48.07|468.88%| > |1|128|32|4.2|25.35|503.57%| > |1|128|64|2.13|13.02|511.27%| > |1|1024|1|536.19|843.88|57.38%| > |1|1024|8|109.71|251.89|129.60%| > |1|1024|32|33.24|104.11|213.21%| > |1|1024|128|8.87|30.47|243.52%| > |1|1024|512|2.24|8.3|270.54%| > |1|8192|1|.33|5000|50.00%| > |1|8192|32|139.47|214.59|53.86%| > |1|8192|128|54.59|109.23|100.09%| > |1|8192|512|15.61|36.15|131.58%| > |1|8192|2048|4.11|11.14|171.05%| > |1|1048576|1|2597.4|3030.3|16.67%| > |1|1048576|32|314.96|371.75|18.03%| > |1|1048576|128|99.7|116.28|16.63%| > |1|1048576|512|30.5|37.15|21.80%| > |1|1048576|2048|10.38|12.3|18.50%| > |1|8388608|1|2564.1|3174.6|23.81%| > |1|8388608|32|196.27|238.95|21.75%| > |1|8388608|128|55.36|68.03|22.89%| > |1|8388608|512|15.58|19.24|23.49%| > |1|8388608|2048|4.56|5.71|25.22%| > The indices size is reduced for low cardinality fields and flat for high > cardinality fields. > {code:java} > 113Mindex_1_doc_32_cardinality_baseline > 114Mindex_1_doc_32_cardinality_candidate > 140Mindex_1_doc_128_cardinality_baseline > 133Mindex_1_doc_128_cardinality_candidate > 193Mindex_1_doc_1024_cardinality_baseline > 174Mindex_1_doc_1024_cardinality_candidate > 241M
[jira] [Resolved] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil
[ https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo resolved LUCENE-10315. --- Fix Version/s: 9.1 Resolution: Fixed > Speed up BKD leaf block ids codec by a 512 ints ForUtil > --- > > Key: LUCENE-10315 > URL: https://issues.apache.org/jira/browse/LUCENE-10315 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Feng Guo >Assignee: Feng Guo >Priority: Major > Fix For: 9.1 > > Attachments: addall.svg > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Elasticsearch (which based on lucene) can automatically infers types for > users with its dynamic mapping feature. When users index some low cardinality > fields, such as gender / age / status... they often use some numbers to > represent the values, while ES will infer these fields as {{{}long{}}}, and > ES uses BKD as the index of {{long}} fields. When the data volume grows, > building the result set of low-cardinality fields will make the CPU usage and > load very high. > This is a flame graph we obtained from the production environment: > [^addall.svg] > It can be seen that almost all CPU is used in addAll. When we reindex > {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly > reduced ( We spent weeks of time to reindex all indices... ). I know that ES > recommended to use {{keyword}} for term/terms query and {{long}} for range > query in the document, but there are always some users who didn't realize > this and keep their habit of using sql database, or dynamic mapping > automatically selects the type for them. All in all, users won't realize that > there would be such a big difference in performance between {{long}} and > {{keyword}} fields in low cardinality fields. So from my point of view it > will make sense if we can make BKD works better for the low/medium > cardinality fields. > As far as i can see, for low cardinality fields, there are two advantages of > {{keyword}} over {{{}long{}}}: > 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's > delta VInt, because its batch reading (readLongs) and SIMD decode. > 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily > materialize of its result set, and when another small result clause > intersects with this low cardinality condition, the low cardinality field can > avoid reading all docIds into memory. > This ISSUE is targeting to solve the first point. The basic idea is trying to > use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization > by mocking some random {{LongPoint}} and querying them with > {{PointInSetQuery}}. > *Benchmark Result* > |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff > percentage| > |1|32|1|51.44|148.26|188.22%| > |1|32|2|26.8|101.88|280.15%| > |1|32|4|14.04|53.52|281.20%| > |1|32|8|7.04|28.54|305.40%| > |1|32|16|3.54|14.61|312.71%| > |1|128|1|110.56|350.26|216.81%| > |1|128|8|16.6|89.81|441.02%| > |1|128|16|8.45|48.07|468.88%| > |1|128|32|4.2|25.35|503.57%| > |1|128|64|2.13|13.02|511.27%| > |1|1024|1|536.19|843.88|57.38%| > |1|1024|8|109.71|251.89|129.60%| > |1|1024|32|33.24|104.11|213.21%| > |1|1024|128|8.87|30.47|243.52%| > |1|1024|512|2.24|8.3|270.54%| > |1|8192|1|.33|5000|50.00%| > |1|8192|32|139.47|214.59|53.86%| > |1|8192|128|54.59|109.23|100.09%| > |1|8192|512|15.61|36.15|131.58%| > |1|8192|2048|4.11|11.14|171.05%| > |1|1048576|1|2597.4|3030.3|16.67%| > |1|1048576|32|314.96|371.75|18.03%| > |1|1048576|128|99.7|116.28|16.63%| > |1|1048576|512|30.5|37.15|21.80%| > |1|1048576|2048|10.38|12.3|18.50%| > |1|8388608|1|2564.1|3174.6|23.81%| > |1|8388608|32|196.27|238.95|21.75%| > |1|8388608|128|55.36|68.03|22.89%| > |1|8388608|512|15.58|19.24|23.49%| > |1|8388608|2048|4.56|5.71|25.22%| > The indices size is reduced for low cardinality fields and flat for high > cardinality fields. > {code:java} > 113Mindex_1_doc_32_cardinality_baseline > 114Mindex_1_doc_32_cardinality_candidate > 140Mindex_1_doc_128_cardinality_baseline > 133Mindex_1_doc_128_cardinality_candidate > 193Mindex_1_doc_1024_cardinality_baseline > 174Mindex_1_doc_1024_cardinality_candidate > 241Mindex_1_doc_8192_cardinality_baseline > 233Mindex_1_doc_8192_cardinality_candidate > 314Mindex_1_doc_1048576_cardinality_baseline > 315Mindex_1_doc_1048576_cardinality_candidate > 392Mindex_1_doc_8388608_cardi
[jira] [Created] (LUCENE-10410) Add some more tests for legacy encoding logic in DocIdsWriter
Feng Guo created LUCENE-10410: - Summary: Add some more tests for legacy encoding logic in DocIdsWriter Key: LUCENE-10410 URL: https://issues.apache.org/jira/browse/LUCENE-10410 Project: Lucene - Core Issue Type: Test Components: core/codecs Reporter: Feng Guo This is a follow-up 0f LUCENE-10315, add some more tests for legacy encoding logic in DocIdsWriter. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 opened a new pull request #654: LUCENE-10410: Add more tests for legacy encoding logic in DocIdsWriter
gf2121 opened a new pull request #654: URL: https://github.com/apache/lucene/pull/654 This is a follow-up of https://issues.apache.org/jira/browse/LUCENE-10315 (#541) . Add some more tests for legacy encoding logic in DocIdsWriter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #643: LUCENE-10400: revise binary dictionaries' constructor in kuromoji
mocobeta commented on pull request #643: URL: https://github.com/apache/lucene/pull/643#issuecomment-1031313230 Thanks for reviewing, I'm going to merge this. I will open another pull request to remove the obsolete methods on main. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta merged pull request #643: LUCENE-10400: revise binary dictionaries' constructor in kuromoji
mocobeta merged pull request #643: URL: https://github.com/apache/lucene/pull/643 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10400) Clean up the constructors' API signature of dictionary classes in kuromoji and nori
[ https://issues.apache.org/jira/browse/LUCENE-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488015#comment-17488015 ] ASF subversion and git services commented on LUCENE-10400: -- Commit e7546c2427e9b82eb4a4632a992866e5436d97a4 in lucene's branch refs/heads/main from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e7546c2 ] LUCENE-10400: revise binary dictionaries' constructor in kuromoji (#643) > Clean up the constructors' API signature of dictionary classes in kuromoji > and nori > --- > > Key: LUCENE-10400 > URL: https://issues.apache.org/jira/browse/LUCENE-10400 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Major > Time Spent: 7.5h > Remaining Estimate: 0h > > It was suggested in a few issues/pr comments. > * do not delegate to load class resources to other classes > * do not allow to switch the location (classpath/file path) of the resource > by constructor parameter > Before working on LUCENE-8816 or LUCENE-10393, we'd need to sort the > protected constructor APIs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10400) Clean up the constructors' API signature of dictionary classes in kuromoji and nori
[ https://issues.apache.org/jira/browse/LUCENE-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488020#comment-17488020 ] ASF subversion and git services commented on LUCENE-10400: -- Commit e4ad3c794fc147c2827ef258e95e8a3fac765d40 in lucene's branch refs/heads/branch_9x from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e4ad3c7 ] LUCENE-10400: revise binary dictionaries' constructor in kuromoji (#643) > Clean up the constructors' API signature of dictionary classes in kuromoji > and nori > --- > > Key: LUCENE-10400 > URL: https://issues.apache.org/jira/browse/LUCENE-10400 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Major > Time Spent: 7.5h > Remaining Estimate: 0h > > It was suggested in a few issues/pr comments. > * do not delegate to load class resources to other classes > * do not allow to switch the location (classpath/file path) of the resource > by constructor parameter > Before working on LUCENE-8816 or LUCENE-10393, we'd need to sort the > protected constructor APIs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] noblepaul closed pull request #2638: 174234: We have observed that number of threads increases on one solr node after rebooting the solr node
noblepaul closed pull request #2638: URL: https://github.com/apache/lucene-solr/pull/2638 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta opened a new pull request #655: LUCENE-10400: cleanup obsolete APIs in kuromoji
mocobeta opened a new pull request #655: URL: https://github.com/apache/lucene/pull/655 1. Remove deprecated constructors 2. Add / change tests for new constructors -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #643: LUCENE-10400: revise binary dictionaries' constructor in kuromoji
mocobeta commented on pull request #643: URL: https://github.com/apache/lucene/pull/643#issuecomment-1031541054 I opened this https://github.com/apache/lucene/pull/655/files. I think the diff would be obvious - will merge it tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta edited a comment on pull request #643: LUCENE-10400: revise binary dictionaries' constructor in kuromoji
mocobeta edited a comment on pull request #643: URL: https://github.com/apache/lucene/pull/643#issuecomment-1031541054 I opened this https://github.com/apache/lucene/pull/655. I think the diff would be obvious - will merge it tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta edited a comment on pull request #643: LUCENE-10400: revise binary dictionaries' constructor in kuromoji
mocobeta edited a comment on pull request #643: URL: https://github.com/apache/lucene/pull/643#issuecomment-1031541054 I opened https://github.com/apache/lucene/pull/655. I think the diff would be obvious - will merge it tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on pull request #645: Rename KnnGraphValues -> HnswGraph
jtibshirani commented on pull request #645: URL: https://github.com/apache/lucene/pull/645#issuecomment-1031710996 Thanks for the review! My understanding is that `git mv` is the same as `git rm` and `git add`. It doesn't give special information to git (instead git automatically detects renames by comparing the file contents). So I am not sure it would help 🤔 I guess an alternative would be to merge it as two separate commits? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors
jpountz commented on pull request #649: URL: https://github.com/apache/lucene/pull/649#issuecomment-1031754530 Optimizing for the case when all docs have a value makes sense to me. > for a case when only certain documents have vectors, we do delta encoding of doc Ids. In the past we rejected changes that would consist of having the data written in a compressed fashion on disk but still uncompressed in memory. I wonder if it would be a better trade-off to keep ints uncompressed, but read them from disk directly instead of loading giant arrays in memory? Or possibly switch to something like DirectMonotonicReader if it doesn't slow down searches. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude opened a new pull request #2639: SOLR-15587: Don't use the UrlScheme singleton on the client-side
thelabdude opened a new pull request #2639: URL: https://github.com/apache/lucene-solr/pull/2639 Backport of https://github.com/apache/solr/pull/600 to 8_11 branch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #645: Rename KnnGraphValues -> HnswGraph
dweiss commented on pull request #645: URL: https://github.com/apache/lucene/pull/645#issuecomment-1031849263 > It doesn't give special information to git (instead git automatically detects renames by comparing the file contents). Correct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani merged pull request #645: Rename KnnGraphValues -> HnswGraph
jtibshirani merged pull request #645: URL: https://github.com/apache/lucene/pull/645 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API
[ https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488470#comment-17488470 ] Vigya Sharma commented on LUCENE-10216: --- Had some thoughts and questions about the transaction model for this API... Currently, it either succeeds and adds all provided readers, or fails and adds none of them. With a merge policy splitting provided readers into groups of smaller {{OneMerge}} objects, this is slightly harder to implement. A OneMerge on a subset of readers may complete in the background and add itself to writer segment infos, while some others running in parallel could fail. One approach could be to expose this failure information to user - the exception can contain the list of readers merged and pending. This could simplify the overall implementation. My current thoughts, however, are that the present transaction logic is important. It is hard for users to parse the exception message, figure out which readers are pending and retry them. As opposed to retrying an entire API call (with all the readers), which their upstream system probably understands as a single unit. However, I wanted to check if loosening the transaction model for this API is a palatable approach. To retain the all or none, single transaction model, I am thinking that we can join on all merges at the end of the {{addIndexes()}} API, and then write their segment info files in a common lock. Would like to hear more thoughts or suggestions on this. > Add concurrency to addIndexes(CodecReader…) API > --- > > Key: LUCENE-10216 > URL: https://issues.apache.org/jira/browse/LUCENE-10216 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Vigya Sharma >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > I work at Amazon Product Search, and we use Lucene to power search for the > e-commerce platform. I’m working on a project that involves applying > metadata+ETL transforms and indexing documents on n different _indexing_ > boxes, combining them into a single index on a separate _reducer_ box, and > making it available for queries on m different _search_ boxes (replicas). > Segments are asynchronously copied from indexers to reducers to searchers as > they become available for the next layer to consume. > I am using the addIndexes API to combine multiple indexes into one on the > reducer boxes. Since we also have taxonomy data, we need to remap facet field > ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version > of this API. The API leverages {{SegmentMerger.merge()}} to create segments > with new ordinal values while also merging all provided segments in the > process. > _This is however a blocking call that runs in a single thread._ Until we have > written segments with new ordinal values, we cannot copy them to searcher > boxes, which increases the time to make documents available for search. > I was playing around with the API by creating multiple concurrent merges, > each with only a single reader, creating a concurrently running 1:1 > conversion from old segments to new ones (with new ordinal values). We follow > this up with non-blocking background merges. This lets us copy the segments > to searchers and replicas as soon as they are available, and later replace > them with merged segments as background jobs complete. On the Amazon dataset > I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. > Each call was given about 5 readers to add on average. > This might be useful add to Lucene. We could create another {{addIndexes()}} > API with a {{boolean}} flag for concurrency, that internally submits multiple > merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, > and waits for them to complete before returning. > While this is doable from outside Lucene by using your thread pool, starting > multiple addIndexes() calls and waiting for them to complete, I felt it needs > some understanding of what addIndexes does, why you need to wait on the merge > and why it makes sense to pass a single reader in the addIndexes API. > Out of box support in Lucene could simplify this for folks a similar use case. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani opened a new pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery
jtibshirani opened a new pull request #656: URL: https://github.com/apache/lucene/pull/656 This PR adds support for a query filter in KnnVectorQuery. First, we gather the query results for each leaf as a bit set. Then the HNSW search skips over the non-matching documents (using the same approach as for live docs). To prevent HNSW search from visiting too many documents when the filter is very selective, we short-circuit if HNSW has already visited more than the number of documents that match the filter, and execute an exact search instead. This bounds the number of visited documents at roughly 2x the cost of just running the exact filter, while in most cases HNSW completes successfully and does a lot better. Co-authored-by: Joel Bernstein -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta merged pull request #655: LUCENE-10400: cleanup obsolete APIs in kuromoji
mocobeta merged pull request #655: URL: https://github.com/apache/lucene/pull/655 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #655: LUCENE-10400: cleanup obsolete APIs in kuromoji
mocobeta commented on pull request #655: URL: https://github.com/apache/lucene/pull/655#issuecomment-1032088144 I'll backport the test to 9.x. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude merged pull request #2639: SOLR-15587: Don't use the UrlScheme singleton on the client-side
thelabdude merged pull request #2639: URL: https://github.com/apache/lucene-solr/pull/2639 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10400) Clean up the constructors' API signature of dictionary classes in kuromoji and nori
[ https://issues.apache.org/jira/browse/LUCENE-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488511#comment-17488511 ] ASF subversion and git services commented on LUCENE-10400: -- Commit 20f7f33c8d21b94c15887912e80d068970fd095f in lucene's branch refs/heads/main from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=20f7f33 ] LUCENE-10400: cleanup obsolete APIs in kuromoji (#655) > Clean up the constructors' API signature of dictionary classes in kuromoji > and nori > --- > > Key: LUCENE-10400 > URL: https://issues.apache.org/jira/browse/LUCENE-10400 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Major > Time Spent: 8.5h > Remaining Estimate: 0h > > It was suggested in a few issues/pr comments. > * do not delegate to load class resources to other classes > * do not allow to switch the location (classpath/file path) of the resource > by constructor parameter > Before working on LUCENE-8816 or LUCENE-10393, we'd need to sort the > protected constructor APIs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery
jtibshirani commented on pull request #656: URL: https://github.com/apache/lucene/pull/656#issuecomment-1032109021 I tried out the around stopping the HNSW search early if it visits too many docs. To test, I modified `KnnGraphTester` to create `acceptDocs` uniformly at random with a certain selectivity, then measured recall and QPS. Here are the results on glove-100-angular (~1.2 million docs) with a filter selectivity 0.01: **Baseline** ``` kRecallVisitedDocs QPS 10 0.774 15957 232.083 50 0.930 63429 58.994 80 0.958 95175 42.470 100 0.967 118891 35.203 500 0.997 1176237 8.136 800 0.999 1183514 5.571 ``` **PR** ``` kRecallVisitedDocs QPS 10 1.000 22908 190.286 50 1.000 23607 152.224 80 1.000 23608 148.036 100 1.000 23608 145.381 500 1.000 23608 138.903 800 1.000 23608 137.882 ``` Since the filter is so selective, HNSW always visits more than 1% of the docs. The adaptive logic in the PR decides to stop the search and switch to an exact search, which bounds the visited docs at 2%. For `k=10` this makes the QPS a little worse, but overall prevents QPS from degrading (with the side benefit of perfect recall). I also tested with less restrictive filters, and in these cases the fallback just doesn't kick in, so the QPS remains the same as before. Overall I like this approach because it doesn't require us to fiddle with thresholds or expose new parameters. It could also help make HNSW more robust in "pathological" cases where even when the filter is not that selective, all the nearest vectors to a query happen to be filtered away. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a change in pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery
jtibshirani commented on a change in pull request #656: URL: https://github.com/apache/lucene/pull/656#discussion_r801192735 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java ## @@ -227,16 +231,36 @@ public TopDocs search(String field, float[] target, int k, Bits acceptDocs) thro // bound k by total number of vectors to prevent oversizing data structures k = Math.min(k, fieldEntry.size()); - OffHeapVectorValues vectorValues = getOffHeapVectorValues(fieldEntry); + +DocIdSetIterator acceptIterator = null; +int visitedLimit = Integer.MAX_VALUE; + +if (acceptDocs instanceof BitSet acceptBitSet) { Review comment: This is a temporary hack since I wasn't sure about the right design. I could see a couple possibilities: 1. Add a new `BitSet filter` parameter to `searchNearestVectors`, keeping the fallback logic within the HNSW classes. 2. Add a new `int visitedLimit` parameter to `LeafReader#searchNearestVectors`. Pull the "exact search" logic up into `KnnVectorQuery`. Which option is better probably depends on how other algorithms would handle filtering (which I am not sure about), and also if we think `visitedLimit` is useful in other contexts. I also played around with having `searchNearestVectors` take a `Collector` and using `CollectionTerminatedException`... but couldn't really see how this made sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488527#comment-17488527 ] Julie Tibshirani commented on LUCENE-10382: --- I had some time to try out the dynamic check I mentioned, and it seems to work. I opened a PR here that builds off Joel's change: https://github.com/apache/lucene/pull/656. It's a draft because there are still some big open API questions. Looking forward to hearing your feedback! > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10409) Improve BKDWriter's DocIdsWriter to better encode decreasing sequences of doc IDs
[ https://issues.apache.org/jira/browse/LUCENE-10409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488607#comment-17488607 ] Feng Guo commented on LUCENE-10409: --- +1, Great idea! I'd like to take on this if you agree. > Improve BKDWriter's DocIdsWriter to better encode decreasing sequences of doc > IDs > - > > Key: LUCENE-10409 > URL: https://issues.apache.org/jira/browse/LUCENE-10409 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > [~gf2121] recently improved DocIdsWriter for the case when doc IDs are dense > and come in the same order as values via the CONTINUOUS_IDS and BITSET_IDS > encodings. > We could do the same for the case when doc IDs come in the opposite order to > values. This would be used whenever searching on a field that is used for > index sorting in the descending order. This would be a frequent case for > Elasticsearch users as we're planning on using index sorting more and more on > time-based data with a descending sort on the timestamp as the last sort > field. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10367) Use WANDScorer in CoveringQuery Can accelerate scorer time
[ https://issues.apache.org/jira/browse/LUCENE-10367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LuYunCheng updated LUCENE-10367: Status: Open (was: Patch Available) > Use WANDScorer in CoveringQuery Can accelerate scorer time > -- > > Key: LUCENE-10367 > URL: https://issues.apache.org/jira/browse/LUCENE-10367 > Project: Lucene - Core > Issue Type: Improvement > Components: core/query/scoring, core/search, modules/sandbox >Reporter: LuYunCheng >Priority: Major > Attachments: LUCENE-10367.patch, TestCoveringQueryBench.java > > > When using CoveringQuery In Elasticsearch with terms_set query, it takes too > much time in CoveringScore and major cost in matain the DisiPriorityQueue: > subScorers. > But when minimumNumberMatch is ConstantLongValuesSource, we can use > WANDScorer to optimize it. > > i do a mini benchmark with 1m docs, which code in LUCENE-10367.patch > TestCoveringQuery.java testRandomBench() > it shows: > TEST: WAND elapsed 67ms > TEST: NOWAND elapsed 163ms > My testing environment is macBook with Intel Core i7 16GMem. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-10410) Add some more tests for legacy encoding logic in DocIdsWriter
[ https://issues.apache.org/jira/browse/LUCENE-10410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo reassigned LUCENE-10410: - Assignee: Feng Guo > Add some more tests for legacy encoding logic in DocIdsWriter > - > > Key: LUCENE-10410 > URL: https://issues.apache.org/jira/browse/LUCENE-10410 > Project: Lucene - Core > Issue Type: Test > Components: core/codecs >Reporter: Feng Guo >Assignee: Feng Guo >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > This is a follow-up 0f LUCENE-10315, add some more tests for legacy encoding > logic in DocIdsWriter. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10378) Implement Weight#count on PointRangeQuery
[ https://issues.apache.org/jira/browse/LUCENE-10378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488645#comment-17488645 ] Adrien Grand commented on LUCENE-10378: --- [~gworah] Have you had a chance to start working on something? I'm also interested in this change which I'd like to use for range faceting. > Implement Weight#count on PointRangeQuery > - > > Key: LUCENE-10378 > URL: https://issues.apache.org/jira/browse/LUCENE-10378 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > When there are no deletions and the field is single-valued (docCount == size) > and has a single-dimension, PointRangeQuery could implement {{Weight#count}} > by only counting matches on the two leaves that cross with the query. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #654: LUCENE-10410: Add more tests for legacy decoding logic in DocIdsWriter
jpountz commented on a change in pull request #654: URL: https://github.com/apache/lucene/pull/654#discussion_r801340328 ## File path: lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java ## @@ -110,7 +111,11 @@ private void test(Directory dir, int[] ints) throws Exception { final long len; DocIdsWriter docIdsWriter = new DocIdsWriter(ints.length); try (IndexOutput out = dir.createOutput("tmp", IOContext.DEFAULT)) { - docIdsWriter.writeDocIds(ints, 0, ints.length, out); + if (rarely()) { +legacyWriteDocIds(ints, 0, ints.length, out); + } else { +docIdsWriter.writeDocIds(ints, 0, ints.length, out); + } Review comment: Can you extract it to a separate test, so that we'd have both a `testLegacy` and `test`? I think it'd make it easier to get a sense of where the bug lies when we see a test failure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org