[GitHub] [lucene] gf2121 merged pull request #653: LUCENE-10315: add CHANGES for #541

2022-02-07 Thread GitBox


gf2121 merged pull request #653:
URL: https://github.com/apache/lucene/pull/653


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil

2022-02-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487952#comment-17487952
 ] 

ASF subversion and git services commented on LUCENE-10315:
--

Commit e93b08f47160a6550cb559bd7e3786195ced88a5 in lucene's branch 
refs/heads/main from gf2121
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e93b08f ]

LUCENE-10315: Add CHANGES for #541 (#653)



> Speed up BKD leaf block ids codec by a 512 ints ForUtil
> ---
>
> Key: LUCENE-10315
> URL: https://issues.apache.org/jira/browse/LUCENE-10315
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Feng Guo
>Priority: Major
> Attachments: addall.svg
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> Elasticsearch (which based on lucene) can automatically infers types for 
> users with its dynamic mapping feature. When users index some low cardinality 
> fields, such as gender / age / status... they often use some numbers to 
> represent the values, while ES will infer these fields as {{{}long{}}}, and 
> ES uses BKD as the index of {{long}} fields. When the data volume grows, 
> building the result set of low-cardinality fields will make the CPU usage and 
> load very high.
> This is a flame graph we obtained from the production environment:
> [^addall.svg]
> It can be seen that almost all CPU is used in addAll. When we reindex 
> {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly 
> reduced ( We spent weeks of time to reindex all indices... ). I know that ES 
> recommended to use {{keyword}} for term/terms query and {{long}} for range 
> query in the document, but there are always some users who didn't realize 
> this and keep their habit of using sql database, or dynamic mapping 
> automatically selects the type for them. All in all, users won't realize that 
> there would be such a big difference in performance between {{long}} and 
> {{keyword}} fields in low cardinality fields. So from my point of view it 
> will make sense if we can make BKD works better for the low/medium 
> cardinality fields.
> As far as i can see, for low cardinality fields, there are two advantages of 
> {{keyword}} over {{{}long{}}}:
> 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's 
> delta VInt, because its batch reading (readLongs) and SIMD decode.
> 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily 
> materialize of its result set, and when another small result clause 
> intersects with this low cardinality condition, the low cardinality field can 
> avoid reading all docIds into memory.
> This ISSUE is targeting to solve the first point. The basic idea is trying to 
> use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization 
> by mocking some random {{LongPoint}} and querying them with 
> {{PointInSetQuery}}.
> *Benchmark Result*
> |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff 
> percentage|
> |1|32|1|51.44|148.26|188.22%|
> |1|32|2|26.8|101.88|280.15%|
> |1|32|4|14.04|53.52|281.20%|
> |1|32|8|7.04|28.54|305.40%|
> |1|32|16|3.54|14.61|312.71%|
> |1|128|1|110.56|350.26|216.81%|
> |1|128|8|16.6|89.81|441.02%|
> |1|128|16|8.45|48.07|468.88%|
> |1|128|32|4.2|25.35|503.57%|
> |1|128|64|2.13|13.02|511.27%|
> |1|1024|1|536.19|843.88|57.38%|
> |1|1024|8|109.71|251.89|129.60%|
> |1|1024|32|33.24|104.11|213.21%|
> |1|1024|128|8.87|30.47|243.52%|
> |1|1024|512|2.24|8.3|270.54%|
> |1|8192|1|.33|5000|50.00%|
> |1|8192|32|139.47|214.59|53.86%|
> |1|8192|128|54.59|109.23|100.09%|
> |1|8192|512|15.61|36.15|131.58%|
> |1|8192|2048|4.11|11.14|171.05%|
> |1|1048576|1|2597.4|3030.3|16.67%|
> |1|1048576|32|314.96|371.75|18.03%|
> |1|1048576|128|99.7|116.28|16.63%|
> |1|1048576|512|30.5|37.15|21.80%|
> |1|1048576|2048|10.38|12.3|18.50%|
> |1|8388608|1|2564.1|3174.6|23.81%|
> |1|8388608|32|196.27|238.95|21.75%|
> |1|8388608|128|55.36|68.03|22.89%|
> |1|8388608|512|15.58|19.24|23.49%|
> |1|8388608|2048|4.56|5.71|25.22%|
> The indices size is reduced for low cardinality fields and flat for high 
> cardinality fields.
> {code:java}
> 113Mindex_1_doc_32_cardinality_baseline
> 114Mindex_1_doc_32_cardinality_candidate
> 140Mindex_1_doc_128_cardinality_baseline
> 133Mindex_1_doc_128_cardinality_candidate
> 193Mindex_1_doc_1024_cardinality_baseline
> 174Mindex_1_doc_1024_cardinality_candidate
> 241Mindex_1_doc_8192_cardinality_baseline
> 233Mindex_1_

[jira] [Created] (LUCENE-10409) Improve BKDWriter's DocIdsWriter to better encode decreasing sequences of doc IDs

2022-02-07 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10409:
-

 Summary: Improve BKDWriter's DocIdsWriter to better encode 
decreasing sequences of doc IDs
 Key: LUCENE-10409
 URL: https://issues.apache.org/jira/browse/LUCENE-10409
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand


[~gf2121] recently improved DocIdsWriter for the case when doc IDs are dense 
and come in the same order as values via the CONTINUOUS_IDS and BITSET_IDS 
encodings.

We could do the same for the case when doc IDs come in the opposite order to 
values. This would be used whenever searching on a field that is used for index 
sorting in the descending order. This would be a frequent case for 
Elasticsearch users as we're planning on using index sorting more and more on 
time-based data with a descending sort on the timestamp as the last sort field.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 merged pull request #652: LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil (backport 9x)

2022-02-07 Thread GitBox


gf2121 merged pull request #652:
URL: https://github.com/apache/lucene/pull/652


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil

2022-02-07 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo reassigned LUCENE-10315:
-

Assignee: Feng Guo

> Speed up BKD leaf block ids codec by a 512 ints ForUtil
> ---
>
> Key: LUCENE-10315
> URL: https://issues.apache.org/jira/browse/LUCENE-10315
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Feng Guo
>Assignee: Feng Guo
>Priority: Major
> Attachments: addall.svg
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> Elasticsearch (which based on lucene) can automatically infers types for 
> users with its dynamic mapping feature. When users index some low cardinality 
> fields, such as gender / age / status... they often use some numbers to 
> represent the values, while ES will infer these fields as {{{}long{}}}, and 
> ES uses BKD as the index of {{long}} fields. When the data volume grows, 
> building the result set of low-cardinality fields will make the CPU usage and 
> load very high.
> This is a flame graph we obtained from the production environment:
> [^addall.svg]
> It can be seen that almost all CPU is used in addAll. When we reindex 
> {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly 
> reduced ( We spent weeks of time to reindex all indices... ). I know that ES 
> recommended to use {{keyword}} for term/terms query and {{long}} for range 
> query in the document, but there are always some users who didn't realize 
> this and keep their habit of using sql database, or dynamic mapping 
> automatically selects the type for them. All in all, users won't realize that 
> there would be such a big difference in performance between {{long}} and 
> {{keyword}} fields in low cardinality fields. So from my point of view it 
> will make sense if we can make BKD works better for the low/medium 
> cardinality fields.
> As far as i can see, for low cardinality fields, there are two advantages of 
> {{keyword}} over {{{}long{}}}:
> 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's 
> delta VInt, because its batch reading (readLongs) and SIMD decode.
> 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily 
> materialize of its result set, and when another small result clause 
> intersects with this low cardinality condition, the low cardinality field can 
> avoid reading all docIds into memory.
> This ISSUE is targeting to solve the first point. The basic idea is trying to 
> use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization 
> by mocking some random {{LongPoint}} and querying them with 
> {{PointInSetQuery}}.
> *Benchmark Result*
> |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff 
> percentage|
> |1|32|1|51.44|148.26|188.22%|
> |1|32|2|26.8|101.88|280.15%|
> |1|32|4|14.04|53.52|281.20%|
> |1|32|8|7.04|28.54|305.40%|
> |1|32|16|3.54|14.61|312.71%|
> |1|128|1|110.56|350.26|216.81%|
> |1|128|8|16.6|89.81|441.02%|
> |1|128|16|8.45|48.07|468.88%|
> |1|128|32|4.2|25.35|503.57%|
> |1|128|64|2.13|13.02|511.27%|
> |1|1024|1|536.19|843.88|57.38%|
> |1|1024|8|109.71|251.89|129.60%|
> |1|1024|32|33.24|104.11|213.21%|
> |1|1024|128|8.87|30.47|243.52%|
> |1|1024|512|2.24|8.3|270.54%|
> |1|8192|1|.33|5000|50.00%|
> |1|8192|32|139.47|214.59|53.86%|
> |1|8192|128|54.59|109.23|100.09%|
> |1|8192|512|15.61|36.15|131.58%|
> |1|8192|2048|4.11|11.14|171.05%|
> |1|1048576|1|2597.4|3030.3|16.67%|
> |1|1048576|32|314.96|371.75|18.03%|
> |1|1048576|128|99.7|116.28|16.63%|
> |1|1048576|512|30.5|37.15|21.80%|
> |1|1048576|2048|10.38|12.3|18.50%|
> |1|8388608|1|2564.1|3174.6|23.81%|
> |1|8388608|32|196.27|238.95|21.75%|
> |1|8388608|128|55.36|68.03|22.89%|
> |1|8388608|512|15.58|19.24|23.49%|
> |1|8388608|2048|4.56|5.71|25.22%|
> The indices size is reduced for low cardinality fields and flat for high 
> cardinality fields.
> {code:java}
> 113Mindex_1_doc_32_cardinality_baseline
> 114Mindex_1_doc_32_cardinality_candidate
> 140Mindex_1_doc_128_cardinality_baseline
> 133Mindex_1_doc_128_cardinality_candidate
> 193Mindex_1_doc_1024_cardinality_baseline
> 174Mindex_1_doc_1024_cardinality_candidate
> 241Mindex_1_doc_8192_cardinality_baseline
> 233Mindex_1_doc_8192_cardinality_candidate
> 314Mindex_1_doc_1048576_cardinality_baseline
> 315Mindex_1_doc_1048576_cardinality_candidate
> 392Mindex_1_doc_8388608_cardinality_baseline
> 391Mindex_1_doc_838

[jira] [Commented] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil

2022-02-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487977#comment-17487977
 ] 

ASF subversion and git services commented on LUCENE-10315:
--

Commit 28ba89b8c951cc757e99eea22095375ddeb49f70 in lucene's branch 
refs/heads/branch_9x from gf2121
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=28ba89b ]

LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil (#541)



> Speed up BKD leaf block ids codec by a 512 ints ForUtil
> ---
>
> Key: LUCENE-10315
> URL: https://issues.apache.org/jira/browse/LUCENE-10315
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Feng Guo
>Assignee: Feng Guo
>Priority: Major
> Attachments: addall.svg
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Elasticsearch (which based on lucene) can automatically infers types for 
> users with its dynamic mapping feature. When users index some low cardinality 
> fields, such as gender / age / status... they often use some numbers to 
> represent the values, while ES will infer these fields as {{{}long{}}}, and 
> ES uses BKD as the index of {{long}} fields. When the data volume grows, 
> building the result set of low-cardinality fields will make the CPU usage and 
> load very high.
> This is a flame graph we obtained from the production environment:
> [^addall.svg]
> It can be seen that almost all CPU is used in addAll. When we reindex 
> {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly 
> reduced ( We spent weeks of time to reindex all indices... ). I know that ES 
> recommended to use {{keyword}} for term/terms query and {{long}} for range 
> query in the document, but there are always some users who didn't realize 
> this and keep their habit of using sql database, or dynamic mapping 
> automatically selects the type for them. All in all, users won't realize that 
> there would be such a big difference in performance between {{long}} and 
> {{keyword}} fields in low cardinality fields. So from my point of view it 
> will make sense if we can make BKD works better for the low/medium 
> cardinality fields.
> As far as i can see, for low cardinality fields, there are two advantages of 
> {{keyword}} over {{{}long{}}}:
> 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's 
> delta VInt, because its batch reading (readLongs) and SIMD decode.
> 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily 
> materialize of its result set, and when another small result clause 
> intersects with this low cardinality condition, the low cardinality field can 
> avoid reading all docIds into memory.
> This ISSUE is targeting to solve the first point. The basic idea is trying to 
> use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization 
> by mocking some random {{LongPoint}} and querying them with 
> {{PointInSetQuery}}.
> *Benchmark Result*
> |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff 
> percentage|
> |1|32|1|51.44|148.26|188.22%|
> |1|32|2|26.8|101.88|280.15%|
> |1|32|4|14.04|53.52|281.20%|
> |1|32|8|7.04|28.54|305.40%|
> |1|32|16|3.54|14.61|312.71%|
> |1|128|1|110.56|350.26|216.81%|
> |1|128|8|16.6|89.81|441.02%|
> |1|128|16|8.45|48.07|468.88%|
> |1|128|32|4.2|25.35|503.57%|
> |1|128|64|2.13|13.02|511.27%|
> |1|1024|1|536.19|843.88|57.38%|
> |1|1024|8|109.71|251.89|129.60%|
> |1|1024|32|33.24|104.11|213.21%|
> |1|1024|128|8.87|30.47|243.52%|
> |1|1024|512|2.24|8.3|270.54%|
> |1|8192|1|.33|5000|50.00%|
> |1|8192|32|139.47|214.59|53.86%|
> |1|8192|128|54.59|109.23|100.09%|
> |1|8192|512|15.61|36.15|131.58%|
> |1|8192|2048|4.11|11.14|171.05%|
> |1|1048576|1|2597.4|3030.3|16.67%|
> |1|1048576|32|314.96|371.75|18.03%|
> |1|1048576|128|99.7|116.28|16.63%|
> |1|1048576|512|30.5|37.15|21.80%|
> |1|1048576|2048|10.38|12.3|18.50%|
> |1|8388608|1|2564.1|3174.6|23.81%|
> |1|8388608|32|196.27|238.95|21.75%|
> |1|8388608|128|55.36|68.03|22.89%|
> |1|8388608|512|15.58|19.24|23.49%|
> |1|8388608|2048|4.56|5.71|25.22%|
> The indices size is reduced for low cardinality fields and flat for high 
> cardinality fields.
> {code:java}
> 113Mindex_1_doc_32_cardinality_baseline
> 114Mindex_1_doc_32_cardinality_candidate
> 140Mindex_1_doc_128_cardinality_baseline
> 133Mindex_1_doc_128_cardinality_candidate
> 193Mindex_1_doc_1024_cardinality_baseline
> 174Mindex_1_doc_1024_cardinality_candidate
> 241M

[jira] [Resolved] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil

2022-02-07 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo resolved LUCENE-10315.
---
Fix Version/s: 9.1
   Resolution: Fixed

> Speed up BKD leaf block ids codec by a 512 ints ForUtil
> ---
>
> Key: LUCENE-10315
> URL: https://issues.apache.org/jira/browse/LUCENE-10315
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Feng Guo
>Assignee: Feng Guo
>Priority: Major
> Fix For: 9.1
>
> Attachments: addall.svg
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Elasticsearch (which based on lucene) can automatically infers types for 
> users with its dynamic mapping feature. When users index some low cardinality 
> fields, such as gender / age / status... they often use some numbers to 
> represent the values, while ES will infer these fields as {{{}long{}}}, and 
> ES uses BKD as the index of {{long}} fields. When the data volume grows, 
> building the result set of low-cardinality fields will make the CPU usage and 
> load very high.
> This is a flame graph we obtained from the production environment:
> [^addall.svg]
> It can be seen that almost all CPU is used in addAll. When we reindex 
> {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly 
> reduced ( We spent weeks of time to reindex all indices... ). I know that ES 
> recommended to use {{keyword}} for term/terms query and {{long}} for range 
> query in the document, but there are always some users who didn't realize 
> this and keep their habit of using sql database, or dynamic mapping 
> automatically selects the type for them. All in all, users won't realize that 
> there would be such a big difference in performance between {{long}} and 
> {{keyword}} fields in low cardinality fields. So from my point of view it 
> will make sense if we can make BKD works better for the low/medium 
> cardinality fields.
> As far as i can see, for low cardinality fields, there are two advantages of 
> {{keyword}} over {{{}long{}}}:
> 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's 
> delta VInt, because its batch reading (readLongs) and SIMD decode.
> 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily 
> materialize of its result set, and when another small result clause 
> intersects with this low cardinality condition, the low cardinality field can 
> avoid reading all docIds into memory.
> This ISSUE is targeting to solve the first point. The basic idea is trying to 
> use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization 
> by mocking some random {{LongPoint}} and querying them with 
> {{PointInSetQuery}}.
> *Benchmark Result*
> |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff 
> percentage|
> |1|32|1|51.44|148.26|188.22%|
> |1|32|2|26.8|101.88|280.15%|
> |1|32|4|14.04|53.52|281.20%|
> |1|32|8|7.04|28.54|305.40%|
> |1|32|16|3.54|14.61|312.71%|
> |1|128|1|110.56|350.26|216.81%|
> |1|128|8|16.6|89.81|441.02%|
> |1|128|16|8.45|48.07|468.88%|
> |1|128|32|4.2|25.35|503.57%|
> |1|128|64|2.13|13.02|511.27%|
> |1|1024|1|536.19|843.88|57.38%|
> |1|1024|8|109.71|251.89|129.60%|
> |1|1024|32|33.24|104.11|213.21%|
> |1|1024|128|8.87|30.47|243.52%|
> |1|1024|512|2.24|8.3|270.54%|
> |1|8192|1|.33|5000|50.00%|
> |1|8192|32|139.47|214.59|53.86%|
> |1|8192|128|54.59|109.23|100.09%|
> |1|8192|512|15.61|36.15|131.58%|
> |1|8192|2048|4.11|11.14|171.05%|
> |1|1048576|1|2597.4|3030.3|16.67%|
> |1|1048576|32|314.96|371.75|18.03%|
> |1|1048576|128|99.7|116.28|16.63%|
> |1|1048576|512|30.5|37.15|21.80%|
> |1|1048576|2048|10.38|12.3|18.50%|
> |1|8388608|1|2564.1|3174.6|23.81%|
> |1|8388608|32|196.27|238.95|21.75%|
> |1|8388608|128|55.36|68.03|22.89%|
> |1|8388608|512|15.58|19.24|23.49%|
> |1|8388608|2048|4.56|5.71|25.22%|
> The indices size is reduced for low cardinality fields and flat for high 
> cardinality fields.
> {code:java}
> 113Mindex_1_doc_32_cardinality_baseline
> 114Mindex_1_doc_32_cardinality_candidate
> 140Mindex_1_doc_128_cardinality_baseline
> 133Mindex_1_doc_128_cardinality_candidate
> 193Mindex_1_doc_1024_cardinality_baseline
> 174Mindex_1_doc_1024_cardinality_candidate
> 241Mindex_1_doc_8192_cardinality_baseline
> 233Mindex_1_doc_8192_cardinality_candidate
> 314Mindex_1_doc_1048576_cardinality_baseline
> 315Mindex_1_doc_1048576_cardinality_candidate
> 392Mindex_1_doc_8388608_cardi

[jira] [Created] (LUCENE-10410) Add some more tests for legacy encoding logic in DocIdsWriter

2022-02-07 Thread Feng Guo (Jira)
Feng Guo created LUCENE-10410:
-

 Summary: Add some more tests for legacy encoding logic in 
DocIdsWriter
 Key: LUCENE-10410
 URL: https://issues.apache.org/jira/browse/LUCENE-10410
 Project: Lucene - Core
  Issue Type: Test
  Components: core/codecs
Reporter: Feng Guo


This is a follow-up 0f LUCENE-10315, add some more tests for legacy encoding 
logic in DocIdsWriter.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 opened a new pull request #654: LUCENE-10410: Add more tests for legacy encoding logic in DocIdsWriter

2022-02-07 Thread GitBox


gf2121 opened a new pull request #654:
URL: https://github.com/apache/lucene/pull/654


   This is a follow-up of https://issues.apache.org/jira/browse/LUCENE-10315 
(#541) . 
   
   Add some more tests for legacy encoding logic in DocIdsWriter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #643: LUCENE-10400: revise binary dictionaries' constructor in kuromoji

2022-02-07 Thread GitBox


mocobeta commented on pull request #643:
URL: https://github.com/apache/lucene/pull/643#issuecomment-1031313230


   Thanks for reviewing, I'm going to merge this. I will open another pull 
request to remove the obsolete methods on main.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta merged pull request #643: LUCENE-10400: revise binary dictionaries' constructor in kuromoji

2022-02-07 Thread GitBox


mocobeta merged pull request #643:
URL: https://github.com/apache/lucene/pull/643


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10400) Clean up the constructors' API signature of dictionary classes in kuromoji and nori

2022-02-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488015#comment-17488015
 ] 

ASF subversion and git services commented on LUCENE-10400:
--

Commit e7546c2427e9b82eb4a4632a992866e5436d97a4 in lucene's branch 
refs/heads/main from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e7546c2 ]

LUCENE-10400: revise binary dictionaries' constructor in kuromoji (#643)



> Clean up the constructors' API signature of dictionary classes in kuromoji 
> and nori
> ---
>
> Key: LUCENE-10400
> URL: https://issues.apache.org/jira/browse/LUCENE-10400
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> It was suggested in a few issues/pr comments.
>  * do not delegate to load class resources to other classes
>  * do not allow to switch the location (classpath/file path) of the resource 
> by constructor parameter
> Before working on LUCENE-8816 or LUCENE-10393, we'd need to sort the 
> protected constructor APIs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10400) Clean up the constructors' API signature of dictionary classes in kuromoji and nori

2022-02-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488020#comment-17488020
 ] 

ASF subversion and git services commented on LUCENE-10400:
--

Commit e4ad3c794fc147c2827ef258e95e8a3fac765d40 in lucene's branch 
refs/heads/branch_9x from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e4ad3c7 ]

LUCENE-10400: revise binary dictionaries' constructor in kuromoji (#643)



> Clean up the constructors' API signature of dictionary classes in kuromoji 
> and nori
> ---
>
> Key: LUCENE-10400
> URL: https://issues.apache.org/jira/browse/LUCENE-10400
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> It was suggested in a few issues/pr comments.
>  * do not delegate to load class resources to other classes
>  * do not allow to switch the location (classpath/file path) of the resource 
> by constructor parameter
> Before working on LUCENE-8816 or LUCENE-10393, we'd need to sort the 
> protected constructor APIs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] noblepaul closed pull request #2638: 174234: We have observed that number of threads increases on one solr node after rebooting the solr node

2022-02-07 Thread GitBox


noblepaul closed pull request #2638:
URL: https://github.com/apache/lucene-solr/pull/2638


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta opened a new pull request #655: LUCENE-10400: cleanup obsolete APIs in kuromoji

2022-02-07 Thread GitBox


mocobeta opened a new pull request #655:
URL: https://github.com/apache/lucene/pull/655


   1. Remove deprecated constructors
   2. Add / change tests for new constructors


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #643: LUCENE-10400: revise binary dictionaries' constructor in kuromoji

2022-02-07 Thread GitBox


mocobeta commented on pull request #643:
URL: https://github.com/apache/lucene/pull/643#issuecomment-1031541054


   I opened this https://github.com/apache/lucene/pull/655/files. 
   I think the diff would be obvious - will merge it tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta edited a comment on pull request #643: LUCENE-10400: revise binary dictionaries' constructor in kuromoji

2022-02-07 Thread GitBox


mocobeta edited a comment on pull request #643:
URL: https://github.com/apache/lucene/pull/643#issuecomment-1031541054


   I opened this https://github.com/apache/lucene/pull/655. 
   I think the diff would be obvious - will merge it tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta edited a comment on pull request #643: LUCENE-10400: revise binary dictionaries' constructor in kuromoji

2022-02-07 Thread GitBox


mocobeta edited a comment on pull request #643:
URL: https://github.com/apache/lucene/pull/643#issuecomment-1031541054


   I opened https://github.com/apache/lucene/pull/655. 
   I think the diff would be obvious - will merge it tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #645: Rename KnnGraphValues -> HnswGraph

2022-02-07 Thread GitBox


jtibshirani commented on pull request #645:
URL: https://github.com/apache/lucene/pull/645#issuecomment-1031710996


   Thanks for the review! My understanding is that `git mv` is the same as `git 
rm` and `git add`. It doesn't give special information to git (instead git 
automatically detects renames by comparing the file contents). So I am not sure 
it would help 🤔  I guess an alternative would be to merge it as two separate 
commits?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

2022-02-07 Thread GitBox


jpountz commented on pull request #649:
URL: https://github.com/apache/lucene/pull/649#issuecomment-1031754530


   Optimizing for the case when all docs have a value makes sense to me.
   
   > for a case when only certain documents have vectors, we do delta encoding 
of doc Ids.
   
   In the past we rejected changes that would consist of having the data 
written in a compressed fashion on disk but still uncompressed in memory.
   
   I wonder if it would be a better trade-off to keep ints uncompressed, but 
read them from disk directly instead of loading giant arrays in memory? Or 
possibly switch to something like DirectMonotonicReader if it doesn't slow down 
searches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] thelabdude opened a new pull request #2639: SOLR-15587: Don't use the UrlScheme singleton on the client-side

2022-02-07 Thread GitBox


thelabdude opened a new pull request #2639:
URL: https://github.com/apache/lucene-solr/pull/2639


   Backport of https://github.com/apache/solr/pull/600 to 8_11 branch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #645: Rename KnnGraphValues -> HnswGraph

2022-02-07 Thread GitBox


dweiss commented on pull request #645:
URL: https://github.com/apache/lucene/pull/645#issuecomment-1031849263


   > It doesn't give special information to git (instead git automatically 
detects renames by comparing the file contents).
   
   Correct. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani merged pull request #645: Rename KnnGraphValues -> HnswGraph

2022-02-07 Thread GitBox


jtibshirani merged pull request #645:
URL: https://github.com/apache/lucene/pull/645


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API

2022-02-07 Thread Vigya Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488470#comment-17488470
 ] 

Vigya Sharma commented on LUCENE-10216:
---

Had some thoughts and questions about the transaction model for this API... 
Currently, it either succeeds and adds all provided readers, or fails and adds 
none of them. 

With a merge policy splitting provided readers into groups of smaller 
{{OneMerge}} objects, this is slightly harder to implement. A OneMerge on a 
subset of readers may complete in the background and add itself to writer 
segment infos, while some others running in parallel could fail.

One approach could be to expose this failure information to user - the 
exception can contain the list of readers merged and pending. This could 
simplify the overall implementation.

My current thoughts, however, are that the present transaction logic is 
important. It is hard for users to parse the exception message, figure out 
which readers are pending and retry them. As opposed to retrying an entire API 
call (with all the readers), which their upstream system probably understands 
as a single unit. However, I wanted to check if loosening the transaction model 
for this API is a palatable approach.

To retain the all or none, single transaction model, I am thinking that we can 
join on all merges at the end of the {{addIndexes()}} API, and then write their 
segment info files in a common lock. 

Would like to hear more thoughts or suggestions on this.

> Add concurrency to addIndexes(CodecReader…) API
> ---
>
> Key: LUCENE-10216
> URL: https://issues.apache.org/jira/browse/LUCENE-10216
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Vigya Sharma
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I work at Amazon Product Search, and we use Lucene to power search for the 
> e-commerce platform. I’m working on a project that involves applying 
> metadata+ETL transforms and indexing documents on n different _indexing_ 
> boxes, combining them into a single index on a separate _reducer_ box, and 
> making it available for queries on m different _search_ boxes (replicas). 
> Segments are asynchronously copied from indexers to reducers to searchers as 
> they become available for the next layer to consume.
> I am using the addIndexes API to combine multiple indexes into one on the 
> reducer boxes. Since we also have taxonomy data, we need to remap facet field 
> ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version 
> of this API. The API leverages {{SegmentMerger.merge()}} to create segments 
> with new ordinal values while also merging all provided segments in the 
> process.
> _This is however a blocking call that runs in a single thread._ Until we have 
> written segments with new ordinal values, we cannot copy them to searcher 
> boxes, which increases the time to make documents available for search.
> I was playing around with the API by creating multiple concurrent merges, 
> each with only a single reader, creating a concurrently running 1:1 
> conversion from old segments to new ones (with new ordinal values). We follow 
> this up with non-blocking background merges. This lets us copy the segments 
> to searchers and replicas as soon as they are available, and later replace 
> them with merged segments as background jobs complete. On the Amazon dataset 
> I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. 
> Each call was given about 5 readers to add on average.
> This might be useful add to Lucene. We could create another {{addIndexes()}} 
> API with a {{boolean}} flag for concurrency, that internally submits multiple 
> merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, 
> and waits for them to complete before returning.
> While this is doable from outside Lucene by using your thread pool, starting 
> multiple addIndexes() calls and waiting for them to complete, I felt it needs 
> some understanding of what addIndexes does, why you need to wait on the merge 
> and why it makes sense to pass a single reader in the addIndexes API.
> Out of box support in Lucene could simplify this for folks a similar use case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani opened a new pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery

2022-02-07 Thread GitBox


jtibshirani opened a new pull request #656:
URL: https://github.com/apache/lucene/pull/656


   This PR adds support for a query filter in KnnVectorQuery. First, we gather 
the
   query results for each leaf as a bit set. Then the HNSW search skips over the
   non-matching documents (using the same approach as for live docs). To prevent
   HNSW search from visiting too many documents when the filter is very 
selective,
   we short-circuit if HNSW has already visited more than the number of 
documents
   that match the filter, and execute an exact search instead. This bounds the
   number of visited documents at roughly 2x the cost of just running the exact
   filter, while in most cases HNSW completes successfully and does a lot 
better.
   
   Co-authored-by: Joel Bernstein 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta merged pull request #655: LUCENE-10400: cleanup obsolete APIs in kuromoji

2022-02-07 Thread GitBox


mocobeta merged pull request #655:
URL: https://github.com/apache/lucene/pull/655


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #655: LUCENE-10400: cleanup obsolete APIs in kuromoji

2022-02-07 Thread GitBox


mocobeta commented on pull request #655:
URL: https://github.com/apache/lucene/pull/655#issuecomment-1032088144


   I'll backport the test to 9.x.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] thelabdude merged pull request #2639: SOLR-15587: Don't use the UrlScheme singleton on the client-side

2022-02-07 Thread GitBox


thelabdude merged pull request #2639:
URL: https://github.com/apache/lucene-solr/pull/2639


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10400) Clean up the constructors' API signature of dictionary classes in kuromoji and nori

2022-02-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488511#comment-17488511
 ] 

ASF subversion and git services commented on LUCENE-10400:
--

Commit 20f7f33c8d21b94c15887912e80d068970fd095f in lucene's branch 
refs/heads/main from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=20f7f33 ]

LUCENE-10400: cleanup obsolete APIs in kuromoji (#655)



> Clean up the constructors' API signature of dictionary classes in kuromoji 
> and nori
> ---
>
> Key: LUCENE-10400
> URL: https://issues.apache.org/jira/browse/LUCENE-10400
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 8.5h
>  Remaining Estimate: 0h
>
> It was suggested in a few issues/pr comments.
>  * do not delegate to load class resources to other classes
>  * do not allow to switch the location (classpath/file path) of the resource 
> by constructor parameter
> Before working on LUCENE-8816 or LUCENE-10393, we'd need to sort the 
> protected constructor APIs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery

2022-02-07 Thread GitBox


jtibshirani commented on pull request #656:
URL: https://github.com/apache/lucene/pull/656#issuecomment-1032109021


   I tried out the around stopping the HNSW search early if it visits too many 
docs. To test, I modified `KnnGraphTester` to create `acceptDocs` uniformly at 
random with a certain selectivity, then measured recall and QPS. Here are the 
results on glove-100-angular (~1.2 million docs) with a filter selectivity 0.01:
   
   **Baseline**
   ```
   kRecallVisitedDocs QPS  
   
   10        0.774   15957     232.083
   50        0.930   63429      58.994
   80        0.958   95175      42.470
   100       0.967  118891      35.203
   500       0.997 1176237       8.136
   800       0.999 1183514       5.571
   ```
   
   **PR**
   ```
   kRecallVisitedDocs QPS  
   10        1.000 22908     190.286
   50        1.000 23607     152.224
   80        1.000 23608     148.036
   100       1.000 23608     145.381
   500       1.000 23608     138.903
   800       1.000 23608     137.882
   ```
   
   Since the filter is so selective, HNSW always visits more than 1% of the 
docs. The adaptive logic in the PR decides to stop the search and switch to an 
exact search, which bounds the visited docs at 2%. For `k=10` this makes the 
QPS a little worse, but overall prevents QPS from degrading (with the side 
benefit of perfect recall). I also tested with less restrictive filters, and in 
these cases the fallback just doesn't kick in, so the QPS remains the same as 
before.
   
   Overall I like this approach because it doesn't require us to fiddle with 
thresholds or expose new parameters. It could also help make HNSW more robust 
in "pathological" cases where even when the filter is not that selective, all 
the nearest vectors to a query happen to be filtered away.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery

2022-02-07 Thread GitBox


jtibshirani commented on a change in pull request #656:
URL: https://github.com/apache/lucene/pull/656#discussion_r801192735



##
File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
##
@@ -227,16 +231,36 @@ public TopDocs search(String field, float[] target, int 
k, Bits acceptDocs) thro
 
 // bound k by total number of vectors to prevent oversizing data structures
 k = Math.min(k, fieldEntry.size());
-
 OffHeapVectorValues vectorValues = getOffHeapVectorValues(fieldEntry);
+
+DocIdSetIterator acceptIterator = null;
+int visitedLimit = Integer.MAX_VALUE;
+
+if (acceptDocs instanceof BitSet acceptBitSet) {

Review comment:
   This is a temporary hack since I wasn't sure about the right design. I 
could see a couple possibilities:
   1. Add a new `BitSet filter` parameter to `searchNearestVectors`, keeping 
the fallback logic within the HNSW classes. 
   2. Add a new `int visitedLimit` parameter to 
`LeafReader#searchNearestVectors`. Pull the "exact search" logic up into 
`KnnVectorQuery`.
   
   Which option is better probably depends on how other algorithms would handle 
filtering (which I am not sure about), and also if we think `visitedLimit` is 
useful in other contexts.
   
   I also played around with having `searchNearestVectors` take a `Collector` 
and using `CollectionTerminatedException`... but couldn't really see how this 
made sense.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-02-07 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488527#comment-17488527
 ] 

Julie Tibshirani commented on LUCENE-10382:
---

I had some time to try out the dynamic check I mentioned, and it seems to work. 
I opened a PR here that builds off Joel's change: 
https://github.com/apache/lucene/pull/656. It's a draft because there are still 
some big open API questions. Looking forward to hearing your feedback!

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10409) Improve BKDWriter's DocIdsWriter to better encode decreasing sequences of doc IDs

2022-02-07 Thread Feng Guo (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488607#comment-17488607
 ] 

Feng Guo commented on LUCENE-10409:
---

+1, Great idea! I'd like to take on this if you agree.

> Improve BKDWriter's DocIdsWriter to better encode decreasing sequences of doc 
> IDs
> -
>
> Key: LUCENE-10409
> URL: https://issues.apache.org/jira/browse/LUCENE-10409
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> [~gf2121] recently improved DocIdsWriter for the case when doc IDs are dense 
> and come in the same order as values via the CONTINUOUS_IDS and BITSET_IDS 
> encodings.
> We could do the same for the case when doc IDs come in the opposite order to 
> values. This would be used whenever searching on a field that is used for 
> index sorting in the descending order. This would be a frequent case for 
> Elasticsearch users as we're planning on using index sorting more and more on 
> time-based data with a descending sort on the timestamp as the last sort 
> field.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10367) Use WANDScorer in CoveringQuery Can accelerate scorer time

2022-02-07 Thread LuYunCheng (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LuYunCheng updated LUCENE-10367:

Status: Open  (was: Patch Available)

> Use WANDScorer in CoveringQuery Can accelerate scorer time
> --
>
> Key: LUCENE-10367
> URL: https://issues.apache.org/jira/browse/LUCENE-10367
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring, core/search, modules/sandbox
>Reporter: LuYunCheng
>Priority: Major
> Attachments: LUCENE-10367.patch, TestCoveringQueryBench.java
>
>
> When using CoveringQuery In Elasticsearch with terms_set query, it takes too 
> much time in CoveringScore and major cost in matain the DisiPriorityQueue: 
> subScorers.
> But when minimumNumberMatch is ConstantLongValuesSource, we can use 
> WANDScorer to optimize it.
>  
> i do a mini benchmark with 1m docs, which code in LUCENE-10367.patch 
> TestCoveringQuery.java  testRandomBench()
> it shows: 
> TEST: WAND elapsed 67ms
> TEST: NOWAND elapsed 163ms
> My testing environment is macBook with Intel Core i7 16GMem. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10410) Add some more tests for legacy encoding logic in DocIdsWriter

2022-02-07 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo reassigned LUCENE-10410:
-

Assignee: Feng Guo

> Add some more tests for legacy encoding logic in DocIdsWriter
> -
>
> Key: LUCENE-10410
> URL: https://issues.apache.org/jira/browse/LUCENE-10410
> Project: Lucene - Core
>  Issue Type: Test
>  Components: core/codecs
>Reporter: Feng Guo
>Assignee: Feng Guo
>Priority: Trivial
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a follow-up 0f LUCENE-10315, add some more tests for legacy encoding 
> logic in DocIdsWriter.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10378) Implement Weight#count on PointRangeQuery

2022-02-07 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488645#comment-17488645
 ] 

Adrien Grand commented on LUCENE-10378:
---

[~gworah] Have you had a chance to start working on something? I'm also 
interested in this change which I'd like to use for range faceting.

> Implement Weight#count on PointRangeQuery
> -
>
> Key: LUCENE-10378
> URL: https://issues.apache.org/jira/browse/LUCENE-10378
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> When there are no deletions and the field is single-valued (docCount == size) 
> and has a single-dimension, PointRangeQuery could implement {{Weight#count}} 
> by only counting matches on the two leaves that cross with the query.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #654: LUCENE-10410: Add more tests for legacy decoding logic in DocIdsWriter

2022-02-07 Thread GitBox


jpountz commented on a change in pull request #654:
URL: https://github.com/apache/lucene/pull/654#discussion_r801340328



##
File path: lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java
##
@@ -110,7 +111,11 @@ private void test(Directory dir, int[] ints) throws 
Exception {
 final long len;
 DocIdsWriter docIdsWriter = new DocIdsWriter(ints.length);
 try (IndexOutput out = dir.createOutput("tmp", IOContext.DEFAULT)) {
-  docIdsWriter.writeDocIds(ints, 0, ints.length, out);
+  if (rarely()) {
+legacyWriteDocIds(ints, 0, ints.length, out);
+  } else {
+docIdsWriter.writeDocIds(ints, 0, ints.length, out);
+  }

Review comment:
   Can you extract it to a separate test, so that we'd have both a 
`testLegacy` and `test`? I think it'd make it easier to get a sense of where 
the bug lies when we see a test failure.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org