[GitHub] [lucene] gf2121 opened a new pull request #706: LUCENE-10417: Revert "LUCENE-10315"

2022-02-24 Thread GitBox


gf2121 opened a new pull request #706:
URL: https://github.com/apache/lucene/pull/706


   SIMD-optimization for BKD `DocIdsWriter` was introduced in 
https://github.com/apache/lucene/pull/652 in order to speed up decoding of 
docIDs, but it leads to the regression in nightly benchmark.
   
   https://home.apache.org/~mikemccand/lucenebench/IntNRQ.html
   
   I tried to run `wiki10m` locally but can not reproduce the regression. I'll 
continue to dig, but i think we need to revert it first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10435) Break loop early while checking whether DocValuesFieldExistsQuery can be rewrite to MatchAllDocsQuery

2022-02-24 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10435.
---
Fix Version/s: 9.1
   Resolution: Fixed

> Break loop early while checking whether DocValuesFieldExistsQuery can be 
> rewrite to MatchAllDocsQuery
> -
>
> Key: LUCENE-10435
> URL: https://issues.apache.org/jira/browse/LUCENE-10435
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Trivial
> Fix For: 9.1
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In the implementation of Query#rewrite in DocValuesFieldExistsQuery, when one 
> Segment can't match the condition occurs, maybe we should break loop directly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 merged pull request #706: LUCENE-10417: Revert "LUCENE-10315"

2022-02-24 Thread GitBox


gf2121 merged pull request #706:
URL: https://github.com/apache/lucene/pull/706


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497260#comment-17497260
 ] 

ASF subversion and git services commented on LUCENE-10315:
--

Commit b0ca227862950a1869b535f31881cdfc2e859176 in lucene's branch 
refs/heads/main from gf2121
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b0ca227 ]

LUCENE-10417: Revert "LUCENE-10315" (#706)



> Speed up BKD leaf block ids codec by a 512 ints ForUtil
> ---
>
> Key: LUCENE-10315
> URL: https://issues.apache.org/jira/browse/LUCENE-10315
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Feng Guo
>Assignee: Feng Guo
>Priority: Major
> Fix For: 9.1
>
> Attachments: addall.svg
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Elasticsearch (which based on lucene) can automatically infers types for 
> users with its dynamic mapping feature. When users index some low cardinality 
> fields, such as gender / age / status... they often use some numbers to 
> represent the values, while ES will infer these fields as {{{}long{}}}, and 
> ES uses BKD as the index of {{long}} fields. When the data volume grows, 
> building the result set of low-cardinality fields will make the CPU usage and 
> load very high.
> This is a flame graph we obtained from the production environment:
> [^addall.svg]
> It can be seen that almost all CPU is used in addAll. When we reindex 
> {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly 
> reduced ( We spent weeks of time to reindex all indices... ). I know that ES 
> recommended to use {{keyword}} for term/terms query and {{long}} for range 
> query in the document, but there are always some users who didn't realize 
> this and keep their habit of using sql database, or dynamic mapping 
> automatically selects the type for them. All in all, users won't realize that 
> there would be such a big difference in performance between {{long}} and 
> {{keyword}} fields in low cardinality fields. So from my point of view it 
> will make sense if we can make BKD works better for the low/medium 
> cardinality fields.
> As far as i can see, for low cardinality fields, there are two advantages of 
> {{keyword}} over {{{}long{}}}:
> 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's 
> delta VInt, because its batch reading (readLongs) and SIMD decode.
> 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily 
> materialize of its result set, and when another small result clause 
> intersects with this low cardinality condition, the low cardinality field can 
> avoid reading all docIds into memory.
> This ISSUE is targeting to solve the first point. The basic idea is trying to 
> use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization 
> by mocking some random {{LongPoint}} and querying them with 
> {{PointInSetQuery}}.
> *Benchmark Result*
> |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff 
> percentage|
> |1|32|1|51.44|148.26|188.22%|
> |1|32|2|26.8|101.88|280.15%|
> |1|32|4|14.04|53.52|281.20%|
> |1|32|8|7.04|28.54|305.40%|
> |1|32|16|3.54|14.61|312.71%|
> |1|128|1|110.56|350.26|216.81%|
> |1|128|8|16.6|89.81|441.02%|
> |1|128|16|8.45|48.07|468.88%|
> |1|128|32|4.2|25.35|503.57%|
> |1|128|64|2.13|13.02|511.27%|
> |1|1024|1|536.19|843.88|57.38%|
> |1|1024|8|109.71|251.89|129.60%|
> |1|1024|32|33.24|104.11|213.21%|
> |1|1024|128|8.87|30.47|243.52%|
> |1|1024|512|2.24|8.3|270.54%|
> |1|8192|1|.33|5000|50.00%|
> |1|8192|32|139.47|214.59|53.86%|
> |1|8192|128|54.59|109.23|100.09%|
> |1|8192|512|15.61|36.15|131.58%|
> |1|8192|2048|4.11|11.14|171.05%|
> |1|1048576|1|2597.4|3030.3|16.67%|
> |1|1048576|32|314.96|371.75|18.03%|
> |1|1048576|128|99.7|116.28|16.63%|
> |1|1048576|512|30.5|37.15|21.80%|
> |1|1048576|2048|10.38|12.3|18.50%|
> |1|8388608|1|2564.1|3174.6|23.81%|
> |1|8388608|32|196.27|238.95|21.75%|
> |1|8388608|128|55.36|68.03|22.89%|
> |1|8388608|512|15.58|19.24|23.49%|
> |1|8388608|2048|4.56|5.71|25.22%|
> The indices size is reduced for low cardinality fields and flat for high 
> cardinality fields.
> {code:java}
> 113Mindex_1_doc_32_cardinality_baseline
> 114Mindex_1_doc_32_cardinality_candidate
> 140Mindex_1_doc_128_cardinality_baseline
> 133Mindex_1_doc_128_cardinality_candidate
> 193Mindex_1_doc_1024_cardinality_baseline
> 174Mindex_1_doc_1024_cardinality_candidate
> 241Mindex_

[jira] [Commented] (LUCENE-10417) IntNRQ task performance decreased in nightly benchmark

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497259#comment-17497259
 ] 

ASF subversion and git services commented on LUCENE-10417:
--

Commit b0ca227862950a1869b535f31881cdfc2e859176 in lucene's branch 
refs/heads/main from gf2121
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b0ca227 ]

LUCENE-10417: Revert "LUCENE-10315" (#706)



> IntNRQ task performance decreased in nightly benchmark
> --
>
> Key: LUCENE-10417
> URL: https://issues.apache.org/jira/browse/LUCENE-10417
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Reporter: Feng Guo
>Assignee: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Link: https://home.apache.org/~mikemccand/lucenebench/2022.02.07.18.02.48.html
> Probably related to LUCENE-10315,  I'll dig.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10194) Should IndexWriter buffer KNN vectors on disk?

2022-02-24 Thread Mayya Sharipova (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayya Sharipova reassigned LUCENE-10194:


Assignee: Mayya Sharipova

> Should IndexWriter buffer KNN vectors on disk?
> --
>
> Key: LUCENE-10194
> URL: https://issues.apache.org/jira/browse/LUCENE-10194
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Mayya Sharipova
>Priority: Minor
>
> VectorValuesWriter buffers data in memory, like we do for all data structures 
> that are computed on flush. But I wonder if this is the right trade-off.
> The use-case I have in mind is someone trying to load a dataset of vectors in 
> Lucene. Given that HNSW graphs are super expensive to create, we'd ideally 
> load that dataset into a single segment rather than many small segments that 
> then need to be merged together, which in-turn re-creates the HNSW graph.
> Yet buffering vectors in memory is expensive. For instance assuming 256 
> dimensions, each vector consumes 1kB of memory. Should we consider buffering 
> vectors on disk to reduce chances of having to create new segments only 
> because the RAM buffer is full?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 merged pull request #707: LUCENE-10417: Revert LUCENE-10315 (backport 9x)

2022-02-24 Thread GitBox


gf2121 merged pull request #707:
URL: https://github.com/apache/lucene/pull/707


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497268#comment-17497268
 ] 

ASF subversion and git services commented on LUCENE-10315:
--

Commit ad48203b557d3250f6975d097a5af331db0ee3cd in lucene's branch 
refs/heads/branch_9x from gf2121
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ad48203 ]

LUCENE-10417: Revert "LUCENE-10315" (#706) (#707)



> Speed up BKD leaf block ids codec by a 512 ints ForUtil
> ---
>
> Key: LUCENE-10315
> URL: https://issues.apache.org/jira/browse/LUCENE-10315
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Feng Guo
>Assignee: Feng Guo
>Priority: Major
> Fix For: 9.1
>
> Attachments: addall.svg
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Elasticsearch (which based on lucene) can automatically infers types for 
> users with its dynamic mapping feature. When users index some low cardinality 
> fields, such as gender / age / status... they often use some numbers to 
> represent the values, while ES will infer these fields as {{{}long{}}}, and 
> ES uses BKD as the index of {{long}} fields. When the data volume grows, 
> building the result set of low-cardinality fields will make the CPU usage and 
> load very high.
> This is a flame graph we obtained from the production environment:
> [^addall.svg]
> It can be seen that almost all CPU is used in addAll. When we reindex 
> {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly 
> reduced ( We spent weeks of time to reindex all indices... ). I know that ES 
> recommended to use {{keyword}} for term/terms query and {{long}} for range 
> query in the document, but there are always some users who didn't realize 
> this and keep their habit of using sql database, or dynamic mapping 
> automatically selects the type for them. All in all, users won't realize that 
> there would be such a big difference in performance between {{long}} and 
> {{keyword}} fields in low cardinality fields. So from my point of view it 
> will make sense if we can make BKD works better for the low/medium 
> cardinality fields.
> As far as i can see, for low cardinality fields, there are two advantages of 
> {{keyword}} over {{{}long{}}}:
> 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's 
> delta VInt, because its batch reading (readLongs) and SIMD decode.
> 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily 
> materialize of its result set, and when another small result clause 
> intersects with this low cardinality condition, the low cardinality field can 
> avoid reading all docIds into memory.
> This ISSUE is targeting to solve the first point. The basic idea is trying to 
> use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization 
> by mocking some random {{LongPoint}} and querying them with 
> {{PointInSetQuery}}.
> *Benchmark Result*
> |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff 
> percentage|
> |1|32|1|51.44|148.26|188.22%|
> |1|32|2|26.8|101.88|280.15%|
> |1|32|4|14.04|53.52|281.20%|
> |1|32|8|7.04|28.54|305.40%|
> |1|32|16|3.54|14.61|312.71%|
> |1|128|1|110.56|350.26|216.81%|
> |1|128|8|16.6|89.81|441.02%|
> |1|128|16|8.45|48.07|468.88%|
> |1|128|32|4.2|25.35|503.57%|
> |1|128|64|2.13|13.02|511.27%|
> |1|1024|1|536.19|843.88|57.38%|
> |1|1024|8|109.71|251.89|129.60%|
> |1|1024|32|33.24|104.11|213.21%|
> |1|1024|128|8.87|30.47|243.52%|
> |1|1024|512|2.24|8.3|270.54%|
> |1|8192|1|.33|5000|50.00%|
> |1|8192|32|139.47|214.59|53.86%|
> |1|8192|128|54.59|109.23|100.09%|
> |1|8192|512|15.61|36.15|131.58%|
> |1|8192|2048|4.11|11.14|171.05%|
> |1|1048576|1|2597.4|3030.3|16.67%|
> |1|1048576|32|314.96|371.75|18.03%|
> |1|1048576|128|99.7|116.28|16.63%|
> |1|1048576|512|30.5|37.15|21.80%|
> |1|1048576|2048|10.38|12.3|18.50%|
> |1|8388608|1|2564.1|3174.6|23.81%|
> |1|8388608|32|196.27|238.95|21.75%|
> |1|8388608|128|55.36|68.03|22.89%|
> |1|8388608|512|15.58|19.24|23.49%|
> |1|8388608|2048|4.56|5.71|25.22%|
> The indices size is reduced for low cardinality fields and flat for high 
> cardinality fields.
> {code:java}
> 113Mindex_1_doc_32_cardinality_baseline
> 114Mindex_1_doc_32_cardinality_candidate
> 140Mindex_1_doc_128_cardinality_baseline
> 133Mindex_1_doc_128_cardinality_candidate
> 193Mindex_1_doc_1024_cardinality_baseline
> 174Mindex_1_doc_1024_cardinality_candidate
> 24

[jira] [Commented] (LUCENE-10417) IntNRQ task performance decreased in nightly benchmark

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497267#comment-17497267
 ] 

ASF subversion and git services commented on LUCENE-10417:
--

Commit ad48203b557d3250f6975d097a5af331db0ee3cd in lucene's branch 
refs/heads/branch_9x from gf2121
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ad48203 ]

LUCENE-10417: Revert "LUCENE-10315" (#706) (#707)



> IntNRQ task performance decreased in nightly benchmark
> --
>
> Key: LUCENE-10417
> URL: https://issues.apache.org/jira/browse/LUCENE-10417
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Reporter: Feng Guo
>Assignee: Feng Guo
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Link: https://home.apache.org/~mikemccand/lucenebench/2022.02.07.18.02.48.html
> Probably related to LUCENE-10315,  I'll dig.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on pull request #705: LUCENE-10439: Support multi-valued and multiple dimensions for count query in PointRangeQuery

2022-02-24 Thread GitBox


LuXugang commented on pull request #705:
URL: https://github.com/apache/lucene/pull/705#issuecomment-1049639011


   
   
   
   
   > We need two cases:
   > 
   > * Checking whether all documents match and returning values.getDocCount(). 
This works when there are no deletions.
   > * Actually counting the number of matching points. This only works when 
there are no deletions and the field is single-valued (docCount == size), plus 
we only want to apply it in the 1D case since this is the only case when we 
have the guarantee that it will actually run fast since there are at most 2 
crossing leaves.
   
   
   Thanks @jpountz, now I fully understand your thought.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #705: LUCENE-10439: Support multi-valued and multiple dimensions for count query in PointRangeQuery

2022-02-24 Thread GitBox


jpountz merged pull request #705:
URL: https://github.com/apache/lucene/pull/705


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10439) Support multi-valued and multiple dimensions for count query in PointRangeQuery

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497272#comment-17497272
 ] 

ASF subversion and git services commented on LUCENE-10439:
--

Commit 550d1305db71b33f988484fe58de1f754283562d in lucene's branch 
refs/heads/main from Lu Xugang
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=550d130 ]

LUCENE-10439: Support multi-valued and multiple dimensions for count query in 
PointRangeQuery (#705)



> Support multi-valued and multiple dimensions for count query in 
> PointRangeQuery
> ---
>
> Key: LUCENE-10439
> URL: https://issues.apache.org/jira/browse/LUCENE-10439
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Trivial
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Follow-up of LUCENE-10424, it also works with fields that have multiple 
> dimensions and/or that are multi-valued.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10439) Support multi-valued and multiple dimensions for count query in PointRangeQuery

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497273#comment-17497273
 ] 

ASF subversion and git services commented on LUCENE-10439:
--

Commit 6acf16a2e3427179614f99e159dec16f63b4dfc4 in lucene's branch 
refs/heads/branch_9x from Lu Xugang
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=6acf16a ]

LUCENE-10439: Support multi-valued and multiple dimensions for count query in 
PointRangeQuery (#705)



> Support multi-valued and multiple dimensions for count query in 
> PointRangeQuery
> ---
>
> Key: LUCENE-10439
> URL: https://issues.apache.org/jira/browse/LUCENE-10439
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Trivial
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Follow-up of LUCENE-10424, it also works with fields that have multiple 
> dimensions and/or that are multi-valued.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10417) IntNRQ task performance decreased in nightly benchmark

2022-02-24 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497280#comment-17497280
 ] 

Adrien Grand commented on LUCENE-10417:
---

FYI Elasticsearch was upgraded to a recent Lucene snapshot 2 days ago, and 
we're seeing some ranges that may be slower but also other ranges that seem to 
be faster. See e.g. {{nightly-http_logs-4g-200s-in-range-latency}} at 
https://elasticsearch-benchmarks.elastic.co/#tracks/http-logs/nightly/default/30d.

> IntNRQ task performance decreased in nightly benchmark
> --
>
> Key: LUCENE-10417
> URL: https://issues.apache.org/jira/browse/LUCENE-10417
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Reporter: Feng Guo
>Assignee: Feng Guo
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Link: https://home.apache.org/~mikemccand/lucenebench/2022.02.07.18.02.48.html
> Probably related to LUCENE-10315,  I'll dig.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz opened a new pull request #708: LUCENE-10408: Write doc IDs of KNN vectors as ints rather than vints.

2022-02-24 Thread GitBox


jpountz opened a new pull request #708:
URL: https://github.com/apache/lucene/pull/708


   Since doc IDs with a vector are loaded as an int[] in memory, this changes 
the
   on-disk format of vectors to align with the in-memory representation by using
   ints instead of vints to represent doc IDs. This might make vectors a bit
   larger on disk, but also a bit faster to open.
   
   I made the same change to how we encode nodes on levels for the same reason.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #708: LUCENE-10408: Write doc IDs of KNN vectors as ints rather than vints.

2022-02-24 Thread GitBox


jpountz merged pull request #708:
URL: https://github.com/apache/lucene/pull/708


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497367#comment-17497367
 ] 

ASF subversion and git services commented on LUCENE-10408:
--

Commit 44d7d962ae42cfca7070a8e2c84ab059fec21e10 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=44d7d96 ]

LUCENE-10408: Write doc IDs of KNN vectors as ints rather than vints. (#708)

Since doc IDs with a vector are loaded as an int[] in memory, this changes the
on-disk format of vectors to align with the in-memory representation by using
ints instead of vints to represent doc IDs. This might make vectors a bit
larger on disk, but also a bit faster to open.

I made the same change to how we encode nodes on levels for the same reason.

> Better dense encoding of doc Ids in Lucene91HnswVectorsFormat
> -
>
> Key: LUCENE-10408
> URL: https://issues.apache.org/jira/browse/LUCENE-10408
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Currently we write doc Ids of all documents that have vectors as is.  We 
> should improve their encoding either using delta encoding or bitset.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #702: LUCENE-10382: Use `IndexReaderContext#id` to check reader identity.

2022-02-24 Thread GitBox


jpountz merged pull request #702:
URL: https://github.com/apache/lucene/pull/702


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497369#comment-17497369
 ] 

ASF subversion and git services commented on LUCENE-10382:
--

Commit d47ff38d703c6b5da1ef9c774ccda201fd682b8d in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d47ff38 ]

LUCENE-10382: Use `IndexReaderContext#id` to check reader identity. (#702)

`KnnVectorQuery` currently uses the index reader's hashcode to make sure that
the query it builds runs on the right reader. We had added
`IndexContextReader#id` a while back for a similar purpose with `TermStates`,
let's reuse it?

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497373#comment-17497373
 ] 

ASF subversion and git services commented on LUCENE-10382:
--

Commit d952b3a58114ce5a929211bca7a9b0e822658f35 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d952b3a ]

LUCENE-10382: Use `IndexReaderContext#id` to check reader identity. (#702)

`KnnVectorQuery` currently uses the index reader's hashcode to make sure that
the query it builds runs on the right reader. We had added
`IndexContextReader#id` a while back for a similar purpose with `TermStates`,
let's reuse it?

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497372#comment-17497372
 ] 

ASF subversion and git services commented on LUCENE-10408:
--

Commit d4cb6d0a307be42b8d3498d4363a68eec5947f15 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d4cb6d0 ]

LUCENE-10408: Write doc IDs of KNN vectors as ints rather than vints. (#708)

Since doc IDs with a vector are loaded as an int[] in memory, this changes the
on-disk format of vectors to align with the in-memory representation by using
ints instead of vints to represent doc IDs. This might make vectors a bit
larger on disk, but also a bit faster to open.

I made the same change to how we encode nodes on levels for the same reason.

> Better dense encoding of doc Ids in Lucene91HnswVectorsFormat
> -
>
> Key: LUCENE-10408
> URL: https://issues.apache.org/jira/browse/LUCENE-10408
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Currently we write doc Ids of all documents that have vectors as is.  We 
> should improve their encoding either using delta encoding or bitset.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10439) Support multi-valued and multiple dimensions for count query in PointRangeQuery

2022-02-24 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10439.
---
Fix Version/s: 9.1
   Resolution: Fixed

> Support multi-valued and multiple dimensions for count query in 
> PointRangeQuery
> ---
>
> Key: LUCENE-10439
> URL: https://issues.apache.org/jira/browse/LUCENE-10439
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Trivial
> Fix For: 9.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Follow-up of LUCENE-10424, it also works with fields that have multiple 
> dimensions and/or that are multi-valued.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Reopened] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil

2022-02-24 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand reopened LUCENE-10315:
---

> Speed up BKD leaf block ids codec by a 512 ints ForUtil
> ---
>
> Key: LUCENE-10315
> URL: https://issues.apache.org/jira/browse/LUCENE-10315
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Feng Guo
>Assignee: Feng Guo
>Priority: Major
> Fix For: 9.1
>
> Attachments: addall.svg
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Elasticsearch (which based on lucene) can automatically infers types for 
> users with its dynamic mapping feature. When users index some low cardinality 
> fields, such as gender / age / status... they often use some numbers to 
> represent the values, while ES will infer these fields as {{{}long{}}}, and 
> ES uses BKD as the index of {{long}} fields. When the data volume grows, 
> building the result set of low-cardinality fields will make the CPU usage and 
> load very high.
> This is a flame graph we obtained from the production environment:
> [^addall.svg]
> It can be seen that almost all CPU is used in addAll. When we reindex 
> {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly 
> reduced ( We spent weeks of time to reindex all indices... ). I know that ES 
> recommended to use {{keyword}} for term/terms query and {{long}} for range 
> query in the document, but there are always some users who didn't realize 
> this and keep their habit of using sql database, or dynamic mapping 
> automatically selects the type for them. All in all, users won't realize that 
> there would be such a big difference in performance between {{long}} and 
> {{keyword}} fields in low cardinality fields. So from my point of view it 
> will make sense if we can make BKD works better for the low/medium 
> cardinality fields.
> As far as i can see, for low cardinality fields, there are two advantages of 
> {{keyword}} over {{{}long{}}}:
> 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's 
> delta VInt, because its batch reading (readLongs) and SIMD decode.
> 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily 
> materialize of its result set, and when another small result clause 
> intersects with this low cardinality condition, the low cardinality field can 
> avoid reading all docIds into memory.
> This ISSUE is targeting to solve the first point. The basic idea is trying to 
> use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization 
> by mocking some random {{LongPoint}} and querying them with 
> {{PointInSetQuery}}.
> *Benchmark Result*
> |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff 
> percentage|
> |1|32|1|51.44|148.26|188.22%|
> |1|32|2|26.8|101.88|280.15%|
> |1|32|4|14.04|53.52|281.20%|
> |1|32|8|7.04|28.54|305.40%|
> |1|32|16|3.54|14.61|312.71%|
> |1|128|1|110.56|350.26|216.81%|
> |1|128|8|16.6|89.81|441.02%|
> |1|128|16|8.45|48.07|468.88%|
> |1|128|32|4.2|25.35|503.57%|
> |1|128|64|2.13|13.02|511.27%|
> |1|1024|1|536.19|843.88|57.38%|
> |1|1024|8|109.71|251.89|129.60%|
> |1|1024|32|33.24|104.11|213.21%|
> |1|1024|128|8.87|30.47|243.52%|
> |1|1024|512|2.24|8.3|270.54%|
> |1|8192|1|.33|5000|50.00%|
> |1|8192|32|139.47|214.59|53.86%|
> |1|8192|128|54.59|109.23|100.09%|
> |1|8192|512|15.61|36.15|131.58%|
> |1|8192|2048|4.11|11.14|171.05%|
> |1|1048576|1|2597.4|3030.3|16.67%|
> |1|1048576|32|314.96|371.75|18.03%|
> |1|1048576|128|99.7|116.28|16.63%|
> |1|1048576|512|30.5|37.15|21.80%|
> |1|1048576|2048|10.38|12.3|18.50%|
> |1|8388608|1|2564.1|3174.6|23.81%|
> |1|8388608|32|196.27|238.95|21.75%|
> |1|8388608|128|55.36|68.03|22.89%|
> |1|8388608|512|15.58|19.24|23.49%|
> |1|8388608|2048|4.56|5.71|25.22%|
> The indices size is reduced for low cardinality fields and flat for high 
> cardinality fields.
> {code:java}
> 113Mindex_1_doc_32_cardinality_baseline
> 114Mindex_1_doc_32_cardinality_candidate
> 140Mindex_1_doc_128_cardinality_baseline
> 133Mindex_1_doc_128_cardinality_candidate
> 193Mindex_1_doc_1024_cardinality_baseline
> 174Mindex_1_doc_1024_cardinality_candidate
> 241Mindex_1_doc_8192_cardinality_baseline
> 233Mindex_1_doc_8192_cardinality_candidate
> 314Mindex_1_doc_1048576_cardinality_baseline
> 315Mindex_1_doc_1048576_cardinality_candidate
> 392Mindex_1_doc_8388608_cardinality_baseline
> 391Mindex_1000

[jira] [Updated] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil

2022-02-24 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-10315:
--
Fix Version/s: (was: 9.1)

> Speed up BKD leaf block ids codec by a 512 ints ForUtil
> ---
>
> Key: LUCENE-10315
> URL: https://issues.apache.org/jira/browse/LUCENE-10315
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Feng Guo
>Assignee: Feng Guo
>Priority: Major
> Attachments: addall.svg
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Elasticsearch (which based on lucene) can automatically infers types for 
> users with its dynamic mapping feature. When users index some low cardinality 
> fields, such as gender / age / status... they often use some numbers to 
> represent the values, while ES will infer these fields as {{{}long{}}}, and 
> ES uses BKD as the index of {{long}} fields. When the data volume grows, 
> building the result set of low-cardinality fields will make the CPU usage and 
> load very high.
> This is a flame graph we obtained from the production environment:
> [^addall.svg]
> It can be seen that almost all CPU is used in addAll. When we reindex 
> {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly 
> reduced ( We spent weeks of time to reindex all indices... ). I know that ES 
> recommended to use {{keyword}} for term/terms query and {{long}} for range 
> query in the document, but there are always some users who didn't realize 
> this and keep their habit of using sql database, or dynamic mapping 
> automatically selects the type for them. All in all, users won't realize that 
> there would be such a big difference in performance between {{long}} and 
> {{keyword}} fields in low cardinality fields. So from my point of view it 
> will make sense if we can make BKD works better for the low/medium 
> cardinality fields.
> As far as i can see, for low cardinality fields, there are two advantages of 
> {{keyword}} over {{{}long{}}}:
> 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's 
> delta VInt, because its batch reading (readLongs) and SIMD decode.
> 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily 
> materialize of its result set, and when another small result clause 
> intersects with this low cardinality condition, the low cardinality field can 
> avoid reading all docIds into memory.
> This ISSUE is targeting to solve the first point. The basic idea is trying to 
> use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization 
> by mocking some random {{LongPoint}} and querying them with 
> {{PointInSetQuery}}.
> *Benchmark Result*
> |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff 
> percentage|
> |1|32|1|51.44|148.26|188.22%|
> |1|32|2|26.8|101.88|280.15%|
> |1|32|4|14.04|53.52|281.20%|
> |1|32|8|7.04|28.54|305.40%|
> |1|32|16|3.54|14.61|312.71%|
> |1|128|1|110.56|350.26|216.81%|
> |1|128|8|16.6|89.81|441.02%|
> |1|128|16|8.45|48.07|468.88%|
> |1|128|32|4.2|25.35|503.57%|
> |1|128|64|2.13|13.02|511.27%|
> |1|1024|1|536.19|843.88|57.38%|
> |1|1024|8|109.71|251.89|129.60%|
> |1|1024|32|33.24|104.11|213.21%|
> |1|1024|128|8.87|30.47|243.52%|
> |1|1024|512|2.24|8.3|270.54%|
> |1|8192|1|.33|5000|50.00%|
> |1|8192|32|139.47|214.59|53.86%|
> |1|8192|128|54.59|109.23|100.09%|
> |1|8192|512|15.61|36.15|131.58%|
> |1|8192|2048|4.11|11.14|171.05%|
> |1|1048576|1|2597.4|3030.3|16.67%|
> |1|1048576|32|314.96|371.75|18.03%|
> |1|1048576|128|99.7|116.28|16.63%|
> |1|1048576|512|30.5|37.15|21.80%|
> |1|1048576|2048|10.38|12.3|18.50%|
> |1|8388608|1|2564.1|3174.6|23.81%|
> |1|8388608|32|196.27|238.95|21.75%|
> |1|8388608|128|55.36|68.03|22.89%|
> |1|8388608|512|15.58|19.24|23.49%|
> |1|8388608|2048|4.56|5.71|25.22%|
> The indices size is reduced for low cardinality fields and flat for high 
> cardinality fields.
> {code:java}
> 113Mindex_1_doc_32_cardinality_baseline
> 114Mindex_1_doc_32_cardinality_candidate
> 140Mindex_1_doc_128_cardinality_baseline
> 133Mindex_1_doc_128_cardinality_candidate
> 193Mindex_1_doc_1024_cardinality_baseline
> 174Mindex_1_doc_1024_cardinality_candidate
> 241Mindex_1_doc_8192_cardinality_baseline
> 233Mindex_1_doc_8192_cardinality_candidate
> 314Mindex_1_doc_1048576_cardinality_baseline
> 315Mindex_1_doc_1048576_cardinality_candidate
> 392Mindex_1_doc_8388608_cardinality_baseline
> 391Mindex_1

[GitHub] [lucene] jpountz commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

2022-02-24 Thread GitBox


jpountz commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1049857125


   @rmuir We can remove the cost estimation, but it will not address the 
problem. I'll try to explain the problem differently in case it helps.
   
   DocIdSetBuilder takes doc IDs in random order with potential duplicates and 
creates a DocIdSet that can iterate over doc IDs in order without any 
duplicates. If you index a multi-valued field with points, a very large segment 
that has 2^30 docs might have 2^32 points matching a range query, which 
translates into 2^29 documents matching the query. So `DocIdBuilder#add` would 
be called 2^32 times and `DocIdSetBuilder#build` would result in a `DocIdSet` 
that has 2^29 documents. This `long` is measuring the number of calls to 
`DocIdSetBuilder#add`, hence the `long`.
   
   The naming may be wrong here, as the `grow` name probably suggests a number 
of docs rather than a number of calls to `add`, similarly to how 
`ArrayUtil#grow` is about the number of items in the array - not the number of 
times you set an index. Hopefully renaming it to `prepareAdd(long 
numCallsToAdd)` or something along these lines would help clarify.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10438) Leverage Weight#count in lucene/facets

2022-02-24 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497394#comment-17497394
 ] 

Adrien Grand commented on LUCENE-10438:
---

Solr indeed has a version of faceting that does this. I haven't looked at the 
details for a long time, but I remember that it would run facets on 
low-cardinality fields by intersecting postings with the bitset produced by the 
query.

> Leverage Weight#count in lucene/facets
> --
>
> Key: LUCENE-10438
> URL: https://issues.apache.org/jira/browse/LUCENE-10438
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/facet
>Reporter: Adrien Grand
>Assignee: Greg Miller
>Priority: Minor
>
> The facet module could leverage Weight#count in order to give fast counts for 
> the browsing use-case?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

2022-02-24 Thread GitBox


rmuir commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1049869937


   > @rmuir We can remove the cost estimation, but it will not address the 
problem. I'll try to explain the problem differently in case it helps.
   
   I really think it will address the problem. I understand what is happening, 
but adding 32 more bits that merely get discarded also will not help anything. 
That's what is being discussed here.
   
   It really is all about cost estimation, as that is the ONLY thing in this PR 
actually using the 32 extra bits. That's why i propose to simply use a 
different cost estimation instead. The current cost estimation explodes the 
complexity of this class: that's why we are tracking:
   * `boolean multiValued`
   * `double numValuesPerDoc`
   * `long counter`
   
   There's no need (from allocation perspective, which is all we should be 
concerned about here) to know about any numbers bigger than 
`Integer.MAX_VALUE`, if we get anywhere near numbers that big, we should be 
switching over to the `FixedBitSet` representation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10432) Add optional 'name' property to org.apache.lucene.search.Explanation

2022-02-24 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497399#comment-17497399
 ] 

Adrien Grand commented on LUCENE-10432:
---

[~reta] I wonder if you have thought about how queries would know what name 
they should return in their explanations. My expectation is that we'd be 
introducing some form of query wrapper whose point would only be to be able to 
set a name or tags in the produced explanations. Then I worry that it would 
make some things more complicated for Lucene like query rewriting, which relies 
on instanceof checks, or query caching, which would consider the same queries 
with different names as different. Overall it looks to me like the benefits 
this is bringing would not be worth the problems it would introduce.

> Add optional 'name' property to org.apache.lucene.search.Explanation 
> -
>
> Key: LUCENE-10432
> URL: https://issues.apache.org/jira/browse/LUCENE-10432
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0, 8.10.1
>Reporter: Andriy Redko
>Priority: Minor
>
> Right now, the `Explanation` class has the `description` property which is 
> used pretty much as placeholder for free-style, human readable summary of 
> what is happening. This is totally fine but it would be great to have a bit 
> more formal way to link the explanation with corresponding function / query / 
> filter if supported by the underlying engine.
> Example: Opensearch / Elasticseach has the concept of named queries / filters 
> [1]. This is not supported by Apache Lucene at the moment but it would be 
> helpful to propagate this information back as part of Explanation tree, for 
> example by introducing  optional 'name' property:
>  
> {noformat}
> {
> "value": 0.0,
> "description": "script score function, computed with script: ...",
>  
> "name": "script1",
> "details": [
>  {
>  "value": 1.0,
>  "description": "_score: ",
>  "details": [
>   {
>   "value": 1.0,
>   "description": "*:*",
>   "details": []
>}
>   ]
>   }
> ]
> }{noformat}
>  
> From the other side, the `name` property may look like not belonging here, 
> the alternative suggestion would be to add support of `properties` /  
> `parameters` / `tags` key/value bag, for example:
>  
> {noformat}
> {
> "value": 0.0,
> "description": "script score function, computed with script: ...",
>  
> "tags": [
>{  "name": "script1" }
> ],
> "details": [
>  {
>  "value": 1.0,
>  "description": "_score: ",
>  "details": [
>   {
>   "value": 1.0,
>   "description": "*:*",
>   "details": []
>}
>   ]
>   }
> ]
> }{noformat}
> The change should be non-breaking but quite useful for engines to enrich the 
> `Explanation` with additional context.
> [1] 
> https://www.elastic.co/guide/en/elasticsearch/reference/7.16/query-dsl-bool-query.html#named-queries
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10431) AssertionError in BooleanQuery.hashCode()

2022-02-24 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497402#comment-17497402
 ] 

Adrien Grand commented on LUCENE-10431:
---

I've been starring at the code and at this stack trace for the past 15 minutes 
but I cannot think of a way how hashCode() could be called before BooleanQuery 
is fully constructed. Can you share more information about how this query gets 
constructed?

> AssertionError in BooleanQuery.hashCode()
> -
>
> Key: LUCENE-10431
> URL: https://issues.apache.org/jira/browse/LUCENE-10431
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.11.1
>Reporter: Michael Bien
>Priority: Major
>
> Hello devs,
> the constructor of BooleanQuery can under some circumstances trigger a hash 
> code computation before "clauseSets" is fully filled. Since BooleanClause is 
> using its query field for the hash code too, it can happen that the "wrong" 
> hash code is stored, since adding the clause to the set triggers its 
> hashCode().
> If assertions are enabled the check in BooleanQuery, which recomputes the 
> hash code, will notice it and throw an error.
> exception:
> {code:java}
> java.lang.AssertionError
>     at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:614)
>     at java.base/java.util.Objects.hashCode(Objects.java:103)
>     at java.base/java.util.HashMap$Node.hashCode(HashMap.java:298)
>     at java.base/java.util.AbstractMap.hashCode(AbstractMap.java:527)
>     at org.apache.lucene.search.Multiset.hashCode(Multiset.java:119)
>     at java.base/java.util.EnumMap.entryHashCode(EnumMap.java:717)
>     at java.base/java.util.EnumMap.hashCode(EnumMap.java:709)
>     at java.base/java.util.Arrays.hashCode(Arrays.java:4498)
>     at java.base/java.util.Objects.hash(Objects.java:133)
>     at 
> org.apache.lucene.search.BooleanQuery.computeHashCode(BooleanQuery.java:597)
>     at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:611)
>     at java.base/java.util.HashMap.hash(HashMap.java:340)
>     at java.base/java.util.HashMap.put(HashMap.java:612)
>     at org.apache.lucene.search.Multiset.add(Multiset.java:82)
>     at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:154)
>     at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:42)
>     at 
> org.apache.lucene.search.BooleanQuery$Builder.build(BooleanQuery.java:133)
> {code}
> I noticed this while trying to upgrade the NetBeans maven indexer modules 
> from lucene 5.x to 8.x https://github.com/apache/netbeans/pull/3558



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

2022-02-24 Thread GitBox


iverase commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1049887302


   32 bits will need to be discarded anyway, the issue is where. 
   
   You either do it at the PointValues level by calling grow like:
   
   ```
   visitor.grow((int) Math.min(getDocCount(), pointTree.size());
   ```
   
   Or you  discarded in the DocIdSetBuilder and allow grow to be called just 
like:
   
   ```
   visitor.grow(pointTree.size());
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10432) Add optional 'name' property to org.apache.lucene.search.Explanation

2022-02-24 Thread Andriy Redko (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497405#comment-17497405
 ] 

Andriy Redko commented on LUCENE-10432:
---

Thanks [~jpountz] 

> I wonder if you have thought about how queries would know what name they 
> should return in their explanations. My expectation is that we'd be 
> introducing some form of query wrapper whose point would only be to be able 
> to set a name or tags in the produced explanations.

That could be an option but I would suggest to not change queries. For example, 
in most cases Opensearch / Elasticsearch queries / filters / functions are 
wrapped into composites, the names and other attributes are stored there.

> Then I worry that it would make some things more complicated for Lucene like 
> query rewriting, which relies on instanceof checks, or query caching, which 
> would consider the same queries with different names as different.

Certainly, if the we change queries but we don't need to, ability to pass 
structured (key/value) details into the Explanation would help the engines 
propagate the internal context back.

 

May be a bit more illustrative example of end-2-end flow, the request:
{noformat}
{
    "explain": true,
    "query": {
      "function_score": {
        "query": {
          "match_all": {
          "_name": "q1"
          }
        },
        "functions": [
          {
            "filter": {
              "terms": {
               "_name": "terms_filter",
                "abc": [
                  "1"
                ]
              }
            },
            "weight": 35
          }
        ],
        "boost_mode": "replace",
        "score_mode": "sum",
        "min_score": 0
      }
    }
}{noformat}
 

And this is how we return that back inside explanation description 
({*}"description" : "match filter(_name: terms_filter): abc:\{1}"{*}):
{noformat}
 {
    "value": 35.0,
    "description": "function score, product of:",
    "details": [
        {
            "value": 1.0,
            "description": "match filter(_name: terms_filter): abc:{1}",
            "details": []
        },
        {
            "value": 35.0,
            "description": "product of:",
            "details": [
                {
                    "value": 1.0,
                    "description": "constant score 1.0 - no function provided",
                    "details": []
                },
                {
                    "value": 35.0,
                    "description": "weight",
                    "details": []
                }
            ]
        }
    ]
}{noformat}
Does it address  your concerns? Thanks a lot for taking a look!

> Add optional 'name' property to org.apache.lucene.search.Explanation 
> -
>
> Key: LUCENE-10432
> URL: https://issues.apache.org/jira/browse/LUCENE-10432
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0, 8.10.1
>Reporter: Andriy Redko
>Priority: Minor
>
> Right now, the `Explanation` class has the `description` property which is 
> used pretty much as placeholder for free-style, human readable summary of 
> what is happening. This is totally fine but it would be great to have a bit 
> more formal way to link the explanation with corresponding function / query / 
> filter if supported by the underlying engine.
> Example: Opensearch / Elasticseach has the concept of named queries / filters 
> [1]. This is not supported by Apache Lucene at the moment but it would be 
> helpful to propagate this information back as part of Explanation tree, for 
> example by introducing  optional 'name' property:
>  
> {noformat}
> {
> "value": 0.0,
> "description": "script score function, computed with script: ...",
>  
> "name": "script1",
> "details": [
>  {
>  "value": 1.0,
>  "description": "_score: ",
>  "details": [
>   {
>   "value": 1.0,
>   "description": "*:*",
>   "details": []
>}
>   ]
>   }
> ]
> }{noformat}
>  
> From the other side, the `name` property may look like not belonging here, 
> the alternative suggestion would be to add support of `properties` /  
> `parameters` / `tags` key/value bag, for example:
>  
> {noformat}
> {
> "value": 0.0,
> "description": "script score function, computed with script: ...",
>  
> "tags": [
>{  "name": "script1" }
> ],
> "details": [
>  {
>  "value": 1.0,
>  "description": "_score: ",   

[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

2022-02-24 Thread GitBox


rmuir commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1049894634


   If this is literally all about "style" issue then let's be open and honest 
about that. I am fine with:
   ```
   /** sugar: to just make code look pretty, nothing else */
   public BulkAdder grow(long numDocs) {
 grow((int) Math.min(Integer.MAX_VALUE, numDocs));
   }
   ```
   
   But I think it is wrong to have constructors taking `Terms` and 
`PointValues` already: it is just more useless complexity and "leaky 
abstraction" from the terrible cost estimation.
   
   And I definitely think having two separate classes just for the cost 
estimation is way too much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

2022-02-24 Thread GitBox


rmuir commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1049905030


   To try to be more helpful, here's what i'd propose. I can try to hack up a 
draft PR later if we want, if it is helpful.
   
   DocIdSetBuilder, remove complex cost estimation:
   * remove `DocIdSetBuilder(int, Terms)` constructor
   * remove `DocIdSetBuilder(int, PointValues)` constructor
   * remove `DocIdSetBuilder.counter` member
   * remove `DocIdSetBuilder.multiValued` member
   * remove `DocIdSetBuilder.numValuesPerDoc` member
   
   DocIdSetBuilder: add sugar `grow(long)` for style purposes:
   ```
   /** sugar: to just make code look pretty, nothing else */
   public BulkAdder grow(long numDocs) {
 grow((int) Math.min(Integer.MAX_VALUE, numDocs));
   }
   ```
   
   FixedBitSet: implement `approximateCardinality()` and simply use it when 
estimating cost() here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10428) getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge leading to busy threads in infinite loop

2022-02-24 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497422#comment-17497422
 ] 

Adrien Grand commented on LUCENE-10428:
---

Ouch this is bad.

Note that in your code snippet, `minScoreSum` should be a float - not a double 
- to replicate what MaxScoreSumPropagator does. By any chance, were you able to 
see what is the number of clauses of this query?

> getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge 
> leading to busy threads in infinite loop
> -
>
> Key: LUCENE-10428
> URL: https://issues.apache.org/jira/browse/LUCENE-10428
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring, core/search
>Reporter: Ankit Jain
>Priority: Major
> Attachments: Flame_graph.png
>
>
> Customers complained about high CPU for Elasticsearch cluster in production. 
> We noticed that few search requests were stuck for long time
> {code:java}
> % curl -s localhost:9200/_cat/tasks?v   
> indices:data/read/search[phase/query] AmMLzDQ4RrOJievRDeGFZw:569205  
> AmMLzDQ4RrOJievRDeGFZw:569204  direct1645195007282 14:36:47  6.2h
> indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:502075  
> emjWc5bUTG6lgnCGLulq-Q:502074  direct1645195037259 14:37:17  6.2h
> indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:583270  
> emjWc5bUTG6lgnCGLulq-Q:583269  direct1645201316981 16:21:56  4.5h
> {code}
> Flame graphs indicated that CPU time is mostly going into 
> *getMinCompetitiveScore method in MaxScoreSumPropagator*. After doing some 
> live JVM debugging found that 
> org.apache.lucene.search.MaxScoreSumPropagator.scoreSumUpperBound method had 
> around 4 million invocations every second
> Figured out the values of some parameters from live debugging:
> {code:java}
> minScoreSum = 3.5541441
> minScore + sumOfOtherMaxScores (params[0] scoreSumUpperBound) = 
> 3.554144322872162
> returnObj scoreSumUpperBound = 3.5541444
> Math.ulp(minScoreSum) = 2.3841858E-7
> {code}
> Example code snippet:
> {code:java}
> double sumOfOtherMaxScores = 3.554144322872162;
> double minScoreSum = 3.5541441;
> float minScore = (float) (minScoreSum - sumOfOtherMaxScores);
> while (scoreSumUpperBound(minScore + sumOfOtherMaxScores) > minScoreSum) {
> minScore -= Math.ulp(minScoreSum);
> System.out.printf("%.20f, %.100f\n", minScore, Math.ulp(minScoreSum));
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir opened a new pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

2022-02-24 Thread GitBox


rmuir opened a new pull request #709:
URL: https://github.com/apache/lucene/pull/709


   Cost estimation drives the API complexity out of control, we don't need it. 
Hopefully i've cleared up all the API damage from this explosive leak.
   
   Instead, FixedBitSet.approximateCardinality() is used for cost estimation. 
TODO: let's optimize that!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

2022-02-24 Thread GitBox


rmuir commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1049948027


   Here's a first stab of what i proposed on 
https://github.com/apache/lucene/pull/692
   
   You can see how damaging the current cost() implementation is.
   
   As followup commits we can add the `grow(long)` sugar that simply truncates. 
And we should optimize `FixedBitSet.approximateCardinality()`. After doing 
that, we should look around and see if there is any other similar damage to our 
APIs related to the fact that FixedBitSet had a slow `approximateCardinality` 
and fix those, too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

2022-02-24 Thread GitBox


rmuir commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1049948208


   prototype: https://github.com/apache/lucene/pull/709


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

2022-02-24 Thread GitBox


jpountz commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1049959940


   That change makes sense to me. FWIW my recollection from profiling 
DocIdSetBuilder is that the deduplication logic is cheap and most of the time 
is spent in `LSBRadixSorter#reorder` so it's ok to always deduplicate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10432) Add optional 'name' property to org.apache.lucene.search.Explanation

2022-02-24 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497468#comment-17497468
 ] 

Adrien Grand commented on LUCENE-10432:
---

The bit I'm missing is how you would let Lucene know about the query name when 
calling {{Weight#explain}}?

> Add optional 'name' property to org.apache.lucene.search.Explanation 
> -
>
> Key: LUCENE-10432
> URL: https://issues.apache.org/jira/browse/LUCENE-10432
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0, 8.10.1
>Reporter: Andriy Redko
>Priority: Minor
>
> Right now, the `Explanation` class has the `description` property which is 
> used pretty much as placeholder for free-style, human readable summary of 
> what is happening. This is totally fine but it would be great to have a bit 
> more formal way to link the explanation with corresponding function / query / 
> filter if supported by the underlying engine.
> Example: Opensearch / Elasticseach has the concept of named queries / filters 
> [1]. This is not supported by Apache Lucene at the moment but it would be 
> helpful to propagate this information back as part of Explanation tree, for 
> example by introducing  optional 'name' property:
>  
> {noformat}
> {
> "value": 0.0,
> "description": "script score function, computed with script: ...",
>  
> "name": "script1",
> "details": [
>  {
>  "value": 1.0,
>  "description": "_score: ",
>  "details": [
>   {
>   "value": 1.0,
>   "description": "*:*",
>   "details": []
>}
>   ]
>   }
> ]
> }{noformat}
>  
> From the other side, the `name` property may look like not belonging here, 
> the alternative suggestion would be to add support of `properties` /  
> `parameters` / `tags` key/value bag, for example:
>  
> {noformat}
> {
> "value": 0.0,
> "description": "script score function, computed with script: ...",
>  
> "tags": [
>{  "name": "script1" }
> ],
> "details": [
>  {
>  "value": 1.0,
>  "description": "_score: ",
>  "details": [
>   {
>   "value": 1.0,
>   "description": "*:*",
>   "details": []
>}
>   ]
>   }
> ]
> }{noformat}
> The change should be non-breaking but quite useful for engines to enrich the 
> `Explanation` with additional context.
> [1] 
> https://www.elastic.co/guide/en/elasticsearch/reference/7.16/query-dsl-bool-query.html#named-queries
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

2022-02-24 Thread GitBox


rmuir commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1049967927


   If we want to add the `grow(long)` sugar method that simply truncates to 
`Integer.MAX_VALUE` and clean up all the points callsites, or write a cool 
FixedBitSet.approximateCardinality, just feel free to push commits here. 
Otherwise I will get to these two things later and remove draft status on the 
PR.
   
   Adding the sugar method is easy, it is just work.
   Implementing the approximateCardinality requires some thought and prolly 
some benchmarking. I had in mind to just "sample" some "chunks" of the long[] 
and sum up `Long.bitCount` across the ranges. In upcoming JDK this method will 
get vectorized, let's take advantage of that, so then both `cardinality()` and 
`approximateCardinality` would get faster: 
https://github.com/openjdk/jdk/pull/6857
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

2022-02-24 Thread GitBox


iverase commented on a change in pull request #709:
URL: https://github.com/apache/lucene/pull/709#discussion_r813988648



##
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##
@@ -266,20 +224,12 @@ private void upgradeToBitSet() {
   public DocIdSet build() {
 try {
   if (bitSet != null) {
-assert counter >= 0;
-final long cost = Math.round(counter / numValuesPerDoc);
-return new BitDocIdSet(bitSet, cost);
+return new BitDocIdSet(bitSet);
   } else {
 Buffer concatenated = concat(buffers);
 LSBRadixSorter sorter = new LSBRadixSorter();
 sorter.sort(PackedInts.bitsRequired(maxDoc - 1), concatenated.array, 
concatenated.length);
-final int l;
-if (multivalued) {
-  l = dedup(concatenated.array, concatenated.length);

Review comment:
   Do we really want to throw away this optimisation? we normally know if 
our data is single or multi-valued so it seems wasteful not to exploit it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

2022-02-24 Thread GitBox


iverase commented on a change in pull request #709:
URL: https://github.com/apache/lucene/pull/709#discussion_r813994000



##
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##
@@ -266,20 +224,12 @@ private void upgradeToBitSet() {
   public DocIdSet build() {
 try {
   if (bitSet != null) {
-assert counter >= 0;
-final long cost = Math.round(counter / numValuesPerDoc);
-return new BitDocIdSet(bitSet, cost);
+return new BitDocIdSet(bitSet);

Review comment:
   we still ned to implement the method estimateCardinality which is the 
hard bit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

2022-02-24 Thread GitBox


iverase commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-104552


   I don't think the is necessary, we can always added to the IntersectVisitor 
instead. Maybe would be worthy to adjust how we call grow() in BKDReader#addAll 
as it does not need the dance it is currently doing:
   
   
https://github.com/apache/lucene/blob/8c67a3816b9060fa983b494886cd4f789be1d868/lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java#L562
   
   The same for  SimpleTextBKDReader#addAll


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase edited a comment on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

2022-02-24 Thread GitBox


iverase edited a comment on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-104552


   I don't think the grow(long) is necessary, we can always added to the 
IntersectVisitor instead. Maybe would be worthy to adjust how we call grow() in 
BKDReader#addAll as it does not need the dance it is currently doing:
   
   
https://github.com/apache/lucene/blob/8c67a3816b9060fa983b494886cd4f789be1d868/lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java#L562
   
   The same for  SimpleTextBKDReader#addAll


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10432) Add optional 'name' property to org.apache.lucene.search.Explanation

2022-02-24 Thread Andriy Redko (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497499#comment-17497499
 ] 

Andriy Redko commented on LUCENE-10432:
---

Yeah, it may not 100% cover everything (like the {{Weight#explain), }}but it is 
also not needed in every place. Probably generic bag for contextual properties 
would be less intrusive and extensible way to propagate name and other things 
(which are backed into description now), just another example for random 
scoring function explanation:
{noformat}
{
    "value": 0.38554674,
    "description": "random score function (seed: 738562412, field: null, _name: 
func2)",
    "details": []
}{noformat}

> Add optional 'name' property to org.apache.lucene.search.Explanation 
> -
>
> Key: LUCENE-10432
> URL: https://issues.apache.org/jira/browse/LUCENE-10432
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0, 8.10.1
>Reporter: Andriy Redko
>Priority: Minor
>
> Right now, the `Explanation` class has the `description` property which is 
> used pretty much as placeholder for free-style, human readable summary of 
> what is happening. This is totally fine but it would be great to have a bit 
> more formal way to link the explanation with corresponding function / query / 
> filter if supported by the underlying engine.
> Example: Opensearch / Elasticseach has the concept of named queries / filters 
> [1]. This is not supported by Apache Lucene at the moment but it would be 
> helpful to propagate this information back as part of Explanation tree, for 
> example by introducing  optional 'name' property:
>  
> {noformat}
> {
> "value": 0.0,
> "description": "script score function, computed with script: ...",
>  
> "name": "script1",
> "details": [
>  {
>  "value": 1.0,
>  "description": "_score: ",
>  "details": [
>   {
>   "value": 1.0,
>   "description": "*:*",
>   "details": []
>}
>   ]
>   }
> ]
> }{noformat}
>  
> From the other side, the `name` property may look like not belonging here, 
> the alternative suggestion would be to add support of `properties` /  
> `parameters` / `tags` key/value bag, for example:
>  
> {noformat}
> {
> "value": 0.0,
> "description": "script score function, computed with script: ...",
>  
> "tags": [
>{  "name": "script1" }
> ],
> "details": [
>  {
>  "value": 1.0,
>  "description": "_score: ",
>  "details": [
>   {
>   "value": 1.0,
>   "description": "*:*",
>   "details": []
>}
>   ]
>   }
> ]
> }{noformat}
> The change should be non-breaking but quite useful for engines to enrich the 
> `Explanation` with additional context.
> [1] 
> https://www.elastic.co/guide/en/elasticsearch/reference/7.16/query-dsl-bool-query.html#named-queries
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

2022-02-24 Thread GitBox


rmuir commented on a change in pull request #709:
URL: https://github.com/apache/lucene/pull/709#discussion_r814039139



##
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##
@@ -266,20 +224,12 @@ private void upgradeToBitSet() {
   public DocIdSet build() {
 try {
   if (bitSet != null) {
-assert counter >= 0;
-final long cost = Math.round(counter / numValuesPerDoc);
-return new BitDocIdSet(bitSet, cost);
+return new BitDocIdSet(bitSet);
   } else {
 Buffer concatenated = concat(buffers);
 LSBRadixSorter sorter = new LSBRadixSorter();
 sorter.sort(PackedInts.bitsRequired(maxDoc - 1), concatenated.array, 
concatenated.length);
-final int l;
-if (multivalued) {
-  l = dedup(concatenated.array, concatenated.length);

Review comment:
   This optimization doesnt make sense to me. Buffers should only be used 
for tiny sets (they are very memory expensive).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

2022-02-24 Thread GitBox


rmuir commented on a change in pull request #709:
URL: https://github.com/apache/lucene/pull/709#discussion_r814040808



##
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##
@@ -266,20 +224,12 @@ private void upgradeToBitSet() {
   public DocIdSet build() {
 try {
   if (bitSet != null) {
-assert counter >= 0;
-final long cost = Math.round(counter / numValuesPerDoc);
-return new BitDocIdSet(bitSet, cost);
+return new BitDocIdSet(bitSet);

Review comment:
   I don't think it is difficult, it just requires a little work. I can get 
to it soon, seems like it should be fun. Ultimately I think it will give us 
better estimations than what we have today, without all the tangled APIs and 
abstraction leakage.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

2022-02-24 Thread GitBox


iverase commented on a change in pull request #709:
URL: https://github.com/apache/lucene/pull/709#discussion_r814045946



##
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##
@@ -266,20 +224,12 @@ private void upgradeToBitSet() {
   public DocIdSet build() {
 try {
   if (bitSet != null) {
-assert counter >= 0;
-final long cost = Math.round(counter / numValuesPerDoc);
-return new BitDocIdSet(bitSet, cost);
+return new BitDocIdSet(bitSet);
   } else {
 Buffer concatenated = concat(buffers);
 LSBRadixSorter sorter = new LSBRadixSorter();
 sorter.sort(PackedInts.bitsRequired(maxDoc - 1), concatenated.array, 
concatenated.length);
-final int l;
-if (multivalued) {
-  l = dedup(concatenated.array, concatenated.length);

Review comment:
   Ok, I am convinced. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

2022-02-24 Thread GitBox


iverase commented on a change in pull request #709:
URL: https://github.com/apache/lucene/pull/709#discussion_r814047234



##
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##
@@ -266,20 +224,12 @@ private void upgradeToBitSet() {
   public DocIdSet build() {
 try {
   if (bitSet != null) {
-assert counter >= 0;
-final long cost = Math.round(counter / numValuesPerDoc);
-return new BitDocIdSet(bitSet, cost);
+return new BitDocIdSet(bitSet);

Review comment:
   I like the idea of sampling, thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz opened a new pull request #710: LUCENE-10311: Make FixedBitSet#approximateCardinality faster (and actually approximate).

2022-02-24 Thread GitBox


jpountz opened a new pull request #710:
URL: https://github.com/apache/lucene/pull/710


   This computes a pop count on a sample of the longs that back the bitset.
   
   Quick benchmarks suggest that this runs 5x-10x faster than
   `FixedBitSet#cardinality` depending on the length of the bitset.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10427) OLAP likewise rollup during segment merge process

2022-02-24 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497526#comment-17497526
 ] 

Adrien Grand commented on LUCENE-10427:
---

I know that the Elasticsearch team is looking into doing things like that, but 
on top of Lucene by creating another index that has a different granularity 
instead of having different granularities within the same index and relying on 
background merges for rollups.

At first sight, doing it within the same index feels a bit scary to me:
 - different segments would have different granularities,
 - merges would no longer combine segments but also perform lossy compression,
 - all file formats would need to be aware of rollups?
 - numeric doc values would need to be able to store multiple fields under the 
hood (min, max, etc.)

What would you think about doing it on top of Lucene instead, e.g. similarly to 
how the faceting module maintains a side-car taxonomy index, maybe one could 
maintain a side-car rollup index to speed up aggregations?

> OLAP likewise rollup during segment merge process
> -
>
> Key: LUCENE-10427
> URL: https://issues.apache.org/jira/browse/LUCENE-10427
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Suhan Mao
>Priority: Major
>
> Currently, many OLAP engines support rollup feature like 
> clickhouse(AggregateMergeTree)/druid. 
> Rollup definition: [https://athena.ecs.csus.edu/~mei/olap/OLAPoperations.php]
> One of the way to do rollup is to merge the same dimension buckets into one 
> and do sum()/min()/max() operation on metric fields during segment 
> compact/merge process. This can significantly reduce the size of the data and 
> speed up the query a lot.
>  
> *Abstraction of how to do*
>  # Define rollup logic: which is dimensions and metrics.
>  # Rollup definition for each metric field: max/min/sum ...
>  # index sorting should the the same as dimension fields.
>  # We will do rollup calculation during segment merge just like other OLAP 
> engine do.
>  
> *Assume the scenario*
> We use ES to ingest realtime raw temperature data every minutes of each 
> sensor device along with many dimension information. User may want to query 
> the data like "what is the max temperature of some device within some/latest 
> hour" or "what is the max temperature of some city within some/latest hour"
> In that way, we can define such fields and rollup definition:
>  # event_hour(round to hour granularity)
>  # device_id(dimension)
>  # city_id(dimension)
>  # temperature(metrics, max/min rollup logic)
> The raw data will periodically be rolled up to the hour granularity during 
> segment merge process, which should save 60x storage ideally in the end.
>  
> *How we do rollup in segment merge*
> bucket: docs should belong to the same bucket if the dimension values are all 
> the same.
>  # For docvalues merge, we send the normal mappedDocId if we encounter a new 
> bucket in DocIDMerger.
>  # Since the index sorting fields are the same with dimension fields. if we 
> encounter more docs in the same bucket, We emit special mappedDocId from 
> DocIDMerger .
>  # In DocValuesConsumer.mergeNumericField, if we meet special mappedDocId, we 
> do a rollup calculation on metric fields and fold the result value to the 
> first doc in the  bucket. The calculation just like a streaming merge sort 
> rollup.
>  # We discard all the special mappedDocId docs because the metrics is already 
> folded to the first doc of in the bucket.
>  # In BKD/posting structure, we discard all the special mappedDocId docs and 
> only place the first doc id within a bucket in the BKD/posting data. It 
> should be simple.
>  
> *How to define the logic*
>  
> {code:java}
> public class RollupMergeConfig {
>   private List dimensionNames;
>   private List aggregateFields;
> } 
> public class RollupMergeAggregateField {
>   private String name;
>   private RollupMergeAggregateType aggregateType;
> }
> public enum RollupMergeAggregateType {
>   COUNT,
>   SUM,
>   MIN,
>   MAX,
>   CARDINALITY // if data sketch is stored in binary doc values, we can do a 
> union logic 
> }{code}
>  
>  
> I have written the initial code in a basic level. I can submit the complete 
> PR if you think this feature is good to try.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #710: LUCENE-10311: Make FixedBitSet#approximateCardinality faster (and actually approximate).

2022-02-24 Thread GitBox


rmuir commented on pull request #710:
URL: https://github.com/apache/lucene/pull/710#issuecomment-1050050397


   Since we made the method `abstract`, let's just have it forward to 
exact-cardinality for the `JavaUtilBitSet` used in the unit tests? It should 
fix the test issues.
   
   I agree with making the method abstract too. I think it is a better choice 
for performance-sensitive, lower-level classes like this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10428) getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge leading to busy threads in infinite loop

2022-02-24 Thread Ankit Jain (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497553#comment-17497553
 ] 

Ankit Jain commented on LUCENE-10428:
-

{quote}By any chance, were you able to see what is the number of clauses of 
this query?{quote}

[~jpountz] - I did check the invocation of sumRelativeErrorBound and it 
probably showed 4.

> getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge 
> leading to busy threads in infinite loop
> -
>
> Key: LUCENE-10428
> URL: https://issues.apache.org/jira/browse/LUCENE-10428
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring, core/search
>Reporter: Ankit Jain
>Priority: Major
> Attachments: Flame_graph.png
>
>
> Customers complained about high CPU for Elasticsearch cluster in production. 
> We noticed that few search requests were stuck for long time
> {code:java}
> % curl -s localhost:9200/_cat/tasks?v   
> indices:data/read/search[phase/query] AmMLzDQ4RrOJievRDeGFZw:569205  
> AmMLzDQ4RrOJievRDeGFZw:569204  direct1645195007282 14:36:47  6.2h
> indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:502075  
> emjWc5bUTG6lgnCGLulq-Q:502074  direct1645195037259 14:37:17  6.2h
> indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:583270  
> emjWc5bUTG6lgnCGLulq-Q:583269  direct1645201316981 16:21:56  4.5h
> {code}
> Flame graphs indicated that CPU time is mostly going into 
> *getMinCompetitiveScore method in MaxScoreSumPropagator*. After doing some 
> live JVM debugging found that 
> org.apache.lucene.search.MaxScoreSumPropagator.scoreSumUpperBound method had 
> around 4 million invocations every second
> Figured out the values of some parameters from live debugging:
> {code:java}
> minScoreSum = 3.5541441
> minScore + sumOfOtherMaxScores (params[0] scoreSumUpperBound) = 
> 3.554144322872162
> returnObj scoreSumUpperBound = 3.5541444
> Math.ulp(minScoreSum) = 2.3841858E-7
> {code}
> Example code snippet:
> {code:java}
> double sumOfOtherMaxScores = 3.554144322872162;
> double minScoreSum = 3.5541441;
> float minScore = (float) (minScoreSum - sumOfOtherMaxScores);
> while (scoreSumUpperBound(minScore + sumOfOtherMaxScores) > minScoreSum) {
> minScore -= Math.ulp(minScoreSum);
> System.out.printf("%.20f, %.100f\n", minScore, Math.ulp(minScoreSum));
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10428) getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge leading to busy threads in infinite loop

2022-02-24 Thread Ankit Jain (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497553#comment-17497553
 ] 

Ankit Jain edited comment on LUCENE-10428 at 2/24/22, 5:00 PM:
---

{quote}By any chance, were you able to see what is the number of clauses of 
this query?
{quote}
[~jpountz] - I did check the invocation of sumRelativeErrorBound and it 
probably showed 4.

Interestingly, even when I run the same query, it does not necessarily get into 
this convergence issue. So, could not find easy way to reproduce this from 
query level


was (Author: akjain):
{quote}By any chance, were you able to see what is the number of clauses of 
this query?{quote}

[~jpountz] - I did check the invocation of sumRelativeErrorBound and it 
probably showed 4.

> getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge 
> leading to busy threads in infinite loop
> -
>
> Key: LUCENE-10428
> URL: https://issues.apache.org/jira/browse/LUCENE-10428
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring, core/search
>Reporter: Ankit Jain
>Priority: Major
> Attachments: Flame_graph.png
>
>
> Customers complained about high CPU for Elasticsearch cluster in production. 
> We noticed that few search requests were stuck for long time
> {code:java}
> % curl -s localhost:9200/_cat/tasks?v   
> indices:data/read/search[phase/query] AmMLzDQ4RrOJievRDeGFZw:569205  
> AmMLzDQ4RrOJievRDeGFZw:569204  direct1645195007282 14:36:47  6.2h
> indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:502075  
> emjWc5bUTG6lgnCGLulq-Q:502074  direct1645195037259 14:37:17  6.2h
> indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:583270  
> emjWc5bUTG6lgnCGLulq-Q:583269  direct1645201316981 16:21:56  4.5h
> {code}
> Flame graphs indicated that CPU time is mostly going into 
> *getMinCompetitiveScore method in MaxScoreSumPropagator*. After doing some 
> live JVM debugging found that 
> org.apache.lucene.search.MaxScoreSumPropagator.scoreSumUpperBound method had 
> around 4 million invocations every second
> Figured out the values of some parameters from live debugging:
> {code:java}
> minScoreSum = 3.5541441
> minScore + sumOfOtherMaxScores (params[0] scoreSumUpperBound) = 
> 3.554144322872162
> returnObj scoreSumUpperBound = 3.5541444
> Math.ulp(minScoreSum) = 2.3841858E-7
> {code}
> Example code snippet:
> {code:java}
> double sumOfOtherMaxScores = 3.554144322872162;
> double minScoreSum = 3.5541441;
> float minScore = (float) (minScoreSum - sumOfOtherMaxScores);
> while (scoreSumUpperBound(minScore + sumOfOtherMaxScores) > minScoreSum) {
> minScore -= Math.ulp(minScoreSum);
> System.out.printf("%.20f, %.100f\n", minScore, Math.ulp(minScoreSum));
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10428) getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge leading to busy threads in infinite loop

2022-02-24 Thread Ankit Jain (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497553#comment-17497553
 ] 

Ankit Jain edited comment on LUCENE-10428 at 2/24/22, 5:01 PM:
---

{quote}By any chance, were you able to see what is the number of clauses of 
this query?
{quote}
[~jpountz] - I did check the invocation of sumRelativeErrorBound and it 
probably showed 4.

Interestingly, even when I run the same query, it does not necessarily get into 
this convergence issue. So, could not find easy way to reproduce this at query 
level


was (Author: akjain):
{quote}By any chance, were you able to see what is the number of clauses of 
this query?
{quote}
[~jpountz] - I did check the invocation of sumRelativeErrorBound and it 
probably showed 4.

Interestingly, even when I run the same query, it does not necessarily get into 
this convergence issue. So, could not find easy way to reproduce this from 
query level

> getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge 
> leading to busy threads in infinite loop
> -
>
> Key: LUCENE-10428
> URL: https://issues.apache.org/jira/browse/LUCENE-10428
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring, core/search
>Reporter: Ankit Jain
>Priority: Major
> Attachments: Flame_graph.png
>
>
> Customers complained about high CPU for Elasticsearch cluster in production. 
> We noticed that few search requests were stuck for long time
> {code:java}
> % curl -s localhost:9200/_cat/tasks?v   
> indices:data/read/search[phase/query] AmMLzDQ4RrOJievRDeGFZw:569205  
> AmMLzDQ4RrOJievRDeGFZw:569204  direct1645195007282 14:36:47  6.2h
> indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:502075  
> emjWc5bUTG6lgnCGLulq-Q:502074  direct1645195037259 14:37:17  6.2h
> indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:583270  
> emjWc5bUTG6lgnCGLulq-Q:583269  direct1645201316981 16:21:56  4.5h
> {code}
> Flame graphs indicated that CPU time is mostly going into 
> *getMinCompetitiveScore method in MaxScoreSumPropagator*. After doing some 
> live JVM debugging found that 
> org.apache.lucene.search.MaxScoreSumPropagator.scoreSumUpperBound method had 
> around 4 million invocations every second
> Figured out the values of some parameters from live debugging:
> {code:java}
> minScoreSum = 3.5541441
> minScore + sumOfOtherMaxScores (params[0] scoreSumUpperBound) = 
> 3.554144322872162
> returnObj scoreSumUpperBound = 3.5541444
> Math.ulp(minScoreSum) = 2.3841858E-7
> {code}
> Example code snippet:
> {code:java}
> double sumOfOtherMaxScores = 3.554144322872162;
> double minScoreSum = 3.5541441;
> float minScore = (float) (minScoreSum - sumOfOtherMaxScores);
> while (scoreSumUpperBound(minScore + sumOfOtherMaxScores) > minScoreSum) {
> minScore -= Math.ulp(minScoreSum);
> System.out.printf("%.20f, %.100f\n", minScore, Math.ulp(minScoreSum));
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10428) getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge leading to busy threads in infinite loop

2022-02-24 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497597#comment-17497597
 ] 

Adrien Grand commented on LUCENE-10428:
---

This is interesting indeed since query execution should be quite deterministic. 
One way that I can think how this logic could enter an infinite loop is if some 
scorers manage to produce negative scores somehow. I'm mentioning this in case 
it rings a bell to you, but it may not be the only way to get into an infinite 
loop.

I opened a pull request that doesn't fix the bug but at least makes it an error 
instead of an infinite loop.

> getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge 
> leading to busy threads in infinite loop
> -
>
> Key: LUCENE-10428
> URL: https://issues.apache.org/jira/browse/LUCENE-10428
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring, core/search
>Reporter: Ankit Jain
>Priority: Major
> Attachments: Flame_graph.png
>
>
> Customers complained about high CPU for Elasticsearch cluster in production. 
> We noticed that few search requests were stuck for long time
> {code:java}
> % curl -s localhost:9200/_cat/tasks?v   
> indices:data/read/search[phase/query] AmMLzDQ4RrOJievRDeGFZw:569205  
> AmMLzDQ4RrOJievRDeGFZw:569204  direct1645195007282 14:36:47  6.2h
> indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:502075  
> emjWc5bUTG6lgnCGLulq-Q:502074  direct1645195037259 14:37:17  6.2h
> indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:583270  
> emjWc5bUTG6lgnCGLulq-Q:583269  direct1645201316981 16:21:56  4.5h
> {code}
> Flame graphs indicated that CPU time is mostly going into 
> *getMinCompetitiveScore method in MaxScoreSumPropagator*. After doing some 
> live JVM debugging found that 
> org.apache.lucene.search.MaxScoreSumPropagator.scoreSumUpperBound method had 
> around 4 million invocations every second
> Figured out the values of some parameters from live debugging:
> {code:java}
> minScoreSum = 3.5541441
> minScore + sumOfOtherMaxScores (params[0] scoreSumUpperBound) = 
> 3.554144322872162
> returnObj scoreSumUpperBound = 3.5541444
> Math.ulp(minScoreSum) = 2.3841858E-7
> {code}
> Example code snippet:
> {code:java}
> double sumOfOtherMaxScores = 3.554144322872162;
> double minScoreSum = 3.5541441;
> float minScore = (float) (minScoreSum - sumOfOtherMaxScores);
> while (scoreSumUpperBound(minScore + sumOfOtherMaxScores) > minScoreSum) {
> minScore -= Math.ulp(minScoreSum);
> System.out.printf("%.20f, %.100f\n", minScore, Math.ulp(minScoreSum));
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a change in pull request #710: LUCENE-10311: Make FixedBitSet#approximateCardinality faster (and actually approximate).

2022-02-24 Thread GitBox


rmuir commented on a change in pull request #710:
URL: https://github.com/apache/lucene/pull/710#discussion_r814137771



##
File path: lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java
##
@@ -176,6 +176,30 @@ public int cardinality() {
 return (int) BitUtil.pop_array(bits, 0, numWords);
   }
 
+  @Override
+  public int approximateCardinality() {
+// Naive sampling: compute the number of bits that are set on the first 16 
longs every 1024
+// longs and scale the result by 1024/16.
+// This computes the pop count on ranges instead of single longs in order 
to take advantage of
+// vectorization.
+
+final int rangeLength = 16;
+final int interval = 1024;
+
+if (numWords < interval) {
+  return cardinality();
+}
+
+long popCount = 0;
+int maxWord;
+for (maxWord = 0; maxWord + interval < numWords; maxWord += interval) {
+  popCount += BitUtil.pop_array(bits, maxWord, rangeLength);

Review comment:
   this isn't related/review comment. just saying i would be in favor of 
removing these `BitUtil` methods as I think they are outdated and provide no 
value. I think it would be easier on our eyes to just see loops with 
Long.bitCount? 
   
   The other constants/methods in the `BitUtil` class actually provide value. 
But let's not wrap what the JDK provides efficiently for no reason?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10391) Reuse data structures across HnswGraph invocations

2022-02-24 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-10391:
--
Attachment: Screen Shot 2022-02-24 at 10.18.42 AM.png

> Reuse data structures across HnswGraph invocations
> --
>
> Key: LUCENE-10391
> URL: https://issues.apache.org/jira/browse/LUCENE-10391
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Julie Tibshirani
>Priority: Minor
> Attachments: Screen Shot 2022-02-24 at 10.18.42 AM.png
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Creating HNSW graphs involves doing many repeated calls to HnswGraph#search. 
> Profiles from nightly benchmarks suggest that allocating data-structures 
> incurs both lots of heap allocations 
> ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_heap)]
>  and CPU usage 
> ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_cpu).]
>  It looks like reusing data structures across invocations would be a 
> low-hanging fruit that could help save significant CPU?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10438) Leverage Weight#count in lucene/facets

2022-02-24 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497616#comment-17497616
 ] 

Greg Miller commented on LUCENE-10438:
--

I experimented with this a bit for taxo- and ssdv-faceting but didn't get 
particularly far. I quickly discovered that {{luceneutil}} doesn't seem to 
exercise the {{Facets#getSpecificValue}} code path, which is where I think the 
optimization opportunity might be. To do this though, I had to defer counting 
to an "on demand" approach instead of counting during initialization. The good 
news is that this change doesn't seem to have regressed the existing benchmark 
tasks (see below).

I think the next steps here are to augment {{luceneutil}} to exercise 
{{getSpecificValue}} so we can measure impact. I'll see if I can find some time 
to poke into that, but if anyone else is interested in getting involved, feel 
free to jump in!

 
{code:java}
                            TaskQPS baseline      StdDevQPS candidate      
StdDev                Pct diff p-value
           BrowseMonthSSDVFacets       16.42     (27.7%)       15.16     
(24.5%)   -7.7% ( -46% -   61%) 0.354
          OrHighMedDayTaxoFacets        6.38      (6.7%)        6.28      
(6.4%)   -1.5% ( -13% -   12%) 0.463
                      TermDTSort       93.64     (12.5%)       92.45     
(11.9%)   -1.3% ( -22% -   26%) 0.742
            HighTermTitleBDVSort      142.12     (14.2%)      140.36     
(13.0%)   -1.2% ( -24% -   30%) 0.773
            MedTermDayTaxoFacets       38.39      (4.2%)       37.92      
(4.1%)   -1.2% (  -9% -    7%) 0.356
                      OrHighHigh       42.40      (4.6%)       42.04      
(3.5%)   -0.9% (  -8% -    7%) 0.510
               HighTermMonthSort      104.42     (18.0%)      103.57     
(17.0%)   -0.8% ( -30% -   41%) 0.882
                         Prefix3      270.23      (7.9%)      268.54     
(11.0%)   -0.6% ( -18% -   19%) 0.837
                       OrHighMed       79.38      (4.5%)       79.00      
(3.6%)   -0.5% (  -8% -    7%) 0.709
                    HighSpanNear       18.50      (2.4%)       18.43      
(2.4%)   -0.4% (  -5% -    4%) 0.586
                          IntNRQ      135.21      (0.5%)      134.77      
(1.6%)   -0.3% (  -2% -    1%) 0.371
                    OrNotHighLow     1056.43      (2.7%)     1055.39      
(3.2%)   -0.1% (  -5% -    5%) 0.916
                        PKLookup      169.34      (3.5%)      169.19      
(3.6%)   -0.1% (  -6% -    7%) 0.937
         AndHighMedDayTaxoFacets       34.87      (1.8%)       34.85      
(1.9%)   -0.0% (  -3% -    3%) 0.939
                    OrNotHighMed      930.52      (3.9%)      930.70      
(4.0%)    0.0% (  -7% -    8%) 0.988
                        Wildcard       93.02      (4.9%)       93.05      
(6.7%)    0.0% ( -10% -   12%) 0.984
                         LowTerm     1992.53      (5.1%)     1993.41      
(4.3%)    0.0% (  -8% -    9%) 0.976
                     AndHighHigh       52.14      (4.9%)       52.17      
(4.0%)    0.1% (  -8% -    9%) 0.969
                HighSloppyPhrase       27.70      (4.0%)       27.72      
(3.8%)    0.1% (  -7% -    8%) 0.933
           HighTermDayOfYearSort       82.23     (13.3%)       82.35     
(14.7%)    0.2% ( -24% -   32%) 0.973
                   OrNotHighHigh      923.35      (3.6%)      925.08      
(4.8%)    0.2% (  -7% -    8%) 0.889
        AndHighHighDayTaxoFacets       19.09      (2.3%)       19.16      
(1.9%)    0.3% (  -3% -    4%) 0.622
                 LowSloppyPhrase       28.20      (2.4%)       28.31      
(2.6%)    0.4% (  -4% -    5%) 0.624
                     LowSpanNear       11.96      (3.9%)       12.01      
(2.5%)    0.4% (  -5% -    7%) 0.666
                       LowPhrase      241.84      (4.3%)      242.98      
(4.0%)    0.5% (  -7% -    9%) 0.721
                     MedSpanNear       22.00      (3.3%)       22.11      
(2.0%)    0.5% (  -4% -    6%) 0.568
       BrowseDayOfYearSSDVFacets       12.00     (15.6%)       12.06     
(14.4%)    0.5% ( -25% -   36%) 0.909
                       MedPhrase       20.64      (4.9%)       20.75      
(4.4%)    0.6% (  -8% -   10%) 0.709
                          Fuzzy2       60.95      (1.7%)       61.29      
(1.8%)    0.6% (  -2% -    4%) 0.304
                      HighPhrase       19.65      (4.8%)       19.77      
(4.3%)    0.6% (  -8% -   10%) 0.678
                 MedSloppyPhrase       30.43      (2.3%)       30.63      
(2.3%)    0.7% (  -3% -    5%) 0.354
                          Fuzzy1       67.61      (1.6%)       68.07      
(2.0%)    0.7% (  -2% -    4%) 0.246
                    OrHighNotMed     1150.70      (3.7%)     1159.51      
(3.7%)    0.8% (  -6% -    8%) 0.516
                       OrHighLow      745.90      (2.9%)      751.76      
(1.7%)    0.8% (  -3% -    5%) 0.292
                   OrHighNotHigh      898.58      (4.1%)      906.

[jira] [Commented] (LUCENE-10391) Reuse data structures across HnswGraph invocations

2022-02-24 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497648#comment-17497648
 ] 

Julie Tibshirani commented on LUCENE-10391:
---

Now that the benchmarks are running again, we can see an improvement in index 
throughput. It might be a combined effect between this change and LUCENE-10408.

!Screen Shot 2022-02-24 at 10.18.42 AM.png|width=444,height=277!

In the profiles, we are still seeing some NeighborQueue allocations. These are 
likely from the results queue, which is still not shared. It is not 
straightforward to share it though, since its size changes across the graph 
levels (it's sometimes 1, sometimes topK). I'm inclined to close this out for 
now without making more changes, let me know what you think.
{code:java}
PERCENT   HEAP SAMPLES  STACK
26.77%145900M   org.apache.lucene.util.fst.BytesStore#writeByte()
  at org.apache.lucene.util.fst.FST#()
8.22% 44814Morg.apache.lucene.util.LongHeap#()
  at org.apache.lucene.util.hnsw.NeighborQueue#() 
{code}
 

> Reuse data structures across HnswGraph invocations
> --
>
> Key: LUCENE-10391
> URL: https://issues.apache.org/jira/browse/LUCENE-10391
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Julie Tibshirani
>Priority: Minor
> Attachments: Screen Shot 2022-02-24 at 10.18.42 AM.png
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Creating HNSW graphs involves doing many repeated calls to HnswGraph#search. 
> Profiles from nightly benchmarks suggest that allocating data-structures 
> incurs both lots of heap allocations 
> ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_heap)]
>  and CPU usage 
> ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_cpu).]
>  It looks like reusing data structures across invocations would be a 
> low-hanging fruit that could help save significant CPU?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #710: LUCENE-10311: Make FixedBitSet#approximateCardinality faster (and actually approximate).

2022-02-24 Thread GitBox


rmuir commented on pull request #710:
URL: https://github.com/apache/lucene/pull/710#issuecomment-1050162660


   also, another random suggestion for another day. I think it would be fine to 
have some logic like this at some point:
   
   ```
   if (length < N) {
 return cardinality(); // for small bitsets, don't be fancy
   }
   ```
   
   But I'm not concerned either way. Just thought if we need to iterate and 
introduce benchmarks, then ignoring tiny cases is an easy way to really zone in 
on good perf.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10428) getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge leading to busy threads in infinite loop

2022-02-24 Thread Ankit Jain (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497658#comment-17497658
 ] 

Ankit Jain commented on LUCENE-10428:
-

{quote}I opened a pull request that doesn't fix the bug but at least makes it 
an error instead of an infinite loop.
{quote}

Can you share link to this PR? Also, we should capture all the debug 
information as part of that error to understand this further.

> getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge 
> leading to busy threads in infinite loop
> -
>
> Key: LUCENE-10428
> URL: https://issues.apache.org/jira/browse/LUCENE-10428
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring, core/search
>Reporter: Ankit Jain
>Priority: Major
> Attachments: Flame_graph.png
>
>
> Customers complained about high CPU for Elasticsearch cluster in production. 
> We noticed that few search requests were stuck for long time
> {code:java}
> % curl -s localhost:9200/_cat/tasks?v   
> indices:data/read/search[phase/query] AmMLzDQ4RrOJievRDeGFZw:569205  
> AmMLzDQ4RrOJievRDeGFZw:569204  direct1645195007282 14:36:47  6.2h
> indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:502075  
> emjWc5bUTG6lgnCGLulq-Q:502074  direct1645195037259 14:37:17  6.2h
> indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:583270  
> emjWc5bUTG6lgnCGLulq-Q:583269  direct1645201316981 16:21:56  4.5h
> {code}
> Flame graphs indicated that CPU time is mostly going into 
> *getMinCompetitiveScore method in MaxScoreSumPropagator*. After doing some 
> live JVM debugging found that 
> org.apache.lucene.search.MaxScoreSumPropagator.scoreSumUpperBound method had 
> around 4 million invocations every second
> Figured out the values of some parameters from live debugging:
> {code:java}
> minScoreSum = 3.5541441
> minScore + sumOfOtherMaxScores (params[0] scoreSumUpperBound) = 
> 3.554144322872162
> returnObj scoreSumUpperBound = 3.5541444
> Math.ulp(minScoreSum) = 2.3841858E-7
> {code}
> Example code snippet:
> {code:java}
> double sumOfOtherMaxScores = 3.554144322872162;
> double minScoreSum = 3.5541441;
> float minScore = (float) (minScoreSum - sumOfOtherMaxScores);
> while (scoreSumUpperBound(minScore + sumOfOtherMaxScores) > minScoreSum) {
> minScore -= Math.ulp(minScoreSum);
> System.out.printf("%.20f, %.100f\n", minScore, Math.ulp(minScoreSum));
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #686: LUCENE-10421: use Constant instead of relying upon timestamp

2022-02-24 Thread GitBox


jtibshirani commented on pull request #686:
URL: https://github.com/apache/lucene/pull/686#issuecomment-1050175765


   Thanks @rmuir ! Are you okay to merge this? I got confused recently over a 
sometimes-reproducible test failure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10440) Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets

2022-02-24 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller reassigned LUCENE-10440:


Assignee: Greg Miller

> Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets
> ---
>
> Key: LUCENE-10440
> URL: https://issues.apache.org/jira/browse/LUCENE-10440
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
>
> Similar to what we did in LUCENE-10379, let's reduce the {{public}} 
> visibility of {{TaxonomyFacets}} and {{FloatTaxonomyFacets}} to pkg-private 
> since they're really implementation details housing common logic and not 
> really intended as extension points for user faceting.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10440) Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets

2022-02-24 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10440:


 Summary: Reduce visibility of TaxonomyFacets and 
FloatTaxonomyFacets
 Key: LUCENE-10440
 URL: https://issues.apache.org/jira/browse/LUCENE-10440
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Greg Miller


Similar to what we did in LUCENE-10379, let's reduce the {{public}} visibility 
of {{TaxonomyFacets}} and {{FloatTaxonomyFacets}} to pkg-private since they're 
really implementation details housing common logic and not really intended as 
extension points for user faceting.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

2022-02-24 Thread GitBox


rmuir commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1050188471


   > I don't think the grow(long) is necessary, we can always added to the 
IntersectVisitor instead. Maybe would be worthy to adjust how we call grow() in 
BKDReader#addAll as it does not need the dance it is currently doing
   
   Sorry, I'm not so familiar with the code in question. Does it mean we can 
remove the `grown` parameter here and the split logic around it for the 
`addAll()` method? If so, that sounds great!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new pull request #712: LUCENE-10440: Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets

2022-02-24 Thread GitBox


gsmiller opened a new pull request #712:
URL: https://github.com/apache/lucene/pull/712


   # Description
   
   These two classes are really implementation details, meant to hold common 
logic for our faceting implementations, but they are `public` and could be 
extended by users. It would be nice to reduce visibility to shrink our API 
surface area.
   
   # Solution
   
   Make pkg-private. Also reduce visibility of `protected` methods/fields to 
pkg-private as a little extra cleanup. Note that I will mark these as 
`@Deprecated` on the 9x branch to provide advanced notice to any users that 
might be extending these.
   
   # Tests
   
   No new testing needed.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [x ] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x ] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x ] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new pull request #713: LUCENE-10440: Mark TaxonomyFacets and FloatTaxonomyFacets as deprecated

2022-02-24 Thread GitBox


gsmiller opened a new pull request #713:
URL: https://github.com/apache/lucene/pull/713


   This is a "backport" of #712, providing early `@Deprecation` notice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10440) Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets

2022-02-24 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497692#comment-17497692
 ] 

Greg Miller commented on LUCENE-10440:
--

PRs posted for this. The only point maybe worth calling out here for discussion 
is that the visibility reduction of {{TaxonomyFacets}} means there is no common 
type to refer to just taxonomy-faceting implementations. The only reason I 
could see this _maybe_ mattering is that {{TaxonomyFacets}} defines public 
methods {{childrenLoaded()}} and {{{}siblingsLoaded(){}}}. So it's possible 
some user wants to refer to taxonomy facets generally, but not as general as 
just referencing {{Facets}} because they want to rely on one of these methods. 
This seems unlikely to me. The only code we have that references these methods 
is in testing, but I suppose users might want to know if these things were 
loaded for the purpose of metrics/logging/etc.

> Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets
> ---
>
> Key: LUCENE-10440
> URL: https://issues.apache.org/jira/browse/LUCENE-10440
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Similar to what we did in LUCENE-10379, let's reduce the {{public}} 
> visibility of {{TaxonomyFacets}} and {{FloatTaxonomyFacets}} to pkg-private 
> since they're really implementation details housing common logic and not 
> really intended as extension points for user faceting.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] magibney commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

2022-02-24 Thread GitBox


magibney commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-1050227228


   This patch applies cleanly and all tests pass. I plan to commit this within 
the next few days, because i think it does improve things (targeting 9.1 
release).
   
   But I want to go on the record mentioning that there are a number of things 
I've noticed in the process of looking at this code (completely orthogonal to 
this PR) that make me a bit uncomfortable:
   
   1. All the objects retrieved through OpenNLPOpsFactory are cached in [static 
maps](https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java#L41-L48)
 that are [never 
cleared](https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java#L191-L199).
   2. These maps are populated in a way where there's a race condition (could 
end up with multiple copies of these objects being held external to this class).
   
   wrt this PR in particular, I'm nervous about the fact that everything _but_ 
the DictionaryLemmatizers have long been cached as objects, which makes me 
wonder if there was a reason that from this class's inception, the 
dictionaryLemmatizers have been cached as String versions of the raw 
dictionary. But I've tried to chase down this line of reasoning, and still 
can't find any obvious problem with this change.
   
   Analogous to not "letting the perfect be the enemy of the good", I'm going 
to not "let the 'not-so-good' be the enemy of the 'probably worse'". I hope 
this is the correct decision.
   
   @spyk thanks for your patience, and for keeping the PR up-to-date. If you 
can add a CHANGES.txt entry, that's probably warranted here (if only because of 
the change to the public API).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10431) AssertionError in BooleanQuery.hashCode()

2022-02-24 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497716#comment-17497716
 ] 

Uwe Schindler commented on LUCENE-10431:


What was the exact query. Is it the only boolean one in the netbeans pull 
request with just 4 must clauses?

I would recommend to add the clauses using the method taking 2 parameters and 
not instantiate clauses. But this should not cause an issue.

Otherwise: are there other queries like ones which may be recursive (boolean 
query that refers to another query wrapping itself)?

> AssertionError in BooleanQuery.hashCode()
> -
>
> Key: LUCENE-10431
> URL: https://issues.apache.org/jira/browse/LUCENE-10431
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.11.1
>Reporter: Michael Bien
>Priority: Major
>
> Hello devs,
> the constructor of BooleanQuery can under some circumstances trigger a hash 
> code computation before "clauseSets" is fully filled. Since BooleanClause is 
> using its query field for the hash code too, it can happen that the "wrong" 
> hash code is stored, since adding the clause to the set triggers its 
> hashCode().
> If assertions are enabled the check in BooleanQuery, which recomputes the 
> hash code, will notice it and throw an error.
> exception:
> {code:java}
> java.lang.AssertionError
>     at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:614)
>     at java.base/java.util.Objects.hashCode(Objects.java:103)
>     at java.base/java.util.HashMap$Node.hashCode(HashMap.java:298)
>     at java.base/java.util.AbstractMap.hashCode(AbstractMap.java:527)
>     at org.apache.lucene.search.Multiset.hashCode(Multiset.java:119)
>     at java.base/java.util.EnumMap.entryHashCode(EnumMap.java:717)
>     at java.base/java.util.EnumMap.hashCode(EnumMap.java:709)
>     at java.base/java.util.Arrays.hashCode(Arrays.java:4498)
>     at java.base/java.util.Objects.hash(Objects.java:133)
>     at 
> org.apache.lucene.search.BooleanQuery.computeHashCode(BooleanQuery.java:597)
>     at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:611)
>     at java.base/java.util.HashMap.hash(HashMap.java:340)
>     at java.base/java.util.HashMap.put(HashMap.java:612)
>     at org.apache.lucene.search.Multiset.add(Multiset.java:82)
>     at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:154)
>     at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:42)
>     at 
> org.apache.lucene.search.BooleanQuery$Builder.build(BooleanQuery.java:133)
> {code}
> I noticed this while trying to upgrade the NetBeans maven indexer modules 
> from lucene 5.x to 8.x https://github.com/apache/netbeans/pull/3558



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10431) AssertionError in BooleanQuery.hashCode()

2022-02-24 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497718#comment-17497718
 ] 

Uwe Schindler commented on LUCENE-10431:


Sorry with builder pattern you can't create a BQ referring itself as clause. 😂

> AssertionError in BooleanQuery.hashCode()
> -
>
> Key: LUCENE-10431
> URL: https://issues.apache.org/jira/browse/LUCENE-10431
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.11.1
>Reporter: Michael Bien
>Priority: Major
>
> Hello devs,
> the constructor of BooleanQuery can under some circumstances trigger a hash 
> code computation before "clauseSets" is fully filled. Since BooleanClause is 
> using its query field for the hash code too, it can happen that the "wrong" 
> hash code is stored, since adding the clause to the set triggers its 
> hashCode().
> If assertions are enabled the check in BooleanQuery, which recomputes the 
> hash code, will notice it and throw an error.
> exception:
> {code:java}
> java.lang.AssertionError
>     at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:614)
>     at java.base/java.util.Objects.hashCode(Objects.java:103)
>     at java.base/java.util.HashMap$Node.hashCode(HashMap.java:298)
>     at java.base/java.util.AbstractMap.hashCode(AbstractMap.java:527)
>     at org.apache.lucene.search.Multiset.hashCode(Multiset.java:119)
>     at java.base/java.util.EnumMap.entryHashCode(EnumMap.java:717)
>     at java.base/java.util.EnumMap.hashCode(EnumMap.java:709)
>     at java.base/java.util.Arrays.hashCode(Arrays.java:4498)
>     at java.base/java.util.Objects.hash(Objects.java:133)
>     at 
> org.apache.lucene.search.BooleanQuery.computeHashCode(BooleanQuery.java:597)
>     at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:611)
>     at java.base/java.util.HashMap.hash(HashMap.java:340)
>     at java.base/java.util.HashMap.put(HashMap.java:612)
>     at org.apache.lucene.search.Multiset.add(Multiset.java:82)
>     at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:154)
>     at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:42)
>     at 
> org.apache.lucene.search.BooleanQuery$Builder.build(BooleanQuery.java:133)
> {code}
> I noticed this while trying to upgrade the NetBeans maven indexer modules 
> from lucene 5.x to 8.x https://github.com/apache/netbeans/pull/3558



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9952) FacetResult#value can be inaccurate in SortedSetDocValueFacetCounts

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497749#comment-17497749
 ] 

ASF subversion and git services commented on LUCENE-9952:
-

Commit 4af516a1491e55022ca81a909b7c78d54d8272c0 in lucene's branch 
refs/heads/main from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4af516a ]

Remove TODO for LUCENE-9952 since that issue was fixed


> FacetResult#value can be inaccurate in SortedSetDocValueFacetCounts
> ---
>
> Key: LUCENE-9952
> URL: https://issues.apache.org/jira/browse/LUCENE-9952
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Affects Versions: 9.0
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> As described in a dev@ list 
> [thread|http://mail-archives.apache.org/mod_mbox/lucene-dev/202105.mbox/%3CCANJ0CDo-9zt0U_pxWNOBkfiJpaAXZGGwOEJPnENAP6JzWz_t9Q%40mail.gmail.com%3E],
>  the value of {{FacetResult#value}} can be incorrect in SSDV faceting when 
> docs are multi-valued (affects both {{SortedSetDocValueFacetCounts}} and 
> {{ConcurrentSortedSetDocValueFacetCounts}}). If a doc has multiple values in 
> the same dimension, it will be counted multiple times when populating the 
> counts of {{FacetResult#value}}.
> We should either provide an accurate count, or provide {{-1}} if we don't 
> have an accurate count (like we do in taxonomy faceting). I _think_ this 
> change will be a bit involved though as SSDV facet counting likely needs to 
> be made aware of {{FacetConfig}}.
> NOTE: I've updated this description to describe only the SSDV case after 
> spinning off LUCENE-9953 to track the LongValueFacetCounts case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9952) FacetResult#value can be inaccurate in SortedSetDocValueFacetCounts

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497757#comment-17497757
 ] 

ASF subversion and git services commented on LUCENE-9952:
-

Commit 81ab1d6ab6fa3aee69153b256a01a4b984f88b59 in lucene's branch 
refs/heads/branch_9x from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=81ab1d6 ]

Remove TODO for LUCENE-9952 since that issue was fixed


> FacetResult#value can be inaccurate in SortedSetDocValueFacetCounts
> ---
>
> Key: LUCENE-9952
> URL: https://issues.apache.org/jira/browse/LUCENE-9952
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Affects Versions: 9.0
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> As described in a dev@ list 
> [thread|http://mail-archives.apache.org/mod_mbox/lucene-dev/202105.mbox/%3CCANJ0CDo-9zt0U_pxWNOBkfiJpaAXZGGwOEJPnENAP6JzWz_t9Q%40mail.gmail.com%3E],
>  the value of {{FacetResult#value}} can be incorrect in SSDV faceting when 
> docs are multi-valued (affects both {{SortedSetDocValueFacetCounts}} and 
> {{ConcurrentSortedSetDocValueFacetCounts}}). If a doc has multiple values in 
> the same dimension, it will be counted multiple times when populating the 
> counts of {{FacetResult#value}}.
> We should either provide an accurate count, or provide {{-1}} if we don't 
> have an accurate count (like we do in taxonomy faceting). I _think_ this 
> change will be a bit involved though as SSDV facet counting likely needs to 
> be made aware of {{FacetConfig}}.
> NOTE: I've updated this description to describe only the SSDV case after 
> spinning off LUCENE-9953 to track the LongValueFacetCounts case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10394) Explore moving ByteBuffer(sData|Index)Input to absolute bulk gets

2022-02-24 Thread Gautam Worah (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497873#comment-17497873
 ] 

Gautam Worah commented on LUCENE-10394:
---

I'll try to work on this soon. Looking into the ByteBuffer API in the meantime.

> Explore moving ByteBuffer(sData|Index)Input to absolute bulk gets
> -
>
> Key: LUCENE-10394
> URL: https://issues.apache.org/jira/browse/LUCENE-10394
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> With the move to Java 17, we now have access to absolute bulk gets on 
> ByteBuffers: 
> https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/nio/ByteBuffer.html#get(int,byte%5B%5D,int,int).
>  We should look into whether this helps with our more random-access workloads 
> like binary doc values, conjunctive queries and building HNSW graphs.
> ByteBuffersDataInput already tries to access the underlying buffers in a 
> random-access fashion and works around the lack of absolute bulk gets by 
> doing {{ByteBuffer#duplicate()}}. It looks like a low hanging fruit to stop 
> duplicating the buffer and just do an absolute bulk get instead. 
> ByteBuffersIndexInput would require a bit more work since it's performing 
> relative reads whenever possible.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] (LUCENE-10394) Explore moving ByteBuffer(sData|Index)Input to absolute bulk gets

2022-02-24 Thread Gautam Worah (Jira)


[ https://issues.apache.org/jira/browse/LUCENE-10394 ]


Gautam Worah deleted comment on LUCENE-10394:
---

was (Author: gworah):
I'll try to work on this soon. Looking into the ByteBuffer API in the meantime.

> Explore moving ByteBuffer(sData|Index)Input to absolute bulk gets
> -
>
> Key: LUCENE-10394
> URL: https://issues.apache.org/jira/browse/LUCENE-10394
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> With the move to Java 17, we now have access to absolute bulk gets on 
> ByteBuffers: 
> https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/nio/ByteBuffer.html#get(int,byte%5B%5D,int,int).
>  We should look into whether this helps with our more random-access workloads 
> like binary doc values, conjunctive queries and building HNSW graphs.
> ByteBuffersDataInput already tries to access the underlying buffers in a 
> random-access fashion and works around the lack of absolute bulk gets by 
> doing {{ByteBuffer#duplicate()}}. It looks like a low hanging fruit to stop 
> duplicating the buffer and just do an absolute bulk get instead. 
> ByteBuffersIndexInput would require a bit more work since it's performing 
> relative reads whenever possible.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10441) ArrayIndexOutOfBoundsException during indexing

2022-02-24 Thread Peixin Li (Jira)
Peixin Li created LUCENE-10441:
--

 Summary: ArrayIndexOutOfBoundsException during indexing
 Key: LUCENE-10441
 URL: https://issues.apache.org/jira/browse/LUCENE-10441
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Peixin Li


Hi experts!, i have facing ArrayIndexOutOfBoundsException during indexing and 
committing documents, this exception gives me no clue about what happened so i 
have little information for debugging, can i have some suggest about what could 
be and how to fix this error? 
{code:java}
java.lang.ArrayIndexOutOfBoundsException: -1
    at org.apache.lucene.util.BytesRefHash$1.get(BytesRefHash.java:179)
    at 
org.apache.lucene.util.StringMSBRadixSorter$1.get(StringMSBRadixSorter.java:42)
    at 
org.apache.lucene.util.StringMSBRadixSorter$1.setPivot(StringMSBRadixSorter.java:63)
    at org.apache.lucene.util.Sorter.binarySort(Sorter.java:192)
    at org.apache.lucene.util.Sorter.binarySort(Sorter.java:187)
    at org.apache.lucene.util.IntroSorter.quicksort(IntroSorter.java:41)
    at org.apache.lucene.util.IntroSorter.quicksort(IntroSorter.java:83)
    at org.apache.lucene.util.IntroSorter.sort(IntroSorter.java:36)
    at org.apache.lucene.util.MSBRadixSorter.introSort(MSBRadixSorter.java:133)
    at org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:126)
    at org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:121)
    at org.apache.lucene.util.BytesRefHash.sort(BytesRefHash.java:183)
    at 
org.apache.lucene.index.SortedSetDocValuesWriter.flush(SortedSetDocValuesWriter.java:171)
    at 
org.apache.lucene.index.DefaultIndexingChain.writeDocValues(DefaultIndexingChain.java:348)
    at 
org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:228)
    at 
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350)
    at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476)
    at 
org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656)
    at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364)
    at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10441) ArrayIndexOutOfBoundsException during indexing

2022-02-24 Thread Peixin Li (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peixin Li updated LUCENE-10441:
---
Description: 
Hi experts!, i have facing ArrayIndexOutOfBoundsException during indexing and 
committing documents, this exception gives me no clue about what happened so i 
have little information for debugging, can i have some suggest about what could 
be and how to fix this error? i'm using Lucene 8.10.0
{code:java}
java.lang.ArrayIndexOutOfBoundsException: -1
    at org.apache.lucene.util.BytesRefHash$1.get(BytesRefHash.java:179)
    at 
org.apache.lucene.util.StringMSBRadixSorter$1.get(StringMSBRadixSorter.java:42)
    at 
org.apache.lucene.util.StringMSBRadixSorter$1.setPivot(StringMSBRadixSorter.java:63)
    at org.apache.lucene.util.Sorter.binarySort(Sorter.java:192)
    at org.apache.lucene.util.Sorter.binarySort(Sorter.java:187)
    at org.apache.lucene.util.IntroSorter.quicksort(IntroSorter.java:41)
    at org.apache.lucene.util.IntroSorter.quicksort(IntroSorter.java:83)
    at org.apache.lucene.util.IntroSorter.sort(IntroSorter.java:36)
    at org.apache.lucene.util.MSBRadixSorter.introSort(MSBRadixSorter.java:133)
    at org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:126)
    at org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:121)
    at org.apache.lucene.util.BytesRefHash.sort(BytesRefHash.java:183)
    at 
org.apache.lucene.index.SortedSetDocValuesWriter.flush(SortedSetDocValuesWriter.java:171)
    at 
org.apache.lucene.index.DefaultIndexingChain.writeDocValues(DefaultIndexingChain.java:348)
    at 
org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:228)
    at 
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350)
    at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476)
    at 
org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656)
    at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364)
    at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728) {code}

  was:
Hi experts!, i have facing ArrayIndexOutOfBoundsException during indexing and 
committing documents, this exception gives me no clue about what happened so i 
have little information for debugging, can i have some suggest about what could 
be and how to fix this error? 
{code:java}
java.lang.ArrayIndexOutOfBoundsException: -1
    at org.apache.lucene.util.BytesRefHash$1.get(BytesRefHash.java:179)
    at 
org.apache.lucene.util.StringMSBRadixSorter$1.get(StringMSBRadixSorter.java:42)
    at 
org.apache.lucene.util.StringMSBRadixSorter$1.setPivot(StringMSBRadixSorter.java:63)
    at org.apache.lucene.util.Sorter.binarySort(Sorter.java:192)
    at org.apache.lucene.util.Sorter.binarySort(Sorter.java:187)
    at org.apache.lucene.util.IntroSorter.quicksort(IntroSorter.java:41)
    at org.apache.lucene.util.IntroSorter.quicksort(IntroSorter.java:83)
    at org.apache.lucene.util.IntroSorter.sort(IntroSorter.java:36)
    at org.apache.lucene.util.MSBRadixSorter.introSort(MSBRadixSorter.java:133)
    at org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:126)
    at org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:121)
    at org.apache.lucene.util.BytesRefHash.sort(BytesRefHash.java:183)
    at 
org.apache.lucene.index.SortedSetDocValuesWriter.flush(SortedSetDocValuesWriter.java:171)
    at 
org.apache.lucene.index.DefaultIndexingChain.writeDocValues(DefaultIndexingChain.java:348)
    at 
org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:228)
    at 
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350)
    at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476)
    at 
org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656)
    at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364)
    at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728) {code}


> ArrayIndexOutOfBoundsException during indexing
> --
>
> Key: LUCENE-10441
> URL: https://issues.apache.org/jira/browse/LUCENE-10441
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Peixin Li
>Priority: Major
>
> Hi experts!, i have facing ArrayIndexOutOfBoundsException during indexing and 
> committing documents, this exception gives me no clue about what happened so 
> i have little information for debugging, can i have some suggest about what 
> could be and how to fix this error? i'm using Lucene 8.10.0
> {code:java}
> java.lang.ArrayIndexOutOfBoundsEx

[jira] [Commented] (LUCENE-10431) AssertionError in BooleanQuery.hashCode()

2022-02-24 Thread Michael Bien (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497915#comment-17497915
 ] 

Michael Bien commented on LUCENE-10431:
---

thanks for the tips. I might have found the cause. The BooleanQuery in question 
is constructed as follows:
{code:java}
add g:org?eclipse?jetty (groupId:org?eclipse?jetty*) with SHOULD
add a:org?eclipse?jetty (artifactId:org?eclipse?jetty*) with SHOULD
add v:org.eclipse.jetty* with SHOULD
add n:org.eclipse.jetty* with SHOULD
add d:org.eclipse.jetty* with SHOULD
add +classnames:org +classnames:eclipse +classnames:jetty with SHOULD
build: org.apache.lucene.search.BooleanQuery$Builder@4c2e43fc
result: (g:org?eclipse?jetty (groupId:org?eclipse?jetty*)) (a:org?eclipse?jetty 
(artifactId:org?eclipse?jetty*)) v:org.eclipse.jetty* n:org.eclipse.jetty* 
d:org.eclipse.jetty* (+classnames:org +classnames:eclipse 
+classnames:jetty){code}
(user searching "org.eclipse.jetty" in the maven repository index)

However, *before* each query is added to the builder. The rewrite method is 
changed to "CONSTANT_SCORE_BOOLEAN_REWRITE" recursively 
(builder.add(setBooleanRewrite(q), occur)).
{code:java}
    private static Query setBooleanRewrite (final Query q) {
        if (q instanceof MultiTermQuery) {
            
((MultiTermQuery)q).setRewriteMethod(MultiTermQuery.CONSTANT_SCORE_BOOLEAN_REWRITE);
        } else if (q instanceof BooleanQuery) {
            for (BooleanClause c : ((BooleanQuery)q).clauses()) {
                setBooleanRewrite(c.getQuery());
            }
        }
        return q;
    }{code}
If I remove this, I don't see any assertion errors. Is this not a legal way of 
changing the rewrite method? I am a little bit worried that this is only hiding 
the issue and it might appear somewhere else.

(btw using the two arg add didn't help, but I am going to change it anyway 
since its shorter :))

> AssertionError in BooleanQuery.hashCode()
> -
>
> Key: LUCENE-10431
> URL: https://issues.apache.org/jira/browse/LUCENE-10431
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.11.1
>Reporter: Michael Bien
>Priority: Major
>
> Hello devs,
> the constructor of BooleanQuery can under some circumstances trigger a hash 
> code computation before "clauseSets" is fully filled. Since BooleanClause is 
> using its query field for the hash code too, it can happen that the "wrong" 
> hash code is stored, since adding the clause to the set triggers its 
> hashCode().
> If assertions are enabled the check in BooleanQuery, which recomputes the 
> hash code, will notice it and throw an error.
> exception:
> {code:java}
> java.lang.AssertionError
>     at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:614)
>     at java.base/java.util.Objects.hashCode(Objects.java:103)
>     at java.base/java.util.HashMap$Node.hashCode(HashMap.java:298)
>     at java.base/java.util.AbstractMap.hashCode(AbstractMap.java:527)
>     at org.apache.lucene.search.Multiset.hashCode(Multiset.java:119)
>     at java.base/java.util.EnumMap.entryHashCode(EnumMap.java:717)
>     at java.base/java.util.EnumMap.hashCode(EnumMap.java:709)
>     at java.base/java.util.Arrays.hashCode(Arrays.java:4498)
>     at java.base/java.util.Objects.hash(Objects.java:133)
>     at 
> org.apache.lucene.search.BooleanQuery.computeHashCode(BooleanQuery.java:597)
>     at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:611)
>     at java.base/java.util.HashMap.hash(HashMap.java:340)
>     at java.base/java.util.HashMap.put(HashMap.java:612)
>     at org.apache.lucene.search.Multiset.add(Multiset.java:82)
>     at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:154)
>     at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:42)
>     at 
> org.apache.lucene.search.BooleanQuery$Builder.build(BooleanQuery.java:133)
> {code}
> I noticed this while trying to upgrade the NetBeans maven indexer modules 
> from lucene 5.x to 8.x https://github.com/apache/netbeans/pull/3558



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang opened a new pull request #714: LUCENE-10439: update CHANGES.txt

2022-02-24 Thread GitBox


LuXugang opened a new pull request #714:
URL: https://github.com/apache/lucene/pull/714


   update CHANGES.txt for 
[LUCENE-10424](https://issues.apache.org/jira/browse/LUCENE-10424) and 
[LUCENE-10439](https://issues.apache.org/jira/browse/LUCENE-10439) .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #686: LUCENE-10421: use Constant instead of relying upon timestamp

2022-02-24 Thread GitBox


rmuir merged pull request #686:
URL: https://github.com/apache/lucene/pull/686


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497920#comment-17497920
 ] 

ASF subversion and git services commented on LUCENE-10421:
--

Commit 466278e14921572ceb54a0a52a8a262476ee24b7 in lucene's branch 
refs/heads/main from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=466278e ]

LUCENE-10421: use Constant instead of relying upon timestamp (#686)



> Non-deterministic results from KnnVectorQuery?
> --
>
> Key: LUCENE-10421
> URL: https://issues.apache.org/jira/browse/LUCENE-10421
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have 
> been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} 
> is giving slightly different results on every run, even on an identical 
> (deterministically constructed – single thread indexing, flush by doc count, 
> {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each 
> night.  It produces failures like this, which then abort the benchmark to 
> help us catch any recent accidental bug that alters our precise top N search 
> hits and scores:
> {noformat}
>  Traceback (most recent call last):
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in 
>   run()
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run
>   raise RuntimeError(‘search result differences: %s’ % str(errors))
> RuntimeError: search result differences: 
> [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], 
> ‘0.92060816’) vs ([254438\
> 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] 
> filter=None sort=None groupField=None hitCount=10: hit 7 has wrong 
> field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, 
> “qu\
> ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], 
> ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\
> [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit 
> 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], 
> ‘0.8378446’)“]{noformat}
> At first I thought this might be expected because of the recent (awesome!!) 
> improvements to HNSW, so I tried to simply "regold".  But the regold did not 
> "take", so it indeed looks like there is some non-determinism here.
> I pinged [~msoko...@gmail.com] and he found this random seeding that is most 
> likely the cause?
> {noformat}
> public final class HnswGraphBuilder {
>   /** Default random seed for level generation * */
>   private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); 
> {noformat}
> Can we somehow make this deterministic instead?  Or maybe the nightly 
> benchmarks could somehow pass something in to make results deterministic for 
> benchmarking?  Or ... we could also relax the benchmarks to accept 
> non-determinism for {{KnnVectorQuery}} task?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497932#comment-17497932
 ] 

ASF subversion and git services commented on LUCENE-10421:
--

Commit 5972b495ba6f5145492077dfb2a5d28717f71533 in lucene's branch 
refs/heads/branch_9x from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5972b49 ]

LUCENE-10421: use Constant instead of relying upon timestamp (#686)



> Non-deterministic results from KnnVectorQuery?
> --
>
> Key: LUCENE-10421
> URL: https://issues.apache.org/jira/browse/LUCENE-10421
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have 
> been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} 
> is giving slightly different results on every run, even on an identical 
> (deterministically constructed – single thread indexing, flush by doc count, 
> {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each 
> night.  It produces failures like this, which then abort the benchmark to 
> help us catch any recent accidental bug that alters our precise top N search 
> hits and scores:
> {noformat}
>  Traceback (most recent call last):
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in 
>   run()
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run
>   raise RuntimeError(‘search result differences: %s’ % str(errors))
> RuntimeError: search result differences: 
> [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], 
> ‘0.92060816’) vs ([254438\
> 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] 
> filter=None sort=None groupField=None hitCount=10: hit 7 has wrong 
> field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, 
> “qu\
> ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], 
> ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\
> [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit 
> 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], 
> ‘0.8378446’)“]{noformat}
> At first I thought this might be expected because of the recent (awesome!!) 
> improvements to HNSW, so I tried to simply "regold".  But the regold did not 
> "take", so it indeed looks like there is some non-determinism here.
> I pinged [~msoko...@gmail.com] and he found this random seeding that is most 
> likely the cause?
> {noformat}
> public final class HnswGraphBuilder {
>   /** Default random seed for level generation * */
>   private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); 
> {noformat}
> Can we somehow make this deterministic instead?  Or maybe the nightly 
> benchmarks could somehow pass something in to make results deterministic for 
> benchmarking?  Or ... we could also relax the benchmarks to accept 
> non-determinism for {{KnnVectorQuery}} task?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?

2022-02-24 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-10421.
--
Fix Version/s: 9.1
   Resolution: Fixed

> Non-deterministic results from KnnVectorQuery?
> --
>
> Key: LUCENE-10421
> URL: https://issues.apache.org/jira/browse/LUCENE-10421
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 9.1
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have 
> been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} 
> is giving slightly different results on every run, even on an identical 
> (deterministically constructed – single thread indexing, flush by doc count, 
> {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each 
> night.  It produces failures like this, which then abort the benchmark to 
> help us catch any recent accidental bug that alters our precise top N search 
> hits and scores:
> {noformat}
>  Traceback (most recent call last):
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in 
>   run()
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run
>   raise RuntimeError(‘search result differences: %s’ % str(errors))
> RuntimeError: search result differences: 
> [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], 
> ‘0.92060816’) vs ([254438\
> 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] 
> filter=None sort=None groupField=None hitCount=10: hit 7 has wrong 
> field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, 
> “qu\
> ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], 
> ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\
> [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit 
> 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], 
> ‘0.8378446’)“]{noformat}
> At first I thought this might be expected because of the recent (awesome!!) 
> improvements to HNSW, so I tried to simply "regold".  But the regold did not 
> "take", so it indeed looks like there is some non-determinism here.
> I pinged [~msoko...@gmail.com] and he found this random seeding that is most 
> likely the cause?
> {noformat}
> public final class HnswGraphBuilder {
>   /** Default random seed for level generation * */
>   private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); 
> {noformat}
> Can we somehow make this deterministic instead?  Or maybe the nightly 
> benchmarks could somehow pass something in to make results deterministic for 
> benchmarking?  Or ... we could also relax the benchmarks to accept 
> non-determinism for {{KnnVectorQuery}} task?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org