[GitHub] [lucene] gf2121 opened a new pull request #706: LUCENE-10417: Revert "LUCENE-10315"
gf2121 opened a new pull request #706: URL: https://github.com/apache/lucene/pull/706 SIMD-optimization for BKD `DocIdsWriter` was introduced in https://github.com/apache/lucene/pull/652 in order to speed up decoding of docIDs, but it leads to the regression in nightly benchmark. https://home.apache.org/~mikemccand/lucenebench/IntNRQ.html I tried to run `wiki10m` locally but can not reproduce the regression. I'll continue to dig, but i think we need to revert it first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10435) Break loop early while checking whether DocValuesFieldExistsQuery can be rewrite to MatchAllDocsQuery
[ https://issues.apache.org/jira/browse/LUCENE-10435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-10435. --- Fix Version/s: 9.1 Resolution: Fixed > Break loop early while checking whether DocValuesFieldExistsQuery can be > rewrite to MatchAllDocsQuery > - > > Key: LUCENE-10435 > URL: https://issues.apache.org/jira/browse/LUCENE-10435 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Priority: Trivial > Fix For: 9.1 > > Time Spent: 1.5h > Remaining Estimate: 0h > > In the implementation of Query#rewrite in DocValuesFieldExistsQuery, when one > Segment can't match the condition occurs, maybe we should break loop directly. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 merged pull request #706: LUCENE-10417: Revert "LUCENE-10315"
gf2121 merged pull request #706: URL: https://github.com/apache/lucene/pull/706 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil
[ https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497260#comment-17497260 ] ASF subversion and git services commented on LUCENE-10315: -- Commit b0ca227862950a1869b535f31881cdfc2e859176 in lucene's branch refs/heads/main from gf2121 [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b0ca227 ] LUCENE-10417: Revert "LUCENE-10315" (#706) > Speed up BKD leaf block ids codec by a 512 ints ForUtil > --- > > Key: LUCENE-10315 > URL: https://issues.apache.org/jira/browse/LUCENE-10315 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Feng Guo >Assignee: Feng Guo >Priority: Major > Fix For: 9.1 > > Attachments: addall.svg > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Elasticsearch (which based on lucene) can automatically infers types for > users with its dynamic mapping feature. When users index some low cardinality > fields, such as gender / age / status... they often use some numbers to > represent the values, while ES will infer these fields as {{{}long{}}}, and > ES uses BKD as the index of {{long}} fields. When the data volume grows, > building the result set of low-cardinality fields will make the CPU usage and > load very high. > This is a flame graph we obtained from the production environment: > [^addall.svg] > It can be seen that almost all CPU is used in addAll. When we reindex > {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly > reduced ( We spent weeks of time to reindex all indices... ). I know that ES > recommended to use {{keyword}} for term/terms query and {{long}} for range > query in the document, but there are always some users who didn't realize > this and keep their habit of using sql database, or dynamic mapping > automatically selects the type for them. All in all, users won't realize that > there would be such a big difference in performance between {{long}} and > {{keyword}} fields in low cardinality fields. So from my point of view it > will make sense if we can make BKD works better for the low/medium > cardinality fields. > As far as i can see, for low cardinality fields, there are two advantages of > {{keyword}} over {{{}long{}}}: > 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's > delta VInt, because its batch reading (readLongs) and SIMD decode. > 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily > materialize of its result set, and when another small result clause > intersects with this low cardinality condition, the low cardinality field can > avoid reading all docIds into memory. > This ISSUE is targeting to solve the first point. The basic idea is trying to > use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization > by mocking some random {{LongPoint}} and querying them with > {{PointInSetQuery}}. > *Benchmark Result* > |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff > percentage| > |1|32|1|51.44|148.26|188.22%| > |1|32|2|26.8|101.88|280.15%| > |1|32|4|14.04|53.52|281.20%| > |1|32|8|7.04|28.54|305.40%| > |1|32|16|3.54|14.61|312.71%| > |1|128|1|110.56|350.26|216.81%| > |1|128|8|16.6|89.81|441.02%| > |1|128|16|8.45|48.07|468.88%| > |1|128|32|4.2|25.35|503.57%| > |1|128|64|2.13|13.02|511.27%| > |1|1024|1|536.19|843.88|57.38%| > |1|1024|8|109.71|251.89|129.60%| > |1|1024|32|33.24|104.11|213.21%| > |1|1024|128|8.87|30.47|243.52%| > |1|1024|512|2.24|8.3|270.54%| > |1|8192|1|.33|5000|50.00%| > |1|8192|32|139.47|214.59|53.86%| > |1|8192|128|54.59|109.23|100.09%| > |1|8192|512|15.61|36.15|131.58%| > |1|8192|2048|4.11|11.14|171.05%| > |1|1048576|1|2597.4|3030.3|16.67%| > |1|1048576|32|314.96|371.75|18.03%| > |1|1048576|128|99.7|116.28|16.63%| > |1|1048576|512|30.5|37.15|21.80%| > |1|1048576|2048|10.38|12.3|18.50%| > |1|8388608|1|2564.1|3174.6|23.81%| > |1|8388608|32|196.27|238.95|21.75%| > |1|8388608|128|55.36|68.03|22.89%| > |1|8388608|512|15.58|19.24|23.49%| > |1|8388608|2048|4.56|5.71|25.22%| > The indices size is reduced for low cardinality fields and flat for high > cardinality fields. > {code:java} > 113Mindex_1_doc_32_cardinality_baseline > 114Mindex_1_doc_32_cardinality_candidate > 140Mindex_1_doc_128_cardinality_baseline > 133Mindex_1_doc_128_cardinality_candidate > 193Mindex_1_doc_1024_cardinality_baseline > 174Mindex_1_doc_1024_cardinality_candidate > 241Mindex_
[jira] [Commented] (LUCENE-10417) IntNRQ task performance decreased in nightly benchmark
[ https://issues.apache.org/jira/browse/LUCENE-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497259#comment-17497259 ] ASF subversion and git services commented on LUCENE-10417: -- Commit b0ca227862950a1869b535f31881cdfc2e859176 in lucene's branch refs/heads/main from gf2121 [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b0ca227 ] LUCENE-10417: Revert "LUCENE-10315" (#706) > IntNRQ task performance decreased in nightly benchmark > -- > > Key: LUCENE-10417 > URL: https://issues.apache.org/jira/browse/LUCENE-10417 > Project: Lucene - Core > Issue Type: Bug > Components: core/codecs >Reporter: Feng Guo >Assignee: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Link: https://home.apache.org/~mikemccand/lucenebench/2022.02.07.18.02.48.html > Probably related to LUCENE-10315, I'll dig. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-10194) Should IndexWriter buffer KNN vectors on disk?
[ https://issues.apache.org/jira/browse/LUCENE-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova reassigned LUCENE-10194: Assignee: Mayya Sharipova > Should IndexWriter buffer KNN vectors on disk? > -- > > Key: LUCENE-10194 > URL: https://issues.apache.org/jira/browse/LUCENE-10194 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Mayya Sharipova >Priority: Minor > > VectorValuesWriter buffers data in memory, like we do for all data structures > that are computed on flush. But I wonder if this is the right trade-off. > The use-case I have in mind is someone trying to load a dataset of vectors in > Lucene. Given that HNSW graphs are super expensive to create, we'd ideally > load that dataset into a single segment rather than many small segments that > then need to be merged together, which in-turn re-creates the HNSW graph. > Yet buffering vectors in memory is expensive. For instance assuming 256 > dimensions, each vector consumes 1kB of memory. Should we consider buffering > vectors on disk to reduce chances of having to create new segments only > because the RAM buffer is full? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 merged pull request #707: LUCENE-10417: Revert LUCENE-10315 (backport 9x)
gf2121 merged pull request #707: URL: https://github.com/apache/lucene/pull/707 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil
[ https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497268#comment-17497268 ] ASF subversion and git services commented on LUCENE-10315: -- Commit ad48203b557d3250f6975d097a5af331db0ee3cd in lucene's branch refs/heads/branch_9x from gf2121 [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ad48203 ] LUCENE-10417: Revert "LUCENE-10315" (#706) (#707) > Speed up BKD leaf block ids codec by a 512 ints ForUtil > --- > > Key: LUCENE-10315 > URL: https://issues.apache.org/jira/browse/LUCENE-10315 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Feng Guo >Assignee: Feng Guo >Priority: Major > Fix For: 9.1 > > Attachments: addall.svg > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Elasticsearch (which based on lucene) can automatically infers types for > users with its dynamic mapping feature. When users index some low cardinality > fields, such as gender / age / status... they often use some numbers to > represent the values, while ES will infer these fields as {{{}long{}}}, and > ES uses BKD as the index of {{long}} fields. When the data volume grows, > building the result set of low-cardinality fields will make the CPU usage and > load very high. > This is a flame graph we obtained from the production environment: > [^addall.svg] > It can be seen that almost all CPU is used in addAll. When we reindex > {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly > reduced ( We spent weeks of time to reindex all indices... ). I know that ES > recommended to use {{keyword}} for term/terms query and {{long}} for range > query in the document, but there are always some users who didn't realize > this and keep their habit of using sql database, or dynamic mapping > automatically selects the type for them. All in all, users won't realize that > there would be such a big difference in performance between {{long}} and > {{keyword}} fields in low cardinality fields. So from my point of view it > will make sense if we can make BKD works better for the low/medium > cardinality fields. > As far as i can see, for low cardinality fields, there are two advantages of > {{keyword}} over {{{}long{}}}: > 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's > delta VInt, because its batch reading (readLongs) and SIMD decode. > 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily > materialize of its result set, and when another small result clause > intersects with this low cardinality condition, the low cardinality field can > avoid reading all docIds into memory. > This ISSUE is targeting to solve the first point. The basic idea is trying to > use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization > by mocking some random {{LongPoint}} and querying them with > {{PointInSetQuery}}. > *Benchmark Result* > |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff > percentage| > |1|32|1|51.44|148.26|188.22%| > |1|32|2|26.8|101.88|280.15%| > |1|32|4|14.04|53.52|281.20%| > |1|32|8|7.04|28.54|305.40%| > |1|32|16|3.54|14.61|312.71%| > |1|128|1|110.56|350.26|216.81%| > |1|128|8|16.6|89.81|441.02%| > |1|128|16|8.45|48.07|468.88%| > |1|128|32|4.2|25.35|503.57%| > |1|128|64|2.13|13.02|511.27%| > |1|1024|1|536.19|843.88|57.38%| > |1|1024|8|109.71|251.89|129.60%| > |1|1024|32|33.24|104.11|213.21%| > |1|1024|128|8.87|30.47|243.52%| > |1|1024|512|2.24|8.3|270.54%| > |1|8192|1|.33|5000|50.00%| > |1|8192|32|139.47|214.59|53.86%| > |1|8192|128|54.59|109.23|100.09%| > |1|8192|512|15.61|36.15|131.58%| > |1|8192|2048|4.11|11.14|171.05%| > |1|1048576|1|2597.4|3030.3|16.67%| > |1|1048576|32|314.96|371.75|18.03%| > |1|1048576|128|99.7|116.28|16.63%| > |1|1048576|512|30.5|37.15|21.80%| > |1|1048576|2048|10.38|12.3|18.50%| > |1|8388608|1|2564.1|3174.6|23.81%| > |1|8388608|32|196.27|238.95|21.75%| > |1|8388608|128|55.36|68.03|22.89%| > |1|8388608|512|15.58|19.24|23.49%| > |1|8388608|2048|4.56|5.71|25.22%| > The indices size is reduced for low cardinality fields and flat for high > cardinality fields. > {code:java} > 113Mindex_1_doc_32_cardinality_baseline > 114Mindex_1_doc_32_cardinality_candidate > 140Mindex_1_doc_128_cardinality_baseline > 133Mindex_1_doc_128_cardinality_candidate > 193Mindex_1_doc_1024_cardinality_baseline > 174Mindex_1_doc_1024_cardinality_candidate > 24
[jira] [Commented] (LUCENE-10417) IntNRQ task performance decreased in nightly benchmark
[ https://issues.apache.org/jira/browse/LUCENE-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497267#comment-17497267 ] ASF subversion and git services commented on LUCENE-10417: -- Commit ad48203b557d3250f6975d097a5af331db0ee3cd in lucene's branch refs/heads/branch_9x from gf2121 [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ad48203 ] LUCENE-10417: Revert "LUCENE-10315" (#706) (#707) > IntNRQ task performance decreased in nightly benchmark > -- > > Key: LUCENE-10417 > URL: https://issues.apache.org/jira/browse/LUCENE-10417 > Project: Lucene - Core > Issue Type: Bug > Components: core/codecs >Reporter: Feng Guo >Assignee: Feng Guo >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Link: https://home.apache.org/~mikemccand/lucenebench/2022.02.07.18.02.48.html > Probably related to LUCENE-10315, I'll dig. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on pull request #705: LUCENE-10439: Support multi-valued and multiple dimensions for count query in PointRangeQuery
LuXugang commented on pull request #705: URL: https://github.com/apache/lucene/pull/705#issuecomment-1049639011 > We need two cases: > > * Checking whether all documents match and returning values.getDocCount(). This works when there are no deletions. > * Actually counting the number of matching points. This only works when there are no deletions and the field is single-valued (docCount == size), plus we only want to apply it in the 1D case since this is the only case when we have the guarantee that it will actually run fast since there are at most 2 crossing leaves. Thanks @jpountz, now I fully understand your thought. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #705: LUCENE-10439: Support multi-valued and multiple dimensions for count query in PointRangeQuery
jpountz merged pull request #705: URL: https://github.com/apache/lucene/pull/705 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10439) Support multi-valued and multiple dimensions for count query in PointRangeQuery
[ https://issues.apache.org/jira/browse/LUCENE-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497272#comment-17497272 ] ASF subversion and git services commented on LUCENE-10439: -- Commit 550d1305db71b33f988484fe58de1f754283562d in lucene's branch refs/heads/main from Lu Xugang [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=550d130 ] LUCENE-10439: Support multi-valued and multiple dimensions for count query in PointRangeQuery (#705) > Support multi-valued and multiple dimensions for count query in > PointRangeQuery > --- > > Key: LUCENE-10439 > URL: https://issues.apache.org/jira/browse/LUCENE-10439 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Priority: Trivial > Time Spent: 20m > Remaining Estimate: 0h > > Follow-up of LUCENE-10424, it also works with fields that have multiple > dimensions and/or that are multi-valued. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10439) Support multi-valued and multiple dimensions for count query in PointRangeQuery
[ https://issues.apache.org/jira/browse/LUCENE-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497273#comment-17497273 ] ASF subversion and git services commented on LUCENE-10439: -- Commit 6acf16a2e3427179614f99e159dec16f63b4dfc4 in lucene's branch refs/heads/branch_9x from Lu Xugang [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=6acf16a ] LUCENE-10439: Support multi-valued and multiple dimensions for count query in PointRangeQuery (#705) > Support multi-valued and multiple dimensions for count query in > PointRangeQuery > --- > > Key: LUCENE-10439 > URL: https://issues.apache.org/jira/browse/LUCENE-10439 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Priority: Trivial > Time Spent: 0.5h > Remaining Estimate: 0h > > Follow-up of LUCENE-10424, it also works with fields that have multiple > dimensions and/or that are multi-valued. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10417) IntNRQ task performance decreased in nightly benchmark
[ https://issues.apache.org/jira/browse/LUCENE-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497280#comment-17497280 ] Adrien Grand commented on LUCENE-10417: --- FYI Elasticsearch was upgraded to a recent Lucene snapshot 2 days ago, and we're seeing some ranges that may be slower but also other ranges that seem to be faster. See e.g. {{nightly-http_logs-4g-200s-in-range-latency}} at https://elasticsearch-benchmarks.elastic.co/#tracks/http-logs/nightly/default/30d. > IntNRQ task performance decreased in nightly benchmark > -- > > Key: LUCENE-10417 > URL: https://issues.apache.org/jira/browse/LUCENE-10417 > Project: Lucene - Core > Issue Type: Bug > Components: core/codecs >Reporter: Feng Guo >Assignee: Feng Guo >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Link: https://home.apache.org/~mikemccand/lucenebench/2022.02.07.18.02.48.html > Probably related to LUCENE-10315, I'll dig. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request #708: LUCENE-10408: Write doc IDs of KNN vectors as ints rather than vints.
jpountz opened a new pull request #708: URL: https://github.com/apache/lucene/pull/708 Since doc IDs with a vector are loaded as an int[] in memory, this changes the on-disk format of vectors to align with the in-memory representation by using ints instead of vints to represent doc IDs. This might make vectors a bit larger on disk, but also a bit faster to open. I made the same change to how we encode nodes on levels for the same reason. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #708: LUCENE-10408: Write doc IDs of KNN vectors as ints rather than vints.
jpountz merged pull request #708: URL: https://github.com/apache/lucene/pull/708 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat
[ https://issues.apache.org/jira/browse/LUCENE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497367#comment-17497367 ] ASF subversion and git services commented on LUCENE-10408: -- Commit 44d7d962ae42cfca7070a8e2c84ab059fec21e10 in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=44d7d96 ] LUCENE-10408: Write doc IDs of KNN vectors as ints rather than vints. (#708) Since doc IDs with a vector are loaded as an int[] in memory, this changes the on-disk format of vectors to align with the in-memory representation by using ints instead of vints to represent doc IDs. This might make vectors a bit larger on disk, but also a bit faster to open. I made the same change to how we encode nodes on levels for the same reason. > Better dense encoding of doc Ids in Lucene91HnswVectorsFormat > - > > Key: LUCENE-10408 > URL: https://issues.apache.org/jira/browse/LUCENE-10408 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mayya Sharipova >Assignee: Mayya Sharipova >Priority: Minor > Fix For: 9.1 > > Time Spent: 5.5h > Remaining Estimate: 0h > > Currently we write doc Ids of all documents that have vectors as is. We > should improve their encoding either using delta encoding or bitset. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #702: LUCENE-10382: Use `IndexReaderContext#id` to check reader identity.
jpountz merged pull request #702: URL: https://github.com/apache/lucene/pull/702 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497369#comment-17497369 ] ASF subversion and git services commented on LUCENE-10382: -- Commit d47ff38d703c6b5da1ef9c774ccda201fd682b8d in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d47ff38 ] LUCENE-10382: Use `IndexReaderContext#id` to check reader identity. (#702) `KnnVectorQuery` currently uses the index reader's hashcode to make sure that the query it builds runs on the right reader. We had added `IndexContextReader#id` a while back for a similar purpose with `TermStates`, let's reuse it? > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 7h 40m > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497373#comment-17497373 ] ASF subversion and git services commented on LUCENE-10382: -- Commit d952b3a58114ce5a929211bca7a9b0e822658f35 in lucene's branch refs/heads/branch_9x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d952b3a ] LUCENE-10382: Use `IndexReaderContext#id` to check reader identity. (#702) `KnnVectorQuery` currently uses the index reader's hashcode to make sure that the query it builds runs on the right reader. We had added `IndexContextReader#id` a while back for a similar purpose with `TermStates`, let's reuse it? > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 7h 50m > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat
[ https://issues.apache.org/jira/browse/LUCENE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497372#comment-17497372 ] ASF subversion and git services commented on LUCENE-10408: -- Commit d4cb6d0a307be42b8d3498d4363a68eec5947f15 in lucene's branch refs/heads/branch_9x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d4cb6d0 ] LUCENE-10408: Write doc IDs of KNN vectors as ints rather than vints. (#708) Since doc IDs with a vector are loaded as an int[] in memory, this changes the on-disk format of vectors to align with the in-memory representation by using ints instead of vints to represent doc IDs. This might make vectors a bit larger on disk, but also a bit faster to open. I made the same change to how we encode nodes on levels for the same reason. > Better dense encoding of doc Ids in Lucene91HnswVectorsFormat > - > > Key: LUCENE-10408 > URL: https://issues.apache.org/jira/browse/LUCENE-10408 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mayya Sharipova >Assignee: Mayya Sharipova >Priority: Minor > Fix For: 9.1 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > Currently we write doc Ids of all documents that have vectors as is. We > should improve their encoding either using delta encoding or bitset. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10439) Support multi-valued and multiple dimensions for count query in PointRangeQuery
[ https://issues.apache.org/jira/browse/LUCENE-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-10439. --- Fix Version/s: 9.1 Resolution: Fixed > Support multi-valued and multiple dimensions for count query in > PointRangeQuery > --- > > Key: LUCENE-10439 > URL: https://issues.apache.org/jira/browse/LUCENE-10439 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Priority: Trivial > Fix For: 9.1 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Follow-up of LUCENE-10424, it also works with fields that have multiple > dimensions and/or that are multi-valued. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Reopened] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil
[ https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand reopened LUCENE-10315: --- > Speed up BKD leaf block ids codec by a 512 ints ForUtil > --- > > Key: LUCENE-10315 > URL: https://issues.apache.org/jira/browse/LUCENE-10315 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Feng Guo >Assignee: Feng Guo >Priority: Major > Fix For: 9.1 > > Attachments: addall.svg > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Elasticsearch (which based on lucene) can automatically infers types for > users with its dynamic mapping feature. When users index some low cardinality > fields, such as gender / age / status... they often use some numbers to > represent the values, while ES will infer these fields as {{{}long{}}}, and > ES uses BKD as the index of {{long}} fields. When the data volume grows, > building the result set of low-cardinality fields will make the CPU usage and > load very high. > This is a flame graph we obtained from the production environment: > [^addall.svg] > It can be seen that almost all CPU is used in addAll. When we reindex > {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly > reduced ( We spent weeks of time to reindex all indices... ). I know that ES > recommended to use {{keyword}} for term/terms query and {{long}} for range > query in the document, but there are always some users who didn't realize > this and keep their habit of using sql database, or dynamic mapping > automatically selects the type for them. All in all, users won't realize that > there would be such a big difference in performance between {{long}} and > {{keyword}} fields in low cardinality fields. So from my point of view it > will make sense if we can make BKD works better for the low/medium > cardinality fields. > As far as i can see, for low cardinality fields, there are two advantages of > {{keyword}} over {{{}long{}}}: > 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's > delta VInt, because its batch reading (readLongs) and SIMD decode. > 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily > materialize of its result set, and when another small result clause > intersects with this low cardinality condition, the low cardinality field can > avoid reading all docIds into memory. > This ISSUE is targeting to solve the first point. The basic idea is trying to > use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization > by mocking some random {{LongPoint}} and querying them with > {{PointInSetQuery}}. > *Benchmark Result* > |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff > percentage| > |1|32|1|51.44|148.26|188.22%| > |1|32|2|26.8|101.88|280.15%| > |1|32|4|14.04|53.52|281.20%| > |1|32|8|7.04|28.54|305.40%| > |1|32|16|3.54|14.61|312.71%| > |1|128|1|110.56|350.26|216.81%| > |1|128|8|16.6|89.81|441.02%| > |1|128|16|8.45|48.07|468.88%| > |1|128|32|4.2|25.35|503.57%| > |1|128|64|2.13|13.02|511.27%| > |1|1024|1|536.19|843.88|57.38%| > |1|1024|8|109.71|251.89|129.60%| > |1|1024|32|33.24|104.11|213.21%| > |1|1024|128|8.87|30.47|243.52%| > |1|1024|512|2.24|8.3|270.54%| > |1|8192|1|.33|5000|50.00%| > |1|8192|32|139.47|214.59|53.86%| > |1|8192|128|54.59|109.23|100.09%| > |1|8192|512|15.61|36.15|131.58%| > |1|8192|2048|4.11|11.14|171.05%| > |1|1048576|1|2597.4|3030.3|16.67%| > |1|1048576|32|314.96|371.75|18.03%| > |1|1048576|128|99.7|116.28|16.63%| > |1|1048576|512|30.5|37.15|21.80%| > |1|1048576|2048|10.38|12.3|18.50%| > |1|8388608|1|2564.1|3174.6|23.81%| > |1|8388608|32|196.27|238.95|21.75%| > |1|8388608|128|55.36|68.03|22.89%| > |1|8388608|512|15.58|19.24|23.49%| > |1|8388608|2048|4.56|5.71|25.22%| > The indices size is reduced for low cardinality fields and flat for high > cardinality fields. > {code:java} > 113Mindex_1_doc_32_cardinality_baseline > 114Mindex_1_doc_32_cardinality_candidate > 140Mindex_1_doc_128_cardinality_baseline > 133Mindex_1_doc_128_cardinality_candidate > 193Mindex_1_doc_1024_cardinality_baseline > 174Mindex_1_doc_1024_cardinality_candidate > 241Mindex_1_doc_8192_cardinality_baseline > 233Mindex_1_doc_8192_cardinality_candidate > 314Mindex_1_doc_1048576_cardinality_baseline > 315Mindex_1_doc_1048576_cardinality_candidate > 392Mindex_1_doc_8388608_cardinality_baseline > 391Mindex_1000
[jira] [Updated] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil
[ https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-10315: -- Fix Version/s: (was: 9.1) > Speed up BKD leaf block ids codec by a 512 ints ForUtil > --- > > Key: LUCENE-10315 > URL: https://issues.apache.org/jira/browse/LUCENE-10315 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Feng Guo >Assignee: Feng Guo >Priority: Major > Attachments: addall.svg > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Elasticsearch (which based on lucene) can automatically infers types for > users with its dynamic mapping feature. When users index some low cardinality > fields, such as gender / age / status... they often use some numbers to > represent the values, while ES will infer these fields as {{{}long{}}}, and > ES uses BKD as the index of {{long}} fields. When the data volume grows, > building the result set of low-cardinality fields will make the CPU usage and > load very high. > This is a flame graph we obtained from the production environment: > [^addall.svg] > It can be seen that almost all CPU is used in addAll. When we reindex > {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly > reduced ( We spent weeks of time to reindex all indices... ). I know that ES > recommended to use {{keyword}} for term/terms query and {{long}} for range > query in the document, but there are always some users who didn't realize > this and keep their habit of using sql database, or dynamic mapping > automatically selects the type for them. All in all, users won't realize that > there would be such a big difference in performance between {{long}} and > {{keyword}} fields in low cardinality fields. So from my point of view it > will make sense if we can make BKD works better for the low/medium > cardinality fields. > As far as i can see, for low cardinality fields, there are two advantages of > {{keyword}} over {{{}long{}}}: > 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's > delta VInt, because its batch reading (readLongs) and SIMD decode. > 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily > materialize of its result set, and when another small result clause > intersects with this low cardinality condition, the low cardinality field can > avoid reading all docIds into memory. > This ISSUE is targeting to solve the first point. The basic idea is trying to > use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization > by mocking some random {{LongPoint}} and querying them with > {{PointInSetQuery}}. > *Benchmark Result* > |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff > percentage| > |1|32|1|51.44|148.26|188.22%| > |1|32|2|26.8|101.88|280.15%| > |1|32|4|14.04|53.52|281.20%| > |1|32|8|7.04|28.54|305.40%| > |1|32|16|3.54|14.61|312.71%| > |1|128|1|110.56|350.26|216.81%| > |1|128|8|16.6|89.81|441.02%| > |1|128|16|8.45|48.07|468.88%| > |1|128|32|4.2|25.35|503.57%| > |1|128|64|2.13|13.02|511.27%| > |1|1024|1|536.19|843.88|57.38%| > |1|1024|8|109.71|251.89|129.60%| > |1|1024|32|33.24|104.11|213.21%| > |1|1024|128|8.87|30.47|243.52%| > |1|1024|512|2.24|8.3|270.54%| > |1|8192|1|.33|5000|50.00%| > |1|8192|32|139.47|214.59|53.86%| > |1|8192|128|54.59|109.23|100.09%| > |1|8192|512|15.61|36.15|131.58%| > |1|8192|2048|4.11|11.14|171.05%| > |1|1048576|1|2597.4|3030.3|16.67%| > |1|1048576|32|314.96|371.75|18.03%| > |1|1048576|128|99.7|116.28|16.63%| > |1|1048576|512|30.5|37.15|21.80%| > |1|1048576|2048|10.38|12.3|18.50%| > |1|8388608|1|2564.1|3174.6|23.81%| > |1|8388608|32|196.27|238.95|21.75%| > |1|8388608|128|55.36|68.03|22.89%| > |1|8388608|512|15.58|19.24|23.49%| > |1|8388608|2048|4.56|5.71|25.22%| > The indices size is reduced for low cardinality fields and flat for high > cardinality fields. > {code:java} > 113Mindex_1_doc_32_cardinality_baseline > 114Mindex_1_doc_32_cardinality_candidate > 140Mindex_1_doc_128_cardinality_baseline > 133Mindex_1_doc_128_cardinality_candidate > 193Mindex_1_doc_1024_cardinality_baseline > 174Mindex_1_doc_1024_cardinality_candidate > 241Mindex_1_doc_8192_cardinality_baseline > 233Mindex_1_doc_8192_cardinality_candidate > 314Mindex_1_doc_1048576_cardinality_baseline > 315Mindex_1_doc_1048576_cardinality_candidate > 392Mindex_1_doc_8388608_cardinality_baseline > 391Mindex_1
[GitHub] [lucene] jpountz commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms
jpountz commented on pull request #692: URL: https://github.com/apache/lucene/pull/692#issuecomment-1049857125 @rmuir We can remove the cost estimation, but it will not address the problem. I'll try to explain the problem differently in case it helps. DocIdSetBuilder takes doc IDs in random order with potential duplicates and creates a DocIdSet that can iterate over doc IDs in order without any duplicates. If you index a multi-valued field with points, a very large segment that has 2^30 docs might have 2^32 points matching a range query, which translates into 2^29 documents matching the query. So `DocIdBuilder#add` would be called 2^32 times and `DocIdSetBuilder#build` would result in a `DocIdSet` that has 2^29 documents. This `long` is measuring the number of calls to `DocIdSetBuilder#add`, hence the `long`. The naming may be wrong here, as the `grow` name probably suggests a number of docs rather than a number of calls to `add`, similarly to how `ArrayUtil#grow` is about the number of items in the array - not the number of times you set an index. Hopefully renaming it to `prepareAdd(long numCallsToAdd)` or something along these lines would help clarify. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10438) Leverage Weight#count in lucene/facets
[ https://issues.apache.org/jira/browse/LUCENE-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497394#comment-17497394 ] Adrien Grand commented on LUCENE-10438: --- Solr indeed has a version of faceting that does this. I haven't looked at the details for a long time, but I remember that it would run facets on low-cardinality fields by intersecting postings with the bitset produced by the query. > Leverage Weight#count in lucene/facets > -- > > Key: LUCENE-10438 > URL: https://issues.apache.org/jira/browse/LUCENE-10438 > Project: Lucene - Core > Issue Type: Task > Components: modules/facet >Reporter: Adrien Grand >Assignee: Greg Miller >Priority: Minor > > The facet module could leverage Weight#count in order to give fast counts for > the browsing use-case? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms
rmuir commented on pull request #692: URL: https://github.com/apache/lucene/pull/692#issuecomment-1049869937 > @rmuir We can remove the cost estimation, but it will not address the problem. I'll try to explain the problem differently in case it helps. I really think it will address the problem. I understand what is happening, but adding 32 more bits that merely get discarded also will not help anything. That's what is being discussed here. It really is all about cost estimation, as that is the ONLY thing in this PR actually using the 32 extra bits. That's why i propose to simply use a different cost estimation instead. The current cost estimation explodes the complexity of this class: that's why we are tracking: * `boolean multiValued` * `double numValuesPerDoc` * `long counter` There's no need (from allocation perspective, which is all we should be concerned about here) to know about any numbers bigger than `Integer.MAX_VALUE`, if we get anywhere near numbers that big, we should be switching over to the `FixedBitSet` representation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10432) Add optional 'name' property to org.apache.lucene.search.Explanation
[ https://issues.apache.org/jira/browse/LUCENE-10432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497399#comment-17497399 ] Adrien Grand commented on LUCENE-10432: --- [~reta] I wonder if you have thought about how queries would know what name they should return in their explanations. My expectation is that we'd be introducing some form of query wrapper whose point would only be to be able to set a name or tags in the produced explanations. Then I worry that it would make some things more complicated for Lucene like query rewriting, which relies on instanceof checks, or query caching, which would consider the same queries with different names as different. Overall it looks to me like the benefits this is bringing would not be worth the problems it would introduce. > Add optional 'name' property to org.apache.lucene.search.Explanation > - > > Key: LUCENE-10432 > URL: https://issues.apache.org/jira/browse/LUCENE-10432 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0, 8.10.1 >Reporter: Andriy Redko >Priority: Minor > > Right now, the `Explanation` class has the `description` property which is > used pretty much as placeholder for free-style, human readable summary of > what is happening. This is totally fine but it would be great to have a bit > more formal way to link the explanation with corresponding function / query / > filter if supported by the underlying engine. > Example: Opensearch / Elasticseach has the concept of named queries / filters > [1]. This is not supported by Apache Lucene at the moment but it would be > helpful to propagate this information back as part of Explanation tree, for > example by introducing optional 'name' property: > > {noformat} > { > "value": 0.0, > "description": "script score function, computed with script: ...", > > "name": "script1", > "details": [ > { > "value": 1.0, > "description": "_score: ", > "details": [ > { > "value": 1.0, > "description": "*:*", > "details": [] >} > ] > } > ] > }{noformat} > > From the other side, the `name` property may look like not belonging here, > the alternative suggestion would be to add support of `properties` / > `parameters` / `tags` key/value bag, for example: > > {noformat} > { > "value": 0.0, > "description": "script score function, computed with script: ...", > > "tags": [ >{ "name": "script1" } > ], > "details": [ > { > "value": 1.0, > "description": "_score: ", > "details": [ > { > "value": 1.0, > "description": "*:*", > "details": [] >} > ] > } > ] > }{noformat} > The change should be non-breaking but quite useful for engines to enrich the > `Explanation` with additional context. > [1] > https://www.elastic.co/guide/en/elasticsearch/reference/7.16/query-dsl-bool-query.html#named-queries > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10431) AssertionError in BooleanQuery.hashCode()
[ https://issues.apache.org/jira/browse/LUCENE-10431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497402#comment-17497402 ] Adrien Grand commented on LUCENE-10431: --- I've been starring at the code and at this stack trace for the past 15 minutes but I cannot think of a way how hashCode() could be called before BooleanQuery is fully constructed. Can you share more information about how this query gets constructed? > AssertionError in BooleanQuery.hashCode() > - > > Key: LUCENE-10431 > URL: https://issues.apache.org/jira/browse/LUCENE-10431 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.11.1 >Reporter: Michael Bien >Priority: Major > > Hello devs, > the constructor of BooleanQuery can under some circumstances trigger a hash > code computation before "clauseSets" is fully filled. Since BooleanClause is > using its query field for the hash code too, it can happen that the "wrong" > hash code is stored, since adding the clause to the set triggers its > hashCode(). > If assertions are enabled the check in BooleanQuery, which recomputes the > hash code, will notice it and throw an error. > exception: > {code:java} > java.lang.AssertionError > at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:614) > at java.base/java.util.Objects.hashCode(Objects.java:103) > at java.base/java.util.HashMap$Node.hashCode(HashMap.java:298) > at java.base/java.util.AbstractMap.hashCode(AbstractMap.java:527) > at org.apache.lucene.search.Multiset.hashCode(Multiset.java:119) > at java.base/java.util.EnumMap.entryHashCode(EnumMap.java:717) > at java.base/java.util.EnumMap.hashCode(EnumMap.java:709) > at java.base/java.util.Arrays.hashCode(Arrays.java:4498) > at java.base/java.util.Objects.hash(Objects.java:133) > at > org.apache.lucene.search.BooleanQuery.computeHashCode(BooleanQuery.java:597) > at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:611) > at java.base/java.util.HashMap.hash(HashMap.java:340) > at java.base/java.util.HashMap.put(HashMap.java:612) > at org.apache.lucene.search.Multiset.add(Multiset.java:82) > at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:154) > at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:42) > at > org.apache.lucene.search.BooleanQuery$Builder.build(BooleanQuery.java:133) > {code} > I noticed this while trying to upgrade the NetBeans maven indexer modules > from lucene 5.x to 8.x https://github.com/apache/netbeans/pull/3558 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms
iverase commented on pull request #692: URL: https://github.com/apache/lucene/pull/692#issuecomment-1049887302 32 bits will need to be discarded anyway, the issue is where. You either do it at the PointValues level by calling grow like: ``` visitor.grow((int) Math.min(getDocCount(), pointTree.size()); ``` Or you discarded in the DocIdSetBuilder and allow grow to be called just like: ``` visitor.grow(pointTree.size()); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10432) Add optional 'name' property to org.apache.lucene.search.Explanation
[ https://issues.apache.org/jira/browse/LUCENE-10432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497405#comment-17497405 ] Andriy Redko commented on LUCENE-10432: --- Thanks [~jpountz] > I wonder if you have thought about how queries would know what name they > should return in their explanations. My expectation is that we'd be > introducing some form of query wrapper whose point would only be to be able > to set a name or tags in the produced explanations. That could be an option but I would suggest to not change queries. For example, in most cases Opensearch / Elasticsearch queries / filters / functions are wrapped into composites, the names and other attributes are stored there. > Then I worry that it would make some things more complicated for Lucene like > query rewriting, which relies on instanceof checks, or query caching, which > would consider the same queries with different names as different. Certainly, if the we change queries but we don't need to, ability to pass structured (key/value) details into the Explanation would help the engines propagate the internal context back. May be a bit more illustrative example of end-2-end flow, the request: {noformat} { "explain": true, "query": { "function_score": { "query": { "match_all": { "_name": "q1" } }, "functions": [ { "filter": { "terms": { "_name": "terms_filter", "abc": [ "1" ] } }, "weight": 35 } ], "boost_mode": "replace", "score_mode": "sum", "min_score": 0 } } }{noformat} And this is how we return that back inside explanation description ({*}"description" : "match filter(_name: terms_filter): abc:\{1}"{*}): {noformat} { "value": 35.0, "description": "function score, product of:", "details": [ { "value": 1.0, "description": "match filter(_name: terms_filter): abc:{1}", "details": [] }, { "value": 35.0, "description": "product of:", "details": [ { "value": 1.0, "description": "constant score 1.0 - no function provided", "details": [] }, { "value": 35.0, "description": "weight", "details": [] } ] } ] }{noformat} Does it address your concerns? Thanks a lot for taking a look! > Add optional 'name' property to org.apache.lucene.search.Explanation > - > > Key: LUCENE-10432 > URL: https://issues.apache.org/jira/browse/LUCENE-10432 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0, 8.10.1 >Reporter: Andriy Redko >Priority: Minor > > Right now, the `Explanation` class has the `description` property which is > used pretty much as placeholder for free-style, human readable summary of > what is happening. This is totally fine but it would be great to have a bit > more formal way to link the explanation with corresponding function / query / > filter if supported by the underlying engine. > Example: Opensearch / Elasticseach has the concept of named queries / filters > [1]. This is not supported by Apache Lucene at the moment but it would be > helpful to propagate this information back as part of Explanation tree, for > example by introducing optional 'name' property: > > {noformat} > { > "value": 0.0, > "description": "script score function, computed with script: ...", > > "name": "script1", > "details": [ > { > "value": 1.0, > "description": "_score: ", > "details": [ > { > "value": 1.0, > "description": "*:*", > "details": [] >} > ] > } > ] > }{noformat} > > From the other side, the `name` property may look like not belonging here, > the alternative suggestion would be to add support of `properties` / > `parameters` / `tags` key/value bag, for example: > > {noformat} > { > "value": 0.0, > "description": "script score function, computed with script: ...", > > "tags": [ >{ "name": "script1" } > ], > "details": [ > { > "value": 1.0, > "description": "_score: ",
[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms
rmuir commented on pull request #692: URL: https://github.com/apache/lucene/pull/692#issuecomment-1049894634 If this is literally all about "style" issue then let's be open and honest about that. I am fine with: ``` /** sugar: to just make code look pretty, nothing else */ public BulkAdder grow(long numDocs) { grow((int) Math.min(Integer.MAX_VALUE, numDocs)); } ``` But I think it is wrong to have constructors taking `Terms` and `PointValues` already: it is just more useless complexity and "leaky abstraction" from the terrible cost estimation. And I definitely think having two separate classes just for the cost estimation is way too much. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms
rmuir commented on pull request #692: URL: https://github.com/apache/lucene/pull/692#issuecomment-1049905030 To try to be more helpful, here's what i'd propose. I can try to hack up a draft PR later if we want, if it is helpful. DocIdSetBuilder, remove complex cost estimation: * remove `DocIdSetBuilder(int, Terms)` constructor * remove `DocIdSetBuilder(int, PointValues)` constructor * remove `DocIdSetBuilder.counter` member * remove `DocIdSetBuilder.multiValued` member * remove `DocIdSetBuilder.numValuesPerDoc` member DocIdSetBuilder: add sugar `grow(long)` for style purposes: ``` /** sugar: to just make code look pretty, nothing else */ public BulkAdder grow(long numDocs) { grow((int) Math.min(Integer.MAX_VALUE, numDocs)); } ``` FixedBitSet: implement `approximateCardinality()` and simply use it when estimating cost() here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10428) getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge leading to busy threads in infinite loop
[ https://issues.apache.org/jira/browse/LUCENE-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497422#comment-17497422 ] Adrien Grand commented on LUCENE-10428: --- Ouch this is bad. Note that in your code snippet, `minScoreSum` should be a float - not a double - to replicate what MaxScoreSumPropagator does. By any chance, were you able to see what is the number of clauses of this query? > getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge > leading to busy threads in infinite loop > - > > Key: LUCENE-10428 > URL: https://issues.apache.org/jira/browse/LUCENE-10428 > Project: Lucene - Core > Issue Type: Bug > Components: core/query/scoring, core/search >Reporter: Ankit Jain >Priority: Major > Attachments: Flame_graph.png > > > Customers complained about high CPU for Elasticsearch cluster in production. > We noticed that few search requests were stuck for long time > {code:java} > % curl -s localhost:9200/_cat/tasks?v > indices:data/read/search[phase/query] AmMLzDQ4RrOJievRDeGFZw:569205 > AmMLzDQ4RrOJievRDeGFZw:569204 direct1645195007282 14:36:47 6.2h > indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:502075 > emjWc5bUTG6lgnCGLulq-Q:502074 direct1645195037259 14:37:17 6.2h > indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:583270 > emjWc5bUTG6lgnCGLulq-Q:583269 direct1645201316981 16:21:56 4.5h > {code} > Flame graphs indicated that CPU time is mostly going into > *getMinCompetitiveScore method in MaxScoreSumPropagator*. After doing some > live JVM debugging found that > org.apache.lucene.search.MaxScoreSumPropagator.scoreSumUpperBound method had > around 4 million invocations every second > Figured out the values of some parameters from live debugging: > {code:java} > minScoreSum = 3.5541441 > minScore + sumOfOtherMaxScores (params[0] scoreSumUpperBound) = > 3.554144322872162 > returnObj scoreSumUpperBound = 3.5541444 > Math.ulp(minScoreSum) = 2.3841858E-7 > {code} > Example code snippet: > {code:java} > double sumOfOtherMaxScores = 3.554144322872162; > double minScoreSum = 3.5541441; > float minScore = (float) (minScoreSum - sumOfOtherMaxScores); > while (scoreSumUpperBound(minScore + sumOfOtherMaxScores) > minScoreSum) { > minScore -= Math.ulp(minScoreSum); > System.out.printf("%.20f, %.100f\n", minScore, Math.ulp(minScoreSum)); > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir opened a new pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it
rmuir opened a new pull request #709: URL: https://github.com/apache/lucene/pull/709 Cost estimation drives the API complexity out of control, we don't need it. Hopefully i've cleared up all the API damage from this explosive leak. Instead, FixedBitSet.approximateCardinality() is used for cost estimation. TODO: let's optimize that! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it
rmuir commented on pull request #709: URL: https://github.com/apache/lucene/pull/709#issuecomment-1049948027 Here's a first stab of what i proposed on https://github.com/apache/lucene/pull/692 You can see how damaging the current cost() implementation is. As followup commits we can add the `grow(long)` sugar that simply truncates. And we should optimize `FixedBitSet.approximateCardinality()`. After doing that, we should look around and see if there is any other similar damage to our APIs related to the fact that FixedBitSet had a slow `approximateCardinality` and fix those, too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms
rmuir commented on pull request #692: URL: https://github.com/apache/lucene/pull/692#issuecomment-1049948208 prototype: https://github.com/apache/lucene/pull/709 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it
jpountz commented on pull request #709: URL: https://github.com/apache/lucene/pull/709#issuecomment-1049959940 That change makes sense to me. FWIW my recollection from profiling DocIdSetBuilder is that the deduplication logic is cheap and most of the time is spent in `LSBRadixSorter#reorder` so it's ok to always deduplicate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10432) Add optional 'name' property to org.apache.lucene.search.Explanation
[ https://issues.apache.org/jira/browse/LUCENE-10432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497468#comment-17497468 ] Adrien Grand commented on LUCENE-10432: --- The bit I'm missing is how you would let Lucene know about the query name when calling {{Weight#explain}}? > Add optional 'name' property to org.apache.lucene.search.Explanation > - > > Key: LUCENE-10432 > URL: https://issues.apache.org/jira/browse/LUCENE-10432 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0, 8.10.1 >Reporter: Andriy Redko >Priority: Minor > > Right now, the `Explanation` class has the `description` property which is > used pretty much as placeholder for free-style, human readable summary of > what is happening. This is totally fine but it would be great to have a bit > more formal way to link the explanation with corresponding function / query / > filter if supported by the underlying engine. > Example: Opensearch / Elasticseach has the concept of named queries / filters > [1]. This is not supported by Apache Lucene at the moment but it would be > helpful to propagate this information back as part of Explanation tree, for > example by introducing optional 'name' property: > > {noformat} > { > "value": 0.0, > "description": "script score function, computed with script: ...", > > "name": "script1", > "details": [ > { > "value": 1.0, > "description": "_score: ", > "details": [ > { > "value": 1.0, > "description": "*:*", > "details": [] >} > ] > } > ] > }{noformat} > > From the other side, the `name` property may look like not belonging here, > the alternative suggestion would be to add support of `properties` / > `parameters` / `tags` key/value bag, for example: > > {noformat} > { > "value": 0.0, > "description": "script score function, computed with script: ...", > > "tags": [ >{ "name": "script1" } > ], > "details": [ > { > "value": 1.0, > "description": "_score: ", > "details": [ > { > "value": 1.0, > "description": "*:*", > "details": [] >} > ] > } > ] > }{noformat} > The change should be non-breaking but quite useful for engines to enrich the > `Explanation` with additional context. > [1] > https://www.elastic.co/guide/en/elasticsearch/reference/7.16/query-dsl-bool-query.html#named-queries > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it
rmuir commented on pull request #709: URL: https://github.com/apache/lucene/pull/709#issuecomment-1049967927 If we want to add the `grow(long)` sugar method that simply truncates to `Integer.MAX_VALUE` and clean up all the points callsites, or write a cool FixedBitSet.approximateCardinality, just feel free to push commits here. Otherwise I will get to these two things later and remove draft status on the PR. Adding the sugar method is easy, it is just work. Implementing the approximateCardinality requires some thought and prolly some benchmarking. I had in mind to just "sample" some "chunks" of the long[] and sum up `Long.bitCount` across the ranges. In upcoming JDK this method will get vectorized, let's take advantage of that, so then both `cardinality()` and `approximateCardinality` would get faster: https://github.com/openjdk/jdk/pull/6857 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it
iverase commented on a change in pull request #709: URL: https://github.com/apache/lucene/pull/709#discussion_r813988648 ## File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java ## @@ -266,20 +224,12 @@ private void upgradeToBitSet() { public DocIdSet build() { try { if (bitSet != null) { -assert counter >= 0; -final long cost = Math.round(counter / numValuesPerDoc); -return new BitDocIdSet(bitSet, cost); +return new BitDocIdSet(bitSet); } else { Buffer concatenated = concat(buffers); LSBRadixSorter sorter = new LSBRadixSorter(); sorter.sort(PackedInts.bitsRequired(maxDoc - 1), concatenated.array, concatenated.length); -final int l; -if (multivalued) { - l = dedup(concatenated.array, concatenated.length); Review comment: Do we really want to throw away this optimisation? we normally know if our data is single or multi-valued so it seems wasteful not to exploit it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it
iverase commented on a change in pull request #709: URL: https://github.com/apache/lucene/pull/709#discussion_r813994000 ## File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java ## @@ -266,20 +224,12 @@ private void upgradeToBitSet() { public DocIdSet build() { try { if (bitSet != null) { -assert counter >= 0; -final long cost = Math.round(counter / numValuesPerDoc); -return new BitDocIdSet(bitSet, cost); +return new BitDocIdSet(bitSet); Review comment: we still ned to implement the method estimateCardinality which is the hard bit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it
iverase commented on pull request #709: URL: https://github.com/apache/lucene/pull/709#issuecomment-104552 I don't think the is necessary, we can always added to the IntersectVisitor instead. Maybe would be worthy to adjust how we call grow() in BKDReader#addAll as it does not need the dance it is currently doing: https://github.com/apache/lucene/blob/8c67a3816b9060fa983b494886cd4f789be1d868/lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java#L562 The same for SimpleTextBKDReader#addAll -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase edited a comment on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it
iverase edited a comment on pull request #709: URL: https://github.com/apache/lucene/pull/709#issuecomment-104552 I don't think the grow(long) is necessary, we can always added to the IntersectVisitor instead. Maybe would be worthy to adjust how we call grow() in BKDReader#addAll as it does not need the dance it is currently doing: https://github.com/apache/lucene/blob/8c67a3816b9060fa983b494886cd4f789be1d868/lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java#L562 The same for SimpleTextBKDReader#addAll -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10432) Add optional 'name' property to org.apache.lucene.search.Explanation
[ https://issues.apache.org/jira/browse/LUCENE-10432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497499#comment-17497499 ] Andriy Redko commented on LUCENE-10432: --- Yeah, it may not 100% cover everything (like the {{Weight#explain), }}but it is also not needed in every place. Probably generic bag for contextual properties would be less intrusive and extensible way to propagate name and other things (which are backed into description now), just another example for random scoring function explanation: {noformat} { "value": 0.38554674, "description": "random score function (seed: 738562412, field: null, _name: func2)", "details": [] }{noformat} > Add optional 'name' property to org.apache.lucene.search.Explanation > - > > Key: LUCENE-10432 > URL: https://issues.apache.org/jira/browse/LUCENE-10432 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0, 8.10.1 >Reporter: Andriy Redko >Priority: Minor > > Right now, the `Explanation` class has the `description` property which is > used pretty much as placeholder for free-style, human readable summary of > what is happening. This is totally fine but it would be great to have a bit > more formal way to link the explanation with corresponding function / query / > filter if supported by the underlying engine. > Example: Opensearch / Elasticseach has the concept of named queries / filters > [1]. This is not supported by Apache Lucene at the moment but it would be > helpful to propagate this information back as part of Explanation tree, for > example by introducing optional 'name' property: > > {noformat} > { > "value": 0.0, > "description": "script score function, computed with script: ...", > > "name": "script1", > "details": [ > { > "value": 1.0, > "description": "_score: ", > "details": [ > { > "value": 1.0, > "description": "*:*", > "details": [] >} > ] > } > ] > }{noformat} > > From the other side, the `name` property may look like not belonging here, > the alternative suggestion would be to add support of `properties` / > `parameters` / `tags` key/value bag, for example: > > {noformat} > { > "value": 0.0, > "description": "script score function, computed with script: ...", > > "tags": [ >{ "name": "script1" } > ], > "details": [ > { > "value": 1.0, > "description": "_score: ", > "details": [ > { > "value": 1.0, > "description": "*:*", > "details": [] >} > ] > } > ] > }{noformat} > The change should be non-breaking but quite useful for engines to enrich the > `Explanation` with additional context. > [1] > https://www.elastic.co/guide/en/elasticsearch/reference/7.16/query-dsl-bool-query.html#named-queries > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it
rmuir commented on a change in pull request #709: URL: https://github.com/apache/lucene/pull/709#discussion_r814039139 ## File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java ## @@ -266,20 +224,12 @@ private void upgradeToBitSet() { public DocIdSet build() { try { if (bitSet != null) { -assert counter >= 0; -final long cost = Math.round(counter / numValuesPerDoc); -return new BitDocIdSet(bitSet, cost); +return new BitDocIdSet(bitSet); } else { Buffer concatenated = concat(buffers); LSBRadixSorter sorter = new LSBRadixSorter(); sorter.sort(PackedInts.bitsRequired(maxDoc - 1), concatenated.array, concatenated.length); -final int l; -if (multivalued) { - l = dedup(concatenated.array, concatenated.length); Review comment: This optimization doesnt make sense to me. Buffers should only be used for tiny sets (they are very memory expensive). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it
rmuir commented on a change in pull request #709: URL: https://github.com/apache/lucene/pull/709#discussion_r814040808 ## File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java ## @@ -266,20 +224,12 @@ private void upgradeToBitSet() { public DocIdSet build() { try { if (bitSet != null) { -assert counter >= 0; -final long cost = Math.round(counter / numValuesPerDoc); -return new BitDocIdSet(bitSet, cost); +return new BitDocIdSet(bitSet); Review comment: I don't think it is difficult, it just requires a little work. I can get to it soon, seems like it should be fun. Ultimately I think it will give us better estimations than what we have today, without all the tangled APIs and abstraction leakage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it
iverase commented on a change in pull request #709: URL: https://github.com/apache/lucene/pull/709#discussion_r814045946 ## File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java ## @@ -266,20 +224,12 @@ private void upgradeToBitSet() { public DocIdSet build() { try { if (bitSet != null) { -assert counter >= 0; -final long cost = Math.round(counter / numValuesPerDoc); -return new BitDocIdSet(bitSet, cost); +return new BitDocIdSet(bitSet); } else { Buffer concatenated = concat(buffers); LSBRadixSorter sorter = new LSBRadixSorter(); sorter.sort(PackedInts.bitsRequired(maxDoc - 1), concatenated.array, concatenated.length); -final int l; -if (multivalued) { - l = dedup(concatenated.array, concatenated.length); Review comment: Ok, I am convinced. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it
iverase commented on a change in pull request #709: URL: https://github.com/apache/lucene/pull/709#discussion_r814047234 ## File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java ## @@ -266,20 +224,12 @@ private void upgradeToBitSet() { public DocIdSet build() { try { if (bitSet != null) { -assert counter >= 0; -final long cost = Math.round(counter / numValuesPerDoc); -return new BitDocIdSet(bitSet, cost); +return new BitDocIdSet(bitSet); Review comment: I like the idea of sampling, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request #710: LUCENE-10311: Make FixedBitSet#approximateCardinality faster (and actually approximate).
jpountz opened a new pull request #710: URL: https://github.com/apache/lucene/pull/710 This computes a pop count on a sample of the longs that back the bitset. Quick benchmarks suggest that this runs 5x-10x faster than `FixedBitSet#cardinality` depending on the length of the bitset. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10427) OLAP likewise rollup during segment merge process
[ https://issues.apache.org/jira/browse/LUCENE-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497526#comment-17497526 ] Adrien Grand commented on LUCENE-10427: --- I know that the Elasticsearch team is looking into doing things like that, but on top of Lucene by creating another index that has a different granularity instead of having different granularities within the same index and relying on background merges for rollups. At first sight, doing it within the same index feels a bit scary to me: - different segments would have different granularities, - merges would no longer combine segments but also perform lossy compression, - all file formats would need to be aware of rollups? - numeric doc values would need to be able to store multiple fields under the hood (min, max, etc.) What would you think about doing it on top of Lucene instead, e.g. similarly to how the faceting module maintains a side-car taxonomy index, maybe one could maintain a side-car rollup index to speed up aggregations? > OLAP likewise rollup during segment merge process > - > > Key: LUCENE-10427 > URL: https://issues.apache.org/jira/browse/LUCENE-10427 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Suhan Mao >Priority: Major > > Currently, many OLAP engines support rollup feature like > clickhouse(AggregateMergeTree)/druid. > Rollup definition: [https://athena.ecs.csus.edu/~mei/olap/OLAPoperations.php] > One of the way to do rollup is to merge the same dimension buckets into one > and do sum()/min()/max() operation on metric fields during segment > compact/merge process. This can significantly reduce the size of the data and > speed up the query a lot. > > *Abstraction of how to do* > # Define rollup logic: which is dimensions and metrics. > # Rollup definition for each metric field: max/min/sum ... > # index sorting should the the same as dimension fields. > # We will do rollup calculation during segment merge just like other OLAP > engine do. > > *Assume the scenario* > We use ES to ingest realtime raw temperature data every minutes of each > sensor device along with many dimension information. User may want to query > the data like "what is the max temperature of some device within some/latest > hour" or "what is the max temperature of some city within some/latest hour" > In that way, we can define such fields and rollup definition: > # event_hour(round to hour granularity) > # device_id(dimension) > # city_id(dimension) > # temperature(metrics, max/min rollup logic) > The raw data will periodically be rolled up to the hour granularity during > segment merge process, which should save 60x storage ideally in the end. > > *How we do rollup in segment merge* > bucket: docs should belong to the same bucket if the dimension values are all > the same. > # For docvalues merge, we send the normal mappedDocId if we encounter a new > bucket in DocIDMerger. > # Since the index sorting fields are the same with dimension fields. if we > encounter more docs in the same bucket, We emit special mappedDocId from > DocIDMerger . > # In DocValuesConsumer.mergeNumericField, if we meet special mappedDocId, we > do a rollup calculation on metric fields and fold the result value to the > first doc in the bucket. The calculation just like a streaming merge sort > rollup. > # We discard all the special mappedDocId docs because the metrics is already > folded to the first doc of in the bucket. > # In BKD/posting structure, we discard all the special mappedDocId docs and > only place the first doc id within a bucket in the BKD/posting data. It > should be simple. > > *How to define the logic* > > {code:java} > public class RollupMergeConfig { > private List dimensionNames; > private List aggregateFields; > } > public class RollupMergeAggregateField { > private String name; > private RollupMergeAggregateType aggregateType; > } > public enum RollupMergeAggregateType { > COUNT, > SUM, > MIN, > MAX, > CARDINALITY // if data sketch is stored in binary doc values, we can do a > union logic > }{code} > > > I have written the initial code in a basic level. I can submit the complete > PR if you think this feature is good to try. > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #710: LUCENE-10311: Make FixedBitSet#approximateCardinality faster (and actually approximate).
rmuir commented on pull request #710: URL: https://github.com/apache/lucene/pull/710#issuecomment-1050050397 Since we made the method `abstract`, let's just have it forward to exact-cardinality for the `JavaUtilBitSet` used in the unit tests? It should fix the test issues. I agree with making the method abstract too. I think it is a better choice for performance-sensitive, lower-level classes like this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10428) getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge leading to busy threads in infinite loop
[ https://issues.apache.org/jira/browse/LUCENE-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497553#comment-17497553 ] Ankit Jain commented on LUCENE-10428: - {quote}By any chance, were you able to see what is the number of clauses of this query?{quote} [~jpountz] - I did check the invocation of sumRelativeErrorBound and it probably showed 4. > getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge > leading to busy threads in infinite loop > - > > Key: LUCENE-10428 > URL: https://issues.apache.org/jira/browse/LUCENE-10428 > Project: Lucene - Core > Issue Type: Bug > Components: core/query/scoring, core/search >Reporter: Ankit Jain >Priority: Major > Attachments: Flame_graph.png > > > Customers complained about high CPU for Elasticsearch cluster in production. > We noticed that few search requests were stuck for long time > {code:java} > % curl -s localhost:9200/_cat/tasks?v > indices:data/read/search[phase/query] AmMLzDQ4RrOJievRDeGFZw:569205 > AmMLzDQ4RrOJievRDeGFZw:569204 direct1645195007282 14:36:47 6.2h > indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:502075 > emjWc5bUTG6lgnCGLulq-Q:502074 direct1645195037259 14:37:17 6.2h > indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:583270 > emjWc5bUTG6lgnCGLulq-Q:583269 direct1645201316981 16:21:56 4.5h > {code} > Flame graphs indicated that CPU time is mostly going into > *getMinCompetitiveScore method in MaxScoreSumPropagator*. After doing some > live JVM debugging found that > org.apache.lucene.search.MaxScoreSumPropagator.scoreSumUpperBound method had > around 4 million invocations every second > Figured out the values of some parameters from live debugging: > {code:java} > minScoreSum = 3.5541441 > minScore + sumOfOtherMaxScores (params[0] scoreSumUpperBound) = > 3.554144322872162 > returnObj scoreSumUpperBound = 3.5541444 > Math.ulp(minScoreSum) = 2.3841858E-7 > {code} > Example code snippet: > {code:java} > double sumOfOtherMaxScores = 3.554144322872162; > double minScoreSum = 3.5541441; > float minScore = (float) (minScoreSum - sumOfOtherMaxScores); > while (scoreSumUpperBound(minScore + sumOfOtherMaxScores) > minScoreSum) { > minScore -= Math.ulp(minScoreSum); > System.out.printf("%.20f, %.100f\n", minScore, Math.ulp(minScoreSum)); > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10428) getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge leading to busy threads in infinite loop
[ https://issues.apache.org/jira/browse/LUCENE-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497553#comment-17497553 ] Ankit Jain edited comment on LUCENE-10428 at 2/24/22, 5:00 PM: --- {quote}By any chance, were you able to see what is the number of clauses of this query? {quote} [~jpountz] - I did check the invocation of sumRelativeErrorBound and it probably showed 4. Interestingly, even when I run the same query, it does not necessarily get into this convergence issue. So, could not find easy way to reproduce this from query level was (Author: akjain): {quote}By any chance, were you able to see what is the number of clauses of this query?{quote} [~jpountz] - I did check the invocation of sumRelativeErrorBound and it probably showed 4. > getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge > leading to busy threads in infinite loop > - > > Key: LUCENE-10428 > URL: https://issues.apache.org/jira/browse/LUCENE-10428 > Project: Lucene - Core > Issue Type: Bug > Components: core/query/scoring, core/search >Reporter: Ankit Jain >Priority: Major > Attachments: Flame_graph.png > > > Customers complained about high CPU for Elasticsearch cluster in production. > We noticed that few search requests were stuck for long time > {code:java} > % curl -s localhost:9200/_cat/tasks?v > indices:data/read/search[phase/query] AmMLzDQ4RrOJievRDeGFZw:569205 > AmMLzDQ4RrOJievRDeGFZw:569204 direct1645195007282 14:36:47 6.2h > indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:502075 > emjWc5bUTG6lgnCGLulq-Q:502074 direct1645195037259 14:37:17 6.2h > indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:583270 > emjWc5bUTG6lgnCGLulq-Q:583269 direct1645201316981 16:21:56 4.5h > {code} > Flame graphs indicated that CPU time is mostly going into > *getMinCompetitiveScore method in MaxScoreSumPropagator*. After doing some > live JVM debugging found that > org.apache.lucene.search.MaxScoreSumPropagator.scoreSumUpperBound method had > around 4 million invocations every second > Figured out the values of some parameters from live debugging: > {code:java} > minScoreSum = 3.5541441 > minScore + sumOfOtherMaxScores (params[0] scoreSumUpperBound) = > 3.554144322872162 > returnObj scoreSumUpperBound = 3.5541444 > Math.ulp(minScoreSum) = 2.3841858E-7 > {code} > Example code snippet: > {code:java} > double sumOfOtherMaxScores = 3.554144322872162; > double minScoreSum = 3.5541441; > float minScore = (float) (minScoreSum - sumOfOtherMaxScores); > while (scoreSumUpperBound(minScore + sumOfOtherMaxScores) > minScoreSum) { > minScore -= Math.ulp(minScoreSum); > System.out.printf("%.20f, %.100f\n", minScore, Math.ulp(minScoreSum)); > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10428) getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge leading to busy threads in infinite loop
[ https://issues.apache.org/jira/browse/LUCENE-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497553#comment-17497553 ] Ankit Jain edited comment on LUCENE-10428 at 2/24/22, 5:01 PM: --- {quote}By any chance, were you able to see what is the number of clauses of this query? {quote} [~jpountz] - I did check the invocation of sumRelativeErrorBound and it probably showed 4. Interestingly, even when I run the same query, it does not necessarily get into this convergence issue. So, could not find easy way to reproduce this at query level was (Author: akjain): {quote}By any chance, were you able to see what is the number of clauses of this query? {quote} [~jpountz] - I did check the invocation of sumRelativeErrorBound and it probably showed 4. Interestingly, even when I run the same query, it does not necessarily get into this convergence issue. So, could not find easy way to reproduce this from query level > getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge > leading to busy threads in infinite loop > - > > Key: LUCENE-10428 > URL: https://issues.apache.org/jira/browse/LUCENE-10428 > Project: Lucene - Core > Issue Type: Bug > Components: core/query/scoring, core/search >Reporter: Ankit Jain >Priority: Major > Attachments: Flame_graph.png > > > Customers complained about high CPU for Elasticsearch cluster in production. > We noticed that few search requests were stuck for long time > {code:java} > % curl -s localhost:9200/_cat/tasks?v > indices:data/read/search[phase/query] AmMLzDQ4RrOJievRDeGFZw:569205 > AmMLzDQ4RrOJievRDeGFZw:569204 direct1645195007282 14:36:47 6.2h > indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:502075 > emjWc5bUTG6lgnCGLulq-Q:502074 direct1645195037259 14:37:17 6.2h > indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:583270 > emjWc5bUTG6lgnCGLulq-Q:583269 direct1645201316981 16:21:56 4.5h > {code} > Flame graphs indicated that CPU time is mostly going into > *getMinCompetitiveScore method in MaxScoreSumPropagator*. After doing some > live JVM debugging found that > org.apache.lucene.search.MaxScoreSumPropagator.scoreSumUpperBound method had > around 4 million invocations every second > Figured out the values of some parameters from live debugging: > {code:java} > minScoreSum = 3.5541441 > minScore + sumOfOtherMaxScores (params[0] scoreSumUpperBound) = > 3.554144322872162 > returnObj scoreSumUpperBound = 3.5541444 > Math.ulp(minScoreSum) = 2.3841858E-7 > {code} > Example code snippet: > {code:java} > double sumOfOtherMaxScores = 3.554144322872162; > double minScoreSum = 3.5541441; > float minScore = (float) (minScoreSum - sumOfOtherMaxScores); > while (scoreSumUpperBound(minScore + sumOfOtherMaxScores) > minScoreSum) { > minScore -= Math.ulp(minScoreSum); > System.out.printf("%.20f, %.100f\n", minScore, Math.ulp(minScoreSum)); > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10428) getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge leading to busy threads in infinite loop
[ https://issues.apache.org/jira/browse/LUCENE-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497597#comment-17497597 ] Adrien Grand commented on LUCENE-10428: --- This is interesting indeed since query execution should be quite deterministic. One way that I can think how this logic could enter an infinite loop is if some scorers manage to produce negative scores somehow. I'm mentioning this in case it rings a bell to you, but it may not be the only way to get into an infinite loop. I opened a pull request that doesn't fix the bug but at least makes it an error instead of an infinite loop. > getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge > leading to busy threads in infinite loop > - > > Key: LUCENE-10428 > URL: https://issues.apache.org/jira/browse/LUCENE-10428 > Project: Lucene - Core > Issue Type: Bug > Components: core/query/scoring, core/search >Reporter: Ankit Jain >Priority: Major > Attachments: Flame_graph.png > > > Customers complained about high CPU for Elasticsearch cluster in production. > We noticed that few search requests were stuck for long time > {code:java} > % curl -s localhost:9200/_cat/tasks?v > indices:data/read/search[phase/query] AmMLzDQ4RrOJievRDeGFZw:569205 > AmMLzDQ4RrOJievRDeGFZw:569204 direct1645195007282 14:36:47 6.2h > indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:502075 > emjWc5bUTG6lgnCGLulq-Q:502074 direct1645195037259 14:37:17 6.2h > indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:583270 > emjWc5bUTG6lgnCGLulq-Q:583269 direct1645201316981 16:21:56 4.5h > {code} > Flame graphs indicated that CPU time is mostly going into > *getMinCompetitiveScore method in MaxScoreSumPropagator*. After doing some > live JVM debugging found that > org.apache.lucene.search.MaxScoreSumPropagator.scoreSumUpperBound method had > around 4 million invocations every second > Figured out the values of some parameters from live debugging: > {code:java} > minScoreSum = 3.5541441 > minScore + sumOfOtherMaxScores (params[0] scoreSumUpperBound) = > 3.554144322872162 > returnObj scoreSumUpperBound = 3.5541444 > Math.ulp(minScoreSum) = 2.3841858E-7 > {code} > Example code snippet: > {code:java} > double sumOfOtherMaxScores = 3.554144322872162; > double minScoreSum = 3.5541441; > float minScore = (float) (minScoreSum - sumOfOtherMaxScores); > while (scoreSumUpperBound(minScore + sumOfOtherMaxScores) > minScoreSum) { > minScore -= Math.ulp(minScoreSum); > System.out.printf("%.20f, %.100f\n", minScore, Math.ulp(minScoreSum)); > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a change in pull request #710: LUCENE-10311: Make FixedBitSet#approximateCardinality faster (and actually approximate).
rmuir commented on a change in pull request #710: URL: https://github.com/apache/lucene/pull/710#discussion_r814137771 ## File path: lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java ## @@ -176,6 +176,30 @@ public int cardinality() { return (int) BitUtil.pop_array(bits, 0, numWords); } + @Override + public int approximateCardinality() { +// Naive sampling: compute the number of bits that are set on the first 16 longs every 1024 +// longs and scale the result by 1024/16. +// This computes the pop count on ranges instead of single longs in order to take advantage of +// vectorization. + +final int rangeLength = 16; +final int interval = 1024; + +if (numWords < interval) { + return cardinality(); +} + +long popCount = 0; +int maxWord; +for (maxWord = 0; maxWord + interval < numWords; maxWord += interval) { + popCount += BitUtil.pop_array(bits, maxWord, rangeLength); Review comment: this isn't related/review comment. just saying i would be in favor of removing these `BitUtil` methods as I think they are outdated and provide no value. I think it would be easier on our eyes to just see loops with Long.bitCount? The other constants/methods in the `BitUtil` class actually provide value. But let's not wrap what the JDK provides efficiently for no reason? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10391) Reuse data structures across HnswGraph invocations
[ https://issues.apache.org/jira/browse/LUCENE-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani updated LUCENE-10391: -- Attachment: Screen Shot 2022-02-24 at 10.18.42 AM.png > Reuse data structures across HnswGraph invocations > -- > > Key: LUCENE-10391 > URL: https://issues.apache.org/jira/browse/LUCENE-10391 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Assignee: Julie Tibshirani >Priority: Minor > Attachments: Screen Shot 2022-02-24 at 10.18.42 AM.png > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Creating HNSW graphs involves doing many repeated calls to HnswGraph#search. > Profiles from nightly benchmarks suggest that allocating data-structures > incurs both lots of heap allocations > ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_heap)] > and CPU usage > ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_cpu).] > It looks like reusing data structures across invocations would be a > low-hanging fruit that could help save significant CPU? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10438) Leverage Weight#count in lucene/facets
[ https://issues.apache.org/jira/browse/LUCENE-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497616#comment-17497616 ] Greg Miller commented on LUCENE-10438: -- I experimented with this a bit for taxo- and ssdv-faceting but didn't get particularly far. I quickly discovered that {{luceneutil}} doesn't seem to exercise the {{Facets#getSpecificValue}} code path, which is where I think the optimization opportunity might be. To do this though, I had to defer counting to an "on demand" approach instead of counting during initialization. The good news is that this change doesn't seem to have regressed the existing benchmark tasks (see below). I think the next steps here are to augment {{luceneutil}} to exercise {{getSpecificValue}} so we can measure impact. I'll see if I can find some time to poke into that, but if anyone else is interested in getting involved, feel free to jump in! {code:java} TaskQPS baseline StdDevQPS candidate StdDev Pct diff p-value BrowseMonthSSDVFacets 16.42 (27.7%) 15.16 (24.5%) -7.7% ( -46% - 61%) 0.354 OrHighMedDayTaxoFacets 6.38 (6.7%) 6.28 (6.4%) -1.5% ( -13% - 12%) 0.463 TermDTSort 93.64 (12.5%) 92.45 (11.9%) -1.3% ( -22% - 26%) 0.742 HighTermTitleBDVSort 142.12 (14.2%) 140.36 (13.0%) -1.2% ( -24% - 30%) 0.773 MedTermDayTaxoFacets 38.39 (4.2%) 37.92 (4.1%) -1.2% ( -9% - 7%) 0.356 OrHighHigh 42.40 (4.6%) 42.04 (3.5%) -0.9% ( -8% - 7%) 0.510 HighTermMonthSort 104.42 (18.0%) 103.57 (17.0%) -0.8% ( -30% - 41%) 0.882 Prefix3 270.23 (7.9%) 268.54 (11.0%) -0.6% ( -18% - 19%) 0.837 OrHighMed 79.38 (4.5%) 79.00 (3.6%) -0.5% ( -8% - 7%) 0.709 HighSpanNear 18.50 (2.4%) 18.43 (2.4%) -0.4% ( -5% - 4%) 0.586 IntNRQ 135.21 (0.5%) 134.77 (1.6%) -0.3% ( -2% - 1%) 0.371 OrNotHighLow 1056.43 (2.7%) 1055.39 (3.2%) -0.1% ( -5% - 5%) 0.916 PKLookup 169.34 (3.5%) 169.19 (3.6%) -0.1% ( -6% - 7%) 0.937 AndHighMedDayTaxoFacets 34.87 (1.8%) 34.85 (1.9%) -0.0% ( -3% - 3%) 0.939 OrNotHighMed 930.52 (3.9%) 930.70 (4.0%) 0.0% ( -7% - 8%) 0.988 Wildcard 93.02 (4.9%) 93.05 (6.7%) 0.0% ( -10% - 12%) 0.984 LowTerm 1992.53 (5.1%) 1993.41 (4.3%) 0.0% ( -8% - 9%) 0.976 AndHighHigh 52.14 (4.9%) 52.17 (4.0%) 0.1% ( -8% - 9%) 0.969 HighSloppyPhrase 27.70 (4.0%) 27.72 (3.8%) 0.1% ( -7% - 8%) 0.933 HighTermDayOfYearSort 82.23 (13.3%) 82.35 (14.7%) 0.2% ( -24% - 32%) 0.973 OrNotHighHigh 923.35 (3.6%) 925.08 (4.8%) 0.2% ( -7% - 8%) 0.889 AndHighHighDayTaxoFacets 19.09 (2.3%) 19.16 (1.9%) 0.3% ( -3% - 4%) 0.622 LowSloppyPhrase 28.20 (2.4%) 28.31 (2.6%) 0.4% ( -4% - 5%) 0.624 LowSpanNear 11.96 (3.9%) 12.01 (2.5%) 0.4% ( -5% - 7%) 0.666 LowPhrase 241.84 (4.3%) 242.98 (4.0%) 0.5% ( -7% - 9%) 0.721 MedSpanNear 22.00 (3.3%) 22.11 (2.0%) 0.5% ( -4% - 6%) 0.568 BrowseDayOfYearSSDVFacets 12.00 (15.6%) 12.06 (14.4%) 0.5% ( -25% - 36%) 0.909 MedPhrase 20.64 (4.9%) 20.75 (4.4%) 0.6% ( -8% - 10%) 0.709 Fuzzy2 60.95 (1.7%) 61.29 (1.8%) 0.6% ( -2% - 4%) 0.304 HighPhrase 19.65 (4.8%) 19.77 (4.3%) 0.6% ( -8% - 10%) 0.678 MedSloppyPhrase 30.43 (2.3%) 30.63 (2.3%) 0.7% ( -3% - 5%) 0.354 Fuzzy1 67.61 (1.6%) 68.07 (2.0%) 0.7% ( -2% - 4%) 0.246 OrHighNotMed 1150.70 (3.7%) 1159.51 (3.7%) 0.8% ( -6% - 8%) 0.516 OrHighLow 745.90 (2.9%) 751.76 (1.7%) 0.8% ( -3% - 5%) 0.292 OrHighNotHigh 898.58 (4.1%) 906.
[jira] [Commented] (LUCENE-10391) Reuse data structures across HnswGraph invocations
[ https://issues.apache.org/jira/browse/LUCENE-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497648#comment-17497648 ] Julie Tibshirani commented on LUCENE-10391: --- Now that the benchmarks are running again, we can see an improvement in index throughput. It might be a combined effect between this change and LUCENE-10408. !Screen Shot 2022-02-24 at 10.18.42 AM.png|width=444,height=277! In the profiles, we are still seeing some NeighborQueue allocations. These are likely from the results queue, which is still not shared. It is not straightforward to share it though, since its size changes across the graph levels (it's sometimes 1, sometimes topK). I'm inclined to close this out for now without making more changes, let me know what you think. {code:java} PERCENT HEAP SAMPLES STACK 26.77%145900M org.apache.lucene.util.fst.BytesStore#writeByte() at org.apache.lucene.util.fst.FST#() 8.22% 44814Morg.apache.lucene.util.LongHeap#() at org.apache.lucene.util.hnsw.NeighborQueue#() {code} > Reuse data structures across HnswGraph invocations > -- > > Key: LUCENE-10391 > URL: https://issues.apache.org/jira/browse/LUCENE-10391 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Assignee: Julie Tibshirani >Priority: Minor > Attachments: Screen Shot 2022-02-24 at 10.18.42 AM.png > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Creating HNSW graphs involves doing many repeated calls to HnswGraph#search. > Profiles from nightly benchmarks suggest that allocating data-structures > incurs both lots of heap allocations > ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_heap)] > and CPU usage > ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_cpu).] > It looks like reusing data structures across invocations would be a > low-hanging fruit that could help save significant CPU? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #710: LUCENE-10311: Make FixedBitSet#approximateCardinality faster (and actually approximate).
rmuir commented on pull request #710: URL: https://github.com/apache/lucene/pull/710#issuecomment-1050162660 also, another random suggestion for another day. I think it would be fine to have some logic like this at some point: ``` if (length < N) { return cardinality(); // for small bitsets, don't be fancy } ``` But I'm not concerned either way. Just thought if we need to iterate and introduce benchmarks, then ignoring tiny cases is an easy way to really zone in on good perf. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10428) getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge leading to busy threads in infinite loop
[ https://issues.apache.org/jira/browse/LUCENE-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497658#comment-17497658 ] Ankit Jain commented on LUCENE-10428: - {quote}I opened a pull request that doesn't fix the bug but at least makes it an error instead of an infinite loop. {quote} Can you share link to this PR? Also, we should capture all the debug information as part of that error to understand this further. > getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge > leading to busy threads in infinite loop > - > > Key: LUCENE-10428 > URL: https://issues.apache.org/jira/browse/LUCENE-10428 > Project: Lucene - Core > Issue Type: Bug > Components: core/query/scoring, core/search >Reporter: Ankit Jain >Priority: Major > Attachments: Flame_graph.png > > > Customers complained about high CPU for Elasticsearch cluster in production. > We noticed that few search requests were stuck for long time > {code:java} > % curl -s localhost:9200/_cat/tasks?v > indices:data/read/search[phase/query] AmMLzDQ4RrOJievRDeGFZw:569205 > AmMLzDQ4RrOJievRDeGFZw:569204 direct1645195007282 14:36:47 6.2h > indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:502075 > emjWc5bUTG6lgnCGLulq-Q:502074 direct1645195037259 14:37:17 6.2h > indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:583270 > emjWc5bUTG6lgnCGLulq-Q:583269 direct1645201316981 16:21:56 4.5h > {code} > Flame graphs indicated that CPU time is mostly going into > *getMinCompetitiveScore method in MaxScoreSumPropagator*. After doing some > live JVM debugging found that > org.apache.lucene.search.MaxScoreSumPropagator.scoreSumUpperBound method had > around 4 million invocations every second > Figured out the values of some parameters from live debugging: > {code:java} > minScoreSum = 3.5541441 > minScore + sumOfOtherMaxScores (params[0] scoreSumUpperBound) = > 3.554144322872162 > returnObj scoreSumUpperBound = 3.5541444 > Math.ulp(minScoreSum) = 2.3841858E-7 > {code} > Example code snippet: > {code:java} > double sumOfOtherMaxScores = 3.554144322872162; > double minScoreSum = 3.5541441; > float minScore = (float) (minScoreSum - sumOfOtherMaxScores); > while (scoreSumUpperBound(minScore + sumOfOtherMaxScores) > minScoreSum) { > minScore -= Math.ulp(minScoreSum); > System.out.printf("%.20f, %.100f\n", minScore, Math.ulp(minScoreSum)); > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on pull request #686: LUCENE-10421: use Constant instead of relying upon timestamp
jtibshirani commented on pull request #686: URL: https://github.com/apache/lucene/pull/686#issuecomment-1050175765 Thanks @rmuir ! Are you okay to merge this? I got confused recently over a sometimes-reproducible test failure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-10440) Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets
[ https://issues.apache.org/jira/browse/LUCENE-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller reassigned LUCENE-10440: Assignee: Greg Miller > Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets > --- > > Key: LUCENE-10440 > URL: https://issues.apache.org/jira/browse/LUCENE-10440 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Minor > > Similar to what we did in LUCENE-10379, let's reduce the {{public}} > visibility of {{TaxonomyFacets}} and {{FloatTaxonomyFacets}} to pkg-private > since they're really implementation details housing common logic and not > really intended as extension points for user faceting. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10440) Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets
Greg Miller created LUCENE-10440: Summary: Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets Key: LUCENE-10440 URL: https://issues.apache.org/jira/browse/LUCENE-10440 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Greg Miller Similar to what we did in LUCENE-10379, let's reduce the {{public}} visibility of {{TaxonomyFacets}} and {{FloatTaxonomyFacets}} to pkg-private since they're really implementation details housing common logic and not really intended as extension points for user faceting. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it
rmuir commented on pull request #709: URL: https://github.com/apache/lucene/pull/709#issuecomment-1050188471 > I don't think the grow(long) is necessary, we can always added to the IntersectVisitor instead. Maybe would be worthy to adjust how we call grow() in BKDReader#addAll as it does not need the dance it is currently doing Sorry, I'm not so familiar with the code in question. Does it mean we can remove the `grown` parameter here and the split logic around it for the `addAll()` method? If so, that sounds great! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller opened a new pull request #712: LUCENE-10440: Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets
gsmiller opened a new pull request #712: URL: https://github.com/apache/lucene/pull/712 # Description These two classes are really implementation details, meant to hold common logic for our faceting implementations, but they are `public` and could be extended by users. It would be nice to reduce visibility to shrink our API surface area. # Solution Make pkg-private. Also reduce visibility of `protected` methods/fields to pkg-private as a little extra cleanup. Note that I will mark these as `@Deprecated` on the 9x branch to provide advanced notice to any users that might be extending these. # Tests No new testing needed. # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x ] I have created a Jira issue and added the issue ID to my pull request title. - [x ] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x ] I have developed this patch against the `main` branch. - [x] I have run `./gradlew check`. - [ ] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller opened a new pull request #713: LUCENE-10440: Mark TaxonomyFacets and FloatTaxonomyFacets as deprecated
gsmiller opened a new pull request #713: URL: https://github.com/apache/lucene/pull/713 This is a "backport" of #712, providing early `@Deprecation` notice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10440) Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets
[ https://issues.apache.org/jira/browse/LUCENE-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497692#comment-17497692 ] Greg Miller commented on LUCENE-10440: -- PRs posted for this. The only point maybe worth calling out here for discussion is that the visibility reduction of {{TaxonomyFacets}} means there is no common type to refer to just taxonomy-faceting implementations. The only reason I could see this _maybe_ mattering is that {{TaxonomyFacets}} defines public methods {{childrenLoaded()}} and {{{}siblingsLoaded(){}}}. So it's possible some user wants to refer to taxonomy facets generally, but not as general as just referencing {{Facets}} because they want to rely on one of these methods. This seems unlikely to me. The only code we have that references these methods is in testing, but I suppose users might want to know if these things were loaded for the purpose of metrics/logging/etc. > Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets > --- > > Key: LUCENE-10440 > URL: https://issues.apache.org/jira/browse/LUCENE-10440 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > Similar to what we did in LUCENE-10379, let's reduce the {{public}} > visibility of {{TaxonomyFacets}} and {{FloatTaxonomyFacets}} to pkg-private > since they're really implementation details housing common logic and not > really intended as extension points for user faceting. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] magibney commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue
magibney commented on pull request #380: URL: https://github.com/apache/lucene/pull/380#issuecomment-1050227228 This patch applies cleanly and all tests pass. I plan to commit this within the next few days, because i think it does improve things (targeting 9.1 release). But I want to go on the record mentioning that there are a number of things I've noticed in the process of looking at this code (completely orthogonal to this PR) that make me a bit uncomfortable: 1. All the objects retrieved through OpenNLPOpsFactory are cached in [static maps](https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java#L41-L48) that are [never cleared](https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java#L191-L199). 2. These maps are populated in a way where there's a race condition (could end up with multiple copies of these objects being held external to this class). wrt this PR in particular, I'm nervous about the fact that everything _but_ the DictionaryLemmatizers have long been cached as objects, which makes me wonder if there was a reason that from this class's inception, the dictionaryLemmatizers have been cached as String versions of the raw dictionary. But I've tried to chase down this line of reasoning, and still can't find any obvious problem with this change. Analogous to not "letting the perfect be the enemy of the good", I'm going to not "let the 'not-so-good' be the enemy of the 'probably worse'". I hope this is the correct decision. @spyk thanks for your patience, and for keeping the PR up-to-date. If you can add a CHANGES.txt entry, that's probably warranted here (if only because of the change to the public API). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10431) AssertionError in BooleanQuery.hashCode()
[ https://issues.apache.org/jira/browse/LUCENE-10431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497716#comment-17497716 ] Uwe Schindler commented on LUCENE-10431: What was the exact query. Is it the only boolean one in the netbeans pull request with just 4 must clauses? I would recommend to add the clauses using the method taking 2 parameters and not instantiate clauses. But this should not cause an issue. Otherwise: are there other queries like ones which may be recursive (boolean query that refers to another query wrapping itself)? > AssertionError in BooleanQuery.hashCode() > - > > Key: LUCENE-10431 > URL: https://issues.apache.org/jira/browse/LUCENE-10431 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.11.1 >Reporter: Michael Bien >Priority: Major > > Hello devs, > the constructor of BooleanQuery can under some circumstances trigger a hash > code computation before "clauseSets" is fully filled. Since BooleanClause is > using its query field for the hash code too, it can happen that the "wrong" > hash code is stored, since adding the clause to the set triggers its > hashCode(). > If assertions are enabled the check in BooleanQuery, which recomputes the > hash code, will notice it and throw an error. > exception: > {code:java} > java.lang.AssertionError > at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:614) > at java.base/java.util.Objects.hashCode(Objects.java:103) > at java.base/java.util.HashMap$Node.hashCode(HashMap.java:298) > at java.base/java.util.AbstractMap.hashCode(AbstractMap.java:527) > at org.apache.lucene.search.Multiset.hashCode(Multiset.java:119) > at java.base/java.util.EnumMap.entryHashCode(EnumMap.java:717) > at java.base/java.util.EnumMap.hashCode(EnumMap.java:709) > at java.base/java.util.Arrays.hashCode(Arrays.java:4498) > at java.base/java.util.Objects.hash(Objects.java:133) > at > org.apache.lucene.search.BooleanQuery.computeHashCode(BooleanQuery.java:597) > at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:611) > at java.base/java.util.HashMap.hash(HashMap.java:340) > at java.base/java.util.HashMap.put(HashMap.java:612) > at org.apache.lucene.search.Multiset.add(Multiset.java:82) > at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:154) > at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:42) > at > org.apache.lucene.search.BooleanQuery$Builder.build(BooleanQuery.java:133) > {code} > I noticed this while trying to upgrade the NetBeans maven indexer modules > from lucene 5.x to 8.x https://github.com/apache/netbeans/pull/3558 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10431) AssertionError in BooleanQuery.hashCode()
[ https://issues.apache.org/jira/browse/LUCENE-10431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497718#comment-17497718 ] Uwe Schindler commented on LUCENE-10431: Sorry with builder pattern you can't create a BQ referring itself as clause. 😂 > AssertionError in BooleanQuery.hashCode() > - > > Key: LUCENE-10431 > URL: https://issues.apache.org/jira/browse/LUCENE-10431 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.11.1 >Reporter: Michael Bien >Priority: Major > > Hello devs, > the constructor of BooleanQuery can under some circumstances trigger a hash > code computation before "clauseSets" is fully filled. Since BooleanClause is > using its query field for the hash code too, it can happen that the "wrong" > hash code is stored, since adding the clause to the set triggers its > hashCode(). > If assertions are enabled the check in BooleanQuery, which recomputes the > hash code, will notice it and throw an error. > exception: > {code:java} > java.lang.AssertionError > at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:614) > at java.base/java.util.Objects.hashCode(Objects.java:103) > at java.base/java.util.HashMap$Node.hashCode(HashMap.java:298) > at java.base/java.util.AbstractMap.hashCode(AbstractMap.java:527) > at org.apache.lucene.search.Multiset.hashCode(Multiset.java:119) > at java.base/java.util.EnumMap.entryHashCode(EnumMap.java:717) > at java.base/java.util.EnumMap.hashCode(EnumMap.java:709) > at java.base/java.util.Arrays.hashCode(Arrays.java:4498) > at java.base/java.util.Objects.hash(Objects.java:133) > at > org.apache.lucene.search.BooleanQuery.computeHashCode(BooleanQuery.java:597) > at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:611) > at java.base/java.util.HashMap.hash(HashMap.java:340) > at java.base/java.util.HashMap.put(HashMap.java:612) > at org.apache.lucene.search.Multiset.add(Multiset.java:82) > at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:154) > at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:42) > at > org.apache.lucene.search.BooleanQuery$Builder.build(BooleanQuery.java:133) > {code} > I noticed this while trying to upgrade the NetBeans maven indexer modules > from lucene 5.x to 8.x https://github.com/apache/netbeans/pull/3558 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9952) FacetResult#value can be inaccurate in SortedSetDocValueFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-9952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497749#comment-17497749 ] ASF subversion and git services commented on LUCENE-9952: - Commit 4af516a1491e55022ca81a909b7c78d54d8272c0 in lucene's branch refs/heads/main from Greg Miller [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4af516a ] Remove TODO for LUCENE-9952 since that issue was fixed > FacetResult#value can be inaccurate in SortedSetDocValueFacetCounts > --- > > Key: LUCENE-9952 > URL: https://issues.apache.org/jira/browse/LUCENE-9952 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 9.0 >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Minor > Fix For: 9.1 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > As described in a dev@ list > [thread|http://mail-archives.apache.org/mod_mbox/lucene-dev/202105.mbox/%3CCANJ0CDo-9zt0U_pxWNOBkfiJpaAXZGGwOEJPnENAP6JzWz_t9Q%40mail.gmail.com%3E], > the value of {{FacetResult#value}} can be incorrect in SSDV faceting when > docs are multi-valued (affects both {{SortedSetDocValueFacetCounts}} and > {{ConcurrentSortedSetDocValueFacetCounts}}). If a doc has multiple values in > the same dimension, it will be counted multiple times when populating the > counts of {{FacetResult#value}}. > We should either provide an accurate count, or provide {{-1}} if we don't > have an accurate count (like we do in taxonomy faceting). I _think_ this > change will be a bit involved though as SSDV facet counting likely needs to > be made aware of {{FacetConfig}}. > NOTE: I've updated this description to describe only the SSDV case after > spinning off LUCENE-9953 to track the LongValueFacetCounts case. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9952) FacetResult#value can be inaccurate in SortedSetDocValueFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-9952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497757#comment-17497757 ] ASF subversion and git services commented on LUCENE-9952: - Commit 81ab1d6ab6fa3aee69153b256a01a4b984f88b59 in lucene's branch refs/heads/branch_9x from Greg Miller [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=81ab1d6 ] Remove TODO for LUCENE-9952 since that issue was fixed > FacetResult#value can be inaccurate in SortedSetDocValueFacetCounts > --- > > Key: LUCENE-9952 > URL: https://issues.apache.org/jira/browse/LUCENE-9952 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 9.0 >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Minor > Fix For: 9.1 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > As described in a dev@ list > [thread|http://mail-archives.apache.org/mod_mbox/lucene-dev/202105.mbox/%3CCANJ0CDo-9zt0U_pxWNOBkfiJpaAXZGGwOEJPnENAP6JzWz_t9Q%40mail.gmail.com%3E], > the value of {{FacetResult#value}} can be incorrect in SSDV faceting when > docs are multi-valued (affects both {{SortedSetDocValueFacetCounts}} and > {{ConcurrentSortedSetDocValueFacetCounts}}). If a doc has multiple values in > the same dimension, it will be counted multiple times when populating the > counts of {{FacetResult#value}}. > We should either provide an accurate count, or provide {{-1}} if we don't > have an accurate count (like we do in taxonomy faceting). I _think_ this > change will be a bit involved though as SSDV facet counting likely needs to > be made aware of {{FacetConfig}}. > NOTE: I've updated this description to describe only the SSDV case after > spinning off LUCENE-9953 to track the LongValueFacetCounts case. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10394) Explore moving ByteBuffer(sData|Index)Input to absolute bulk gets
[ https://issues.apache.org/jira/browse/LUCENE-10394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497873#comment-17497873 ] Gautam Worah commented on LUCENE-10394: --- I'll try to work on this soon. Looking into the ByteBuffer API in the meantime. > Explore moving ByteBuffer(sData|Index)Input to absolute bulk gets > - > > Key: LUCENE-10394 > URL: https://issues.apache.org/jira/browse/LUCENE-10394 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > With the move to Java 17, we now have access to absolute bulk gets on > ByteBuffers: > https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/nio/ByteBuffer.html#get(int,byte%5B%5D,int,int). > We should look into whether this helps with our more random-access workloads > like binary doc values, conjunctive queries and building HNSW graphs. > ByteBuffersDataInput already tries to access the underlying buffers in a > random-access fashion and works around the lack of absolute bulk gets by > doing {{ByteBuffer#duplicate()}}. It looks like a low hanging fruit to stop > duplicating the buffer and just do an absolute bulk get instead. > ByteBuffersIndexInput would require a bit more work since it's performing > relative reads whenever possible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] (LUCENE-10394) Explore moving ByteBuffer(sData|Index)Input to absolute bulk gets
[ https://issues.apache.org/jira/browse/LUCENE-10394 ] Gautam Worah deleted comment on LUCENE-10394: --- was (Author: gworah): I'll try to work on this soon. Looking into the ByteBuffer API in the meantime. > Explore moving ByteBuffer(sData|Index)Input to absolute bulk gets > - > > Key: LUCENE-10394 > URL: https://issues.apache.org/jira/browse/LUCENE-10394 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > With the move to Java 17, we now have access to absolute bulk gets on > ByteBuffers: > https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/nio/ByteBuffer.html#get(int,byte%5B%5D,int,int). > We should look into whether this helps with our more random-access workloads > like binary doc values, conjunctive queries and building HNSW graphs. > ByteBuffersDataInput already tries to access the underlying buffers in a > random-access fashion and works around the lack of absolute bulk gets by > doing {{ByteBuffer#duplicate()}}. It looks like a low hanging fruit to stop > duplicating the buffer and just do an absolute bulk get instead. > ByteBuffersIndexInput would require a bit more work since it's performing > relative reads whenever possible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10441) ArrayIndexOutOfBoundsException during indexing
Peixin Li created LUCENE-10441: -- Summary: ArrayIndexOutOfBoundsException during indexing Key: LUCENE-10441 URL: https://issues.apache.org/jira/browse/LUCENE-10441 Project: Lucene - Core Issue Type: Bug Reporter: Peixin Li Hi experts!, i have facing ArrayIndexOutOfBoundsException during indexing and committing documents, this exception gives me no clue about what happened so i have little information for debugging, can i have some suggest about what could be and how to fix this error? {code:java} java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.lucene.util.BytesRefHash$1.get(BytesRefHash.java:179) at org.apache.lucene.util.StringMSBRadixSorter$1.get(StringMSBRadixSorter.java:42) at org.apache.lucene.util.StringMSBRadixSorter$1.setPivot(StringMSBRadixSorter.java:63) at org.apache.lucene.util.Sorter.binarySort(Sorter.java:192) at org.apache.lucene.util.Sorter.binarySort(Sorter.java:187) at org.apache.lucene.util.IntroSorter.quicksort(IntroSorter.java:41) at org.apache.lucene.util.IntroSorter.quicksort(IntroSorter.java:83) at org.apache.lucene.util.IntroSorter.sort(IntroSorter.java:36) at org.apache.lucene.util.MSBRadixSorter.introSort(MSBRadixSorter.java:133) at org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:126) at org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:121) at org.apache.lucene.util.BytesRefHash.sort(BytesRefHash.java:183) at org.apache.lucene.index.SortedSetDocValuesWriter.flush(SortedSetDocValuesWriter.java:171) at org.apache.lucene.index.DefaultIndexingChain.writeDocValues(DefaultIndexingChain.java:348) at org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:228) at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350) at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476) at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10441) ArrayIndexOutOfBoundsException during indexing
[ https://issues.apache.org/jira/browse/LUCENE-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peixin Li updated LUCENE-10441: --- Description: Hi experts!, i have facing ArrayIndexOutOfBoundsException during indexing and committing documents, this exception gives me no clue about what happened so i have little information for debugging, can i have some suggest about what could be and how to fix this error? i'm using Lucene 8.10.0 {code:java} java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.lucene.util.BytesRefHash$1.get(BytesRefHash.java:179) at org.apache.lucene.util.StringMSBRadixSorter$1.get(StringMSBRadixSorter.java:42) at org.apache.lucene.util.StringMSBRadixSorter$1.setPivot(StringMSBRadixSorter.java:63) at org.apache.lucene.util.Sorter.binarySort(Sorter.java:192) at org.apache.lucene.util.Sorter.binarySort(Sorter.java:187) at org.apache.lucene.util.IntroSorter.quicksort(IntroSorter.java:41) at org.apache.lucene.util.IntroSorter.quicksort(IntroSorter.java:83) at org.apache.lucene.util.IntroSorter.sort(IntroSorter.java:36) at org.apache.lucene.util.MSBRadixSorter.introSort(MSBRadixSorter.java:133) at org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:126) at org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:121) at org.apache.lucene.util.BytesRefHash.sort(BytesRefHash.java:183) at org.apache.lucene.index.SortedSetDocValuesWriter.flush(SortedSetDocValuesWriter.java:171) at org.apache.lucene.index.DefaultIndexingChain.writeDocValues(DefaultIndexingChain.java:348) at org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:228) at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350) at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476) at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728) {code} was: Hi experts!, i have facing ArrayIndexOutOfBoundsException during indexing and committing documents, this exception gives me no clue about what happened so i have little information for debugging, can i have some suggest about what could be and how to fix this error? {code:java} java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.lucene.util.BytesRefHash$1.get(BytesRefHash.java:179) at org.apache.lucene.util.StringMSBRadixSorter$1.get(StringMSBRadixSorter.java:42) at org.apache.lucene.util.StringMSBRadixSorter$1.setPivot(StringMSBRadixSorter.java:63) at org.apache.lucene.util.Sorter.binarySort(Sorter.java:192) at org.apache.lucene.util.Sorter.binarySort(Sorter.java:187) at org.apache.lucene.util.IntroSorter.quicksort(IntroSorter.java:41) at org.apache.lucene.util.IntroSorter.quicksort(IntroSorter.java:83) at org.apache.lucene.util.IntroSorter.sort(IntroSorter.java:36) at org.apache.lucene.util.MSBRadixSorter.introSort(MSBRadixSorter.java:133) at org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:126) at org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:121) at org.apache.lucene.util.BytesRefHash.sort(BytesRefHash.java:183) at org.apache.lucene.index.SortedSetDocValuesWriter.flush(SortedSetDocValuesWriter.java:171) at org.apache.lucene.index.DefaultIndexingChain.writeDocValues(DefaultIndexingChain.java:348) at org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:228) at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350) at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476) at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728) {code} > ArrayIndexOutOfBoundsException during indexing > -- > > Key: LUCENE-10441 > URL: https://issues.apache.org/jira/browse/LUCENE-10441 > Project: Lucene - Core > Issue Type: Bug >Reporter: Peixin Li >Priority: Major > > Hi experts!, i have facing ArrayIndexOutOfBoundsException during indexing and > committing documents, this exception gives me no clue about what happened so > i have little information for debugging, can i have some suggest about what > could be and how to fix this error? i'm using Lucene 8.10.0 > {code:java} > java.lang.ArrayIndexOutOfBoundsEx
[jira] [Commented] (LUCENE-10431) AssertionError in BooleanQuery.hashCode()
[ https://issues.apache.org/jira/browse/LUCENE-10431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497915#comment-17497915 ] Michael Bien commented on LUCENE-10431: --- thanks for the tips. I might have found the cause. The BooleanQuery in question is constructed as follows: {code:java} add g:org?eclipse?jetty (groupId:org?eclipse?jetty*) with SHOULD add a:org?eclipse?jetty (artifactId:org?eclipse?jetty*) with SHOULD add v:org.eclipse.jetty* with SHOULD add n:org.eclipse.jetty* with SHOULD add d:org.eclipse.jetty* with SHOULD add +classnames:org +classnames:eclipse +classnames:jetty with SHOULD build: org.apache.lucene.search.BooleanQuery$Builder@4c2e43fc result: (g:org?eclipse?jetty (groupId:org?eclipse?jetty*)) (a:org?eclipse?jetty (artifactId:org?eclipse?jetty*)) v:org.eclipse.jetty* n:org.eclipse.jetty* d:org.eclipse.jetty* (+classnames:org +classnames:eclipse +classnames:jetty){code} (user searching "org.eclipse.jetty" in the maven repository index) However, *before* each query is added to the builder. The rewrite method is changed to "CONSTANT_SCORE_BOOLEAN_REWRITE" recursively (builder.add(setBooleanRewrite(q), occur)). {code:java} private static Query setBooleanRewrite (final Query q) { if (q instanceof MultiTermQuery) { ((MultiTermQuery)q).setRewriteMethod(MultiTermQuery.CONSTANT_SCORE_BOOLEAN_REWRITE); } else if (q instanceof BooleanQuery) { for (BooleanClause c : ((BooleanQuery)q).clauses()) { setBooleanRewrite(c.getQuery()); } } return q; }{code} If I remove this, I don't see any assertion errors. Is this not a legal way of changing the rewrite method? I am a little bit worried that this is only hiding the issue and it might appear somewhere else. (btw using the two arg add didn't help, but I am going to change it anyway since its shorter :)) > AssertionError in BooleanQuery.hashCode() > - > > Key: LUCENE-10431 > URL: https://issues.apache.org/jira/browse/LUCENE-10431 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.11.1 >Reporter: Michael Bien >Priority: Major > > Hello devs, > the constructor of BooleanQuery can under some circumstances trigger a hash > code computation before "clauseSets" is fully filled. Since BooleanClause is > using its query field for the hash code too, it can happen that the "wrong" > hash code is stored, since adding the clause to the set triggers its > hashCode(). > If assertions are enabled the check in BooleanQuery, which recomputes the > hash code, will notice it and throw an error. > exception: > {code:java} > java.lang.AssertionError > at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:614) > at java.base/java.util.Objects.hashCode(Objects.java:103) > at java.base/java.util.HashMap$Node.hashCode(HashMap.java:298) > at java.base/java.util.AbstractMap.hashCode(AbstractMap.java:527) > at org.apache.lucene.search.Multiset.hashCode(Multiset.java:119) > at java.base/java.util.EnumMap.entryHashCode(EnumMap.java:717) > at java.base/java.util.EnumMap.hashCode(EnumMap.java:709) > at java.base/java.util.Arrays.hashCode(Arrays.java:4498) > at java.base/java.util.Objects.hash(Objects.java:133) > at > org.apache.lucene.search.BooleanQuery.computeHashCode(BooleanQuery.java:597) > at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:611) > at java.base/java.util.HashMap.hash(HashMap.java:340) > at java.base/java.util.HashMap.put(HashMap.java:612) > at org.apache.lucene.search.Multiset.add(Multiset.java:82) > at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:154) > at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:42) > at > org.apache.lucene.search.BooleanQuery$Builder.build(BooleanQuery.java:133) > {code} > I noticed this while trying to upgrade the NetBeans maven indexer modules > from lucene 5.x to 8.x https://github.com/apache/netbeans/pull/3558 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang opened a new pull request #714: LUCENE-10439: update CHANGES.txt
LuXugang opened a new pull request #714: URL: https://github.com/apache/lucene/pull/714 update CHANGES.txt for [LUCENE-10424](https://issues.apache.org/jira/browse/LUCENE-10424) and [LUCENE-10439](https://issues.apache.org/jira/browse/LUCENE-10439) . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #686: LUCENE-10421: use Constant instead of relying upon timestamp
rmuir merged pull request #686: URL: https://github.com/apache/lucene/pull/686 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?
[ https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497920#comment-17497920 ] ASF subversion and git services commented on LUCENE-10421: -- Commit 466278e14921572ceb54a0a52a8a262476ee24b7 in lucene's branch refs/heads/main from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=466278e ] LUCENE-10421: use Constant instead of relying upon timestamp (#686) > Non-deterministic results from KnnVectorQuery? > -- > > Key: LUCENE-10421 > URL: https://issues.apache.org/jira/browse/LUCENE-10421 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have > been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} > is giving slightly different results on every run, even on an identical > (deterministically constructed – single thread indexing, flush by doc count, > {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each > night. It produces failures like this, which then abort the benchmark to > help us catch any recent accidental bug that alters our precise top N search > hits and scores: > {noformat} > Traceback (most recent call last): > File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in > run() > File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run > raise RuntimeError(‘search result differences: %s’ % str(errors)) > RuntimeError: search result differences: > [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None > groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], > ‘0.92060816’) vs ([254438\ > 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] > filter=None sort=None groupField=None hitCount=10: hit 7 has wrong > field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, > “qu\ > ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None > groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], > ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\ > [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit > 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], > ‘0.8378446’)“]{noformat} > At first I thought this might be expected because of the recent (awesome!!) > improvements to HNSW, so I tried to simply "regold". But the regold did not > "take", so it indeed looks like there is some non-determinism here. > I pinged [~msoko...@gmail.com] and he found this random seeding that is most > likely the cause? > {noformat} > public final class HnswGraphBuilder { > /** Default random seed for level generation * */ > private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); > {noformat} > Can we somehow make this deterministic instead? Or maybe the nightly > benchmarks could somehow pass something in to make results deterministic for > benchmarking? Or ... we could also relax the benchmarks to accept > non-determinism for {{KnnVectorQuery}} task? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?
[ https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497932#comment-17497932 ] ASF subversion and git services commented on LUCENE-10421: -- Commit 5972b495ba6f5145492077dfb2a5d28717f71533 in lucene's branch refs/heads/branch_9x from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5972b49 ] LUCENE-10421: use Constant instead of relying upon timestamp (#686) > Non-deterministic results from KnnVectorQuery? > -- > > Key: LUCENE-10421 > URL: https://issues.apache.org/jira/browse/LUCENE-10421 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have > been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} > is giving slightly different results on every run, even on an identical > (deterministically constructed – single thread indexing, flush by doc count, > {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each > night. It produces failures like this, which then abort the benchmark to > help us catch any recent accidental bug that alters our precise top N search > hits and scores: > {noformat} > Traceback (most recent call last): > File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in > run() > File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run > raise RuntimeError(‘search result differences: %s’ % str(errors)) > RuntimeError: search result differences: > [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None > groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], > ‘0.92060816’) vs ([254438\ > 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] > filter=None sort=None groupField=None hitCount=10: hit 7 has wrong > field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, > “qu\ > ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None > groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], > ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\ > [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit > 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], > ‘0.8378446’)“]{noformat} > At first I thought this might be expected because of the recent (awesome!!) > improvements to HNSW, so I tried to simply "regold". But the regold did not > "take", so it indeed looks like there is some non-determinism here. > I pinged [~msoko...@gmail.com] and he found this random seeding that is most > likely the cause? > {noformat} > public final class HnswGraphBuilder { > /** Default random seed for level generation * */ > private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); > {noformat} > Can we somehow make this deterministic instead? Or maybe the nightly > benchmarks could somehow pass something in to make results deterministic for > benchmarking? Or ... we could also relax the benchmarks to accept > non-determinism for {{KnnVectorQuery}} task? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?
[ https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-10421. -- Fix Version/s: 9.1 Resolution: Fixed > Non-deterministic results from KnnVectorQuery? > -- > > Key: LUCENE-10421 > URL: https://issues.apache.org/jira/browse/LUCENE-10421 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Fix For: 9.1 > > Time Spent: 40m > Remaining Estimate: 0h > > [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have > been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} > is giving slightly different results on every run, even on an identical > (deterministically constructed – single thread indexing, flush by doc count, > {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each > night. It produces failures like this, which then abort the benchmark to > help us catch any recent accidental bug that alters our precise top N search > hits and scores: > {noformat} > Traceback (most recent call last): > File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in > run() > File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run > raise RuntimeError(‘search result differences: %s’ % str(errors)) > RuntimeError: search result differences: > [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None > groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], > ‘0.92060816’) vs ([254438\ > 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] > filter=None sort=None groupField=None hitCount=10: hit 7 has wrong > field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, > “qu\ > ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None > groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], > ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\ > [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit > 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], > ‘0.8378446’)“]{noformat} > At first I thought this might be expected because of the recent (awesome!!) > improvements to HNSW, so I tried to simply "regold". But the regold did not > "take", so it indeed looks like there is some non-determinism here. > I pinged [~msoko...@gmail.com] and he found this random seeding that is most > likely the cause? > {noformat} > public final class HnswGraphBuilder { > /** Default random seed for level generation * */ > private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); > {noformat} > Can we somehow make this deterministic instead? Or maybe the nightly > benchmarks could somehow pass something in to make results deterministic for > benchmarking? Or ... we could also relax the benchmarks to accept > non-determinism for {{KnnVectorQuery}} task? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org