[GitHub] [lucene-solr] janhoy commented on pull request #1364: SOLR-14335: Lock Solr's memory to prevent swapping
janhoy commented on pull request #1364: URL: https://github.com/apache/lucene-solr/pull/1364#issuecomment-988681951 Lucene and Solr development has moved to separate git repositories and this PR is being bulk-closed.\nPlease open a new PR against https://github.com/apache/solr or https://github.com/apache/lucene if your contribution is still relevant to the project. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy closed pull request #1364: SOLR-14335: Lock Solr's memory to prevent swapping
janhoy closed pull request #1364: URL: https://github.com/apache/lucene-solr/pull/1364 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
Feng Guo created LUCENE-10297: - Summary: Speed up medium cardinality fields with readLELongs and SIMD Key: LUCENE-10297 URL: https://issues.apache.org/jira/browse/LUCENE-10297 Project: Lucene - Core Issue Type: Improvement Components: core/codecs Reporter: Feng Guo Though we already have a bitset optimization for low cardinality fields, but the optimization usually only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields. Maybe this is because the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds? I think medium cardinality fields are tempted for this optimization. I benchmark the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields are even. *BaseLine* {code:java} task: index_1_doc_32_cardinality_baseline, term count: 1, took: 29 task: index_1_doc_32_cardinality_baseline, term count: 2, took: 40 task: index_1_doc_32_cardinality_baseline, term count: 4, took: 74 task: index_1_doc_32_cardinality_baseline, term count: 8, took: 144 task: index_1_doc_32_cardinality_baseline, term count: 16, took: 284 task: index_1_doc_128_cardinality_baseline, term count: 1, took: 20 task: index_1_doc_128_cardinality_baseline, term count: 8, took: 70 task: index_1_doc_128_cardinality_baseline, term count: 16, took: 127 task: index_1_doc_128_cardinality_baseline, term count: 32, took: 251 task: index_1_doc_128_cardinality_baseline, term count: 64, took: 576 task: index_1_doc_8192_cardinality_baseline, term count: 1, took: 2 task: index_1_doc_8192_cardinality_baseline, term count: 16, took: 11 task: index_1_doc_8192_cardinality_baseline, term count: 64, took: 18 task: index_1_doc_8192_cardinality_baseline, term count: 512, took: 88 task: index_1_doc_8192_cardinality_baseline, term count: 2048, took: 266 task: index_1_doc_1048576_cardinality_baseline, term count: 1, took: 3 task: index_1_doc_1048576_cardinality_baseline, term count: 16, took: 11 task: index_1_doc_1048576_cardinality_baseline, term count: 64, took: 8 task: index_1_doc_1048576_cardinality_baseline, term count: 512, took: 33 task: index_1_doc_1048576_cardinality_baseline, term count: 2048, took: 97 task: index_1_doc_8388608_cardinality_baseline, term count: 1, took: 4 task: index_1_doc_8388608_cardinality_baseline, term count: 16, took: 20 task: index_1_doc_8388608_cardinality_baseline, term count: 64, took: 31 task: index_1_doc_8388608_cardinality_baseline, term count: 512, took: 70 task: index_1_doc_8388608_cardinality_baseline, term count: 2048, took: 209 {code} *candidate* {code:java} task: index_1_doc_32_cardinality_candidate, term count: 1, took: 18 task: index_1_doc_32_cardinality_candidate, term count: 2, took: 16 task: index_1_doc_32_cardinality_candidate, term count: 4, took: 26 task: index_1_doc_32_cardinality_candidate, term count: 8, took: 46 task: index_1_doc_32_cardinality_candidate, term count: 16, took: 88 task: index_1_doc_128_cardinality_candidate, term count: 1, took: 12 task: index_1_doc_128_cardinality_candidate, term count: 8, took: 22 task: index_1_doc_128_cardinality_candidate, term count: 16, took: 29 task: index_1_doc_128_cardinality_candidate, term count: 32, took: 50 task: index_1_doc_128_cardinality_candidate, term count: 64, took: 93 task: index_1_doc_8192_cardinality_candidate, term count: 1, took: 2 task: index_1_doc_8192_cardinality_candidate, term count: 16, took: 9 task: index_1_doc_8192_cardinality_candidate, term count: 64, took: 13 task: index_1_doc_8192_cardinality_candidate, term count: 512, took: 42 task: index_1_doc_8192_cardinality_candidate, term count: 2048, took: 129 task: index_1_doc_1048576_cardinality_candidate, term count: 1, took: 2 task: index_1_doc_1048576_cardinality_candidate, term count: 16, took: 9 task: index_1_doc_1048576_cardinality_candidate, term count: 64, took: 9 task: index_1_doc_1048576_cardinality_candidate, term count: 512, took: 32 task: index_1_doc_1048576_cardinality_candidate, term count: 2048, took: 93 task: index_1_doc_8388608_cardinality_candidate, ter
[GitHub] [lucene] gf2121 opened a new pull request #530: LUCENE-10297: Speed up medium cardinality fields with readLELongs and SIMD
gf2121 opened a new pull request #530: URL: https://github.com/apache/lucene/pull/530 see https://issues.apache.org/jira/browse/LUCENE-10297 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: Though we already have a bitset optimization for low cardinality fields, but the optimization usually only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields. Maybe this is because the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds? I think medium cardinality fields are tempted for this optimization. The basic idea is that we get deltas for sorted ids and encode them with encode I benchmark the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields are even. *BaseLine* {code:java} task: index_1_doc_32_cardinality_baseline, term count: 1, took: 29 task: index_1_doc_32_cardinality_baseline, term count: 2, took: 40 task: index_1_doc_32_cardinality_baseline, term count: 4, took: 74 task: index_1_doc_32_cardinality_baseline, term count: 8, took: 144 task: index_1_doc_32_cardinality_baseline, term count: 16, took: 284 task: index_1_doc_128_cardinality_baseline, term count: 1, took: 20 task: index_1_doc_128_cardinality_baseline, term count: 8, took: 70 task: index_1_doc_128_cardinality_baseline, term count: 16, took: 127 task: index_1_doc_128_cardinality_baseline, term count: 32, took: 251 task: index_1_doc_128_cardinality_baseline, term count: 64, took: 576 task: index_1_doc_8192_cardinality_baseline, term count: 1, took: 2 task: index_1_doc_8192_cardinality_baseline, term count: 16, took: 11 task: index_1_doc_8192_cardinality_baseline, term count: 64, took: 18 task: index_1_doc_8192_cardinality_baseline, term count: 512, took: 88 task: index_1_doc_8192_cardinality_baseline, term count: 2048, took: 266 task: index_1_doc_1048576_cardinality_baseline, term count: 1, took: 3 task: index_1_doc_1048576_cardinality_baseline, term count: 16, took: 11 task: index_1_doc_1048576_cardinality_baseline, term count: 64, took: 8 task: index_1_doc_1048576_cardinality_baseline, term count: 512, took: 33 task: index_1_doc_1048576_cardinality_baseline, term count: 2048, took: 97 task: index_1_doc_8388608_cardinality_baseline, term count: 1, took: 4 task: index_1_doc_8388608_cardinality_baseline, term count: 16, took: 20 task: index_1_doc_8388608_cardinality_baseline, term count: 64, took: 31 task: index_1_doc_8388608_cardinality_baseline, term count: 512, took: 70 task: index_1_doc_8388608_cardinality_baseline, term count: 2048, took: 209 {code} *candidate* {code:java} task: index_1_doc_32_cardinality_candidate, term count: 1, took: 18 task: index_1_doc_32_cardinality_candidate, term count: 2, took: 16 task: index_1_doc_32_cardinality_candidate, term count: 4, took: 26 task: index_1_doc_32_cardinality_candidate, term count: 8, took: 46 task: index_1_doc_32_cardinality_candidate, term count: 16, took: 88 task: index_1_doc_128_cardinality_candidate, term count: 1, took: 12 task: index_1_doc_128_cardinality_candidate, term count: 8, took: 22 task: index_1_doc_128_cardinality_candidate, term count: 16, took: 29 task: index_1_doc_128_cardinality_candidate, term count: 32, took: 50 task: index_1_doc_128_cardinality_candidate, term count: 64, took: 93 task: index_1_doc_8192_cardinality_candidate, term count: 1, took: 2 task: index_1_doc_8192_cardinality_candidate, term count: 16, took: 9 task: index_1_doc_8192_cardinality_candidate, term count: 64, took: 13 task: index_1_doc_8192_cardinality_candidate, term count: 512, took: 42 task: index_1_doc_8192_cardinality_candidate, term count: 2048, took: 129 task: index_1_doc_1048576_cardinality_candidate, term count: 1, took: 2 task: index_1_doc_1048576_cardinality_candidate, term count: 16, took: 9 task: index_1_doc_1048576_cardinality_candidate, term count: 64, took: 9 task: index_1_doc_1048576_cardinality_candidate, term count: 512, took: 32 task: index_1_doc_1048576_cardinality_candidate, term count: 2048, took: 93 task: index_1_doc_8388608_cardinality_candidate, term count: 1, took: 2 task: index_1_doc_8388608_cardinality_candidate, term count: 16, took: 21 ta
[GitHub] [lucene-solr] janhoy opened a new pull request #2625: Added bulkclose feature to the githubPRs script
janhoy opened a new pull request #2625: URL: https://github.com/apache/lucene-solr/pull/2625 Example use: ```bash ./githubPRs.py \ --bulkclose "Lucene and Solr development has moved to separate git repositories and this PR is being bulk-closed. Please open a new PR against https://github.com/apache/solr or https://github.com/apache/lucene if your contribution is still relevant to the project." \ --token X ``` Result of such an action can be seen in #1364 which I used for testing. You can then easily query GitHub for a list of the `stale-closed` PRs: https://github.com/apache/lucene-solr/pulls?q=label%3Astale-closed+is%3Aclosed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: Though we already have a bitset optimization for low cardinality fields, but the optimization usually only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields. Maybe this is because the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds? I think medium cardinality fields are tempted for this optimization. I benchmark the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields are even. *BaseLine* {code:java} task: index_1_doc_32_cardinality_baseline, term count: 1, took: 29 task: index_1_doc_32_cardinality_baseline, term count: 2, took: 40 task: index_1_doc_32_cardinality_baseline, term count: 4, took: 74 task: index_1_doc_32_cardinality_baseline, term count: 8, took: 144 task: index_1_doc_32_cardinality_baseline, term count: 16, took: 284 task: index_1_doc_128_cardinality_baseline, term count: 1, took: 20 task: index_1_doc_128_cardinality_baseline, term count: 8, took: 70 task: index_1_doc_128_cardinality_baseline, term count: 16, took: 127 task: index_1_doc_128_cardinality_baseline, term count: 32, took: 251 task: index_1_doc_128_cardinality_baseline, term count: 64, took: 576 task: index_1_doc_8192_cardinality_baseline, term count: 1, took: 2 task: index_1_doc_8192_cardinality_baseline, term count: 16, took: 11 task: index_1_doc_8192_cardinality_baseline, term count: 64, took: 18 task: index_1_doc_8192_cardinality_baseline, term count: 512, took: 88 task: index_1_doc_8192_cardinality_baseline, term count: 2048, took: 266 task: index_1_doc_1048576_cardinality_baseline, term count: 1, took: 3 task: index_1_doc_1048576_cardinality_baseline, term count: 16, took: 11 task: index_1_doc_1048576_cardinality_baseline, term count: 64, took: 8 task: index_1_doc_1048576_cardinality_baseline, term count: 512, took: 33 task: index_1_doc_1048576_cardinality_baseline, term count: 2048, took: 97 task: index_1_doc_8388608_cardinality_baseline, term count: 1, took: 4 task: index_1_doc_8388608_cardinality_baseline, term count: 16, took: 20 task: index_1_doc_8388608_cardinality_baseline, term count: 64, took: 31 task: index_1_doc_8388608_cardinality_baseline, term count: 512, took: 70 task: index_1_doc_8388608_cardinality_baseline, term count: 2048, took: 209 {code} *candidate* {code:java} task: index_1_doc_32_cardinality_candidate, term count: 1, took: 18 task: index_1_doc_32_cardinality_candidate, term count: 2, took: 16 task: index_1_doc_32_cardinality_candidate, term count: 4, took: 26 task: index_1_doc_32_cardinality_candidate, term count: 8, took: 46 task: index_1_doc_32_cardinality_candidate, term count: 16, took: 88 task: index_1_doc_128_cardinality_candidate, term count: 1, took: 12 task: index_1_doc_128_cardinality_candidate, term count: 8, took: 22 task: index_1_doc_128_cardinality_candidate, term count: 16, took: 29 task: index_1_doc_128_cardinality_candidate, term count: 32, took: 50 task: index_1_doc_128_cardinality_candidate, term count: 64, took: 93 task: index_1_doc_8192_cardinality_candidate, term count: 1, took: 2 task: index_1_doc_8192_cardinality_candidate, term count: 16, took: 9 task: index_1_doc_8192_cardinality_candidate, term count: 64, took: 13 task: index_1_doc_8192_cardinality_candidate, term count: 512, took: 42 task: index_1_doc_8192_cardinality_candidate, term count: 2048, took: 129 task: index_1_doc_1048576_cardinality_candidate, term count: 1, took: 2 task: index_1_doc_1048576_cardinality_candidate, term count: 16, took: 9 task: index_1_doc_1048576_cardinality_candidate, term count: 64, took: 9 task: index_1_doc_1048576_cardinality_candidate, term count: 512, took: 32 task: index_1_doc_1048576_cardinality_candidate, term count: 2048, took: 93 task: index_1_doc_8388608_cardinality_candidate, term count: 1, took: 2 task: index_1_doc_8388608_cardinality_candidate, term count: 16, took: 21 task: index_1_doc_8388608_cardinality_candidate, term count: 64, took: 38
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: Though we already have a bitset optimization for low cardinality fields, but the optimization usually only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *BaseLine* {code:java} task: index_1_doc_32_cardinality_baseline, term count: 1, took: 29 task: index_1_doc_32_cardinality_baseline, term count: 2, took: 40 task: index_1_doc_32_cardinality_baseline, term count: 4, took: 74 task: index_1_doc_32_cardinality_baseline, term count: 8, took: 144 task: index_1_doc_32_cardinality_baseline, term count: 16, took: 284 task: index_1_doc_128_cardinality_baseline, term count: 1, took: 20 task: index_1_doc_128_cardinality_baseline, term count: 8, took: 70 task: index_1_doc_128_cardinality_baseline, term count: 16, took: 127 task: index_1_doc_128_cardinality_baseline, term count: 32, took: 251 task: index_1_doc_128_cardinality_baseline, term count: 64, took: 576 task: index_1_doc_8192_cardinality_baseline, term count: 1, took: 2 task: index_1_doc_8192_cardinality_baseline, term count: 16, took: 11 task: index_1_doc_8192_cardinality_baseline, term count: 64, took: 18 task: index_1_doc_8192_cardinality_baseline, term count: 512, took: 88 task: index_1_doc_8192_cardinality_baseline, term count: 2048, took: 266 task: index_1_doc_1048576_cardinality_baseline, term count: 1, took: 3 task: index_1_doc_1048576_cardinality_baseline, term count: 16, took: 11 task: index_1_doc_1048576_cardinality_baseline, term count: 64, took: 8 task: index_1_doc_1048576_cardinality_baseline, term count: 512, took: 33 task: index_1_doc_1048576_cardinality_baseline, term count: 2048, took: 97 task: index_1_doc_8388608_cardinality_baseline, term count: 1, took: 4 task: index_1_doc_8388608_cardinality_baseline, term count: 16, took: 20 task: index_1_doc_8388608_cardinality_baseline, term count: 64, took: 31 task: index_1_doc_8388608_cardinality_baseline, term count: 512, took: 70 task: index_1_doc_8388608_cardinality_baseline, term count: 2048, took: 209 {code} *candidate* {code:java} task: index_1_doc_32_cardinality_candidate, term count: 1, took: 18 task: index_1_doc_32_cardinality_candidate, term count: 2, took: 16 task: index_1_doc_32_cardinality_candidate, term count: 4, took: 26 task: index_1_doc_32_cardinality_candidate, term count: 8, took: 46 task: index_1_doc_32_cardinality_candidate, term count: 16, took: 88 task: index_1_doc_128_cardinality_candidate, term count: 1, took: 12 task: index_1_doc_128_cardinality_candidate, term count: 8, took: 22 task: index_1_doc_128_cardinality_candidate, term count: 16, took: 29 task: index_1_doc_128_cardinality_candidate, term count: 32, took: 50 task: index_1_doc_128_cardinality_candidate, term count: 64, took: 93 task: index_1_doc_8192_cardinality_candidate, term count: 1, took: 2 task: index_1_doc_8192_cardinality_candidate, term count: 16, took: 9 task: index_1_doc_8192_cardinality_candidate, term count: 64, took: 13 task: index_1_doc_8192_cardinality_candidate, term count: 512, took: 42 task: index_1_doc_8192_cardinality_candidate, term count: 2048, took: 129 task: index_1_doc_1048576_cardinality_candidate, term count: 1, took: 2 task: index_1_doc_1048576_cardinality_candidate, term count: 16, took: 9 task: index_1_doc_1048576_cardinality_candidate, term count: 64, took: 9 task: index_1_doc_1048576_cardinality_candidate, term count: 512, took: 32 task: index_1_doc_1048576_cardinality_candidate, term count: 2048, took: 93 task: index_1_doc_8388608_cardinality_candidate, term count: 1, took: 2 task: index_1_doc_8388608_cardinality_candidate, term count: 16, took: 21 task: index_1_doc_8388608_cardinality_candidate, term count: 64, took: 38 task: index
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: Though we already have a bitset optimization for low cardinality fields, but the optimization usually only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *BaseLine* {code:java} task: index_1_doc_32_cardinality_baseline, term count: 1, took: 29 task: index_1_doc_32_cardinality_baseline, term count: 2, took: 40 task: index_1_doc_32_cardinality_baseline, term count: 4, took: 74 task: index_1_doc_32_cardinality_baseline, term count: 8, took: 144 task: index_1_doc_32_cardinality_baseline, term count: 16, took: 284 task: index_1_doc_128_cardinality_baseline, term count: 1, took: 20 task: index_1_doc_128_cardinality_baseline, term count: 8, took: 70 task: index_1_doc_128_cardinality_baseline, term count: 16, took: 127 task: index_1_doc_128_cardinality_baseline, term count: 32, took: 251 task: index_1_doc_128_cardinality_baseline, term count: 64, took: 576 task: index_1_doc_8192_cardinality_baseline, term count: 1, took: 2 task: index_1_doc_8192_cardinality_baseline, term count: 16, took: 11 task: index_1_doc_8192_cardinality_baseline, term count: 64, took: 18 task: index_1_doc_8192_cardinality_baseline, term count: 512, took: 88 task: index_1_doc_8192_cardinality_baseline, term count: 2048, took: 266 task: index_1_doc_1048576_cardinality_baseline, term count: 1, took: 3 task: index_1_doc_1048576_cardinality_baseline, term count: 16, took: 11 task: index_1_doc_1048576_cardinality_baseline, term count: 64, took: 8 task: index_1_doc_1048576_cardinality_baseline, term count: 512, took: 33 task: index_1_doc_1048576_cardinality_baseline, term count: 2048, took: 97 task: index_1_doc_8388608_cardinality_baseline, term count: 1, took: 4 task: index_1_doc_8388608_cardinality_baseline, term count: 16, took: 20 task: index_1_doc_8388608_cardinality_baseline, term count: 64, took: 31 task: index_1_doc_8388608_cardinality_baseline, term count: 512, took: 70 task: index_1_doc_8388608_cardinality_baseline, term count: 2048, took: 209 {code} *Candidate* {code:java} task: index_1_doc_32_cardinality_candidate, term count: 1, took: 18 task: index_1_doc_32_cardinality_candidate, term count: 2, took: 16 task: index_1_doc_32_cardinality_candidate, term count: 4, took: 26 task: index_1_doc_32_cardinality_candidate, term count: 8, took: 46 task: index_1_doc_32_cardinality_candidate, term count: 16, took: 88 task: index_1_doc_128_cardinality_candidate, term count: 1, took: 12 task: index_1_doc_128_cardinality_candidate, term count: 8, took: 22 task: index_1_doc_128_cardinality_candidate, term count: 16, took: 29 task: index_1_doc_128_cardinality_candidate, term count: 32, took: 50 task: index_1_doc_128_cardinality_candidate, term count: 64, took: 93 task: index_1_doc_8192_cardinality_candidate, term count: 1, took: 2 task: index_1_doc_8192_cardinality_candidate, term count: 16, took: 9 task: index_1_doc_8192_cardinality_candidate, term count: 64, took: 13 task: index_1_doc_8192_cardinality_candidate, term count: 512, took: 42 task: index_1_doc_8192_cardinality_candidate, term count: 2048, took: 129 task: index_1_doc_1048576_cardinality_candidate, term count: 1, took: 2 task: index_1_doc_1048576_cardinality_candidate, term count: 16, took: 9 task: index_1_doc_1048576_cardinality_candidate, term count: 64, took: 9 task: index_1_doc_1048576_cardinality_candidate, term count: 512, took: 32 task: index_1_doc_1048576_cardinality_candidate, term count: 2048, took: 93 task: index_1_doc_8388608_cardinality_candidate, term count: 1, took: 2 task: index_1_doc_8388608_cardinality_candidate, term count: 16, took: 21 task: index_1_doc_8388608_cardinality_candidate, term count: 64, took: 38 task: index
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: Though we already have a bitset optimization for low cardinality fields, but the optimization usually only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff| |1.00|32.00|1.00|29.00|18.00|-37.93%| |1.00|32.00|2.00|40.00|16.00|-60.00%| |1.00|32.00|4.00|74.00|26.00|-64.86%| |1.00|32.00|8.00|144.00|46.00|-68.06%| |1.00|32.00|16.00|284.00|88.00|-69.01%| |1.00|128.00|1.00|20.00|12.00|-40.00%| |1.00|128.00|8.00|70.00|22.00|-68.57%| |1.00|128.00|16.00|127.00|29.00|-77.17%| |1.00|128.00|32.00|251.00|50.00|-80.08%| |1.00|128.00|64.00|576.00|93.00|-83.85%| |1.00|8192.00|1.00|2.00|2.00|0.00%| |1.00|8192.00|16.00|11.00|9.00|-18.18%| |1.00|8192.00|64.00|18.00|13.00|-27.78%| |1.00|8192.00|512.00|88.00|42.00|-52.27%| |1.00|8192.00|2048.00|266.00|129.00|-51.50%| |1.00|1048576.00|1.00|3.00|2.00|-33.33%| |1.00|1048576.00|16.00|11.00|9.00|-18.18%| |1.00|1048576.00|64.00|8.00|9.00|12.50%| |1.00|1048576.00|512.00|33.00|32.00|-3.03%| |1.00|1048576.00|2048.00|97.00|93.00|-4.12%| |1.00|8388608.00|1.00|4.00|2.00|-50.00%| |1.00|8388608.00|16.00|20.00|21.00|5.00%| |1.00|8388608.00|64.00|31.00|38.00|22.58%| |1.00|8388608.00|512.00|70.00|73.00|4.29%| |1.00|8388608.00|2048.00|209.00|204.00|-2.39%| was: Though we already have a bitset optimization for low cardinality fields, but the optimization usually only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *BaseLine* {code:java} task: index_1_doc_32_cardinality_baseline, term count: 1, took: 29 task: index_1_doc_32_cardinality_baseline, term count: 2, took: 40 task: index_1_doc_32_cardinality_baseline, term count: 4, took: 74 task: index_1_doc_32_cardinality_baseline, term count: 8, took: 144 task: index_1_doc_32_cardinality_baseline, term count: 16, took: 284 task: index_1_doc_128_cardinality_baseline, term count: 1, took: 20 task: index_1_doc_128_cardinality_baseline, term count: 8, took: 70 task: index_1_doc_128_cardinality_baseline, term count: 16, took: 127 task: index_1_doc_128_cardinality_baseline, term count: 32, took: 251 task: index_1_doc_128_cardinality_baseline, term count: 64, took: 576 task: index_1_doc_8192_cardinality_baseline, term count: 1, took: 2 task: index_1_doc_8192_cardinality_baseline, term count: 16, took: 11 task: index_1_doc_8192_cardinality_baseline, term count: 64, took: 18 task: index_1_doc_8192_cardinality_baseline, term count: 512, took: 88 task: index_1_doc_8192_cardinality_baseline, term count: 2048, took: 266 task: index_1_doc_1048576_cardinality_baseline, term count: 1, took: 3 task: index_1_doc_1048576_cardinality_baseline, term count: 16, took: 11 task: index_1_doc_1048576_cardinality_baseline, term count: 64, took: 8 task: index_1_doc_1048576_cardinality_baseline, term count: 512, took: 33 task: index_1_doc_1048576_cardinality_baseline, term count: 2048, took: 97 task: index_1_doc_8388
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: Though we already have a bitset optimization for low cardinality fields, but the optimization usually only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff| |1|32|1|29|18|-37.93%| |1|32|2|40|16|-60.00%| |1|32|4|74|26|-64.86%| |1|32|8|144|46|-68.06%| |1|32|16|284|88|-69.01%| |1|128|1|20|12|-40.00%| |1|128|8|70|22|-68.57%| |1|128|16|127|29|-77.17%| |1|128|32|251|50|-80.08%| |1|128|64|576|93|-83.85%| |1|8192|1|2|2|0.00%| |1|8192|16|11|9|-18.18%| |1|8192|64|18|13|-27.78%| |1|8192|512|88|42|-52.27%| |1|8192|2048|266|129|-51.50%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|11|9|-18.18%| |1|1048576|64|8|9|12.50%| |1|1048576|512|33|32|-3.03%| |1|1048576|2048|97|93|-4.12%| |1|8388608|1|4|2|-50.00%| |1|8388608|16|20|21|5.00%| |1|8388608|64|31|38|22.58%| |1|8388608|512|70|73|4.29%| |1|8388608|2048|209|204|-2.39%| was: Though we already have a bitset optimization for low cardinality fields, but the optimization usually only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff| |1.00|32.00|1.00|29.00|18.00|-37.93%| |1.00|32.00|2.00|40.00|16.00|-60.00%| |1.00|32.00|4.00|74.00|26.00|-64.86%| |1.00|32.00|8.00|144.00|46.00|-68.06%| |1.00|32.00|16.00|284.00|88.00|-69.01%| |1.00|128.00|1.00|20.00|12.00|-40.00%| |1.00|128.00|8.00|70.00|22.00|-68.57%| |1.00|128.00|16.00|127.00|29.00|-77.17%| |1.00|128.00|32.00|251.00|50.00|-80.08%| |1.00|128.00|64.00|576.00|93.00|-83.85%| |1.00|8192.00|1.00|2.00|2.00|0.00%| |1.00|8192.00|16.00|11.00|9.00|-18.18%| |1.00|8192.00|64.00|18.00|13.00|-27.78%| |1.00|8192.00|512.00|88.00|42.00|-52.27%| |1.00|8192.00|2048.00|266.00|129.00|-51.50%| |1.00|1048576.00|1.00|3.00|2.00|-33.33%| |1.00|1048576.00|16.00|11.00|9.00|-18.18%| |1.00|1048576.00|64.00|8.00|9.00|12.50%| |1.00|1048576.00|512.00|33.00|32.00|-3.03%| |1.00|1048576.00|2048.00|97.00|93.00|-4.12%| |1.00|8388608.00|1.00|4.00|2.00|-50.00%| |1.00|8388608.00|16.00|20.00|21.00|5.00%| |1.00|8388608.00|64.00|31.00|38.00|22.58%| |1.00|8388608.00|512.00|70.00|73.00|4.29%| |1.00|8388608.00|2048.00|209.00|204.00|-2.39%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Though we already have a bitset optimization for low cardinality fields, but > the optimization usually only works on extremly low cardinality fields > (cardinality < 16), for medium cardinality case like 30, 100 can
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: Though we already have a bitset optimization for low cardinality fields, but the optimization usually only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc|cardinality|query term|baseline (ms)|candidate (ms)|diff| |1|32|1|29|18|-37.93%| |1|32|2|40|16|-60.00%| |1|32|4|74|26|-64.86%| |1|32|8|144|46|-68.06%| |1|32|16|284|88|-69.01%| |1|128|1|20|12|-40.00%| |1|128|8|70|22|-68.57%| |1|128|16|127|29|-77.17%| |1|128|32|251|50|-80.08%| |1|128|64|576|93|-83.85%| |1|8192|1|2|2|0.00%| |1|8192|16|11|9|-18.18%| |1|8192|64|18|13|-27.78%| |1|8192|512|88|42|-52.27%| |1|8192|2048|266|129|-51.50%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|11|9|-18.18%| |1|1048576|64|8|9|12.50%| |1|1048576|512|33|32|-3.03%| |1|1048576|2048|97|93|-4.12%| |1|8388608|1|4|2|-50.00%| |1|8388608|16|20|21|5.00%| |1|8388608|64|31|38|22.58%| |1|8388608|512|70|73|4.29%| |1|8388608|2048|209|204|-2.39%| was: Though we already have a bitset optimization for low cardinality fields, but the optimization usually only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff| |1|32|1|29|18|-37.93%| |1|32|2|40|16|-60.00%| |1|32|4|74|26|-64.86%| |1|32|8|144|46|-68.06%| |1|32|16|284|88|-69.01%| |1|128|1|20|12|-40.00%| |1|128|8|70|22|-68.57%| |1|128|16|127|29|-77.17%| |1|128|32|251|50|-80.08%| |1|128|64|576|93|-83.85%| |1|8192|1|2|2|0.00%| |1|8192|16|11|9|-18.18%| |1|8192|64|18|13|-27.78%| |1|8192|512|88|42|-52.27%| |1|8192|2048|266|129|-51.50%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|11|9|-18.18%| |1|1048576|64|8|9|12.50%| |1|1048576|512|33|32|-3.03%| |1|1048576|2048|97|93|-4.12%| |1|8388608|1|4|2|-50.00%| |1|8388608|16|20|21|5.00%| |1|8388608|64|31|38|22.58%| |1|8388608|512|70|73|4.29%| |1|8388608|2048|209|204|-2.39%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Though we already have a bitset optimization for low cardinality fields, but > the optimization usually only works on extremly low cardinality fields > (cardinality < 16), for medium cardinality case like 30, 100 can rarely get > this optimization. > In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to > use readLELongs to speed up BKD id blocks, but did not get a obvious gain on > this approach. I think this is because we are trying to optimize the unsorted > situation, which typically happens for high cardinality fields, and the > bottleneck of queries on high cardinali
[GitHub] [lucene-solr] rmuir commented on pull request #2625: Added bulkclose feature to the githubPRs script
rmuir commented on pull request #2625: URL: https://github.com/apache/lucene-solr/pull/2625#issuecomment-988751134 -1 to adding bulk close functionality -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] rmuir commented on pull request #2625: Added bulkclose feature to the githubPRs script
rmuir commented on pull request #2625: URL: https://github.com/apache/lucene-solr/pull/2625#issuecomment-988760471 That's an actual veto. justification: read the fucking mailing list thread, see how @janhoy tried to "slip this in" under the pretense of a +1. Several of us are against bulk-closing on the thread. It is more people, than are for it. Consensus is not with you. Passive-aggressive shit like this doesn't help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: Though we already have a bitset optimization for low cardinality fields, but the optimization usually only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| |1|8388608|1|4|3|-25.00%| |1|8388608|16|24|21|-12.50%| |1|8388608|64|46|45|-2.17%| |1|8388608|512|121|127|4.96%| |1|8388608|2048|193|207|7.25%| was: Though we already have a bitset optimization for low cardinality fields, but the optimization usually only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc|cardinality|query term|baseline (ms)|candidate (ms)|diff| |1|32|1|29|18|-37.93%| |1|32|2|40|16|-60.00%| |1|32|4|74|26|-64.86%| |1|32|8|144|46|-68.06%| |1|32|16|284|88|-69.01%| |1|128|1|20|12|-40.00%| |1|128|8|70|22|-68.57%| |1|128|16|127|29|-77.17%| |1|128|32|251|50|-80.08%| |1|128|64|576|93|-83.85%| |1|8192|1|2|2|0.00%| |1|8192|16|11|9|-18.18%| |1|8192|64|18|13|-27.78%| |1|8192|512|88|42|-52.27%| |1|8192|2048|266|129|-51.50%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|11|9|-18.18%| |1|1048576|64|8|9|12.50%| |1|1048576|512|33|32|-3.03%| |1|1048576|2048|97|93|-4.12%| |1|8388608|1|4|2|-50.00%| |1|8388608|16|20|21|5.00%| |1|8388608|64|31|38|22.58%| |1|8388608|512|70|73|4.29%| |1|8388608|2048|209|204|-2.39%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Though we already have a bitset optimization for low cardinality fields, but > the optimization usually only works on extremly low cardinality fields > (cardinality < 16), for medium cardinality case like 30, 100 can rarely get > this optimization. > In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to > use readLELongs to speed up BKD id blocks, but did not get a obvious gain on > this approach. I think this is because we are trying to optimize the unsorted > situation, which typically happens for high cardinality fields, and the > bottleneck of queries on high cardin
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization usually only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| |1|8388608|1|4|3|-25.00%| |1|8388608|16|24|21|-12.50%| |1|8388608|64|46|45|-2.17%| |1|8388608|512|121|127|4.96%| |1|8388608|2048|193|207|7.25%| was: Though we already have a bitset optimization for low cardinality fields, but the optimization usually only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| |1|8388608|1|4|3|-25.00%| |1|8388608|16|24|21|-12.50%| |1|8388608|64|46|45|-2.17%| |1|8388608|512|121|127|4.96%| |1|8388608|2048|193|207|7.25%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We already have a bitset optimization for low cardinality fields, but the > optimization usually only works on extremly low cardinality fields > (cardinality < 16), for medium cardinality case like 30, 100 can rarely get > this optimization. > In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to > use readLELongs to speed up BKD id blocks, but did not get a obvious gain on > this approach. I think this is because we are trying to optimize the unsorted > situation, which typically happens for high cardinality fields, and the > bottleneck of queries on high c
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| |1|8388608|1|4|3|-25.00%| |1|8388608|16|24|21|-12.50%| |1|8388608|64|46|45|-2.17%| |1|8388608|512|121|127|4.96%| |1|8388608|2048|193|207|7.25%| was: We already have a bitset optimization for low cardinality fields, but the optimization usually only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| |1|8388608|1|4|3|-25.00%| |1|8388608|16|24|21|-12.50%| |1|8388608|64|46|45|-2.17%| |1|8388608|512|121|127|4.96%| |1|8388608|2048|193|207|7.25%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We already have a bitset optimization for low cardinality fields, but the > optimization only works on extremly low cardinality fields (cardinality < > 16), for medium cardinality case like 30, 100 can rarely get this > optimization. > In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to > use readLELongs to speed up BKD id blocks, but did not get a obvious gain on > this approach. I think this is because we are trying to optimize the unsorted > situation, which typically happens for high cardinality fields, and the > bottleneck of queries on high cardinality fields is usu
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 30/100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| |1|8388608|1|4|3|-25.00%| |1|8388608|16|24|21|-12.50%| |1|8388608|64|46|45|-2.17%| |1|8388608|512|121|127|4.96%| |1|8388608|2048|193|207|7.25%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (cardinality < 16), for medium cardinality case like 30, 100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| |1|8388608|1|4|3|-25.00%| |1|8388608|16|24|21|-12.50%| |1|8388608|64|46|45|-2.17%| |1|8388608|512|121|127|4.96%| |1|8388608|2048|193|207|7.25%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We already have a bitset optimization for low cardinality fields, but the > optimization only works on extremly low cardinality fields (doc count > 1/16 > total doc), for medium cardinality case like 30/100 can rarely get this > optimization. > In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to > use readLELongs to speed up BKD id blocks, but did not get a obvious gain on > this approach. I think this is because we are trying to optimize the unsorted > situation, which typically happens for high cardinality fields, and the > bottleneck of queries on high cardinality fie
[GitHub] [lucene-solr] janhoy closed pull request #182: SOLR-10415 - improve debug logging to use parameterized logging
janhoy closed pull request #182: URL: https://github.com/apache/lucene-solr/pull/182 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy closed pull request #185: SOLR-10487: Support to specify connection and socket read timeout in DataImportHandler for SolrEntityProcessor.
janhoy closed pull request #185: URL: https://github.com/apache/lucene-solr/pull/185 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy closed pull request #690: SOLR-13517: [ UX improvement ] Dashboard will now store query and filter parameters on page change a…
janhoy closed pull request #690: URL: https://github.com/apache/lucene-solr/pull/690 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy closed pull request #740: SOLR-12550 - distribUpdateSoTimeout for configuring socket timeouts in solrcloud doesn't take effect for updates.
janhoy closed pull request #740: URL: https://github.com/apache/lucene-solr/pull/740 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a change in pull request #521: LUCENE-10229: Unify behaviour of match offsets for interval queries
dweiss commented on a change in pull request #521: URL: https://github.com/apache/lucene/pull/521#discussion_r764938074 ## File path: lucene/queries/src/java/org/apache/lucene/queries/intervals/Intervals.java ## @@ -275,7 +275,10 @@ public static IntervalsSource ordered(IntervalsSource... subSources) { } /** - * Create an unordered {@link IntervalsSource} + * Create an unordered {@link IntervalsSource}. Note that if there are multiple intervals ends at Review comment: There is no overlap indeed - one interval is 'a b' the other 'c d' (the smallest possible variant). This is tricky - I agree -- but does not negate the utility of the entire concept. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| |1|8388608|1|4|3|-25.00%| |1|8388608|16|24|21|-12.50%| |1|8388608|64|46|45|-2.17%| |1|8388608|512|121|127|4.96%| |1|8388608|2048|193|207|7.25%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 30/100 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| |1|8388608|1|4|3|-25.00%| |1|8388608|16|24|21|-12.50%| |1|8388608|64|46|45|-2.17%| |1|8388608|512|121|127|4.96%| |1|8388608|2048|193|207|7.25%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We already have a bitset optimization for low cardinality fields, but the > optimization only works on extremly low cardinality fields (doc count > 1/16 > total doc), for medium cardinality case like 32/128 can rarely get this > optimization. > In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to > use readLELongs to speed up BKD id blocks, but did not get a obvious gain on > this approach. I think this is because we are trying to optimize the unsorted > situation, which typically happens for high cardinality fields, and the > bottleneck of queries on high cardi
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| |1|8388608|1|4|3|-25.00%| |1|8388608|16|24|21|-12.50%| |1|8388608|64|46|45|-2.17%| |1|8388608|512|121|127|4.96%| |1|8388608|2048|193|207|7.25%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We already have a bitset optimization for low cardinality fields, but the > optimization only works on extremly low cardinality fields (doc count > 1/16 > total doc), for medium cardinality case like 32/128 can rarely get this > optimization. > In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to > use readLELongs to speed up BKD id blocks, but did not get a obvious gain on > this approach. I think this is because we are trying to optimize the unsorted > situation, which typically happens for high cardinality fields, and the > bottleneck of queries on high cardinality fie
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We already have a bitset optimization for low cardinality fields, but the > optimization only works on extremly low cardinality fields (doc count > 1/16 > total doc), for medium cardinality case like 32/128 can rarely get this > optimization. > In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to > use readLELongs to speed up BKD id blocks, but did not get a obvious gain on > this approach. I think this is because we are trying to optimize the unsorted > situation, which typically happens for high cardinality fields, and the > bottleneck of queries on high cardinality fiel
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We already have a bitset optimization for low cardinality fields, but the > optimization only works on extremly low cardinality fields (doc count > 1/16 > total doc), for medium cardinality case like 32/128 can rarely get this > optimization. > In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to > use readLELongs to speed up BKD id blocks, but did not get a obvious gain on > this approach. I think this is because we are trying to optimize the unsorted > situation, which typically happens for high cardinality fields, and the > bottleneck of queries on high card
[GitHub] [lucene] magibney commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue
magibney commented on pull request #380: URL: https://github.com/apache/lucene/pull/380#issuecomment-988928174 Thanks for the nudge, @fmmoret. I think if introducing this change, we should really avoid [needlessly building and throwing away](https://github.com/apache/lucene/pull/380#discussion_r750515187) the stringified dictionary. @spyk is this something you'd be interested in pursuing (i.e., pushing a new commit to your PR branch)? Lmk if not and I'll try (or Alessandro, per his earlier comment?) to move it along. >Ideally, opennlp would have a DictionaryLemmatizer ctor that accepts a Reader directly -- I can't imagine that would be a controversial upstream PR? I don't think concerns over the default character encoding issue should hold things up. We're not making anything worse wrt the default encoding assumption. A simple `TODO` comment should suffice. I think we should circle back (I should be able to find the time for this if nobody else steps forward) to actually address such a `TODO` as a separate issue/PR, following something like the `InputStreamReader` approach I mentioned above (trusting someone will contradict me if they disagree with this proposed approach!). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mikemccand commented on pull request #1064: LUCENE-9084: circular synchronization wait (potential deadlock) in AnalyzingInfixSuggester
mikemccand commented on pull request #1064: URL: https://github.com/apache/lucene-solr/pull/1064#issuecomment-988973336 It looks like this one was indeed merged -- closing the PR. Thank you @paulward24! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mikemccand closed pull request #1064: LUCENE-9084: circular synchronization wait (potential deadlock) in AnalyzingInfixSuggester
mikemccand closed pull request #1064: URL: https://github.com/apache/lucene-solr/pull/1064 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mikemccand commented on pull request #906: LUCENE-8996: maxScore is sometimes missing from distributed responses
mikemccand commented on pull request #906: URL: https://github.com/apache/lucene-solr/pull/906#issuecomment-988978401 Hmm, I see this [src fix was committed, but the new unit test was not committed](https://github.com/apache/lucene/commit/49631ace9f1ee110d52a207377e4926baef74929) -- was that intentional? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Maybe medium cardinality fields are tempted for this optimization, The basic idea is that compute the delta of the sorted ids and encode/decode them like what we do in StoredFieldsInts. I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Medium cardinality fields may be tempted for this optimization :) I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We already have a bitset optimization for low cardinality fields, but the > optimization only works on extremly low cardinality fields (doc count > 1/16 > total doc), for medium cardinality case like 32/128 can rarely get this > optimization. > In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to > use readLELongs to speed up BKD id blocks, but did not get a obvious gain on > this approach. I think this is because we are
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually {{visitDocValues}} but not {{readDocIds}}. Maybe medium cardinality fields are tempted for this optimization, The basic idea is that compute the delta of the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{PointInSetQuery}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually {{visitDocValues}} but not {{readDocIds}}. Maybe medium cardinality fields are tempted for this optimization, The basic idea is that compute the delta of the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We already have a bitset optimization for low cardinality fields, but the > optimization only works on extremly low cardinality fields (doc count > 1/16 > total doc), for medium cardinality case like 32/128 can rarely get this > optimization. > In [https://github.com/apache/lucene-
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually {{visitDocValues}} but not {{readDocIds}}. Maybe medium cardinality fields are tempted for this optimization, The basic idea is that compute the delta of the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually visitDocValues but not readDocIds. Maybe medium cardinality fields are tempted for this optimization, The basic idea is that compute the delta of the sorted ids and encode/decode them like what we do in StoredFieldsInts. I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We already have a bitset optimization for low cardinality fields, but the > optimization only works on extremly low cardinality fields (doc count > 1/16 > total doc), for medium cardinality case like 32/128 can rarely get this > optimization. > In [https://github.com/apache/lucene-solr/pull/1538],
[jira] [Updated] (LUCENE-10259) Luke does not start with whitespace in unzipped directory.
[ https://issues.apache.org/jira/browse/LUCENE-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-10259: -- Fix Version/s: (was: 9.x) > Luke does not start with whitespace in unzipped directory. > -- > > Key: LUCENE-10259 > URL: https://issues.apache.org/jira/browse/LUCENE-10259 > Project: Lucene - Core > Issue Type: Bug > Components: luke >Affects Versions: 9.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Blocker > Fix For: 9.0, 10.0 (main) > > Attachments: screenshot-1.png > > Time Spent: 1h 50m > Remaining Estimate: 0h > > When you start Luke on windows, nothing happens. No error message nothing. > This happens for users that have whitespace in their username ("Uwe > Schindler") and you unzip the tgz file to desktop. > This also affects the Linux shell script, but more unlikely. > The fix is easy: Add in both shell scripts quotes around the module-path. > I think we should respin. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10287) Add jdk.unsupported module to Luke startup script
[ https://issues.apache.org/jira/browse/LUCENE-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-10287: -- Fix Version/s: (was: 9.x) > Add jdk.unsupported module to Luke startup script > - > > Key: LUCENE-10287 > URL: https://issues.apache.org/jira/browse/LUCENE-10287 > Project: Lucene - Core > Issue Type: Bug > Components: luke >Affects Versions: 9.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Major > Fix For: 9.1, 10.0 (main) > > Time Spent: 1h > Remaining Estimate: 0h > > See my note on the JDK 9.0 release: When you start Luke (in module mode, as > done by default), it won't use MMapDirectory when opening indexes. The reason > is simple: It can't see sun.misc.Unsafe, which is needed to unmap mapped byte > buffers. It will silently disable itsself (as it is not a hard dependency). > By default we should pass the "jdk.unsupported" module when starting Luke. > In case of a respin, this should be backported. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10040) Handle deletions in nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455386#comment-17455386 ] Julie Tibshirani commented on LUCENE-10040: --- Thanks for posting, I found Weaviate's blog helpful as I was thinking through this issue! > Handle deletions in nearest vector search > - > > Key: LUCENE-10040 > URL: https://issues.apache.org/jira/browse/LUCENE-10040 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Assignee: Julie Tibshirani >Priority: Major > Time Spent: 5.5h > Remaining Estimate: 0h > > Currently nearest vector search doesn't account for deleted documents. Even > if a document is not in {{LeafReader#getLiveDocs}}, it could still be > returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be > surprising + difficult for users, since other search APIs account for deleted > docs. We've discussed extending the search logic to take a parameter like > {{Bits liveDocs}}. This issue discusses options around adding support. > One approach is to just filter out deleted docs after running the KNN search. > This behavior seems hard to work with as a user: fewer than {{k}} docs might > come back from your KNN search! > Alternatively, {{LeafReader#searchNearestVectors}} could always return the > {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs > while assembling its candidate list. It would traverse further into the > graph, visiting more nodes to ensure it gathers the required candidates. > (Note deleted docs would still be visited/ traversed). The [hnswlib > library|https://github.com/nmslib/hnswlib] contains an implementation like > this, where you can mark documents as deleted and they're skipped during > search. > This approach seems reasonable to me, but there are some challenges: > * Performance can be unpredictable. If deletions are random, it shouldn't > have a huge effect. But in the worst case, a segment could have 50% deleted > docs, and they all happen to be near the query vector. HNSW would need to > traverse through around half the entire graph to collect neighbors. > * As far as I know, there hasn't been academic research or any testing into > how well this performs in terms of recall. I have a vague intuition it could > be harder to achieve high recall as the algorithm traverses areas further > from the "natural" entry points. The HNSW paper doesn't mention deletions/ > filtering, and I haven't seen community benchmarks around it. > Background links: > * Thoughts on deletions from the author of the HNSW paper: > [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892] > * Blog from Vespa team which mentions combining KNN and search filters (very > similar to applying deleted docs): > [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. > The "Exact vs Approximate" section shows good performance even when a large > percentage of documents are filtered out. The team mentioned to me they > didn't have the chance to measure recall, only latency. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10114) Remove unused byte order mark in Lucene90PostingsWriter
[ https://issues.apache.org/jira/browse/LUCENE-10114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-10114. --- Resolution: Fixed > Remove unused byte order mark in Lucene90PostingsWriter > --- > > Key: LUCENE-10114 > URL: https://issues.apache.org/jira/browse/LUCENE-10114 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/index >Affects Versions: 9.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Major > Fix For: 9.0 > > Time Spent: 20m > Remaining Estimate: 0h > > While reviewing the byte order in lucene index, I found the following code in > {{Lucene90PostingsWriter}}: > {code:java} > ByteOrder byteOrder = ByteOrder.nativeOrder(); > if (byteOrder == ByteOrder.BIG_ENDIAN) { > docOut.writeByte((byte) 'B'); > } else if (byteOrder == ByteOrder.LITTLE_ENDIAN) { > docOut.writeByte((byte) 'L'); > } else { > throw new Error(); > } > {code} > Actually this byte is consumed nowhere, as the file is only used via seeking > and the offsets are just 1 larger. We should remove this code. > Why was this added? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-10114) Remove unused byte order mark in Lucene90PostingsWriter
[ https://issues.apache.org/jira/browse/LUCENE-10114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand closed LUCENE-10114. - > Remove unused byte order mark in Lucene90PostingsWriter > --- > > Key: LUCENE-10114 > URL: https://issues.apache.org/jira/browse/LUCENE-10114 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/index >Affects Versions: 9.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Major > Fix For: 9.0 > > Time Spent: 20m > Remaining Estimate: 0h > > While reviewing the byte order in lucene index, I found the following code in > {{Lucene90PostingsWriter}}: > {code:java} > ByteOrder byteOrder = ByteOrder.nativeOrder(); > if (byteOrder == ByteOrder.BIG_ENDIAN) { > docOut.writeByte((byte) 'B'); > } else if (byteOrder == ByteOrder.LITTLE_ENDIAN) { > docOut.writeByte((byte) 'L'); > } else { > throw new Error(); > } > {code} > Actually this byte is consumed nowhere, as the file is only used via seeking > and the offsets are just 1 larger. We should remove this code. > Why was this added? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-9484) Allow index sorting to happen after the fact
[ https://issues.apache.org/jira/browse/LUCENE-9484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand closed LUCENE-9484. > Allow index sorting to happen after the fact > > > Key: LUCENE-9484 > URL: https://issues.apache.org/jira/browse/LUCENE-9484 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Simon Willnauer >Priority: Major > Fix For: 9.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > I did look into sorting an index after it was created and found that with > some smallish modifications we can actually allow that by piggibacking on > SortingLeafReader and addIndices in a pretty straight-forward and simple way. > With some smallish modifications / fixes to SortingLeafReader we can just > merge and unsorted index into a sorted index using a fresh index writer. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9484) Allow index sorting to happen after the fact
[ https://issues.apache.org/jira/browse/LUCENE-9484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-9484. -- Resolution: Fixed > Allow index sorting to happen after the fact > > > Key: LUCENE-9484 > URL: https://issues.apache.org/jira/browse/LUCENE-9484 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Simon Willnauer >Priority: Major > Fix For: 9.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > I did look into sorting an index after it was created and found that with > some smallish modifications we can actually allow that by piggibacking on > SortingLeafReader and addIndices in a pretty straight-forward and simple way. > With some smallish modifications / fixes to SortingLeafReader we can just > merge and unsorted index into a sorted index using a fresh index writer. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. Maybe this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually {{visitDocValues}} but not {{readDocIds}} ? Maybe medium cardinality fields are tempted for this optimization, The basic idea is that compute the delta of the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{PointInSetQuery}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually {{visitDocValues}} but not {{readDocIds}}. Maybe medium cardinality fields are tempted for this optimization, The basic idea is that compute the delta of the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{PointInSetQuery}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We already have a bitset optimization for low cardinality fields, but the > optimization only works on extremly low cardinality fields (doc count > 1/16 > total doc), for medium cardinality case like 32/128 can rarely get this > optimization. > In [https://github.com/apache/lucen
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. Maybe this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} ? But i think medium cardinality fields may be tempted for this optimization. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{PointInSetQuery}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. Maybe this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is usually {{visitDocValues}} but not {{readDocIds}} ? Maybe medium cardinality fields are tempted for this optimization, The basic idea is that compute the delta of the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{PointInSetQuery}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We already have a bitset optimization for low cardinality fields, but the > optimization only works on extremly low cardinality fields (doc count > 1/16 > total doc), for medium cardinality case like 32/128 can rarely get this > optimization. > In [https://github.com/apach
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. Maybe this is because we are trying to optimize the unsorted situation (typically happens for high cardinality fields) and the bottleneck of queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} ? IMO medium cardinality fields may be tempted for this optimization because they need to read lots of ids. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{PointInSetQuery}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. Maybe this is because we are trying to optimize the unsorted situation, which typically happens for high cardinality fields, and the bottleneck of queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} ? But i think medium cardinality fields may be tempted for this optimization. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{PointInSetQuery}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We already have a bitset optimization for low cardinality fields, but the > optimization only works on extremly low cardinality fields (doc count > 1/16 > total doc), for medium cardinality case like 32/128 can rarely get this > optimizatio
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. Maybe this is because we are trying to optimize the unsorted situation (typically happens for high cardinality fields) and the bottleneck of queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} ? IMO medium cardinality fields may be tempted for this optimization because they need to read lots of ids for one term. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{PointInSetQuery}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. Maybe this is because we are trying to optimize the unsorted situation (typically happens for high cardinality fields) and the bottleneck of queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} ? IMO medium cardinality fields may be tempted for this optimization because they need to read lots of ids. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{PointInSetQuery}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We already have a bitset optimization for low cardinality fields, but the > optimization only works on extremly low cardinality fields (doc count > 1/16 > total doc), for medium cardinality case like 32/1
[jira] [Created] (LUCENE-10298) dev-tools/scripts/addBackcompatIndexes.py doesn't work well with spotless
Adrien Grand created LUCENE-10298: - Summary: dev-tools/scripts/addBackcompatIndexes.py doesn't work well with spotless Key: LUCENE-10298 URL: https://issues.apache.org/jira/browse/LUCENE-10298 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand addBackcompatIndexes.py expects that lists of index names have one entry per line, e.g. {code} static final String oldNames = { "" } {code} However, when the array is small, Spotless forces the array to be written on a single line, and addBackcompatIndexes.py no longer recognizes the structure of the file. It's probably fixable, but my Python skills are not good enough. Or maybe this file should be one of the rare ones that we exclude from Spotless? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 commented on pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently
gf2121 commented on pull request #510: URL: https://github.com/apache/lucene/pull/510#issuecomment-989104054 @iverase Thanks for your explanation! > I worked on the PR about using #readLELongs but never get a meaningful speed up that justify the added complexity. I find that we were trying to use #readLELongs to speed up 24/32 bit situation in the `DocIdsWriter`, which means the ids in the block are unsorted, typically happening in high cardinarlity fields. I think queries on high cardinality fields spend most of their time on `visitDocValues` but not `readDocIds`, so maybe this is the reason that we can not see a obvious gain on E2E side? My current thoughts are about using readLELongs to speed up the **sorted** ids situation (means low or medium cardinality fields), whose bottleneck is reading docIds. For sorted arrays, we can compute the delta of the sorted ids and encode/decode them like what we do in `StoredFieldsInts`. I raised an [ISSUE](https://issues.apache.org/jira/browse/LUCENE-10297) based this idea. The benchmark result i post in the issue looks promising. Would you like to help take a look when you are free? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10298) dev-tools/scripts/addBackcompatIndexes.py doesn't work well with spotless
[ https://issues.apache.org/jira/browse/LUCENE-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455935#comment-17455935 ] Dawid Weiss commented on LUCENE-10298: -- I wouldn't make such exceptions. They're hard to maintain... A better solution would be to read this list from a resource instead. A hacky way would be to force the line break with a // comment after the bracket and {code} static final String oldNames = { // auto-updated list starts here "" // list ends here. } {code} > dev-tools/scripts/addBackcompatIndexes.py doesn't work well with spotless > - > > Key: LUCENE-10298 > URL: https://issues.apache.org/jira/browse/LUCENE-10298 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > > addBackcompatIndexes.py expects that lists of index names have one entry per > line, e.g. > {code} > static final String oldNames = { > "" > } > {code} > However, when the array is small, Spotless forces the array to be written on > a single line, and addBackcompatIndexes.py no longer recognizes the structure > of the file. > It's probably fixable, but my Python skills are not good enough. Or maybe > this file should be one of the rare ones that we exclude from Spotless? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10040) Handle deletions in nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani resolved LUCENE-10040. --- Fix Version/s: 9.0 Resolution: Fixed > Handle deletions in nearest vector search > - > > Key: LUCENE-10040 > URL: https://issues.apache.org/jira/browse/LUCENE-10040 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Assignee: Julie Tibshirani >Priority: Major > Fix For: 9.0 > > Time Spent: 5.5h > Remaining Estimate: 0h > > Currently nearest vector search doesn't account for deleted documents. Even > if a document is not in {{LeafReader#getLiveDocs}}, it could still be > returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be > surprising + difficult for users, since other search APIs account for deleted > docs. We've discussed extending the search logic to take a parameter like > {{Bits liveDocs}}. This issue discusses options around adding support. > One approach is to just filter out deleted docs after running the KNN search. > This behavior seems hard to work with as a user: fewer than {{k}} docs might > come back from your KNN search! > Alternatively, {{LeafReader#searchNearestVectors}} could always return the > {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs > while assembling its candidate list. It would traverse further into the > graph, visiting more nodes to ensure it gathers the required candidates. > (Note deleted docs would still be visited/ traversed). The [hnswlib > library|https://github.com/nmslib/hnswlib] contains an implementation like > this, where you can mark documents as deleted and they're skipped during > search. > This approach seems reasonable to me, but there are some challenges: > * Performance can be unpredictable. If deletions are random, it shouldn't > have a huge effect. But in the worst case, a segment could have 50% deleted > docs, and they all happen to be near the query vector. HNSW would need to > traverse through around half the entire graph to collect neighbors. > * As far as I know, there hasn't been academic research or any testing into > how well this performs in terms of recall. I have a vague intuition it could > be harder to achieve high recall as the algorithm traverses areas further > from the "natural" entry points. The HNSW paper doesn't mention deletions/ > filtering, and I haven't seen community benchmarks around it. > Background links: > * Thoughts on deletions from the author of the HNSW paper: > [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892] > * Blog from Vespa team which mentions combining KNN and search filters (very > similar to applying deleted docs): > [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. > The "Exact vs Approximate" section shows good performance even when a large > percentage of documents are filtered out. The team mentioned to me they > didn't have the chance to measure recall, only latency. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani merged pull request #527: LUCENE-10040: Add test for vector search with skewed deletions
jtibshirani merged pull request #527: URL: https://github.com/apache/lucene/pull/527 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10040) Handle deletions in nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455944#comment-17455944 ] ASF subversion and git services commented on LUCENE-10040: -- Commit 5d39bca87a44f51e5d556bb0a7e8c28df3f539fa in lucene's branch refs/heads/main from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5d39bca ] LUCENE-10040: Add test for vector search with skewed deletions (#527) This exercises a challenging case where the documents to skip all happen to be closest to the query vector. In many cases, HNSW appears to be robust to this case and maintains good recall. > Handle deletions in nearest vector search > - > > Key: LUCENE-10040 > URL: https://issues.apache.org/jira/browse/LUCENE-10040 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Assignee: Julie Tibshirani >Priority: Major > Fix For: 9.0 > > Time Spent: 5.5h > Remaining Estimate: 0h > > Currently nearest vector search doesn't account for deleted documents. Even > if a document is not in {{LeafReader#getLiveDocs}}, it could still be > returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be > surprising + difficult for users, since other search APIs account for deleted > docs. We've discussed extending the search logic to take a parameter like > {{Bits liveDocs}}. This issue discusses options around adding support. > One approach is to just filter out deleted docs after running the KNN search. > This behavior seems hard to work with as a user: fewer than {{k}} docs might > come back from your KNN search! > Alternatively, {{LeafReader#searchNearestVectors}} could always return the > {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs > while assembling its candidate list. It would traverse further into the > graph, visiting more nodes to ensure it gathers the required candidates. > (Note deleted docs would still be visited/ traversed). The [hnswlib > library|https://github.com/nmslib/hnswlib] contains an implementation like > this, where you can mark documents as deleted and they're skipped during > search. > This approach seems reasonable to me, but there are some challenges: > * Performance can be unpredictable. If deletions are random, it shouldn't > have a huge effect. But in the worst case, a segment could have 50% deleted > docs, and they all happen to be near the query vector. HNSW would need to > traverse through around half the entire graph to collect neighbors. > * As far as I know, there hasn't been academic research or any testing into > how well this performs in terms of recall. I have a vague intuition it could > be harder to achieve high recall as the algorithm traverses areas further > from the "natural" entry points. The HNSW paper doesn't mention deletions/ > filtering, and I haven't seen community benchmarks around it. > Background links: > * Thoughts on deletions from the author of the HNSW paper: > [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892] > * Blog from Vespa team which mentions combining KNN and search filters (very > similar to applying deleted docs): > [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. > The "Exact vs Approximate" section shows good performance even when a large > percentage of documents are filtered out. The team mentioned to me they > didn't have the chance to measure recall, only latency. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10040) Handle deletions in nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455949#comment-17455949 ] ASF subversion and git services commented on LUCENE-10040: -- Commit 394472d4b8e40504f0521df340df446089a7afff in lucene's branch refs/heads/branch_9x from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=394472d ] LUCENE-10040: Add test for vector search with skewed deletions (#527) This exercises a challenging case where the documents to skip all happen to be closest to the query vector. In many cases, HNSW appears to be robust to this case and maintains good recall. > Handle deletions in nearest vector search > - > > Key: LUCENE-10040 > URL: https://issues.apache.org/jira/browse/LUCENE-10040 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Assignee: Julie Tibshirani >Priority: Major > Fix For: 9.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > Currently nearest vector search doesn't account for deleted documents. Even > if a document is not in {{LeafReader#getLiveDocs}}, it could still be > returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be > surprising + difficult for users, since other search APIs account for deleted > docs. We've discussed extending the search logic to take a parameter like > {{Bits liveDocs}}. This issue discusses options around adding support. > One approach is to just filter out deleted docs after running the KNN search. > This behavior seems hard to work with as a user: fewer than {{k}} docs might > come back from your KNN search! > Alternatively, {{LeafReader#searchNearestVectors}} could always return the > {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs > while assembling its candidate list. It would traverse further into the > graph, visiting more nodes to ensure it gathers the required candidates. > (Note deleted docs would still be visited/ traversed). The [hnswlib > library|https://github.com/nmslib/hnswlib] contains an implementation like > this, where you can mark documents as deleted and they're skipped during > search. > This approach seems reasonable to me, but there are some challenges: > * Performance can be unpredictable. If deletions are random, it shouldn't > have a huge effect. But in the worst case, a segment could have 50% deleted > docs, and they all happen to be near the query vector. HNSW would need to > traverse through around half the entire graph to collect neighbors. > * As far as I know, there hasn't been academic research or any testing into > how well this performs in terms of recall. I have a vague intuition it could > be harder to achieve high recall as the algorithm traverses areas further > from the "natural" entry points. The HNSW paper doesn't mention deletions/ > filtering, and I haven't seen community benchmarks around it. > Background links: > * Thoughts on deletions from the author of the HNSW paper: > [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892] > * Blog from Vespa team which mentions combining KNN and search filters (very > similar to applying deleted docs): > [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. > The "Exact vs Approximate" section shows good performance even when a large > percentage of documents are filtered out. The team mentioned to me they > didn't have the chance to measure recall, only latency. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10298) dev-tools/scripts/addBackcompatIndexes.py doesn't work well with spotless
[ https://issues.apache.org/jira/browse/LUCENE-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455952#comment-17455952 ] Uwe Schindler commented on LUCENE-10298: Or maybe write the index names to a simple properties file that can be updated with plain stupid Java or Python and load it as resource? > dev-tools/scripts/addBackcompatIndexes.py doesn't work well with spotless > - > > Key: LUCENE-10298 > URL: https://issues.apache.org/jira/browse/LUCENE-10298 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > > addBackcompatIndexes.py expects that lists of index names have one entry per > line, e.g. > {code} > static final String oldNames = { > "" > } > {code} > However, when the array is small, Spotless forces the array to be written on > a single line, and addBackcompatIndexes.py no longer recognizes the structure > of the file. > It's probably fixable, but my Python skills are not good enough. Or maybe > this file should be one of the rare ones that we exclude from Spotless? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude opened a new pull request #2626: SOLR-15832: Clean-up after publish action in Schema Designer shouldn't fail if .system collection doesn't exist
thelabdude opened a new pull request #2626: URL: https://github.com/apache/lucene-solr/pull/2626 backport of https://github.com/apache/solr/pull/451 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10274) Implement "hyperrectangle" faceting
[ https://issues.apache.org/jira/browse/LUCENE-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455990#comment-17455990 ] Greg Miller commented on LUCENE-10274: -- I was thinking that this would work over the same doc values indexed when creating "Point" fields (e.g., LongPoint), which is a binary field encoding all N dimensions into a single byte entry. So the faceting logic would inspect a single binary field encoding the N dimensions, testing whether-or-not it's contained in each hyperrectangle of interest. > Implement "hyperrectangle" faceting > --- > > Key: LUCENE-10274 > URL: https://issues.apache.org/jira/browse/LUCENE-10274 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > > I'd be interested in expanding Lucene's faceting capabilities to aggregate a > point field against a set of user-provided n-dimensional > [hyperrectangles|https://en.wikipedia.org/wiki/Hyperrectangle]. This would be > a generalization of {{LongRangeFacets}} / {{DoubleRangeFacets}} from a single > dimension to n-dimensions, and would compliment {{PointRangeQuery}} well, > providing the ability to facet ahead of "drilling down" on such a query. > As a motivating use-case, imagine searching against movie documents that > contain a 2-dimensional point storing "awards" the movie has received. One > dimension encodes the year the award was won, while the other encodes the > type of award as an ordinal. For example, the film "Nomadland" won the > "Academy Awards Best Picture" award in 2021. Imagine providing a > two-dimensional refinement to users allowing them to filter by the > combination of award + year in a single action (e.g., using > {{{}PointRangeQuery{}}}) and needing to get facet counts for these > combinations ahead of time. > Curious if the community thinks this functionality would be useful. Any > thoughts? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10299) investigate prefix/wildcard perf drop in nightly benchmarks
Robert Muir created LUCENE-10299: Summary: investigate prefix/wildcard perf drop in nightly benchmarks Key: LUCENE-10299 URL: https://issues.apache.org/jira/browse/LUCENE-10299 Project: Lucene - Core Issue Type: Task Environment: Recently the prefix/wildcard dropped. As these are super simple and not impacted by cleanups being done around RegExp, I think instead the perf-difference is in the guts of MultiTermQuery where it uses DocIdSetBuilder? *note that I haven't confirmed this and it is just a suspicion* So I think it may be LUCENE-10289 changes? e.g. doing loops with {{long}} instead of {{int}} like before, we know these are slower in java. I will admit, I'm a bit confused why we made this change since lucene docids can only be {{int}}. Maybe we get the performance back for free, with JDK18/19 which are optimizing loops on {{long}} better? So I'm not arguing that we burn a bunch of time to fix this, but just opening the issue. Reporter: Robert Muir -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10274) Implement "hyperrectangle" faceting
[ https://issues.apache.org/jira/browse/LUCENE-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455990#comment-17455990 ] Greg Miller edited comment on LUCENE-10274 at 12/8/21, 9:05 PM: I was thinking that this would work over the same doc values indexed when creating "Point" fields (e.g., LongPoint), which is a binary field encoding all N dimensions into a single byte entry. So the faceting logic would inspect a single binary field encoding the N dimensions, testing whether-or-not it's contained in each hyperrectangle of interest. UPDATE: Actually, I think I was confusing the current Point field impl with something else. I just glanced at the code and there isn't a current dv field of course (just the inverted points index). So yeah, will need some thought as to how to encode these as dvs. was (Author: gsmiller): I was thinking that this would work over the same doc values indexed when creating "Point" fields (e.g., LongPoint), which is a binary field encoding all N dimensions into a single byte entry. So the faceting logic would inspect a single binary field encoding the N dimensions, testing whether-or-not it's contained in each hyperrectangle of interest. > Implement "hyperrectangle" faceting > --- > > Key: LUCENE-10274 > URL: https://issues.apache.org/jira/browse/LUCENE-10274 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > > I'd be interested in expanding Lucene's faceting capabilities to aggregate a > point field against a set of user-provided n-dimensional > [hyperrectangles|https://en.wikipedia.org/wiki/Hyperrectangle]. This would be > a generalization of {{LongRangeFacets}} / {{DoubleRangeFacets}} from a single > dimension to n-dimensions, and would compliment {{PointRangeQuery}} well, > providing the ability to facet ahead of "drilling down" on such a query. > As a motivating use-case, imagine searching against movie documents that > contain a 2-dimensional point storing "awards" the movie has received. One > dimension encodes the year the award was won, while the other encodes the > type of award as an ordinal. For example, the film "Nomadland" won the > "Academy Awards Best Picture" award in 2021. Imagine providing a > two-dimensional refinement to users allowing them to filter by the > combination of award + year in a single action (e.g., using > {{{}PointRangeQuery{}}}) and needing to get facet counts for these > combinations ahead of time. > Curious if the community thinks this functionality would be useful. Any > thoughts? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10299) investigate prefix/wildcard perf drop in nightly benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-10299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-10299: - Description: Recently the prefix/wildcard dropped. As these are super simple and not impacted by cleanups being done around RegExp, I think instead the perf-difference is in the guts of MultiTermQuery where it uses DocIdSetBuilder? *note that I haven't confirmed this and it is just a suspicion* So I think it may be LUCENE-10289 changes? e.g. doing loops with {{long}} instead of {{int}} like before, we know these are slower in java. I will admit, I'm a bit confused why we made this change since lucene docids can only be {{int}}. Maybe we get the performance back for free, with JDK18/19 which are optimizing loops on {{long}} better? So I'm not arguing that we burn a bunch of time to fix this, but just opening the issue. cc [~ivera] Environment: (was: Recently the prefix/wildcard dropped. As these are super simple and not impacted by cleanups being done around RegExp, I think instead the perf-difference is in the guts of MultiTermQuery where it uses DocIdSetBuilder? *note that I haven't confirmed this and it is just a suspicion* So I think it may be LUCENE-10289 changes? e.g. doing loops with {{long}} instead of {{int}} like before, we know these are slower in java. I will admit, I'm a bit confused why we made this change since lucene docids can only be {{int}}. Maybe we get the performance back for free, with JDK18/19 which are optimizing loops on {{long}} better? So I'm not arguing that we burn a bunch of time to fix this, but just opening the issue.) > investigate prefix/wildcard perf drop in nightly benchmarks > --- > > Key: LUCENE-10299 > URL: https://issues.apache.org/jira/browse/LUCENE-10299 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Recently the prefix/wildcard dropped. As these are super simple and not > impacted by cleanups being done around RegExp, I think instead the > perf-difference is in the guts of MultiTermQuery where it uses > DocIdSetBuilder? > *note that I haven't confirmed this and it is just a suspicion* > So I think it may be LUCENE-10289 changes? e.g. doing loops with {{long}} > instead of {{int}} like before, we know these are slower in java. > I will admit, I'm a bit confused why we made this change since lucene docids > can only be {{int}}. > Maybe we get the performance back for free, with JDK18/19 which are > optimizing loops on {{long}} better? So I'm not arguing that we burn a bunch > of time to fix this, but just opening the issue. > cc [~ivera] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10299) investigate prefix/wildcard perf drop in nightly benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-10299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455991#comment-17455991 ] Robert Muir commented on LUCENE-10299: -- Here are the list of commits in between the benchmark runs where the perf dropped: https://github.com/apache/lucene/compare/ec57641ea5940270ff7eb08536c9050a050adf1f...68e94c959729dee6f32b1c6fca1a5e4902a9fa51 > investigate prefix/wildcard perf drop in nightly benchmarks > --- > > Key: LUCENE-10299 > URL: https://issues.apache.org/jira/browse/LUCENE-10299 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Recently the prefix/wildcard dropped. As these are super simple and not > impacted by cleanups being done around RegExp, I think instead the > perf-difference is in the guts of MultiTermQuery where it uses > DocIdSetBuilder? > *note that I haven't confirmed this and it is just a suspicion* > So I think it may be LUCENE-10289 changes? e.g. doing loops with {{long}} > instead of {{int}} like before, we know these are slower in java. > I will admit, I'm a bit confused why we made this change since lucene docids > can only be {{int}}. > Maybe we get the performance back for free, with JDK18/19 which are > optimizing loops on {{long}} better? So I'm not arguing that we burn a bunch > of time to fix this, but just opening the issue. > cc [~ivera] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10274) Implement "hyperrectangle" faceting
[ https://issues.apache.org/jira/browse/LUCENE-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455994#comment-17455994 ] Greg Miller commented on LUCENE-10274: -- {quote}I would also suggest to start with the simple separate-numeric-docvalues-fields case and use similar logic as the {{org.apache.lucene.facet.range}} package, just on 2-D, or maybe 3-D, N-D, etc {quote} We could also pack the N dimensions into a single binary dv field using the {{encodeDimension}} / {{decodeDimension}} paradigm in {{LongPoint}} / {{DoublePoint}} for this. That seems simpler for a user to manage as opposed to managing separate fields for every dimension, but maybe there are performance limitations of such an approach. > Implement "hyperrectangle" faceting > --- > > Key: LUCENE-10274 > URL: https://issues.apache.org/jira/browse/LUCENE-10274 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > > I'd be interested in expanding Lucene's faceting capabilities to aggregate a > point field against a set of user-provided n-dimensional > [hyperrectangles|https://en.wikipedia.org/wiki/Hyperrectangle]. This would be > a generalization of {{LongRangeFacets}} / {{DoubleRangeFacets}} from a single > dimension to n-dimensions, and would compliment {{PointRangeQuery}} well, > providing the ability to facet ahead of "drilling down" on such a query. > As a motivating use-case, imagine searching against movie documents that > contain a 2-dimensional point storing "awards" the movie has received. One > dimension encodes the year the award was won, while the other encodes the > type of award as an ordinal. For example, the film "Nomadland" won the > "Academy Awards Best Picture" award in 2021. Imagine providing a > two-dimensional refinement to users allowing them to filter by the > combination of award + year in a single action (e.g., using > {{{}PointRangeQuery{}}}) and needing to get facet counts for these > combinations ahead of time. > Curious if the community thinks this functionality would be useful. Any > thoughts? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #470: LUCENE-10255: fully embrace the java module system
dweiss commented on pull request #470: URL: https://github.com/apache/lucene/pull/470#issuecomment-989200591 I've pushed another bit of exploration and I think it shows we're close. Many things can be cleaned up nicely later (modular configurations generated from sourcesets, including compilation task configuration) but we already have a nice (I think!) way to express modular vs. classpath dependencies, working compilation and a test subproject that uses module descriptor and module path to run the tests. The only bit I didn't get to was reconfigure the actual test task (classpath + module path). Hopefully tomorrow will figure out the remaining bits and start cleanups and polishing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jnorthrup commented on pull request #310: LUCENE-10112: Improve LZ4 Compression performance with direct primitive read/writes
jnorthrup commented on pull request #310: URL: https://github.com/apache/lucene/pull/310#issuecomment-989246505 hi @uschindler are there analogs for lzo and zstd for which this benchmark can be used to address the ever present IO budget and cache line costs of the different libs ? (benchmark in #308) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #310: LUCENE-10112: Improve LZ4 Compression performance with direct primitive read/writes
uschindler commented on pull request #310: URL: https://github.com/apache/lucene/pull/310#issuecomment-989287677 Hi @jnorthrup, I do not fully understand what you are intending to do? If you want to compare the LZ4 to LZO/ZSTD just run Mike's benchmarks. Me main reason for this PR is to not build an int manually from 4 bytes and instead read it as one atomic value (which should be faster). The other compression algorithms you mention are using native code, so that's a different story. This is all pure Java. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456048#comment-17456048 ] Greg Miller commented on LUCENE-10281: -- Yeah, +1 to not considering this a bug (but I'm a little biased I suppose since I wrote this). As you point out, the heuristic would be better if we knew how many of the hits actually had values in the SSDV field, but it's expensive to determine that up-front. So the current heuristic (which is just a heuristic and could be flawed in a number of ways), assumes all the hits have a value. > Error condition used to judge whether hits are sparse in > StringValueFacetCounts > --- > > Key: LUCENE-10281 > URL: https://issues.apache.org/jira/browse/LUCENE-10281 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 8.11 >Reporter: Lu Xugang >Priority: Minor > Attachments: 1.jpg > > Time Spent: 10m > Remaining Estimate: 0h > > Description: > In construction method StringValueFacetCounts(StringDocValuesReaderState > state, FacetsCollector facetsCollector), if facetsCollector was provided, a > condition of *(totalHits < totalDocs / 10)* used to judge whether using > IntIntHashMap which means sparse to store term ord and count 。 > But per totalHits doesn't means it must be containing SSDV , and so is > totalDocs. so the right calculation should be *( totalHits has SSDV) / > (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by > SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get > because we can only read index by docId provided by FacetsCollector, but the > way of getting *totalHits has SSDV* is slow and redundant. > Solution: > if we don't wanna to break the old logic that using denseCounts while > cardinality < 1024 and using IntIntHashMap while 10% threshold and using > denseCounts while the rest of the case, then we could still use denseCounts > if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique > term collected,then change to use denseCounts. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #498: LUCENE-10275: Add interval tree to MultiRangeQuery
gsmiller commented on pull request #498: URL: https://github.com/apache/lucene/pull/498#issuecomment-989310001 Nice change! (Just now catching up on some Lucene issues and saw this) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude merged pull request #2626: SOLR-15832: Clean-up after publish action in Schema Designer shouldn't fail if .system collection doesn't exist
thelabdude merged pull request #2626: URL: https://github.com/apache/lucene-solr/pull/2626 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #528: LUCENE-10296: Stop minimizing regepx
rmuir commented on pull request #528: URL: https://github.com/apache/lucene/pull/528#issuecomment-989427375 I'm waiting a bit on https://issues.apache.org/jira/browse/LUCENE-10299 . I don't expect any regression, but I don't want to confuse it with the `DocIdSetBuilder` stuff -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #528: LUCENE-10296: Stop minimizing regepx
rmuir commented on pull request #528: URL: https://github.com/apache/lucene/pull/528#issuecomment-989454186 LUCENE-10296: Stop minimizing regexps In current trunk, we let caller (e.g. `RegExpQuery`) try to "reduce" the expression. The parser nor the low-level executors don't implicitly call exponential-time algorithms anymore. But now that we have cleaned this up, we can see, what is happening is even worse than just calling `determinize()`. We still call `minimize()` which is much crazier and much more. We stopped doing this for all other `AutomatonQuery` subclasses a long time ago, as we determined that it didn't help performance. Additionally, minimization vs. determinization is even less important than early days where we found trouble: the representation got a lot better. Today when you `finishState()` we do a lot of practical sorting/coalescing on-the-fly. The practical parts of minimization for runtime perf. Also we added our fancy UTF32-to-UTF8 automata convertor, that makes the worst-case-space-per-state significantly lower than with UTF-16 representation? So why minimize() ? Let's just replace `minimize()` calls with `determinize()` calls? I've already swapped them out for all of `src/test`, to get jenkins looking for issues ahead of time. This change moves Hopcroft minimization (MinimizeOperations) to src/test for now. I'd like to explore nuking it from there as a next step, any tests that truly need minimization should be fine with brzozowski's algorithm: that's a 2-liner. I think the problem is understood, longs are insane for docids, I don't wish to hold changes up on stupid stuff -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #528: LUCENE-10296: Stop minimizing regepx
rmuir merged pull request #528: URL: https://github.com/apache/lucene/pull/528 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10296) Stop minimizing regexps
[ https://issues.apache.org/jira/browse/LUCENE-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456106#comment-17456106 ] ASF subversion and git services commented on LUCENE-10296: -- Commit 7a872c7a5c00d846314d44a445f8b0e83acb6a86 in lucene's branch refs/heads/main from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=7a872c7 ] LUCENE-10296: Stop minimizing regepx (#528) In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the expression. The parser nor the low-level executors don't implicitly call exponential-time algorithms anymore. But now that we have cleaned this up, we can see it is even worse than just calling determinize(). We still call minimize() which is much crazier and much more. We stopped doing this for all other AutomatonQuery subclasses a long time ago, as we determined that it didn't help performance. Additionally, minimization vs. determinization is even less important than early days where we found trouble: the representation got a lot better. Today when you finishState we do a lot of practical sorting/coalescing on-the-fly. Also we added this fancy UTF32-to-UTF8 automata convertor, that makes the worst-case-space-per-state significantly lower than it was before? So why minimize() ? Let's just replace minimize() calls with determinize() calls? I've already swapped them out for all of src/test, to get jenkins looking for issues ahead of time. This change moves hopcroft minimization (MinimizeOperations) to src/test for now. I'd like to explore nuking it from there as a next step, any tests that truly need minimization should be fine with brzozowski's algorithm. > Stop minimizing regexps > --- > > Key: LUCENE-10296 > URL: https://issues.apache.org/jira/browse/LUCENE-10296 > Project: Lucene - Core > Issue Type: Task >Affects Versions: 10.0 (main) >Reporter: Robert Muir >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the > expression. The parser nor the low-level executors don't implicitly call > exponential-time algorithms anymore. > But now that we have cleaned this up, we can see it is even worse than just > calling {{{}determinize(){}}}. We still call {{minimize()}} which is much > crazier and much more. > We stopped doing this for all other AutomatonQuery subclasses a long time > ago, as we determined that it didn't help performance. Additionally, > minimization vs. determinization is even less important than early days where > we found trouble: the representation got a lot better. Today when you > {{finishState}} we do a lot of practical sorting/coalescing on-the-fly. Also > we added this fancy UTF32-to-UTF8 automata convertor, that makes the > worst-case-space-per-state significantly lower than it was before? So why > {{minimize()}} ? > Let's just replace {{minimize()}} calls with {{determinize()}} calls? I've > already swapped them out for all of {{{}src/test{}}}, to get jenkins looking > for issues ahead of time. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10299) investigate prefix/wildcard perf drop in nightly benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-10299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456112#comment-17456112 ] Robert Muir commented on LUCENE-10299: -- fairly heavy cost on the MTQ-rewrite happening in the nightly bench. I assume similar cost for filter construction, etc. Honestly I suggest reverting this commit. > investigate prefix/wildcard perf drop in nightly benchmarks > --- > > Key: LUCENE-10299 > URL: https://issues.apache.org/jira/browse/LUCENE-10299 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Recently the prefix/wildcard dropped. As these are super simple and not > impacted by cleanups being done around RegExp, I think instead the > perf-difference is in the guts of MultiTermQuery where it uses > DocIdSetBuilder? > *note that I haven't confirmed this and it is just a suspicion* > So I think it may be LUCENE-10289 changes? e.g. doing loops with {{long}} > instead of {{int}} like before, we know these are slower in java. > I will admit, I'm a bit confused why we made this change since lucene docids > can only be {{int}}. > Maybe we get the performance back for free, with JDK18/19 which are > optimizing loops on {{long}} better? So I'm not arguing that we burn a bunch > of time to fix this, but just opening the issue. > cc [~ivera] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Reopened] (LUCENE-10289) DocIdSetBuilder#grow() should take a long instead of int
[ https://issues.apache.org/jira/browse/LUCENE-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reopened LUCENE-10289: -- I re opened the issue due to perf regressions: LUCENE-10299 > DocIdSetBuilder#grow() should take a long instead of int > - > > Key: LUCENE-10289 > URL: https://issues.apache.org/jira/browse/LUCENE-10289 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Assignee: Ignacio Vera >Priority: Major > Fix For: 9.1 > > Time Spent: 1h > Remaining Estimate: 0h > > DocIdSetBuilder accepts adding duplicates and therefore it potentially can > accept more than Integer.MAX_VALUE docs. For example, it already holds a > counter internally that is a long. It probably make sense to be able to grow > using a long instead of an int. > > This will allow us to change PointValue.IntersectVisitor#grow() from int to > long and remove some unnecessary dance when we need to bulk add more that > Integer.MAX_VALUE points. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10296) Stop minimizing regexps
[ https://issues.apache.org/jira/browse/LUCENE-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-10296. -- Fix Version/s: 10.0 (main) Resolution: Fixed > Stop minimizing regexps > --- > > Key: LUCENE-10296 > URL: https://issues.apache.org/jira/browse/LUCENE-10296 > Project: Lucene - Core > Issue Type: Task >Affects Versions: 10.0 (main) >Reporter: Robert Muir >Priority: Major > Fix For: 10.0 (main) > > Time Spent: 40m > Remaining Estimate: 0h > > In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the > expression. The parser nor the low-level executors don't implicitly call > exponential-time algorithms anymore. > But now that we have cleaned this up, we can see it is even worse than just > calling {{{}determinize(){}}}. We still call {{minimize()}} which is much > crazier and much more. > We stopped doing this for all other AutomatonQuery subclasses a long time > ago, as we determined that it didn't help performance. Additionally, > minimization vs. determinization is even less important than early days where > we found trouble: the representation got a lot better. Today when you > {{finishState}} we do a lot of practical sorting/coalescing on-the-fly. Also > we added this fancy UTF32-to-UTF8 automata convertor, that makes the > worst-case-space-per-state significantly lower than it was before? So why > {{minimize()}} ? > Let's just replace {{minimize()}} calls with {{determinize()}} calls? I've > already swapped them out for all of {{{}src/test{}}}, to get jenkins looking > for issues ahead of time. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 edited a comment on pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently
gf2121 edited a comment on pull request #510: URL: https://github.com/apache/lucene/pull/510#issuecomment-989104054 @iverase Thanks for your explanation! > I worked on the PR about using #readLELongs but never get a meaningful speed up that justify the added complexity. I find that we were trying to use #readLELongs to speed up 24/32 bit situation in the `DocIdsWriter`, which means the ids in the block are unsorted, typically happening in high cardinarlity fields. I think queries on high cardinality fields spend most of their time on `visitDocValues` but not `readDocIds`, so maybe this is the reason that we can not see a obvious gain on E2E took? My current thoughts are about using readLELongs to speed up the **sorted** ids situation (means low or medium cardinality fields), whose bottleneck is reading docIds. For sorted arrays, we can compute the delta of the sorted ids and encode/decode them like what we do in `StoredFieldsInts`. I raised an [ISSUE](https://issues.apache.org/jira/browse/LUCENE-10297) based on this idea. The benchmark result i post in the issue looks promising. Would you like to help take a look when you are free? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. Maybe this is because we are trying to optimize the unsorted situation (typically happens for high cardinality fields) and the bottleneck of queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} ? However, medium cardinality fields may be tempted for this optimization because they need to read lots of ids for each term. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{PointInSetQuery}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. Maybe this is because we are trying to optimize the unsorted situation (typically happens for high cardinality fields) and the bottleneck of queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} ? IMO medium cardinality fields may be tempted for this optimization because they need to read lots of ids for one term. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{PointInSetQuery}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We already have a bitset optimization for low cardinality fields, but the > optimization only works on extremly low cardinality fields (doc count > 1/16 > total doc), for medium cardina
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. Maybe this is because we are trying to optimize the unsorted situation (typically happens for high cardinality fields) and the bottleneck of queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} ? However, medium cardinality fields may be tempted for this optimization because they need to read lots of ids for each term. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{PointInSetQuery}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|query term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. Maybe this is because we are trying to optimize the unsorted situation (typically happens for high cardinality fields) and the bottleneck of queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} ? However, medium cardinality fields may be tempted for this optimization because they need to read lots of ids for each term. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{PointInSetQuery}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| > Speed up medium cardinality fields with readLELongs and SIMD > > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We already have a bitset optimization for low cardinality fields, but the > optimization only works on extremly low cardinality fields (doc count > 1/16 > total doc), for medium c
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Summary: Speed up medium cardinality fields with readLongs and SIMD (was: Speed up medium cardinality fields with readLELongs and SIMD) > Speed up medium cardinality fields with readLongs and SIMD > -- > > Key: LUCENE-10297 > URL: https://issues.apache.org/jira/browse/LUCENE-10297 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We already have a bitset optimization for low cardinality fields, but the > optimization only works on extremly low cardinality fields (doc count > 1/16 > total doc), for medium cardinality case like 32/128 can rarely get this > optimization. > In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to > use readLELongs to speed up BKD id blocks, but did not get a obvious gain on > this approach. Maybe this is because we are trying to optimize the unsorted > situation (typically happens for high cardinality fields) and the bottleneck > of queries on high cardinality fields is {{visitDocValues}} but not > {{readDocIds}} ? > However, medium cardinality fields may be tempted for this optimization > because they need to read lots of ids for each term. The basic idea is that > we can compute the delta of the sorted ids and encode/decode them like what > we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some > random longPoint and querying them with {{PointInSetQuery}}. As expected, the > medium cardinality fields got spped up and high cardinality fields get even > results. > *Benchmark Result* > |doc count|field cardinality|query term count|baseline(ms)|candidate(ms)|diff > percentage| > |1|32|1|19|16|-15.79%| > |1|32|2|34|14|-58.82%| > |1|32|4|76|22|-71.05%| > |1|32|8|139|42|-69.78%| > |1|32|16|279|82|-70.61%| > |1|128|1|17|11|-35.29%| > |1|128|8|75|23|-69.33%| > |1|128|16|126|25|-80.16%| > |1|128|32|245|50|-79.59%| > |1|128|64|528|97|-81.63%| > |1|1024|1|3|2|-33.33%| > |1|1024|8|13|8|-38.46%| > |1|1024|32|31|19|-38.71%| > |1|1024|128|120|67|-44.17%| > |1|1024|512|480|133|-72.29%| > |1|8192|1|3|3|0.00%| > |1|8192|16|18|15|-16.67%| > |1|8192|64|19|14|-26.32%| > |1|8192|512|69|43|-37.68%| > |1|8192|2048|236|134|-43.22%| > |1|1048576|1|3|2|-33.33%| > |1|1048576|16|18|19|5.56%| > |1|1048576|64|17|17|0.00%| > |1|1048576|512|34|32|-5.88%| > |1|1048576|2048|89|93|4.49%| > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. Maybe this is because we are trying to optimize the unsorted situation (typically happens for high cardinality fields) and the bottleneck of queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} ? However, medium cardinality fields may be tempted for this optimization because they need to read lots of ids for each term. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in {{{}StoredFieldsInts{}}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{{}PointInSetQuery{}}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|query point|baseline(ms)|candidate(ms)|diff percentage|baseline(QPS)|candidate(QPS)|diff percentage| |1|32|1|19|16|-15.79%|52.63|62.50|18.75%| |1|32|2|34|14|-58.82%|29.41|71.43|142.86%| |1|32|4|76|22|-71.05%|13.16|45.45|245.45%| |1|32|8|139|42|-69.78%|7.19|23.81|230.95%| |1|32|16|279|82|-70.61%|3.58|12.20|240.24%| |1|128|1|17|11|-35.29%|58.82|90.91|54.55%| |1|128|8|75|23|-69.33%|13.33|43.48|226.09%| |1|128|16|126|25|-80.16%|7.94|40.00|404.00%| |1|128|32|245|50|-79.59%|4.08|20.00|390.00%| |1|128|64|528|97|-81.63%|1.89|10.31|444.33%| |1|1024|1|3|2|-33.33%|333.33|500.00|50.00%| |1|1024|8|13|8|-38.46%|76.92|125.00|62.50%| |1|1024|32|31|19|-38.71%|32.26|52.63|63.16%| |1|1024|128|120|67|-44.17%|8.33|14.93|79.10%| |1|1024|512|480|133|-72.29%|2.08|7.52|260.90%| |1|8192|1|3|3|0.00%|333.33|333.33|0.00%| |1|8192|16|18|15|-16.67%|55.56|66.67|20.00%| |1|8192|64|19|14|-26.32%|52.63|71.43|35.71%| |1|8192|512|69|43|-37.68%|14.49|23.26|60.47%| |1|8192|2048|236|134|-43.22%|4.24|7.46|76.12%| |1|1048576|1|3|2|-33.33%|333.33|500.00|50.00%| |1|1048576|16|18|19|5.56%|55.56|52.63|-5.26%| |1|1048576|64|17|17|0.00%|58.82|58.82|0.00%| |1|1048576|512|34|32|-5.88%|29.41|31.25|6.25%| |1|1048576|2048|89|93|4.49%|11.24|10.75|-4.30%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. Maybe this is because we are trying to optimize the unsorted situation (typically happens for high cardinality fields) and the bottleneck of queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} ? However, medium cardinality fields may be tempted for this optimization because they need to read lots of ids for each term. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{PointInSetQuery}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|query term count|baseline(ms)|candidate(ms)|diff percentage| |1|32|1|19|16|-15.79%| |1|32|2|34|14|-58.82%| |1|32|4|76|22|-71.05%| |1|32|8|139|42|-69.78%| |1|32|16|279|82|-70.61%| |1|128|1|17|11|-35.29%| |1|128|8|75|23|-69.33%| |1|128|16|126|25|-80.16%| |1|128|32|245|50|-79.59%| |1|128|64|528|97|-81.63%| |1|1024|1|3|2|-33.33%| |1|1024|8|13|8|-38.46%| |1|1024|32|31|19|-38.71%| |1|1024|128|120|67|-44.17%| |1|1024|512|480|133|-72.29%| |1|8192|1|3|3|0.00%| |1|8192|16|18|15|-16.67%| |1|8192|64|19|14|-26.32%| |1|8192|512|69|43|-37.68%| |1|8192|2048|236|134|-43.22%| |1|1048576|1|3|2|-33.33%| |1|1048576|16|18|19|5.56%| |1|1048576|64|17|17|0.00%| |1|1048576|512|34|32|-5.88%| |1|1048576|2048|89|93|4.49%| > Speed up medium cardinality fields with readLongs and SIMD > --
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think the reason could be that we were trying to optimize the unsorted situation (typically happens for high cardinality fields) and the bottleneck of queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}}. However, medium cardinality fields may be tempted for this optimization because they need to read lots of ids for each term. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in {{{}StoredFieldsInts{}}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{{}PointInSetQuery{}}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|query point|baseline(ms)|candidate(ms)|diff percentage|baseline(QPS)|candidate(QPS)|diff percentage| |1|32|1|19|16|-15.79%|52.63|62.50|18.75%| |1|32|2|34|14|-58.82%|29.41|71.43|142.86%| |1|32|4|76|22|-71.05%|13.16|45.45|245.45%| |1|32|8|139|42|-69.78%|7.19|23.81|230.95%| |1|32|16|279|82|-70.61%|3.58|12.20|240.24%| |1|128|1|17|11|-35.29%|58.82|90.91|54.55%| |1|128|8|75|23|-69.33%|13.33|43.48|226.09%| |1|128|16|126|25|-80.16%|7.94|40.00|404.00%| |1|128|32|245|50|-79.59%|4.08|20.00|390.00%| |1|128|64|528|97|-81.63%|1.89|10.31|444.33%| |1|1024|1|3|2|-33.33%|333.33|500.00|50.00%| |1|1024|8|13|8|-38.46%|76.92|125.00|62.50%| |1|1024|32|31|19|-38.71%|32.26|52.63|63.16%| |1|1024|128|120|67|-44.17%|8.33|14.93|79.10%| |1|1024|512|480|133|-72.29%|2.08|7.52|260.90%| |1|8192|1|3|3|0.00%|333.33|333.33|0.00%| |1|8192|16|18|15|-16.67%|55.56|66.67|20.00%| |1|8192|64|19|14|-26.32%|52.63|71.43|35.71%| |1|8192|512|69|43|-37.68%|14.49|23.26|60.47%| |1|8192|2048|236|134|-43.22%|4.24|7.46|76.12%| |1|1048576|1|3|2|-33.33%|333.33|500.00|50.00%| |1|1048576|16|18|19|5.56%|55.56|52.63|-5.26%| |1|1048576|64|17|17|0.00%|58.82|58.82|0.00%| |1|1048576|512|34|32|-5.88%|29.41|31.25|6.25%| |1|1048576|2048|89|93|4.49%|11.24|10.75|-4.30%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. Maybe this is because we are trying to optimize the unsorted situation (typically happens for high cardinality fields) and the bottleneck of queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} ? However, medium cardinality fields may be tempted for this optimization because they need to read lots of ids for each term. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in {{{}StoredFieldsInts{}}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{{}PointInSetQuery{}}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|query point|baseline(ms)|candidate(ms)|diff percentage|baseline(QPS)|candidate(QPS)|diff percentage| |1|32|1|19|16|-15.79%|52.63|62.50|18.75%| |1|32|2|34|14|-58.82%|29.41|71.43|142.86%| |1|32|4|76|22|-71.05%|13.16|45.45|245.45%| |1|32|8|139|42|-69.78%|7.19|23.81|230.95%| |1|32|16|279|82|-70.61%|3.58|12.20|240.24%| |1|128|1|17|11|-35.29%|58.82|90.91|54.55%| |1|128|8|75|23|-69.33%|13.33|43.48|226.09%| |1|128|16|126|25|-80.16%|7.94|40.00|404.00%| |1|128|32|245|50|-79.59%|4.08|20.00|390.00%| |1|128|64|528|97|-81.63%|1.89|10.31|444.33%| |1|1024|1|3|2|-33.33%|333.33|500.00|50.00%| |1|1024|8|13|8|-38.46%|76.92|125.00|62.50%| |1|1024|32|31|19|-38.71%|32.26|52.63|63.16%| |1|1024|128|120|67|-44.17%|8.33|14.93|79.10%| |1|1024|512|480|133|-72.29%|2.08|7.52|260.90%| |1|8192|1|3|3|0.00%|333.33|333.33|0.00%| |1|8192|16|18|15|-16.67%|55.56|66.67|20.00%| |1|8192|64|19|14|-
[GitHub] [lucene] gf2121 edited a comment on pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently
gf2121 edited a comment on pull request #510: URL: https://github.com/apache/lucene/pull/510#issuecomment-989104054 @iverase Thanks for your explanation! > I worked on the PR about using #readLELongs but never get a meaningful speed up that justify the added complexity. I find that we were trying to use #readLELongs to speed up 24/32 bit situation in the `DocIdsWriter`, which means the ids in the block are unsorted, typically happening in high cardinarlity fields. I think queries on high cardinality fields spend most of their time on `visitDocValues` but not `readDocIds`, so maybe this is the reason that we can not see a obvious gain on E2E took? My current thoughts are about using readLELongs to speed up the **sorted** ids situation (means low or medium cardinality fields), whose bottleneck is reading docIds. For sorted arrays, we can compute the delta of the sorted ids and encode/decode them like what we do in `StoredFieldsInts`. I raised an [ISSUE](https://issues.apache.org/jira/browse/LUCENE-10297) based on this idea. The benchmark result i post in the issue looks promising. Would you like to help take a look when you have free time? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think the reason could be that we were trying to optimize the unsorted situation (typically happens for high cardinality fields) and the bottleneck of queries on high cardinality fields is {{visitDocValues}} but not {{{}readDocIds{}}}. (Not sure, i'm doing some more benchmark on this) However, medium cardinality fields may be tempted for this optimization because they need to read lots of ids for each term. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in {{{}StoredFieldsInts{}}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{{}PointInSetQuery{}}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|query point|baseline(ms)|candidate(ms)|diff percentage|baseline(QPS)|candidate(QPS)|diff percentage| |1|32|1|19|16|-15.79%|52.63|62.50|18.75%| |1|32|2|34|14|-58.82%|29.41|71.43|142.86%| |1|32|4|76|22|-71.05%|13.16|45.45|245.45%| |1|32|8|139|42|-69.78%|7.19|23.81|230.95%| |1|32|16|279|82|-70.61%|3.58|12.20|240.24%| |1|128|1|17|11|-35.29%|58.82|90.91|54.55%| |1|128|8|75|23|-69.33%|13.33|43.48|226.09%| |1|128|16|126|25|-80.16%|7.94|40.00|404.00%| |1|128|32|245|50|-79.59%|4.08|20.00|390.00%| |1|128|64|528|97|-81.63%|1.89|10.31|444.33%| |1|1024|1|3|2|-33.33%|333.33|500.00|50.00%| |1|1024|8|13|8|-38.46%|76.92|125.00|62.50%| |1|1024|32|31|19|-38.71%|32.26|52.63|63.16%| |1|1024|128|120|67|-44.17%|8.33|14.93|79.10%| |1|1024|512|480|133|-72.29%|2.08|7.52|260.90%| |1|8192|1|3|3|0.00%|333.33|333.33|0.00%| |1|8192|16|18|15|-16.67%|55.56|66.67|20.00%| |1|8192|64|19|14|-26.32%|52.63|71.43|35.71%| |1|8192|512|69|43|-37.68%|14.49|23.26|60.47%| |1|8192|2048|236|134|-43.22%|4.24|7.46|76.12%| |1|1048576|1|3|2|-33.33%|333.33|500.00|50.00%| |1|1048576|16|18|19|5.56%|55.56|52.63|-5.26%| |1|1048576|64|17|17|0.00%|58.82|58.82|0.00%| |1|1048576|512|34|32|-5.88%|29.41|31.25|6.25%| |1|1048576|2048|89|93|4.49%|11.24|10.75|-4.30%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think the reason could be that we were trying to optimize the unsorted situation (typically happens for high cardinality fields) and the bottleneck of queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}}. However, medium cardinality fields may be tempted for this optimization because they need to read lots of ids for each term. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in {{{}StoredFieldsInts{}}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{{}PointInSetQuery{}}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|query point|baseline(ms)|candidate(ms)|diff percentage|baseline(QPS)|candidate(QPS)|diff percentage| |1|32|1|19|16|-15.79%|52.63|62.50|18.75%| |1|32|2|34|14|-58.82%|29.41|71.43|142.86%| |1|32|4|76|22|-71.05%|13.16|45.45|245.45%| |1|32|8|139|42|-69.78%|7.19|23.81|230.95%| |1|32|16|279|82|-70.61%|3.58|12.20|240.24%| |1|128|1|17|11|-35.29%|58.82|90.91|54.55%| |1|128|8|75|23|-69.33%|13.33|43.48|226.09%| |1|128|16|126|25|-80.16%|7.94|40.00|404.00%| |1|128|32|245|50|-79.59%|4.08|20.00|390.00%| |1|128|64|528|97|-81.63%|1.89|10.31|444.33%| |1|1024|1|3|2|-33.33%|333.33|500.00|50.00%| |1|1024|8|13|8|-38.46%|76.92|125.00|62.50%| |1|1024|32|31|19|-38.71%|32.26|52.63|63.16%| |1|1024|128|120|67|-44.17%|8.33|14.93|79.10%| |1|1024|512|480|133|-72.29%|2.08|7.52|260.90%| |1|8192|1|3|3|0.00%|333.33|333.33|0.00%| |1|819
[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLongs and SIMD
[ https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10297: -- Description: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think the reason could be that we were trying to optimize the unsorted situation (typically happens for high cardinality fields) and the bottleneck of queries on high cardinality fields is {{visitDocValues}} but not {{{}readDocIds{}}}. _(Not sure, i'm doing some more benchmark on this)_ However, medium cardinality fields may be tempted for this optimization because they need to read lots of ids for each term. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in {{{}StoredFieldsInts{}}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{{}PointInSetQuery{}}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|query point|baseline(ms)|candidate(ms)|diff percentage|baseline(QPS)|candidate(QPS)|diff percentage| |1|32|1|19|16|-15.79%|52.63|62.50|18.75%| |1|32|2|34|14|-58.82%|29.41|71.43|142.86%| |1|32|4|76|22|-71.05%|13.16|45.45|245.45%| |1|32|8|139|42|-69.78%|7.19|23.81|230.95%| |1|32|16|279|82|-70.61%|3.58|12.20|240.24%| |1|128|1|17|11|-35.29%|58.82|90.91|54.55%| |1|128|8|75|23|-69.33%|13.33|43.48|226.09%| |1|128|16|126|25|-80.16%|7.94|40.00|404.00%| |1|128|32|245|50|-79.59%|4.08|20.00|390.00%| |1|128|64|528|97|-81.63%|1.89|10.31|444.33%| |1|1024|1|3|2|-33.33%|333.33|500.00|50.00%| |1|1024|8|13|8|-38.46%|76.92|125.00|62.50%| |1|1024|32|31|19|-38.71%|32.26|52.63|63.16%| |1|1024|128|120|67|-44.17%|8.33|14.93|79.10%| |1|1024|512|480|133|-72.29%|2.08|7.52|260.90%| |1|8192|1|3|3|0.00%|333.33|333.33|0.00%| |1|8192|16|18|15|-16.67%|55.56|66.67|20.00%| |1|8192|64|19|14|-26.32%|52.63|71.43|35.71%| |1|8192|512|69|43|-37.68%|14.49|23.26|60.47%| |1|8192|2048|236|134|-43.22%|4.24|7.46|76.12%| |1|1048576|1|3|2|-33.33%|333.33|500.00|50.00%| |1|1048576|16|18|19|5.56%|55.56|52.63|-5.26%| |1|1048576|64|17|17|0.00%|58.82|58.82|0.00%| |1|1048576|512|34|32|-5.88%|29.41|31.25|6.25%| |1|1048576|2048|89|93|4.49%|11.24|10.75|-4.30%| was: We already have a bitset optimization for low cardinality fields, but the optimization only works on extremly low cardinality fields (doc count > 1/16 total doc), for medium cardinality case like 32/128 can rarely get this optimization. In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think the reason could be that we were trying to optimize the unsorted situation (typically happens for high cardinality fields) and the bottleneck of queries on high cardinality fields is {{visitDocValues}} but not {{{}readDocIds{}}}. (Not sure, i'm doing some more benchmark on this) However, medium cardinality fields may be tempted for this optimization because they need to read lots of ids for each term. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in {{{}StoredFieldsInts{}}}. I benchmarked the optimization by mocking some random longPoint and querying them with {{{}PointInSetQuery{}}}. As expected, the medium cardinality fields got spped up and high cardinality fields get even results. *Benchmark Result* |doc count|field cardinality|query point|baseline(ms)|candidate(ms)|diff percentage|baseline(QPS)|candidate(QPS)|diff percentage| |1|32|1|19|16|-15.79%|52.63|62.50|18.75%| |1|32|2|34|14|-58.82%|29.41|71.43|142.86%| |1|32|4|76|22|-71.05%|13.16|45.45|245.45%| |1|32|8|139|42|-69.78%|7.19|23.81|230.95%| |1|32|16|279|82|-70.61%|3.58|12.20|240.24%| |1|128|1|17|11|-35.29%|58.82|90.91|54.55%| |1|128|8|75|23|-69.33%|13.33|43.48|226.09%| |1|128|16|126|25|-80.16%|7.94|40.00|404.00%| |1|128|32|245|50|-79.59%|4.08|20.00|390.00%| |1|128|64|528|97|-81.63%|1.89|10.31|444.33%| |1|1024|1|3|2|-33.33%|333.33|500.00|50.00%| |1|1024|8|13|8|-38.46%|76.92|125.00|62.50%| |1|1024|32|31|19|-38.71%|32.26|52.63|63.16%| |1|1024|128|120|67|-44.17%|8.33|14.93|79.10%| |1|1024|512|480|133|-72.29%|2.08|7.52|260.90%| |10
[GitHub] [lucene] spyk commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue
spyk commented on pull request #380: URL: https://github.com/apache/lucene/pull/380#issuecomment-989601844 Thanks @magibney! Yes, makes sense, I'll add a commit for the redundant parsing code removal and add a TODO comment as suggested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org