[GitHub] [lucene-solr] janhoy commented on pull request #1364: SOLR-14335: Lock Solr's memory to prevent swapping

2021-12-08 Thread GitBox


janhoy commented on pull request #1364:
URL: https://github.com/apache/lucene-solr/pull/1364#issuecomment-988681951


   Lucene and Solr development has moved to separate git repositories and this 
PR is being bulk-closed.\nPlease open a new PR against 
https://github.com/apache/solr or https://github.com/apache/lucene if your 
contribution is still relevant to the project.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy closed pull request #1364: SOLR-14335: Lock Solr's memory to prevent swapping

2021-12-08 Thread GitBox


janhoy closed pull request #1364:
URL: https://github.com/apache/lucene-solr/pull/1364


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)
Feng Guo created LUCENE-10297:
-

 Summary: Speed up medium cardinality fields with readLELongs and 
SIMD
 Key: LUCENE-10297
 URL: https://issues.apache.org/jira/browse/LUCENE-10297
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs
Reporter: Feng Guo


Though we already have a bitset optimization for low cardinality fields, but 
the optimization usually only works on extremly low cardinality fields 
(cardinality < 16), for medium cardinality case like 30, 100 can rarely get 
this optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields. Maybe this is 
because the bottleneck of queries on high cardinality fields is usually 
visitDocValues but not readDocIds? I think medium cardinality fields are 
tempted for this optimization.

I benchmark the optimization by mocking some random longPoint and querying them 
with PointInSetQuery. As expected, the medium cardinality fields got spped up 
and high cardinality fields are even.

*BaseLine*
{code:java}
task: index_1_doc_32_cardinality_baseline, term count: 1, took: 29
task: index_1_doc_32_cardinality_baseline, term count: 2, took: 40
task: index_1_doc_32_cardinality_baseline, term count: 4, took: 74
task: index_1_doc_32_cardinality_baseline, term count: 8, took: 144
task: index_1_doc_32_cardinality_baseline, term count: 16, took: 284
task: index_1_doc_128_cardinality_baseline, term count: 1, took: 20
task: index_1_doc_128_cardinality_baseline, term count: 8, took: 70
task: index_1_doc_128_cardinality_baseline, term count: 16, took: 127
task: index_1_doc_128_cardinality_baseline, term count: 32, took: 251
task: index_1_doc_128_cardinality_baseline, term count: 64, took: 576
task: index_1_doc_8192_cardinality_baseline, term count: 1, took: 2
task: index_1_doc_8192_cardinality_baseline, term count: 16, took: 11
task: index_1_doc_8192_cardinality_baseline, term count: 64, took: 18
task: index_1_doc_8192_cardinality_baseline, term count: 512, took: 88
task: index_1_doc_8192_cardinality_baseline, term count: 2048, took: 266
task: index_1_doc_1048576_cardinality_baseline, term count: 1, took: 3
task: index_1_doc_1048576_cardinality_baseline, term count: 16, took: 11
task: index_1_doc_1048576_cardinality_baseline, term count: 64, took: 8
task: index_1_doc_1048576_cardinality_baseline, term count: 512, took: 
33
task: index_1_doc_1048576_cardinality_baseline, term count: 2048, took: 
97
task: index_1_doc_8388608_cardinality_baseline, term count: 1, took: 4
task: index_1_doc_8388608_cardinality_baseline, term count: 16, took: 20
task: index_1_doc_8388608_cardinality_baseline, term count: 64, took: 31
task: index_1_doc_8388608_cardinality_baseline, term count: 512, took: 
70
task: index_1_doc_8388608_cardinality_baseline, term count: 2048, took: 
209
{code}

*candidate*

{code:java}
task: index_1_doc_32_cardinality_candidate, term count: 1, took: 18
task: index_1_doc_32_cardinality_candidate, term count: 2, took: 16
task: index_1_doc_32_cardinality_candidate, term count: 4, took: 26
task: index_1_doc_32_cardinality_candidate, term count: 8, took: 46
task: index_1_doc_32_cardinality_candidate, term count: 16, took: 88
task: index_1_doc_128_cardinality_candidate, term count: 1, took: 12
task: index_1_doc_128_cardinality_candidate, term count: 8, took: 22
task: index_1_doc_128_cardinality_candidate, term count: 16, took: 29
task: index_1_doc_128_cardinality_candidate, term count: 32, took: 50
task: index_1_doc_128_cardinality_candidate, term count: 64, took: 93
task: index_1_doc_8192_cardinality_candidate, term count: 1, took: 2
task: index_1_doc_8192_cardinality_candidate, term count: 16, took: 9
task: index_1_doc_8192_cardinality_candidate, term count: 64, took: 13
task: index_1_doc_8192_cardinality_candidate, term count: 512, took: 42
task: index_1_doc_8192_cardinality_candidate, term count: 2048, took: 
129
task: index_1_doc_1048576_cardinality_candidate, term count: 1, took: 2
task: index_1_doc_1048576_cardinality_candidate, term count: 16, took: 9
task: index_1_doc_1048576_cardinality_candidate, term count: 64, took: 9
task: index_1_doc_1048576_cardinality_candidate, term count: 512, took: 
32
task: index_1_doc_1048576_cardinality_candidate, term count: 2048, 
took: 93
task: index_1_doc_8388608_cardinality_candidate, ter

[GitHub] [lucene] gf2121 opened a new pull request #530: LUCENE-10297: Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread GitBox


gf2121 opened a new pull request #530:
URL: https://github.com/apache/lucene/pull/530


   see https://issues.apache.org/jira/browse/LUCENE-10297


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
Though we already have a bitset optimization for low cardinality fields, but 
the optimization usually only works on extremly low cardinality fields 
(cardinality < 16), for medium cardinality case like 30, 100 can rarely get 
this optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields. Maybe this is 
because the bottleneck of queries on high cardinality fields is usually 
visitDocValues but not readDocIds? I think medium cardinality fields are 
tempted for this optimization.

The basic idea is that we get deltas for sorted ids and encode them with encode

I benchmark the optimization by mocking some random longPoint and querying them 
with PointInSetQuery. As expected, the medium cardinality fields got spped up 
and high cardinality fields are even.

*BaseLine*
{code:java}
task: index_1_doc_32_cardinality_baseline, term count: 1, took: 29
task: index_1_doc_32_cardinality_baseline, term count: 2, took: 40
task: index_1_doc_32_cardinality_baseline, term count: 4, took: 74
task: index_1_doc_32_cardinality_baseline, term count: 8, took: 144
task: index_1_doc_32_cardinality_baseline, term count: 16, took: 284
task: index_1_doc_128_cardinality_baseline, term count: 1, took: 20
task: index_1_doc_128_cardinality_baseline, term count: 8, took: 70
task: index_1_doc_128_cardinality_baseline, term count: 16, took: 127
task: index_1_doc_128_cardinality_baseline, term count: 32, took: 251
task: index_1_doc_128_cardinality_baseline, term count: 64, took: 576
task: index_1_doc_8192_cardinality_baseline, term count: 1, took: 2
task: index_1_doc_8192_cardinality_baseline, term count: 16, took: 11
task: index_1_doc_8192_cardinality_baseline, term count: 64, took: 18
task: index_1_doc_8192_cardinality_baseline, term count: 512, took: 88
task: index_1_doc_8192_cardinality_baseline, term count: 2048, took: 266
task: index_1_doc_1048576_cardinality_baseline, term count: 1, took: 3
task: index_1_doc_1048576_cardinality_baseline, term count: 16, took: 11
task: index_1_doc_1048576_cardinality_baseline, term count: 64, took: 8
task: index_1_doc_1048576_cardinality_baseline, term count: 512, took: 
33
task: index_1_doc_1048576_cardinality_baseline, term count: 2048, took: 
97
task: index_1_doc_8388608_cardinality_baseline, term count: 1, took: 4
task: index_1_doc_8388608_cardinality_baseline, term count: 16, took: 20
task: index_1_doc_8388608_cardinality_baseline, term count: 64, took: 31
task: index_1_doc_8388608_cardinality_baseline, term count: 512, took: 
70
task: index_1_doc_8388608_cardinality_baseline, term count: 2048, took: 
209
{code}
*candidate*
{code:java}
task: index_1_doc_32_cardinality_candidate, term count: 1, took: 18
task: index_1_doc_32_cardinality_candidate, term count: 2, took: 16
task: index_1_doc_32_cardinality_candidate, term count: 4, took: 26
task: index_1_doc_32_cardinality_candidate, term count: 8, took: 46
task: index_1_doc_32_cardinality_candidate, term count: 16, took: 88
task: index_1_doc_128_cardinality_candidate, term count: 1, took: 12
task: index_1_doc_128_cardinality_candidate, term count: 8, took: 22
task: index_1_doc_128_cardinality_candidate, term count: 16, took: 29
task: index_1_doc_128_cardinality_candidate, term count: 32, took: 50
task: index_1_doc_128_cardinality_candidate, term count: 64, took: 93
task: index_1_doc_8192_cardinality_candidate, term count: 1, took: 2
task: index_1_doc_8192_cardinality_candidate, term count: 16, took: 9
task: index_1_doc_8192_cardinality_candidate, term count: 64, took: 13
task: index_1_doc_8192_cardinality_candidate, term count: 512, took: 42
task: index_1_doc_8192_cardinality_candidate, term count: 2048, took: 
129
task: index_1_doc_1048576_cardinality_candidate, term count: 1, took: 2
task: index_1_doc_1048576_cardinality_candidate, term count: 16, took: 9
task: index_1_doc_1048576_cardinality_candidate, term count: 64, took: 9
task: index_1_doc_1048576_cardinality_candidate, term count: 512, took: 
32
task: index_1_doc_1048576_cardinality_candidate, term count: 2048, 
took: 93
task: index_1_doc_8388608_cardinality_candidate, term count: 1, took: 2
task: index_1_doc_8388608_cardinality_candidate, term count: 16, took: 
21
ta

[GitHub] [lucene-solr] janhoy opened a new pull request #2625: Added bulkclose feature to the githubPRs script

2021-12-08 Thread GitBox


janhoy opened a new pull request #2625:
URL: https://github.com/apache/lucene-solr/pull/2625


   Example use:
   
   ```bash
   ./githubPRs.py \
 --bulkclose "Lucene and Solr development has moved to separate git 
repositories and this PR is being bulk-closed. Please open a new PR against 
https://github.com/apache/solr or https://github.com/apache/lucene if your 
contribution is still relevant to the project." \
 --token X
   ```
   
   Result of such an action can be seen in #1364 which I used for testing. You 
can then easily query GitHub for a list of the `stale-closed` PRs: 
https://github.com/apache/lucene-solr/pulls?q=label%3Astale-closed+is%3Aclosed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
Though we already have a bitset optimization for low cardinality fields, but 
the optimization usually only works on extremly low cardinality fields 
(cardinality < 16), for medium cardinality case like 30, 100 can rarely get 
this optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields. Maybe this is 
because the bottleneck of queries on high cardinality fields is usually 
visitDocValues but not readDocIds? I think medium cardinality fields are 
tempted for this optimization.

I benchmark the optimization by mocking some random longPoint and querying them 
with PointInSetQuery. As expected, the medium cardinality fields got spped up 
and high cardinality fields are even.

*BaseLine*
{code:java}
task: index_1_doc_32_cardinality_baseline, term count: 1, took: 29
task: index_1_doc_32_cardinality_baseline, term count: 2, took: 40
task: index_1_doc_32_cardinality_baseline, term count: 4, took: 74
task: index_1_doc_32_cardinality_baseline, term count: 8, took: 144
task: index_1_doc_32_cardinality_baseline, term count: 16, took: 284
task: index_1_doc_128_cardinality_baseline, term count: 1, took: 20
task: index_1_doc_128_cardinality_baseline, term count: 8, took: 70
task: index_1_doc_128_cardinality_baseline, term count: 16, took: 127
task: index_1_doc_128_cardinality_baseline, term count: 32, took: 251
task: index_1_doc_128_cardinality_baseline, term count: 64, took: 576
task: index_1_doc_8192_cardinality_baseline, term count: 1, took: 2
task: index_1_doc_8192_cardinality_baseline, term count: 16, took: 11
task: index_1_doc_8192_cardinality_baseline, term count: 64, took: 18
task: index_1_doc_8192_cardinality_baseline, term count: 512, took: 88
task: index_1_doc_8192_cardinality_baseline, term count: 2048, took: 266
task: index_1_doc_1048576_cardinality_baseline, term count: 1, took: 3
task: index_1_doc_1048576_cardinality_baseline, term count: 16, took: 11
task: index_1_doc_1048576_cardinality_baseline, term count: 64, took: 8
task: index_1_doc_1048576_cardinality_baseline, term count: 512, took: 
33
task: index_1_doc_1048576_cardinality_baseline, term count: 2048, took: 
97
task: index_1_doc_8388608_cardinality_baseline, term count: 1, took: 4
task: index_1_doc_8388608_cardinality_baseline, term count: 16, took: 20
task: index_1_doc_8388608_cardinality_baseline, term count: 64, took: 31
task: index_1_doc_8388608_cardinality_baseline, term count: 512, took: 
70
task: index_1_doc_8388608_cardinality_baseline, term count: 2048, took: 
209
{code}
*candidate*
{code:java}
task: index_1_doc_32_cardinality_candidate, term count: 1, took: 18
task: index_1_doc_32_cardinality_candidate, term count: 2, took: 16
task: index_1_doc_32_cardinality_candidate, term count: 4, took: 26
task: index_1_doc_32_cardinality_candidate, term count: 8, took: 46
task: index_1_doc_32_cardinality_candidate, term count: 16, took: 88
task: index_1_doc_128_cardinality_candidate, term count: 1, took: 12
task: index_1_doc_128_cardinality_candidate, term count: 8, took: 22
task: index_1_doc_128_cardinality_candidate, term count: 16, took: 29
task: index_1_doc_128_cardinality_candidate, term count: 32, took: 50
task: index_1_doc_128_cardinality_candidate, term count: 64, took: 93
task: index_1_doc_8192_cardinality_candidate, term count: 1, took: 2
task: index_1_doc_8192_cardinality_candidate, term count: 16, took: 9
task: index_1_doc_8192_cardinality_candidate, term count: 64, took: 13
task: index_1_doc_8192_cardinality_candidate, term count: 512, took: 42
task: index_1_doc_8192_cardinality_candidate, term count: 2048, took: 
129
task: index_1_doc_1048576_cardinality_candidate, term count: 1, took: 2
task: index_1_doc_1048576_cardinality_candidate, term count: 16, took: 9
task: index_1_doc_1048576_cardinality_candidate, term count: 64, took: 9
task: index_1_doc_1048576_cardinality_candidate, term count: 512, took: 
32
task: index_1_doc_1048576_cardinality_candidate, term count: 2048, 
took: 93
task: index_1_doc_8388608_cardinality_candidate, term count: 1, took: 2
task: index_1_doc_8388608_cardinality_candidate, term count: 16, took: 
21
task: index_1_doc_8388608_cardinality_candidate, term count: 64, took: 
38

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
Though we already have a bitset optimization for low cardinality fields, but 
the optimization usually only works on extremly low cardinality fields 
(cardinality < 16), for medium cardinality case like 30, 100 can rarely get 
this optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.

*BaseLine*
{code:java}
task: index_1_doc_32_cardinality_baseline, term count: 1, took: 29
task: index_1_doc_32_cardinality_baseline, term count: 2, took: 40
task: index_1_doc_32_cardinality_baseline, term count: 4, took: 74
task: index_1_doc_32_cardinality_baseline, term count: 8, took: 144
task: index_1_doc_32_cardinality_baseline, term count: 16, took: 284
task: index_1_doc_128_cardinality_baseline, term count: 1, took: 20
task: index_1_doc_128_cardinality_baseline, term count: 8, took: 70
task: index_1_doc_128_cardinality_baseline, term count: 16, took: 127
task: index_1_doc_128_cardinality_baseline, term count: 32, took: 251
task: index_1_doc_128_cardinality_baseline, term count: 64, took: 576
task: index_1_doc_8192_cardinality_baseline, term count: 1, took: 2
task: index_1_doc_8192_cardinality_baseline, term count: 16, took: 11
task: index_1_doc_8192_cardinality_baseline, term count: 64, took: 18
task: index_1_doc_8192_cardinality_baseline, term count: 512, took: 88
task: index_1_doc_8192_cardinality_baseline, term count: 2048, took: 266
task: index_1_doc_1048576_cardinality_baseline, term count: 1, took: 3
task: index_1_doc_1048576_cardinality_baseline, term count: 16, took: 11
task: index_1_doc_1048576_cardinality_baseline, term count: 64, took: 8
task: index_1_doc_1048576_cardinality_baseline, term count: 512, took: 
33
task: index_1_doc_1048576_cardinality_baseline, term count: 2048, took: 
97
task: index_1_doc_8388608_cardinality_baseline, term count: 1, took: 4
task: index_1_doc_8388608_cardinality_baseline, term count: 16, took: 20
task: index_1_doc_8388608_cardinality_baseline, term count: 64, took: 31
task: index_1_doc_8388608_cardinality_baseline, term count: 512, took: 
70
task: index_1_doc_8388608_cardinality_baseline, term count: 2048, took: 
209
{code}
*candidate*
{code:java}
task: index_1_doc_32_cardinality_candidate, term count: 1, took: 18
task: index_1_doc_32_cardinality_candidate, term count: 2, took: 16
task: index_1_doc_32_cardinality_candidate, term count: 4, took: 26
task: index_1_doc_32_cardinality_candidate, term count: 8, took: 46
task: index_1_doc_32_cardinality_candidate, term count: 16, took: 88
task: index_1_doc_128_cardinality_candidate, term count: 1, took: 12
task: index_1_doc_128_cardinality_candidate, term count: 8, took: 22
task: index_1_doc_128_cardinality_candidate, term count: 16, took: 29
task: index_1_doc_128_cardinality_candidate, term count: 32, took: 50
task: index_1_doc_128_cardinality_candidate, term count: 64, took: 93
task: index_1_doc_8192_cardinality_candidate, term count: 1, took: 2
task: index_1_doc_8192_cardinality_candidate, term count: 16, took: 9
task: index_1_doc_8192_cardinality_candidate, term count: 64, took: 13
task: index_1_doc_8192_cardinality_candidate, term count: 512, took: 42
task: index_1_doc_8192_cardinality_candidate, term count: 2048, took: 
129
task: index_1_doc_1048576_cardinality_candidate, term count: 1, took: 2
task: index_1_doc_1048576_cardinality_candidate, term count: 16, took: 9
task: index_1_doc_1048576_cardinality_candidate, term count: 64, took: 9
task: index_1_doc_1048576_cardinality_candidate, term count: 512, took: 
32
task: index_1_doc_1048576_cardinality_candidate, term count: 2048, 
took: 93
task: index_1_doc_8388608_cardinality_candidate, term count: 1, took: 2
task: index_1_doc_8388608_cardinality_candidate, term count: 16, took: 
21
task: index_1_doc_8388608_cardinality_candidate, term count: 64, took: 
38
task: index

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
Though we already have a bitset optimization for low cardinality fields, but 
the optimization usually only works on extremly low cardinality fields 
(cardinality < 16), for medium cardinality case like 30, 100 can rarely get 
this optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.

*BaseLine*
{code:java}
task: index_1_doc_32_cardinality_baseline, term count: 1, took: 29
task: index_1_doc_32_cardinality_baseline, term count: 2, took: 40
task: index_1_doc_32_cardinality_baseline, term count: 4, took: 74
task: index_1_doc_32_cardinality_baseline, term count: 8, took: 144
task: index_1_doc_32_cardinality_baseline, term count: 16, took: 284
task: index_1_doc_128_cardinality_baseline, term count: 1, took: 20
task: index_1_doc_128_cardinality_baseline, term count: 8, took: 70
task: index_1_doc_128_cardinality_baseline, term count: 16, took: 127
task: index_1_doc_128_cardinality_baseline, term count: 32, took: 251
task: index_1_doc_128_cardinality_baseline, term count: 64, took: 576
task: index_1_doc_8192_cardinality_baseline, term count: 1, took: 2
task: index_1_doc_8192_cardinality_baseline, term count: 16, took: 11
task: index_1_doc_8192_cardinality_baseline, term count: 64, took: 18
task: index_1_doc_8192_cardinality_baseline, term count: 512, took: 88
task: index_1_doc_8192_cardinality_baseline, term count: 2048, took: 266
task: index_1_doc_1048576_cardinality_baseline, term count: 1, took: 3
task: index_1_doc_1048576_cardinality_baseline, term count: 16, took: 11
task: index_1_doc_1048576_cardinality_baseline, term count: 64, took: 8
task: index_1_doc_1048576_cardinality_baseline, term count: 512, took: 
33
task: index_1_doc_1048576_cardinality_baseline, term count: 2048, took: 
97
task: index_1_doc_8388608_cardinality_baseline, term count: 1, took: 4
task: index_1_doc_8388608_cardinality_baseline, term count: 16, took: 20
task: index_1_doc_8388608_cardinality_baseline, term count: 64, took: 31
task: index_1_doc_8388608_cardinality_baseline, term count: 512, took: 
70
task: index_1_doc_8388608_cardinality_baseline, term count: 2048, took: 
209
{code}
*Candidate*
{code:java}
task: index_1_doc_32_cardinality_candidate, term count: 1, took: 18
task: index_1_doc_32_cardinality_candidate, term count: 2, took: 16
task: index_1_doc_32_cardinality_candidate, term count: 4, took: 26
task: index_1_doc_32_cardinality_candidate, term count: 8, took: 46
task: index_1_doc_32_cardinality_candidate, term count: 16, took: 88
task: index_1_doc_128_cardinality_candidate, term count: 1, took: 12
task: index_1_doc_128_cardinality_candidate, term count: 8, took: 22
task: index_1_doc_128_cardinality_candidate, term count: 16, took: 29
task: index_1_doc_128_cardinality_candidate, term count: 32, took: 50
task: index_1_doc_128_cardinality_candidate, term count: 64, took: 93
task: index_1_doc_8192_cardinality_candidate, term count: 1, took: 2
task: index_1_doc_8192_cardinality_candidate, term count: 16, took: 9
task: index_1_doc_8192_cardinality_candidate, term count: 64, took: 13
task: index_1_doc_8192_cardinality_candidate, term count: 512, took: 42
task: index_1_doc_8192_cardinality_candidate, term count: 2048, took: 
129
task: index_1_doc_1048576_cardinality_candidate, term count: 1, took: 2
task: index_1_doc_1048576_cardinality_candidate, term count: 16, took: 9
task: index_1_doc_1048576_cardinality_candidate, term count: 64, took: 9
task: index_1_doc_1048576_cardinality_candidate, term count: 512, took: 
32
task: index_1_doc_1048576_cardinality_candidate, term count: 2048, 
took: 93
task: index_1_doc_8388608_cardinality_candidate, term count: 1, took: 2
task: index_1_doc_8388608_cardinality_candidate, term count: 16, took: 
21
task: index_1_doc_8388608_cardinality_candidate, term count: 64, took: 
38
task: index

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
Though we already have a bitset optimization for low cardinality fields, but 
the optimization usually only works on extremly low cardinality fields 
(cardinality < 16), for medium cardinality case like 30, 100 can rarely get 
this optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.


|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff|
|1.00|32.00|1.00|29.00|18.00|-37.93%|
|1.00|32.00|2.00|40.00|16.00|-60.00%|
|1.00|32.00|4.00|74.00|26.00|-64.86%|
|1.00|32.00|8.00|144.00|46.00|-68.06%|
|1.00|32.00|16.00|284.00|88.00|-69.01%|
|1.00|128.00|1.00|20.00|12.00|-40.00%|
|1.00|128.00|8.00|70.00|22.00|-68.57%|
|1.00|128.00|16.00|127.00|29.00|-77.17%|
|1.00|128.00|32.00|251.00|50.00|-80.08%|
|1.00|128.00|64.00|576.00|93.00|-83.85%|
|1.00|8192.00|1.00|2.00|2.00|0.00%|
|1.00|8192.00|16.00|11.00|9.00|-18.18%|
|1.00|8192.00|64.00|18.00|13.00|-27.78%|
|1.00|8192.00|512.00|88.00|42.00|-52.27%|
|1.00|8192.00|2048.00|266.00|129.00|-51.50%|
|1.00|1048576.00|1.00|3.00|2.00|-33.33%|
|1.00|1048576.00|16.00|11.00|9.00|-18.18%|
|1.00|1048576.00|64.00|8.00|9.00|12.50%|
|1.00|1048576.00|512.00|33.00|32.00|-3.03%|
|1.00|1048576.00|2048.00|97.00|93.00|-4.12%|
|1.00|8388608.00|1.00|4.00|2.00|-50.00%|
|1.00|8388608.00|16.00|20.00|21.00|5.00%|
|1.00|8388608.00|64.00|31.00|38.00|22.58%|
|1.00|8388608.00|512.00|70.00|73.00|4.29%|
|1.00|8388608.00|2048.00|209.00|204.00|-2.39%|

  was:
Though we already have a bitset optimization for low cardinality fields, but 
the optimization usually only works on extremly low cardinality fields 
(cardinality < 16), for medium cardinality case like 30, 100 can rarely get 
this optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.

*BaseLine*
{code:java}
task: index_1_doc_32_cardinality_baseline, term count: 1, took: 29
task: index_1_doc_32_cardinality_baseline, term count: 2, took: 40
task: index_1_doc_32_cardinality_baseline, term count: 4, took: 74
task: index_1_doc_32_cardinality_baseline, term count: 8, took: 144
task: index_1_doc_32_cardinality_baseline, term count: 16, took: 284
task: index_1_doc_128_cardinality_baseline, term count: 1, took: 20
task: index_1_doc_128_cardinality_baseline, term count: 8, took: 70
task: index_1_doc_128_cardinality_baseline, term count: 16, took: 127
task: index_1_doc_128_cardinality_baseline, term count: 32, took: 251
task: index_1_doc_128_cardinality_baseline, term count: 64, took: 576
task: index_1_doc_8192_cardinality_baseline, term count: 1, took: 2
task: index_1_doc_8192_cardinality_baseline, term count: 16, took: 11
task: index_1_doc_8192_cardinality_baseline, term count: 64, took: 18
task: index_1_doc_8192_cardinality_baseline, term count: 512, took: 88
task: index_1_doc_8192_cardinality_baseline, term count: 2048, took: 266
task: index_1_doc_1048576_cardinality_baseline, term count: 1, took: 3
task: index_1_doc_1048576_cardinality_baseline, term count: 16, took: 11
task: index_1_doc_1048576_cardinality_baseline, term count: 64, took: 8
task: index_1_doc_1048576_cardinality_baseline, term count: 512, took: 
33
task: index_1_doc_1048576_cardinality_baseline, term count: 2048, took: 
97
task: index_1_doc_8388

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
Though we already have a bitset optimization for low cardinality fields, but 
the optimization usually only works on extremly low cardinality fields 
(cardinality < 16), for medium cardinality case like 30, 100 can rarely get 
this optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.
 
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff|
|1|32|1|29|18|-37.93%|
|1|32|2|40|16|-60.00%|
|1|32|4|74|26|-64.86%|
|1|32|8|144|46|-68.06%|
|1|32|16|284|88|-69.01%|
|1|128|1|20|12|-40.00%|
|1|128|8|70|22|-68.57%|
|1|128|16|127|29|-77.17%|
|1|128|32|251|50|-80.08%|
|1|128|64|576|93|-83.85%|
|1|8192|1|2|2|0.00%|
|1|8192|16|11|9|-18.18%|
|1|8192|64|18|13|-27.78%|
|1|8192|512|88|42|-52.27%|
|1|8192|2048|266|129|-51.50%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|11|9|-18.18%|
|1|1048576|64|8|9|12.50%|
|1|1048576|512|33|32|-3.03%|
|1|1048576|2048|97|93|-4.12%|
|1|8388608|1|4|2|-50.00%|
|1|8388608|16|20|21|5.00%|
|1|8388608|64|31|38|22.58%|
|1|8388608|512|70|73|4.29%|
|1|8388608|2048|209|204|-2.39%|

  was:
Though we already have a bitset optimization for low cardinality fields, but 
the optimization usually only works on extremly low cardinality fields 
(cardinality < 16), for medium cardinality case like 30, 100 can rarely get 
this optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.


|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff|
|1.00|32.00|1.00|29.00|18.00|-37.93%|
|1.00|32.00|2.00|40.00|16.00|-60.00%|
|1.00|32.00|4.00|74.00|26.00|-64.86%|
|1.00|32.00|8.00|144.00|46.00|-68.06%|
|1.00|32.00|16.00|284.00|88.00|-69.01%|
|1.00|128.00|1.00|20.00|12.00|-40.00%|
|1.00|128.00|8.00|70.00|22.00|-68.57%|
|1.00|128.00|16.00|127.00|29.00|-77.17%|
|1.00|128.00|32.00|251.00|50.00|-80.08%|
|1.00|128.00|64.00|576.00|93.00|-83.85%|
|1.00|8192.00|1.00|2.00|2.00|0.00%|
|1.00|8192.00|16.00|11.00|9.00|-18.18%|
|1.00|8192.00|64.00|18.00|13.00|-27.78%|
|1.00|8192.00|512.00|88.00|42.00|-52.27%|
|1.00|8192.00|2048.00|266.00|129.00|-51.50%|
|1.00|1048576.00|1.00|3.00|2.00|-33.33%|
|1.00|1048576.00|16.00|11.00|9.00|-18.18%|
|1.00|1048576.00|64.00|8.00|9.00|12.50%|
|1.00|1048576.00|512.00|33.00|32.00|-3.03%|
|1.00|1048576.00|2048.00|97.00|93.00|-4.12%|
|1.00|8388608.00|1.00|4.00|2.00|-50.00%|
|1.00|8388608.00|16.00|20.00|21.00|5.00%|
|1.00|8388608.00|64.00|31.00|38.00|22.58%|
|1.00|8388608.00|512.00|70.00|73.00|4.29%|
|1.00|8388608.00|2048.00|209.00|204.00|-2.39%|


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Though we already have a bitset optimization for low cardinality fields, but 
> the optimization usually only works on extremly low cardinality fields 
> (cardinality < 16), for medium cardinality case like 30, 100 can

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
Though we already have a bitset optimization for low cardinality fields, but 
the optimization usually only works on extremly low cardinality fields 
(cardinality < 16), for medium cardinality case like 30, 100 can rarely get 
this optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.
 
|doc|cardinality|query term|baseline (ms)|candidate (ms)|diff|
|1|32|1|29|18|-37.93%|
|1|32|2|40|16|-60.00%|
|1|32|4|74|26|-64.86%|
|1|32|8|144|46|-68.06%|
|1|32|16|284|88|-69.01%|
|1|128|1|20|12|-40.00%|
|1|128|8|70|22|-68.57%|
|1|128|16|127|29|-77.17%|
|1|128|32|251|50|-80.08%|
|1|128|64|576|93|-83.85%|
|1|8192|1|2|2|0.00%|
|1|8192|16|11|9|-18.18%|
|1|8192|64|18|13|-27.78%|
|1|8192|512|88|42|-52.27%|
|1|8192|2048|266|129|-51.50%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|11|9|-18.18%|
|1|1048576|64|8|9|12.50%|
|1|1048576|512|33|32|-3.03%|
|1|1048576|2048|97|93|-4.12%|
|1|8388608|1|4|2|-50.00%|
|1|8388608|16|20|21|5.00%|
|1|8388608|64|31|38|22.58%|
|1|8388608|512|70|73|4.29%|
|1|8388608|2048|209|204|-2.39%|

  was:
Though we already have a bitset optimization for low cardinality fields, but 
the optimization usually only works on extremly low cardinality fields 
(cardinality < 16), for medium cardinality case like 30, 100 can rarely get 
this optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.
 
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff|
|1|32|1|29|18|-37.93%|
|1|32|2|40|16|-60.00%|
|1|32|4|74|26|-64.86%|
|1|32|8|144|46|-68.06%|
|1|32|16|284|88|-69.01%|
|1|128|1|20|12|-40.00%|
|1|128|8|70|22|-68.57%|
|1|128|16|127|29|-77.17%|
|1|128|32|251|50|-80.08%|
|1|128|64|576|93|-83.85%|
|1|8192|1|2|2|0.00%|
|1|8192|16|11|9|-18.18%|
|1|8192|64|18|13|-27.78%|
|1|8192|512|88|42|-52.27%|
|1|8192|2048|266|129|-51.50%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|11|9|-18.18%|
|1|1048576|64|8|9|12.50%|
|1|1048576|512|33|32|-3.03%|
|1|1048576|2048|97|93|-4.12%|
|1|8388608|1|4|2|-50.00%|
|1|8388608|16|20|21|5.00%|
|1|8388608|64|31|38|22.58%|
|1|8388608|512|70|73|4.29%|
|1|8388608|2048|209|204|-2.39%|


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Though we already have a bitset optimization for low cardinality fields, but 
> the optimization usually only works on extremly low cardinality fields 
> (cardinality < 16), for medium cardinality case like 30, 100 can rarely get 
> this optimization.
> In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
> use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
> this approach. I think this is because we are trying to optimize the unsorted 
> situation, which typically happens for high cardinality fields, and the 
> bottleneck of queries on high cardinali

[GitHub] [lucene-solr] rmuir commented on pull request #2625: Added bulkclose feature to the githubPRs script

2021-12-08 Thread GitBox


rmuir commented on pull request #2625:
URL: https://github.com/apache/lucene-solr/pull/2625#issuecomment-988751134


   -1 to adding bulk close functionality


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] rmuir commented on pull request #2625: Added bulkclose feature to the githubPRs script

2021-12-08 Thread GitBox


rmuir commented on pull request #2625:
URL: https://github.com/apache/lucene-solr/pull/2625#issuecomment-988760471


   That's an actual veto. justification: read the fucking mailing list thread, 
see how @janhoy tried to "slip this in" under the pretense of a +1. Several of 
us are against bulk-closing on the thread. It is more people, than are for it. 
Consensus is not with you. Passive-aggressive shit like this doesn't help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
Though we already have a bitset optimization for low cardinality fields, but 
the optimization usually only works on extremly low cardinality fields 
(cardinality < 16), for medium cardinality case like 30, 100 can rarely get 
this optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.
 
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|
|1|8388608|1|4|3|-25.00%|
|1|8388608|16|24|21|-12.50%|
|1|8388608|64|46|45|-2.17%|
|1|8388608|512|121|127|4.96%|
|1|8388608|2048|193|207|7.25%|

  was:
Though we already have a bitset optimization for low cardinality fields, but 
the optimization usually only works on extremly low cardinality fields 
(cardinality < 16), for medium cardinality case like 30, 100 can rarely get 
this optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.
 
|doc|cardinality|query term|baseline (ms)|candidate (ms)|diff|
|1|32|1|29|18|-37.93%|
|1|32|2|40|16|-60.00%|
|1|32|4|74|26|-64.86%|
|1|32|8|144|46|-68.06%|
|1|32|16|284|88|-69.01%|
|1|128|1|20|12|-40.00%|
|1|128|8|70|22|-68.57%|
|1|128|16|127|29|-77.17%|
|1|128|32|251|50|-80.08%|
|1|128|64|576|93|-83.85%|
|1|8192|1|2|2|0.00%|
|1|8192|16|11|9|-18.18%|
|1|8192|64|18|13|-27.78%|
|1|8192|512|88|42|-52.27%|
|1|8192|2048|266|129|-51.50%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|11|9|-18.18%|
|1|1048576|64|8|9|12.50%|
|1|1048576|512|33|32|-3.03%|
|1|1048576|2048|97|93|-4.12%|
|1|8388608|1|4|2|-50.00%|
|1|8388608|16|20|21|5.00%|
|1|8388608|64|31|38|22.58%|
|1|8388608|512|70|73|4.29%|
|1|8388608|2048|209|204|-2.39%|


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Though we already have a bitset optimization for low cardinality fields, but 
> the optimization usually only works on extremly low cardinality fields 
> (cardinality < 16), for medium cardinality case like 30, 100 can rarely get 
> this optimization.
> In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
> use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
> this approach. I think this is because we are trying to optimize the unsorted 
> situation, which typically happens for high cardinality fields, and the 
> bottleneck of queries on high cardin

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization usually only works on extremly low cardinality fields (cardinality 
< 16), for medium cardinality case like 30, 100 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.
 
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|
|1|8388608|1|4|3|-25.00%|
|1|8388608|16|24|21|-12.50%|
|1|8388608|64|46|45|-2.17%|
|1|8388608|512|121|127|4.96%|
|1|8388608|2048|193|207|7.25%|

  was:
Though we already have a bitset optimization for low cardinality fields, but 
the optimization usually only works on extremly low cardinality fields 
(cardinality < 16), for medium cardinality case like 30, 100 can rarely get 
this optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.
 
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|
|1|8388608|1|4|3|-25.00%|
|1|8388608|16|24|21|-12.50%|
|1|8388608|64|46|45|-2.17%|
|1|8388608|512|121|127|4.96%|
|1|8388608|2048|193|207|7.25%|


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have a bitset optimization for low cardinality fields, but the 
> optimization usually only works on extremly low cardinality fields 
> (cardinality < 16), for medium cardinality case like 30, 100 can rarely get 
> this optimization.
> In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
> use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
> this approach. I think this is because we are trying to optimize the unsorted 
> situation, which typically happens for high cardinality fields, and the 
> bottleneck of queries on high c

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (cardinality < 16), 
for medium cardinality case like 30, 100 can rarely get this optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.
 
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|
|1|8388608|1|4|3|-25.00%|
|1|8388608|16|24|21|-12.50%|
|1|8388608|64|46|45|-2.17%|
|1|8388608|512|121|127|4.96%|
|1|8388608|2048|193|207|7.25%|

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization usually only works on extremly low cardinality fields (cardinality 
< 16), for medium cardinality case like 30, 100 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.
 
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|
|1|8388608|1|4|3|-25.00%|
|1|8388608|16|24|21|-12.50%|
|1|8388608|64|46|45|-2.17%|
|1|8388608|512|121|127|4.96%|
|1|8388608|2048|193|207|7.25%|


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have a bitset optimization for low cardinality fields, but the 
> optimization only works on extremly low cardinality fields (cardinality < 
> 16), for medium cardinality case like 30, 100 can rarely get this 
> optimization.
> In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
> use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
> this approach. I think this is because we are trying to optimize the unsorted 
> situation, which typically happens for high cardinality fields, and the 
> bottleneck of queries on high cardinality fields is usu

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 30/100 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.
 
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|
|1|8388608|1|4|3|-25.00%|
|1|8388608|16|24|21|-12.50%|
|1|8388608|64|46|45|-2.17%|
|1|8388608|512|121|127|4.96%|
|1|8388608|2048|193|207|7.25%|

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (cardinality < 16), 
for medium cardinality case like 30, 100 can rarely get this optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.
 
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|
|1|8388608|1|4|3|-25.00%|
|1|8388608|16|24|21|-12.50%|
|1|8388608|64|46|45|-2.17%|
|1|8388608|512|121|127|4.96%|
|1|8388608|2048|193|207|7.25%|


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have a bitset optimization for low cardinality fields, but the 
> optimization only works on extremly low cardinality fields (doc count > 1/16 
> total doc), for medium cardinality case like 30/100 can rarely get this 
> optimization.
> In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
> use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
> this approach. I think this is because we are trying to optimize the unsorted 
> situation, which typically happens for high cardinality fields, and the 
> bottleneck of queries on high cardinality fie

[GitHub] [lucene-solr] janhoy closed pull request #182: SOLR-10415 - improve debug logging to use parameterized logging

2021-12-08 Thread GitBox


janhoy closed pull request #182:
URL: https://github.com/apache/lucene-solr/pull/182


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy closed pull request #185: SOLR-10487: Support to specify connection and socket read timeout in DataImportHandler for SolrEntityProcessor.

2021-12-08 Thread GitBox


janhoy closed pull request #185:
URL: https://github.com/apache/lucene-solr/pull/185


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy closed pull request #690: SOLR-13517: [ UX improvement ] Dashboard will now store query and filter parameters on page change a…

2021-12-08 Thread GitBox


janhoy closed pull request #690:
URL: https://github.com/apache/lucene-solr/pull/690


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy closed pull request #740: SOLR-12550 - distribUpdateSoTimeout for configuring socket timeouts in solrcloud doesn't take effect for updates.

2021-12-08 Thread GitBox


janhoy closed pull request #740:
URL: https://github.com/apache/lucene-solr/pull/740


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #521: LUCENE-10229: Unify behaviour of match offsets for interval queries

2021-12-08 Thread GitBox


dweiss commented on a change in pull request #521:
URL: https://github.com/apache/lucene/pull/521#discussion_r764938074



##
File path: 
lucene/queries/src/java/org/apache/lucene/queries/intervals/Intervals.java
##
@@ -275,7 +275,10 @@ public static IntervalsSource ordered(IntervalsSource... 
subSources) {
   }
 
   /**
-   * Create an unordered {@link IntervalsSource}
+   * Create an unordered {@link IntervalsSource}. Note that if there are 
multiple intervals ends at

Review comment:
   There is no overlap indeed - one interval is 'a b' the other 'c d' (the 
smallest possible variant). This is tricky - I agree -- but does not negate the 
utility of the entire concept.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.
 
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|
|1|8388608|1|4|3|-25.00%|
|1|8388608|16|24|21|-12.50%|
|1|8388608|64|46|45|-2.17%|
|1|8388608|512|121|127|4.96%|
|1|8388608|2048|193|207|7.25%|

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 30/100 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.
 
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|
|1|8388608|1|4|3|-25.00%|
|1|8388608|16|24|21|-12.50%|
|1|8388608|64|46|45|-2.17%|
|1|8388608|512|121|127|4.96%|
|1|8388608|2048|193|207|7.25%|


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have a bitset optimization for low cardinality fields, but the 
> optimization only works on extremly low cardinality fields (doc count > 1/16 
> total doc), for medium cardinality case like 32/128 can rarely get this 
> optimization.
> In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
> use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
> this approach. I think this is because we are trying to optimize the unsorted 
> situation, which typically happens for high cardinality fields, and the 
> bottleneck of queries on high cardi

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.


|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|


 

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.
 
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|
|1|8388608|1|4|3|-25.00%|
|1|8388608|16|24|21|-12.50%|
|1|8388608|64|46|45|-2.17%|
|1|8388608|512|121|127|4.96%|
|1|8388608|2048|193|207|7.25%|


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have a bitset optimization for low cardinality fields, but the 
> optimization only works on extremly low cardinality fields (doc count > 1/16 
> total doc), for medium cardinality case like 32/128 can rarely get this 
> optimization.
> In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
> use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
> this approach. I think this is because we are trying to optimize the unsorted 
> situation, which typically happens for high cardinality fields, and the 
> bottleneck of queries on high cardinality fie

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.


|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|


 


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have a bitset optimization for low cardinality fields, but the 
> optimization only works on extremly low cardinality fields (doc count > 1/16 
> total doc), for medium cardinality case like 32/128 can rarely get this 
> optimization.
> In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
> use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
> this approach. I think this is because we are trying to optimize the unsorted 
> situation, which typically happens for high cardinality fields, and the 
> bottleneck of queries on high cardinality fiel

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.



|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have a bitset optimization for low cardinality fields, but the 
> optimization only works on extremly low cardinality fields (doc count > 1/16 
> total doc), for medium cardinality case like 32/128 can rarely get this 
> optimization.
> In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
> use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
> this approach. I think this is because we are trying to optimize the unsorted 
> situation, which typically happens for high cardinality fields, and the 
> bottleneck of queries on high card

[GitHub] [lucene] magibney commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

2021-12-08 Thread GitBox


magibney commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-988928174


   Thanks for the nudge, @fmmoret.
   
   I think if introducing this change, we should really avoid [needlessly 
building and throwing 
away](https://github.com/apache/lucene/pull/380#discussion_r750515187) the 
stringified dictionary. @spyk is this something you'd be interested in pursuing 
(i.e., pushing a new commit to your PR branch)? Lmk if not and I'll try (or 
Alessandro, per his earlier comment?) to move it along.
   
   >Ideally, opennlp would have a DictionaryLemmatizer ctor that accepts a 
Reader directly -- I can't imagine that would be a controversial upstream PR?
   
   I don't think concerns over the default character encoding issue should hold 
things up. We're not making anything worse wrt the default encoding assumption. 
A simple `TODO` comment should suffice. I think we should circle back (I should 
be able to find the time for this if nobody else steps forward) to actually 
address such a `TODO` as a separate issue/PR, following something like the 
`InputStreamReader` approach I mentioned above (trusting someone will 
contradict me if they disagree with this proposed approach!).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mikemccand commented on pull request #1064: LUCENE-9084: circular synchronization wait (potential deadlock) in AnalyzingInfixSuggester

2021-12-08 Thread GitBox


mikemccand commented on pull request #1064:
URL: https://github.com/apache/lucene-solr/pull/1064#issuecomment-988973336


   It looks like this one was indeed merged -- closing the PR.  Thank you 
@paulward24!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mikemccand closed pull request #1064: LUCENE-9084: circular synchronization wait (potential deadlock) in AnalyzingInfixSuggester

2021-12-08 Thread GitBox


mikemccand closed pull request #1064:
URL: https://github.com/apache/lucene-solr/pull/1064


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mikemccand commented on pull request #906: LUCENE-8996: maxScore is sometimes missing from distributed responses

2021-12-08 Thread GitBox


mikemccand commented on pull request #906:
URL: https://github.com/apache/lucene-solr/pull/906#issuecomment-988978401


   Hmm, I see this [src fix was committed, but the new unit test was not 
committed](https://github.com/apache/lucene/commit/49631ace9f1ee110d52a207377e4926baef74929)
 -- was that intentional?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. 

Maybe medium cardinality fields are tempted for this optimization, The basic 
idea is that compute the delta of the sorted ids and encode/decode them like 
what we do in StoredFieldsInts. I benchmarked the optimization by mocking some 
random longPoint and querying them with PointInSetQuery. As expected, the 
medium cardinality fields got spped up and high cardinality fields get even 
results.


*Benchmark Result*
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. Medium cardinality fields may be tempted for this optimization 
:)

I benchmarked the optimization by mocking some random longPoint and querying 
them with PointInSetQuery. As expected, the medium cardinality fields got spped 
up and high cardinality fields get even results.



|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have a bitset optimization for low cardinality fields, but the 
> optimization only works on extremly low cardinality fields (doc count > 1/16 
> total doc), for medium cardinality case like 32/128 can rarely get this 
> optimization.
> In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
> use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
> this approach. I think this is because we are 

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually {{visitDocValues}} 
but not {{readDocIds}}. 

Maybe medium cardinality fields are tempted for this optimization, The basic 
idea is that compute the delta of the sorted ids and encode/decode them like 
what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking 
some random longPoint and querying them with {{PointInSetQuery}}. As expected, 
the medium cardinality fields got spped up and high cardinality fields get even 
results.


*Benchmark Result*
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually {{visitDocValues}} 
but not {{readDocIds}}. 

Maybe medium cardinality fields are tempted for this optimization, The basic 
idea is that compute the delta of the sorted ids and encode/decode them like 
what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking 
some random longPoint and querying them with PointInSetQuery. As expected, the 
medium cardinality fields got spped up and high cardinality fields get even 
results.


*Benchmark Result*
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have a bitset optimization for low cardinality fields, but the 
> optimization only works on extremly low cardinality fields (doc count > 1/16 
> total doc), for medium cardinality case like 32/128 can rarely get this 
> optimization.
> In [https://github.com/apache/lucene-

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually {{visitDocValues}} 
but not {{readDocIds}}. 

Maybe medium cardinality fields are tempted for this optimization, The basic 
idea is that compute the delta of the sorted ids and encode/decode them like 
what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking 
some random longPoint and querying them with PointInSetQuery. As expected, the 
medium cardinality fields got spped up and high cardinality fields get even 
results.


*Benchmark Result*
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually visitDocValues but 
not readDocIds. 

Maybe medium cardinality fields are tempted for this optimization, The basic 
idea is that compute the delta of the sorted ids and encode/decode them like 
what we do in StoredFieldsInts. I benchmarked the optimization by mocking some 
random longPoint and querying them with PointInSetQuery. As expected, the 
medium cardinality fields got spped up and high cardinality fields get even 
results.


*Benchmark Result*
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have a bitset optimization for low cardinality fields, but the 
> optimization only works on extremly low cardinality fields (doc count > 1/16 
> total doc), for medium cardinality case like 32/128 can rarely get this 
> optimization.
> In [https://github.com/apache/lucene-solr/pull/1538],

[jira] [Updated] (LUCENE-10259) Luke does not start with whitespace in unzipped directory.

2021-12-08 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-10259:
--
Fix Version/s: (was: 9.x)

> Luke does not start with whitespace in unzipped directory.
> --
>
> Key: LUCENE-10259
> URL: https://issues.apache.org/jira/browse/LUCENE-10259
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: luke
>Affects Versions: 9.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Blocker
> Fix For: 9.0, 10.0 (main)
>
> Attachments: screenshot-1.png
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> When you start Luke on windows, nothing happens. No error message nothing. 
> This happens for users that have whitespace in their username ("Uwe 
> Schindler") and you unzip the tgz file to desktop.
> This also affects the Linux shell script, but more unlikely.
> The fix is easy: Add in both shell scripts quotes around the module-path.
> I think we should respin.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10287) Add jdk.unsupported module to Luke startup script

2021-12-08 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-10287:
--
Fix Version/s: (was: 9.x)

> Add jdk.unsupported module to Luke startup script
> -
>
> Key: LUCENE-10287
> URL: https://issues.apache.org/jira/browse/LUCENE-10287
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: luke
>Affects Versions: 9.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 9.1, 10.0 (main)
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> See my note on the JDK 9.0 release: When you start Luke (in module mode, as 
> done by default), it won't use MMapDirectory when opening indexes. The reason 
> is simple: It can't see sun.misc.Unsafe, which is needed to unmap mapped byte 
> buffers. It will silently disable itsself (as it is not a hard dependency).
> By default we should pass the "jdk.unsupported" module when starting Luke.
> In case of a respin, this should be backported.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10040) Handle deletions in nearest vector search

2021-12-08 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455386#comment-17455386
 ] 

Julie Tibshirani commented on LUCENE-10040:
---

Thanks for posting, I found Weaviate's blog helpful as I was thinking through 
this issue!

> Handle deletions in nearest vector search
> -
>
> Key: LUCENE-10040
> URL: https://issues.apache.org/jira/browse/LUCENE-10040
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Assignee: Julie Tibshirani
>Priority: Major
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Currently nearest vector search doesn't account for deleted documents. Even 
> if a document is not in {{LeafReader#getLiveDocs}}, it could still be 
> returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be 
> surprising + difficult for users, since other search APIs account for deleted 
> docs. We've discussed extending the search logic to take a parameter like 
> {{Bits liveDocs}}. This issue discusses options around adding support.
> One approach is to just filter out deleted docs after running the KNN search. 
> This behavior seems hard to work with as a user: fewer than {{k}} docs might 
> come back from your KNN search!
> Alternatively, {{LeafReader#searchNearestVectors}} could always return the 
> {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs 
> while assembling its candidate list. It would traverse further into the 
> graph, visiting more nodes to ensure it gathers the required candidates. 
> (Note deleted docs would still be visited/ traversed). The [hnswlib 
> library|https://github.com/nmslib/hnswlib] contains an implementation like 
> this, where you can mark documents as deleted and they're skipped during 
> search.
> This approach seems reasonable to me, but there are some challenges:
>  * Performance can be unpredictable. If deletions are random, it shouldn't 
> have a huge effect. But in the worst case, a segment could have 50% deleted 
> docs, and they all happen to be near the query vector. HNSW would need to 
> traverse through around half the entire graph to collect neighbors.
>  * As far as I know, there hasn't been academic research or any testing into 
> how well this performs in terms of recall. I have a vague intuition it could 
> be harder to achieve high recall as the algorithm traverses areas further 
> from the "natural" entry points. The HNSW paper doesn't mention deletions/ 
> filtering, and I haven't seen community benchmarks around it.
> Background links:
>  * Thoughts on deletions from the author of the HNSW paper: 
> [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892]
>  * Blog from Vespa team which mentions combining KNN and search filters (very 
> similar to applying deleted docs): 
> [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. 
> The "Exact vs Approximate" section shows good performance even when a large 
> percentage of documents are filtered out. The team mentioned to me they 
> didn't have the chance to measure recall, only latency.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10114) Remove unused byte order mark in Lucene90PostingsWriter

2021-12-08 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10114.
---
Resolution: Fixed

> Remove unused byte order mark in Lucene90PostingsWriter
> ---
>
> Key: LUCENE-10114
> URL: https://issues.apache.org/jira/browse/LUCENE-10114
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/index
>Affects Versions: 9.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While reviewing the byte order in lucene index, I found the following code in 
> {{Lucene90PostingsWriter}}:
> {code:java}
> ByteOrder byteOrder = ByteOrder.nativeOrder();
> if (byteOrder == ByteOrder.BIG_ENDIAN) {
>   docOut.writeByte((byte) 'B');
> } else if (byteOrder == ByteOrder.LITTLE_ENDIAN) {
>   docOut.writeByte((byte) 'L');
> } else {
>   throw new Error();
> }
> {code}
> Actually this byte is consumed nowhere, as the file is only used via seeking 
> and the offsets are just 1 larger. We should remove this code.
> Why was this added?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Closed] (LUCENE-10114) Remove unused byte order mark in Lucene90PostingsWriter

2021-12-08 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand closed LUCENE-10114.
-

> Remove unused byte order mark in Lucene90PostingsWriter
> ---
>
> Key: LUCENE-10114
> URL: https://issues.apache.org/jira/browse/LUCENE-10114
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/index
>Affects Versions: 9.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While reviewing the byte order in lucene index, I found the following code in 
> {{Lucene90PostingsWriter}}:
> {code:java}
> ByteOrder byteOrder = ByteOrder.nativeOrder();
> if (byteOrder == ByteOrder.BIG_ENDIAN) {
>   docOut.writeByte((byte) 'B');
> } else if (byteOrder == ByteOrder.LITTLE_ENDIAN) {
>   docOut.writeByte((byte) 'L');
> } else {
>   throw new Error();
> }
> {code}
> Actually this byte is consumed nowhere, as the file is only used via seeking 
> and the offsets are just 1 larger. We should remove this code.
> Why was this added?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Closed] (LUCENE-9484) Allow index sorting to happen after the fact

2021-12-08 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand closed LUCENE-9484.


> Allow index sorting to happen after the fact
> 
>
> Key: LUCENE-9484
> URL: https://issues.apache.org/jira/browse/LUCENE-9484
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Simon Willnauer
>Priority: Major
> Fix For: 9.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> I did look into sorting an index after it was created and found that with 
> some smallish modifications we can actually allow that by piggibacking on 
> SortingLeafReader and addIndices in a pretty straight-forward and simple way. 
> With some smallish modifications / fixes to SortingLeafReader we can just 
> merge and unsorted index into a sorted index using a fresh index writer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9484) Allow index sorting to happen after the fact

2021-12-08 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-9484.
--
Resolution: Fixed

> Allow index sorting to happen after the fact
> 
>
> Key: LUCENE-9484
> URL: https://issues.apache.org/jira/browse/LUCENE-9484
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Simon Willnauer
>Priority: Major
> Fix For: 9.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> I did look into sorting an index after it was created and found that with 
> some smallish modifications we can actually allow that by piggibacking on 
> SortingLeafReader and addIndices in a pretty straight-forward and simple way. 
> With some smallish modifications / fixes to SortingLeafReader we can just 
> merge and unsorted index into a sorted index using a fresh index writer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. Maybe this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually {{visitDocValues}} 
but not {{readDocIds}} ?

Maybe medium cardinality fields are tempted for this optimization, The basic 
idea is that compute the delta of the sorted ids and encode/decode them like 
what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking 
some random longPoint and querying them with {{PointInSetQuery}}. As expected, 
the medium cardinality fields got spped up and high cardinality fields get even 
results.


*Benchmark Result*
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually {{visitDocValues}} 
but not {{readDocIds}}. 

Maybe medium cardinality fields are tempted for this optimization, The basic 
idea is that compute the delta of the sorted ids and encode/decode them like 
what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking 
some random longPoint and querying them with {{PointInSetQuery}}. As expected, 
the medium cardinality fields got spped up and high cardinality fields get even 
results.


*Benchmark Result*
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have a bitset optimization for low cardinality fields, but the 
> optimization only works on extremly low cardinality fields (doc count > 1/16 
> total doc), for medium cardinality case like 32/128 can rarely get this 
> optimization.
> In [https://github.com/apache/lucen

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. Maybe this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is {{visitDocValues}} but not 
{{readDocIds}} ? 

But i think medium cardinality fields may be tempted for this optimization. The 
basic idea is that we can compute the delta of the sorted ids and encode/decode 
them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by 
mocking some random longPoint and querying them with {{PointInSetQuery}}. As 
expected, the medium cardinality fields got spped up and high cardinality 
fields get even results.


*Benchmark Result*
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. Maybe this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is usually {{visitDocValues}} 
but not {{readDocIds}} ?

Maybe medium cardinality fields are tempted for this optimization, The basic 
idea is that compute the delta of the sorted ids and encode/decode them like 
what we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking 
some random longPoint and querying them with {{PointInSetQuery}}. As expected, 
the medium cardinality fields got spped up and high cardinality fields get even 
results.


*Benchmark Result*
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have a bitset optimization for low cardinality fields, but the 
> optimization only works on extremly low cardinality fields (doc count > 1/16 
> total doc), for medium cardinality case like 32/128 can rarely get this 
> optimization.
> In [https://github.com/apach

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. Maybe this is because we are trying to optimize the unsorted 
situation (typically happens for high cardinality fields) and the bottleneck of 
queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} 
? 

IMO medium cardinality fields may be tempted for this optimization because they 
need to read lots of ids. The basic idea is that we can compute the delta of 
the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. 
I benchmarked the optimization by mocking some random longPoint and querying 
them with {{PointInSetQuery}}. As expected, the medium cardinality fields got 
spped up and high cardinality fields get even results.


*Benchmark Result*
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. Maybe this is because we are trying to optimize the unsorted 
situation, which typically happens for high cardinality fields, and the 
bottleneck of queries on high cardinality fields is {{visitDocValues}} but not 
{{readDocIds}} ? 

But i think medium cardinality fields may be tempted for this optimization. The 
basic idea is that we can compute the delta of the sorted ids and encode/decode 
them like what we do in {{StoredFieldsInts}}. I benchmarked the optimization by 
mocking some random longPoint and querying them with {{PointInSetQuery}}. As 
expected, the medium cardinality fields got spped up and high cardinality 
fields get even results.


*Benchmark Result*
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have a bitset optimization for low cardinality fields, but the 
> optimization only works on extremly low cardinality fields (doc count > 1/16 
> total doc), for medium cardinality case like 32/128 can rarely get this 
> optimizatio

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. Maybe this is because we are trying to optimize the unsorted 
situation (typically happens for high cardinality fields) and the bottleneck of 
queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} 
? 

IMO medium cardinality fields may be tempted for this optimization because they 
need to read lots of ids for one term. The basic idea is that we can compute 
the delta of the sorted ids and encode/decode them like what we do in 
{{StoredFieldsInts}}. I benchmarked the optimization by mocking some random 
longPoint and querying them with {{PointInSetQuery}}. As expected, the medium 
cardinality fields got spped up and high cardinality fields get even results.


*Benchmark Result*
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. Maybe this is because we are trying to optimize the unsorted 
situation (typically happens for high cardinality fields) and the bottleneck of 
queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} 
? 

IMO medium cardinality fields may be tempted for this optimization because they 
need to read lots of ids. The basic idea is that we can compute the delta of 
the sorted ids and encode/decode them like what we do in {{StoredFieldsInts}}. 
I benchmarked the optimization by mocking some random longPoint and querying 
them with {{PointInSetQuery}}. As expected, the medium cardinality fields got 
spped up and high cardinality fields get even results.


*Benchmark Result*
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have a bitset optimization for low cardinality fields, but the 
> optimization only works on extremly low cardinality fields (doc count > 1/16 
> total doc), for medium cardinality case like 32/1

[jira] [Created] (LUCENE-10298) dev-tools/scripts/addBackcompatIndexes.py doesn't work well with spotless

2021-12-08 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10298:
-

 Summary: dev-tools/scripts/addBackcompatIndexes.py doesn't work 
well with spotless
 Key: LUCENE-10298
 URL: https://issues.apache.org/jira/browse/LUCENE-10298
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand


addBackcompatIndexes.py expects that lists of index names have one entry per 
line, e.g.

{code}
static final String oldNames = {
  ""
}
{code}

However, when the array is small, Spotless forces the array to be written on a 
single line, and addBackcompatIndexes.py no longer recognizes the structure of 
the file.

It's probably fixable, but my Python skills are not good enough. Or maybe this 
file should be one of the rare ones that we exclude from Spotless?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 commented on pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently

2021-12-08 Thread GitBox


gf2121 commented on pull request #510:
URL: https://github.com/apache/lucene/pull/510#issuecomment-989104054


   @iverase Thanks for your explanation!
   
   > I worked on the PR about using #readLELongs but never get a meaningful 
speed up that justify the added complexity.
   
   I find that we were trying to use #readLELongs to speed up 24/32 bit 
situation in the `DocIdsWriter`, which means the ids in the block are unsorted, 
typically happening in high cardinarlity fields. I think queries on high 
cardinality fields spend most of their time on `visitDocValues` but not 
`readDocIds`, so maybe this is the reason that we can not see a obvious gain on 
E2E side?
   
   My current thoughts are about using readLELongs to speed up the **sorted** 
ids situation (means low or medium cardinality fields), whose bottleneck is 
reading docIds. For sorted arrays,  we can compute the delta of the sorted ids 
and encode/decode them like what we do in `StoredFieldsInts`. 
   
   I raised an [ISSUE](https://issues.apache.org/jira/browse/LUCENE-10297) 
based this idea. The benchmark result i post in the issue looks promising. 
Would you like to help take a look when you are free? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10298) dev-tools/scripts/addBackcompatIndexes.py doesn't work well with spotless

2021-12-08 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455935#comment-17455935
 ] 

Dawid Weiss commented on LUCENE-10298:
--

I wouldn't make such exceptions. They're hard to maintain... A better solution 
would be to read this list from a resource instead. A hacky way would be to 
force the line break with a // comment after the bracket and
{code}
static final String oldNames = {
  // auto-updated list starts here
  ""
  // list ends here.
}
{code}

> dev-tools/scripts/addBackcompatIndexes.py doesn't work well with spotless
> -
>
> Key: LUCENE-10298
> URL: https://issues.apache.org/jira/browse/LUCENE-10298
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> addBackcompatIndexes.py expects that lists of index names have one entry per 
> line, e.g.
> {code}
> static final String oldNames = {
>   ""
> }
> {code}
> However, when the array is small, Spotless forces the array to be written on 
> a single line, and addBackcompatIndexes.py no longer recognizes the structure 
> of the file.
> It's probably fixable, but my Python skills are not good enough. Or maybe 
> this file should be one of the rare ones that we exclude from Spotless?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10040) Handle deletions in nearest vector search

2021-12-08 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani resolved LUCENE-10040.
---
Fix Version/s: 9.0
   Resolution: Fixed

> Handle deletions in nearest vector search
> -
>
> Key: LUCENE-10040
> URL: https://issues.apache.org/jira/browse/LUCENE-10040
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Assignee: Julie Tibshirani
>Priority: Major
> Fix For: 9.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Currently nearest vector search doesn't account for deleted documents. Even 
> if a document is not in {{LeafReader#getLiveDocs}}, it could still be 
> returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be 
> surprising + difficult for users, since other search APIs account for deleted 
> docs. We've discussed extending the search logic to take a parameter like 
> {{Bits liveDocs}}. This issue discusses options around adding support.
> One approach is to just filter out deleted docs after running the KNN search. 
> This behavior seems hard to work with as a user: fewer than {{k}} docs might 
> come back from your KNN search!
> Alternatively, {{LeafReader#searchNearestVectors}} could always return the 
> {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs 
> while assembling its candidate list. It would traverse further into the 
> graph, visiting more nodes to ensure it gathers the required candidates. 
> (Note deleted docs would still be visited/ traversed). The [hnswlib 
> library|https://github.com/nmslib/hnswlib] contains an implementation like 
> this, where you can mark documents as deleted and they're skipped during 
> search.
> This approach seems reasonable to me, but there are some challenges:
>  * Performance can be unpredictable. If deletions are random, it shouldn't 
> have a huge effect. But in the worst case, a segment could have 50% deleted 
> docs, and they all happen to be near the query vector. HNSW would need to 
> traverse through around half the entire graph to collect neighbors.
>  * As far as I know, there hasn't been academic research or any testing into 
> how well this performs in terms of recall. I have a vague intuition it could 
> be harder to achieve high recall as the algorithm traverses areas further 
> from the "natural" entry points. The HNSW paper doesn't mention deletions/ 
> filtering, and I haven't seen community benchmarks around it.
> Background links:
>  * Thoughts on deletions from the author of the HNSW paper: 
> [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892]
>  * Blog from Vespa team which mentions combining KNN and search filters (very 
> similar to applying deleted docs): 
> [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. 
> The "Exact vs Approximate" section shows good performance even when a large 
> percentage of documents are filtered out. The team mentioned to me they 
> didn't have the chance to measure recall, only latency.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani merged pull request #527: LUCENE-10040: Add test for vector search with skewed deletions

2021-12-08 Thread GitBox


jtibshirani merged pull request #527:
URL: https://github.com/apache/lucene/pull/527


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10040) Handle deletions in nearest vector search

2021-12-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455944#comment-17455944
 ] 

ASF subversion and git services commented on LUCENE-10040:
--

Commit 5d39bca87a44f51e5d556bb0a7e8c28df3f539fa in lucene's branch 
refs/heads/main from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5d39bca ]

LUCENE-10040: Add test for vector search with skewed deletions (#527)

This exercises a challenging case where the documents to skip all happen to
be closest to the query vector. In many cases, HNSW appears to be robust to this
case and maintains good recall.

> Handle deletions in nearest vector search
> -
>
> Key: LUCENE-10040
> URL: https://issues.apache.org/jira/browse/LUCENE-10040
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Assignee: Julie Tibshirani
>Priority: Major
> Fix For: 9.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Currently nearest vector search doesn't account for deleted documents. Even 
> if a document is not in {{LeafReader#getLiveDocs}}, it could still be 
> returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be 
> surprising + difficult for users, since other search APIs account for deleted 
> docs. We've discussed extending the search logic to take a parameter like 
> {{Bits liveDocs}}. This issue discusses options around adding support.
> One approach is to just filter out deleted docs after running the KNN search. 
> This behavior seems hard to work with as a user: fewer than {{k}} docs might 
> come back from your KNN search!
> Alternatively, {{LeafReader#searchNearestVectors}} could always return the 
> {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs 
> while assembling its candidate list. It would traverse further into the 
> graph, visiting more nodes to ensure it gathers the required candidates. 
> (Note deleted docs would still be visited/ traversed). The [hnswlib 
> library|https://github.com/nmslib/hnswlib] contains an implementation like 
> this, where you can mark documents as deleted and they're skipped during 
> search.
> This approach seems reasonable to me, but there are some challenges:
>  * Performance can be unpredictable. If deletions are random, it shouldn't 
> have a huge effect. But in the worst case, a segment could have 50% deleted 
> docs, and they all happen to be near the query vector. HNSW would need to 
> traverse through around half the entire graph to collect neighbors.
>  * As far as I know, there hasn't been academic research or any testing into 
> how well this performs in terms of recall. I have a vague intuition it could 
> be harder to achieve high recall as the algorithm traverses areas further 
> from the "natural" entry points. The HNSW paper doesn't mention deletions/ 
> filtering, and I haven't seen community benchmarks around it.
> Background links:
>  * Thoughts on deletions from the author of the HNSW paper: 
> [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892]
>  * Blog from Vespa team which mentions combining KNN and search filters (very 
> similar to applying deleted docs): 
> [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. 
> The "Exact vs Approximate" section shows good performance even when a large 
> percentage of documents are filtered out. The team mentioned to me they 
> didn't have the chance to measure recall, only latency.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10040) Handle deletions in nearest vector search

2021-12-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455949#comment-17455949
 ] 

ASF subversion and git services commented on LUCENE-10040:
--

Commit 394472d4b8e40504f0521df340df446089a7afff in lucene's branch 
refs/heads/branch_9x from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=394472d ]

LUCENE-10040: Add test for vector search with skewed deletions (#527)

This exercises a challenging case where the documents to skip all happen to
be closest to the query vector. In many cases, HNSW appears to be robust to this
case and maintains good recall.

> Handle deletions in nearest vector search
> -
>
> Key: LUCENE-10040
> URL: https://issues.apache.org/jira/browse/LUCENE-10040
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Assignee: Julie Tibshirani
>Priority: Major
> Fix For: 9.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Currently nearest vector search doesn't account for deleted documents. Even 
> if a document is not in {{LeafReader#getLiveDocs}}, it could still be 
> returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be 
> surprising + difficult for users, since other search APIs account for deleted 
> docs. We've discussed extending the search logic to take a parameter like 
> {{Bits liveDocs}}. This issue discusses options around adding support.
> One approach is to just filter out deleted docs after running the KNN search. 
> This behavior seems hard to work with as a user: fewer than {{k}} docs might 
> come back from your KNN search!
> Alternatively, {{LeafReader#searchNearestVectors}} could always return the 
> {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs 
> while assembling its candidate list. It would traverse further into the 
> graph, visiting more nodes to ensure it gathers the required candidates. 
> (Note deleted docs would still be visited/ traversed). The [hnswlib 
> library|https://github.com/nmslib/hnswlib] contains an implementation like 
> this, where you can mark documents as deleted and they're skipped during 
> search.
> This approach seems reasonable to me, but there are some challenges:
>  * Performance can be unpredictable. If deletions are random, it shouldn't 
> have a huge effect. But in the worst case, a segment could have 50% deleted 
> docs, and they all happen to be near the query vector. HNSW would need to 
> traverse through around half the entire graph to collect neighbors.
>  * As far as I know, there hasn't been academic research or any testing into 
> how well this performs in terms of recall. I have a vague intuition it could 
> be harder to achieve high recall as the algorithm traverses areas further 
> from the "natural" entry points. The HNSW paper doesn't mention deletions/ 
> filtering, and I haven't seen community benchmarks around it.
> Background links:
>  * Thoughts on deletions from the author of the HNSW paper: 
> [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892]
>  * Blog from Vespa team which mentions combining KNN and search filters (very 
> similar to applying deleted docs): 
> [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. 
> The "Exact vs Approximate" section shows good performance even when a large 
> percentage of documents are filtered out. The team mentioned to me they 
> didn't have the chance to measure recall, only latency.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10298) dev-tools/scripts/addBackcompatIndexes.py doesn't work well with spotless

2021-12-08 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455952#comment-17455952
 ] 

Uwe Schindler commented on LUCENE-10298:


Or maybe write the index names to a simple properties file that can be updated 
with plain stupid Java or Python and load it as resource?

> dev-tools/scripts/addBackcompatIndexes.py doesn't work well with spotless
> -
>
> Key: LUCENE-10298
> URL: https://issues.apache.org/jira/browse/LUCENE-10298
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> addBackcompatIndexes.py expects that lists of index names have one entry per 
> line, e.g.
> {code}
> static final String oldNames = {
>   ""
> }
> {code}
> However, when the array is small, Spotless forces the array to be written on 
> a single line, and addBackcompatIndexes.py no longer recognizes the structure 
> of the file.
> It's probably fixable, but my Python skills are not good enough. Or maybe 
> this file should be one of the rare ones that we exclude from Spotless?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] thelabdude opened a new pull request #2626: SOLR-15832: Clean-up after publish action in Schema Designer shouldn't fail if .system collection doesn't exist

2021-12-08 Thread GitBox


thelabdude opened a new pull request #2626:
URL: https://github.com/apache/lucene-solr/pull/2626


   backport of https://github.com/apache/solr/pull/451


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10274) Implement "hyperrectangle" faceting

2021-12-08 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455990#comment-17455990
 ] 

Greg Miller commented on LUCENE-10274:
--

I was thinking that this would work over the same doc values indexed when 
creating "Point" fields (e.g., LongPoint), which is a binary field encoding all 
N dimensions into a single byte entry. So the faceting logic would inspect a 
single binary field encoding the N dimensions, testing whether-or-not it's 
contained in each hyperrectangle of interest.

> Implement "hyperrectangle" faceting
> ---
>
> Key: LUCENE-10274
> URL: https://issues.apache.org/jira/browse/LUCENE-10274
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>
> I'd be interested in expanding Lucene's faceting capabilities to aggregate a 
> point field against a set of user-provided n-dimensional 
> [hyperrectangles|https://en.wikipedia.org/wiki/Hyperrectangle]. This would be 
> a generalization of {{LongRangeFacets}} / {{DoubleRangeFacets}} from a single 
> dimension to n-dimensions, and would compliment {{PointRangeQuery}} well, 
> providing the ability to facet ahead of "drilling down" on such a query.
> As a motivating use-case, imagine searching against movie documents that 
> contain a 2-dimensional point storing "awards" the movie has received. One 
> dimension encodes the year the award was won, while the other encodes the 
> type of award as an ordinal. For example, the film "Nomadland" won the 
> "Academy Awards Best Picture" award in 2021. Imagine providing a 
> two-dimensional refinement to users allowing them to filter by the 
> combination of award + year in a single action (e.g., using 
> {{{}PointRangeQuery{}}}) and needing to get facet counts for these 
> combinations ahead of time.
> Curious if the community thinks this functionality would be useful. Any 
> thoughts? 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10299) investigate prefix/wildcard perf drop in nightly benchmarks

2021-12-08 Thread Robert Muir (Jira)
Robert Muir created LUCENE-10299:


 Summary: investigate prefix/wildcard perf drop in nightly 
benchmarks
 Key: LUCENE-10299
 URL: https://issues.apache.org/jira/browse/LUCENE-10299
 Project: Lucene - Core
  Issue Type: Task
 Environment: Recently the prefix/wildcard dropped. As these are super 
simple and not impacted by cleanups being done around RegExp, I think instead 
the perf-difference is in the guts of MultiTermQuery where it uses 
DocIdSetBuilder?

*note that I haven't confirmed this and it is just a suspicion*

So I think it may be LUCENE-10289 changes? e.g. doing loops with {{long}} 
instead of {{int}} like before, we know these are slower in java.

I will admit, I'm a bit confused why we made this change since lucene docids 
can only be {{int}}.

Maybe we get the performance back for free, with JDK18/19 which are optimizing 
loops on {{long}} better? So I'm not arguing that we burn a bunch of time to 
fix this, but just opening the issue.
Reporter: Robert Muir






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10274) Implement "hyperrectangle" faceting

2021-12-08 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455990#comment-17455990
 ] 

Greg Miller edited comment on LUCENE-10274 at 12/8/21, 9:05 PM:


I was thinking that this would work over the same doc values indexed when 
creating "Point" fields (e.g., LongPoint), which is a binary field encoding all 
N dimensions into a single byte entry. So the faceting logic would inspect a 
single binary field encoding the N dimensions, testing whether-or-not it's 
contained in each hyperrectangle of interest.

 

UPDATE: Actually, I think I was confusing the current Point field impl with 
something else. I just glanced at the code and there isn't a current dv field 
of course (just the inverted points index). So yeah, will need some thought as 
to how to encode these as dvs.


was (Author: gsmiller):
I was thinking that this would work over the same doc values indexed when 
creating "Point" fields (e.g., LongPoint), which is a binary field encoding all 
N dimensions into a single byte entry. So the faceting logic would inspect a 
single binary field encoding the N dimensions, testing whether-or-not it's 
contained in each hyperrectangle of interest.

> Implement "hyperrectangle" faceting
> ---
>
> Key: LUCENE-10274
> URL: https://issues.apache.org/jira/browse/LUCENE-10274
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>
> I'd be interested in expanding Lucene's faceting capabilities to aggregate a 
> point field against a set of user-provided n-dimensional 
> [hyperrectangles|https://en.wikipedia.org/wiki/Hyperrectangle]. This would be 
> a generalization of {{LongRangeFacets}} / {{DoubleRangeFacets}} from a single 
> dimension to n-dimensions, and would compliment {{PointRangeQuery}} well, 
> providing the ability to facet ahead of "drilling down" on such a query.
> As a motivating use-case, imagine searching against movie documents that 
> contain a 2-dimensional point storing "awards" the movie has received. One 
> dimension encodes the year the award was won, while the other encodes the 
> type of award as an ordinal. For example, the film "Nomadland" won the 
> "Academy Awards Best Picture" award in 2021. Imagine providing a 
> two-dimensional refinement to users allowing them to filter by the 
> combination of award + year in a single action (e.g., using 
> {{{}PointRangeQuery{}}}) and needing to get facet counts for these 
> combinations ahead of time.
> Curious if the community thinks this functionality would be useful. Any 
> thoughts? 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10299) investigate prefix/wildcard perf drop in nightly benchmarks

2021-12-08 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-10299:
-
Description: 
Recently the prefix/wildcard dropped. As these are super simple and not 
impacted by cleanups being done around RegExp, I think instead the 
perf-difference is in the guts of MultiTermQuery where it uses DocIdSetBuilder?

*note that I haven't confirmed this and it is just a suspicion*

So I think it may be LUCENE-10289 changes? e.g. doing loops with {{long}} 
instead of {{int}} like before, we know these are slower in java.

I will admit, I'm a bit confused why we made this change since lucene docids 
can only be {{int}}.

Maybe we get the performance back for free, with JDK18/19 which are optimizing 
loops on {{long}} better? So I'm not arguing that we burn a bunch of time to 
fix this, but just opening the issue.

cc [~ivera]
Environment: (was: Recently the prefix/wildcard dropped. As these are 
super simple and not impacted by cleanups being done around RegExp, I think 
instead the perf-difference is in the guts of MultiTermQuery where it uses 
DocIdSetBuilder?

*note that I haven't confirmed this and it is just a suspicion*

So I think it may be LUCENE-10289 changes? e.g. doing loops with {{long}} 
instead of {{int}} like before, we know these are slower in java.

I will admit, I'm a bit confused why we made this change since lucene docids 
can only be {{int}}.

Maybe we get the performance back for free, with JDK18/19 which are optimizing 
loops on {{long}} better? So I'm not arguing that we burn a bunch of time to 
fix this, but just opening the issue.)

> investigate prefix/wildcard perf drop in nightly benchmarks
> ---
>
> Key: LUCENE-10299
> URL: https://issues.apache.org/jira/browse/LUCENE-10299
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Recently the prefix/wildcard dropped. As these are super simple and not 
> impacted by cleanups being done around RegExp, I think instead the 
> perf-difference is in the guts of MultiTermQuery where it uses 
> DocIdSetBuilder?
> *note that I haven't confirmed this and it is just a suspicion*
> So I think it may be LUCENE-10289 changes? e.g. doing loops with {{long}} 
> instead of {{int}} like before, we know these are slower in java.
> I will admit, I'm a bit confused why we made this change since lucene docids 
> can only be {{int}}.
> Maybe we get the performance back for free, with JDK18/19 which are 
> optimizing loops on {{long}} better? So I'm not arguing that we burn a bunch 
> of time to fix this, but just opening the issue.
> cc [~ivera]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10299) investigate prefix/wildcard perf drop in nightly benchmarks

2021-12-08 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455991#comment-17455991
 ] 

Robert Muir commented on LUCENE-10299:
--

Here are the list of commits in between the benchmark runs where the perf 
dropped: 
https://github.com/apache/lucene/compare/ec57641ea5940270ff7eb08536c9050a050adf1f...68e94c959729dee6f32b1c6fca1a5e4902a9fa51

> investigate prefix/wildcard perf drop in nightly benchmarks
> ---
>
> Key: LUCENE-10299
> URL: https://issues.apache.org/jira/browse/LUCENE-10299
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Recently the prefix/wildcard dropped. As these are super simple and not 
> impacted by cleanups being done around RegExp, I think instead the 
> perf-difference is in the guts of MultiTermQuery where it uses 
> DocIdSetBuilder?
> *note that I haven't confirmed this and it is just a suspicion*
> So I think it may be LUCENE-10289 changes? e.g. doing loops with {{long}} 
> instead of {{int}} like before, we know these are slower in java.
> I will admit, I'm a bit confused why we made this change since lucene docids 
> can only be {{int}}.
> Maybe we get the performance back for free, with JDK18/19 which are 
> optimizing loops on {{long}} better? So I'm not arguing that we burn a bunch 
> of time to fix this, but just opening the issue.
> cc [~ivera]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10274) Implement "hyperrectangle" faceting

2021-12-08 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455994#comment-17455994
 ] 

Greg Miller commented on LUCENE-10274:
--

{quote}I would also suggest to start with the simple 
separate-numeric-docvalues-fields case and use similar logic as the 
{{org.apache.lucene.facet.range}} package, just on 2-D, or maybe 3-D, N-D, etc
{quote}
We could also pack the N dimensions into a single binary dv field using the 
{{encodeDimension}} / {{decodeDimension}} paradigm in {{LongPoint}} / 
{{DoublePoint}} for this. That seems simpler for a user to manage as opposed to 
managing separate fields for every dimension, but maybe there are performance 
limitations of such an approach.

> Implement "hyperrectangle" faceting
> ---
>
> Key: LUCENE-10274
> URL: https://issues.apache.org/jira/browse/LUCENE-10274
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>
> I'd be interested in expanding Lucene's faceting capabilities to aggregate a 
> point field against a set of user-provided n-dimensional 
> [hyperrectangles|https://en.wikipedia.org/wiki/Hyperrectangle]. This would be 
> a generalization of {{LongRangeFacets}} / {{DoubleRangeFacets}} from a single 
> dimension to n-dimensions, and would compliment {{PointRangeQuery}} well, 
> providing the ability to facet ahead of "drilling down" on such a query.
> As a motivating use-case, imagine searching against movie documents that 
> contain a 2-dimensional point storing "awards" the movie has received. One 
> dimension encodes the year the award was won, while the other encodes the 
> type of award as an ordinal. For example, the film "Nomadland" won the 
> "Academy Awards Best Picture" award in 2021. Imagine providing a 
> two-dimensional refinement to users allowing them to filter by the 
> combination of award + year in a single action (e.g., using 
> {{{}PointRangeQuery{}}}) and needing to get facet counts for these 
> combinations ahead of time.
> Curious if the community thinks this functionality would be useful. Any 
> thoughts? 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #470: LUCENE-10255: fully embrace the java module system

2021-12-08 Thread GitBox


dweiss commented on pull request #470:
URL: https://github.com/apache/lucene/pull/470#issuecomment-989200591


   I've pushed another bit of exploration and I think it shows we're close. 
Many things can be cleaned up nicely later (modular configurations generated 
from sourcesets, including compilation task configuration) but we already have 
a nice (I think!) way to express modular vs. classpath dependencies, working 
compilation and a test subproject that uses module descriptor and module path 
to run the tests. The only bit I didn't get to was reconfigure the actual test 
task (classpath + module path). Hopefully tomorrow will figure out the 
remaining bits and start cleanups and polishing.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jnorthrup commented on pull request #310: LUCENE-10112: Improve LZ4 Compression performance with direct primitive read/writes

2021-12-08 Thread GitBox


jnorthrup commented on pull request #310:
URL: https://github.com/apache/lucene/pull/310#issuecomment-989246505


   hi @uschindler are there analogs for lzo and zstd for which this benchmark 
can be used to address the ever present IO budget and cache line costs of the 
different libs ? (benchmark in #308)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #310: LUCENE-10112: Improve LZ4 Compression performance with direct primitive read/writes

2021-12-08 Thread GitBox


uschindler commented on pull request #310:
URL: https://github.com/apache/lucene/pull/310#issuecomment-989287677


   Hi @jnorthrup, I do not fully understand what you are intending to do? If 
you want to compare the LZ4 to LZO/ZSTD just run Mike's benchmarks.
   Me main reason for this PR is to not build an int manually from 4 bytes and 
instead read it as one atomic value (which should be faster). The other 
compression algorithms you mention are using native code, so that's a different 
story. This is all pure Java.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts

2021-12-08 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456048#comment-17456048
 ] 

Greg Miller commented on LUCENE-10281:
--

Yeah, +1 to not considering this a bug (but I'm a little biased I suppose since 
I wrote this). As you point out, the heuristic would be better if we knew how 
many of the hits actually had values in the SSDV field, but it's expensive to 
determine that up-front. So the current heuristic (which is just a heuristic 
and could be flawed in a number of ways), assumes all the hits have a value.

> Error condition used to judge whether hits are sparse in 
> StringValueFacetCounts
> ---
>
> Key: LUCENE-10281
> URL: https://issues.apache.org/jira/browse/LUCENE-10281
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.11
>Reporter: Lu Xugang
>Priority: Minor
> Attachments: 1.jpg
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Description:
> In construction method StringValueFacetCounts(StringDocValuesReaderState 
> state, FacetsCollector facetsCollector), if facetsCollector was provided, a 
> condition of *(totalHits < totalDocs / 10)* used to judge whether using 
> IntIntHashMap which means sparse to store term ord and count 。
> But per totalHits doesn't means it must be containing SSDV , and so is 
> totalDocs. so the right calculation should be *( totalHits has SSDV) / 
> (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by 
> SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get 
> because we can only read index by docId provided by FacetsCollector, but the 
> way of getting *totalHits has SSDV* is slow and redundant.
> Solution:
> if we don't wanna to break the old logic that using denseCounts while 
> cardinality < 1024 and using IntIntHashMap while 10% threshold and using 
> denseCounts while the rest of the case, then we could still use denseCounts 
> if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique 
> term collected,then change to use denseCounts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #498: LUCENE-10275: Add interval tree to MultiRangeQuery

2021-12-08 Thread GitBox


gsmiller commented on pull request #498:
URL: https://github.com/apache/lucene/pull/498#issuecomment-989310001


   Nice change! (Just now catching up on some Lucene issues and saw this)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] thelabdude merged pull request #2626: SOLR-15832: Clean-up after publish action in Schema Designer shouldn't fail if .system collection doesn't exist

2021-12-08 Thread GitBox


thelabdude merged pull request #2626:
URL: https://github.com/apache/lucene-solr/pull/2626


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #528: LUCENE-10296: Stop minimizing regepx

2021-12-08 Thread GitBox


rmuir commented on pull request #528:
URL: https://github.com/apache/lucene/pull/528#issuecomment-989427375


   I'm waiting a bit on https://issues.apache.org/jira/browse/LUCENE-10299 . I 
don't expect any regression, but I don't want to confuse it with the 
`DocIdSetBuilder` stuff


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #528: LUCENE-10296: Stop minimizing regepx

2021-12-08 Thread GitBox


rmuir commented on pull request #528:
URL: https://github.com/apache/lucene/pull/528#issuecomment-989454186


   LUCENE-10296: Stop minimizing regexps
   
   In current trunk, we let caller (e.g. `RegExpQuery`) try to "reduce" the 
expression. The parser nor the low-level executors don't implicitly call 
exponential-time algorithms anymore.
   
   But now that we have cleaned this up, we can see, what is happening is even 
worse than just calling `determinize()`. We still call `minimize()` which is 
much crazier and much more.
   
   We stopped doing this for all other `AutomatonQuery` subclasses a long time 
ago, as we determined that it didn't help performance. Additionally, 
minimization vs. determinization is even less important than early days where 
we found trouble: the representation got a lot better. Today when you 
`finishState()` we do a lot of practical sorting/coalescing on-the-fly. The 
practical parts of minimization for runtime perf. Also we added our fancy 
UTF32-to-UTF8 automata convertor, that makes the worst-case-space-per-state 
significantly lower than with UTF-16 representation? So why minimize() ?
   
   Let's just replace `minimize()` calls with `determinize()` calls? I've 
already swapped them out for all of `src/test`, to get jenkins looking for 
issues ahead of time.
   
   This change moves Hopcroft minimization (MinimizeOperations) to src/test for 
now. I'd like to explore nuking it from there as a next step, any tests that 
truly need minimization should be fine with brzozowski's algorithm: that's a 
2-liner.
   
   I think the problem is understood, longs are insane for docids, I don't wish 
to hold changes up on stupid stuff


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #528: LUCENE-10296: Stop minimizing regepx

2021-12-08 Thread GitBox


rmuir merged pull request #528:
URL: https://github.com/apache/lucene/pull/528


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10296) Stop minimizing regexps

2021-12-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456106#comment-17456106
 ] 

ASF subversion and git services commented on LUCENE-10296:
--

Commit 7a872c7a5c00d846314d44a445f8b0e83acb6a86 in lucene's branch 
refs/heads/main from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=7a872c7 ]

LUCENE-10296: Stop minimizing regepx (#528)

In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the 
expression. The parser nor the low-level executors don't implicitly call 
exponential-time algorithms anymore.

But now that we have cleaned this up, we can see it is even worse than just 
calling determinize(). We still call minimize() which is much crazier and much 
more.

We stopped doing this for all other AutomatonQuery subclasses a long time ago, 
as we determined that it didn't help performance. Additionally, minimization 
vs. determinization is even less important than early days where we found 
trouble: the representation got a lot better. Today when you finishState we do 
a lot of practical sorting/coalescing on-the-fly. Also we added this fancy 
UTF32-to-UTF8 automata convertor, that makes the worst-case-space-per-state 
significantly lower than it was before? So why minimize() ?

Let's just replace minimize() calls with determinize() calls? I've already 
swapped them out for all of src/test, to get jenkins looking for issues ahead 
of time.

This change moves hopcroft minimization (MinimizeOperations) to src/test for 
now. I'd like to explore nuking it from there as a next step, any tests that 
truly need minimization should be fine with brzozowski's
algorithm.

> Stop minimizing regexps
> ---
>
> Key: LUCENE-10296
> URL: https://issues.apache.org/jira/browse/LUCENE-10296
> Project: Lucene - Core
>  Issue Type: Task
>Affects Versions: 10.0 (main)
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the 
> expression. The parser nor the low-level executors don't implicitly call 
> exponential-time algorithms anymore.
> But now that we have cleaned this up, we can see it is even worse than just 
> calling {{{}determinize(){}}}. We still call {{minimize()}} which is much 
> crazier and much more. 
> We stopped doing this for all other AutomatonQuery subclasses a long time 
> ago, as we determined that it didn't help performance. Additionally, 
> minimization vs. determinization is even less important than early days where 
> we found trouble: the representation got a lot better. Today when you 
> {{finishState}} we do a lot of practical sorting/coalescing on-the-fly. Also 
> we added this fancy UTF32-to-UTF8 automata convertor, that makes the 
> worst-case-space-per-state significantly lower than it was before? So why 
> {{minimize()}} ?
> Let's just replace {{minimize()}} calls with {{determinize()}} calls? I've 
> already swapped them out for all of {{{}src/test{}}}, to get jenkins looking 
> for issues ahead of time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10299) investigate prefix/wildcard perf drop in nightly benchmarks

2021-12-08 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456112#comment-17456112
 ] 

Robert Muir commented on LUCENE-10299:
--

fairly heavy cost on the MTQ-rewrite happening in the nightly bench. I assume 
similar cost for filter construction, etc. Honestly I suggest reverting this 
commit.

> investigate prefix/wildcard perf drop in nightly benchmarks
> ---
>
> Key: LUCENE-10299
> URL: https://issues.apache.org/jira/browse/LUCENE-10299
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Recently the prefix/wildcard dropped. As these are super simple and not 
> impacted by cleanups being done around RegExp, I think instead the 
> perf-difference is in the guts of MultiTermQuery where it uses 
> DocIdSetBuilder?
> *note that I haven't confirmed this and it is just a suspicion*
> So I think it may be LUCENE-10289 changes? e.g. doing loops with {{long}} 
> instead of {{int}} like before, we know these are slower in java.
> I will admit, I'm a bit confused why we made this change since lucene docids 
> can only be {{int}}.
> Maybe we get the performance back for free, with JDK18/19 which are 
> optimizing loops on {{long}} better? So I'm not arguing that we burn a bunch 
> of time to fix this, but just opening the issue.
> cc [~ivera]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Reopened] (LUCENE-10289) DocIdSetBuilder#grow() should take a long instead of int

2021-12-08 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reopened LUCENE-10289:
--

I re opened the issue due to perf regressions:

LUCENE-10299

> DocIdSetBuilder#grow() should take a long instead of int 
> -
>
> Key: LUCENE-10289
> URL: https://issues.apache.org/jira/browse/LUCENE-10289
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Major
> Fix For: 9.1
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> DocIdSetBuilder accepts adding duplicates and therefore it potentially can 
> accept more than Integer.MAX_VALUE docs. For example, it already holds a 
> counter internally that is a long. It probably make sense to be able to grow 
> using a long instead of an int.
>  
> This will allow us to change PointValue.IntersectVisitor#grow() from int to 
> long and remove some unnecessary dance when we need to bulk add more that 
> Integer.MAX_VALUE points.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10296) Stop minimizing regexps

2021-12-08 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-10296.
--
Fix Version/s: 10.0 (main)
   Resolution: Fixed

> Stop minimizing regexps
> ---
>
> Key: LUCENE-10296
> URL: https://issues.apache.org/jira/browse/LUCENE-10296
> Project: Lucene - Core
>  Issue Type: Task
>Affects Versions: 10.0 (main)
>Reporter: Robert Muir
>Priority: Major
> Fix For: 10.0 (main)
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the 
> expression. The parser nor the low-level executors don't implicitly call 
> exponential-time algorithms anymore.
> But now that we have cleaned this up, we can see it is even worse than just 
> calling {{{}determinize(){}}}. We still call {{minimize()}} which is much 
> crazier and much more. 
> We stopped doing this for all other AutomatonQuery subclasses a long time 
> ago, as we determined that it didn't help performance. Additionally, 
> minimization vs. determinization is even less important than early days where 
> we found trouble: the representation got a lot better. Today when you 
> {{finishState}} we do a lot of practical sorting/coalescing on-the-fly. Also 
> we added this fancy UTF32-to-UTF8 automata convertor, that makes the 
> worst-case-space-per-state significantly lower than it was before? So why 
> {{minimize()}} ?
> Let's just replace {{minimize()}} calls with {{determinize()}} calls? I've 
> already swapped them out for all of {{{}src/test{}}}, to get jenkins looking 
> for issues ahead of time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 edited a comment on pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently

2021-12-08 Thread GitBox


gf2121 edited a comment on pull request #510:
URL: https://github.com/apache/lucene/pull/510#issuecomment-989104054


   @iverase Thanks for your explanation!
   
   > I worked on the PR about using #readLELongs but never get a meaningful 
speed up that justify the added complexity.
   
   I find that we were trying to use #readLELongs to speed up 24/32 bit 
situation in the `DocIdsWriter`, which means the ids in the block are unsorted, 
typically happening in high cardinarlity fields. I think queries on high 
cardinality fields spend most of their time on `visitDocValues` but not 
`readDocIds`, so maybe this is the reason that we can not see a obvious gain on 
E2E took?
   
   My current thoughts are about using readLELongs to speed up the **sorted** 
ids situation (means low or medium cardinality fields), whose bottleneck is 
reading docIds. For sorted arrays,  we can compute the delta of the sorted ids 
and encode/decode them like what we do in `StoredFieldsInts`. 
   
   I raised an [ISSUE](https://issues.apache.org/jira/browse/LUCENE-10297) 
based on this idea. The benchmark result i post in the issue looks promising. 
Would you like to help take a look when you are free? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. Maybe this is because we are trying to optimize the unsorted 
situation (typically happens for high cardinality fields) and the bottleneck of 
queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} 
? 

However, medium cardinality fields may be tempted for this optimization because 
they need to read lots of ids for each term. The basic idea is that we can 
compute the delta of the sorted ids and encode/decode them like what we do in 
{{StoredFieldsInts}}. I benchmarked the optimization by mocking some random 
longPoint and querying them with {{PointInSetQuery}}. As expected, the medium 
cardinality fields got spped up and high cardinality fields get even results.


*Benchmark Result*
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. Maybe this is because we are trying to optimize the unsorted 
situation (typically happens for high cardinality fields) and the bottleneck of 
queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} 
? 

IMO medium cardinality fields may be tempted for this optimization because they 
need to read lots of ids for one term. The basic idea is that we can compute 
the delta of the sorted ids and encode/decode them like what we do in 
{{StoredFieldsInts}}. I benchmarked the optimization by mocking some random 
longPoint and querying them with {{PointInSetQuery}}. As expected, the medium 
cardinality fields got spped up and high cardinality fields get even results.


*Benchmark Result*
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have a bitset optimization for low cardinality fields, but the 
> optimization only works on extremly low cardinality fields (doc count > 1/16 
> total doc), for medium cardina

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLELongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. Maybe this is because we are trying to optimize the unsorted 
situation (typically happens for high cardinality fields) and the bottleneck of 
queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} 
? 

However, medium cardinality fields may be tempted for this optimization because 
they need to read lots of ids for each term. The basic idea is that we can 
compute the delta of the sorted ids and encode/decode them like what we do in 
{{StoredFieldsInts}}. I benchmarked the optimization by mocking some random 
longPoint and querying them with {{PointInSetQuery}}. As expected, the medium 
cardinality fields got spped up and high cardinality fields get even results.


*Benchmark Result*
|doc count|field cardinality|query term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. Maybe this is because we are trying to optimize the unsorted 
situation (typically happens for high cardinality fields) and the bottleneck of 
queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} 
? 

However, medium cardinality fields may be tempted for this optimization because 
they need to read lots of ids for each term. The basic idea is that we can 
compute the delta of the sorted ids and encode/decode them like what we do in 
{{StoredFieldsInts}}. I benchmarked the optimization by mocking some random 
longPoint and querying them with {{PointInSetQuery}}. As expected, the medium 
cardinality fields got spped up and high cardinality fields get even results.


*Benchmark Result*
|doc count|field cardinality|field term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 


> Speed up medium cardinality fields with readLELongs and SIMD
> 
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have a bitset optimization for low cardinality fields, but the 
> optimization only works on extremly low cardinality fields (doc count > 1/16 
> total doc), for medium c

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Summary: Speed up medium cardinality fields with readLongs and SIMD  (was: 
Speed up medium cardinality fields with readLELongs and SIMD)

> Speed up medium cardinality fields with readLongs and SIMD
> --
>
> Key: LUCENE-10297
> URL: https://issues.apache.org/jira/browse/LUCENE-10297
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have a bitset optimization for low cardinality fields, but the 
> optimization only works on extremly low cardinality fields (doc count > 1/16 
> total doc), for medium cardinality case like 32/128 can rarely get this 
> optimization.
> In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
> use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
> this approach. Maybe this is because we are trying to optimize the unsorted 
> situation (typically happens for high cardinality fields) and the bottleneck 
> of queries on high cardinality fields is {{visitDocValues}} but not 
> {{readDocIds}} ? 
> However, medium cardinality fields may be tempted for this optimization 
> because they need to read lots of ids for each term. The basic idea is that 
> we can compute the delta of the sorted ids and encode/decode them like what 
> we do in {{StoredFieldsInts}}. I benchmarked the optimization by mocking some 
> random longPoint and querying them with {{PointInSetQuery}}. As expected, the 
> medium cardinality fields got spped up and high cardinality fields get even 
> results.
> *Benchmark Result*
> |doc count|field cardinality|query term count|baseline(ms)|candidate(ms)|diff 
> percentage|
> |1|32|1|19|16|-15.79%|
> |1|32|2|34|14|-58.82%|
> |1|32|4|76|22|-71.05%|
> |1|32|8|139|42|-69.78%|
> |1|32|16|279|82|-70.61%|
> |1|128|1|17|11|-35.29%|
> |1|128|8|75|23|-69.33%|
> |1|128|16|126|25|-80.16%|
> |1|128|32|245|50|-79.59%|
> |1|128|64|528|97|-81.63%|
> |1|1024|1|3|2|-33.33%|
> |1|1024|8|13|8|-38.46%|
> |1|1024|32|31|19|-38.71%|
> |1|1024|128|120|67|-44.17%|
> |1|1024|512|480|133|-72.29%|
> |1|8192|1|3|3|0.00%|
> |1|8192|16|18|15|-16.67%|
> |1|8192|64|19|14|-26.32%|
> |1|8192|512|69|43|-37.68%|
> |1|8192|2048|236|134|-43.22%|
> |1|1048576|1|3|2|-33.33%|
> |1|1048576|16|18|19|5.56%|
> |1|1048576|64|17|17|0.00%|
> |1|1048576|512|34|32|-5.88%|
> |1|1048576|2048|89|93|4.49%|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. Maybe this is because we are trying to optimize the unsorted 
situation (typically happens for high cardinality fields) and the bottleneck of 
queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} 
?

However, medium cardinality fields may be tempted for this optimization because 
they need to read lots of ids for each term. The basic idea is that we can 
compute the delta of the sorted ids and encode/decode them like what we do in 
{{{}StoredFieldsInts{}}}. I benchmarked the optimization by mocking some random 
longPoint and querying them with {{{}PointInSetQuery{}}}. As expected, the 
medium cardinality fields got spped up and high cardinality fields get even 
results.

*Benchmark Result*
|doc count|field cardinality|query point|baseline(ms)|candidate(ms)|diff 
percentage|baseline(QPS)|candidate(QPS)|diff percentage|
|1|32|1|19|16|-15.79%|52.63|62.50|18.75%|
|1|32|2|34|14|-58.82%|29.41|71.43|142.86%|
|1|32|4|76|22|-71.05%|13.16|45.45|245.45%|
|1|32|8|139|42|-69.78%|7.19|23.81|230.95%|
|1|32|16|279|82|-70.61%|3.58|12.20|240.24%|
|1|128|1|17|11|-35.29%|58.82|90.91|54.55%|
|1|128|8|75|23|-69.33%|13.33|43.48|226.09%|
|1|128|16|126|25|-80.16%|7.94|40.00|404.00%|
|1|128|32|245|50|-79.59%|4.08|20.00|390.00%|
|1|128|64|528|97|-81.63%|1.89|10.31|444.33%|
|1|1024|1|3|2|-33.33%|333.33|500.00|50.00%|
|1|1024|8|13|8|-38.46%|76.92|125.00|62.50%|
|1|1024|32|31|19|-38.71%|32.26|52.63|63.16%|
|1|1024|128|120|67|-44.17%|8.33|14.93|79.10%|
|1|1024|512|480|133|-72.29%|2.08|7.52|260.90%|
|1|8192|1|3|3|0.00%|333.33|333.33|0.00%|
|1|8192|16|18|15|-16.67%|55.56|66.67|20.00%|
|1|8192|64|19|14|-26.32%|52.63|71.43|35.71%|
|1|8192|512|69|43|-37.68%|14.49|23.26|60.47%|
|1|8192|2048|236|134|-43.22%|4.24|7.46|76.12%|
|1|1048576|1|3|2|-33.33%|333.33|500.00|50.00%|
|1|1048576|16|18|19|5.56%|55.56|52.63|-5.26%|
|1|1048576|64|17|17|0.00%|58.82|58.82|0.00%|
|1|1048576|512|34|32|-5.88%|29.41|31.25|6.25%|
|1|1048576|2048|89|93|4.49%|11.24|10.75|-4.30%|

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. Maybe this is because we are trying to optimize the unsorted 
situation (typically happens for high cardinality fields) and the bottleneck of 
queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} 
? 

However, medium cardinality fields may be tempted for this optimization because 
they need to read lots of ids for each term. The basic idea is that we can 
compute the delta of the sorted ids and encode/decode them like what we do in 
{{StoredFieldsInts}}. I benchmarked the optimization by mocking some random 
longPoint and querying them with {{PointInSetQuery}}. As expected, the medium 
cardinality fields got spped up and high cardinality fields get even results.


*Benchmark Result*
|doc count|field cardinality|query term count|baseline(ms)|candidate(ms)|diff 
percentage|
|1|32|1|19|16|-15.79%|
|1|32|2|34|14|-58.82%|
|1|32|4|76|22|-71.05%|
|1|32|8|139|42|-69.78%|
|1|32|16|279|82|-70.61%|
|1|128|1|17|11|-35.29%|
|1|128|8|75|23|-69.33%|
|1|128|16|126|25|-80.16%|
|1|128|32|245|50|-79.59%|
|1|128|64|528|97|-81.63%|
|1|1024|1|3|2|-33.33%|
|1|1024|8|13|8|-38.46%|
|1|1024|32|31|19|-38.71%|
|1|1024|128|120|67|-44.17%|
|1|1024|512|480|133|-72.29%|
|1|8192|1|3|3|0.00%|
|1|8192|16|18|15|-16.67%|
|1|8192|64|19|14|-26.32%|
|1|8192|512|69|43|-37.68%|
|1|8192|2048|236|134|-43.22%|
|1|1048576|1|3|2|-33.33%|
|1|1048576|16|18|19|5.56%|
|1|1048576|64|17|17|0.00%|
|1|1048576|512|34|32|-5.88%|
|1|1048576|2048|89|93|4.49%|

 


> Speed up medium cardinality fields with readLongs and SIMD
> --

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think the reason could be that we were trying to optimize the 
unsorted situation (typically happens for high cardinality fields) and the 
bottleneck of queries on high cardinality fields is {{visitDocValues}} but not 
{{readDocIds}}.

However, medium cardinality fields may be tempted for this optimization because 
they need to read lots of ids for each term. The basic idea is that we can 
compute the delta of the sorted ids and encode/decode them like what we do in 
{{{}StoredFieldsInts{}}}. I benchmarked the optimization by mocking some random 
longPoint and querying them with {{{}PointInSetQuery{}}}. As expected, the 
medium cardinality fields got spped up and high cardinality fields get even 
results.

*Benchmark Result*
|doc count|field cardinality|query point|baseline(ms)|candidate(ms)|diff 
percentage|baseline(QPS)|candidate(QPS)|diff percentage|
|1|32|1|19|16|-15.79%|52.63|62.50|18.75%|
|1|32|2|34|14|-58.82%|29.41|71.43|142.86%|
|1|32|4|76|22|-71.05%|13.16|45.45|245.45%|
|1|32|8|139|42|-69.78%|7.19|23.81|230.95%|
|1|32|16|279|82|-70.61%|3.58|12.20|240.24%|
|1|128|1|17|11|-35.29%|58.82|90.91|54.55%|
|1|128|8|75|23|-69.33%|13.33|43.48|226.09%|
|1|128|16|126|25|-80.16%|7.94|40.00|404.00%|
|1|128|32|245|50|-79.59%|4.08|20.00|390.00%|
|1|128|64|528|97|-81.63%|1.89|10.31|444.33%|
|1|1024|1|3|2|-33.33%|333.33|500.00|50.00%|
|1|1024|8|13|8|-38.46%|76.92|125.00|62.50%|
|1|1024|32|31|19|-38.71%|32.26|52.63|63.16%|
|1|1024|128|120|67|-44.17%|8.33|14.93|79.10%|
|1|1024|512|480|133|-72.29%|2.08|7.52|260.90%|
|1|8192|1|3|3|0.00%|333.33|333.33|0.00%|
|1|8192|16|18|15|-16.67%|55.56|66.67|20.00%|
|1|8192|64|19|14|-26.32%|52.63|71.43|35.71%|
|1|8192|512|69|43|-37.68%|14.49|23.26|60.47%|
|1|8192|2048|236|134|-43.22%|4.24|7.46|76.12%|
|1|1048576|1|3|2|-33.33%|333.33|500.00|50.00%|
|1|1048576|16|18|19|5.56%|55.56|52.63|-5.26%|
|1|1048576|64|17|17|0.00%|58.82|58.82|0.00%|
|1|1048576|512|34|32|-5.88%|29.41|31.25|6.25%|
|1|1048576|2048|89|93|4.49%|11.24|10.75|-4.30%|

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. Maybe this is because we are trying to optimize the unsorted 
situation (typically happens for high cardinality fields) and the bottleneck of 
queries on high cardinality fields is {{visitDocValues}} but not {{readDocIds}} 
?

However, medium cardinality fields may be tempted for this optimization because 
they need to read lots of ids for each term. The basic idea is that we can 
compute the delta of the sorted ids and encode/decode them like what we do in 
{{{}StoredFieldsInts{}}}. I benchmarked the optimization by mocking some random 
longPoint and querying them with {{{}PointInSetQuery{}}}. As expected, the 
medium cardinality fields got spped up and high cardinality fields get even 
results.

*Benchmark Result*
|doc count|field cardinality|query point|baseline(ms)|candidate(ms)|diff 
percentage|baseline(QPS)|candidate(QPS)|diff percentage|
|1|32|1|19|16|-15.79%|52.63|62.50|18.75%|
|1|32|2|34|14|-58.82%|29.41|71.43|142.86%|
|1|32|4|76|22|-71.05%|13.16|45.45|245.45%|
|1|32|8|139|42|-69.78%|7.19|23.81|230.95%|
|1|32|16|279|82|-70.61%|3.58|12.20|240.24%|
|1|128|1|17|11|-35.29%|58.82|90.91|54.55%|
|1|128|8|75|23|-69.33%|13.33|43.48|226.09%|
|1|128|16|126|25|-80.16%|7.94|40.00|404.00%|
|1|128|32|245|50|-79.59%|4.08|20.00|390.00%|
|1|128|64|528|97|-81.63%|1.89|10.31|444.33%|
|1|1024|1|3|2|-33.33%|333.33|500.00|50.00%|
|1|1024|8|13|8|-38.46%|76.92|125.00|62.50%|
|1|1024|32|31|19|-38.71%|32.26|52.63|63.16%|
|1|1024|128|120|67|-44.17%|8.33|14.93|79.10%|
|1|1024|512|480|133|-72.29%|2.08|7.52|260.90%|
|1|8192|1|3|3|0.00%|333.33|333.33|0.00%|
|1|8192|16|18|15|-16.67%|55.56|66.67|20.00%|
|1|8192|64|19|14|-

[GitHub] [lucene] gf2121 edited a comment on pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently

2021-12-08 Thread GitBox


gf2121 edited a comment on pull request #510:
URL: https://github.com/apache/lucene/pull/510#issuecomment-989104054


   @iverase Thanks for your explanation!
   
   > I worked on the PR about using #readLELongs but never get a meaningful 
speed up that justify the added complexity.
   
   I find that we were trying to use #readLELongs to speed up 24/32 bit 
situation in the `DocIdsWriter`, which means the ids in the block are unsorted, 
typically happening in high cardinarlity fields. I think queries on high 
cardinality fields spend most of their time on `visitDocValues` but not 
`readDocIds`, so maybe this is the reason that we can not see a obvious gain on 
E2E took?
   
   My current thoughts are about using readLELongs to speed up the **sorted** 
ids situation (means low or medium cardinality fields), whose bottleneck is 
reading docIds. For sorted arrays,  we can compute the delta of the sorted ids 
and encode/decode them like what we do in `StoredFieldsInts`. 
   
   I raised an [ISSUE](https://issues.apache.org/jira/browse/LUCENE-10297) 
based on this idea. The benchmark result i post in the issue looks promising. 
Would you like to help take a look when you have free time? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think the reason could be that we were trying to optimize the 
unsorted situation (typically happens for high cardinality fields) and the 
bottleneck of queries on high cardinality fields is {{visitDocValues}} but not 
{{{}readDocIds{}}}. (Not sure, i'm doing some more benchmark on this)

However, medium cardinality fields may be tempted for this optimization because 
they need to read lots of ids for each term. The basic idea is that we can 
compute the delta of the sorted ids and encode/decode them like what we do in 
{{{}StoredFieldsInts{}}}. I benchmarked the optimization by mocking some random 
longPoint and querying them with {{{}PointInSetQuery{}}}. As expected, the 
medium cardinality fields got spped up and high cardinality fields get even 
results.

*Benchmark Result*
|doc count|field cardinality|query point|baseline(ms)|candidate(ms)|diff 
percentage|baseline(QPS)|candidate(QPS)|diff percentage|
|1|32|1|19|16|-15.79%|52.63|62.50|18.75%|
|1|32|2|34|14|-58.82%|29.41|71.43|142.86%|
|1|32|4|76|22|-71.05%|13.16|45.45|245.45%|
|1|32|8|139|42|-69.78%|7.19|23.81|230.95%|
|1|32|16|279|82|-70.61%|3.58|12.20|240.24%|
|1|128|1|17|11|-35.29%|58.82|90.91|54.55%|
|1|128|8|75|23|-69.33%|13.33|43.48|226.09%|
|1|128|16|126|25|-80.16%|7.94|40.00|404.00%|
|1|128|32|245|50|-79.59%|4.08|20.00|390.00%|
|1|128|64|528|97|-81.63%|1.89|10.31|444.33%|
|1|1024|1|3|2|-33.33%|333.33|500.00|50.00%|
|1|1024|8|13|8|-38.46%|76.92|125.00|62.50%|
|1|1024|32|31|19|-38.71%|32.26|52.63|63.16%|
|1|1024|128|120|67|-44.17%|8.33|14.93|79.10%|
|1|1024|512|480|133|-72.29%|2.08|7.52|260.90%|
|1|8192|1|3|3|0.00%|333.33|333.33|0.00%|
|1|8192|16|18|15|-16.67%|55.56|66.67|20.00%|
|1|8192|64|19|14|-26.32%|52.63|71.43|35.71%|
|1|8192|512|69|43|-37.68%|14.49|23.26|60.47%|
|1|8192|2048|236|134|-43.22%|4.24|7.46|76.12%|
|1|1048576|1|3|2|-33.33%|333.33|500.00|50.00%|
|1|1048576|16|18|19|5.56%|55.56|52.63|-5.26%|
|1|1048576|64|17|17|0.00%|58.82|58.82|0.00%|
|1|1048576|512|34|32|-5.88%|29.41|31.25|6.25%|
|1|1048576|2048|89|93|4.49%|11.24|10.75|-4.30%|

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think the reason could be that we were trying to optimize the 
unsorted situation (typically happens for high cardinality fields) and the 
bottleneck of queries on high cardinality fields is {{visitDocValues}} but not 
{{readDocIds}}.

However, medium cardinality fields may be tempted for this optimization because 
they need to read lots of ids for each term. The basic idea is that we can 
compute the delta of the sorted ids and encode/decode them like what we do in 
{{{}StoredFieldsInts{}}}. I benchmarked the optimization by mocking some random 
longPoint and querying them with {{{}PointInSetQuery{}}}. As expected, the 
medium cardinality fields got spped up and high cardinality fields get even 
results.

*Benchmark Result*
|doc count|field cardinality|query point|baseline(ms)|candidate(ms)|diff 
percentage|baseline(QPS)|candidate(QPS)|diff percentage|
|1|32|1|19|16|-15.79%|52.63|62.50|18.75%|
|1|32|2|34|14|-58.82%|29.41|71.43|142.86%|
|1|32|4|76|22|-71.05%|13.16|45.45|245.45%|
|1|32|8|139|42|-69.78%|7.19|23.81|230.95%|
|1|32|16|279|82|-70.61%|3.58|12.20|240.24%|
|1|128|1|17|11|-35.29%|58.82|90.91|54.55%|
|1|128|8|75|23|-69.33%|13.33|43.48|226.09%|
|1|128|16|126|25|-80.16%|7.94|40.00|404.00%|
|1|128|32|245|50|-79.59%|4.08|20.00|390.00%|
|1|128|64|528|97|-81.63%|1.89|10.31|444.33%|
|1|1024|1|3|2|-33.33%|333.33|500.00|50.00%|
|1|1024|8|13|8|-38.46%|76.92|125.00|62.50%|
|1|1024|32|31|19|-38.71%|32.26|52.63|63.16%|
|1|1024|128|120|67|-44.17%|8.33|14.93|79.10%|
|1|1024|512|480|133|-72.29%|2.08|7.52|260.90%|
|1|8192|1|3|3|0.00%|333.33|333.33|0.00%|
|1|819

[jira] [Updated] (LUCENE-10297) Speed up medium cardinality fields with readLongs and SIMD

2021-12-08 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10297:
--
Description: 
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think the reason could be that we were trying to optimize the 
unsorted situation (typically happens for high cardinality fields) and the 
bottleneck of queries on high cardinality fields is {{visitDocValues}} but not 
{{{}readDocIds{}}}. _(Not sure, i'm doing some more benchmark on this)_

However, medium cardinality fields may be tempted for this optimization because 
they need to read lots of ids for each term. The basic idea is that we can 
compute the delta of the sorted ids and encode/decode them like what we do in 
{{{}StoredFieldsInts{}}}. I benchmarked the optimization by mocking some random 
longPoint and querying them with {{{}PointInSetQuery{}}}. As expected, the 
medium cardinality fields got spped up and high cardinality fields get even 
results.

*Benchmark Result*
|doc count|field cardinality|query point|baseline(ms)|candidate(ms)|diff 
percentage|baseline(QPS)|candidate(QPS)|diff percentage|
|1|32|1|19|16|-15.79%|52.63|62.50|18.75%|
|1|32|2|34|14|-58.82%|29.41|71.43|142.86%|
|1|32|4|76|22|-71.05%|13.16|45.45|245.45%|
|1|32|8|139|42|-69.78%|7.19|23.81|230.95%|
|1|32|16|279|82|-70.61%|3.58|12.20|240.24%|
|1|128|1|17|11|-35.29%|58.82|90.91|54.55%|
|1|128|8|75|23|-69.33%|13.33|43.48|226.09%|
|1|128|16|126|25|-80.16%|7.94|40.00|404.00%|
|1|128|32|245|50|-79.59%|4.08|20.00|390.00%|
|1|128|64|528|97|-81.63%|1.89|10.31|444.33%|
|1|1024|1|3|2|-33.33%|333.33|500.00|50.00%|
|1|1024|8|13|8|-38.46%|76.92|125.00|62.50%|
|1|1024|32|31|19|-38.71%|32.26|52.63|63.16%|
|1|1024|128|120|67|-44.17%|8.33|14.93|79.10%|
|1|1024|512|480|133|-72.29%|2.08|7.52|260.90%|
|1|8192|1|3|3|0.00%|333.33|333.33|0.00%|
|1|8192|16|18|15|-16.67%|55.56|66.67|20.00%|
|1|8192|64|19|14|-26.32%|52.63|71.43|35.71%|
|1|8192|512|69|43|-37.68%|14.49|23.26|60.47%|
|1|8192|2048|236|134|-43.22%|4.24|7.46|76.12%|
|1|1048576|1|3|2|-33.33%|333.33|500.00|50.00%|
|1|1048576|16|18|19|5.56%|55.56|52.63|-5.26%|
|1|1048576|64|17|17|0.00%|58.82|58.82|0.00%|
|1|1048576|512|34|32|-5.88%|29.41|31.25|6.25%|
|1|1048576|2048|89|93|4.49%|11.24|10.75|-4.30%|

  was:
We already have a bitset optimization for low cardinality fields, but the 
optimization only works on extremly low cardinality fields (doc count > 1/16 
total doc), for medium cardinality case like 32/128 can rarely get this 
optimization.

In [https://github.com/apache/lucene-solr/pull/1538], we made some effort to 
use readLELongs to speed up BKD id blocks, but did not get a obvious gain on 
this approach. I think the reason could be that we were trying to optimize the 
unsorted situation (typically happens for high cardinality fields) and the 
bottleneck of queries on high cardinality fields is {{visitDocValues}} but not 
{{{}readDocIds{}}}. (Not sure, i'm doing some more benchmark on this)

However, medium cardinality fields may be tempted for this optimization because 
they need to read lots of ids for each term. The basic idea is that we can 
compute the delta of the sorted ids and encode/decode them like what we do in 
{{{}StoredFieldsInts{}}}. I benchmarked the optimization by mocking some random 
longPoint and querying them with {{{}PointInSetQuery{}}}. As expected, the 
medium cardinality fields got spped up and high cardinality fields get even 
results.

*Benchmark Result*
|doc count|field cardinality|query point|baseline(ms)|candidate(ms)|diff 
percentage|baseline(QPS)|candidate(QPS)|diff percentage|
|1|32|1|19|16|-15.79%|52.63|62.50|18.75%|
|1|32|2|34|14|-58.82%|29.41|71.43|142.86%|
|1|32|4|76|22|-71.05%|13.16|45.45|245.45%|
|1|32|8|139|42|-69.78%|7.19|23.81|230.95%|
|1|32|16|279|82|-70.61%|3.58|12.20|240.24%|
|1|128|1|17|11|-35.29%|58.82|90.91|54.55%|
|1|128|8|75|23|-69.33%|13.33|43.48|226.09%|
|1|128|16|126|25|-80.16%|7.94|40.00|404.00%|
|1|128|32|245|50|-79.59%|4.08|20.00|390.00%|
|1|128|64|528|97|-81.63%|1.89|10.31|444.33%|
|1|1024|1|3|2|-33.33%|333.33|500.00|50.00%|
|1|1024|8|13|8|-38.46%|76.92|125.00|62.50%|
|1|1024|32|31|19|-38.71%|32.26|52.63|63.16%|
|1|1024|128|120|67|-44.17%|8.33|14.93|79.10%|
|1|1024|512|480|133|-72.29%|2.08|7.52|260.90%|
|10

[GitHub] [lucene] spyk commented on pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

2021-12-08 Thread GitBox


spyk commented on pull request #380:
URL: https://github.com/apache/lucene/pull/380#issuecomment-989601844


   Thanks @magibney! Yes, makes sense, I'll add a commit for the redundant 
parsing code removal and add a TODO comment as suggested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org