[GitHub] [lucene] jpountz commented on pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction

2022-06-23 Thread GitBox


jpountz commented on PR #972:
URL: https://github.com/apache/lucene/pull/972#issuecomment-1164220705

   The fact that queries perform slower in general in your first benchmark run 
makes me wonder if this could be due to insufficient warmup time. The default 
task repeat count of 20 might be too low for these queries that are very good 
at skipping irrelevant documents. Maybe try passing `taskRepeatCount=100` in 
the ctor of your `Competition` object? Does it make queries run closer in terms 
of performance to your second benchmark run?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-23 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558029#comment-17558029
 ] 

Michael McCandless commented on LUCENE-10557:
-

Finally catching up over here!

*Thank you* [~tomoko] for tackling this!

I agree testing what is realistic/possible will enable us to make an informed 
decision.  I really hope we are not stuck asking all future developers to 
fallback to Jira and use two search engines.

To make Jira effectively read-only post-migration, Robert suggested we could 
use Jira's workflow controls to make a "degraded" workflow that does not allow 
any writes to the issues (creating new issues, adding comments, changing 
milestones, etc.).  We can add that to the (draft) migration steps.

For committers, [https://id.apache.org|https://id.apache.org/] has the mapping 
of apache userid to GitHub id, though I'm not sure if that is publicly 
queryable.  And as [~msoko...@gmail.com] pointed out on the dev list thread, 
the [GitHub Apache org might also have it|https://github.com/apache].

Did you see/start from [the Lucene.Net migration 
tool|https://github.com/bongohrtech/jira-issues-importer/tree/lucenenet]? This 
is what [~nightowl888] pointed to (up above).

Those few migrated issues look like a great start!
{quote}{*}There is no way to upload files to GitHub with REST APIs{*}; it is 
only allowed via the Web Interface.
{quote}
Wow that is indeed disappointing.  I wonder whether GitHub's issue search also 
search attachments?  Does Jira's search?
 

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional command

[GitHub] [lucene] gsmiller commented on pull request #964: LUCENE-10620: Pass the Weight to Collectors.

2022-06-23 Thread GitBox


gsmiller commented on PR #964:
URL: https://github.com/apache/lucene/pull/964#issuecomment-1164336774

   @jpountz ah right. No, I don’t think it makes sense for users to have to 
deal with creating weights on their own (and having to consider query rewriting 
as well before doing so). Your approach makes total sense to me in light of 
this. +1 to making this change. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10593) VectorSimilarityFunction reverse removal

2022-06-23 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558093#comment-17558093
 ] 

Alessandro Benedetti commented on LUCENE-10593:
---

Hi @msokolov @mayya-sharipova and @jtibshirani , I have finally finished my 
performance tests.
Initially the results were worse in this branch, I found that suspicious as I 
expected the removal of the BoundChecker and the removal of the reverse 
mechanism to outweigh the additional division in the distance measure during 
graph building and searching.

After a deep investigation I found the culprit (you see it in the latest 
commit).


{code:java}
if (neighborSimilarity >= score) {
if ((neighborSimilarity < score) == false) { // this version improves the 
performance dramatically in both indexing/searching
{code}


After that fix, the results are very encouraging.
There are strong speedup for both angular and euclidean distances, both for 
indexing and searching.
*If this is validated we are getting a great cleanup of the code and also a 
nice performance boost.*
I'll have my colleague @eliaporciani to repeat the tests on Apple M1.

The following tests were executed on Intellij running the 
org.apache.lucene.util.hnsw.KnnGraphTester.
2.4 GHz 8-Core Intel Core i9 - 32 GB 2667 MHz DDR4


{noformat}
`INDEXING EUCLIDEAN

-beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -metric euclidean

ORIGINAL
IW 0 [2022-06-22T14:00:12.647030Z; main]: 64335 msec to write vectors
IW 0 [2022-06-22T14:01:57.425108Z; main]: 65710 msec to write vectors
IW 0 [2022-06-22T14:03:18.052900Z; main]: 64817 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T14:04:50.683607Z; main]: 6597 msec to write vectors
IW 0 [2022-06-22T14:05:34.090801Z; main]: 6687 msec to write vectors
IW 0 [2022-06-22T14:06:00.268309Z; main]: 6564 msec to write vectors

INDEXING ANGULAR

-beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -metric angular

ORIGINAL
IW 0 [2022-06-22T13:55:45.401310Z; main]: 32897 msec to write vectors
IW 0 [2022-06-22T13:56:39.737642Z; main]: 33255 msec to write vectors
IW 0 [2022-06-22T13:57:31.172709Z; main]: 32576 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T13:52:06.085790Z; main]: 25261 msec to write vectors
IW 0 [2022-06-22T13:52:51.022766Z; main]: 25775 msec to write vectors
IW 0 [2022-06-22T13:53:47.565833Z; main]: 24523 msec to write vectors`

`SEARCH EUCLIDEAN

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -search 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -metric euclidean

ORIGINAL
completed 500 searches in 1026 ms: 487 QPS CPU time=1025ms
completed 500 searches in 1030 ms: 485 QPS CPU time=1029ms
completed 500 searches in 1031 ms: 484 QPS CPU time=1030ms

THIS BRANCH
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 47 ms: 10638 QPS CPU time=46ms

SEARCH ANGULAR

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -search 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -metric angular

ORIGINAL
completed 500 searches in 154 ms: 3246 QPS CPU time=153ms
completed 500 searches in 162 ms: 3086 QPS CPU time=162ms
completed 500 searches in 166 ms: 3012 QPS CPU time=166ms

THIS BRANCH
completed 500 searches in 62 ms: 8064 QPS CPU time=62ms
completed 500 searches in 65 ms: 7692 QPS CPU time=65ms
completed 500 searches in 63 ms: 7936 QPS CPU time=62ms
`
{noformat}



Please correct me in case I did anything wrong, it's the first time I was using 
the org.apache.lucene.util.hnsw.KnnGraphTester

> VectorSimilarityFunction reverse removal
> 
>
> Key: LUCENE-10593
> URL: https://issues.apache.org/jira/browse/LUCENE-10593
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alessandro Benedetti
>Priority: Major
>  Labels: vector-based-search
>
> org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity behaves 
> in an opposite way in comparison to the other similarities:
> A higher similarity score means higher distance, for this reason, has been 
> marked with "reversed" and a function is present to map from the similarity 
> to a score (where higher means closer, like in all other similarities.)
> Having this counterintuitive behavior with no apparent explanation I could 
> find(please correct me if I am wrong) brings a lot 

[jira] [Comment Edited] (LUCENE-10593) VectorSimilarityFunction reverse removal

2022-06-23 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558093#comment-17558093
 ] 

Alessandro Benedetti edited comment on LUCENE-10593 at 6/23/22 1:34 PM:


Hi @msokolov @mayya-sharipova and @jtibshirani , I have finally finished my 
performance tests.
Initially the results were worse in this branch, I found that suspicious as I 
expected the removal of the BoundChecker and the removal of the reverse 
mechanism to outweigh the additional division in the distance measure during 
graph building and searching.

After a deep investigation I found the culprit (you see it in the latest 
commit).


{code:java}
if (neighborSimilarity >= score) {
if ((neighborSimilarity < score) == false) { // this version improves the 
performance dramatically in both indexing/searching
{code}


After that fix, the results are very encouraging.
There are strong speedup for both angular and euclidean distances, both for 
indexing and searching.
*If this is validated we are getting a great cleanup of the code and also a 
nice performance boost.*
I'll have my colleague @eliaporciani to repeat the tests on Apple M1.

The following tests were executed on Intellij running the 
org.apache.lucene.util.hnsw.KnnGraphTester.
2.4 GHz 8-Core Intel Core i9 - 32 GB 2667 MHz DDR4


{noformat}
`INDEXING EUCLIDEAN

-beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -metric euclidean

ORIGINAL
IW 0 [2022-06-22T14:00:12.647030Z; main]: 64335 msec to write vectors
IW 0 [2022-06-22T14:01:57.425108Z; main]: 65710 msec to write vectors
IW 0 [2022-06-22T14:03:18.052900Z; main]: 64817 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T14:04:50.683607Z; main]: 6597 msec to write vectors
IW 0 [2022-06-22T14:05:34.090801Z; main]: 6687 msec to write vectors
IW 0 [2022-06-22T14:06:00.268309Z; main]: 6564 msec to write vectors

INDEXING ANGULAR

-beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -metric angular

ORIGINAL
IW 0 [2022-06-22T13:55:45.401310Z; main]: 32897 msec to write vectors
IW 0 [2022-06-22T13:56:39.737642Z; main]: 33255 msec to write vectors
IW 0 [2022-06-22T13:57:31.172709Z; main]: 32576 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T13:52:06.085790Z; main]: 25261 msec to write vectors
IW 0 [2022-06-22T13:52:51.022766Z; main]: 25775 msec to write vectors
IW 0 [2022-06-22T13:53:47.565833Z; main]: 24523 msec to write vectors`

`SEARCH EUCLIDEAN

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -search 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -metric euclidean

ORIGINAL
completed 500 searches in 1026 ms: 487 QPS CPU time=1025ms
completed 500 searches in 1030 ms: 485 QPS CPU time=1029ms
completed 500 searches in 1031 ms: 484 QPS CPU time=1030ms

THIS BRANCH
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 47 ms: 10638 QPS CPU time=46ms

SEARCH ANGULAR

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -search 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -metric angular

ORIGINAL
completed 500 searches in 154 ms: 3246 QPS CPU time=153ms
completed 500 searches in 162 ms: 3086 QPS CPU time=162ms
completed 500 searches in 166 ms: 3012 QPS CPU time=166ms

THIS BRANCH
completed 500 searches in 62 ms: 8064 QPS CPU time=62ms
completed 500 searches in 65 ms: 7692 QPS CPU time=65ms
completed 500 searches in 63 ms: 7936 QPS CPU time=62ms
`
{noformat}



Please correct me in case I did anything wrong, it's the first time I was using 
the org.apache.lucene.util.hnsw.KnnGraphTester

I am proceeding in running additional performance tests on different datasets.
Functional tests are all green.


was (Author: alessandro.benedetti):
Hi @msokolov @mayya-sharipova and @jtibshirani , I have finally finished my 
performance tests.
Initially the results were worse in this branch, I found that suspicious as I 
expected the removal of the BoundChecker and the removal of the reverse 
mechanism to outweigh the additional division in the distance measure during 
graph building and searching.

After a deep investigation I found the culprit (you see it in the latest 
commit).


{code:java}
if (neighborSimilarity >= score) {
if ((neighborSimilarity < score) == false) { // this version improves the 
performance dramatically in both indexing/searching
{code}


After that fix, the results are very encouraging.
There are strong spe

[GitHub] [lucene] msokolov commented on pull request #926: VectorSimilarityFunction reverse removal

2022-06-23 Thread GitBox


msokolov commented on PR #926:
URL: https://github.com/apache/lucene/pull/926#issuecomment-1164418508

   Hi Alessandro, thank you for running the tests. I'm suspicious of the 
results though -- they just look too good to be true! I know from profiling 
that we spend most of the time in similarity computations, yet this change 
doesn't impact how many of those we do nor how costly they are.
   
   One thing I see is that you are using an `hdf5` file as input, but this 
tester was not designed to accept that format. This is a script I have used to 
extract raw floating-point data (what KnnGraphTester expects) from hdf5. This 
also takes care of normalizing to unit vectors, which you should do for angular 
data, but nor euclidean
   
   ```
   import h5py
   import numpy as np
   import sys
   
   with h5py.File(sys.argv[1], 'r') as f:
   for key in f.keys():
   print(f"{key}: {f[key].shape}")
   ds = f[key]
   print(f"copying {ds.shape} from {key}")
   arr = np.zeros(ds.shape, dtype='float32')
   ds.read_direct(arr)
   
   # normalize all vectors (along dim 1) to unit length
   norm = np.linalg.norm(arr, 2, 1)
   norm[norm==0] = 1
   arr = arr / np.expand_dims(norm, 1)
   
   arr.tofile(sys.argv[1] + "-" + key)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-23 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558097#comment-17558097
 ] 

Tomoko Uchida commented on LUCENE-10557:


I'm still not fully sure if we can/should Jira completely read-only, maybe 
we'll have a discussion in the mail list later.
{quote}bq. Did you see/start from [the Lucene.Net migration 
tool|https://github.com/bongohrtech/jira-issues-importer/tree/lucenenet]?
{quote}
No - Lucene.Net and Lucene have different requirements and data 
migration/conversion scripts like this are usually not reusable.  I think it'd 
be easier to write a tool that fits our needs from scratch than to tweak 
others' work that is optimized for their needs. (It's not technically difficult 
- a set of tiny scripts are sufficient, there are just many uncertainties.)

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

2022-06-23 Thread GitBox


msokolov commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r905035144


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -246,7 +246,7 @@ private boolean diversityCheck(
 for (int i = 0; i < neighbors.size(); i++) {
   float neighborSimilarity =
   similarityFunction.compare(candidate, 
vectorValues.vectorValue(neighbors.node[i]));
-  if (neighborSimilarity >= score) {

Review Comment:
   I don't think this change should make any difference at all - it's logically 
equivalent, isn't it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

2022-06-23 Thread GitBox


alessandrobenedetti commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r905042674


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -246,7 +246,7 @@ private boolean diversityCheck(
 for (int i = 0; i < neighbors.size(); i++) {
   float neighborSimilarity =
   similarityFunction.compare(candidate, 
vectorValues.vectorValue(neighbors.node[i]));
-  if (neighborSimilarity >= score) {

Review Comment:
   Yesterday myself and 2 of our engineers spent literally 3 hours discussing 
and testing this, isolating the instruction that was causing  this branch to go 
slower than the upstream/main .
   We ended up with this single change bringing a lot of improvement (and it's 
in line with what the BoundChecker was doing).
   
   We isolated the problem and solution there, but we couldn't find a specific 
Java reason.
   We were wondering if the equality check on top of the "greater than" may 
result in additional expense when closer vectors are present?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction

2022-06-23 Thread GitBox


jpountz commented on PR #972:
URL: https://github.com/apache/lucene/pull/972#issuecomment-1164436120

   @zacharymorn FYI I played with a slightly different approach that implements 
BMM as a bulk scorer instead of a scorer, which I was hoping would help with 
making bookkeeping more lightweight: 
https://github.com/jpountz/lucene/tree/maxscore. It could be interesting to 
compare with your implementation.
   
   One optimization it has that seemed to help that your scorer doesn't have is 
to check for every non-essential scorer whether the score obtained so far plus 
the sum of max scores of non essential scorers that haven't been checked yet is 
still competitive.
   
   I got the following results on one run on wikimedium10m:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
   OrHighNotLow 1493.13  (6.5%) 1445.29  
(5.1%)   -3.2% ( -13% -8%) 0.083
   OrNotHighMed 1410.19  (3.8%) 1373.37  
(3.1%)   -2.6% (  -9% -4%) 0.017
  OrNotHighHigh 1057.88  (5.1%) 1031.19  
(4.4%)   -2.5% ( -11% -7%) 0.096
   OrHighNotMed 1525.10  (5.2%) 1486.80  
(4.4%)   -2.5% ( -11% -7%) 0.098
  OrHighNotHigh 1250.31  (4.3%) 1221.99  
(3.4%)   -2.3% (  -9% -5%) 0.062
 IntNRQ  531.54  (2.9%)  522.49  
(2.7%)   -1.7% (  -7% -3%) 0.053
 Fuzzy1  111.13  (2.1%)  109.80  
(2.6%)   -1.2% (  -5% -3%) 0.107
 AndHighMed  386.29  (4.1%)  381.84  
(3.3%)   -1.2% (  -8% -6%) 0.329
AndHighHigh   78.96  (5.6%)   78.18  
(4.7%)   -1.0% ( -10% -9%) 0.548
   BrowseDateSSDVFacets4.51 (12.6%)4.47 
(12.4%)   -0.8% ( -22% -   27%) 0.836
   OrNotHighLow 1316.24  (3.8%) 1305.93  
(3.1%)   -0.8% (  -7% -6%) 0.476
 OrHighMedDayTaxoFacets   20.87  (5.1%)   20.71  
(4.2%)   -0.8% (  -9% -9%) 0.609
  BrowseMonthSSDVFacets   23.54  (6.4%)   23.42  
(7.4%)   -0.5% ( -13% -   14%) 0.817
BrowseRandomLabelTaxoFacets   37.54  (1.7%)   37.37  
(1.9%)   -0.5% (  -4% -3%) 0.432
MedSpanNear   68.68  (1.7%)   68.37  
(2.2%)   -0.4% (  -4% -3%) 0.474
   AndHighHighDayTaxoFacets   10.78  (5.9%)   10.73  
(4.7%)   -0.4% ( -10% -   10%) 0.794
  BrowseMonthTaxoFacets   28.39 (10.0%)   28.29  
(9.1%)   -0.3% ( -17% -   20%) 0.910
  HighTermDayOfYearSort  171.78 (13.7%)  171.22 
(13.2%)   -0.3% ( -23% -   30%) 0.939
   PKLookup  245.27  (2.2%)  244.52  
(1.9%)   -0.3% (  -4% -3%) 0.635
   HighSloppyPhrase   39.08  (2.9%)   38.96  
(4.3%)   -0.3% (  -7% -7%) 0.795
  HighTermMonthSort  167.47 (15.1%)  167.06 
(14.7%)   -0.2% ( -26% -   34%) 0.959
 HighPhrase  250.14  (2.8%)  249.53  
(2.3%)   -0.2% (  -5% -5%) 0.767
 TermDTSort  138.22 (14.0%)  137.97 
(13.4%)   -0.2% ( -24% -   31%) 0.967
 Fuzzy2   55.22  (1.6%)   55.17  
(1.5%)   -0.1% (  -3% -3%) 0.837
MedTerm 1844.25  (6.4%) 1843.10  
(4.9%)   -0.1% ( -10% -   11%) 0.972
MedSloppyPhrase   15.34  (2.2%)   15.33  
(3.9%)   -0.1% (  -5% -6%) 0.954
Prefix3  110.03  (2.6%)  110.07  
(1.8%)0.0% (  -4% -4%) 0.962
   HighSpanNear7.95  (1.7%)7.97  
(1.7%)0.2% (  -3% -3%) 0.772
  BrowseDayOfYearTaxoFacets   46.78  (1.9%)   46.86  
(2.1%)0.2% (  -3% -4%) 0.788
 AndHighLow 1291.99  (2.6%) 1294.28  
(3.4%)0.2% (  -5% -6%) 0.854
LowSpanNear   47.55  (1.5%)   47.64  
(1.4%)0.2% (  -2% -3%) 0.697
   Wildcard  157.83  (1.5%)  158.14  
(1.3%)0.2% (  -2% -3%) 0.661
  LowPhrase   83.20  (2.3%)   83.37  
(2.1%)0.2% (  -4% -4%) 0.773
Respell   95.18  (1.4%)   95.47  
(1.3%)0.3% (  -2% -3%) 0.492
AndHighMedDayTaxoFacets   51.97  (1.8%)   52.16  
(2.1%)0.4% (  -3% -4%) 0.553
   BrowseDateTaxoFacets   45.77  (2.0%)   45.98  
(1.9%)0.5% (

[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-23 Thread Ignacio Vera (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558103#comment-17558103
 ] 

Ignacio Vera commented on LUCENE-10396:
---

I have been thinking on the ability if visiting a single documents per field 
value as explained for Adrien above and I think we can implement it in a not 
intrusive way for SortedDocValues that have low cardinality. The idea is to add 
the following method to the SortedDocValues api with a default implementation:

{code}
  /**
   * Advances the iterator the the next document whose ordinal is distinct to 
the current ordinal
   * and returns the document number itself. Exhausts the iterator and returns 
{@link
   * #NO_MORE_DOCS} if there is no more ordinals distinct to the current one.
   *
   * The behaviour of this method is undefined when called with 
 target ≤ current
   * , or after the iterator has exhausted. Both cases may result in 
unpredicted behaviour.
   *  
   * The default implementation just iterates over the documents and manually 
checks if the ordinal has changed but
   *  but some implementations are  considerably more efficient than that.
   *
   */
  public int advanceOrd() throws IOException {
int doc = docID();
if (doc == DocIdSetIterator.NO_MORE_DOCS) {
  return doc;
}
final long ord = ordValue();
do  {
  doc = nextDoc();
} while (doc != DocIdSetIterator.NO_MORE_DOCS && ordValue() == ord);
assert doc == docID();
return doc;
  }
{code}

When consuming the doc values, if the field is the primary sort of he index and 
the cardinality is low (average of documents per field >64?), then we will use 
a DirectMonotonicWriter to write the offset for each ord which should not use 
too much disk space. When producing the doc value, we will override this method 
with a faster DirectMonotonicReader implementation.




> Automatically create sparse indexes for sort fields
> ---
>
> Key: LUCENE-10396
> URL: https://issues.apache.org/jira/browse/LUCENE-10396
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to 
> be able to early terminate sorted queries, but as a way to cluster doc IDs 
> that share similar properties so that queries can take advantage of it. For 
> instance imagine you're maintaining a catalog of cars for sale, by sorting by 
> car type, then fuel type then price. Then all cars with the same type, fuel 
> type and similar prices will be stored in a contiguous range of doc IDs. 
> Without index sorting, conjunctions across these 3 fields would be almost a 
> worst-case scenario as every clause might match lots of documents while their 
> intersection might not. With index sorting enabled however, there's only a 
> very small number of calls to advance() that would lead to doc IDs that do 
> not match, because these advance() calls that do not lead to a match would 
> always jump over a large number of doc IDs. I had created that example for 
> ApacheCon last year that demonstrates the benefits of index sorting on 
> conjunctions. In both cases, the index is storing the same data, it just gets 
> different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there 
> is a lot more we can do by taking advantage of the index sort explicitly. For 
> instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
> queries on fields that are primary sort fields by performing a binary search 
> to identify the first and last documents that match the range query. I would 
> like to introduce [sparse 
> indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
> that are used for index sorting, with the goal of improving the runtime of 
> {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
> and making it easier and more efficient to leverage index sorting to filter 
> on subsequent sort fields. A simple form of a sparse index could consist of 
> storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we 
> wouldn't need to make it configurable and could enable it automatically as 
> soon as index sorting is enabled. And it would get its own file format 
> abstraction.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kaivalnp commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

2022-06-23 Thread GitBox


kaivalnp commented on code in PR #951:
URL: https://github.com/apache/lucene/pull/951#discussion_r905065952


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -92,20 +91,40 @@ public KnnVectorQuery(String field, float[] target, int k, 
Query filter) {
   public Query rewrite(IndexReader reader) throws IOException {
 TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
 
-BitSetCollector filterCollector = null;
+Weight filterWeight = null;
 if (filter != null) {
-  filterCollector = new BitSetCollector(reader.leaves().size());
   IndexSearcher indexSearcher = new IndexSearcher(reader);
   BooleanQuery booleanQuery =
   new BooleanQuery.Builder()
   .add(filter, BooleanClause.Occur.FILTER)
   .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER)
   .build();
-  indexSearcher.search(booleanQuery, filterCollector);
+  Query rewritten = indexSearcher.rewrite(booleanQuery);
+  filterWeight = indexSearcher.createWeight(rewritten, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
 }
 
 for (LeafReaderContext ctx : reader.leaves()) {
-  TopDocs results = searchLeaf(ctx, filterCollector);
+  Bits acceptDocs;
+  int cost;
+  if (filterWeight != null) {
+Scorer scorer = filterWeight.scorer(ctx);
+if (scorer != null) {
+  DocIdSetIterator iterator = scorer.iterator();
+  if (iterator instanceof BitSetIterator) {
+acceptDocs = ((BitSetIterator) iterator).getBitSet();
+  } else {
+acceptDocs = BitSet.of(iterator, ctx.reader().maxDoc());
+  }
+  cost = (int) iterator.cost();

Review Comment:
   The approach you mentioned makes sense, and I have added it into the latest 
commit.
   
   But I feel we are missing out on some performance because of inaccurate 
estimation for `approximateSearch`. We could try @jpountz's suggestion of 
pro-rating the cost of the `BitSet` (since we only want some estimation), and 
performing `approximateSearch` on that. Something like:
   
   ```java
   Bits acceptDocs, liveDocs = ctx.reader().getLiveDocs();
   int maxDoc = ctx.reader().maxDoc(), cost;
   DocIdSetIterator iterator = scorer.iterator();
   if (iterator instanceof BitSetIterator bitSetIterator) {
 BitSet bitSet = bitSetIterator.getBitSet();
 acceptDocs = new Bits() {
   @Override
   public boolean get(int index) {
 return bitSet.get(index) && (liveDocs == null || liveDocs.get(index));
   }
   
   @Override
   public int length() {
 return maxDoc;
   }
 };
 cost = bitSet.cardinality() * ctx.reader().numDocs() / maxDoc;
 TopDocs results = approximateSearch(ctx, acceptDocs, cost);
 if (results.totalHits.relation == TotalHits.Relation.EQUAL_TO) {
   return results;
 } else {
   FilteredDocIdSetIterator filterIterator = new 
FilteredDocIdSetIterator(iterator) {
 @Override
 protected boolean match(int doc) {
   return liveDocs == null || liveDocs.get(doc);
 }
   };
   return exactSearch(ctx, filterIterator);
 }
   }
   ```
   
   This way we won't traverse the `BitSet` even for deletions. If the limit is 
reached, we anyways have to iterate over the `iterator` to build a `BitSet`, so 
instead we can pass the filtered `iterator` to `exactSearch`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kaivalnp commented on pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

2022-06-23 Thread GitBox


kaivalnp commented on PR #951:
URL: https://github.com/apache/lucene/pull/951#issuecomment-1164455602

   Thank you! I have added this approach to the latest commit, and a suggestion 
to incorporate deletes above


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

2022-06-23 Thread GitBox


alessandrobenedetti commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r905082182


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -246,7 +246,7 @@ private boolean diversityCheck(
 for (int i = 0; i < neighbors.size(); i++) {
   float neighborSimilarity =
   similarityFunction.compare(candidate, 
vectorValues.vectorValue(neighbors.node[i]));
-  if (neighborSimilarity >= score) {

Review Comment:
   Forget about this, apparently with the new file vector format there's no 
difference anymore.
   Quite peculiar though, when I have time I will investigate



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-23 Thread Ignacio Vera (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558103#comment-17558103
 ] 

Ignacio Vera edited comment on LUCENE-10396 at 6/23/22 3:01 PM:


I have been thinking on the ability if visiting a single documents per field 
value as explained for Adrien above and I think we can implement it in a not 
intrusive way for SortedDocValues that have low cardinality. The idea is to add 
the following method to the SortedDocValues api with a default implementation:

{code}
  /**
   * Advances the iterator the the next document whose ordinal is distinct to 
the current ordinal
   * and returns the document number itself. It returns {@link
   * #NO_MORE_DOCS} if there is no more ordinals distinct to the current one 
left.
   *
   * The behaviour of this method is undefined when called after the 
iterator has exhausted and may result in unpredicted behaviour.
   *  
   * The default implementation just iterates over the documents and manually 
checks if the ordinal has changed but
   *  but some implementations are  considerably more efficient than that.
   *
   */
  public int advanceOrd() throws IOException {
int doc = docID();
if (doc == DocIdSetIterator.NO_MORE_DOCS) {
  return doc;
}
final long ord = ordValue();
do  {
  doc = nextDoc();
} while (doc != DocIdSetIterator.NO_MORE_DOCS && ordValue() == ord);
assert doc == docID();
return doc;
  }
{code}

When consuming the doc values, if the field is the primary sort of he index and 
the cardinality is low (average of documents per field >64?), then we will use 
a DirectMonotonicWriter to write the offset for each ord which should not use 
too much disk space. When producing the doc value, we will override this method 
with a faster DirectMonotonicReader implementation.





was (Author: ivera):
I have been thinking on the ability if visiting a single documents per field 
value as explained for Adrien above and I think we can implement it in a not 
intrusive way for SortedDocValues that have low cardinality. The idea is to add 
the following method to the SortedDocValues api with a default implementation:

{code}
  /**
   * Advances the iterator the the next document whose ordinal is distinct to 
the current ordinal
   * and returns the document number itself. Exhausts the iterator and returns 
{@link
   * #NO_MORE_DOCS} if there is no more ordinals distinct to the current one.
   *
   * The behaviour of this method is undefined when called with 
 target ≤ current
   * , or after the iterator has exhausted. Both cases may result in 
unpredicted behaviour.
   *  
   * The default implementation just iterates over the documents and manually 
checks if the ordinal has changed but
   *  but some implementations are  considerably more efficient than that.
   *
   */
  public int advanceOrd() throws IOException {
int doc = docID();
if (doc == DocIdSetIterator.NO_MORE_DOCS) {
  return doc;
}
final long ord = ordValue();
do  {
  doc = nextDoc();
} while (doc != DocIdSetIterator.NO_MORE_DOCS && ordValue() == ord);
assert doc == docID();
return doc;
  }
{code}

When consuming the doc values, if the field is the primary sort of he index and 
the cardinality is low (average of documents per field >64?), then we will use 
a DirectMonotonicWriter to write the offset for each ord which should not use 
too much disk space. When producing the doc value, we will override this method 
with a faster DirectMonotonicReader implementation.




> Automatically create sparse indexes for sort fields
> ---
>
> Key: LUCENE-10396
> URL: https://issues.apache.org/jira/browse/LUCENE-10396
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to 
> be able to early terminate sorted queries, but as a way to cluster doc IDs 
> that share similar properties so that queries can take advantage of it. For 
> instance imagine you're maintaining a catalog of cars for sale, by sorting by 
> car type, then fuel type then price. Then all cars with the same type, fuel 
> type and similar prices will be stored in a contiguous range of doc IDs. 
> Without index sorting, conjunctions across these 3 fields would be almost a 
> worst-case scenario as every clause might match lots of documents while their 
> intersection might not. With index sorting enabled however, there's only a 
> very small number of calls to advance() that would lead to doc IDs that do 
> not match, because these advance() calls that do not lead to a match would 
> always jump over a large number of doc IDs. I had created that 

[jira] [Commented] (LUCENE-9580) Tessellator failure for a certain polygon

2022-06-23 Thread Hugo Mercier (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558127#comment-17558127
 ] 

Hugo Mercier commented on LUCENE-9580:
--

I've encountered the same issue on Elasticsearch 7.17 (Lucene 8.11) with a 
polygon having 4 holes and the patch correctly fixes the problem.

Will there be a backport of the fix onto 8.11 ? so that it could be included in 
ES 7.17.x ?

> Tessellator failure for a certain polygon
> -
>
> Key: LUCENE-9580
> URL: https://issues.apache.org/jira/browse/LUCENE-9580
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.5, 8.6
>Reporter: Iurii Vyshnevskyi
>Assignee: Ignacio Vera
>Priority: Major
> Fix For: 9.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This bug was discovered while using ElasticSearch (checked with versions 
> 7.6.2 and 7.9.2).
> But I've created an isolated test case just for Lucene: 
> [https://github.com/apache/lucene-solr/pull/2006/files]
>  
> The unit test fails with "java.lang.IllegalArgumentException: Unable to 
> Tessellate shape".
>  
> The polygon contains two holes that share the same vertex and one more 
> standalone hole.
> Removing any of them makes the unit test pass. 
>  
> Changing the least significant digit in any coordinate of the "common vertex" 
> in any of two first holes, so that these vertices become different in each 
> hole - also makes unit test pass.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] alessandrobenedetti commented on pull request #926: VectorSimilarityFunction reverse removal

2022-06-23 Thread GitBox


alessandrobenedetti commented on PR #926:
URL: https://github.com/apache/lucene/pull/926#issuecomment-1164571753

   @msokolov your input has been invaluable!
   I run the tests on the same machine, with the preprocessed files and now the 
results are different.
   The main and this branch present basically the same performance, which is 
still good to go ahead with this cleaning.
   I'll have my colleague @eliaporciani to repeat the tests on Apple M1.
   
   The following tests were executed on Intellij running the 
org.apache.lucene.util.hnsw.KnnGraphTester.
   2.4 GHz 8-Core Intel Core i9 - 32 GB 2667 MHz DDR4
   
   'INDEXING EUCLIDEAN SIFT
   
   -beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5-train
 -metric euclidean
   
   ORIGINAL
   IW 0 [2022-06-23T13:54:27.747641Z; main]: 63533 msec to write vectors
   IW 0 [2022-06-23T13:55:48.843182Z; main]: 64045 msec to write vectors
   IW 0 [2022-06-23T13:57:49.295840Z; main]: 61186 msec to write vectors
   
   
   
   THIS BRANCH
   IW 0 [2022-06-23T14:03:37.728524Z; main]: 62374 msec to write vectors
   IW 0 [2022-06-23T14:05:08.103138Z; main]: 59842 msec to write vectors
   IW 0 [2022-06-23T14:06:29.304945Z; main]: 60854 msec to write vectors
   
   
   
   INDEXING EUCLIDEAN FASHION
   
   -beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/fashion-mnist-784-euclidean.hdf5-train
 -metric euclidean
   
   ORIGINAL
   IW 0 [2022-06-23T13:59:09.597175Z; main]: 31619 msec to write vectors
   IW 0 [2022-06-23T13:59:53.456692Z; main]: 31519 msec to write vectors
   IW 0 [2022-06-23T14:01:08.290438Z; main]: 32137 msec to write vectors
   
   
   THIS BRANCH
   IW 0 [2022-06-23T14:14:35.732944Z; main]: 31096 msec to write vectors
   IW 0 [2022-06-23T14:15:32.317792Z; main]: 30997 msec to write vectors
   IW 0 [2022-06-23T14:16:20.496127Z; main]: 31145 msec to write vectors
   
   
   
   INDEXING ANGULAR LASTFM
   
   -beamWidthIndex 100 -maxConn 16 -ndoc 5 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5-train
 -metric angular 
   
   ORIGINAL
   IW 0 [2022-06-23T14:40:01.946477Z; main]: 18302 msec to write vectors
   IW 0 [2022-06-23T14:41:08.918642Z; main]: 18159 msec to write vectors
   IW 0 [2022-06-23T14:41:45.277353Z; main]: 18434 msec to write vectors
   
   
   THIS BRANCH
   IW 0 [2022-06-23T14:37:00.629288Z; main]: 19394 msec to write vectors
   IW 0 [2022-06-23T14:37:36.090859Z; main]: 19088 msec to write vectors
   IW 0 [2022-06-23T14:38:12.640373Z; main]: 18438 msec to write vectors
   
   
   INDEXING ANGULAR NY
   -beamWidthIndex 100 -maxConn 16 -ndoc 1 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/nytimes-256-angular.hdf5-train
 -metric angular 
   
   ORIGINAL
   IW 0 [2022-06-23T14:42:46.883227Z; main]: 37087 msec to write vectors
   IW 0 [2022-06-23T14:43:49.080938Z; main]: 37791 msec to write vectors
   IW 0 [2022-06-23T14:46:02.353413Z; main]: 37156 msec to write vectors
   
   
   THIS BRANCH
   IW 0 [2022-06-23T14:47:30.493401Z; main]: 36602 msec to write vectors
   IW 0 [2022-06-23T14:48:34.029209Z; main]: 36707 msec to write vectors
IW 0 
[2022-06-23T14:49:29.004465Z; main]: 38374 msec to write vectors
   
   
   
   ——
   
   
   SEARCH EUCLIDEAN SIFT
   
   -niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5-train
 -search 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5-test
 -metric euclidean
   
   ORIGINAL
   completed 500 searches in 762 ms: 656 QPS CPU time=761ms
   completed 500 searches in 760 ms: 657 QPS CPU time=759ms
   completed 500 searches in 770 ms: 649 QPS CPU time=769ms
   
   
   THIS BRANCH
   completed 500 searches in 745 ms: 671 QPS CPU time=745ms
   completed 500 searches in 756 ms: 661 QPS CPU time=755ms
   completed 500 searches in 771 ms: 648 QPS CPU time=769ms
   
   
   SEARCH EUCLIDEAN FASHION
   
   -niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/fashion-mnist-784-euclidean.hdf5-train
 -search 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/fashion-mnist-784-euclidean.hdf5-test
 -metric euclidean
   
   ORIGINAL
   completed 500 searches in 340 ms: 1470 QPS CPU time=339ms
   completed 500 searches in 342 ms: 1461 QPS CPU time=341ms
   completed 500 searches in 350 ms: 1428 QPS CPU time=349ms
   
   
   THIS BRANCH
   completed 500 searches in 339 ms: 1474 QPS CPU time=339ms
   completed 500 searches in 347 ms: 1440 QPS CPU time=346ms
   completed 500 searches in 346 ms: 1445 QPS CPU time=345ms
   
   
   
   SEARCH ANGULAR LASTFM
   
   -niter 500 -beamWidthIndex 100

[jira] (LUCENE-10593) VectorSimilarityFunction reverse removal

2022-06-23 Thread Alessandro Benedetti (Jira)


[ https://issues.apache.org/jira/browse/LUCENE-10593 ]


Alessandro Benedetti deleted comment on LUCENE-10593:
---

was (Author: alessandro.benedetti):
Hi @msokolov @mayya-sharipova and @jtibshirani , I have finally finished my 
performance tests.
Initially the results were worse in this branch, I found that suspicious as I 
expected the removal of the BoundChecker and the removal of the reverse 
mechanism to outweigh the additional division in the distance measure during 
graph building and searching.

After a deep investigation I found the culprit (you see it in the latest 
commit).


{code:java}
if (neighborSimilarity >= score) {
if ((neighborSimilarity < score) == false) { // this version improves the 
performance dramatically in both indexing/searching
{code}


After that fix, the results are very encouraging.
There are strong speedup for both angular and euclidean distances, both for 
indexing and searching.
*If this is validated we are getting a great cleanup of the code and also a 
nice performance boost.*
I'll have my colleague @eliaporciani to repeat the tests on Apple M1.

The following tests were executed on Intellij running the 
org.apache.lucene.util.hnsw.KnnGraphTester.
2.4 GHz 8-Core Intel Core i9 - 32 GB 2667 MHz DDR4


{noformat}
`INDEXING EUCLIDEAN

-beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -metric euclidean

ORIGINAL
IW 0 [2022-06-22T14:00:12.647030Z; main]: 64335 msec to write vectors
IW 0 [2022-06-22T14:01:57.425108Z; main]: 65710 msec to write vectors
IW 0 [2022-06-22T14:03:18.052900Z; main]: 64817 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T14:04:50.683607Z; main]: 6597 msec to write vectors
IW 0 [2022-06-22T14:05:34.090801Z; main]: 6687 msec to write vectors
IW 0 [2022-06-22T14:06:00.268309Z; main]: 6564 msec to write vectors

INDEXING ANGULAR

-beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -metric angular

ORIGINAL
IW 0 [2022-06-22T13:55:45.401310Z; main]: 32897 msec to write vectors
IW 0 [2022-06-22T13:56:39.737642Z; main]: 33255 msec to write vectors
IW 0 [2022-06-22T13:57:31.172709Z; main]: 32576 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T13:52:06.085790Z; main]: 25261 msec to write vectors
IW 0 [2022-06-22T13:52:51.022766Z; main]: 25775 msec to write vectors
IW 0 [2022-06-22T13:53:47.565833Z; main]: 24523 msec to write vectors`

`SEARCH EUCLIDEAN

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -search 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -metric euclidean

ORIGINAL
completed 500 searches in 1026 ms: 487 QPS CPU time=1025ms
completed 500 searches in 1030 ms: 485 QPS CPU time=1029ms
completed 500 searches in 1031 ms: 484 QPS CPU time=1030ms

THIS BRANCH
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 47 ms: 10638 QPS CPU time=46ms

SEARCH ANGULAR

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -search 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -metric angular

ORIGINAL
completed 500 searches in 154 ms: 3246 QPS CPU time=153ms
completed 500 searches in 162 ms: 3086 QPS CPU time=162ms
completed 500 searches in 166 ms: 3012 QPS CPU time=166ms

THIS BRANCH
completed 500 searches in 62 ms: 8064 QPS CPU time=62ms
completed 500 searches in 65 ms: 7692 QPS CPU time=65ms
completed 500 searches in 63 ms: 7936 QPS CPU time=62ms
`
{noformat}



Please correct me in case I did anything wrong, it's the first time I was using 
the org.apache.lucene.util.hnsw.KnnGraphTester

I am proceeding in running additional performance tests on different datasets.
Functional tests are all green.

> VectorSimilarityFunction reverse removal
> 
>
> Key: LUCENE-10593
> URL: https://issues.apache.org/jira/browse/LUCENE-10593
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alessandro Benedetti
>Priority: Major
>  Labels: vector-based-search
>
> org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity behaves 
> in an opposite way in comparison to the other similarities:
> A higher similarity score means higher distance, for this reason, has been 
> marked with "reversed" and a function is present to map from the similarity 
> to a score (where higher means closer, like in all other similarities.)
> Having this counterintuitive behavior with no apparent explanation I could 
> find

[jira] [Commented] (LUCENE-10593) VectorSimilarityFunction reverse removal

2022-06-23 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558141#comment-17558141
 ] 

Alessandro Benedetti commented on LUCENE-10593:
---

Recent performance tests in the Pull Request.
There's no evidence of slowing down, so this refactor seems good to go to me.
Functional tests are all green.

Planning to continue discussions and merge next week.

> VectorSimilarityFunction reverse removal
> 
>
> Key: LUCENE-10593
> URL: https://issues.apache.org/jira/browse/LUCENE-10593
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alessandro Benedetti
>Priority: Major
>  Labels: vector-based-search
>
> org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity behaves 
> in an opposite way in comparison to the other similarities:
> A higher similarity score means higher distance, for this reason, has been 
> marked with "reversed" and a function is present to map from the similarity 
> to a score (where higher means closer, like in all other similarities.)
> Having this counterintuitive behavior with no apparent explanation I could 
> find(please correct me if I am wrong) brings a lot of nasty side effects for 
> the code readability, especially when combined with the NeighbourQueue that 
> has a "reversed" itself.
> In addition, it complicates also the usage of the pattern:
> Result Queue -> MIN HEAP
> Candidate Queue -> MAX HEAP
> In HNSW searchers.
> The proposal in my Pull Request aims to:
> 1) the Euclidean similarity just returns the score, in line with the other 
> similarities, with the formula currently used to move from distance to score
> 2) simplify the code, removing the bound checker that's not necessary anymore
> 3) refactor here and there to be in line with the simplification
> 4) refactor of NeighborQueue to clearly state when it's a MIN_HEAP or 
> MAX_HEAP, now debugging is much easier and understanding the HNSW code is 
> much more intuitive



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #964: LUCENE-10620: Pass the Weight to Collectors.

2022-06-23 Thread GitBox


jpountz commented on PR #964:
URL: https://github.com/apache/lucene/pull/964#issuecomment-1164586245

   Thanks for taking the time to think about it @gsmiller, appreciated!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #964: LUCENE-10620: Pass the Weight to Collectors.

2022-06-23 Thread GitBox


jpountz merged PR #964:
URL: https://github.com/apache/lucene/pull/964


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10620) Can we pass the Weight to Collector?

2022-06-23 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558151#comment-17558151
 ] 

ASF subversion and git services commented on LUCENE-10620:
--

Commit 4c1ae2a332cb85878310142ebb9fd5beba0345f2 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4c1ae2a332c ]

LUCENE-10620: Pass the Weight to Collectors. (#964)

This allows `Collector`s to use `Weight#count` when appropriate.

> Can we pass the Weight to Collector?
> 
>
> Key: LUCENE-10620
> URL: https://issues.apache.org/jira/browse/LUCENE-10620
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Today collectors cannot know about the Weight, and thus they cannot leverage 
> {{Weight#count}}. {{IndexSearcher#count}} works around it by extending 
> {{TotalHitCountCollector}} in order to shortcut counting the number of hits 
> on a segment via {{Weight#count}} whenever possible.
> It works, but I would prefer this shortcut to work for all users of 
> TotalHitCountCollector. For instance the faceting module creates a 
> MultiCollector over a TotalHitCountCollector and a FacetCollector, and today 
> it doesn't benefit from quick counts, which would enable it to only collect 
> matches into a FacetCollector.
> I'm considering adding a new {{Collector#setWeight}} API to allow collectors 
> to leverage {{Weight#count}}. I gave {{TotalHitCountCollector}} as an example 
> above, but this could have applications for our top-docs collectors too, 
> which could skip counting hits at all if the weight can provide them with the 
> hit count up-front.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10620) Can we pass the Weight to Collector?

2022-06-23 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10620.
---
Fix Version/s: 9.3
   Resolution: Fixed

> Can we pass the Weight to Collector?
> 
>
> Key: LUCENE-10620
> URL: https://issues.apache.org/jira/browse/LUCENE-10620
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Today collectors cannot know about the Weight, and thus they cannot leverage 
> {{Weight#count}}. {{IndexSearcher#count}} works around it by extending 
> {{TotalHitCountCollector}} in order to shortcut counting the number of hits 
> on a segment via {{Weight#count}} whenever possible.
> It works, but I would prefer this shortcut to work for all users of 
> TotalHitCountCollector. For instance the faceting module creates a 
> MultiCollector over a TotalHitCountCollector and a FacetCollector, and today 
> it doesn't benefit from quick counts, which would enable it to only collect 
> matches into a FacetCollector.
> I'm considering adding a new {{Collector#setWeight}} API to allow collectors 
> to leverage {{Weight#count}}. I gave {{TotalHitCountCollector}} as an example 
> above, but this could have applications for our top-docs collectors too, 
> which could skip counting hits at all if the weight can provide them with the 
> hit count up-front.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

2022-06-23 Thread GitBox


jtibshirani commented on code in PR #951:
URL: https://github.com/apache/lucene/pull/951#discussion_r905236130


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -92,20 +91,40 @@ public KnnVectorQuery(String field, float[] target, int k, 
Query filter) {
   public Query rewrite(IndexReader reader) throws IOException {
 TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
 
-BitSetCollector filterCollector = null;
+Weight filterWeight = null;
 if (filter != null) {
-  filterCollector = new BitSetCollector(reader.leaves().size());
   IndexSearcher indexSearcher = new IndexSearcher(reader);
   BooleanQuery booleanQuery =
   new BooleanQuery.Builder()
   .add(filter, BooleanClause.Occur.FILTER)
   .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER)
   .build();
-  indexSearcher.search(booleanQuery, filterCollector);
+  Query rewritten = indexSearcher.rewrite(booleanQuery);
+  filterWeight = indexSearcher.createWeight(rewritten, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
 }
 
 for (LeafReaderContext ctx : reader.leaves()) {
-  TopDocs results = searchLeaf(ctx, filterCollector);
+  Bits acceptDocs;
+  int cost;
+  if (filterWeight != null) {
+Scorer scorer = filterWeight.scorer(ctx);
+if (scorer != null) {
+  DocIdSetIterator iterator = scorer.iterator();
+  if (iterator instanceof BitSetIterator) {
+acceptDocs = ((BitSetIterator) iterator).getBitSet();
+  } else {
+acceptDocs = BitSet.of(iterator, ctx.reader().maxDoc());
+  }
+  cost = (int) iterator.cost();

Review Comment:
   Personally I'd prefer to keep `visitedLimit` always an accurate 
representation. I think it makes the algorithm easier to reason about. We can 
always revisit this in the future with more ideas for optimizations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

2022-06-23 Thread GitBox


jtibshirani commented on PR #951:
URL: https://github.com/apache/lucene/pull/951#issuecomment-1164628757

   The latest approach looks good to me. Are you still seeing a significant 
latency improvement in some cases?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-06-23 Thread GitBox


mdmarshmallow commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1164640793

   Yeah, I think this change should be completely compatible with 9.30. Most of 
our changes are isolated to the new `facetset` package and all other changes 
are just adding some functions to different places, which should not affect any 
existing functionality.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10626) Hunspell: add tools to aid dictionary editing: analysis introspection, stem expansion and stem/flag suggestion

2022-06-23 Thread Peter Gromov (Jira)
Peter Gromov created LUCENE-10626:
-

 Summary: Hunspell: add tools to aid dictionary editing: analysis 
introspection, stem expansion and stem/flag suggestion
 Key: LUCENE-10626
 URL: https://issues.apache.org/jira/browse/LUCENE-10626
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Peter Gromov


The following tools would be nice to have when editing and appending an 
existing dictionary:
1. See how Hunspell analyzes a given word, with all the involved affix flags: 
`Hunspell.analyzeSimpleWord`
2. See all forms that the given stem can produce with the given flags: 
`Hunspell.expandRoot`, `WordFormGenerator.expandRoot`
3. Given a number of word forms, suggest a stem and a set of flags that produce 
these word forms: `Hunspell.compress`, `WordFormGenerator.compress`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] donnerpeter opened a new pull request, #975: LUCENE-10626 Hunspell: add tools to aid dictionary editing

2022-06-23 Thread GitBox


donnerpeter opened a new pull request, #975:
URL: https://github.com/apache/lucene/pull/975

   https://issues.apache.org/jira/browse/LUCENE-10626


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] donnerpeter commented on pull request #975: LUCENE-10626 Hunspell: add tools to aid dictionary editing

2022-06-23 Thread GitBox


donnerpeter commented on PR #975:
URL: https://github.com/apache/lucene/pull/975#issuecomment-1164800989

   Reviewing commits separately might be easier


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kaivalnp commented on pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

2022-06-23 Thread GitBox


kaivalnp commented on PR #951:
URL: https://github.com/apache/lucene/pull/951#issuecomment-1164860104

   Yes, I saw similar improvement for `BitSet` backed queries as the numbers 
[here](https://github.com/apache/lucene/pull/932)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shahrs87 commented on a diff in pull request #907: LUCENE-10357 Ghost fields and postings/points

2022-06-23 Thread GitBox


shahrs87 commented on code in PR #907:
URL: https://github.com/apache/lucene/pull/907#discussion_r905458119


##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -1378,7 +1378,7 @@ private static Status.TermIndexStatus checkFields(
   computedFieldCount++;
 
   final Terms terms = fields.terms(field);
-  if (terms == null) {
+  if (terms == Terms.EMPTY) {

Review Comment:
   I tried to remove this if (terms == Terms.EMPTY) statement but many tests 
failed.
   Example of failing test: 
`org.apache.lucene.codecs.lucene90.TestLucene90NormsFormat#testUndeadNorms`
   Stack trace:
   ```
   > Task :lucene:core:test FAILED
   WARNING: A command line option has enabled the Security Manager
   WARNING: The Security Manager is deprecated and will be removed in a future 
release
   
   org.apache.lucene.codecs.lucene90.TestLucene90NormsFormat > testUndeadNorms 
FAILED
   org.apache.lucene.index.CheckIndex$CheckIndexException: field "content" 
should have hasFreqs=true but got false
   at 
__randomizedtesting.SeedInfo.seed([2A7308C15B316422:27A64CAC8EDF759E]:0)
   at 
app//org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1437)
   at 
app//org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:2428)
   at 
app//org.apache.lucene.index.CheckIndex.testSegment(CheckIndex.java:999)
   at 
app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:714)
   at 
app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:552)
   at 
app//org.apache.lucene.tests.util.TestUtil.checkIndex(TestUtil.java:343)
   at 
app//org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:909)
   at 
app//org.apache.lucene.tests.index.BaseNormsFormatTestCase.testUndeadNorms(BaseNormsFormatTestCase.java:698)
   at 
java.base@17.0.2/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
   at 
java.base@17.0.2/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
   at 
java.base@17.0.2/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.base@17.0.2/java.lang.reflect.Method.invoke(Method.java:568)
   at 
app//com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
   at 
app//com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
   at 
app//com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
   at 
app//com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
   at 
app//org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
   at 
app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   at 
app//org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
   at 
app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   at 
app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
   ```
   
   I think the problem is:
   Even though the `content` field is deleted, it is still present in fields 
returned by `reader#getPostingsReader` 
[here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java#L2417)
 
   Since the returned terms is EMPTY it has `hasFreqs` set to `false` but since 
this field exists in fields and the indexOptions set to 
`DOCS_AND_FREQS_AND_POSITIONS` it is expecting hasFreqs to true. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shahrs87 commented on a diff in pull request #907: LUCENE-10357 Ghost fields and postings/points

2022-06-23 Thread GitBox


shahrs87 commented on code in PR #907:
URL: https://github.com/apache/lucene/pull/907#discussion_r905561688


##
lucene/core/src/java/org/apache/lucene/index/FrozenBufferedUpdates.java:
##
@@ -595,7 +595,7 @@ private void setField(String field) throws IOException {
 
 DocIdSetIterator nextTerm(String field, BytesRef term) throws IOException {
   setField(field);
-  if (termsEnum != null) {
+  if (termsEnum != null && termsEnum != TermsEnum.EMPTY) {

Review Comment:
   The following test failed 
`TestPointQueries#testAllPointDocsWereDeletedAndThenMergedAgain`
   ```
   Cannot read field "bytes" because "other" is null
   java.lang.NullPointerException: Cannot read field "bytes" because "other" is 
null
at 
__randomizedtesting.SeedInfo.seed([16B9FF96CE2E2AF5:59A84BA7E3282B90]:0)
at org.apache.lucene.util.BytesRef.compareTo(BytesRef.java:159)
at 
org.apache.lucene.index.FrozenBufferedUpdates$TermDocsIterator.nextTerm(FrozenBufferedUpdates.java:604)
at 
org.apache.lucene.index.FrozenBufferedUpdates.applyTermDeletes(FrozenBufferedUpdates.java:473)
at 
org.apache.lucene.index.FrozenBufferedUpdates.apply(FrozenBufferedUpdates.java:175)
at org.apache.lucene.index.IndexWriter.forceApply(IndexWriter.java:5965)
at org.apache.lucene.index.IndexWriter.tryApply(IndexWriter.java:5865)
at 
org.apache.lucene.index.IndexWriter.lambda$publishFrozenUpdates$10(IndexWriter.java:2771)
at 
org.apache.lucene.index.IndexWriter$EventQueue.processEventsInternal(IndexWriter.java:328)
at 
org.apache.lucene.index.IndexWriter$EventQueue.processEvents(IndexWriter.java:317)
at 
org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5708)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4038)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3995)
at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:2099)
at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:2080)
at 
org.apache.lucene.search.TestPointQueries.testAllPointDocsWereDeletedAndThenMergedAgain(TestPointQueries.java:1221)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock

2022-06-23 Thread Weiming Wu (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558247#comment-17558247
 ] 

Weiming Wu commented on LUCENE-10624:
-

Hi Adrien. Thanks for your comments!

 

For the reason of speedup, I investigated this spare doc test case. It 
retrieves all field values of hit docs at the end of the test. The change 
speeds up this operation. 
[https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/SearchTaxis.java#L152-L155.]
{code:java}
for(ScoreDoc hit : hits.scoreDocs) {
  Document doc = searcher.doc(hit.doc);
  results.add("  " + hit.doc + " " + hit.score + ": " + doc.getFields().size() 
+ " fields");
} {code}
Also, I found the blog of this performance test. Seems this performance test is 
designed to test sparse doc value retrieve. (not an expert, feel free to 
correct me). 
[https://www.elastic.co/blog/sparse-versus-dense-document-values-with-apache-lucene]

 

For exponential search, I did the performance test again. Comparing to the pure 
binary search, some cases speed up, some slow down. I also did a test in our 
search system, the latency is slightly increased because our doc is very spare. 
Therefore, I feel I need to investigate more. I plan to open a new issue for 
exponential search. Does it make sense?

> Binary Search for Sparse IndexedDISI advanceWithinBlock & 
> advanceExactWithinBlock
> -
>
> Key: LUCENE-10624
> URL: https://issues.apache.org/jira/browse/LUCENE-10624
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 9.0, 9.1, 9.2
>Reporter: Weiming Wu
>Priority: Major
> Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, 
> candidate_sparseTaxis_searchsparse-sorted.0.log
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> h3. Problem Statement
> We noticed DocValue read performance regression with the iterative API when 
> upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The 
> degradation is similar to what's described in 
> https://issues.apache.org/jira/browse/SOLR-9599 
> By analyzing profiling data, we found method "advanceWithinBlock" and 
> "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to 
> their O(N) doc lookup algorithm.
> h3. Changes
> Used binary search algorithm to replace current O(N) lookup algorithm in 
> Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because 
> docs are in ascending order.
> h3. Test
> {code:java}
> ./gradlew tidy
> ./gradlew check {code}
> h3. Benchmark
> Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the 
> reports of baseline and candidates in attachments section.{color}
> {color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color}
> {color:#1d1c1d}2. Some highlights (>20%):{color}
>  * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] 
> yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}*
>  ** {color:#1d1c1d}*Baseline:*  10973978+ hits hits in *726.81967 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 
> msec*{color}
>  * *{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}*
>  ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color}
>  * {color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color}
>  ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 
> msec*{color}{*}{*}
>  * {color:#1d1c1d}*...*{color}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock

2022-06-23 Thread Weiming Wu (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558247#comment-17558247
 ] 

Weiming Wu edited comment on LUCENE-10624 at 6/23/22 10:46 PM:
---

Hi Adrien. Thanks for your comments!

 

For the reason of speedup, I investigated this spare doc test case. It 
retrieves all field values of hit docs at the end of the test. The change 
speeds up this operation. 
[https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/SearchTaxis.java#L152-L155.]
{code:java}
for(ScoreDoc hit : hits.scoreDocs) {
  Document doc = searcher.doc(hit.doc);
  results.add("  " + hit.doc + " " + hit.score + ": " + doc.getFields().size() 
+ " fields");
} {code}
Also, I found the blog of this performance test. Seems this performance test is 
designed to test sparse doc value retrieve. (not an expert, feel free to 
correct me). 
[https://www.elastic.co/blog/sparse-versus-dense-document-values-with-apache-lucene]

 

For exponential search, I did the performance test again. Comparing to the pure 
binary search, some cases speed up, some slow down. I also did a test in our 
search system, the latency is slightly increased because our doc is very spare. 
Therefore, I feel I need to investigate more. I plan to open a new issue for 
exponential search. Does it make sense? [~jpountz] 


was (Author: JIRAUSER290435):
Hi Adrien. Thanks for your comments!

 

For the reason of speedup, I investigated this spare doc test case. It 
retrieves all field values of hit docs at the end of the test. The change 
speeds up this operation. 
[https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/SearchTaxis.java#L152-L155.]
{code:java}
for(ScoreDoc hit : hits.scoreDocs) {
  Document doc = searcher.doc(hit.doc);
  results.add("  " + hit.doc + " " + hit.score + ": " + doc.getFields().size() 
+ " fields");
} {code}
Also, I found the blog of this performance test. Seems this performance test is 
designed to test sparse doc value retrieve. (not an expert, feel free to 
correct me). 
[https://www.elastic.co/blog/sparse-versus-dense-document-values-with-apache-lucene]

 

For exponential search, I did the performance test again. Comparing to the pure 
binary search, some cases speed up, some slow down. I also did a test in our 
search system, the latency is slightly increased because our doc is very spare. 
Therefore, I feel I need to investigate more. I plan to open a new issue for 
exponential search. Does it make sense?

> Binary Search for Sparse IndexedDISI advanceWithinBlock & 
> advanceExactWithinBlock
> -
>
> Key: LUCENE-10624
> URL: https://issues.apache.org/jira/browse/LUCENE-10624
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 9.0, 9.1, 9.2
>Reporter: Weiming Wu
>Priority: Major
> Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, 
> candidate_sparseTaxis_searchsparse-sorted.0.log
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> h3. Problem Statement
> We noticed DocValue read performance regression with the iterative API when 
> upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The 
> degradation is similar to what's described in 
> https://issues.apache.org/jira/browse/SOLR-9599 
> By analyzing profiling data, we found method "advanceWithinBlock" and 
> "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to 
> their O(N) doc lookup algorithm.
> h3. Changes
> Used binary search algorithm to replace current O(N) lookup algorithm in 
> Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because 
> docs are in ascending order.
> h3. Test
> {code:java}
> ./gradlew tidy
> ./gradlew check {code}
> h3. Benchmark
> Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the 
> reports of baseline and candidates in attachments section.{color}
> {color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color}
> {color:#1d1c1d}2. Some highlights (>20%):{color}
>  * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] 
> yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}*
>  ** {color:#1d1c1d}*Baseline:*  10973978+ hits hits in *726.81967 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 
> msec*{color}
>  * *{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}*
>  ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color}
>  * {color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color}
>  ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 300174+ hits hits in *22

[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-23 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558263#comment-17558263
 ] 

Michael McCandless commented on LUCENE-10557:
-

{quote}I'm still not fully sure if we can/should Jira completely read-only, 
maybe we'll have a discussion in the mail list later.
{quote}
OK that's fair – I just think having two writable issue trackers at the same 
time is asking for disaster.  It really should be an atomic switch from Jira to 
GitHub issues to close that risk.  But we can defer that discussion until we 
agree the migration is even the right choice.  Maybe we must decide to live 
with Jira forever instead of hard switching to GitHub issues
{quote}{quote}{quote} Did you see/start from [the Lucene.Net migration 
tool|https://github.com/bongohrtech/jira-issues-importer/tree/lucenenet]?
{quote}{quote}
No - Lucene.Net and Lucene have different requirements and data 
migration/conversion scripts like this are usually not reusable.  I think it'd 
be easier to write a tool that fits our needs from scratch than to tweak 
others' work that is optimized for their needs. (It's not technically difficult 
- a set of tiny scripts are sufficient, there are just many uncertainties.)
{quote}
OK that's fair.

I just wanted to make sure you were aware of how Lucene.Net accomplished their 
Jira -> GitHub Issues migration so we could build on that / improve for our 
specific requirements.  We are not the first Apache project that feels the need 
to 1) migrate from Jira -> GitHub issues, and 2) preserve the history.  So 
let's learn from past projects like Lucene.Net and others.

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-

[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-23 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558318#comment-17558318
 ] 

Tomoko Uchida commented on LUCENE-10557:


Seems converting Jira "table" markup to Markdown is error-prone. e.g. 
[https://github.com/mocobeta/sandbox-lucene-10557/issues/188]

I'm not sure the cause - maybe dumped data via Jira API doesn't preserve the 
original text, maybe there were some breaking changes/incompatibilities in Jira 
language, or maybe there is a bug in the converter library 
([https://github.com/catcombo/jira2markdown]) the script relies on. 

We have lots of tables in Jira comments and important information is there, 
it's a minor blocker to me if we don't figure out a way to correctly convert 
tables to Markdown.

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock

2022-06-23 Thread Weiming Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiming Wu updated LUCENE-10624:

Attachment: candiate-exponential-searchsparse-sorted.0.log

> Binary Search for Sparse IndexedDISI advanceWithinBlock & 
> advanceExactWithinBlock
> -
>
> Key: LUCENE-10624
> URL: https://issues.apache.org/jira/browse/LUCENE-10624
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 9.0, 9.1, 9.2
>Reporter: Weiming Wu
>Priority: Major
> Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, 
> candiate-exponential-searchsparse-sorted.0.log, 
> candidate_sparseTaxis_searchsparse-sorted.0.log
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> h3. Problem Statement
> We noticed DocValue read performance regression with the iterative API when 
> upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The 
> degradation is similar to what's described in 
> https://issues.apache.org/jira/browse/SOLR-9599 
> By analyzing profiling data, we found method "advanceWithinBlock" and 
> "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to 
> their O(N) doc lookup algorithm.
> h3. Changes
> Used binary search algorithm to replace current O(N) lookup algorithm in 
> Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because 
> docs are in ascending order.
> h3. Test
> {code:java}
> ./gradlew tidy
> ./gradlew check {code}
> h3. Benchmark
> Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the 
> reports of baseline and candidates in attachments section.{color}
> {color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color}
> {color:#1d1c1d}2. Some highlights (>20%):{color}
>  * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] 
> yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}*
>  ** {color:#1d1c1d}*Baseline:*  10973978+ hits hits in *726.81967 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 
> msec*{color}
>  * *{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}*
>  ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color}
>  * {color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color}
>  ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 
> msec*{color}{*}{*}
>  * {color:#1d1c1d}*...*{color}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #968: [LUCENE-10624] Binary Search for Sparse IndexedDISI advanceWithinBloc…

2022-06-23 Thread GitBox


zacharymorn commented on PR #968:
URL: https://github.com/apache/lucene/pull/968#issuecomment-1165214328

   Hmm I see. I'm actually also wondering if it will be possible to have one of 
them simply delegate to the other one (potentially indirectly via some helper 
method), and then check the returned value  (e.g. have 
`advanceExactWithinBlock` to delegate somehow to `advanceWithinBlock` and then 
check if the positioned doc is actually equal to `target`)? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction

2022-06-23 Thread GitBox


zacharymorn commented on PR #972:
URL: https://github.com/apache/lucene/pull/972#issuecomment-1165218655

   Thanks @jpountz for the suggestion and also providing the bulk scorer 
implementation! The result looks pretty impressive as well! 
   
   I just tried `taskRepeatCount=200` with my implementation, and although it 
did make the results more stable across runs, the speed up with full tasks was 
still no where near that from just running the three disjunction tasks. I also 
pulled your version and ran through the same set of tests above, and the speed 
up were both good and stable (which rules out issue from the util side I 
guess). I will study your implementation further and see if mine can be 
improved accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #975: LUCENE-10626 Hunspell: add tools to aid dictionary editing

2022-06-23 Thread GitBox


dweiss commented on PR #975:
URL: https://github.com/apache/lucene/pull/975#issuecomment-1165237503

   Hi Peter! I'll take a look later today - it's end-of-school in Poland today 
and it's a bit hectic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org