[GitHub] [lucene] romseygeek commented on a diff in pull request #860: LUCENE-10553: Fix WANDScorer's handling of 0 and +Infty.

2022-05-03 Thread GitBox


romseygeek commented on code in PR #860:
URL: https://github.com/apache/lucene/pull/860#discussion_r863538587


##
lucene/core/src/java/org/apache/lucene/search/WANDScorer.java:
##
@@ -86,7 +86,6 @@ static int scalingFactor(float f) {
* sure we do not miss any matches.
*/
   static long scaleMaxScore(float maxScore, int scalingFactor) {
-assert scalingFactor(maxScore) >= scalingFactor;

Review Comment:
   Does this assertion no longer hold? I'm a bit nervous about removing it, as 
it tripping was what alerted us to the bug in the first place.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-05-03 Thread GitBox


mayya-sharipova commented on PR #792:
URL: https://github.com/apache/lucene/pull/792#issuecomment-1116030632

   @LuXugang I've created a feature branch: 
https://github.com/apache/lucene/tree/vectors-disi-direct
   
   From my side, we are good to merge this PR into it, but I wonder if 
@jtibshirani has any other comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10436) Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery into a single FieldExistsQuery?

2022-05-03 Thread Alan Woodward (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531180#comment-17531180
 ] 

Alan Woodward commented on LUCENE-10436:


The backport of this has inadvertently made the previously public 
DocValuesFieldExistsQuery.getDocValuesDocIdSetIterator() method 
package-private.  I think it's still a useful method (or at least, we're using 
it in elasticsearch) so it should probably stay public.  Perhaps we should 
instead move it to the public DocValues utility class?

> Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and 
> KnnVectorFieldExistsQuery into a single FieldExistsQuery?
> --
>
> Key: LUCENE-10436
> URL: https://issues.apache.org/jira/browse/LUCENE-10436
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Zach Chen
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Now that we require consistency across data structures, we could merge 
> DocValuesFieldExistsQuery, NormsFieldExistsQuery and 
> KnnVectorFieldExistsQuery together into a FieldExistsQuery that would require 
> that the field indexes either norms, doc values or vectors?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10555) avoid repeated NumericLeafComparator setScorer called

2022-05-03 Thread jianping weng (Jira)
jianping weng created LUCENE-10555:
--

 Summary: avoid repeated NumericLeafComparator setScorer called 
 Key: LUCENE-10555
 URL: https://issues.apache.org/jira/browse/LUCENE-10555
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: jianping weng


ElasticSearch use CancellableBulkScorer to fast cancel long time query 
execution by dividing one segment docs to many small split docs. For every 
split docs, collector.setScorer(scorer) is called, then  
NumericLeafComparator#setScorer is called. As a result, for one segment, 
NumericLeafComparator#setScorer is called many times. 

Every time NumericLeafComparator#setScorer is called, the 
NumericLeafComparator#iteratorCost is reset to the Scorer.cost and increase 
many unnecessary  pointValues#intersect calls to get  competitive docs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10555) avoid repeated NumericLeafComparator setScorer calls

2022-05-03 Thread jianping weng (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jianping weng updated LUCENE-10555:
---
Summary: avoid repeated NumericLeafComparator setScorer calls  (was: avoid 
repeated NumericLeafComparator setScorer called )

> avoid repeated NumericLeafComparator setScorer calls
> 
>
> Key: LUCENE-10555
> URL: https://issues.apache.org/jira/browse/LUCENE-10555
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: jianping weng
>Priority: Major
>
> ElasticSearch use CancellableBulkScorer to fast cancel long time query 
> execution by dividing one segment docs to many small split docs. For every 
> split docs, collector.setScorer(scorer) is called, then  
> NumericLeafComparator#setScorer is called. As a result, for one segment, 
> NumericLeafComparator#setScorer is called many times. 
> Every time NumericLeafComparator#setScorer is called, the 
> NumericLeafComparator#iteratorCost is reset to the Scorer.cost and increase 
> many unnecessary  pointValues#intersect calls to get  competitive docs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 opened a new pull request, #864: LUCENE-10555: avoid repeated NumericLeafComparator setScorer calls

2022-05-03 Thread GitBox


wjp719 opened a new pull request, #864:
URL: https://github.com/apache/lucene/pull/864

   ElasticSearch use `CancellableBulkScorer` to fast cancel long time query 
execution by splitting one segment docs to many   docs sets. For every   docs 
sets, `collector.setScorer(scorer)` is called, then  
`NumericLeafComparator#setScorer` is called. As a result, for one segment, 
`NumericLeafComparator#setScorer` is called many times. 
   
   Every time `NumericLeafComparator#setScorer` is called, the 
`NumericLeafComparator#iteratorCost` is reset to the Scorer.cost and increase 
many unnecessary  `pointValues#intersect` calls to get  competitive docIds. 
This result in performance degradation
   
   This pr checks `NumericLeafComparator#setScorer` to be called for only one 
time.
   # Checklist
   
   Please review the following and check all that apply:
   
   - [ ] I have reviewed the guidelines for [How to 
Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my 
code conforms to the standards described there to the best of my ability.
   - [ ] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [ ] I have developed this patch against the `main` branch.
   - [ ] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader

2022-05-03 Thread GitBox


mocobeta commented on PR #833:
URL: https://github.com/apache/lucene/pull/833#issuecomment-1116116020

   Hi, I am very new to this issue and interested in adding a benchmark for 
ExitableDirectoryReader to luceneutil. I think it'd be not a "benchmark" since 
it won't measure search performance, but check if the search returns in the 
expected timeout with incomplete results; this should work as an assertion that 
the reader works as expected with real data/queries and some realistic 
timeouts. It may be helpful to catch any degradations?
   It'll take some time for me, so I don't think we should hold this for it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader

2022-05-03 Thread GitBox


mocobeta commented on code in PR #833:
URL: https://github.com/apache/lucene/pull/833#discussion_r863835107


##
lucene/core/src/test/org/apache/lucene/index/TestExitableDirectoryReader.java:
##
@@ -428,6 +430,107 @@ public void testDocValues() throws IOException {
 directory.close();
   }
 
+  public void testVectorValues() throws IOException {
+Directory directory = newDirectory();
+IndexWriter writer =
+new IndexWriter(directory, newIndexWriterConfig(new 
MockAnalyzer(random(;
+
+int numDoc = atLeast(20);
+int deletedDoc = atMost(5);
+int dimension = atLeast(3);
+boolean vectorIndexed = false;
+
+for (int i = 0; i < numDoc; i++) {
+  Document doc = new Document();
+
+  if (random().nextBoolean()) {
+vectorIndexed = true;
+float[] value = new float[dimension];
+for (int j = 0; j < dimension; j++) {
+  value[j] = random().nextFloat();
+}
+FieldType fieldType =
+KnnVectorField.createFieldType(dimension, 
VectorSimilarityFunction.COSINE);
+doc.add(new KnnVectorField("vector", value, fieldType));
+  }
+
+  doc.add(new StringField("id", Integer.toString(i), Field.Store.YES));
+  writer.addDocument(doc);
+}
+
+writer.forceMerge(1);
+writer.commit();
+
+for (int i = 0; i < deletedDoc; i++) {
+  writer.deleteDocuments(new Term("id", Integer.toString(i)));
+}
+
+writer.close();
+
+QueryTimeout randomQueryTimeout;

Review Comment:
   nit - at first impression, I thought this would represent some random time 
spans. Actually this is either immediate, infinite, or disabled. I wonder if we 
simply name it `queryTimeout`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-05-03 Thread GitBox


rmuir commented on code in PR #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r863840319


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java:
##
@@ -86,6 +86,20 @@ public MergeSpecification findMerges(
 return mergeSpec;
   }
 
+  @Override
+  public MergeSpecification findMerges(CodecReader... readers) throws 
IOException {
+if (random.nextBoolean() == true) {
+  return super.findMerges(readers);
+} else {
+  // create a oneMerge for each reader to let them get concurrently 
processed by addIndexes()

Review Comment:
   > Thanks for reviewing this @rmuir, @uschindler. I understand your concern. 
Failures on concurrent code are already tough to reproduce; it gets much harder 
to do this from a randomized test seed.
   > 
   > It is worth noting that the `findMerges(CodecReader[])` API modified in 
`MockRandomMergePolicy`, is only invoked within the 
`addIndexes(CodecReaders[])` API, which should contain its scope of impact. But 
I definitely don't want to add noise with random failures on distantly related 
tests. I'll remove this new policy from `MockRandomMergePolicy`, which is 
frequently used for `IndexWriterConfig` in multiple tests.
   > 
   
   This is true, but also part of the risk: `addIndexes(CodecReaders[])` is 
already pretty scary, correctness-wise. This is typically used to expertly 
"transform" an index, otherwise you wouldn't use this API. It is probably 
undertested today, I just don't want to mix concurrency randomly into the few 
tests it has.
   
   > For my awareness, what is the best practice in Lucene for testing 
non-default, concurrency related code paths? Randomized tests help validate 
infrequently triggered code, but concurrency poses has reproducibility 
challenges. For such cases, do we try to rely entirely on change-specific 
concurrency unit tests?
   
   Well, we need to think about it. I actually hate that 
ConcurrentMergeScheduler gets used the way it does in the tests, for this same 
reason. And that's the actual problem here... I think, not your logic in the 
mergepolicy? So really my concerns are around mergescheduler in tests 
preventing reproducibility. CMS is the default if someone tests with just `new 
IndexWriterConfig` and it is almost always the default if they use 
`LuceneTestCase.newIndexWriterConfig`.
   
   Maybe we can improve our tests to use `SerialMergeScheduler` and 
`ConcurrentMergeScheduler(n=1)` more often, so that reproducibility is improved 
across the board. We can have separate specific tests using 
`ConcurrentMergeScheduler(n>1)`? 
   
   It really isn't anything new from your change here, I think i was just a bit 
frustrated the last time i tried to reproduce a seed recently 
(https://issues.apache.org/jira/browse/LUCENE-10529), so I want us to be 
careful about blanket application of concurrency to a ton of existing tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader

2022-05-03 Thread GitBox


mocobeta commented on code in PR #833:
URL: https://github.com/apache/lucene/pull/833#discussion_r863840972


##
lucene/core/src/test/org/apache/lucene/index/TestExitableDirectoryReader.java:
##
@@ -428,6 +430,107 @@ public void testDocValues() throws IOException {
 directory.close();
   }
 
+  public void testVectorValues() throws IOException {
+Directory directory = newDirectory();
+IndexWriter writer =
+new IndexWriter(directory, newIndexWriterConfig(new 
MockAnalyzer(random(;
+
+int numDoc = atLeast(20);
+int deletedDoc = atMost(5);
+int dimension = atLeast(3);
+boolean vectorIndexed = false;
+
+for (int i = 0; i < numDoc; i++) {
+  Document doc = new Document();
+
+  if (random().nextBoolean()) {
+vectorIndexed = true;
+float[] value = new float[dimension];
+for (int j = 0; j < dimension; j++) {
+  value[j] = random().nextFloat();
+}
+FieldType fieldType =
+KnnVectorField.createFieldType(dimension, 
VectorSimilarityFunction.COSINE);
+doc.add(new KnnVectorField("vector", value, fieldType));
+  }
+
+  doc.add(new StringField("id", Integer.toString(i), Field.Store.YES));
+  writer.addDocument(doc);
+}
+
+writer.forceMerge(1);
+writer.commit();
+
+for (int i = 0; i < deletedDoc; i++) {
+  writer.deleteDocuments(new Term("id", Integer.toString(i)));
+}
+
+writer.close();
+
+QueryTimeout randomQueryTimeout;

Review Comment:
   Or, I guess we can test all three timeout patterns in this test without 
randomness, like `testExitableFilterTermsIndexReader()`? Sorry if I'm missing 
any points.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader

2022-05-03 Thread GitBox


mocobeta commented on code in PR #833:
URL: https://github.com/apache/lucene/pull/833#discussion_r863855038


##
lucene/core/src/test/org/apache/lucene/index/TestExitableDirectoryReader.java:
##
@@ -428,6 +430,107 @@ public void testDocValues() throws IOException {
 directory.close();
   }
 
+  public void testVectorValues() throws IOException {
+Directory directory = newDirectory();
+IndexWriter writer =
+new IndexWriter(directory, newIndexWriterConfig(new 
MockAnalyzer(random(;
+
+int numDoc = atLeast(20);
+int deletedDoc = atMost(5);
+int dimension = atLeast(3);
+boolean vectorIndexed = false;
+
+for (int i = 0; i < numDoc; i++) {
+  Document doc = new Document();
+
+  if (random().nextBoolean()) {

Review Comment:
   I'm just curious - what's the point of this randomness? Wouldn't it be okay 
in adding the vector values for all test docs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang opened a new pull request, #865: Lucene-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-05-03 Thread GitBox


LuXugang opened a new pull request, #865:
URL: https://github.com/apache/lucene/pull/865

   follow-up of https://github.com/apache/lucene/pull/792 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on pull request #865: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-05-03 Thread GitBox


LuXugang commented on PR #865:
URL: https://github.com/apache/lucene/pull/865#issuecomment-1116261200

   Hi @mayya-sharipova , full RP presented.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader

2022-05-03 Thread GitBox


jpountz commented on PR #833:
URL: https://github.com/apache/lucene/pull/833#issuecomment-1116331660

   Actually I was thinking of search performance as something that we would 
like to measure, since these wrappers can induce some performance overhead. 
E.g. the baseline could run on the raw reader and the contender would wrap the 
reader with ExitableDirectoryReader and a very large timeout that's almost 
certainly not going to be hit, so that we could measure how much search 
performance we're trading by introducing these wrappers and timeout checks. 
Maybe luceneutil could have an option to do this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10527) Use bigger maxConn for last layer in HNSW

2022-05-03 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-10527:
--
Description: 
Recently I was rereading the HNSW paper 
([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using a 
different maxConn for the upper layers vs. the bottom one (which contains the 
full neighborhood graph). Specifically, they suggest using maxConn=M for upper 
layers and maxConn=2*M for the bottom. This differs from what we do, which is 
to use maxConn=M for all layers.

I tried updating our logic using a hacky patch, and noticed an improvement in 
latency for higher recall values (which is consistent with the paper's 
observation):

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
!image-2022-04-20-14-53-58-484.png|width=400,height=367!

As we'd expect, indexing becomes a bit slower:
{code:java}
Baseline: Indexed 1183514 documents in 733s 
Candidate: Indexed 1183514 documents in 948s{code}
When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a 
big difference in recall for the same settings of M and efConstruction. (Even 
adding graph layers in LUCENE-10054 didn't really affect recall.) With this 
change, the recall is now very similar:

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
{code:java}
kApproach  Recall   
  QPS
10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563
 4410.499
50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798
 1956.280
100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862
 1209.734
100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958
  341.428
800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974
  230.396
1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980
  188.757

10   hnswlib ({'M': 32, 'efConstruction': 100})0.552
16745.433
50   hnswlib ({'M': 32, 'efConstruction': 100})0.794
 5738.468
100  hnswlib ({'M': 32, 'efConstruction': 100})0.860
 3336.386
500  hnswlib ({'M': 32, 'efConstruction': 100})0.956
  832.982
800  hnswlib ({'M': 32, 'efConstruction': 100})0.973
  541.097
1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979
  442.163
{code}
I think it'd be nice update to maxConn so that we faithfully implement the 
paper's algorithm. This is probably least surprising for users, and I don't see 
a strong reason to take a different approach from the paper? Let me know what 
you think!

  was:
Recently I was rereading the HNSW paper 
([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using a 
different maxConn for the upper layers vs. the bottom one (which contains the 
full neighborhood graph). Specifically, they suggest using maxConn=M for upper 
layers and maxConn=2*M for the bottom. This differs from what we do, which is 
to use maxConn=M for all layers.

I tried updating our logic using a hacky patch, and noticed an improvement in 
latency for higher recall values (which is consistent with the paper's 
observation):

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
!image-2022-04-20-14-53-58-484.png|width=400,height=367!

As we'd expect, indexing becomes a bit slower:
{code:java}
Baseline: Indexed 1183514 documents in 733s 
Candidate: Indexed 1183514 documents in 948s{code}
When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a 
big difference in recall for the same settings of M and efConstruction. (Even 
adding graph layers in LUCENE-10054 didn't really affect recall.) With this 
change, the recall is now very similar:

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
{code:java}
num_candsApproach  Recall   
  QPS
10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563
 4410.499
50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798
 1956.280
100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862
 1209.734
100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958
  341.428
800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974
  230.396
1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980
  188.757

10   hnswlib ({'M': 32, 'efConstruction': 100})0.552
16745.433
50   hnswlib ({'M': 32, 'efConstruction': 100})0.794
 5738.468
100  hnswlib ({'M': 32, 'efConstruction': 100})0.860
 3336.386
500  hnswlib ({'M': 32, '

[GitHub] [lucene] jtibshirani commented on pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-05-03 Thread GitBox


jtibshirani commented on PR #792:
URL: https://github.com/apache/lucene/pull/792#issuecomment-1116736228

   Thanks for running those benchmarks! I'm working on a quick review before 
you merge it.
   
   One thing I was surprised about with the benchmarks it that you find 
searches take 60ms or more. In 
https://issues.apache.org/jira/browse/LUCENE-10527 we see that on a similar 
sized dataset, the latency always below 10ms. (It's reported in QPS or 
queries-per-second, but it's sequential so latency is just `1000 / QPS`). Maybe 
you could give some information about your machine and benchmark set-up (was 
there a warmup?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader

2022-05-03 Thread GitBox


mocobeta commented on PR #833:
URL: https://github.com/apache/lucene/pull/833#issuecomment-1116850417

   > the baseline could run on the raw reader and the contender would wrap the 
reader with ExitableDirectoryReader and a very large timeout that's almost 
certainly not going to be hit, so that we could measure how much search 
performance we're trading by introducing these wrappers and timeout checks
   
   This sounds good to me (with a benchmark, we could tune 
`DOCS_BETWEEN_TIMEOUT_CHECK`?). I'll take a look - I think it's feasible but 
could take some time for me :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on pull request #862: LUCENE-9848 Sort HNSW graph neighbors for construction

2022-05-03 Thread GitBox


mayya-sharipova commented on PR #862:
URL: https://github.com/apache/lucene/pull/862#issuecomment-1116854352

   @msokolov Thanks so much for your feedback. I've addressed it in 
160904904c94ffd4d194fc2509124c0e2eb9c44a.
   
   I've also did another set of benchmarking with new changes, this time with 
higher `M` and `efConstruction` values:
   
   glove-100-angular M:32 efConstruction:500
   
   Happy to report that indexing took just slightly more (3%), while for 
searches reported slight gains in both recall and QPS:
   
   **Indexing Speed**
   baseline: Indexed 1183514 documents in 3697s
   candidate Indexed 1183514 documents in 3814s
   
   
   **Search**
   
   | | baseline recall | baseline QPS | candidate recall | 
candidate QPS |
   | --- | --: | ---: | ---: | 
: |
   | n_cands=10  |   0.634 | 2628.879 |0.659 |  
2659.589 |
   | n_cands=20  |   0.696 | 2034.672 |0.719 |  
2186.015 |
   | n_cands=40  |   0.762 | 1611.403 |0.784 |  
1666.735 |
   | n_cands=80  |   0.824 | 1150.786 |0.843 |  
1161.111 |
   | n_cands=120 |   0.855 |  878.114 |0.874 |   
890.783 |
   | n_cands=200 |   0.892 |  610.891 |0.907 |   
621.173 |
   | n_cands=400 |   0.931 |  343.649 |0.944 |   
359.924 |
   | n_cands=600 |   0.949 |  239.217 |0.960 |   
256.863 |
   | n_cands=800 |   0.961 |  189.523 |0.970 |   
199.867 |
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-05-03 Thread GitBox


jtibshirani commented on code in PR #792:
URL: https://github.com/apache/lucene/pull/792#discussion_r864338089


##
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java:
##
@@ -320,13 +323,19 @@ private static class FieldEntry {
 final int numLevels;
 final int dimension;
 private final int size;
-final int[] ordToDoc;
-private final IntUnaryOperator ordToDocOperator;
 final int[][] nodesByLevel;
 // for each level the start offsets in vectorIndex file from where to read 
neighbours
 final long[] graphOffsetsByLevel;
-
-FieldEntry(DataInput input, VectorSimilarityFunction similarityFunction) 
throws IOException {
+final long docsWithFieldOffset;

Review Comment:
   It'd be great to add short comments here explaining what these different 
values are used for.



##
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java:
##
@@ -507,8 +515,90 @@ public BytesRef binaryValue(int targetOrd) throws 
IOException {
 }
 
 private void readValue(int targetOrd) throws IOException {
-  dataIn.seek((long) targetOrd * byteSize);
-  dataIn.readBytes(byteBuffer.array(), byteBuffer.arrayOffset(), byteSize);
+  slice.seek((long) targetOrd * byteSize);
+  slice.readBytes(byteBuffer.array(), byteBuffer.arrayOffset(), byteSize);
+}
+
+public IndexedDISI initDISI(IndexInput vectorData) throws IOException {
+  // dense
+  if (fieldEntry == null || fieldEntry.docsWithFieldOffset == -1) {
+return null;
+  }
+  assert fieldEntry.docsWithFieldOffset != -2;
+  // sparse
+  return new IndexedDISI(
+  vectorData,
+  fieldEntry.docsWithFieldOffset,
+  fieldEntry.docsWithFieldLength,
+  fieldEntry.jumpTableEntryCount,
+  fieldEntry.denseRankPower,
+  fieldEntry.size);
+}
+
+public static final OffHeapVectorValues emptyOffHeapVectorValues(int 
dimension)

Review Comment:
   I think we should avoid creating this new empty `OffHeapVectorValues` class 
to keep things simple. Instead we can just assert in `getOffHeapVectorValues` 
that the field is not empty and make sure to always return early if the field 
is empty.



##
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java:
##
@@ -400,27 +400,42 @@ static class OffHeapVectorValues extends VectorValues
 
 private final int dimension;
 private final int size;
-private final int[] ordToDoc;
-private final IntUnaryOperator ordToDocOperator;
+// dataIn was used to init a new IndexedDIS for #randomAccess()
 private final IndexInput dataIn;
+private final IndexInput slice;
 private final BytesRef binaryValue;
 private final ByteBuffer byteBuffer;
 private final int byteSize;
 private final float[] value;
+private final IndexedDISI disi;
+private final FieldEntry fieldEntry;
+final DirectMonotonicReader ordToDoc;
 
 private int ord = -1;
 private int doc = -1;
 
-OffHeapVectorValues(int dimension, int size, int[] ordToDoc, IndexInput 
dataIn) {
+OffHeapVectorValues(
+int dimension, int size, FieldEntry fieldEntry, IndexInput dataIn, 
IndexInput slice)
+throws IOException {
   this.dimension = dimension;
   this.size = size;
-  this.ordToDoc = ordToDoc;
-  ordToDocOperator = ordToDoc == null ? IntUnaryOperator.identity() : 
(ord) -> ordToDoc[ord];
+  this.fieldEntry = fieldEntry;
   this.dataIn = dataIn;
+  this.slice = slice;
+  this.disi = initDISI(dataIn);

Review Comment:
   I think it'd be clearer if we separated out the sparse and dense cases into 
two different subclasses. Currently it feels pretty fragile/ confusing, for 
example in `Lucene91HnswVectorsWriter` line 148 we create an 
`OffHeapVectorValues` that has null for one `dataIn` but not for `slice`. 
Instead we could just have this be its own class like 
`OffHeapVectorValues.Dense` that does not require the `dataIn` at all.
   
   I also think we can simplify the dense logic. We don't need different 
variables `ord` and `doc` because these are always equal in the dense case.
   
   



##
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java:
##
@@ -335,23 +344,23 @@ private static class FieldEntry {
   dimension = input.readInt();
   size = input.readInt();
 
-  int denseSparseMarker = input.readByte();
-  if (denseSparseMarker == -1) {
-ordToDoc = null; // each document has a vector value
+  docsWithFieldOffset = input.readLong();
+  docsWithFieldLength = input.readLong();
+  jumpTableEntryCount = input.readShort();
+  denseRankPower = input.readByte();
+
+  // sparse
+  if (docsWithFieldOffset != -1 && docsWithFieldOffset != -2) {

Review Comment:
   I think it'd be clearer to just say `docsWithFieldOffset == 0` here. T

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #862: LUCENE-9848 Sort HNSW graph neighbors for construction

2022-05-03 Thread GitBox


mayya-sharipova commented on code in PR #862:
URL: https://github.com/apache/lucene/pull/862#discussion_r864407239


##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##
@@ -21,32 +21,64 @@
 
 /**
  * NeighborArray encodes the neighbors of a node and their mutual scores in 
the HNSW graph as a pair
- * of growable arrays.
+ * of growable arrays. Nodes are arranged in the sorted order of their scores 
in descending order
+ * (if scoresDescOrder is true), or in the ascending order of their scores (if 
scoresDescOrder is
+ * false)
  *
  * @lucene.internal
  */
 public class NeighborArray {
-
+  private final boolean scoresDescOrder;
   private int size;
 
   float[] score;
   int[] node;
 
-  public NeighborArray(int maxSize) {
+  public NeighborArray(int maxSize, boolean descOrder) {
 node = new int[maxSize];
 score = new float[maxSize];
+this.scoresDescOrder = descOrder;
   }
 
+  /**
+   * Add a new node to the NeighborArray. The new node must be worse than all 
previously stored
+   * nodes.
+   */
   public void add(int newNode, float newScore) {
 if (size == node.length - 1) {
   node = ArrayUtil.grow(node, (size + 1) * 3 / 2);
   score = ArrayUtil.growExact(score, node.length);
 }
+if (size > 0) {
+  float previousScore = score[size - 1];
+  assert ((scoresDescOrder && (previousScore >= newScore))
+  || (scoresDescOrder == false && (previousScore <= newScore)))
+  : "Nodes are added in the incorrect order!";
+}
 node[size] = newNode;
 score[size] = newScore;
 ++size;
   }
 
+  /** Add a new node to the NeighborArray into a correct sort position 
according to its score. */
+  public void addAndSort(int newNode, float newScore) {
+if (size == node.length - 1) {
+  node = ArrayUtil.grow(node, (size + 1) * 3 / 2);
+  score = ArrayUtil.growExact(score, node.length);
+}
+int insertionPoint =
+scoresDescOrder
+? descSortFindRightMostInsertionPoint(newScore)
+: ascSortFindRightMostInsertionPoint(newScore);
+for (int i = size; i > insertionPoint; i--) {

Review Comment:
   TIL: that we can use System.arraycopy for the reverse case as well.  
Addressed in: 160904904c94ffd4d194fc2509124c0e2eb9c44a



##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##
@@ -21,32 +21,64 @@
 
 /**
  * NeighborArray encodes the neighbors of a node and their mutual scores in 
the HNSW graph as a pair
- * of growable arrays.
+ * of growable arrays. Nodes are arranged in the sorted order of their scores 
in descending order
+ * (if scoresDescOrder is true), or in the ascending order of their scores (if 
scoresDescOrder is
+ * false)
  *
  * @lucene.internal
  */
 public class NeighborArray {
-
+  private final boolean scoresDescOrder;
   private int size;
 
   float[] score;
   int[] node;
 
-  public NeighborArray(int maxSize) {
+  public NeighborArray(int maxSize, boolean descOrder) {
 node = new int[maxSize];
 score = new float[maxSize];
+this.scoresDescOrder = descOrder;
   }
 
+  /**
+   * Add a new node to the NeighborArray. The new node must be worse than all 
previously stored
+   * nodes.
+   */
   public void add(int newNode, float newScore) {
 if (size == node.length - 1) {
   node = ArrayUtil.grow(node, (size + 1) * 3 / 2);
   score = ArrayUtil.growExact(score, node.length);
 }
+if (size > 0) {
+  float previousScore = score[size - 1];
+  assert ((scoresDescOrder && (previousScore >= newScore))
+  || (scoresDescOrder == false && (previousScore <= newScore)))
+  : "Nodes are added in the incorrect order!";
+}
 node[size] = newNode;
 score[size] = newScore;
 ++size;
   }
 
+  /** Add a new node to the NeighborArray into a correct sort position 
according to its score. */
+  public void addAndSort(int newNode, float newScore) {

Review Comment:
   Addressed in: 160904904c94ffd4d194fc2509124c0e2eb9c44a



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #862: LUCENE-9848 Sort HNSW graph neighbors for construction

2022-05-03 Thread GitBox


mayya-sharipova commented on code in PR #862:
URL: https://github.com/apache/lucene/pull/862#discussion_r864407369


##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##
@@ -72,8 +104,38 @@ public void removeLast() {
 size--;
   }
 
+  public void removeIndex(int idx) {
+for (int i = idx; i < (size - 1); i++) {
+  node[i] = node[i + 1];
+  score[i] = score[i + 1];
+}
+size--;
+  }
+
   @Override
   public String toString() {
 return "NeighborArray[" + size + "]";
   }
+
+  private int ascSortFindRightMostInsertionPoint(float newScore) {
+int start = 0;

Review Comment:
   Great comment, addressed in: 160904904c94ffd4d194fc2509124c0e2eb9c44a



##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##
@@ -72,8 +104,38 @@ public void removeLast() {
 size--;
   }
 
+  public void removeIndex(int idx) {
+for (int i = idx; i < (size - 1); i++) {
+  node[i] = node[i + 1];

Review Comment:
   Great comment, addressed in: 160904904c94ffd4d194fc2509124c0e2eb9c44a



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-05-03 Thread GitBox


LuXugang commented on code in PR #792:
URL: https://github.com/apache/lucene/pull/792#discussion_r864437136


##
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java:
##
@@ -335,23 +344,23 @@ private static class FieldEntry {
   dimension = input.readInt();
   size = input.readInt();
 
-  int denseSparseMarker = input.readByte();
-  if (denseSparseMarker == -1) {
-ordToDoc = null; // each document has a vector value
+  docsWithFieldOffset = input.readLong();
+  docsWithFieldLength = input.readLong();
+  jumpTableEntryCount = input.readShort();
+  denseRankPower = input.readByte();
+
+  // sparse
+  if (docsWithFieldOffset != -1 && docsWithFieldOffset != -2) {

Review Comment:
   In sparse case, `docsWithFieldOffset` is always a value greater than 0, so 
add an assertion like `docsWithFieldOffset > 0` seems like can make code 
clearer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-05-03 Thread GitBox


LuXugang commented on code in PR #792:
URL: https://github.com/apache/lucene/pull/792#discussion_r864438991


##
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java:
##
@@ -335,23 +344,23 @@ private static class FieldEntry {
   dimension = input.readInt();
   size = input.readInt();
 
-  int denseSparseMarker = input.readByte();
-  if (denseSparseMarker == -1) {
-ordToDoc = null; // each document has a vector value
+  docsWithFieldOffset = input.readLong();
+  docsWithFieldLength = input.readLong();
+  jumpTableEntryCount = input.readShort();
+  denseRankPower = input.readByte();
+
+  // sparse
+  if (docsWithFieldOffset != -1 && docsWithFieldOffset != -2) {

Review Comment:
   Thanks  @jtibshirani ,In sparse case,  `docsWithFieldOffset` is always a 
value greater than 0, the clearer code could be write like:
   
   ```
 // dense or empty
 if (docsWithFieldOffset == -1 || docsWithFieldOffset == -2) {
   ...
 } else { // sparse
   ...
 }
   ```

   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-05-03 Thread GitBox


LuXugang commented on code in PR #792:
URL: https://github.com/apache/lucene/pull/792#discussion_r864438991


##
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java:
##
@@ -335,23 +344,23 @@ private static class FieldEntry {
   dimension = input.readInt();
   size = input.readInt();
 
-  int denseSparseMarker = input.readByte();
-  if (denseSparseMarker == -1) {
-ordToDoc = null; // each document has a vector value
+  docsWithFieldOffset = input.readLong();
+  docsWithFieldLength = input.readLong();
+  jumpTableEntryCount = input.readShort();
+  denseRankPower = input.readByte();
+
+  // sparse
+  if (docsWithFieldOffset != -1 && docsWithFieldOffset != -2) {

Review Comment:
   Thanks  @jtibshirani ,In sparse case,  `docsWithFieldOffset` is always a 
value greater than 0, the clearer code may could be like:
   
   ```
 // dense or empty
 if (docsWithFieldOffset == -1 || docsWithFieldOffset == -2) {
   ...
 } else { // sparse
   ...
 }
   ```

   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta opened a new pull request, #866: Make CONTRIBUTING.md a bit more succinct

2022-05-03 Thread GitBox


mocobeta opened a new pull request, #866:
URL: https://github.com/apache/lucene/pull/866

   I think it'd be good to try to keep the contributing guide succinct as far 
as possible so that it is helpful for experienced developers (the main targets 
of the document, I think).
   
   - remove specific Gradle commands - Instead, we can link to the 
corresponding help documents.
   - remove the instruction for making a patch - I assume contributors who 
prefer patches to pull requests are already familiar with it.
   - remove the explanation about Yetus - I don't think it is still working on 
the Jira Lucene project.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #866: Make CONTRIBUTING.md a bit more succinct

2022-05-03 Thread GitBox


mocobeta commented on PR #866:
URL: https://github.com/apache/lucene/pull/866#issuecomment-1116946180

   It's a trivial change in documentation. I'll wait until tomorrow then merge 
it to main.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-05-03 Thread GitBox


LuXugang commented on code in PR #792:
URL: https://github.com/apache/lucene/pull/792#discussion_r864475729


##
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java:
##
@@ -507,8 +515,90 @@ public BytesRef binaryValue(int targetOrd) throws 
IOException {
 }
 
 private void readValue(int targetOrd) throws IOException {
-  dataIn.seek((long) targetOrd * byteSize);
-  dataIn.readBytes(byteBuffer.array(), byteBuffer.arrayOffset(), byteSize);
+  slice.seek((long) targetOrd * byteSize);
+  slice.readBytes(byteBuffer.array(), byteBuffer.arrayOffset(), byteSize);
+}
+
+public IndexedDISI initDISI(IndexInput vectorData) throws IOException {
+  // dense
+  if (fieldEntry == null || fieldEntry.docsWithFieldOffset == -1) {
+return null;
+  }
+  assert fieldEntry.docsWithFieldOffset != -2;
+  // sparse
+  return new IndexedDISI(
+  vectorData,
+  fieldEntry.docsWithFieldOffset,
+  fieldEntry.docsWithFieldLength,
+  fieldEntry.jumpTableEntryCount,
+  fieldEntry.denseRankPower,
+  fieldEntry.size);
+}
+
+public static final OffHeapVectorValues emptyOffHeapVectorValues(int 
dimension)

Review Comment:
   > I think it'd be clearer if we separated out the sparse and dense cases 
into two different subclasses
   
   
   Totally agreed! It makes code much more readable. but I still want to keep 
`emptyOffHeapVectorValues` which has been renamed to `EmptyOffHeapVectorValues` 
in latest PR,  I would like to keep same pattern like 
`Lucene90NormsProducer#getNorms` or `Lucene90DocValuesProducer#getXXX`.
   
   Addressd in 
https://github.com/apache/lucene/pull/792/commits/49f67a57e8e294fbed2a63d4887169103f05073b
 .
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-05-03 Thread GitBox


LuXugang commented on code in PR #792:
URL: https://github.com/apache/lucene/pull/792#discussion_r864476448


##
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java:
##
@@ -335,23 +344,23 @@ private static class FieldEntry {
   dimension = input.readInt();
   size = input.readInt();
 
-  int denseSparseMarker = input.readByte();
-  if (denseSparseMarker == -1) {
-ordToDoc = null; // each document has a vector value
+  docsWithFieldOffset = input.readLong();
+  docsWithFieldLength = input.readLong();
+  jumpTableEntryCount = input.readShort();
+  denseRankPower = input.readByte();
+
+  // sparse
+  if (docsWithFieldOffset != -1 && docsWithFieldOffset != -2) {

Review Comment:
   addressd in 
https://github.com/apache/lucene/pull/792/commits/e2931d90126375c488fa0060ff85619452081af7
 .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-05-03 Thread GitBox


LuXugang commented on PR #792:
URL: https://github.com/apache/lucene/pull/792#issuecomment-1116957549

   > Maybe you could give some information about your machine and benchmark 
set-up (was there a warmup?)
   
   Here is my benchmark test demo : 
https://github.com/LuXugang/Lucene-7.5.0/commit/b69ae6c70665878f95115a6a49715c84c760b4c6
 .  Dataset in demo is `glove.6B` from https://github.com/mikemccand/luceneutil 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org