date:20220131

[jira] [Created] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-01-31 Thread Adrien Grand (Jira)

Adrien Grand created LUCENE-10396:
-

 Summary: Automatically create sparse indexes for sort fields
 Key: LUCENE-10396
 URL: https://issues.apache.org/jira/browse/LUCENE-10396
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand
 Attachments: sorted_conjunction.png

On Elasticsearch we're more and more leveraging index sorting not as a way to 
be able to early terminate sorted queries, but as a way to cluster doc IDs that 
share similar properties so that queries can take advantage of it. For instance 
imagine you're maintaining a catalog of cars for sale, by sorting by car type, 
then fuel type then price. Then all cars with the same type, fuel type and 
similar prices will be stored in a contiguous range of doc IDs. Without index 
sorting, conjunctions across these 3 fields would be almost a worst-case 
scenario as every clause might match lots of documents while their intersection 
might not. With index sorting enabled however, there's only a very small number 
of calls to advance() that would lead to doc IDs that do not match, because 
these advance() calls that do not lead to a match would always jump over a 
large number of doc IDs. I had created that example for ApacheCon last year 
that demonstrates the benefits of index sorting on conjunctions. In both cases, 
the index is storing the same data, it just gets different doc ID ordering 
thanks to index sorting:
!sorted_conjunction.png!

While index sorting can help improve query efficiency out-of-the-box, there is 
a lot more we can do by taking advantage of the index sort explicitly. For 
instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
queries on fields that are primary sort fields by performing a binary search to 
identify the first and last documents that match the range query. I would like 
to introduce [sparse 
indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
that are used for index sorting, with the goal of improving the runtime of 
{{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
and making it easier and more efficient to leverage index sorting to filter on 
subsequent sort fields. A simple form of a sparse index could consist of 
storing every N-th values of the fields that are used for index sorting.

In terms of implementation, sparse indexing should be cheap enough that we 
wouldn't need to make it configurable and could enable it automatically as soon 
as index sorting is enabled. And it would get its own file format abstraction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] javanna commented on pull request #622: LUCENE-10395: Introduce TotalHitCountCollectorManager

2022-01-31 Thread GitBox



javanna commented on pull request #622:
URL: https://github.com/apache/lucene/pull/622#issuecomment-1025703099


   This should be ready now ;)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10385) Implement Weight#count on IndexSortSortedNumericDocValuesRangeQuery.

2022-01-31 Thread Luca Cavanna (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484648#comment-17484648
 ] 

Luca Cavanna commented on LUCENE-10385:
---

Leaving a note only to say that I would like to take this issue. I will open a 
PR soon.

> Implement Weight#count on IndexSortSortedNumericDocValuesRangeQuery.
> 
>
> Key: LUCENE-10385
> URL: https://issues.apache.org/jira/browse/LUCENE-10385
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> This query can count matches by computing the first and last matching doc IDs 
> using binary search.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz merged pull request #622: LUCENE-10395: Introduce TotalHitCountCollectorManager

2022-01-31 Thread GitBox



jpountz merged pull request #622:
URL: https://github.com/apache/lucene/pull/622


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10395) Add support for TotalHitCountCollectorManager

2022-01-31 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484687#comment-17484687
 ] 

ASF subversion and git services commented on LUCENE-10395:
--

Commit df12e2b195b576481d8588ae67c4155bbac189fe in lucene's branch 
refs/heads/main from Luca Cavanna
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=df12e2b ]

LUCENE-10395: Introduce TotalHitCountCollectorManager (#622)



> Add support for TotalHitCountCollectorManager
> -
>
> Key: LUCENE-10395
> URL: https://issues.apache.org/jira/browse/LUCENE-10395
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Luca Cavanna
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> With LUCENE-10002 we are looking at replacing usages of 
> IndexSearcher#search(Query, Collector) with the corresponding method that 
> takes a CollectorManager instead of a Collector. As part of this effort we 
> would like to add support for a collector manager that encapsulates 
> TotalHitCountCollector which is commonly used in our tests and possibly by 
> users.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a change in pull request #541: LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil

2022-01-31 Thread GitBox



jpountz commented on a change in pull request #541:
URL: https://github.com/apache/lucene/pull/541#discussion_r795686801



##
File path: lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java
##
@@ -24,89 +24,73 @@
 import org.apache.lucene.util.DocBaseBitSetIterator;
 import org.apache.lucene.util.FixedBitSet;
 
-class DocIdsWriter {
+final class DocIdsWriter {
 
-  private DocIdsWriter() {}
+  private static final byte CONTINUOUS_IDS = (byte) -2;
+  private static final byte BITSET_IDS = (byte) -1;
+  private static final byte DELTA_FOR_UTIL = (byte) 32 + 16;
+  private static final byte BPV_24 = (byte) 32 + 24;
+  private static final byte BPV_32 = (byte) 32;
+  // These signs are legacy, should no longer be used in the writing side.
+  private static final byte LEGACY_DELTA_VINT = (byte) 0;
+  private static final byte LEGACY_BPV_24 = (byte) 24;
 
-  static void writeDocIds(int[] docIds, int start, int count, DataOutput out) 
throws IOException {
+  private final BKDForUtil forUtil;
+  private final int[] scratch;
+
+  DocIdsWriter(int maxPointsInLeaf) {
+scratch = new int[maxPointsInLeaf];
+forUtil = new BKDForUtil(maxPointsInLeaf);
+  }
+
+  void writeDocIds(int[] docIds, int start, int count, DataOutput out) throws 
IOException {
 // docs can be sorted either when all docs in a block have the same value
 // or when a segment is sorted
-boolean sorted = true;
 boolean strictlySorted = true;
+int min = docIds[0];
+int max = docIds[0];
 for (int i = 1; i < count; ++i) {
   int last = docIds[start + i - 1];
   int current = docIds[start + i];
-  if (last > current) {
-sorted = strictlySorted = false;
-break;
-  } else if (last == current) {
+  if (last >= current) {
 strictlySorted = false;
   }
+  min = Math.min(min, current);
+  max = Math.max(max, current);
 }
 
-int min2max = docIds[start + count - 1] - docIds[start] + 1;
+int min2max = max - min + 1;
 if (strictlySorted) {
   if (min2max == count) {
 // continuous ids, typically happens when segment is sorted
-out.writeByte((byte) -2);
+out.writeByte(CONTINUOUS_IDS);
 out.writeVInt(docIds[start]);
 return;
   } else if (min2max <= (count << 4)) {
 assert min2max > count : "min2max: " + min2max + ", count: " + count;
 // Only trigger bitset optimization when max - min + 1 <= 16 * count 
in order to avoid
 // expanding too much storage.
 // A field with lower cardinality will have higher probability to 
trigger this optimization.
-out.writeByte((byte) -1);
+out.writeByte(BITSET_IDS);
 writeIdsAsBitSet(docIds, start, count, out);
 return;
   }
 }
-if (sorted) {
-  out.writeByte((byte) 0);
-  int previous = 0;
-  for (int i = 0; i < count; ++i) {
-int doc = docIds[start + i];
-out.writeVInt(doc - previous);
-previous = doc;
+
+if (min2max <= 0xL) {

Review comment:
   ```suggestion
   if (min2max <= 0x) {
   ```

##
File path: lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java
##
@@ -24,89 +24,73 @@
 import org.apache.lucene.util.DocBaseBitSetIterator;
 import org.apache.lucene.util.FixedBitSet;
 
-class DocIdsWriter {
+final class DocIdsWriter {
 
-  private DocIdsWriter() {}
+  private static final byte CONTINUOUS_IDS = (byte) -2;
+  private static final byte BITSET_IDS = (byte) -1;
+  private static final byte DELTA_FOR_UTIL = (byte) 32 + 16;
+  private static final byte BPV_24 = (byte) 32 + 24;
+  private static final byte BPV_32 = (byte) 32;
+  // These signs are legacy, should no longer be used in the writing side.
+  private static final byte LEGACY_DELTA_VINT = (byte) 0;
+  private static final byte LEGACY_BPV_24 = (byte) 24;
 
-  static void writeDocIds(int[] docIds, int start, int count, DataOutput out) 
throws IOException {
+  private final BKDForUtil forUtil;
+  private final int[] scratch;
+
+  DocIdsWriter(int maxPointsInLeaf) {
+scratch = new int[maxPointsInLeaf];
+forUtil = new BKDForUtil(maxPointsInLeaf);
+  }
+
+  void writeDocIds(int[] docIds, int start, int count, DataOutput out) throws 
IOException {
 // docs can be sorted either when all docs in a block have the same value
 // or when a segment is sorted
-boolean sorted = true;
 boolean strictlySorted = true;
+int min = docIds[0];
+int max = docIds[0];
 for (int i = 1; i < count; ++i) {
   int last = docIds[start + i - 1];
   int current = docIds[start + i];
-  if (last > current) {
-sorted = strictlySorted = false;
-break;
-  } else if (last == current) {
+  if (last >= current) {
 strictlySorted = false;
   }
+  min = Math.min(min, current);
+  max = Math.max(max, current);
 }
 
-int min2max = docIds[start + count - 1] - docIds[start] + 1

[jira] [Commented] (LUCENE-10395) Add support for TotalHitCountCollectorManager

2022-01-31 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484693#comment-17484693
 ] 

ASF subversion and git services commented on LUCENE-10395:
--

Commit 0c97682cdc1564a0eba9602de138a7750c561e42 in lucene's branch 
refs/heads/branch_9x from Luca Cavanna
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0c97682 ]

LUCENE-10395: Introduce TotalHitCountCollectorManager (#622)



> Add support for TotalHitCountCollectorManager
> -
>
> Key: LUCENE-10395
> URL: https://issues.apache.org/jira/browse/LUCENE-10395
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Luca Cavanna
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> With LUCENE-10002 we are looking at replacing usages of 
> IndexSearcher#search(Query, Collector) with the corresponding method that 
> takes a CollectorManager instead of a Collector. As part of this effort we 
> would like to add support for a collector manager that encapsulates 
> TotalHitCountCollector which is commonly used in our tests and possibly by 
> users.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10395) Add support for TotalHitCountCollectorManager

2022-01-31 Thread Adrien Grand (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10395.
---
Fix Version/s: 9.1
   Resolution: Fixed

> Add support for TotalHitCountCollectorManager
> -
>
> Key: LUCENE-10395
> URL: https://issues.apache.org/jira/browse/LUCENE-10395
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Luca Cavanna
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> With LUCENE-10002 we are looking at replacing usages of 
> IndexSearcher#search(Query, Collector) with the corresponding method that 
> takes a CollectorManager instead of a Collector. As part of this effort we 
> would like to add support for a collector manager that encapsulates 
> TotalHitCountCollector which is commonly used in our tests and possibly by 
> users.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10390) Improve Weight#count javadocs

2022-01-31 Thread Luca Cavanna (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484703#comment-17484703
 ] 

Luca Cavanna commented on LUCENE-10390:
---

[~jpountz] I believe this one can be closed as the corresponding PR was merged.

> Improve Weight#count javadocs
> -
>
> Key: LUCENE-10390
> URL: https://issues.apache.org/jira/browse/LUCENE-10390
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Luca Cavanna
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I started working on LUCENE-10385 and chatting with [~jpountz] we said that 
> maybe the javadocs around Weight#count could be clarified around when it 
> should be implemented. It currently mentions that the method should only be 
> overridden by sub-classes that are able to compute/retrieve the count in O(1) 
> time. The intention is that computing the count should be fast and have a 
> much smaller overhead compared to collecting all matches. Possibly the 
> requirement could be relaxed to sub-linear time instead of strictly linear 
> time? In fact, in LUCENE-10385 we would use binary search to compute the 
> count which does not execute in linear time but rather O(log N) which though 
> seems to be acceptable.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10390) Improve Weight#count javadocs

2022-01-31 Thread Adrien Grand (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10390.
---
Fix Version/s: 9.1
   Resolution: Fixed

Thanks [~lucacavanna] !

> Improve Weight#count javadocs
> -
>
> Key: LUCENE-10390
> URL: https://issues.apache.org/jira/browse/LUCENE-10390
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Luca Cavanna
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I started working on LUCENE-10385 and chatting with [~jpountz] we said that 
> maybe the javadocs around Weight#count could be clarified around when it 
> should be implemented. It currently mentions that the method should only be 
> overridden by sub-classes that are able to compute/retrieve the count in O(1) 
> time. The intention is that computing the count should be fast and have a 
> much smaller overhead compared to collecting all matches. Possibly the 
> requirement could be relaxed to sub-linear time instead of strictly linear 
> time? In fact, in LUCENE-10385 we would use binary search to compute the 
> count which does not execute in linear time but rather O(log N) which though 
> seems to be acceptable.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova merged pull request #616: LUCENE-9573 Add Vectors to TestBackwardsCompatibility

2022-01-31 Thread GitBox



mayya-sharipova merged pull request #616:
URL: https://github.com/apache/lucene/pull/616


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9573) Add back compat tests for VectorFormat to TestBackwardsCompatibility

2022-01-31 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484711#comment-17484711
 ] 

ASF subversion and git services commented on LUCENE-9573:
-

Commit 8dfdb261e76db9db3d6e56f2e8bb0aac5b6871c8 in lucene's branch 
refs/heads/main from Mayya Sharipova
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8dfdb26 ]

LUCENE-9573 Add Vectors to TestBackwardsCompatibility (#616)

This patch adds KNN vectors for testing backward compatible indices

- Add a KnnVectorField to documents when creating a new backward
  compatible index
- Add knn vectors search and check for vector values to the testing
  of search of backward compatible indices
- Add tests for knn vector search when changing backward compatible
 indices (merging them and adding new documents to them)

> Add back compat tests for VectorFormat to TestBackwardsCompatibility
> 
>
> Key: LUCENE-9573
> URL: https://issues.apache.org/jira/browse/LUCENE-9573
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Assignee: Mayya Sharipova
>Priority: Blocker
> Fix For: 9.1
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> In LUCENE-9322 we add a new VectorFormat to the index. This issue is about 
> adding backwards compatibility tests for it once the index format has 
> crystallized into its 9.0 form



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10397) KnnVectorQuery doesn't tie break by doc ID

2022-01-31 Thread Adrien Grand (Jira)

Adrien Grand created LUCENE-10397:
-

 Summary: KnnVectorQuery doesn't tie break by doc ID
 Key: LUCENE-10397
 URL: https://issues.apache.org/jira/browse/LUCENE-10397
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand


I was expecting KnnVectorQUery to tie-break by doc ID so that if multiple 
documents get the same score then the ones that have the lowest doc ID would 
get returned first, similarly to how SortField.SCORE also tie-breaks by doc ID.

However the following test fails, suggesting that it is not the case.

{code:java}
  public void testTieBreak() throws IOException {
try (Directory d = newDirectory()) {
  try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
for (int j = 0; j < 5; j++) {
  Document doc = new Document();
  doc.add(
  new KnnVectorField("field", new float[] {0, 1}, 
VectorSimilarityFunction.DOT_PRODUCT));
  w.addDocument(doc);
}
  }
  try (IndexReader reader = DirectoryReader.open(d)) {
assertEquals(1, reader.leaves().size());
IndexSearcher searcher = new IndexSearcher(reader);
KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 3}, 
3);
TopDocs topHits = searcher.search(query, 3);
assertEquals(0, topHits.scoreDocs[0].doc);
assertEquals(1, topHits.scoreDocs[1].doc);
assertEquals(2, topHits.scoreDocs[2].doc);
  }
}
  }
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10397) KnnVectorQuery doesn't tie break by doc ID

2022-01-31 Thread Adrien Grand (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-10397:
--
Issue Type: Bug  (was: Task)

> KnnVectorQuery doesn't tie break by doc ID
> --
>
> Key: LUCENE-10397
> URL: https://issues.apache.org/jira/browse/LUCENE-10397
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> I was expecting KnnVectorQUery to tie-break by doc ID so that if multiple 
> documents get the same score then the ones that have the lowest doc ID would 
> get returned first, similarly to how SortField.SCORE also tie-breaks by doc 
> ID.
> However the following test fails, suggesting that it is not the case.
> {code:java}
>   public void testTieBreak() throws IOException {
> try (Directory d = newDirectory()) {
>   try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
> for (int j = 0; j < 5; j++) {
>   Document doc = new Document();
>   doc.add(
>   new KnnVectorField("field", new float[] {0, 1}, 
> VectorSimilarityFunction.DOT_PRODUCT));
>   w.addDocument(doc);
> }
>   }
>   try (IndexReader reader = DirectoryReader.open(d)) {
> assertEquals(1, reader.leaves().size());
> IndexSearcher searcher = new IndexSearcher(reader);
> KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 
> 3}, 3);
> TopDocs topHits = searcher.search(query, 3);
> assertEquals(0, topHits.scoreDocs[0].doc);
> assertEquals(1, topHits.scoreDocs[1].doc);
> assertEquals(2, topHits.scoreDocs[2].doc);
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-31 Thread Joel Bernstein (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484759#comment-17484759
 ] 

Joel Bernstein commented on LUCENE-10382:
-

I'm going to start the brute force implementation inside of KnnVectorQuery 
soon. My plan is to advance through the VectorValues using the BitsFilter and 
score the vectors with the KnnVectorField.vectorSimilarityFunction. If there is 
code available that does this already let me know.

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-31 Thread Joel Bernstein (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484759#comment-17484759
 ] 

Joel Bernstein edited comment on LUCENE-10382 at 1/31/22, 4:19 PM:
---

I'm going to start the brute force implementation inside of KnnVectorQuery 
soon. My plan is to advance through the VectorValues using the bitsFilter and 
score the vectors with the KnnVectorField.vectorSimilarityFunction. If there is 
code available that does this already let me know.


was (Author: joel.bernstein):
I'm going to start the brute force implementation inside of KnnVectorQuery 
soon. My plan is to advance through the VectorValues using the BitsFilter and 
score the vectors with the KnnVectorField.vectorSimilarityFunction. If there is 
code available that does this already let me know.

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-31 Thread Joel Bernstein (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484759#comment-17484759
 ] 

Joel Bernstein edited comment on LUCENE-10382 at 1/31/22, 4:20 PM:
---

I'm going to start the brute force implementation inside of KnnVectorQuery 
soon. My plan is to advance through the VectorValues using the bitsFilter and 
score the vectors with the KnnVectorField.vectorSimilarityFunction. If there is 
code available that does this already let me know.

Adding a brute force flag to KnnVectorQuery might be a good idea as well.


was (Author: joel.bernstein):
I'm going to start the brute force implementation inside of KnnVectorQuery 
soon. My plan is to advance through the VectorValues using the bitsFilter and 
score the vectors with the KnnVectorField.vectorSimilarityFunction. If there is 
code available that does this already let me know.

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-01-31 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484830#comment-17484830
 ] 

Adrien Grand commented on LUCENE-10396:
---

One interesting question about this will be the read API. What I have in mind 
for now looks something like this:

{code:java}
class SparseIndexProducer {
  DocIdRangeIterator getIterator()
}

class DocIdRange {
  int first, last;
}

class DocIdRangeIterator {

  /**
   * Return a range of doc IDs that may contain {@code value} as the value of 
the sortPos-th sort field.
   * All documents between target included and {@code range.first} excluded are 
guaranteed to not
   * contain {@code value}. {@code range.first} equal to NO_MORE_DOCS is used 
to signal that no more
   * documents may contain {@code value}.
   */
  DocIdRange nextRangeNumeric(int target, int sortPos, long value);

  /**
   * Likewise for SORTED doc values.
   */
  DocIdRange nextRangeSorted(int target, int sortPos, long ord);

  /**
   * Likewise for BINARY doc values.
   */
  DocIdRange nextRangeBinary(int target, int sortPos, BytesRef binaryValue);

}
{code}

> Automatically create sparse indexes for sort fields
> ---
>
> Key: LUCENE-10396
> URL: https://issues.apache.org/jira/browse/LUCENE-10396
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to 
> be able to early terminate sorted queries, but as a way to cluster doc IDs 
> that share similar properties so that queries can take advantage of it. For 
> instance imagine you're maintaining a catalog of cars for sale, by sorting by 
> car type, then fuel type then price. Then all cars with the same type, fuel 
> type and similar prices will be stored in a contiguous range of doc IDs. 
> Without index sorting, conjunctions across these 3 fields would be almost a 
> worst-case scenario as every clause might match lots of documents while their 
> intersection might not. With index sorting enabled however, there's only a 
> very small number of calls to advance() that would lead to doc IDs that do 
> not match, because these advance() calls that do not lead to a match would 
> always jump over a large number of doc IDs. I had created that example for 
> ApacheCon last year that demonstrates the benefits of index sorting on 
> conjunctions. In both cases, the index is storing the same data, it just gets 
> different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there 
> is a lot more we can do by taking advantage of the index sort explicitly. For 
> instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
> queries on fields that are primary sort fields by performing a binary search 
> to identify the first and last documents that match the range query. I would 
> like to introduce [sparse 
> indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
> that are used for index sorting, with the goal of improving the runtime of 
> {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
> and making it easier and more efficient to leverage index sorting to filter 
> on subsequent sort fields. A simple form of a sparse index could consist of 
> storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we 
> wouldn't need to make it configurable and could enable it automatically as 
> soon as index sorting is enabled. And it would get its own file format 
> abstraction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #632: LUCENE-10050 Remove DrillSideways#search(DrillDownQuery,Collector) in favor of DrillSideways#search(DrillDownQuery,CollectorManager)

2022-01-31 Thread GitBox



gsmiller commented on pull request #632:
URL: https://github.com/apache/lucene/pull/632#issuecomment-1026319472


   Thanks for tackling this @gautamworah96! I'm hoping to dig into this PR 
tomorrow. Just letting you know I saw it and am planning to have a look. Thanks 
again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10398) Add static method for getting Terms from LeafReader

2022-01-31 Thread Marc D'Mello (Jira)

Marc D'Mello created LUCENE-10398:
-

 Summary: Add static method for getting Terms from LeafReader
 Key: LUCENE-10398
 URL: https://issues.apache.org/jira/browse/LUCENE-10398
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Marc D'Mello


Hi all, {{LeafReader}} has methods like {{getBinaryDocValues(String field)}} 
that return {{null}} values if the field is not indexed. These methods also 
have equivalent {{DocValues}} static methods, such as 
{{DocValues.getBinary()}}, which return an {{emptyBinary()}} rather than a 
{{null}} if there is no field. I noticed that {{Terms}} does not have an 
equivalent static method for {{LeafReader.terms()}} like {{Terms.terms()}} or 
something similar. I was wondering if there was a reason for this, or if a 
method like this could be useful. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn commented on pull request #418: LUCENE-10061: Implements dynamic pruning support for CombinedFieldsQuery

2022-01-31 Thread GitBox



zacharymorn commented on pull request #418:
URL: https://github.com/apache/lucene/pull/418#issuecomment-1026512709


   Hi @jpountz , just want to check again to see if I have addressed your 
concern with the utility approach?  
   
   To better illustrate the potential issue I'm considering with the higher 
level abstraction approach, I've copied over some key code snippets here: 
   
   If we were to extract out `SynonymImpactsSource` (from `SynonymQuery`), it 
may look something like this:
   ```
   public class SynonymImpactsSource implements ImpactsSource {
   
 private final ImpactsEnum[] impactsEnums;
 private final Impacts[] impacts;
 private final float[] boosts;
 private Impacts lead;
   
 public SynonymImpactsSource(ImpactsEnum[] impactsEnums, float[] boosts) {
   ...
 }
   
 @Override
 public Impacts getImpacts() throws IOException {
   // Use the impacts that have the lower next boundary as a lead.
   // It will decide on the number of levels and the block boundaries.
   if (lead == null) {
 Impacts tmpLead = null;
 for (int i = 0; i < impactsEnums.length; ++i) {
   impacts[i] = impactsEnums[i].getImpacts();
   if (tmpLead == null || impacts[i].getDocIdUpTo(0) < 
tmpLead.getDocIdUpTo(0)) {
 tmpLead = impacts[i];
   }
 }
 lead = tmpLead;
   }
   return new Impacts() {
   
 @Override
 public int numLevels() {
   // Delegate to the lead
   return lead.numLevels();
 }
   
 @Override
 public int getDocIdUpTo(int level) {
   // Delegate to the lead
   return lead.getDocIdUpTo(level);
 }
   
 @Override
 public List getImpacts(int level) {
   final int docIdUpTo = getDocIdUpTo(level); 
   return ImpactsMergingUtils.mergeImpactsPerField(
   impactsEnums, impacts, boosts, docIdUpTo, false);
 }
   };
 }
   
 @Override
 public void advanceShallow(int target) throws IOException {
   for (ImpactsEnum impactsEnum : impactsEnums) {
 if (impactsEnum.docID() < target) {
   impactsEnum.advanceShallow(target);
 }
   }
 }
   
 public Impacts[] impacts() {
   return impacts;
 }
   }
   ```
   
   However, the above `SynonymImpactsSource#getImpacts(level)` may not be 
directly usable for `CombinedFieldsQuery`, as illustrated below
   ```
   Impacts CombinedFieldsQuery$ImpactsSource#getImpacts {
  for (entry : Map) {
   // this is potentially problematic, as this methods only takes in 
level info, and it internally looks up a field-specific docIdUpTo to get valid 
impacts (see definition above). However, for CombinedFieldQuery, docIdUpTo may 
be different among different fields with same level 
   combine SynonymImpactsSource.getImpacts(level) 
  }
 return combined Impacts
   }
   ```
   
   Hence we may actually need to change some APIs if we were to make 
`SynonymImpactsSource` to work in `CombinedFieldQuery` use case ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-01-31 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484830#comment-17484830
 ] 

Adrien Grand edited comment on LUCENE-10396 at 2/1/22, 7:31 AM:


One interesting question about this will be the read API. What I have in mind 
for now looks something like this:

{code:java}
class SparseIndexProducer {

  /**
   * Get an iterator over ranges of doc IDs that may contain values that fall 
in the given range of values.
   */
  DocIdRangeIterator getNumeric(FieldInfo field, long rangeMin, long rangeMax);

  DocIdRangeIterator getSorted(FieldInfo field, long rangeMinOrd, long 
rangeMaxOrd);

  DocIdRangeIterator getBinary(FieldInfo field, BytesRef rangeMin, BytesRef 
rangeMax);
}

class DocIdRange {
  int first, last;
}

class DocIdRangeIterator {

  /**
   * Return the next range of doc IDs so that range.min is greater than or 
equal to the target.
   */
  DocIdRange advance(int target);

}
{code}


was (Author: jpountz):
One interesting question about this will be the read API. What I have in mind 
for now looks something like this:

{code:java}
class SparseIndexProducer {
  DocIdRangeIterator getIterator()
}

class DocIdRange {
  int first, last;
}

class DocIdRangeIterator {

  /**
   * Return a range of doc IDs that may contain {@code value} as the value of 
the sortPos-th sort field.
   * All documents between target included and {@code range.first} excluded are 
guaranteed to not
   * contain {@code value}. {@code range.first} equal to NO_MORE_DOCS is used 
to signal that no more
   * documents may contain {@code value}.
   */
  DocIdRange nextRangeNumeric(int target, int sortPos, long value);

  /**
   * Likewise for SORTED doc values.
   */
  DocIdRange nextRangeSorted(int target, int sortPos, long ord);

  /**
   * Likewise for BINARY doc values.
   */
  DocIdRange nextRangeBinary(int target, int sortPos, BytesRef binaryValue);

}
{code}

> Automatically create sparse indexes for sort fields
> ---
>
> Key: LUCENE-10396
> URL: https://issues.apache.org/jira/browse/LUCENE-10396
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to 
> be able to early terminate sorted queries, but as a way to cluster doc IDs 
> that share similar properties so that queries can take advantage of it. For 
> instance imagine you're maintaining a catalog of cars for sale, by sorting by 
> car type, then fuel type then price. Then all cars with the same type, fuel 
> type and similar prices will be stored in a contiguous range of doc IDs. 
> Without index sorting, conjunctions across these 3 fields would be almost a 
> worst-case scenario as every clause might match lots of documents while their 
> intersection might not. With index sorting enabled however, there's only a 
> very small number of calls to advance() that would lead to doc IDs that do 
> not match, because these advance() calls that do not lead to a match would 
> always jump over a large number of doc IDs. I had created that example for 
> ApacheCon last year that demonstrates the benefits of index sorting on 
> conjunctions. In both cases, the index is storing the same data, it just gets 
> different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there 
> is a lot more we can do by taking advantage of the index sort explicitly. For 
> instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
> queries on fields that are primary sort fields by performing a binary search 
> to identify the first and last documents that match the range query. I would 
> like to introduce [sparse 
> indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
> that are used for index sorting, with the goal of improving the runtime of 
> {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
> and making it easier and more efficient to leverage index sorting to filter 
> on subsequent sort fields. A simple form of a sparse index could consist of 
> storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we 
> wouldn't need to make it configurable and could enable it automatically as 
> soon as index sorting is enabled. And it would get its own file format 
> abstraction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...

[GitHub] [lucene] vigyasharma opened a new pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-01-31 Thread GitBox

vigyasharma opened a new pull request #633:
URL: https://github.com/apache/lucene/pull/633

# Description

> _Work in Progress PR to share approach and get early feedback on changes_

The addIndexes(CodecReader...) API today merges all provided readers into a
single merge, in one large blocking call. We want to add concurrency here by
invoking multiple parallel merges on subsets of readers, in a way that is
configurable by users. The merged segments created, can later be merged further
in the regular, non-blocking, background merges triggered by Lucene. Currently,
users are responsible for managing their API run times, by invoking it
multiple times with subsets of readers.
JIRA - https://issues.apache.org/jira/browse/LUCENE-10216

# Solution

In this change, we leverage the configured MergeScheduler to invoke all
merges required by the addIndexes API. MergePolicy also exposes a
`findMerges(List readers)` API that users can override to
configure how their readers should be clustered into merges. We add a default
implementation that retains current behavior of adding all readers into a
single merge.

# Tests

- Existing tests passing
- Pending: New tests to be added

# Checklist

Please review the following and check all that apply:

- [ ] I have reviewed the guidelines for [How to
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code
conforms to the standards described there to the best of my ability.
- [ ] I have created a Jira issue and added the issue ID to my pull request
title.
- [ ] I have given Lucene maintainers
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
to contribute to my PR branch. (optional but recommended)
- [ ] I have developed this patch against the `main` branch.
- [ ] I have run `./gradlew check`.
- [ ] I have added tests for my changes.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API

2022-01-31 Thread Vigya Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485083#comment-17485083
 ] 

Vigya Sharma commented on LUCENE-10216:
---

I took a shot at the MergePolicy + MergeScheduler idea. Have raised a draft, 
work-in-progress PR - https://github.com/apache/lucene/pull/633, to get some 
high level feedback on the approach. It has existing tests passing, and is 
pending new tests specific to this change.

> Add concurrency to addIndexes(CodecReader…) API
> ---
>
> Key: LUCENE-10216
> URL: https://issues.apache.org/jira/browse/LUCENE-10216
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Vigya Sharma
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I work at Amazon Product Search, and we use Lucene to power search for the 
> e-commerce platform. I’m working on a project that involves applying 
> metadata+ETL transforms and indexing documents on n different _indexing_ 
> boxes, combining them into a single index on a separate _reducer_ box, and 
> making it available for queries on m different _search_ boxes (replicas). 
> Segments are asynchronously copied from indexers to reducers to searchers as 
> they become available for the next layer to consume.
> I am using the addIndexes API to combine multiple indexes into one on the 
> reducer boxes. Since we also have taxonomy data, we need to remap facet field 
> ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version 
> of this API. The API leverages {{SegmentMerger.merge()}} to create segments 
> with new ordinal values while also merging all provided segments in the 
> process.
> _This is however a blocking call that runs in a single thread._ Until we have 
> written segments with new ordinal values, we cannot copy them to searcher 
> boxes, which increases the time to make documents available for search.
> I was playing around with the API by creating multiple concurrent merges, 
> each with only a single reader, creating a concurrently running 1:1 
> conversion from old segments to new ones (with new ordinal values). We follow 
> this up with non-blocking background merges. This lets us copy the segments 
> to searchers and replicas as soon as they are available, and later replace 
> them with merged segments as background jobs complete. On the Amazon dataset 
> I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. 
> Each call was given about 5 readers to add on average.
> This might be useful add to Lucene. We could create another {{addIndexes()}} 
> API with a {{boolean}} flag for concurrency, that internally submits multiple 
> merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, 
> and waits for them to complete before returning.
> While this is doable from outside Lucene by using your thread pool, starting 
> multiple addIndexes() calls and waiting for them to complete, I felt it needs 
> some understanding of what addIndexes does, why you need to wait on the merge 
> and why it makes sense to pass a single reader in the addIndexes API.
> Out of box support in Lucene could simplify this for folks a similar use case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10054) Handle hierarchy in HNSW graph

2022-01-31 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485084#comment-17485084
 ] 

Adrien Grand commented on LUCENE-10054:
---

It looks like this change made both indexing (+5%) and searching (+7%) vectors 
slightly faster. 
http://people.apache.org/~mikemccand/lucenebench/VectorSearch.html

> Handle hierarchy in HNSW graph
> --
>
> Key: LUCENE-10054
> URL: https://issues.apache.org/jira/browse/LUCENE-10054
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Mayya Sharipova
>Priority: Major
>  Time Spent: 20h
>  Remaining Estimate: 0h
>
> Currently HNSW graph is represented as a single layer graph. 
>  We would like to extend it to handle hierarchy as per 
> [discussion|https://issues.apache.org/jira/browse/LUCENE-9004?focusedCommentId=17393216&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17393216].
>  
>  
> TODO tasks:
> - add multiple layers in the HnswGraph class
>  - modify the format in  Lucene90HnswVectorsWriter and 
> Lucene90HnswVectorsReader to handle multiple layers
> - modify graph construction and search algorithm to handle hierarchy
>  - run benchmarks



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10396) Automatically create sparse indexes for sort fields

[GitHub] [lucene] javanna commented on pull request #622: LUCENE-10395: Introduce TotalHitCountCollectorManager

[jira] [Commented] (LUCENE-10385) Implement Weight#count on IndexSortSortedNumericDocValuesRangeQuery.

[GitHub] [lucene] jpountz merged pull request #622: LUCENE-10395: Introduce TotalHitCountCollectorManager

[jira] [Commented] (LUCENE-10395) Add support for TotalHitCountCollectorManager

[GitHub] [lucene] jpountz commented on a change in pull request #541: LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil

[jira] [Commented] (LUCENE-10395) Add support for TotalHitCountCollectorManager

[jira] [Resolved] (LUCENE-10395) Add support for TotalHitCountCollectorManager

[jira] [Commented] (LUCENE-10390) Improve Weight#count javadocs

[jira] [Resolved] (LUCENE-10390) Improve Weight#count javadocs

[GitHub] [lucene] mayya-sharipova merged pull request #616: LUCENE-9573 Add Vectors to TestBackwardsCompatibility

[jira] [Commented] (LUCENE-9573) Add back compat tests for VectorFormat to TestBackwardsCompatibility

[jira] [Created] (LUCENE-10397) KnnVectorQuery doesn't tie break by doc ID

[jira] [Updated] (LUCENE-10397) KnnVectorQuery doesn't tie break by doc ID

[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

[jira] [Comment Edited] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

[jira] [Comment Edited] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

[GitHub] [lucene] gsmiller commented on pull request #632: LUCENE-10050 Remove DrillSideways#search(DrillDownQuery,Collector) in favor of DrillSideways#search(DrillDownQuery,CollectorManager)

[jira] [Created] (LUCENE-10398) Add static method for getting Terms from LeafReader

[GitHub] [lucene] zacharymorn commented on pull request #418: LUCENE-10061: Implements dynamic pruning support for CombinedFieldsQuery

[jira] [Comment Edited] (LUCENE-10396) Automatically create sparse indexes for sort fields

[GitHub] [lucene] vigyasharma opened a new pull request #633: [WIP] LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API

[jira] [Commented] (LUCENE-10054) Handle hierarchy in HNSW graph

25 matches

Site Navigation

Mail list logo

Footer information