[GitHub] [lucene] uschindler commented on pull request #12066: Retire/deprecate per-instance MMapDirectory#setUseUnmap

2023-01-04 Thread GitBox


uschindler commented on PR #12066:
URL: https://github.com/apache/lucene/pull/12066#issuecomment-1370980047

   I cleaned up the code more:
   - removed the useless doPrivileged block around the actual unmapping. We 
just call a MethodHandle and the Java 9+ version has no security checks.
   - put the whole `MMapDirectory` clinit into one doPrivileged block
   - add more info to log messages, especially when permissions are missing; 
log how to disable MemorySegments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ErikPelli commented on a diff in pull request #12034: Avoid possible NullPointerException in BlendedTermQuery

2023-01-04 Thread GitBox


ErikPelli commented on code in PR #12034:
URL: https://github.com/apache/lucene/pull/12034#discussion_r1061649853


##
lucene/core/src/java/org/apache/lucene/search/BlendedTermQuery.java:
##
@@ -314,19 +314,15 @@ private static TermStates adjustFrequencies(
   IndexReaderContext readerContext, TermStates ctx, int artificialDf, long 
artificialTtf)
   throws IOException {
 List leaves = readerContext.leaves();
-final int len;
-if (leaves == null) {
-  len = 1;
-} else {
-  len = leaves.size();
-}
 TermStates newCtx = new TermStates(readerContext);
-for (int i = 0; i < len; ++i) {
-  TermState termState = ctx.get(leaves.get(i));
-  if (termState == null) {
-continue;
+if (leaves != null) {

Review Comment:
   Ok, I missed that. In fact `IndexReaderContext` is an abstract class and the 
classes that extend it (`CompositeReaderContext` and `LeafReaderContext`) 
return a valid List or an exception. 
   So modify the doc is totaly fine, but it's needed also to remove the null 
check of result of `IndexReaderContext#leaves()` method call, because it won't 
ever happen.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ErikPelli commented on a diff in pull request #12034: Remove null check in IndexReaderContext#leaves() usages

2023-01-04 Thread GitBox


ErikPelli commented on code in PR #12034:
URL: https://github.com/apache/lucene/pull/12034#discussion_r1061673383


##
lucene/core/src/java/org/apache/lucene/search/BlendedTermQuery.java:
##
@@ -314,19 +314,15 @@ private static TermStates adjustFrequencies(
   IndexReaderContext readerContext, TermStates ctx, int artificialDf, long 
artificialTtf)
   throws IOException {
 List leaves = readerContext.leaves();
-final int len;
-if (leaves == null) {
-  len = 1;
-} else {
-  len = leaves.size();
-}
 TermStates newCtx = new TermStates(readerContext);
-for (int i = 0; i < len; ++i) {
-  TermState termState = ctx.get(leaves.get(i));
-  if (termState == null) {
-continue;
+if (leaves != null) {

Review Comment:
   Should I still add an entry in the CHANGES.txt file after this change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12054: Introduce a new `KeywordField`.

2023-01-04 Thread GitBox


jpountz commented on PR #12054:
URL: https://github.com/apache/lucene/pull/12054#issuecomment-1371192608

   Thanks for looking @gsmiller. The test that fails is because this PR relies 
on behavior introduced by https://github.com/apache/lucene/pull/12053. I'll 
rebase when this other PR is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12053: Allow reusing indexed binary fields.

2023-01-04 Thread GitBox


jpountz commented on PR #12053:
URL: https://github.com/apache/lucene/pull/12053#issuecomment-1371197803

   I pushed a new commit that also disallows term vector offsets on binary 
fields.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vstrout opened a new issue, #12067: Getting exception on search after upgrading to Lucene 9.4

2023-01-04 Thread GitBox


vstrout opened a new issue, #12067:
URL: https://github.com/apache/lucene/issues/12067

   ### Description
   
   After upgrading from Lucene 9.3.0 to Lucene 9.4.2 the index search with 
sorting by description throws the following exception:
   
   Caused by: java.lang.IllegalStateException: Term [77 73 64 66 6a 66 73 67 73 
20 61 64 6b 66 64 6a 68 74 67 64 67 20 61 64 6b 66 64 6a 68 74 67 64 67 20 72 
65 74 72 65 72 74 65 20] exists in doc values but not in the terms index
at 
org.apache.lucene.search.comparators.TermOrdValComparator$CompetitiveIterator.init(TermOrdValComparator.java:582)
at 
org.apache.lucene.search.comparators.TermOrdValComparator$CompetitiveIterator.update(TermOrdValComparator.java:553)
at 
org.apache.lucene.search.comparators.TermOrdValComparator$TermOrdValLeafComparator.updateCompetitiveIterator(TermOrdValComparator.java:457)
at 
org.apache.lucene.search.comparators.TermOrdValComparator$TermOrdValLeafComparator.setHitsThresholdReached(TermOrdValComparator.java:284)
at 
org.apache.lucene.search.TopFieldCollector$TopFieldLeafCollector.countHit(TopFieldCollector.java:86)
at 
org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector$1.collect(TopFieldCollector.java:202)
at 
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:305)
at 
org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:247)
at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:38)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:744)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:662)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:656)
at 
org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:636)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:553)
   
   I wrote a small java function that reproduces the issue. Increasing maxRows 
value there to above 200 somehow resolves the problem.
   
public static void test() throws Exception {
String name = "description";
String content = "content";
File dir = new File("C:\\test"); 
MMapDirectory directory = new MMapDirectory(dir.toPath());  
Analyzer analyzer = new 
StandardAnalyzer(EnglishAnalyzer.ENGLISH_STOP_WORDS_SET);

boolean createIndex = true; // must be changed to false after 
calling this function first time
if (createIndex) {
String[] values = {
"anshduvfv ",
"dhisdefihfhg ",
"Afasdfasdf ",
"Retrerte ",
"bnbssdfgfg ",
"wrgfhhjg ",
"jfhtvg ",
"fhdhfdsads ",
"Wsdfjfsgs ",
"adkfdjhtgdg ",
};

Random random = new Random();
IndexWriterConfig config = new 
IndexWriterConfig(analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
IndexWriter writer = new IndexWriter(directory, config);
for (int i = 0; i < 110; i++) {
for (int j = 0; j < 10; j++) {
String value = values[j] + 
values[random.nextInt(10)] + values[random.nextInt(10)] + 
values[random.nextInt(10)];

Document doc = new Document();
doc.add(new TextField(content, value, 
Field.Store.NO));
doc.add(new StringField(name, value, 
Field.Store.YES));
doc.add(new SortedDocValuesField(name, 
new BytesRef(value.toLowerCase(;  // case-insensitive sorting
writer.addDocument(doc);
}
}
writer.close();
}

int maxRows = 100;
String request = "*:*";
DirectoryReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser(content, analyzer);
parser.setSplitOnWhitespace(true);
parser.setAllowLeadingWildcard(false);
Query query = parser.parse(request);
Sort sort = new Sort(new SortField(name, Type.STRING, true));
TopDocs docs = s

[GitHub] [lucene] jpountz commented on a diff in pull request #12064: Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String)

2023-01-04 Thread GitBox


jpountz commented on code in PR #12064:
URL: https://github.com/apache/lucene/pull/12064#discussion_r1061739690


##
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene94/Lucene94HnswVectorsReader.java:
##
@@ -233,12 +234,19 @@ public void checkIntegrity() throws IOException {
   @Override
   public VectorValues getVectorValues(String field) throws IOException {
 FieldEntry fieldEntry = fields.get(field);
-VectorValues values = OffHeapVectorValues.load(fieldEntry, vectorData);
-if (fieldEntry.vectorEncoding == VectorEncoding.BYTE) {
-  return new ExpandingVectorValues(values);
-} else {
-  return values;
+if (fieldEntry.vectorEncoding != VectorEncoding.FLOAT32) {
+  return null;
 }
+return OffHeapVectorValues.load(fieldEntry, vectorData);
+  }
+
+  @Override
+  public ByteVectorValues getByteVectorValues(String field) throws IOException 
{
+FieldEntry fieldEntry = fields.get(field);
+if (fieldEntry.vectorEncoding != VectorEncoding.BYTE) {
+  return null;
+}

Review Comment:
   this `if` statement should be unnecessary since CodecReader already checks 
field infos? Can we fail hard if the encoding is not right instead of returning 
null?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler merged pull request #12066: Retire/deprecate per-instance MMapDirectory#setUseUnmap

2023-01-04 Thread GitBox


uschindler merged PR #12066:
URL: https://github.com/apache/lucene/pull/12066


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12064: Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String)

2023-01-04 Thread GitBox


benwtrent commented on PR #12064:
URL: https://github.com/apache/lucene/pull/12064#issuecomment-1371336800

   @jpountz 
   
   > One thing I'd like to see is whether we can avoid making 
AbstractVectorValues public and duplicate logic for byte and float vector 
values in e.g. VectorScorer and CheckIndex.
   
   I can see about that. The downside is that internally, when writing the 
vectors, having a known `binaryValue` really simplifies things. I will see if I 
can get the same simplifications in another way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-04 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1061902120


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java:
##
@@ -461,6 +467,126 @@ public void mergeOneField(FieldInfo fieldInfo, MergeState 
mergeState) throws IOE
 }
   }
 
+  private void maybeInitializeFromGraph(
+  HnswGraphBuilder hnswGraphBuilder, MergeState mergeState, FieldInfo 
fieldInfo)
+  throws IOException {
+int initializerIndex = selectGraphForInitialization(mergeState, fieldInfo);
+if (initializerIndex == -1) {
+  return;
+}
+
+HnswGraph initializerGraph =
+getHnswGraphFromReader(fieldInfo.name, 
mergeState.knnVectorsReaders[initializerIndex]);
+Map ordinalMapper =
+getOldToNewOrdinalMap(mergeState, fieldInfo, initializerIndex);
+hnswGraphBuilder.initializeFromGraph(initializerGraph, ordinalMapper);
+  }
+
+  private int selectGraphForInitialization(MergeState mergeState, FieldInfo 
fieldInfo)
+  throws IOException {
+// Find the KnnVectorReader with the most docs that meets the following 
criteria:
+//  1. Does not contain any deleted docs
+//  2. Is a Lucene95HnswVectorsReader/PerFieldKnnVectorReader
+// If no readers exist that meet this criteria, return -1. If they do, 
return their index in
+// merge state
+int maxCandidateVectorCount = 0;
+int initializerIndex = -1;
+
+for (int i = 0; i < mergeState.liveDocs.length; i++) {
+  KnnVectorsReader currKnnVectorsReader = mergeState.knnVectorsReaders[i];
+  if (mergeState.knnVectorsReaders[i]
+  instanceof PerFieldKnnVectorsFormat.FieldsReader candidateReader) {
+currKnnVectorsReader = candidateReader.getFieldReader(fieldInfo.name);
+  }
+
+  if (!allMatch(mergeState.liveDocs[i])
+  || !(currKnnVectorsReader instanceof Lucene95HnswVectorsReader 
candidateReader)) {
+continue;
+  }
+
+  VectorValues vectorValues = 
candidateReader.getVectorValues(fieldInfo.name);
+  if (vectorValues == null) {
+continue;
+  }
+
+  int candidateVectorCount = vectorValues.size();
+  if (candidateVectorCount > maxCandidateVectorCount) {
+maxCandidateVectorCount = candidateVectorCount;
+initializerIndex = i;
+  }
+}
+return initializerIndex;
+  }
+
+  private HnswGraph getHnswGraphFromReader(String fieldName, KnnVectorsReader 
knnVectorsReader)
+  throws IOException {
+if (knnVectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader 
perFieldReader
+&& perFieldReader.getFieldReader(fieldName)
+instanceof Lucene95HnswVectorsReader fieldReader) {
+  return fieldReader.getGraph(fieldName);
+}
+
+if (knnVectorsReader instanceof Lucene95HnswVectorsReader) {
+  return ((Lucene95HnswVectorsReader) 
knnVectorsReader).getGraph(fieldName);
+}
+
+throw new IllegalArgumentException(
+"Invalid KnnVectorsReader. Must be of type 
PerFieldKnnVectorsFormat.FieldsReader or Lucene94HnswVectorsReader");

Review Comment:
   Makes sense. Will update.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-04 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1061902422


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java:
##
@@ -461,6 +467,126 @@ public void mergeOneField(FieldInfo fieldInfo, MergeState 
mergeState) throws IOE
 }
   }
 
+  private void maybeInitializeFromGraph(
+  HnswGraphBuilder hnswGraphBuilder, MergeState mergeState, FieldInfo 
fieldInfo)
+  throws IOException {
+int initializerIndex = selectGraphForInitialization(mergeState, fieldInfo);
+if (initializerIndex == -1) {
+  return;
+}
+
+HnswGraph initializerGraph =
+getHnswGraphFromReader(fieldInfo.name, 
mergeState.knnVectorsReaders[initializerIndex]);
+Map ordinalMapper =
+getOldToNewOrdinalMap(mergeState, fieldInfo, initializerIndex);
+hnswGraphBuilder.initializeFromGraph(initializerGraph, ordinalMapper);
+  }
+
+  private int selectGraphForInitialization(MergeState mergeState, FieldInfo 
fieldInfo)
+  throws IOException {
+// Find the KnnVectorReader with the most docs that meets the following 
criteria:
+//  1. Does not contain any deleted docs
+//  2. Is a Lucene95HnswVectorsReader/PerFieldKnnVectorReader
+// If no readers exist that meet this criteria, return -1. If they do, 
return their index in
+// merge state
+int maxCandidateVectorCount = 0;
+int initializerIndex = -1;
+
+for (int i = 0; i < mergeState.liveDocs.length; i++) {
+  KnnVectorsReader currKnnVectorsReader = mergeState.knnVectorsReaders[i];
+  if (mergeState.knnVectorsReaders[i]
+  instanceof PerFieldKnnVectorsFormat.FieldsReader candidateReader) {
+currKnnVectorsReader = candidateReader.getFieldReader(fieldInfo.name);
+  }
+
+  if (!allMatch(mergeState.liveDocs[i])
+  || !(currKnnVectorsReader instanceof Lucene95HnswVectorsReader 
candidateReader)) {
+continue;
+  }
+
+  VectorValues vectorValues = 
candidateReader.getVectorValues(fieldInfo.name);
+  if (vectorValues == null) {
+continue;
+  }
+
+  int candidateVectorCount = vectorValues.size();
+  if (candidateVectorCount > maxCandidateVectorCount) {
+maxCandidateVectorCount = candidateVectorCount;
+initializerIndex = i;
+  }
+}
+return initializerIndex;
+  }
+
+  private HnswGraph getHnswGraphFromReader(String fieldName, KnnVectorsReader 
knnVectorsReader)
+  throws IOException {
+if (knnVectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader 
perFieldReader
+&& perFieldReader.getFieldReader(fieldName)
+instanceof Lucene95HnswVectorsReader fieldReader) {
+  return fieldReader.getGraph(fieldName);
+}
+
+if (knnVectorsReader instanceof Lucene95HnswVectorsReader) {
+  return ((Lucene95HnswVectorsReader) 
knnVectorsReader).getGraph(fieldName);
+}
+

Review Comment:
   Good idea, I will add this comment here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-04 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1061902971


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java:
##
@@ -461,6 +467,126 @@ public void mergeOneField(FieldInfo fieldInfo, MergeState 
mergeState) throws IOE
 }
   }
 
+  private void maybeInitializeFromGraph(
+  HnswGraphBuilder hnswGraphBuilder, MergeState mergeState, FieldInfo 
fieldInfo)
+  throws IOException {
+int initializerIndex = selectGraphForInitialization(mergeState, fieldInfo);
+if (initializerIndex == -1) {
+  return;
+}
+
+HnswGraph initializerGraph =
+getHnswGraphFromReader(fieldInfo.name, 
mergeState.knnVectorsReaders[initializerIndex]);
+Map ordinalMapper =
+getOldToNewOrdinalMap(mergeState, fieldInfo, initializerIndex);
+hnswGraphBuilder.initializeFromGraph(initializerGraph, ordinalMapper);
+  }
+
+  private int selectGraphForInitialization(MergeState mergeState, FieldInfo 
fieldInfo)
+  throws IOException {
+// Find the KnnVectorReader with the most docs that meets the following 
criteria:
+//  1. Does not contain any deleted docs
+//  2. Is a Lucene95HnswVectorsReader/PerFieldKnnVectorReader
+// If no readers exist that meet this criteria, return -1. If they do, 
return their index in
+// merge state
+int maxCandidateVectorCount = 0;
+int initializerIndex = -1;
+
+for (int i = 0; i < mergeState.liveDocs.length; i++) {
+  KnnVectorsReader currKnnVectorsReader = mergeState.knnVectorsReaders[i];
+  if (mergeState.knnVectorsReaders[i]
+  instanceof PerFieldKnnVectorsFormat.FieldsReader candidateReader) {
+currKnnVectorsReader = candidateReader.getFieldReader(fieldInfo.name);
+  }
+
+  if (!allMatch(mergeState.liveDocs[i])
+  || !(currKnnVectorsReader instanceof Lucene95HnswVectorsReader 
candidateReader)) {
+continue;
+  }
+
+  VectorValues vectorValues = 
candidateReader.getVectorValues(fieldInfo.name);
+  if (vectorValues == null) {
+continue;
+  }
+
+  int candidateVectorCount = vectorValues.size();
+  if (candidateVectorCount > maxCandidateVectorCount) {
+maxCandidateVectorCount = candidateVectorCount;
+initializerIndex = i;
+  }
+}
+return initializerIndex;
+  }
+
+  private HnswGraph getHnswGraphFromReader(String fieldName, KnnVectorsReader 
knnVectorsReader)
+  throws IOException {
+if (knnVectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader 
perFieldReader
+&& perFieldReader.getFieldReader(fieldName)
+instanceof Lucene95HnswVectorsReader fieldReader) {
+  return fieldReader.getGraph(fieldName);
+}
+
+if (knnVectorsReader instanceof Lucene95HnswVectorsReader) {
+  return ((Lucene95HnswVectorsReader) 
knnVectorsReader).getGraph(fieldName);
+}
+
+throw new IllegalArgumentException(
+"Invalid KnnVectorsReader. Must be of type 
PerFieldKnnVectorsFormat.FieldsReader or Lucene94HnswVectorsReader");
+  }
+
+  private Map getOldToNewOrdinalMap(
+  MergeState mergeState, FieldInfo fieldInfo, int initializerIndex) throws 
IOException {
+VectorValues initializerVectorValues =
+
mergeState.knnVectorsReaders[initializerIndex].getVectorValues(fieldInfo.name);
+MergeState.DocMap initializerDocMap = mergeState.docMaps[initializerIndex];
+
+Map newIdToOldOrdinal = new HashMap<>();
+int oldOrd = 0;
+for (int oldId = initializerVectorValues.nextDoc();
+oldId != NO_MORE_DOCS;
+oldId = initializerVectorValues.nextDoc()) {
+  if (initializerVectorValues.vectorValue() == null) {
+continue;
+  }
+  int newId = initializerDocMap.get(oldId);
+  newIdToOldOrdinal.put(newId, oldOrd);
+  oldOrd++;
+}
+
+Map oldToNewOrdinalMap = new HashMap<>();
+int newOrd = 0;
+int maxNewDocID = Collections.max(newIdToOldOrdinal.keySet());

Review Comment:
   Good idea, I will update this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-04 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1061905607


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -143,10 +148,64 @@ public OnHeapHnswGraph build(RandomAccessVectorValues 
vectorsToAdd) throws IOExc
 return hnsw;
   }
 
+  /**
+   * Initializes the graph of this builder. Transfers the nodes and their 
neighbors from the
+   * initializer graph into the graph being produced by this builder, mapping 
ordinals from the
+   * initializer graph to their new ordinals in this builder's graph. The 
builder's graph must be
+   * empty before calling this method.
+   *
+   * @param initializerGraph graph used for initialization
+   * @param oldToNewOrdinalMap map for converting from ordinals in the 
initializerGraph to this
+   * builder's graph
+   */
+  public void initializeFromGraph(
+  HnswGraph initializerGraph, Map oldToNewOrdinalMap) 
throws IOException {
+assert hnsw.size() == 0;
+float[] vectorValue = null;
+BytesRef binaryValue = null;
+for (int level = 0; level < initializerGraph.numLevels(); level++) {
+  HnswGraph.NodesIterator it = initializerGraph.getNodesOnLevel(level);
+
+  while (it.hasNext()) {
+int oldOrd = it.nextInt();
+int newOrd = oldToNewOrdinalMap.get(oldOrd);
+
+hnsw.addNode(level, newOrd);
+
+if (level == 0) {
+  initializedNodes.add(newOrd);
+}
+
+switch (this.vectorEncoding) {
+  case FLOAT32 -> vectorValue = vectors.vectorValue(newOrd);
+  case BYTE -> binaryValue = vectors.binaryValue(newOrd);
+}
+
+NeighborArray newNeighbors = this.hnsw.getNeighbors(level, newOrd);
+initializerGraph.seek(level, oldOrd);
+for (int oldNeighbor = initializerGraph.nextNeighbor();
+oldNeighbor != NO_MORE_DOCS;
+oldNeighbor = initializerGraph.nextNeighbor()) {
+  int newNeighbor = oldToNewOrdinalMap.get(oldNeighbor);
+  float score =
+  switch (this.vectorEncoding) {
+case FLOAT32 -> this.similarityFunction.compare(

Review Comment:
   Ah I thought about this, but the HnswGraph does not guarantee ordering of 
neighbors. For example, during writing, the neighbors get sorted to improve 
compression: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java#L486-L493.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-04 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1061916872


##
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##
@@ -94,36 +93,83 @@ public int size() {
   }
 
   /**
-   * Add node on the given level
+   * Add node on the given level. Nodes can be inserted out of order, but it 
requires that the nodes

Review Comment:
   Im not sure on this. For random insertion for the graph, I think a BST would 
be better. 
   
   However, the insertion pattern for merge typically wont be random. It will 
be more like first, nodes [15-73] are inserted, and then nodes [0-14] and then 
nodes [74-100]. This assumes that the MergedVectorValues are a concatenation of 
the segments to be merged vectors. For this, I added the small optimization of 
the "lastAddedPosInLayer" list, which will skip binary search during locally 
ordered insertion.
   
   That being said, I am not sure that the "concatenation" property is 
guaranteed, specifically in the case when the mergeState.needsIndexSort == 
true. Given this case, it seems like insertion pattern might be more random.
   
   All that being said, do you think it would be better to build for the 
concatenation pattern or more random insert pattern?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-04 Thread GitBox


zhaih commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1061955216


##
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##
@@ -94,36 +93,83 @@ public int size() {
   }
 
   /**
-   * Add node on the given level
+   * Add node on the given level. Nodes can be inserted out of order, but it 
requires that the nodes

Review Comment:
   Good point, for concatenation pattern it might be better to not use a naive 
BST, but still because in L156 we need to copy the rest of array again and 
again as long as that is a non-appending action, I think it still is better to 
use some tree-like structure, maybe RB tree or other balanced tree?
   
   But we can also leave it as a separate issue for now, because seems like it 
is another mid-big change and this one need not to be blocked by that I guess?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-04 Thread GitBox


zhaih commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1061976538


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -143,10 +148,64 @@ public OnHeapHnswGraph build(RandomAccessVectorValues 
vectorsToAdd) throws IOExc
 return hnsw;
   }
 
+  /**
+   * Initializes the graph of this builder. Transfers the nodes and their 
neighbors from the
+   * initializer graph into the graph being produced by this builder, mapping 
ordinals from the
+   * initializer graph to their new ordinals in this builder's graph. The 
builder's graph must be
+   * empty before calling this method.
+   *
+   * @param initializerGraph graph used for initialization
+   * @param oldToNewOrdinalMap map for converting from ordinals in the 
initializerGraph to this
+   * builder's graph
+   */
+  public void initializeFromGraph(
+  HnswGraph initializerGraph, Map oldToNewOrdinalMap) 
throws IOException {
+assert hnsw.size() == 0;
+float[] vectorValue = null;
+BytesRef binaryValue = null;
+for (int level = 0; level < initializerGraph.numLevels(); level++) {
+  HnswGraph.NodesIterator it = initializerGraph.getNodesOnLevel(level);
+
+  while (it.hasNext()) {
+int oldOrd = it.nextInt();
+int newOrd = oldToNewOrdinalMap.get(oldOrd);
+
+hnsw.addNode(level, newOrd);
+
+if (level == 0) {
+  initializedNodes.add(newOrd);
+}
+
+switch (this.vectorEncoding) {
+  case FLOAT32 -> vectorValue = vectors.vectorValue(newOrd);
+  case BYTE -> binaryValue = vectors.binaryValue(newOrd);
+}
+
+NeighborArray newNeighbors = this.hnsw.getNeighbors(level, newOrd);
+initializerGraph.seek(level, oldOrd);
+for (int oldNeighbor = initializerGraph.nextNeighbor();
+oldNeighbor != NO_MORE_DOCS;
+oldNeighbor = initializerGraph.nextNeighbor()) {
+  int newNeighbor = oldToNewOrdinalMap.get(oldNeighbor);
+  float score =
+  switch (this.vectorEncoding) {
+case FLOAT32 -> this.similarityFunction.compare(

Review Comment:
   Oh ok, I see. Initially I thought the neighbor ordering is for searching but 
seems that is not the case. Is this sorted order only used for calculating 
diversity easier?
   Anyway that can be a later topic and I think we can live with the existing 
logic for now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-04 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1061991840


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -143,10 +148,64 @@ public OnHeapHnswGraph build(RandomAccessVectorValues 
vectorsToAdd) throws IOExc
 return hnsw;
   }
 
+  /**
+   * Initializes the graph of this builder. Transfers the nodes and their 
neighbors from the
+   * initializer graph into the graph being produced by this builder, mapping 
ordinals from the
+   * initializer graph to their new ordinals in this builder's graph. The 
builder's graph must be
+   * empty before calling this method.
+   *
+   * @param initializerGraph graph used for initialization
+   * @param oldToNewOrdinalMap map for converting from ordinals in the 
initializerGraph to this
+   * builder's graph
+   */
+  public void initializeFromGraph(
+  HnswGraph initializerGraph, Map oldToNewOrdinalMap) 
throws IOException {
+assert hnsw.size() == 0;
+float[] vectorValue = null;
+BytesRef binaryValue = null;
+for (int level = 0; level < initializerGraph.numLevels(); level++) {
+  HnswGraph.NodesIterator it = initializerGraph.getNodesOnLevel(level);
+
+  while (it.hasNext()) {
+int oldOrd = it.nextInt();
+int newOrd = oldToNewOrdinalMap.get(oldOrd);
+
+hnsw.addNode(level, newOrd);
+
+if (level == 0) {
+  initializedNodes.add(newOrd);
+}
+
+switch (this.vectorEncoding) {
+  case FLOAT32 -> vectorValue = vectors.vectorValue(newOrd);
+  case BYTE -> binaryValue = vectors.binaryValue(newOrd);
+}
+
+NeighborArray newNeighbors = this.hnsw.getNeighbors(level, newOrd);
+initializerGraph.seek(level, oldOrd);
+for (int oldNeighbor = initializerGraph.nextNeighbor();
+oldNeighbor != NO_MORE_DOCS;
+oldNeighbor = initializerGraph.nextNeighbor()) {
+  int newNeighbor = oldToNewOrdinalMap.get(oldNeighbor);
+  float score =
+  switch (this.vectorEncoding) {
+case FLOAT32 -> this.similarityFunction.compare(

Review Comment:
   > Is this sorted order only used for calculating diversity easier?
   
   Yes, I think you are correct. The  reason for sorting order by distance 
during construction is that the neighbor arrays of an inserted node continue to 
get updated as more nodes are inserted in. So keeping it sorted will allow the 
worst node or nodes will allow it to be more easily identified. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org