[GitHub] [lucene] uschindler commented on pull request #12066: Retire/deprecate per-instance MMapDirectory#setUseUnmap
uschindler commented on PR #12066: URL: https://github.com/apache/lucene/pull/12066#issuecomment-1370980047 I cleaned up the code more: - removed the useless doPrivileged block around the actual unmapping. We just call a MethodHandle and the Java 9+ version has no security checks. - put the whole `MMapDirectory` clinit into one doPrivileged block - add more info to log messages, especially when permissions are missing; log how to disable MemorySegments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ErikPelli commented on a diff in pull request #12034: Avoid possible NullPointerException in BlendedTermQuery
ErikPelli commented on code in PR #12034: URL: https://github.com/apache/lucene/pull/12034#discussion_r1061649853 ## lucene/core/src/java/org/apache/lucene/search/BlendedTermQuery.java: ## @@ -314,19 +314,15 @@ private static TermStates adjustFrequencies( IndexReaderContext readerContext, TermStates ctx, int artificialDf, long artificialTtf) throws IOException { List leaves = readerContext.leaves(); -final int len; -if (leaves == null) { - len = 1; -} else { - len = leaves.size(); -} TermStates newCtx = new TermStates(readerContext); -for (int i = 0; i < len; ++i) { - TermState termState = ctx.get(leaves.get(i)); - if (termState == null) { -continue; +if (leaves != null) { Review Comment: Ok, I missed that. In fact `IndexReaderContext` is an abstract class and the classes that extend it (`CompositeReaderContext` and `LeafReaderContext`) return a valid List or an exception. So modify the doc is totaly fine, but it's needed also to remove the null check of result of `IndexReaderContext#leaves()` method call, because it won't ever happen. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ErikPelli commented on a diff in pull request #12034: Remove null check in IndexReaderContext#leaves() usages
ErikPelli commented on code in PR #12034: URL: https://github.com/apache/lucene/pull/12034#discussion_r1061673383 ## lucene/core/src/java/org/apache/lucene/search/BlendedTermQuery.java: ## @@ -314,19 +314,15 @@ private static TermStates adjustFrequencies( IndexReaderContext readerContext, TermStates ctx, int artificialDf, long artificialTtf) throws IOException { List leaves = readerContext.leaves(); -final int len; -if (leaves == null) { - len = 1; -} else { - len = leaves.size(); -} TermStates newCtx = new TermStates(readerContext); -for (int i = 0; i < len; ++i) { - TermState termState = ctx.get(leaves.get(i)); - if (termState == null) { -continue; +if (leaves != null) { Review Comment: Should I still add an entry in the CHANGES.txt file after this change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12054: Introduce a new `KeywordField`.
jpountz commented on PR #12054: URL: https://github.com/apache/lucene/pull/12054#issuecomment-1371192608 Thanks for looking @gsmiller. The test that fails is because this PR relies on behavior introduced by https://github.com/apache/lucene/pull/12053. I'll rebase when this other PR is merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12053: Allow reusing indexed binary fields.
jpountz commented on PR #12053: URL: https://github.com/apache/lucene/pull/12053#issuecomment-1371197803 I pushed a new commit that also disallows term vector offsets on binary fields. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vstrout opened a new issue, #12067: Getting exception on search after upgrading to Lucene 9.4
vstrout opened a new issue, #12067: URL: https://github.com/apache/lucene/issues/12067 ### Description After upgrading from Lucene 9.3.0 to Lucene 9.4.2 the index search with sorting by description throws the following exception: Caused by: java.lang.IllegalStateException: Term [77 73 64 66 6a 66 73 67 73 20 61 64 6b 66 64 6a 68 74 67 64 67 20 61 64 6b 66 64 6a 68 74 67 64 67 20 72 65 74 72 65 72 74 65 20] exists in doc values but not in the terms index at org.apache.lucene.search.comparators.TermOrdValComparator$CompetitiveIterator.init(TermOrdValComparator.java:582) at org.apache.lucene.search.comparators.TermOrdValComparator$CompetitiveIterator.update(TermOrdValComparator.java:553) at org.apache.lucene.search.comparators.TermOrdValComparator$TermOrdValLeafComparator.updateCompetitiveIterator(TermOrdValComparator.java:457) at org.apache.lucene.search.comparators.TermOrdValComparator$TermOrdValLeafComparator.setHitsThresholdReached(TermOrdValComparator.java:284) at org.apache.lucene.search.TopFieldCollector$TopFieldLeafCollector.countHit(TopFieldCollector.java:86) at org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector$1.collect(TopFieldCollector.java:202) at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:305) at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:247) at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:38) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:744) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:662) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:656) at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:636) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:553) I wrote a small java function that reproduces the issue. Increasing maxRows value there to above 200 somehow resolves the problem. public static void test() throws Exception { String name = "description"; String content = "content"; File dir = new File("C:\\test"); MMapDirectory directory = new MMapDirectory(dir.toPath()); Analyzer analyzer = new StandardAnalyzer(EnglishAnalyzer.ENGLISH_STOP_WORDS_SET); boolean createIndex = true; // must be changed to false after calling this function first time if (createIndex) { String[] values = { "anshduvfv ", "dhisdefihfhg ", "Afasdfasdf ", "Retrerte ", "bnbssdfgfg ", "wrgfhhjg ", "jfhtvg ", "fhdhfdsads ", "Wsdfjfsgs ", "adkfdjhtgdg ", }; Random random = new Random(); IndexWriterConfig config = new IndexWriterConfig(analyzer); config.setOpenMode(OpenMode.CREATE_OR_APPEND); IndexWriter writer = new IndexWriter(directory, config); for (int i = 0; i < 110; i++) { for (int j = 0; j < 10; j++) { String value = values[j] + values[random.nextInt(10)] + values[random.nextInt(10)] + values[random.nextInt(10)]; Document doc = new Document(); doc.add(new TextField(content, value, Field.Store.NO)); doc.add(new StringField(name, value, Field.Store.YES)); doc.add(new SortedDocValuesField(name, new BytesRef(value.toLowerCase(; // case-insensitive sorting writer.addDocument(doc); } } writer.close(); } int maxRows = 100; String request = "*:*"; DirectoryReader reader = DirectoryReader.open(directory); IndexSearcher searcher = new IndexSearcher(reader); QueryParser parser = new QueryParser(content, analyzer); parser.setSplitOnWhitespace(true); parser.setAllowLeadingWildcard(false); Query query = parser.parse(request); Sort sort = new Sort(new SortField(name, Type.STRING, true)); TopDocs docs = s
[GitHub] [lucene] jpountz commented on a diff in pull request #12064: Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String)
jpountz commented on code in PR #12064: URL: https://github.com/apache/lucene/pull/12064#discussion_r1061739690 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene94/Lucene94HnswVectorsReader.java: ## @@ -233,12 +234,19 @@ public void checkIntegrity() throws IOException { @Override public VectorValues getVectorValues(String field) throws IOException { FieldEntry fieldEntry = fields.get(field); -VectorValues values = OffHeapVectorValues.load(fieldEntry, vectorData); -if (fieldEntry.vectorEncoding == VectorEncoding.BYTE) { - return new ExpandingVectorValues(values); -} else { - return values; +if (fieldEntry.vectorEncoding != VectorEncoding.FLOAT32) { + return null; } +return OffHeapVectorValues.load(fieldEntry, vectorData); + } + + @Override + public ByteVectorValues getByteVectorValues(String field) throws IOException { +FieldEntry fieldEntry = fields.get(field); +if (fieldEntry.vectorEncoding != VectorEncoding.BYTE) { + return null; +} Review Comment: this `if` statement should be unnecessary since CodecReader already checks field infos? Can we fail hard if the encoding is not right instead of returning null? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler merged pull request #12066: Retire/deprecate per-instance MMapDirectory#setUseUnmap
uschindler merged PR #12066: URL: https://github.com/apache/lucene/pull/12066 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #12064: Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String)
benwtrent commented on PR #12064: URL: https://github.com/apache/lucene/pull/12064#issuecomment-1371336800 @jpountz > One thing I'd like to see is whether we can avoid making AbstractVectorValues public and duplicate logic for byte and float vector values in e.g. VectorScorer and CheckIndex. I can see about that. The downside is that internally, when writing the vectors, having a known `binaryValue` really simplifies things. I will see if I can get the same simplifications in another way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge
jmazanec15 commented on code in PR #12050: URL: https://github.com/apache/lucene/pull/12050#discussion_r1061902120 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java: ## @@ -461,6 +467,126 @@ public void mergeOneField(FieldInfo fieldInfo, MergeState mergeState) throws IOE } } + private void maybeInitializeFromGraph( + HnswGraphBuilder hnswGraphBuilder, MergeState mergeState, FieldInfo fieldInfo) + throws IOException { +int initializerIndex = selectGraphForInitialization(mergeState, fieldInfo); +if (initializerIndex == -1) { + return; +} + +HnswGraph initializerGraph = +getHnswGraphFromReader(fieldInfo.name, mergeState.knnVectorsReaders[initializerIndex]); +Map ordinalMapper = +getOldToNewOrdinalMap(mergeState, fieldInfo, initializerIndex); +hnswGraphBuilder.initializeFromGraph(initializerGraph, ordinalMapper); + } + + private int selectGraphForInitialization(MergeState mergeState, FieldInfo fieldInfo) + throws IOException { +// Find the KnnVectorReader with the most docs that meets the following criteria: +// 1. Does not contain any deleted docs +// 2. Is a Lucene95HnswVectorsReader/PerFieldKnnVectorReader +// If no readers exist that meet this criteria, return -1. If they do, return their index in +// merge state +int maxCandidateVectorCount = 0; +int initializerIndex = -1; + +for (int i = 0; i < mergeState.liveDocs.length; i++) { + KnnVectorsReader currKnnVectorsReader = mergeState.knnVectorsReaders[i]; + if (mergeState.knnVectorsReaders[i] + instanceof PerFieldKnnVectorsFormat.FieldsReader candidateReader) { +currKnnVectorsReader = candidateReader.getFieldReader(fieldInfo.name); + } + + if (!allMatch(mergeState.liveDocs[i]) + || !(currKnnVectorsReader instanceof Lucene95HnswVectorsReader candidateReader)) { +continue; + } + + VectorValues vectorValues = candidateReader.getVectorValues(fieldInfo.name); + if (vectorValues == null) { +continue; + } + + int candidateVectorCount = vectorValues.size(); + if (candidateVectorCount > maxCandidateVectorCount) { +maxCandidateVectorCount = candidateVectorCount; +initializerIndex = i; + } +} +return initializerIndex; + } + + private HnswGraph getHnswGraphFromReader(String fieldName, KnnVectorsReader knnVectorsReader) + throws IOException { +if (knnVectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader perFieldReader +&& perFieldReader.getFieldReader(fieldName) +instanceof Lucene95HnswVectorsReader fieldReader) { + return fieldReader.getGraph(fieldName); +} + +if (knnVectorsReader instanceof Lucene95HnswVectorsReader) { + return ((Lucene95HnswVectorsReader) knnVectorsReader).getGraph(fieldName); +} + +throw new IllegalArgumentException( +"Invalid KnnVectorsReader. Must be of type PerFieldKnnVectorsFormat.FieldsReader or Lucene94HnswVectorsReader"); Review Comment: Makes sense. Will update. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge
jmazanec15 commented on code in PR #12050: URL: https://github.com/apache/lucene/pull/12050#discussion_r1061902422 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java: ## @@ -461,6 +467,126 @@ public void mergeOneField(FieldInfo fieldInfo, MergeState mergeState) throws IOE } } + private void maybeInitializeFromGraph( + HnswGraphBuilder hnswGraphBuilder, MergeState mergeState, FieldInfo fieldInfo) + throws IOException { +int initializerIndex = selectGraphForInitialization(mergeState, fieldInfo); +if (initializerIndex == -1) { + return; +} + +HnswGraph initializerGraph = +getHnswGraphFromReader(fieldInfo.name, mergeState.knnVectorsReaders[initializerIndex]); +Map ordinalMapper = +getOldToNewOrdinalMap(mergeState, fieldInfo, initializerIndex); +hnswGraphBuilder.initializeFromGraph(initializerGraph, ordinalMapper); + } + + private int selectGraphForInitialization(MergeState mergeState, FieldInfo fieldInfo) + throws IOException { +// Find the KnnVectorReader with the most docs that meets the following criteria: +// 1. Does not contain any deleted docs +// 2. Is a Lucene95HnswVectorsReader/PerFieldKnnVectorReader +// If no readers exist that meet this criteria, return -1. If they do, return their index in +// merge state +int maxCandidateVectorCount = 0; +int initializerIndex = -1; + +for (int i = 0; i < mergeState.liveDocs.length; i++) { + KnnVectorsReader currKnnVectorsReader = mergeState.knnVectorsReaders[i]; + if (mergeState.knnVectorsReaders[i] + instanceof PerFieldKnnVectorsFormat.FieldsReader candidateReader) { +currKnnVectorsReader = candidateReader.getFieldReader(fieldInfo.name); + } + + if (!allMatch(mergeState.liveDocs[i]) + || !(currKnnVectorsReader instanceof Lucene95HnswVectorsReader candidateReader)) { +continue; + } + + VectorValues vectorValues = candidateReader.getVectorValues(fieldInfo.name); + if (vectorValues == null) { +continue; + } + + int candidateVectorCount = vectorValues.size(); + if (candidateVectorCount > maxCandidateVectorCount) { +maxCandidateVectorCount = candidateVectorCount; +initializerIndex = i; + } +} +return initializerIndex; + } + + private HnswGraph getHnswGraphFromReader(String fieldName, KnnVectorsReader knnVectorsReader) + throws IOException { +if (knnVectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader perFieldReader +&& perFieldReader.getFieldReader(fieldName) +instanceof Lucene95HnswVectorsReader fieldReader) { + return fieldReader.getGraph(fieldName); +} + +if (knnVectorsReader instanceof Lucene95HnswVectorsReader) { + return ((Lucene95HnswVectorsReader) knnVectorsReader).getGraph(fieldName); +} + Review Comment: Good idea, I will add this comment here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge
jmazanec15 commented on code in PR #12050: URL: https://github.com/apache/lucene/pull/12050#discussion_r1061902971 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java: ## @@ -461,6 +467,126 @@ public void mergeOneField(FieldInfo fieldInfo, MergeState mergeState) throws IOE } } + private void maybeInitializeFromGraph( + HnswGraphBuilder hnswGraphBuilder, MergeState mergeState, FieldInfo fieldInfo) + throws IOException { +int initializerIndex = selectGraphForInitialization(mergeState, fieldInfo); +if (initializerIndex == -1) { + return; +} + +HnswGraph initializerGraph = +getHnswGraphFromReader(fieldInfo.name, mergeState.knnVectorsReaders[initializerIndex]); +Map ordinalMapper = +getOldToNewOrdinalMap(mergeState, fieldInfo, initializerIndex); +hnswGraphBuilder.initializeFromGraph(initializerGraph, ordinalMapper); + } + + private int selectGraphForInitialization(MergeState mergeState, FieldInfo fieldInfo) + throws IOException { +// Find the KnnVectorReader with the most docs that meets the following criteria: +// 1. Does not contain any deleted docs +// 2. Is a Lucene95HnswVectorsReader/PerFieldKnnVectorReader +// If no readers exist that meet this criteria, return -1. If they do, return their index in +// merge state +int maxCandidateVectorCount = 0; +int initializerIndex = -1; + +for (int i = 0; i < mergeState.liveDocs.length; i++) { + KnnVectorsReader currKnnVectorsReader = mergeState.knnVectorsReaders[i]; + if (mergeState.knnVectorsReaders[i] + instanceof PerFieldKnnVectorsFormat.FieldsReader candidateReader) { +currKnnVectorsReader = candidateReader.getFieldReader(fieldInfo.name); + } + + if (!allMatch(mergeState.liveDocs[i]) + || !(currKnnVectorsReader instanceof Lucene95HnswVectorsReader candidateReader)) { +continue; + } + + VectorValues vectorValues = candidateReader.getVectorValues(fieldInfo.name); + if (vectorValues == null) { +continue; + } + + int candidateVectorCount = vectorValues.size(); + if (candidateVectorCount > maxCandidateVectorCount) { +maxCandidateVectorCount = candidateVectorCount; +initializerIndex = i; + } +} +return initializerIndex; + } + + private HnswGraph getHnswGraphFromReader(String fieldName, KnnVectorsReader knnVectorsReader) + throws IOException { +if (knnVectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader perFieldReader +&& perFieldReader.getFieldReader(fieldName) +instanceof Lucene95HnswVectorsReader fieldReader) { + return fieldReader.getGraph(fieldName); +} + +if (knnVectorsReader instanceof Lucene95HnswVectorsReader) { + return ((Lucene95HnswVectorsReader) knnVectorsReader).getGraph(fieldName); +} + +throw new IllegalArgumentException( +"Invalid KnnVectorsReader. Must be of type PerFieldKnnVectorsFormat.FieldsReader or Lucene94HnswVectorsReader"); + } + + private Map getOldToNewOrdinalMap( + MergeState mergeState, FieldInfo fieldInfo, int initializerIndex) throws IOException { +VectorValues initializerVectorValues = + mergeState.knnVectorsReaders[initializerIndex].getVectorValues(fieldInfo.name); +MergeState.DocMap initializerDocMap = mergeState.docMaps[initializerIndex]; + +Map newIdToOldOrdinal = new HashMap<>(); +int oldOrd = 0; +for (int oldId = initializerVectorValues.nextDoc(); +oldId != NO_MORE_DOCS; +oldId = initializerVectorValues.nextDoc()) { + if (initializerVectorValues.vectorValue() == null) { +continue; + } + int newId = initializerDocMap.get(oldId); + newIdToOldOrdinal.put(newId, oldOrd); + oldOrd++; +} + +Map oldToNewOrdinalMap = new HashMap<>(); +int newOrd = 0; +int maxNewDocID = Collections.max(newIdToOldOrdinal.keySet()); Review Comment: Good idea, I will update this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge
jmazanec15 commented on code in PR #12050: URL: https://github.com/apache/lucene/pull/12050#discussion_r1061905607 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -143,10 +148,64 @@ public OnHeapHnswGraph build(RandomAccessVectorValues vectorsToAdd) throws IOExc return hnsw; } + /** + * Initializes the graph of this builder. Transfers the nodes and their neighbors from the + * initializer graph into the graph being produced by this builder, mapping ordinals from the + * initializer graph to their new ordinals in this builder's graph. The builder's graph must be + * empty before calling this method. + * + * @param initializerGraph graph used for initialization + * @param oldToNewOrdinalMap map for converting from ordinals in the initializerGraph to this + * builder's graph + */ + public void initializeFromGraph( + HnswGraph initializerGraph, Map oldToNewOrdinalMap) throws IOException { +assert hnsw.size() == 0; +float[] vectorValue = null; +BytesRef binaryValue = null; +for (int level = 0; level < initializerGraph.numLevels(); level++) { + HnswGraph.NodesIterator it = initializerGraph.getNodesOnLevel(level); + + while (it.hasNext()) { +int oldOrd = it.nextInt(); +int newOrd = oldToNewOrdinalMap.get(oldOrd); + +hnsw.addNode(level, newOrd); + +if (level == 0) { + initializedNodes.add(newOrd); +} + +switch (this.vectorEncoding) { + case FLOAT32 -> vectorValue = vectors.vectorValue(newOrd); + case BYTE -> binaryValue = vectors.binaryValue(newOrd); +} + +NeighborArray newNeighbors = this.hnsw.getNeighbors(level, newOrd); +initializerGraph.seek(level, oldOrd); +for (int oldNeighbor = initializerGraph.nextNeighbor(); +oldNeighbor != NO_MORE_DOCS; +oldNeighbor = initializerGraph.nextNeighbor()) { + int newNeighbor = oldToNewOrdinalMap.get(oldNeighbor); + float score = + switch (this.vectorEncoding) { +case FLOAT32 -> this.similarityFunction.compare( Review Comment: Ah I thought about this, but the HnswGraph does not guarantee ordering of neighbors. For example, during writing, the neighbors get sorted to improve compression: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java#L486-L493. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge
jmazanec15 commented on code in PR #12050: URL: https://github.com/apache/lucene/pull/12050#discussion_r1061916872 ## lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java: ## @@ -94,36 +93,83 @@ public int size() { } /** - * Add node on the given level + * Add node on the given level. Nodes can be inserted out of order, but it requires that the nodes Review Comment: Im not sure on this. For random insertion for the graph, I think a BST would be better. However, the insertion pattern for merge typically wont be random. It will be more like first, nodes [15-73] are inserted, and then nodes [0-14] and then nodes [74-100]. This assumes that the MergedVectorValues are a concatenation of the segments to be merged vectors. For this, I added the small optimization of the "lastAddedPosInLayer" list, which will skip binary search during locally ordered insertion. That being said, I am not sure that the "concatenation" property is guaranteed, specifically in the case when the mergeState.needsIndexSort == true. Given this case, it seems like insertion pattern might be more random. All that being said, do you think it would be better to build for the concatenation pattern or more random insert pattern? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge
zhaih commented on code in PR #12050: URL: https://github.com/apache/lucene/pull/12050#discussion_r1061955216 ## lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java: ## @@ -94,36 +93,83 @@ public int size() { } /** - * Add node on the given level + * Add node on the given level. Nodes can be inserted out of order, but it requires that the nodes Review Comment: Good point, for concatenation pattern it might be better to not use a naive BST, but still because in L156 we need to copy the rest of array again and again as long as that is a non-appending action, I think it still is better to use some tree-like structure, maybe RB tree or other balanced tree? But we can also leave it as a separate issue for now, because seems like it is another mid-big change and this one need not to be blocked by that I guess? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge
zhaih commented on code in PR #12050: URL: https://github.com/apache/lucene/pull/12050#discussion_r1061976538 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -143,10 +148,64 @@ public OnHeapHnswGraph build(RandomAccessVectorValues vectorsToAdd) throws IOExc return hnsw; } + /** + * Initializes the graph of this builder. Transfers the nodes and their neighbors from the + * initializer graph into the graph being produced by this builder, mapping ordinals from the + * initializer graph to their new ordinals in this builder's graph. The builder's graph must be + * empty before calling this method. + * + * @param initializerGraph graph used for initialization + * @param oldToNewOrdinalMap map for converting from ordinals in the initializerGraph to this + * builder's graph + */ + public void initializeFromGraph( + HnswGraph initializerGraph, Map oldToNewOrdinalMap) throws IOException { +assert hnsw.size() == 0; +float[] vectorValue = null; +BytesRef binaryValue = null; +for (int level = 0; level < initializerGraph.numLevels(); level++) { + HnswGraph.NodesIterator it = initializerGraph.getNodesOnLevel(level); + + while (it.hasNext()) { +int oldOrd = it.nextInt(); +int newOrd = oldToNewOrdinalMap.get(oldOrd); + +hnsw.addNode(level, newOrd); + +if (level == 0) { + initializedNodes.add(newOrd); +} + +switch (this.vectorEncoding) { + case FLOAT32 -> vectorValue = vectors.vectorValue(newOrd); + case BYTE -> binaryValue = vectors.binaryValue(newOrd); +} + +NeighborArray newNeighbors = this.hnsw.getNeighbors(level, newOrd); +initializerGraph.seek(level, oldOrd); +for (int oldNeighbor = initializerGraph.nextNeighbor(); +oldNeighbor != NO_MORE_DOCS; +oldNeighbor = initializerGraph.nextNeighbor()) { + int newNeighbor = oldToNewOrdinalMap.get(oldNeighbor); + float score = + switch (this.vectorEncoding) { +case FLOAT32 -> this.similarityFunction.compare( Review Comment: Oh ok, I see. Initially I thought the neighbor ordering is for searching but seems that is not the case. Is this sorted order only used for calculating diversity easier? Anyway that can be a later topic and I think we can live with the existing logic for now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge
jmazanec15 commented on code in PR #12050: URL: https://github.com/apache/lucene/pull/12050#discussion_r1061991840 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -143,10 +148,64 @@ public OnHeapHnswGraph build(RandomAccessVectorValues vectorsToAdd) throws IOExc return hnsw; } + /** + * Initializes the graph of this builder. Transfers the nodes and their neighbors from the + * initializer graph into the graph being produced by this builder, mapping ordinals from the + * initializer graph to their new ordinals in this builder's graph. The builder's graph must be + * empty before calling this method. + * + * @param initializerGraph graph used for initialization + * @param oldToNewOrdinalMap map for converting from ordinals in the initializerGraph to this + * builder's graph + */ + public void initializeFromGraph( + HnswGraph initializerGraph, Map oldToNewOrdinalMap) throws IOException { +assert hnsw.size() == 0; +float[] vectorValue = null; +BytesRef binaryValue = null; +for (int level = 0; level < initializerGraph.numLevels(); level++) { + HnswGraph.NodesIterator it = initializerGraph.getNodesOnLevel(level); + + while (it.hasNext()) { +int oldOrd = it.nextInt(); +int newOrd = oldToNewOrdinalMap.get(oldOrd); + +hnsw.addNode(level, newOrd); + +if (level == 0) { + initializedNodes.add(newOrd); +} + +switch (this.vectorEncoding) { + case FLOAT32 -> vectorValue = vectors.vectorValue(newOrd); + case BYTE -> binaryValue = vectors.binaryValue(newOrd); +} + +NeighborArray newNeighbors = this.hnsw.getNeighbors(level, newOrd); +initializerGraph.seek(level, oldOrd); +for (int oldNeighbor = initializerGraph.nextNeighbor(); +oldNeighbor != NO_MORE_DOCS; +oldNeighbor = initializerGraph.nextNeighbor()) { + int newNeighbor = oldToNewOrdinalMap.get(oldNeighbor); + float score = + switch (this.vectorEncoding) { +case FLOAT32 -> this.similarityFunction.compare( Review Comment: > Is this sorted order only used for calculating diversity easier? Yes, I think you are correct. The reason for sorting order by distance during construction is that the neighbor arrays of an inserted node continue to get updated as more nodes are inserted in. So keeping it sorted will allow the worst node or nodes will allow it to be more easily identified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org