Re: [I] Potential resource leakage in WordDictionary#loadMainDataFromFile [lucene]
xcx1r3 commented on issue #14719: URL: https://github.com/apache/lucene/issues/14719#issuecomment-2911474550 if an exception occur, the close() statement will not be executed, leading to a potential resource leak. ``` private int loadMainDataFromFile(String dctFilePath) throws IOException { int i, cnt, length, total = 0; // The file only counted 6763 Chinese characters plus 5 reserved slots 3756~3760. // The 3756th is used (as a header) to store information. int[] buffer = new int[3]; byte[] intBuffer = new byte[4]; String tmpword; DataInputStream dctFile = new DataInputStream(Files.newInputStream(Paths.get(dctFilePath))); dctFile.close(); -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use read advice consistently in the knn vector formats [lucene]
jimczi closed pull request #14076: Use read advice consistently in the knn vector formats URL: https://github.com/apache/lucene/pull/14076 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Update ruff rule PATH103 to enforce modern os.makedirs usage [lucene]
rmuir merged PR #14710: URL: https://github.com/apache/lucene/pull/14710 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Cache high-order bits of hashcode to speed up BytesRefHash [lucene]
github-actions[bot] commented on PR #14720: URL: https://github.com/apache/lucene/pull/14720#issuecomment-2912485390 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Cache high-order bits of hashcode to speed up BytesRefHash [lucene]
bugmakerr opened a new pull request, #14720: URL: https://github.com/apache/lucene/pull/14720 ### Description This PR tries to utilize the unused part of the id to cache the high-order bits of the hashcode to speed up `BytesRefHash`. I used 1 million 16-byte UUIDs to [benchmark this change](https://github.com/bugmakerr/lucene/commit/43d2945be75acb2464c36ca1eac6067445687fe2), and the results are as follows.  The `baselineXXX` version is the current implementation, the `cachedXXX` version uses a separate array of ints to cache hash codes, and the candidate version is the implementation of this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a Faiss codec for KNN searches [lucene]
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r2109429583 ## gradle/testing/defaults-tests.gradle: ## @@ -145,6 +145,7 @@ allprojects { ':lucene:core', ':lucene:codecs', ":lucene:distribution.tests", + ':lucene:sandbox', Review Comment: This line allows the sandbox module to call native libraries from tests (i.e. `--enable-native-access`), but tests were still being run earlier.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce NeighborArray heap memory [lucene]
benwtrent merged PR #14527: URL: https://github.com/apache/lucene/pull/14527 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Nightly benchmark regression on 2025.05.01 [lucene]
jpountz commented on issue #14630: URL: https://github.com/apache/lucene/issues/14630#issuecomment-2913813621 It looks like nightly benchmarks only run every 2 days since May 13th, vs. every day before that. Is this because it now takes longer to run the benchmark? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Potential resource leakage in WordDictionary#loadMainDataFromFile [lucene]
jpountz commented on issue #14719: URL: https://github.com/apache/lucene/issues/14719#issuecomment-2913817299 Good catch, would you like to submit a PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Only run the labeller on the main branch of the lucene repository [lucene]
github-actions[bot] commented on PR #14721: URL: https://github.com/apache/lucene/pull/14721#issuecomment-2913824556 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix comment above OnHeapHnswGraph#getNeighbors. [lucene]
msokolov commented on PR #14713: URL: https://github.com/apache/lucene/pull/14713#issuecomment-2913825952 Thanks @vsop-479 ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix comment above OnHeapHnswGraph#getNeighbors. [lucene]
msokolov merged PR #14713: URL: https://github.com/apache/lucene/pull/14713 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Only run the labeller on the main branch of the lucene repository [lucene]
dweiss opened a new pull request, #14721: URL: https://github.com/apache/lucene/pull/14721 This prevents this action from running on PR against forks, which I couldn't get to work (missing permissions for some reason). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce NeighborArray heap memory [lucene]
benwtrent commented on code in PR #14527: URL: https://github.com/apache/lucene/pull/14527#discussion_r2109471013 ## .gitignore: ## @@ -32,3 +32,10 @@ __pycache__ # SDKMAN .sdkmanrc + +# Java class files +*.class + +# Ignore bin directories +bin/ +**/bin/ Review Comment: If you think these need updated, could you do it in a separate PR? I would like to keep this change restricted to HNSW. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a Faiss codec for KNN searches [lucene]
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r2109488907 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsReader.java: ## @@ -0,0 +1,195 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.codecs.faiss; + +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_CODEC_NAME; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_EXTENSION; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_CODEC_NAME; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_EXTENSION; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.VERSION_CURRENT; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.VERSION_START; +import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.indexRead; +import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.indexSearch; + +import java.io.IOException; +import java.lang.foreign.Arena; +import java.lang.foreign.MemorySegment; +import java.util.HashMap; +import java.util.Map; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.hnsw.FlatVectorsReader; +import org.apache.lucene.index.ByteVectorValues; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.IndexFileNames; +import org.apache.lucene.index.SegmentReadState; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.search.KnnCollector; +import org.apache.lucene.store.DataAccessHint; +import org.apache.lucene.store.FileTypeHint; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.IOUtils; + +/** + * Read per-segment Faiss indexes and associated metadata. + * + * @lucene.experimental + */ +final class FaissKnnVectorsReader extends KnnVectorsReader { + private final FlatVectorsReader rawVectorsReader; + private final IndexInput meta, data; + private final Map indexMap; + private final Arena arena; + private boolean closed; + + public FaissKnnVectorsReader(SegmentReadState state, FlatVectorsReader rawVectorsReader) + throws IOException { +this.rawVectorsReader = rawVectorsReader; +this.indexMap = new HashMap<>(); +this.arena = Arena.ofShared(); +this.closed = false; + +boolean failure = true; +try { + meta = + openInput( + state, + META_EXTENSION, + META_CODEC_NAME, + VERSION_START, + VERSION_CURRENT, + state.context); + data = + openInput( + state, + DATA_EXTENSION, + DATA_CODEC_NAME, + VERSION_START, + VERSION_CURRENT, + state.context.withHints(FileTypeHint.DATA, DataAccessHint.RANDOM)); + + Map.Entry entry; + while ((entry = parseNextField(state)) != null) { +this.indexMap.put(entry.getKey(), entry.getValue()); + } + + failure = false; +} finally { + if (failure) { +IOUtils.closeWhileHandlingException(this); + } +} + } + + @SuppressWarnings("SameParameterValue") + private IndexInput openInput( + SegmentReadState state, + String extension, + String codecName, + int versionStart, + int versionEnd, + IOContext context) + throws IOException { + +String fileName = +IndexFileNames.segmentFileName(state.segmentInfo.name, state.segmentSuffix, extension); +IndexInput input = state.directory.openInput(fileName, context); +CodecUtil.checkIndexHeader( +input, codecName, versionStart, versionEnd, state.segmentInfo.getId(), state.segmentSuffix); +return input; + } + + private Map.Entry parseNextField(SegmentReadState state) throws IOException { +int fieldNumber = meta.readInt(); +if (fieldNumber == -1) { + return null; +} + +FieldInfo fieldInfo = state.
Re: [PR] Fix resource leak in loadMainDataFromFile [lucene]
github-actions[bot] commented on PR #14726: URL: https://github.com/apache/lucene/pull/14726#issuecomment-2914833524 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix resource leak in loadMainDataFromFile [lucene]
xcx1r3 opened a new pull request, #14726: URL: https://github.com/apache/lucene/pull/14726 Use try-with-resources to auto-close DataInputStream ``` try (DataInputStream dctFile = new DataInputStream(Files.newInputStream(Paths.get(dctFilePath { ... } -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Potential resource leakage in WordDictionary#loadMainDataFromFile [lucene]
xcx1r3 commented on issue #14719: URL: https://github.com/apache/lucene/issues/14719#issuecomment-2914834339 #14726 sure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix resource leak in loadMainDataFromFile [lucene]
xcx1r3 closed pull request #14726: Fix resource leak in loadMainDataFromFile URL: https://github.com/apache/lucene/pull/14726 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a Faiss codec for KNN searches [lucene]
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r2109735507 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsReader.java: ## @@ -0,0 +1,195 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.codecs.faiss; + +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_CODEC_NAME; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_EXTENSION; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_CODEC_NAME; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_EXTENSION; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.VERSION_CURRENT; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.VERSION_START; +import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.indexRead; +import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.indexSearch; + +import java.io.IOException; +import java.lang.foreign.Arena; +import java.lang.foreign.MemorySegment; +import java.util.HashMap; +import java.util.Map; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.hnsw.FlatVectorsReader; +import org.apache.lucene.index.ByteVectorValues; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.IndexFileNames; +import org.apache.lucene.index.SegmentReadState; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.search.KnnCollector; +import org.apache.lucene.store.DataAccessHint; +import org.apache.lucene.store.FileTypeHint; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.IOUtils; + +/** + * Read per-segment Faiss indexes and associated metadata. + * + * @lucene.experimental + */ +final class FaissKnnVectorsReader extends KnnVectorsReader { + private final FlatVectorsReader rawVectorsReader; + private final IndexInput meta, data; + private final Map indexMap; + private final Arena arena; + private boolean closed; + + public FaissKnnVectorsReader(SegmentReadState state, FlatVectorsReader rawVectorsReader) + throws IOException { +this.rawVectorsReader = rawVectorsReader; +this.indexMap = new HashMap<>(); +this.arena = Arena.ofShared(); +this.closed = false; + +boolean failure = true; +try { + meta = + openInput( + state, + META_EXTENSION, + META_CODEC_NAME, + VERSION_START, + VERSION_CURRENT, + state.context); + data = + openInput( + state, + DATA_EXTENSION, + DATA_CODEC_NAME, + VERSION_START, + VERSION_CURRENT, + state.context.withHints(FileTypeHint.DATA, DataAccessHint.RANDOM)); + + Map.Entry entry; + while ((entry = parseNextField(state)) != null) { +this.indexMap.put(entry.getKey(), entry.getValue()); + } + + failure = false; +} finally { + if (failure) { +IOUtils.closeWhileHandlingException(this); + } +} + } + + @SuppressWarnings("SameParameterValue") + private IndexInput openInput( + SegmentReadState state, + String extension, + String codecName, + int versionStart, + int versionEnd, + IOContext context) + throws IOException { + +String fileName = +IndexFileNames.segmentFileName(state.segmentInfo.name, state.segmentSuffix, extension); +IndexInput input = state.directory.openInput(fileName, context); +CodecUtil.checkIndexHeader( +input, codecName, versionStart, versionEnd, state.segmentInfo.getId(), state.segmentSuffix); +return input; + } + + private Map.Entry parseNextField(SegmentReadState state) throws IOException { +int fieldNumber = meta.readInt(); +if (fieldNumber == -1) { + return null; +} + +FieldInfo fieldInfo = state.
Re: [PR] Add a Faiss codec for KNN searches [lucene]
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r2109760361 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsWriter.java: ## @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.codecs.faiss; + +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_CODEC_NAME; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_EXTENSION; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_CODEC_NAME; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_EXTENSION; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.VERSION_CURRENT; +import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.createIndex; +import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.indexWrite; + +import java.io.IOException; +import java.lang.foreign.Arena; +import java.lang.foreign.MemorySegment; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnFieldVectorsWriter; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.codecs.hnsw.FlatFieldVectorsWriter; +import org.apache.lucene.codecs.hnsw.FlatVectorsWriter; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.IndexFileNames; +import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.index.Sorter; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.search.DocIdSet; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.hnsw.IntToIntFunction; + +/** + * Write per-segment Faiss indexes and associated metadata. + * + * @lucene.experimental + */ +final class FaissKnnVectorsWriter extends KnnVectorsWriter { + private final String description, indexParams; + private final FlatVectorsWriter rawVectorsWriter; + private final IndexOutput meta, data; + private final Map> rawFields; + private boolean closed, finished; + + public FaissKnnVectorsWriter( + String description, + String indexParams, + SegmentWriteState state, + FlatVectorsWriter rawVectorsWriter) + throws IOException { + +this.description = description; +this.indexParams = indexParams; +this.rawVectorsWriter = rawVectorsWriter; +this.rawFields = new HashMap<>(); +this.closed = false; +this.finished = false; + +boolean failure = true; +try { + this.meta = openOutput(state, META_EXTENSION, META_CODEC_NAME); + this.data = openOutput(state, DATA_EXTENSION, DATA_CODEC_NAME); + failure = false; +} finally { + if (failure) { +IOUtils.closeWhileHandlingException(this); + } +} + } + + private IndexOutput openOutput(SegmentWriteState state, String extension, String codecName) + throws IOException { +String fileName = +IndexFileNames.segmentFileName(state.segmentInfo.name, state.segmentSuffix, extension); +IndexOutput output = state.directory.createOutput(fileName, state.context); +CodecUtil.writeIndexHeader( +output, codecName, VERSION_CURRENT, state.segmentInfo.getId(), state.segmentSuffix); +return output; + } + + @Override + public void mergeOneField(FieldInfo fieldInfo, MergeState mergeState) throws IOException { +rawVectorsWriter.mergeOneField(fieldInfo, mergeState); +switch (fieldInfo.getVectorEncoding()) { + case BYTE -> + // TODO: Support using SQ8 quantization, see: + // - https://github.com/opensearch-project/k-NN/pull/2425 + throw new UnsupportedOperationException("Byte vectors not supported"); + case FLOAT32 -> { +FloatVectorValues merged = + KnnVectorsWriter.MergedVectorValues.mergeFloatVectorValues(fieldInfo, mergeState); +writeFloatField(fieldInfo, merged, doc -> doc); + } +} + } + + @Override + public
Re: [PR] Add a Faiss codec for KNN searches [lucene]
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r2109774033 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsWriter.java: ## @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.codecs.faiss; + +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_CODEC_NAME; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_EXTENSION; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_CODEC_NAME; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_EXTENSION; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.VERSION_CURRENT; +import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.createIndex; +import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.indexWrite; + +import java.io.IOException; +import java.lang.foreign.Arena; +import java.lang.foreign.MemorySegment; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnFieldVectorsWriter; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.codecs.hnsw.FlatFieldVectorsWriter; +import org.apache.lucene.codecs.hnsw.FlatVectorsWriter; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.IndexFileNames; +import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.index.Sorter; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.search.DocIdSet; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.hnsw.IntToIntFunction; + +/** + * Write per-segment Faiss indexes and associated metadata. + * + * @lucene.experimental + */ +final class FaissKnnVectorsWriter extends KnnVectorsWriter { + private final String description, indexParams; + private final FlatVectorsWriter rawVectorsWriter; + private final IndexOutput meta, data; + private final Map> rawFields; + private boolean closed, finished; + + public FaissKnnVectorsWriter( + String description, + String indexParams, + SegmentWriteState state, + FlatVectorsWriter rawVectorsWriter) + throws IOException { + +this.description = description; +this.indexParams = indexParams; +this.rawVectorsWriter = rawVectorsWriter; +this.rawFields = new HashMap<>(); +this.closed = false; +this.finished = false; + +boolean failure = true; Review Comment: Ah found it (#14633) -- will follow this.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a Faiss codec for KNN searches [lucene]
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r2109779193 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsWriter.java: ## @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.codecs.faiss; + +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_CODEC_NAME; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_EXTENSION; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_CODEC_NAME; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_EXTENSION; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.VERSION_CURRENT; +import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.createIndex; +import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.indexWrite; + +import java.io.IOException; +import java.lang.foreign.Arena; +import java.lang.foreign.MemorySegment; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnFieldVectorsWriter; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.codecs.hnsw.FlatFieldVectorsWriter; +import org.apache.lucene.codecs.hnsw.FlatVectorsWriter; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.IndexFileNames; +import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.index.Sorter; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.search.DocIdSet; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.hnsw.IntToIntFunction; + +/** + * Write per-segment Faiss indexes and associated metadata. + * + * @lucene.experimental + */ +final class FaissKnnVectorsWriter extends KnnVectorsWriter { + private final String description, indexParams; + private final FlatVectorsWriter rawVectorsWriter; + private final IndexOutput meta, data; + private final Map> rawFields; + private boolean closed, finished; + + public FaissKnnVectorsWriter( + String description, + String indexParams, + SegmentWriteState state, + FlatVectorsWriter rawVectorsWriter) + throws IOException { + +this.description = description; +this.indexParams = indexParams; +this.rawVectorsWriter = rawVectorsWriter; +this.rawFields = new HashMap<>(); +this.closed = false; +this.finished = false; + +boolean failure = true; +try { + this.meta = openOutput(state, META_EXTENSION, META_CODEC_NAME); + this.data = openOutput(state, DATA_EXTENSION, DATA_CODEC_NAME); + failure = false; +} finally { + if (failure) { +IOUtils.closeWhileHandlingException(this); + } +} + } + + private IndexOutput openOutput(SegmentWriteState state, String extension, String codecName) + throws IOException { +String fileName = +IndexFileNames.segmentFileName(state.segmentInfo.name, state.segmentSuffix, extension); +IndexOutput output = state.directory.createOutput(fileName, state.context); +CodecUtil.writeIndexHeader( +output, codecName, VERSION_CURRENT, state.segmentInfo.getId(), state.segmentSuffix); +return output; + } + + @Override + public void mergeOneField(FieldInfo fieldInfo, MergeState mergeState) throws IOException { +rawVectorsWriter.mergeOneField(fieldInfo, mergeState); +switch (fieldInfo.getVectorEncoding()) { + case BYTE -> + // TODO: Support using SQ8 quantization, see: + // - https://github.com/opensearch-project/k-NN/pull/2425 + throw new UnsupportedOperationException("Byte vectors not supported"); + case FLOAT32 -> { +FloatVectorValues merged = + KnnVectorsWriter.MergedVectorValues.mergeFloatVectorValues(fieldInfo, mergeState); +writeFloatField(fieldInfo, merged, doc -> doc); + } +} + } + + @Override + public
Re: [PR] Cache high-order bits of hashcode to speed up BytesRefHash [lucene]
jpountz commented on code in PR #14720: URL: https://github.com/apache/lucene/pull/14720#discussion_r2110084706 ## lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java: ## @@ -71,9 +72,13 @@ public BytesRefHash(ByteBlockPool pool) { /** Creates a new {@link BytesRefHash} */ public BytesRefHash(ByteBlockPool pool, int capacity, BytesStartArray bytesStartArray) { +if ((capacity & (capacity - 1)) != 0) { Review Comment: Can you use `BitUtil#isZeroOrPowerOfTwo`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move HitQueue in TopScoreDocCollector to a LongHeap [lucene]
jpountz commented on PR #14714: URL: https://github.com/apache/lucene/pull/14714#issuecomment-2913896479 I wasn't aware of this indeed. OK for passing null then, I agree that there may be sub classes that rely on this API in the wild. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Arg001 - no violations found [lucene]
github-actions[bot] commented on PR #14724: URL: https://github.com/apache/lucene/pull/14724#issuecomment-2914476284 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Arg001 - no violations found [lucene]
Mariah33 commented on PR #14724: URL: https://github.com/apache/lucene/pull/14724#issuecomment-2914477522 on wrong branch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Clarify filter fields usage in javadocs [lucene]
github-actions[bot] commented on PR #14660: URL: https://github.com/apache/lucene/pull/14660#issuecomment-2914507503 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] No ruff violation [lucene]
github-actions[bot] commented on PR #14725: URL: https://github.com/apache/lucene/pull/14725#issuecomment-2914529256 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] No ruff violation [lucene]
Mariah33 opened a new pull request, #14725: URL: https://github.com/apache/lucene/pull/14725 ### Description Didn't find these ruff rules in the code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Arg001 - no violations found [lucene]
Mariah33 opened a new pull request, #14724: URL: https://github.com/apache/lucene/pull/14724 ### Description This rule was not found in the code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Arg001 - no violations found [lucene]
Mariah33 closed pull request #14724: Arg001 - no violations found URL: https://github.com/apache/lucene/pull/14724 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Apply minimal fix for ruff rule PATH103 using Path.resolve [lucene]
rmuir merged PR #14711: URL: https://github.com/apache/lucene/pull/14711 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] deps(java): bump org.apache.groovy:groovy-all from 4.0.26 to 4.0.27 [lucene]
dependabot[bot] opened a new pull request, #14722: URL: https://github.com/apache/lucene/pull/14722 Bumps [org.apache.groovy:groovy-all](https://github.com/apache/groovy) from 4.0.26 to 4.0.27. Commits See full diff in https://github.com/apache/groovy/commits";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] deps(java): bump com.diffplug.spotless from 7.0.3 to 7.0.4 [lucene]
dependabot[bot] opened a new pull request, #14723: URL: https://github.com/apache/lucene/pull/14723 Bumps com.diffplug.spotless from 7.0.3 to 7.0.4. [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] deps(java): bump org.apache.groovy:groovy-all from 4.0.26 to 4.0.27 [lucene]
github-actions[bot] commented on PR #14722: URL: https://github.com/apache/lucene/pull/14722#issuecomment-2914414329 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] deps(java): bump com.diffplug.spotless from 7.0.3 to 7.0.4 [lucene]
github-actions[bot] commented on PR #14723: URL: https://github.com/apache/lucene/pull/14723#issuecomment-2914414463 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Apply minimal fix for ruff rule PATH103 using Path.resolve [lucene]
github-actions[bot] commented on PR #14711: URL: https://github.com/apache/lucene/pull/14711#issuecomment-2914446223 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a Faiss codec for KNN searches [lucene]
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r2109479695 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsFormat.java: ## @@ -0,0 +1,93 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.codecs.faiss; + +import java.io.IOException; +import java.util.Locale; +import org.apache.lucene.codecs.KnnVectorsFormat; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.codecs.hnsw.FlatVectorScorerUtil; +import org.apache.lucene.codecs.hnsw.FlatVectorsFormat; +import org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsFormat; +import org.apache.lucene.index.SegmentReadState; +import org.apache.lucene.index.SegmentWriteState; + +/** + * A format which uses https://github.com/facebookresearch/faiss";>Faiss to create and + * search vector indexes, using {@link LibFaissC} to interact with the native library. + * + * A separate Faiss index is created per-segment, and uses the following files: + * + * + * .faissm (metadata file): stores field number, offset and length of actual + * Faiss index in data file. + * .faissd (data file): stores concatenated Faiss indexes for all fields. + * All files required by {@link Lucene99FlatVectorsFormat} for storing raw vectors. + * + * + * Note: Set the {@code $OMP_NUM_THREADS} environment variable to control internal threading. Review Comment: Makes sense, I'll add it ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsReader.java: ## @@ -0,0 +1,195 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.codecs.faiss; + +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_CODEC_NAME; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_EXTENSION; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_CODEC_NAME; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_EXTENSION; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.VERSION_CURRENT; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.VERSION_START; +import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.indexRead; +import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.indexSearch; + +import java.io.IOException; +import java.lang.foreign.Arena; +import java.lang.foreign.MemorySegment; +import java.util.HashMap; +import java.util.Map; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.hnsw.FlatVectorsReader; +import org.apache.lucene.index.ByteVectorValues; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.IndexFileNames; +import org.apache.lucene.index.SegmentReadState; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.search.KnnCollector; +import org.apache.lucene.store.DataAccessHint; +import org.apache.lucene.store.FileTypeHint; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.IOUtils; + +/** + * Read per-segment Faiss indexes and ass
Re: [PR] Move HitQueue in TopScoreDocCollector to a LongHeap [lucene]
gf2121 commented on PR #14714: URL: https://github.com/apache/lucene/pull/14714#issuecomment-2913036055 Thanks for the suggestion! > It's a bit ugly to pass null as a HitQueue in the constructor of TopScoreDocCollector. Can we only keep method signatures on TopDocsCollector and move the current impls to some other class? FWIW passing a null PQ is mentioned in `TopScoreDocCollector`'s java doc https://github.com/apache/lucene/blob/6b3c3e4803dfe3edba75569e289fe492d8cc5cd2/lucene/core/src/java/org/apache/lucene/search/TopDocsCollector.java#L25-L28. I agree it is ugly to copy the large `topDocs(int start, int howMany)` so i was looking to extract PQ logics to a protected method, but i'm not sure if we should touch this public API class as this seems not to break the original intention of the design. In case you did not notice the java doc, i'd like to ask your suggestion again :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding profiling support for concurrent segment search [lucene]
jainankitk commented on PR #14413: URL: https://github.com/apache/lucene/pull/14413#issuecomment-2913552519 I submitted talk on this topic (`Profiling Concurrent Search in Lucene: A Deep Dive into Parallel Execution`) for ASF conference (https://communityovercode.org/schedule/) and it was selected. Would love to iterate and get this PR merged before that! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce NeighborArray heap memory [lucene]
weizijun commented on code in PR #14527: URL: https://github.com/apache/lucene/pull/14527#discussion_r2109476565 ## .gitignore: ## @@ -32,3 +32,10 @@ __pycache__ # SDKMAN .sdkmanrc + +# Java class files +*.class + +# Ignore bin directories +bin/ +**/bin/ Review Comment: Oh, sorry, that was a mistake, I'll delete it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a Faiss codec for KNN searches [lucene]
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r2109729290 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsReader.java: ## @@ -0,0 +1,195 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.codecs.faiss; + +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_CODEC_NAME; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_EXTENSION; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_CODEC_NAME; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_EXTENSION; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.VERSION_CURRENT; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.VERSION_START; +import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.indexRead; +import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.indexSearch; + +import java.io.IOException; +import java.lang.foreign.Arena; +import java.lang.foreign.MemorySegment; +import java.util.HashMap; +import java.util.Map; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.hnsw.FlatVectorsReader; +import org.apache.lucene.index.ByteVectorValues; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.IndexFileNames; +import org.apache.lucene.index.SegmentReadState; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.search.KnnCollector; +import org.apache.lucene.store.DataAccessHint; +import org.apache.lucene.store.FileTypeHint; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.IOUtils; + +/** + * Read per-segment Faiss indexes and associated metadata. + * + * @lucene.experimental + */ +final class FaissKnnVectorsReader extends KnnVectorsReader { + private final FlatVectorsReader rawVectorsReader; + private final IndexInput meta, data; + private final Map indexMap; + private final Arena arena; + private boolean closed; + + public FaissKnnVectorsReader(SegmentReadState state, FlatVectorsReader rawVectorsReader) + throws IOException { +this.rawVectorsReader = rawVectorsReader; +this.indexMap = new HashMap<>(); +this.arena = Arena.ofShared(); +this.closed = false; + +boolean failure = true; +try { + meta = + openInput( + state, + META_EXTENSION, + META_CODEC_NAME, + VERSION_START, + VERSION_CURRENT, + state.context); + data = + openInput( + state, + DATA_EXTENSION, + DATA_CODEC_NAME, + VERSION_START, + VERSION_CURRENT, + state.context.withHints(FileTypeHint.DATA, DataAccessHint.RANDOM)); + + Map.Entry entry; + while ((entry = parseNextField(state)) != null) { +this.indexMap.put(entry.getKey(), entry.getValue()); + } + + failure = false; +} finally { + if (failure) { +IOUtils.closeWhileHandlingException(this); + } +} + } + + @SuppressWarnings("SameParameterValue") + private IndexInput openInput( + SegmentReadState state, + String extension, + String codecName, + int versionStart, + int versionEnd, + IOContext context) + throws IOException { + +String fileName = +IndexFileNames.segmentFileName(state.segmentInfo.name, state.segmentSuffix, extension); +IndexInput input = state.directory.openInput(fileName, context); +CodecUtil.checkIndexHeader( +input, codecName, versionStart, versionEnd, state.segmentInfo.getId(), state.segmentSuffix); +return input; + } + + private Map.Entry parseNextField(SegmentReadState state) throws IOException { +int fieldNumber = meta.readInt(); +if (fieldNumber == -1) { + return null; +} + +FieldInfo fieldInfo = state.
Re: [PR] Add a Faiss codec for KNN searches [lucene]
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r2109499282 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsFormat.java: ## @@ -0,0 +1,93 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.codecs.faiss; + +import java.io.IOException; +import java.util.Locale; +import org.apache.lucene.codecs.KnnVectorsFormat; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.codecs.hnsw.FlatVectorScorerUtil; +import org.apache.lucene.codecs.hnsw.FlatVectorsFormat; +import org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsFormat; +import org.apache.lucene.index.SegmentReadState; +import org.apache.lucene.index.SegmentWriteState; + +/** + * A format which uses https://github.com/facebookresearch/faiss";>Faiss to create and + * search vector indexes, using {@link LibFaissC} to interact with the native library. + * + * A separate Faiss index is created per-segment, and uses the following files: + * + * + * .faissm (metadata file): stores field number, offset and length of actual + * Faiss index in data file. + * .faissd (data file): stores concatenated Faiss indexes for all fields. + * All files required by {@link Lucene99FlatVectorsFormat} for storing raw vectors. + * + * + * Note: Set the {@code $OMP_NUM_THREADS} environment variable to control internal threading. + * + * @lucene.experimental Review Comment: I do see some [references](https://github.com/search?q=repo%3Afacebookresearch%2Ffaiss%20compatibility&type=code) of backwards compatibility, and an [old comment](https://github.com/facebookresearch/faiss/issues/2373#issuecomment-1175895577) which says that newer versions of Faiss can read older indexes -- but I couldn't find documentation for it.. Further, we may change some internals of the codec making it incompatible with earlier versions -- but I'll add a comment saying there's no guarantee today, and a TODO to figure that out -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Create a bot to add milestones to new PRs [lucene]
stefanvodita commented on issue #14190: URL: https://github.com/apache/lucene/issues/14190#issuecomment-2913105749 #14697 is a nice example of the bot modifying the milestone after we moved the CHANGES entry to a different section! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce NeighborArray heap memory [lucene]
weizijun commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2914720940 Here are the statistics of 100w hnsw graphs, with m = 16 and ef = 100: Level count = 5: ``` level: 0, node count: 100 level: 1, node count: 62835 level: 2, node count: 3926 level: 3, node count: 235 level: 4, node count: 12 ``` Average number of neighbors per level: ``` level: 0, avg neighbor count: 8.026909 level: 1, avg neighbor count: 7.539985676772499 level: 2, avg neighbor count: 8.596535914416709 level: 3, avg neighbor count: 8.353191489361702 level: 4, avg neighbor count: 3.8335 ``` The detail of neighbor count: level: 0 ``` level: 0, neighbor count: 1, node count: 141664 level: 0, neighbor count: 2, node count: 130484 level: 0, neighbor count: 3, node count: 111485 level: 0, neighbor count: 4, node count: 91141 level: 0, neighbor count: 5, node count: 72929 level: 0, neighbor count: 6, node count: 59030 level: 0, neighbor count: 7, node count: 47796 level: 0, neighbor count: 8, node count: 39864 level: 0, neighbor count: 9, node count: 33320 level: 0, neighbor count: 10, node count: 27923 level: 0, neighbor count: 11, node count: 23972 level: 0, neighbor count: 12, node count: 20777 level: 0, neighbor count: 13, node count: 17986 level: 0, neighbor count: 14, node count: 15510 level: 0, neighbor count: 15, node count: 13725 level: 0, neighbor count: 16, node count: 12296 level: 0, neighbor count: 17, node count: 10947 level: 0, neighbor count: 18, node count: 9826 level: 0, neighbor count: 19, node count: 8765 level: 0, neighbor count: 20, node count: 7947 level: 0, neighbor count: 21, node count: 7348 level: 0, neighbor count: 22, node count: 6639 level: 0, neighbor count: 23, node count: 6045 level: 0, neighbor count: 24, node count: 5413 level: 0, neighbor count: 25, node count: 5101 level: 0, neighbor count: 26, node count: 4569 level: 0, neighbor count: 27, node count: 4105 level: 0, neighbor count: 28, node count: 3965 level: 0, neighbor count: 29, node count: 3564 level: 0, neighbor count: 30, node count: 3330 level: 0, neighbor count: 31, node count: 3019 level: 0, neighbor count: 32, node count: 49515 ``` level: 1 ``` level: 1, neighbor count: 1, node count: 6760 level: 1, neighbor count: 2, node count: 6707 level: 1, neighbor count: 3, node count: 6127 level: 1, neighbor count: 4, node count: 5277 level: 1, neighbor count: 5, node count: 4420 level: 1, neighbor count: 6, node count: 3805 level: 1, neighbor count: 7, node count: 3321 level: 1, neighbor count: 8, node count: 2827 level: 1, neighbor count: 9, node count: 2502 level: 1, neighbor count: 10, node count: 2093 level: 1, neighbor count: 11, node count: 1849 level: 1, neighbor count: 12, node count: 1645 level: 1, neighbor count: 13, node count: 1521 level: 1, neighbor count: 14, node count: 1257 level: 1, neighbor count: 15, node count: 1163 level: 1, neighbor count: 16, node count: 11561 ``` level: 2 ``` level: 2, neighbor count: 1, node count: 298 level: 2, neighbor count: 2, node count: 302 level: 2, neighbor count: 3, node count: 309 level: 2, neighbor count: 4, node count: 278 level: 2, neighbor count: 5, node count: 267 level: 2, neighbor count: 6, node count: 251 level: 2, neighbor count: 7, node count: 196 level: 2, neighbor count: 8, node count: 209 level: 2, neighbor count: 9, node count: 178 level: 2, neighbor count: 10, node count: 159 level: 2, neighbor count: 11, node count: 153 level: 2, neighbor count: 12, node count: 134 level: 2, neighbor count: 13, node count: 125 level: 2, neighbor count: 14, node count: 75 level: 2, neighbor count: 15, node count: 106 level: 2, neighbor count: 16, node count: 886 ``` level: 3 ``` level: 3, neighbor count: 1, node count: 18 level: 3, neighbor count: 2, node count: 14 level: 3, neighbor count: 3, node count: 11 level: 3, neighbor count: 4, node count: 14 level: 3, neighbor count: 5, node count: 17 level: 3, neighbor count: 6, node count: 20 level: 3, neighbor count: 7, node count: 19 level: 3, neighbor count: 8, node count: 11 level: 3, neighbor count: 9, node count: 23 level: 3, neighbor count: 10, node count: 12 level: 3, neighbor count: 11, node count: 12 level: 3, neighbor count: 12, node count: 9 level: 3, neighbor count: 13, node count: 7 level: 3, neighbor count: 14, node count: 10 level: 3, neighbor count: 15, node count: 4 level: 3, neighbor count: 16, node count: 34 ``` level: 4 ``` level: 4, neighbor count: 1, node count: 1 level: 4, neighbor count: 2, node count: 2 level: 4, neighbor count: 3, node count: 5 level: 4, neighbor count: 5, node count: 2 level: 4, neighbor count: 8, node count: 2