[GitHub] [lucene] henryrneh commented on issue #12165: Integrating Apache Lucene into OSS-Fuzz
henryrneh commented on issue #12165: URL: https://github.com/apache/lucene/issues/12165#issuecomment-1674365549 Now we have started to do some bug triaging of bugs from OSS-Fuzz. There are multiple issues discovered with the fuzzer, for example OutOfMemory or StackOverflow, that we can disclose one by one or by giving you access via email to the oss-fuzz platform. Should we disclose them here through public issues or do you prefer through secur...@apache.org mailing list? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on issue #12165: Integrating Apache Lucene into OSS-Fuzz
uschindler commented on issue #12165: URL: https://github.com/apache/lucene/issues/12165#issuecomment-1674532419 Just open public issues. Actually not all of those errors would be fixed, because Apache Lucene does not always do all possible checks, as performance is more important than an OOM (caused by "wrong usage"). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent merged pull request #12500: Fix flaky testToString method for Knn Vector queries
benwtrent merged PR #12500: URL: https://github.com/apache/lucene/pull/12500 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] sabi0 opened a new issue, #12501: Default PostingsFormat lost the SPI extension point in the Codec class
sabi0 opened a new issue, #12501: URL: https://github.com/apache/lucene/issues/12501 ### Description `Lucene70Codec` had: ``` private final PostingsFormat defaultFormat = PostingsFormat.forName("Lucene50"); ``` In the `Lucene80Codec` PostingsFormat instantiation was moved to the constructor. Presumably to pass the additional `fstLoadMode` parameter? https://github.com/apache/lucene/commit/28e8a30b536a39e5539ac6e8b9407d31724c8857#diff-3a74c1b72ab52e54dfcdc9de142b4331b372c11bcca87842b001c30f89ce58ebR98 In a subsequent commit the code was reverted to the default PostingsFormat constructor: https://github.com/apache/lucene/commit/651f41e21bd3df98f70d2673295db29506e3d2e6#diff-3a74c1b72ab52e54dfcdc9de142b4331b372c11bcca87842b001c30f89ce58ebR96 But the SPI call `PostingsFormat.forName()` was not restored. And is still missing in `Lucene95Codec`: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95Codec.java#L121 Restore the SPI extension point to allow overriding the PostingsFormat without having to override the Codec? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class
uschindler commented on issue #12501: URL: https://github.com/apache/lucene/issues/12501#issuecomment-1674943197 Hi, the SPI should only be used when READING indexes. When you create a codec for IndexWriter the codec version hardcodes its postings formats and other subtypes. As you see it is the same for docValuesFormat and other parts. This allows to read any index, but when you write an index it will use the exact codec as specified. Basically we made the decission to hardcode the correct classes when writing indexes, but load the codecs dynamically when open an existing index. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] sabi0 commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class
sabi0 commented on issue #12501: URL: https://github.com/apache/lucene/issues/12501#issuecomment-1674950075 I see. Thank you for the explanation. The commits that change this behavior did not say anything about this. So it looked like this SPI extension point loss was an unwitting side-effect of a sequence of refactorings. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class
uschindler commented on issue #12501: URL: https://github.com/apache/lucene/issues/12501#issuecomment-1674955262 > So it looked like this SPI extension point loss was an unwitting side-effect of a sequence of refactorings. No, the SPI fromName does not allow you to change the implementation, as there can only be one "Lucene50" implementation on classpath. If you want to have another codec it must have a new name and therefore for (new) indexes passed via IndexWriterConfig using a new codec. > private final PostingsFormat defaultFormat = PostingsFormat.forName("Lucene50"); This does not allow you to overwrite the format, you still need to subclass codec, as the name "Lucene50" is part of Lucene core and can't be replaced. So it will always load the hardcoded 5.0 codec. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class
uschindler commented on issue #12501: URL: https://github.com/apache/lucene/issues/12501#issuecomment-1674957426 For the FST loading mode mentioned above, the codec does not need to be changed, you can tell DirectoryReader to use FST load modes using attributes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler closed issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class
uschindler closed issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class URL: https://github.com/apache/lucene/issues/12501 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class
uschindler commented on issue #12501: URL: https://github.com/apache/lucene/issues/12501#issuecomment-1674970444 The general rule is: If you want to change the index postings format (but nothing else like codec itsself) when writing a new index, you need to subclass default codec. By that it keeps its name and code reading the index will look it up. If you add a completely new postings format, subclass abstract base class, give it a new name and register it in SPI. If you just want to change settings you can reuse the postings format by instantiating it in the codec with different settings. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] sabi0 commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class
sabi0 commented on issue #12501: URL: https://github.com/apache/lucene/issues/12501#issuecomment-1674983089 The codec classes are `final`. Besides having two implementations (`Lucene84PostingsFormat` in lucene-core and `MyLucene84PostingsFormat`) with the same "Lucene84" name will likely result in a lookup error? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class
uschindler commented on issue #12501: URL: https://github.com/apache/lucene/issues/12501#issuecomment-1674996502 > The postings format classes are `final`. Besides having two implementations (`Lucene84PostingsFormat` in lucene-core and `MyLucene84PostingsFormat`) with the same "Lucene84" name will likely result in a lookup error? Exactly and because of that its final. Your postings format needs a new name. If you want to use it as default you can subclass the `Codec` as this is the main entry point: https://github.com/apache/lucene/blob/df8745e59ee65f276ccaefa87480e1fd85facb56/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95Codec.java#L55 I don't understand what your problem is: - Write your own postings format with its own name. You can't subclass. But you can use a FilterPostingsFormat to wrap the default postings format - Subclass https://github.com/apache/lucene/blob/df8745e59ee65f276ccaefa87480e1fd85facb56/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95Codec.java#L55 and return/inject your codec there, an alternative which is not so statically depending on the exact version is to use `new FilterCodec(Codec.getDefault()) { ... override method returning postingsFormat()... }` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ashvardanian opened a new issue, #12502: USearch integration and potential Vector Search performance improvements
ashvardanian opened a new issue, #12502: URL: https://github.com/apache/lucene/issues/12502 ### Description I was recently approached by Lucene and Elastic users, facing low performance and high memory consumption issues, running Vector Search tasks on JVM. Some have also been using native libraries, like our [USearch](https://github.com/unum-cloud/usearch), and were curious if those systems can be combined. Hence, here I am, excited to open a discussion 🤗 cc @jbellis, @benwtrent, @alessandrobenedetti, @msokolov --- I have looked into the existing HNSW implementation and related PR - #10047. The integration should be simple, assuming [we already have a JNI, that passes CI and is hosted on GitHub](https://github.com/unum-cloud/usearch/packages/1867475). The upside would be: - the performance won't be just on par with FAISS but can be higher. - cross-platform `f16` support and `i8` optional automatic downcasting. - indexes can be memory-mapped from disk without loading into RAM and are about to receive many `io_uring`-based kernel-bypass tricks, similar to what we have in [UCall](https://github.com/unum-cloud/ucall). --- This may automatically resolve the following issues (in reverse chronological order): - [x] half-precision support: #12403 - [x] multi-key support: #12313 - [x] pluggable metrics, similar to our JIT support in Python: #12219 - [x] 2K+ dimensional vectors: #11507 - [x] compact offsets with `uint40_t`: #10884 - [x] memory consumption: #10177 --- As far as I understand, it is not common to integrate Lucene with native libraries, but it seems like it can be justified in such computationally-intensive workloads. | | FAISS, `f32` | USearch, `f32` | USearch, `f16` | USearch, `i8` | | :--- | ---: | -: | -: | : | | Batch Insert | 16 K/s | 73 K/s |100 K/s | 104 K/s **+550%** | | Batch Search | 82 K/s |103 K/s |113 K/s | 134 K/s **+63%** | | Bulk Insert | 76 K/s |105 K/s |115 K/s | 202 K/s **+165%** | | Bulk Search | 118 K/s |174 K/s |173 K/s | 304 K/s **+157%** | | Recall @ 10 | 99% | 99.2% | 99.1% | 99.2% | > Dataset: 1M vectors sample of the Deep1B dataset. Hardware: `c7g.metal` AWS instance with 64 cores and DDR5 memory. HNSW was configured with identical hyper-parameters: connectivity `M=16`, expansion @ construction `efConstruction=128`, and expansion @ search `ef=64`. Batch size is 256. Both libraries were compiled for the target architecture. I am happy to contribute, and looking forward to your comments 🤗 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] henryrneh opened a new issue, #12503: OutOfMemoryrror found by OSS-Fuzz (issue 60248)
henryrneh opened a new issue, #12503: URL: https://github.com/apache/lucene/issues/12503 ### Description Dear Apache Lucene maintainers, The OutOfMemory is triggered in this [line](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java#L400) by parse() function from QueryParser when a crafted untrusted input is processed by it. We have reviewed the finding and it might be security-related due to the potential of a denial of service. We would appreciate it if you could take a look at the finding. Do you see a risk that this might be exploited by untrusted input? Part of the stack trace: == Java Exception: com.code_intelligence.jazzer.api.FuzzerSecurityIssueLow: Out of memory (use '-Xmx1710m' to reproduce) Caused by: java.lang.OutOfMemoryError: Java heap space  at org.apache.lucene.util.ArrayUtil.growExact(ArrayUtil.java:400)  at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:412)  at org.apache.lucene.util.BytesRefBuilder.grow(BytesRefBuilder.java:60)  at org.apache.lucene.util.BytesRefBuilder.append(BytesRefBuilder.java:71)  at org.apache.lucene.util.BytesRefBuilder.append(BytesRefBuilder.java:78)  at org.apache.lucene.util.BytesRefBuilder.append(BytesRefBuilder.java:83)  at org.apache.lucene.util.BytesRefBuilder.copyBytes(BytesRefBuilder.java:115)  at org.apache.lucene.analysis.miscellaneous.ConcatenateGraphFilter$BytesRefBuilderTermAttributeImpl.copyTo(ConcatenateGraphFilter.java:380)  at org.apache.lucene.analysis.miscellaneous.ConcatenateGraphFilter$BytesRefBuilderTermAttributeImpl.clone(ConcatenateGraphFilter.java:386)  at org.apache.lucene.util.AttributeSource$State.clone(AttributeSource.java:52)  at org.apache.lucene.util.AttributeSource.captureState(AttributeSource.java:302)  at org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:92)  at org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:70)  at org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:318)  at org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:257)  at org.apache.lucene.queryparser.classic.QueryParserBase.newFieldQuery(QueryParserBase.java:468)  at org.apache.lucene.queryparser.classic.QueryParserBase.getFieldQuery(QueryParserBase.java:457)  at org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:824)  at org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:494)  at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:366)  at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:251)  at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:223)  at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:136) ... We have added a reproducer zip which contains a README that describes how to reproduce the issue. Reproducer Zip: https://drive.google.com/file/d/1wIbOOZcuEW1uOoTosAtJWxREVwt9imaw/view?usp=sharing Fuzz target: https://github.com/google/oss-fuzz/blob/master/projects/lucene/QueryParserFuzzer.java Note: We have updated the fuzz test in the zip file to simplify the debugging process. OSS-Fuzz issue link: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=60248 Hint: The provided OSS-Fuzz Issue link is only accessible if the issue is fixed or you are the maintainer of the OSS-Fuzz project. ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on issue #12502: USearch integration and potential Vector Search performance improvements
benwtrent commented on issue #12502: URL: https://github.com/apache/lucene/issues/12502#issuecomment-1675079917 I don't think we need a native implementation. JNI stuff can be dangerous. I honestly don't know the history around Lucene and if there have ever been considerations in the area before. I think we should work on making vector search better in Java. We have yet to hit the ceiling here in vector search & index performance in Java and Lucene. @uschindler what do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jbellis commented on issue #12502: USearch integration and potential Vector Search performance improvements
jbellis commented on issue #12502: URL: https://github.com/apache/lucene/issues/12502#issuecomment-1675081860 Hi Ash, (1) Have you compared usearch directly with Lucene? This could be a useful starting point: https://github.com/jbellis/hnswrecall (2) My understanding is that it is a design goal for Lucene to have zero external dependencies at all, but I'm not a committer so hopefully others will chime in. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on issue #12502: USearch integration and potential Vector Search performance improvements
uschindler commented on issue #12502: URL: https://github.com/apache/lucene/issues/12502#issuecomment-1675084211 Yes: - no external libraries for Lucene Core - no native code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent opened a new pull request, #12504: ToParentBlockJoin[Byte|Float]KnnVectorQuery needs to handle the case when parents are missing
benwtrent opened a new pull request, #12504: URL: https://github.com/apache/lucene/pull/12504 This is a follow up to: https://github.com/apache/lucene/pull/12434 Adds a test for when parents are missing in the index and verifies we return no hits. Previously this would have thrown an NPE Doesn't really justify a CHANGES update as its fixing an unreleased bug due to this previous change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent opened a new issue, #12505: Re-explore the logic around when Vector search should be Exact
benwtrent opened a new issue, #12505: URL: https://github.com/apache/lucene/issues/12505 ### Description Lucene always does an approximate nearest neighbors search when no filter is provided. This seems like unnecessary work. Some benchmarks would have to be done, but some ideas I had around options to explore: - Why not always do exact when `maxDoc < k`? - Should the "when to do exact" calculation consider `byte` vs `float` vectors? It seems weird to go through all the work of going to the graph if there are only 10 documents in a segment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jbellis commented on a diff in pull request #12421: Concurrent hnsw graph and builder, take two
jbellis commented on code in PR #12421: URL: https://github.com/apache/lucene/pull/12421#discussion_r1291603910 ## lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentNeighborSet.java: ## @@ -0,0 +1,292 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.hnsw; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.util.PrimitiveIterator; +import java.util.concurrent.atomic.AtomicReference; +import java.util.function.Function; +import org.apache.lucene.util.BitSet; +import org.apache.lucene.util.FixedBitSet; + +/** A concurrent set of neighbors. */ +public class ConcurrentNeighborSet { + /** the node id whose neighbors we are storing */ + private final int nodeId; + + /** + * We use a copy-on-write NeighborArray to store the neighbors. Even though updating this is + * expensive, it is still faster than using a concurrent Collection because "iterate through a + * node's neighbors" is a hot loop in adding to the graph, and NeighborArray can do that much + * faster: no boxing/unboxing, all the data is stored sequentially instead of having to follow + * references, and no fancy encoding necessary for node/score. + */ + private final AtomicReference neighborsRef; + + private final NeighborSimilarity similarity; + + /** the maximum number of neighbors we can store */ + private final int maxConnections; + + public ConcurrentNeighborSet(int nodeId, int maxConnections, NeighborSimilarity similarity) { +this.nodeId = nodeId; +this.maxConnections = maxConnections; +this.similarity = similarity; +neighborsRef = new AtomicReference<>(new ConcurrentNeighborArray(maxConnections, true)); + } + + public PrimitiveIterator.OfInt nodeIterator() { +// don't use a stream here. stream's implementation of iterator buffers +// very aggressively, which is a big waste for a lot of searches. +return new NeighborIterator(neighborsRef.get()); + } + + public void backlink(Function neighborhoodOf) throws IOException { +NeighborArray neighbors = neighborsRef.get(); +for (int i = 0; i < neighbors.size(); i++) { + int nbr = neighbors.node[i]; + float nbrScore = neighbors.score[i]; + ConcurrentNeighborSet nbrNbr = neighborhoodOf.apply(nbr); + nbrNbr.insert(nodeId, nbrScore); +} + } + + private static class NeighborIterator implements PrimitiveIterator.OfInt { +private final NeighborArray neighbors; +private int i; + +private NeighborIterator(NeighborArray neighbors) { + this.neighbors = neighbors; + i = 0; +} + +@Override +public boolean hasNext() { + return i < neighbors.size(); +} + +@Override +public int nextInt() { + return neighbors.node[i++]; +} + } + + public int size() { +return neighborsRef.get().size(); + } + + public int arrayLength() { +return neighborsRef.get().node.length; + } + + /** + * For each candidate (going from best to worst), select it only if it is closer to target than it + * is to any of the already-selected candidates. This is maintained whether those other neighbors + * were selected by this method, or were added as a "backlink" to a node inserted concurrently + * that chose this one as a neighbor. + */ + public void insertDiverse(NeighborArray candidates) { +BitSet selected = new FixedBitSet(candidates.size()); +for (int i = candidates.size() - 1; i >= 0; i--) { + int cNode = candidates.node[i]; + float cScore = candidates.score[i]; + if (isDiverse(cNode, cScore, candidates, selected)) { +selected.set(i); + } +} +insertMultiple(candidates, selected); +// This leaves the paper's keepPrunedConnection option out; we might want to add that +// as an option in the future. + } + + private void insertMultiple(NeighborArray others, BitSet selected) { +neighborsRef.getAndUpdate( +current -> { + ConcurrentNeighborArray next = current.copy(); Review Comment: Looked at another profile this morning. 99.75% of insertMultiple is score comparisons, for vectors of dimension 256. -- This is an automated message from the Apache Gi
[GitHub] [lucene] sabi0 commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class
sabi0 commented on issue #12501: URL: https://github.com/apache/lucene/issues/12501#issuecomment-1675193472 > > Besides having two implementations ... with the same "Lucene84" name will likely result in a lookup error? > Exactly and because of that its final. I just do not understand then how your suggestion to "subclass default codec. By that it keeps its name and code reading the index will look it up" would work? Having the "official" Lucene95Codec and MyCustomCodec sharing the same "Lucene95" name will result in lookup error, won't it? So I have to give my custom codec a new name. And then "plug" it in using some configuration property, I suppose? Or find a way to ensure "my" `META-INF/services` appears on the classpath before lucene-core.jar -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ashvardanian commented on issue #12502: USearch integration and potential Vector Search performance improvements
ashvardanian commented on issue #12502: URL: https://github.com/apache/lucene/issues/12502#issuecomment-1675201748 Thank you, @benwtrent, @jbellis, and @uschindler! It's very insightful! [Nmslib.java](https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/util/Nmslib.java) seems like the right place to start. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #12415: Optimize disjunction counts.
jpountz merged PR #12415: URL: https://github.com/apache/lucene/pull/12415 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] reta commented on issue #12498: Simplify task executor for concurrent operations
reta commented on issue #12498: URL: https://github.com/apache/lucene/issues/12498#issuecomment-1675403043 > It makes sense to me to push the responsibility of figuring out how to execute tasks to the executor. Also pinging @reta. Thanks @jpountz , I second that > Additionally, I think that we should unconditionally offload execution to the executor when available, even when we have a single slice. It may seem counter intuitive but it's again to be able to determine what type of workload each thread pool performs. That's is one of the difficulties we are dealing as well, specifically the exception branching logic has to account for wrapped / unwrapped exceptions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] squirmy closed pull request #1681: SOLR-10804: Allow same version updates in DocBasedVersionConstraintsProcessor
squirmy closed pull request #1681: SOLR-10804: Allow same version updates in DocBasedVersionConstraintsProcessor URL: https://github.com/apache/lucene-solr/pull/1681 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] searchivarius commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores
searchivarius commented on issue #12342: URL: https://github.com/apache/lucene/issues/12342#issuecomment-1675721505 Looking great, many thanks! Could you remind me what is ordered and reversed? This is something related to insertion order? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org