date:20230811

[GitHub] [lucene] henryrneh commented on issue #12165: Integrating Apache Lucene into OSS-Fuzz

2023-08-11 Thread via GitHub



henryrneh commented on issue #12165:
URL: https://github.com/apache/lucene/issues/12165#issuecomment-1674365549

   Now we have started to do some bug triaging of bugs from OSS-Fuzz. There are 
multiple issues discovered with the fuzzer, for example OutOfMemory or 
StackOverflow, that we can disclose one by one or by giving you access via 
email to the oss-fuzz platform. Should we disclose them here through public 
issues or do you prefer through secur...@apache.org mailing list?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on issue #12165: Integrating Apache Lucene into OSS-Fuzz

2023-08-11 Thread via GitHub



uschindler commented on issue #12165:
URL: https://github.com/apache/lucene/issues/12165#issuecomment-1674532419

   Just open public issues. 
   
   Actually not all of those errors would be fixed, because Apache Lucene does 
not always do all possible checks, as performance is more important than an OOM 
(caused by "wrong usage").


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] benwtrent merged pull request #12500: Fix flaky testToString method for Knn Vector queries

2023-08-11 Thread via GitHub



benwtrent merged PR #12500:
URL: https://github.com/apache/lucene/pull/12500


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] sabi0 opened a new issue, #12501: Default PostingsFormat lost the SPI extension point in the Codec class

2023-08-11 Thread via GitHub



sabi0 opened a new issue, #12501:
URL: https://github.com/apache/lucene/issues/12501

   ### Description
   
   `Lucene70Codec` had:
   ```
   private final PostingsFormat defaultFormat = 
PostingsFormat.forName("Lucene50");
   ```
   
   In the `Lucene80Codec` PostingsFormat instantiation was moved to the 
constructor. Presumably to pass the additional `fstLoadMode` parameter?
   
https://github.com/apache/lucene/commit/28e8a30b536a39e5539ac6e8b9407d31724c8857#diff-3a74c1b72ab52e54dfcdc9de142b4331b372c11bcca87842b001c30f89ce58ebR98
   
   In a subsequent commit the code was reverted to the default PostingsFormat 
constructor:
   
https://github.com/apache/lucene/commit/651f41e21bd3df98f70d2673295db29506e3d2e6#diff-3a74c1b72ab52e54dfcdc9de142b4331b372c11bcca87842b001c30f89ce58ebR96
   
   But the SPI call `PostingsFormat.forName()` was not restored. And is still 
missing in `Lucene95Codec`:
   
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95Codec.java#L121
   
   Restore the SPI extension point to allow overriding the PostingsFormat 
without having to override the Codec?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

2023-08-11 Thread via GitHub



uschindler commented on issue #12501:
URL: https://github.com/apache/lucene/issues/12501#issuecomment-1674943197

   Hi,
   the SPI should only be used when READING indexes. When you create a codec 
for IndexWriter the codec version hardcodes its postings formats and other 
subtypes. As you see it is the same for docValuesFormat and other parts.
   
   This allows to read any index, but when you write an index it will use the 
exact codec as specified.
   
   Basically we made the decission to hardcode the correct classes when writing 
indexes, but load the codecs dynamically when open an existing index.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] sabi0 commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

2023-08-11 Thread via GitHub



sabi0 commented on issue #12501:
URL: https://github.com/apache/lucene/issues/12501#issuecomment-1674950075

   I see. Thank you for the explanation.
   The commits that change this behavior did not say anything about this.
   So it looked like this SPI extension point loss was an unwitting side-effect 
of a sequence of refactorings.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

2023-08-11 Thread via GitHub



uschindler commented on issue #12501:
URL: https://github.com/apache/lucene/issues/12501#issuecomment-1674955262

   > So it looked like this SPI extension point loss was an unwitting 
side-effect of a sequence of refactorings.
   
   No, the SPI fromName does not allow you to change the implementation, as 
there can only be one "Lucene50" implementation on classpath. If you want to 
have another codec it must have a new name and therefore for (new) indexes 
passed via IndexWriterConfig using a new codec.
   
   > private final PostingsFormat defaultFormat = 
PostingsFormat.forName("Lucene50");
   
   This does not allow you to overwrite the format, you still need to subclass 
codec, as the name "Lucene50" is part of Lucene core and can't be replaced. So 
it will always load the hardcoded 5.0 codec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

2023-08-11 Thread via GitHub



uschindler commented on issue #12501:
URL: https://github.com/apache/lucene/issues/12501#issuecomment-1674957426

   For the FST loading mode mentioned above, the codec does not need to be 
changed, you can tell DirectoryReader to use FST load modes using attributes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler closed issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

2023-08-11 Thread via GitHub



uschindler closed issue #12501: Default PostingsFormat lost the SPI extension 
point in the Codec class
URL: https://github.com/apache/lucene/issues/12501


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

2023-08-11 Thread via GitHub



uschindler commented on issue #12501:
URL: https://github.com/apache/lucene/issues/12501#issuecomment-1674970444

   The general rule is: If you want to change the index postings format (but 
nothing else like codec itsself) when writing a new index, you need to subclass 
default codec. By that it keeps its name and code reading the index will look 
it up. If you add a completely new postings format, subclass abstract base 
class, give it a new name and register it in SPI. If you just want to change 
settings you can reuse the postings format by instantiating it in the codec 
with different settings.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] sabi0 commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

2023-08-11 Thread via GitHub



sabi0 commented on issue #12501:
URL: https://github.com/apache/lucene/issues/12501#issuecomment-1674983089

   The codec classes are `final`.
   Besides having two implementations (`Lucene84PostingsFormat` in lucene-core 
and `MyLucene84PostingsFormat`) with the same "Lucene84" name will likely 
result in a lookup error?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

2023-08-11 Thread via GitHub

uschindler commented on issue #12501:
URL: https://github.com/apache/lucene/issues/12501#issuecomment-1674996502

> The postings format classes are `final`. Besides having two
implementations (`Lucene84PostingsFormat` in lucene-core and
`MyLucene84PostingsFormat`) with the same "Lucene84" name will likely result in
a lookup error?

Exactly and because of that its final. Your postings format needs a new name.

If you want to use it as default you can subclass the `Codec` as this is the
main entry point:
https://github.com/apache/lucene/blob/df8745e59ee65f276ccaefa87480e1fd85facb56/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95Codec.java#L55

I don't understand what your problem is:
- Write your own postings format with its own name. You can't subclass. But
you can use a FilterPostingsFormat to wrap the default postings format
- Subclass
https://github.com/apache/lucene/blob/df8745e59ee65f276ccaefa87480e1fd85facb56/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95Codec.java#L55
and return/inject your codec there, an alternative which is not so statically
depending on the exact version is to use `new FilterCodec(Codec.getDefault()) {
... override method returning postingsFormat()... }`

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] ashvardanian opened a new issue, #12502: USearch integration and potential Vector Search performance improvements

2023-08-11 Thread via GitHub



ashvardanian opened a new issue, #12502:
URL: https://github.com/apache/lucene/issues/12502

   ### Description
   
   I was recently approached by Lucene and Elastic users, facing low 
performance and high memory consumption issues, running Vector Search tasks on 
JVM. Some have also been using native libraries, like our 
[USearch](https://github.com/unum-cloud/usearch), and were curious if those 
systems can be combined. Hence, here I am, excited to open a discussion 🤗 
   
   cc @jbellis, @benwtrent, @alessandrobenedetti, @msokolov
   
   ---
   
   I have looked into the existing HNSW implementation and related PR - #10047. 
The integration should be simple, assuming [we already have a JNI, that passes 
CI and is hosted on 
GitHub](https://github.com/unum-cloud/usearch/packages/1867475). The upside 
would be:
   
   - the performance won't be just on par with FAISS but can be higher.
   - cross-platform `f16` support and `i8` optional automatic downcasting.
   - indexes can be memory-mapped from disk without loading into RAM and are 
about to receive many `io_uring`-based kernel-bypass tricks, similar to what we 
have in [UCall](https://github.com/unum-cloud/ucall).
   
   ---
   
   This may automatically resolve the following issues (in reverse 
chronological order):
   
   - [x] half-precision support: #12403
   - [x] multi-key support: #12313 
   - [x] pluggable metrics, similar to our JIT support in Python: #12219
   - [x] 2K+ dimensional vectors: #11507
   - [x] compact offsets with `uint40_t`: #10884
   - [x] memory consumption: #10177
   
   ---
   
   As far as I understand, it is not common to integrate Lucene with native 
libraries, but it seems like it can be justified in such 
computationally-intensive workloads. 
   
   |  | FAISS, `f32` | USearch, `f32` | USearch, `f16` | 
USearch, `i8` |
   | :--- | ---: | -: | -: | 
: |
   | Batch Insert |   16 K/s | 73 K/s |100 K/s | 104 K/s 
**+550%** |
   | Batch Search |   82 K/s |103 K/s |113 K/s |  134 K/s 
**+63%** |
   | Bulk Insert  |   76 K/s |105 K/s |115 K/s | 202 K/s 
**+165%** |
   | Bulk Search  |  118 K/s |174 K/s |173 K/s | 304 K/s 
**+157%** |
   | Recall @ 10  |  99% |  99.2% |  99.1% |
 99.2% |
   
   > Dataset: 1M vectors sample of the Deep1B dataset. Hardware: `c7g.metal` 
AWS instance with 64 cores and DDR5 memory. HNSW was configured with identical 
hyper-parameters: connectivity `M=16`, expansion @ construction 
`efConstruction=128`, and expansion @ search `ef=64`. Batch size is 256. Both 
libraries were compiled for the target architecture.
   
   I am happy to contribute, and looking forward to your comments 🤗


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] henryrneh opened a new issue, #12503: OutOfMemoryrror found by OSS-Fuzz (issue 60248)

2023-08-11 Thread via GitHub



henryrneh opened a new issue, #12503:
URL: https://github.com/apache/lucene/issues/12503

   ### Description
   
   Dear Apache Lucene maintainers,
   
   The OutOfMemory is triggered in this 
[line](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java#L400)
 by parse() function from QueryParser when a crafted untrusted input is 
processed by it. 
   
   We have reviewed the finding and it might be security-related due to the 
potential of a denial of service. We would appreciate it if you could take a 
look at the finding. Do you see a risk that this might be exploited by 
untrusted input?
   
   Part of the stack trace:
   == Java Exception: com.code_intelligence.jazzer.api.FuzzerSecurityIssueLow: 
Out of memory (use '-Xmx1710m' to reproduce)
   Caused by: java.lang.OutOfMemoryError: Java heap space
    at org.apache.lucene.util.ArrayUtil.growExact(ArrayUtil.java:400)
    at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:412)
    at org.apache.lucene.util.BytesRefBuilder.grow(BytesRefBuilder.java:60)
    at org.apache.lucene.util.BytesRefBuilder.append(BytesRefBuilder.java:71)
    at org.apache.lucene.util.BytesRefBuilder.append(BytesRefBuilder.java:78)
    at org.apache.lucene.util.BytesRefBuilder.append(BytesRefBuilder.java:83)
    at 
org.apache.lucene.util.BytesRefBuilder.copyBytes(BytesRefBuilder.java:115)
    at 
org.apache.lucene.analysis.miscellaneous.ConcatenateGraphFilter$BytesRefBuilderTermAttributeImpl.copyTo(ConcatenateGraphFilter.java:380)
    at 
org.apache.lucene.analysis.miscellaneous.ConcatenateGraphFilter$BytesRefBuilderTermAttributeImpl.clone(ConcatenateGraphFilter.java:386)
    at 
org.apache.lucene.util.AttributeSource$State.clone(AttributeSource.java:52)
    at 
org.apache.lucene.util.AttributeSource.captureState(AttributeSource.java:302)
    at 
org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:92)
    at 
org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:70)
    at 
org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:318)
    at 
org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:257)
    at 
org.apache.lucene.queryparser.classic.QueryParserBase.newFieldQuery(QueryParserBase.java:468)
    at 
org.apache.lucene.queryparser.classic.QueryParserBase.getFieldQuery(QueryParserBase.java:457)
    at 
org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:824)
    at 
org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:494)
    at 
org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:366)
    at 
org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:251)
    at 
org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:223)
    at 
org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:136)
   ...
   
   We have added a reproducer zip which contains a README that describes how to 
reproduce the issue.
   Reproducer Zip: 
https://drive.google.com/file/d/1wIbOOZcuEW1uOoTosAtJWxREVwt9imaw/view?usp=sharing
   
   Fuzz target: 
https://github.com/google/oss-fuzz/blob/master/projects/lucene/QueryParserFuzzer.java
   Note: We have updated the fuzz test in the zip file to simplify the 
debugging process.
   
   OSS-Fuzz issue link: 
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=60248
   
   Hint: The provided OSS-Fuzz Issue link is only accessible if the issue is 
fixed or you are the maintainer of the OSS-Fuzz project.
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] benwtrent commented on issue #12502: USearch integration and potential Vector Search performance improvements

2023-08-11 Thread via GitHub



benwtrent commented on issue #12502:
URL: https://github.com/apache/lucene/issues/12502#issuecomment-1675079917

   I don't think we need a native implementation. JNI stuff can be dangerous. I 
honestly don't know the history around Lucene and if there have ever been 
considerations in the area before. 
   
   I think we should work on making vector search better in Java. We have yet 
to hit the ceiling here in vector search & index performance in Java and Lucene.
   
   @uschindler what do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jbellis commented on issue #12502: USearch integration and potential Vector Search performance improvements

2023-08-11 Thread via GitHub



jbellis commented on issue #12502:
URL: https://github.com/apache/lucene/issues/12502#issuecomment-1675081860

   Hi Ash,
   
   (1) Have you compared usearch directly with Lucene?  This could be a useful 
starting point: https://github.com/jbellis/hnswrecall
   
   (2) My understanding is that it is a design goal for Lucene to have zero 
external dependencies at all, but I'm not a committer so hopefully others will 
chime in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on issue #12502: USearch integration and potential Vector Search performance improvements

2023-08-11 Thread via GitHub



uschindler commented on issue #12502:
URL: https://github.com/apache/lucene/issues/12502#issuecomment-1675084211

   Yes:
   - no external libraries for Lucene Core
   - no native code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] benwtrent opened a new pull request, #12504: ToParentBlockJoin[Byte|Float]KnnVectorQuery needs to handle the case when parents are missing

2023-08-11 Thread via GitHub



benwtrent opened a new pull request, #12504:
URL: https://github.com/apache/lucene/pull/12504

   This is a follow up to: https://github.com/apache/lucene/pull/12434
   
   Adds a test for when parents are missing in the index and verifies we return 
no hits. Previously this would have thrown an NPE
   
   Doesn't really justify a CHANGES update as its fixing an unreleased bug due 
to this previous change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] benwtrent opened a new issue, #12505: Re-explore the logic around when Vector search should be Exact

2023-08-11 Thread via GitHub



benwtrent opened a new issue, #12505:
URL: https://github.com/apache/lucene/issues/12505

   ### Description
   
   Lucene always does an approximate nearest neighbors search when no filter is 
provided. 
   
   This seems like unnecessary work. Some benchmarks would have to be done, but 
some ideas I had around options to explore:
   
- Why not always do exact when `maxDoc < k`?
- Should the "when to do exact" calculation consider `byte` vs `float` 
vectors?
   
   It seems weird to go through all the work of going to the graph if there are 
only 10 documents in a segment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jbellis commented on a diff in pull request #12421: Concurrent hnsw graph and builder, take two

2023-08-11 Thread via GitHub



jbellis commented on code in PR #12421:
URL: https://github.com/apache/lucene/pull/12421#discussion_r1291603910


##
lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentNeighborSet.java:
##
@@ -0,0 +1,292 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.PrimitiveIterator;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.Function;
+import org.apache.lucene.util.BitSet;
+import org.apache.lucene.util.FixedBitSet;
+
+/** A concurrent set of neighbors. */
+public class ConcurrentNeighborSet {
+  /** the node id whose neighbors we are storing */
+  private final int nodeId;
+
+  /**
+   * We use a copy-on-write NeighborArray to store the neighbors. Even though 
updating this is
+   * expensive, it is still faster than using a concurrent Collection because 
"iterate through a
+   * node's neighbors" is a hot loop in adding to the graph, and NeighborArray 
can do that much
+   * faster: no boxing/unboxing, all the data is stored sequentially instead 
of having to follow
+   * references, and no fancy encoding necessary for node/score.
+   */
+  private final AtomicReference neighborsRef;
+
+  private final NeighborSimilarity similarity;
+
+  /** the maximum number of neighbors we can store */
+  private final int maxConnections;
+
+  public ConcurrentNeighborSet(int nodeId, int maxConnections, 
NeighborSimilarity similarity) {
+this.nodeId = nodeId;
+this.maxConnections = maxConnections;
+this.similarity = similarity;
+neighborsRef = new AtomicReference<>(new 
ConcurrentNeighborArray(maxConnections, true));
+  }
+
+  public PrimitiveIterator.OfInt nodeIterator() {
+// don't use a stream here. stream's implementation of iterator buffers
+// very aggressively, which is a big waste for a lot of searches.
+return new NeighborIterator(neighborsRef.get());
+  }
+
+  public void backlink(Function 
neighborhoodOf) throws IOException {
+NeighborArray neighbors = neighborsRef.get();
+for (int i = 0; i < neighbors.size(); i++) {
+  int nbr = neighbors.node[i];
+  float nbrScore = neighbors.score[i];
+  ConcurrentNeighborSet nbrNbr = neighborhoodOf.apply(nbr);
+  nbrNbr.insert(nodeId, nbrScore);
+}
+  }
+
+  private static class NeighborIterator implements PrimitiveIterator.OfInt {
+private final NeighborArray neighbors;
+private int i;
+
+private NeighborIterator(NeighborArray neighbors) {
+  this.neighbors = neighbors;
+  i = 0;
+}
+
+@Override
+public boolean hasNext() {
+  return i < neighbors.size();
+}
+
+@Override
+public int nextInt() {
+  return neighbors.node[i++];
+}
+  }
+
+  public int size() {
+return neighborsRef.get().size();
+  }
+
+  public int arrayLength() {
+return neighborsRef.get().node.length;
+  }
+
+  /**
+   * For each candidate (going from best to worst), select it only if it is 
closer to target than it
+   * is to any of the already-selected candidates. This is maintained whether 
those other neighbors
+   * were selected by this method, or were added as a "backlink" to a node 
inserted concurrently
+   * that chose this one as a neighbor.
+   */
+  public void insertDiverse(NeighborArray candidates) {
+BitSet selected = new FixedBitSet(candidates.size());
+for (int i = candidates.size() - 1; i >= 0; i--) {
+  int cNode = candidates.node[i];
+  float cScore = candidates.score[i];
+  if (isDiverse(cNode, cScore, candidates, selected)) {
+selected.set(i);
+  }
+}
+insertMultiple(candidates, selected);
+// This leaves the paper's keepPrunedConnection option out; we might want 
to add that
+// as an option in the future.
+  }
+
+  private void insertMultiple(NeighborArray others, BitSet selected) {
+neighborsRef.getAndUpdate(
+current -> {
+  ConcurrentNeighborArray next = current.copy();

Review Comment:
   Looked at another profile this morning.  99.75% of insertMultiple is score 
comparisons, for vectors of dimension 256.



-- 
This is an automated message from the Apache Gi

[GitHub] [lucene] sabi0 commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

2023-08-11 Thread via GitHub



sabi0 commented on issue #12501:
URL: https://github.com/apache/lucene/issues/12501#issuecomment-1675193472

   > > Besides having two implementations ... with the same "Lucene84" name 
will likely result in a lookup error?
   > Exactly and because of that its final.
   
   I just do not understand then how your suggestion to "subclass default 
codec. By that it keeps its name and code reading the index will look it up" 
would work?
   Having the "official" Lucene95Codec and MyCustomCodec sharing the same 
"Lucene95" name will result in lookup error, won't it?
   
   So I have to give my custom codec a new name. And then "plug" it in using 
some configuration property, I suppose?
   Or find a way to ensure "my" `META-INF/services` appears on the classpath 
before lucene-core.jar


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] ashvardanian commented on issue #12502: USearch integration and potential Vector Search performance improvements

2023-08-11 Thread via GitHub



ashvardanian commented on issue #12502:
URL: https://github.com/apache/lucene/issues/12502#issuecomment-1675201748

   Thank you, @benwtrent, @jbellis, and @uschindler! It's very insightful! 
[Nmslib.java](https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/util/Nmslib.java)
 seems like the right place to start.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz merged pull request #12415: Optimize disjunction counts.

2023-08-11 Thread via GitHub



jpountz merged PR #12415:
URL: https://github.com/apache/lucene/pull/12415


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] reta commented on issue #12498: Simplify task executor for concurrent operations

2023-08-11 Thread via GitHub



reta commented on issue #12498:
URL: https://github.com/apache/lucene/issues/12498#issuecomment-1675403043

   > It makes sense to me to push the responsibility of figuring out how to 
execute tasks to the executor. Also pinging @reta.
   
   Thanks @jpountz , I second that
   
   > Additionally, I think that we should unconditionally offload execution to 
the executor when available, even when we have a single slice. It may seem 
counter intuitive but it's again to be able to determine what type of workload 
each thread pool performs.
   
   That's is one of the difficulties we are dealing as well, specifically the 
exception branching logic has to account for wrapped / unwrapped exceptions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] squirmy closed pull request #1681: SOLR-10804: Allow same version updates in DocBasedVersionConstraintsProcessor

2023-08-11 Thread via GitHub



squirmy closed pull request #1681: SOLR-10804: Allow same version updates in 
DocBasedVersionConstraintsProcessor
URL: https://github.com/apache/lucene-solr/pull/1681


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] searchivarius commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores

2023-08-11 Thread via GitHub



searchivarius commented on issue #12342:
URL: https://github.com/apache/lucene/issues/12342#issuecomment-1675721505

   Looking great, many thanks! Could you remind me what is ordered and 
reversed? This is something related to insertion order?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] henryrneh commented on issue #12165: Integrating Apache Lucene into OSS-Fuzz

[GitHub] [lucene] uschindler commented on issue #12165: Integrating Apache Lucene into OSS-Fuzz

[GitHub] [lucene] benwtrent merged pull request #12500: Fix flaky testToString method for Knn Vector queries

[GitHub] [lucene] sabi0 opened a new issue, #12501: Default PostingsFormat lost the SPI extension point in the Codec class

[GitHub] [lucene] uschindler commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

[GitHub] [lucene] sabi0 commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

[GitHub] [lucene] uschindler commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

[GitHub] [lucene] uschindler commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

[GitHub] [lucene] uschindler closed issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

[GitHub] [lucene] uschindler commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

[GitHub] [lucene] sabi0 commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

[GitHub] [lucene] uschindler commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

[GitHub] [lucene] ashvardanian opened a new issue, #12502: USearch integration and potential Vector Search performance improvements

[GitHub] [lucene] henryrneh opened a new issue, #12503: OutOfMemoryrror found by OSS-Fuzz (issue 60248)

[GitHub] [lucene] benwtrent commented on issue #12502: USearch integration and potential Vector Search performance improvements

[GitHub] [lucene] jbellis commented on issue #12502: USearch integration and potential Vector Search performance improvements

[GitHub] [lucene] uschindler commented on issue #12502: USearch integration and potential Vector Search performance improvements

[GitHub] [lucene] benwtrent opened a new pull request, #12504: ToParentBlockJoin[Byte|Float]KnnVectorQuery needs to handle the case when parents are missing

[GitHub] [lucene] benwtrent opened a new issue, #12505: Re-explore the logic around when Vector search should be Exact

[GitHub] [lucene] jbellis commented on a diff in pull request #12421: Concurrent hnsw graph and builder, take two

[GitHub] [lucene] sabi0 commented on issue #12501: Default PostingsFormat lost the SPI extension point in the Codec class

[GitHub] [lucene] ashvardanian commented on issue #12502: USearch integration and potential Vector Search performance improvements

[GitHub] [lucene] jpountz merged pull request #12415: Optimize disjunction counts.

[GitHub] [lucene] reta commented on issue #12498: Simplify task executor for concurrent operations

[GitHub] [lucene-solr] squirmy closed pull request #1681: SOLR-10804: Allow same version updates in DocBasedVersionConstraintsProcessor

[GitHub] [lucene] searchivarius commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores

26 matches

Site Navigation

Mail list logo

Footer information