Re: [I] HnwsGraph creates disconnected components [lucene]

2023-10-20 Thread via GitHub
nitirajrathore commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1772314503 Thanks @msokolov : These are really good suggestions. I will try to incorporate these ideas in solutions. I think in the end there can be multiple ways to allow more connecti

Re: [I] Optimize FST suffix sharing for block tree index [lucene]

2023-10-20 Thread via GitHub
mikemccand commented on issue #12702: URL: https://github.com/apache/lucene/issues/12702#issuecomment-1772457667 > The floor data is guaranteed to be stored within single arc (never be prefix shared) in FST because fp is encoded before it. But won't the leading bytes of `fp` be shared

[PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-20 Thread via GitHub
ChrisHegarty opened a new pull request, #12703: URL: https://github.com/apache/lucene/pull/12703 [ This PR is draft - not ready to me merged. It is intended to help facilitate a discussion ] This PR enhances the vector similarity functions so that they can access the underlying memor

Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-20 Thread via GitHub
ChrisHegarty commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1772530717 Some benchmark results. Mac M2, 128 bit ``` INFO: Java vector incubator API enabled; uses preferredBitSize=128 ... Benchmark (si

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772535368 Thanks @rmuir @gf2121 I need to spend a bit more evaluating this. But it looks like no action is needed here? -- This is an automated message from the Apache Git Service. To res

Re: [PR] Fix index out of bounds when writing FST to different metaOut (#12697) [lucene]

2023-10-20 Thread via GitHub
mikemccand merged PR #12698: URL: https://github.com/apache/lucene/pull/12698 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [I] ArrayIndexOutOfBoundsException when writing the FSTStore-backed FST with different DataOutput for meta [lucene]

2023-10-20 Thread via GitHub
mikemccand closed issue #12697: ArrayIndexOutOfBoundsException when writing the FSTStore-backed FST with different DataOutput for meta URL: https://github.com/apache/lucene/issues/12697 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Gi

Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-20 Thread via GitHub
bruno-roustant commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1366758770 ## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ## @@ -20,76 +20,161 @@ import org.apache.lucene.util.packed.PackedInts; import org.apache.l

Re: [PR] Fix index out of bounds when writing FST to different metaOut (#12697) [lucene]

2023-10-20 Thread via GitHub
mikemccand commented on PR #12698: URL: https://github.com/apache/lucene/pull/12698#issuecomment-1772566523 Thanks @dungba88 -- I merged to `main` and `branch_9x`! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR

Re: [PR] Random access term dictionary [lucene]

2023-10-20 Thread via GitHub
bruno-roustant commented on PR #12688: URL: https://github.com/apache/lucene/pull/12688#issuecomment-1772589968 I'll also try to review! On the bit packing subject, I have some handy generic code (not in Lucene yet) to write and read variable size bits. Tell me if you are interested. --

Re: [PR] [DRAFT] Concurrent HNSW Merge [lucene]

2023-10-20 Thread via GitHub
benwtrent commented on PR #12660: URL: https://github.com/apache/lucene/pull/12660#issuecomment-1772603178 This is awesome. I am so happy it's a clean change without tons of complexity and we still get 4x speed up with additional threads. I will give it a review this weekend or early

Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-20 Thread via GitHub
mikemccand commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1366900931 ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -135,32 +123,28 @@ public class FSTCompiler { * Instantiates an FST/FSA builder with

Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-20 Thread via GitHub
mikemccand commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1366903022 ## lucene/core/src/java/org/apache/lucene/util/packed/AbstractPagedMutable.java: ## @@ -110,8 +110,10 @@ protected long baseRamBytesUsed() { public long ramBytes

Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-20 Thread via GitHub
mikemccand commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1366910269 ## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ## @@ -99,20 +184,18 @@ private long hash(FSTCompiler.UnCompiledNode node) { h += 17;

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772656211 @ChrisHegarty there are plenty of actions we could take... but I implemented this specific same optimization in question safely in #12681 See https://en.wikipedia.org/wiki/Advanced_

[I] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-20 Thread via GitHub
mikemccand opened a new issue, #12704: URL: https://github.com/apache/lucene/issues/12704 ### Description Spinoff from [this cool comment](https://github.com/apache/lucene/pull/12633#discussion_r1366847986), thanks to hashing guru @bruno-roustant: ``` Instead, we should mul

Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-20 Thread via GitHub
mikemccand commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1366913164 ## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ## @@ -99,20 +184,18 @@ private long hash(FSTCompiler.UnCompiledNode node) { h += 17;

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772661255 to have any decent performance, we really need information on the CPU in question and its vector capabilities. And the idea you can write "one loop" that "runs anywhere" is an obvious pipe

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772669239 also i think JMH is bad news when it comes to downclocking. It does not show the true performance impact of this. It slows down other things on the machine as well: the user might have oth

Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-20 Thread via GitHub
mikemccand commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1772672994 I'll also confirm `Test2BFST` still passes ... soon this test will no longer require a 35 GB heap to run! -- This is an automated message from the Apache Git Service. To respond to

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
uschindler commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772673981 > to have any decent performance, we really need information on the CPU in question and its vector capabilities. And the idea you can write "one loop" that "runs anywhere" is an obvio

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
uschindler commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772679957 > Unfortunately this approach is slightly suboptimal for your Rocket Lake which doesn't suffer from downclocking, but it is a disaster elsewhere, so we have to play it safe. We

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772695285 Vector API should also fix its bugs. It is totally senseless to have `IntVector.SPECIES_PREFERRED` and `FloatVector.SPECIES_PREFERRED` and then always set them to '512' on every avx-512 ma

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772702122 I would really just fix the api: instead of `IntVector.SPECIES_PREFERRED` constant which is meaningless, it should be a method taking `VectorOperation...` about how you plan to use it. it

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-20 Thread via GitHub
jbellis commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772704049 Responding top to bottom, > I wonder how much the speed difference is due to (1) Vectors being out of memory (and if they used PQ for diskann, if they did, we should test PQ w

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772706786 such a method would solve 95% of my problems, if it would throw UnsupportedOperationException or return `null` if the hardware/hotspot doesnt support all the requested VectorOperators. -

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-20 Thread via GitHub
jbellis commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772711571 > DiskANN is known to be slower at indexing than HNSW I don't remember the numbers here, maybe 10% slower? It wasn't material enough to make me worry about it. (This is wit

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-20 Thread via GitHub
jbellis commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772722737 > It is possible that the candidate postings (gathered via HNSW) don't contain ANY filtered docs. This would require gathering more candidate postings. This was a big problem

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-20 Thread via GitHub
jbellis commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772724758 > Or perhaps we "just" make a Lucene Codec component (KnnVectorsFormat) that wraps jvector? (https://github.com/jbellis/jvector) I'm happy to support anyone who wants to try t

Re: [PR] Avoid use docsSeen in BKDWriter [lucene]

2023-10-20 Thread via GitHub
easyice commented on PR #12658: URL: https://github.com/apache/lucene/pull/12658#issuecomment-1772756782 I think we can only use this optimization without deleted docs for merges, because we can't use the cardinality of `liveDocs` as docCount, the `liveDocs` is set to 1 when initialized.

Re: [PR] Avoid use docsSeen in BKDWriter [lucene]

2023-10-20 Thread via GitHub
easyice commented on code in PR #12658: URL: https://github.com/apache/lucene/pull/12658#discussion_r1366998517 ## lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java: ## @@ -519,9 +526,8 @@ private Runnable writeFieldNDims( // compute the min/max for this slice

Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-20 Thread via GitHub
rmuir commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1772787526 Thanks for investigating this! Can we just fix vector code to take MemorySegment and wrap array code? I don't think we should add yet another factor to multiply the number of vector

Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-20 Thread via GitHub
rmuir commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1772792213 as far as performance in practice, what kind of alignment is necessary such that it is reasonable for mmap'd files? Please, let it not be 64 bytes alignment for avx-512, that's too wastefu

Re: [I] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-20 Thread via GitHub
bruno-roustant commented on issue #12704: URL: https://github.com/apache/lucene/issues/12704#issuecomment-1772810122 @dweiss will probably say more than me about the awesome BitMixer#PHI_C64 constant! -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip [lucene]

2023-10-20 Thread via GitHub
mikemccand commented on PR #12653: URL: https://github.com/apache/lucene/pull/12653#issuecomment-1772824555 Thanks @shubhamvishu -- looks great! I plan to merge later today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip [lucene]

2023-10-20 Thread via GitHub
mikemccand commented on code in PR #12653: URL: https://github.com/apache/lucene/pull/12653#discussion_r1367036009 ## lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListWriter.java: ## @@ -63,24 +63,23 @@ public abstract class MultiLevelSkipListWriter { /** for e

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-10-20 Thread via GitHub
shubhamvishu commented on code in PR #12679: URL: https://github.com/apache/lucene/pull/12679#discussion_r1367106152 ## lucene/core/src/java/org/apache/lucene/search/AbstractRnnVectorQuery.java: ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [PR] Scorer should sum up scores into a double [lucene]

2023-10-20 Thread via GitHub
shubhamvishu commented on PR #12682: URL: https://github.com/apache/lucene/pull/12682#issuecomment-1772913419 Thanks for the approval @jpountz ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to th

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772915097 > > Unfortunately this approach is slightly suboptimal for your Rocket Lake which doesn't suffer from downclocking, but it is a disaster elsewhere, so we have to play it safe. > > W

Re: [PR] SOLR-15055 Re-implement 'withCollection' and 'maxShardsPerNode' [lucene-solr]

2023-10-20 Thread via GitHub
ljak commented on PR #2179: URL: https://github.com/apache/lucene-solr/pull/2179#issuecomment-1772924915 Hi, I know it's an old thread but I have a question. As far as I can tell (after searching), the `maxShardsPerNode` function wasn't re-implemented right (in the new autoscal

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
uschindler commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772952661 Hi, > The jvm already has these. For example a user can set max vector width and avx instructiom level already. I assume that avx 512 users who are running on downclock-suscept

Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-20 Thread via GitHub
ChrisHegarty commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1772963134 @rmuir If I understand your comment correctly. I unaligned the vector data in the mmap file, in the benchmark. The results are similar enough to the aligned, maybe a little less wh

Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-20 Thread via GitHub
ChrisHegarty commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1772974346 > Thanks for investigating this! Can we just fix vector code to take MemorySegment and wrap array code? Yes, that is a good idea. I'll do it and see how poorly is performs. I

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772982177 Just dropping an ACK here, for now. I do get the issues, and I agree that there could be better ways to frame things at the vector API level. -- This is an automated message from

Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-20 Thread via GitHub
mikemccand commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1772988443 `Test2BFST` passed! ``` The slowest tests (exceeding 500 ms) during this run:

Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-20 Thread via GitHub
mikemccand merged PR #12633: URL: https://github.com/apache/lucene/pull/12633 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [I] Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality [lucene]

2023-10-20 Thread via GitHub
mikemccand commented on issue #12542: URL: https://github.com/apache/lucene/issues/12542#issuecomment-1772993863 I've merged the change into `main`! I'll let it bake for some time (week or two?) and if all looks good, backport to 9.x. -- This is an automated message from the Apache Git S

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
uschindler commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1773038045 > Just dropping an ACK here, for now. I do get the issues, and I agree that there could be better ways to frame things at the vector API level. Let's write a proposal together i

Re: [PR] Random access term dictionary [lucene]

2023-10-20 Thread via GitHub
Tony-X commented on PR #12688: URL: https://github.com/apache/lucene/pull/12688#issuecomment-1773113866 Thanks @bruno-roustant ! If you're okay to share it feel free to share it here. I'm in the process of baking my own specific implementation (which internally uses a single long as

Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-20 Thread via GitHub
ChrisHegarty commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1773365569 Well... as simple wrapping of float[] into MemorySegment is not going to work out, the Vector API does not like it due to alignment constraints (which seems overly pedantic since it

Re: [PR] Capture build scans on ge.apache.org to benefit from deep build insights [lucene]

2023-10-20 Thread via GitHub
dsmiley commented on PR #12293: URL: https://github.com/apache/lucene/pull/12293#issuecomment-1773459815 I'm eager to see the kind of build insights Gradle Enterprise offers us. If there are no further concerns, I'll merge Tuesday. -- This is an automated message from the Apache Git Serv

Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-20 Thread via GitHub
rmuir commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1773612369 > Well... as simple wrapping of float[] into MemorySegment is not going to work out, the Vector API does not like it due to alignment constraints (which seems overly pedantic since it can

Re: [PR] Remove direct dependency of NodeHash to FST [lucene]

2023-10-20 Thread via GitHub
dungba88 commented on PR #12690: URL: https://github.com/apache/lucene/pull/12690#issuecomment-1773630051 As the other PR has been merged, I have rebased and resolved the conflict -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu