Re: [I] HnwsGraph creates disconnected components [lucene]
nitirajrathore commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1772314503 Thanks @msokolov : These are really good suggestions. I will try to incorporate these ideas in solutions. I think in the end there can be multiple ways to allow more connectivity. I was also worried about how the connections are made and pruned. I checked [these lines](https://github.com/nitirajrathore/lucene/blob/f64bb19697708bfd91e05ff4314976c991f60cbc/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java#L293-L293) in our algorithm. Here we connect a node to the neighbour list only if is not closer to any of the current neighbour. Think of a situaion where for some nodes (say `a`) if the first neighbour connected (say `b`) has a good score but the rest of the candidate nodes have higher score with `b` than `a`. Then the rest of the candidates will NOT get connected to `a` AT ALL. Later if because of some other connection of `b`, say the non diverse connection of b -> a is removed then the `a` will get completely disconnected from graph. Basically we should have more nodes connected in the begining itself so that if some nodes are removed we do not run danger of disconnected graphs. To check this I added lot of code to collect the events (like `add`, `connect`, `disconnect`) and was able to reproduce it with just 10K graph size itself with max-conn 16. Example below. But the same is not reproduced with max-conn 32 or max-conn 64. Although, increasing max-conn in this case helped but the basic algorithm for connections does not seem very good to me. To fix this I checked that the original paper proposes the concept of [keepPrunedConnections](https://arxiv.org/pdf/1603.09320.pdf). I implemented that in my local and ran tests. But something is not right, I will keep checking if I did something wrong in the implementation etc. Will submit "draft" PR soon for others comments. I think there is another issue in current implementation. Look at [this code](https://github.com/nitirajrathore/lucene/blob/f64bb19697708bfd91e05ff4314976c991f60cbc/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java#L278-L278) where we finally connected edge from `b ---> a` and now if `b` has more connections then max-conn then we are removing one of the non diverse connection of `b` Say we remove the `b ---> c`. So we are removing this half of the undirected connection but I don't think we are removing the other half `c ---> b` anywhere. This will leave inconsistent Graph. I tried fixing this as well, but I am getting some errors. I will try to fix and submit the draft PR. Note that adding the concept of `keepPrunedConnections` will not just impact these disconnected nodes but in general the connections of all the nodes will increase and will become close to 'max-conn'. I think this is a good property without much additional cost at the time of graph creation. But the search time may increase a bit since there will be more connections to explore for the same max-conn. But I think this will also increase the Recall with ExactKNN. I will run the performance tests once code complete to find those numbers. --- - Real example of disconnectedness In one of the 10K vector runs, I found 3 nodes were disconnected. namely 297, 298, 5601 I checked that 297 was initially connected to 293. After that it was compared to a lot of candidates but was not connected to any because their score with 293 was higher than their score with 297. For example here 297 was compared with 43 and 197 but was NOT connected to either because their score with 293 was higher. In this case 297 was left with just one connection (with 293). Later on the connection of 293 ---> 297 also got deleted because of diversity checked triggered for other nodes connection with 293. ``` level=0,node=297,type=CONNECT,source=297,dest=293,otherdata=score=0.9578 level=0,node=297,type=COMPARE,source=297,dest=43,otherdata=score=0.9541 level=0,node=297,type=COMPARE,source=43,dest=293,otherdata=neighborSimilarity=0.9816, score=0.9541 level=0,node=297,type=COMPARE,source=297,dest=197,otherdata=score=0.9539 level=0,node=297,type=COMPARE,source=197,dest=293,otherdata=neighborSimilarity=0.9868, score=0.9539 .. level=0,node=298,type=CONNECT,source=298,dest=297,otherdata=score=0.9854 ``` Final connections were, full 2-way connection 297 <---> 298 and one sided connection of 297---> 293 (left over connection). But there was no path to 297 and 298 from any other node as the other way connection from 293 ---> 297 got deleted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please
Re: [I] Optimize FST suffix sharing for block tree index [lucene]
mikemccand commented on issue #12702: URL: https://github.com/apache/lucene/issues/12702#issuecomment-1772457667 > The floor data is guaranteed to be stored within single arc (never be prefix shared) in FST because fp is encoded before it. But won't the leading bytes of `fp` be shared in that prefix (since we switched to MSB vLong encoding)? > Out of curiosity, i tried to completely disable suffix sharing in block tree index, result in only 1.47% total .tip size increased for wikimediumall. This is impressive! I would have expected a worse impact. This is likely because `BlockTree` is essentially storing a prefix-ish trie in RAM, not the full terms dictionary. So the suffixes are mostly dropped from the FST index and left in the term blocks stored separately. > I wonder if we can avoid adding floor data outputs into NodeHash some way? I'm curious: are there any `floorData` outputs in `NodeHash` (shared suffixes) at all today in the BlockTree terms index? On the ["limit how much RAM FST Compiler is allowed to use to share suffixes" PR](https://github.com/apache/lucene/pull/12633) I also tested fully disabling `NodeHash` (no prefix sharing) when storing all `wikimediumall` index `body` terms in an FST but found a much bigger increase (65% increase: 350.2 MB -> 577.4 MB), because the suffixes are stored. Similarly, if we explore [experimental codecs that hold all terms in an FST](https://github.com/apache/lucene/pull/12688), now possible / reasonable since the FST is off-heap, sharing the suffixes will be important. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] [DRAFT] Load vector data directly from the memory segment [lucene]
ChrisHegarty opened a new pull request, #12703: URL: https://github.com/apache/lucene/pull/12703 [ This PR is draft - not ready to me merged. It is intended to help facilitate a discussion ] This PR enhances the vector similarity functions so that they can access the underlying memory directly, rather than first copying to a primitive array within the Java heap. I added a number of overloads to `VectorSimilarityFunction`, to allow the retrieval of the vector data to be pushed down. This way, the actual retrieval can be pushed into the provider implementation. This feels right, to me. `RandomAccessVectorValues` encapsulates and provides access to the underlying vector data. I added the ability to retrieve the backing IndexInput here, so that it's possible to bypass the accessor that does the copy. This is not great, but maybe ok, especially if we could restrict access? I updated `MemorySegmentIndexInput` to support retrieval of the backing segment for a given position. That way the vector provider can access this directly. This kinda feels ok, it makes the vector provider and memory segment index input more close in nature, without imposing incubating APIs into the currently-previewing implementation. There are now a couple of extra variants of the distance calculation functions, but they are largely a cut'n'paste of sections of each other. We might not need them all, but that kinda depends on which are more performance sensitive than others. Currently only float dot product has been updated, so as to first determine the potential performance benefits as well as the approach. Outstanding work and areas for improvement: 1. Figure out something better than exposing IndexInput in RandomAccessVectorValues 2. Binary and other distance calculations 3. Evaluate perf impact with macro benchmarks ( only micro done so far ) 4. There are a couple of hacks to get the benchmark running - remove these 5. .. relates #12482 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]
ChrisHegarty commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1772530717 Some benchmark results. Mac M2, 128 bit ``` INFO: Java vector incubator API enabled; uses preferredBitSize=128 ... Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.floatDotProductVector 1024 thrpt5 6.568 ± 0.189 ops/us VectorUtilBenchmark.floatDotProductVectorMS11024 thrpt5 8.823 ± 0.157 ops/us VectorUtilBenchmark.floatDotProductVectorMS21024 thrpt5 12.561 ± 0.298 ops/us ``` Rocket Lake, AVX 512 ``` INFO: Java vector incubator API enabled; uses preferredBitSize=512 ... Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.floatDotProductVector 1024 thrpt5 9.911 ± 0.004 ops/us VectorUtilBenchmark.floatDotProductVectorMS11024 thrpt5 10.081 ± 0.004 ops/us VectorUtilBenchmark.floatDotProductVectorMS21024 thrpt5 14.162 ± 0.020 ops/us ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772535368 Thanks @rmuir @gf2121 I need to spend a bit more evaluating this. But it looks like no action is needed here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix index out of bounds when writing FST to different metaOut (#12697) [lucene]
mikemccand merged PR #12698: URL: https://github.com/apache/lucene/pull/12698 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] ArrayIndexOutOfBoundsException when writing the FSTStore-backed FST with different DataOutput for meta [lucene]
mikemccand closed issue #12697: ArrayIndexOutOfBoundsException when writing the FSTStore-backed FST with different DataOutput for meta URL: https://github.com/apache/lucene/issues/12697 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]
bruno-roustant commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1366758770 ## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ## @@ -20,76 +20,161 @@ import org.apache.lucene.util.packed.PackedInts; import org.apache.lucene.util.packed.PagedGrowableWriter; +// TODO: any way to make a reverse suffix lookup (msokolov's idea) instead of more costly hash? +// hmmm, though, hash is not so wasteful +// since it does not have to store value of each entry: the value is the node pointer in the FST. +// actually, there is much to save +// there -- we would not need any long per entry -- we'd be able to start at the FST end node and +// work backwards from the transitions + +// TODO: couldn't we prune natrually babck until we see a transition with an output? it's highly Review Comment: "natrually babck" ## lucene/core/src/java/org/apache/lucene/util/packed/AbstractPagedMutable.java: ## @@ -110,8 +110,10 @@ protected long baseRamBytesUsed() { public long ramBytesUsed() { long bytesUsed = RamUsageEstimator.alignObjectSize(baseRamBytesUsed()); bytesUsed += RamUsageEstimator.alignObjectSize(RamUsageEstimator.shallowSizeOf(subMutables)); +// System.out.println("abstract.ramBytesUsed:"); Review Comment: Do we keep these commented prints? ## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ## @@ -99,20 +184,18 @@ private long hash(FSTCompiler.UnCompiledNode node) { h += 17; } } -// System.out.println(" ret " + (h&Integer.MAX_VALUE)); + Review Comment: Do we need to mask with Long.MAX_VALUE below, since we mask anyway with the table mask? Instead, we should multiply with the gold constant BitMixer#PHI_C64 (make it public). This really makes a difference in the evenness of the value distribution. This is one of the secrets of the HPPC hashing. By applying this, we get multiple advantages: - lookup should be improved (less hash collision) - we can try to rehash at 3/4 occupancy because the performance should not be impacted until this point. - in case of hash collision, we can lookup linearly with a pos = pos + 1 instead of quadratic probe (lines 95 and 327); this may avoid some mem cache miss. (same for the other hash method) ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -135,32 +123,28 @@ public class FSTCompiler { * Instantiates an FST/FSA builder with default settings and pruning options turned off. For more * tuning and tweaking, see {@link Builder}. */ + // TODO: remove this? Builder API should be the only entry point? public FSTCompiler(FST.INPUT_TYPE inputType, Outputs outputs) { -this(inputType, 0, 0, true, true, Integer.MAX_VALUE, outputs, true, 15, 1f); +this(inputType, 32.0, outputs, true, 15, 1f); } private FSTCompiler( FST.INPUT_TYPE inputType, - int minSuffixCount1, - int minSuffixCount2, - boolean doShareSuffix, - boolean doShareNonSingletonNodes, - int shareMaxTailLength, + double suffixRAMLimitMB, // pass 0 to disable suffix compression/trie; larger values create Review Comment: `@param`in method javadoc instead? This is not easy to read with spotless truncation. ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -207,59 +187,26 @@ public Builder(FST.INPUT_TYPE inputType, Outputs outputs) { } /** - * If pruning the input graph during construction, this threshold is used for telling if a node - * is kept or pruned. If transition_count(node) >= minSuffixCount1, the node is kept. + * The approximate maximum amount of RAM (in MB) to use holding the suffix cache, which enables + * the FST to share common suffixes. Pass {@link Double#POSITIVE_INFINITY} to keep all suffixes + * and create an exactly minimal FST. In this case, the amount of RAM actually used will be + * bounded by the number of unique suffixes. If you pass a value smaller than the builder would + * use, the least recently used suffixes will be discarded, thus reducing suffix sharing and + * creating a non-minimal FST. In this case, the larger the limit, the closer the FST will be to + * its true minimal size, with diminishing returns as you increasea the limit. Pass {@code 0} to Review Comment: "increasea" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@luc
Re: [PR] Fix index out of bounds when writing FST to different metaOut (#12697) [lucene]
mikemccand commented on PR #12698: URL: https://github.com/apache/lucene/pull/12698#issuecomment-1772566523 Thanks @dungba88 -- I merged to `main` and `branch_9x`! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
bruno-roustant commented on PR #12688: URL: https://github.com/apache/lucene/pull/12688#issuecomment-1772589968 I'll also try to review! On the bit packing subject, I have some handy generic code (not in Lucene yet) to write and read variable size bits. Tell me if you are interested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Concurrent HNSW Merge [lucene]
benwtrent commented on PR #12660: URL: https://github.com/apache/lucene/pull/12660#issuecomment-1772603178 This is awesome. I am so happy it's a clean change without tons of complexity and we still get 4x speed up with additional threads. I will give it a review this weekend or early next week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]
mikemccand commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1366900931 ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -135,32 +123,28 @@ public class FSTCompiler { * Instantiates an FST/FSA builder with default settings and pruning options turned off. For more * tuning and tweaking, see {@link Builder}. */ + // TODO: remove this? Builder API should be the only entry point? public FSTCompiler(FST.INPUT_TYPE inputType, Outputs outputs) { -this(inputType, 0, 0, true, true, Integer.MAX_VALUE, outputs, true, 15, 1f); +this(inputType, 32.0, outputs, true, 15, 1f); } private FSTCompiler( FST.INPUT_TYPE inputType, - int minSuffixCount1, - int minSuffixCount2, - boolean doShareSuffix, - boolean doShareNonSingletonNodes, - int shareMaxTailLength, + double suffixRAMLimitMB, // pass 0 to disable suffix compression/trie; larger values create Review Comment: Thanks @bruno-roustant -- I agree it looks awful :) -- but I think I'll just remove this wimpy comment. These private ctor parameters are better documented on the Builder setters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]
mikemccand commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1366903022 ## lucene/core/src/java/org/apache/lucene/util/packed/AbstractPagedMutable.java: ## @@ -110,8 +110,10 @@ protected long baseRamBytesUsed() { public long ramBytesUsed() { long bytesUsed = RamUsageEstimator.alignObjectSize(baseRamBytesUsed()); bytesUsed += RamUsageEstimator.alignObjectSize(RamUsageEstimator.shallowSizeOf(subMutables)); +// System.out.println("abstract.ramBytesUsed:"); Review Comment: I don't think we have a general rule :) I'll remove these ones. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]
mikemccand commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1366910269 ## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ## @@ -99,20 +184,18 @@ private long hash(FSTCompiler.UnCompiledNode node) { h += 17; } } -// System.out.println(" ret " + (h&Integer.MAX_VALUE)); + Review Comment: > Do we need to mask with Long.MAX_VALUE below, since we mask anyway with the table mask? You're right, this is pointless -- I'll remove from both hash functions -- then we preserve that top (sign) bit for this following change: > Instead, we should multiply with the gold constant BitMixer#PHI_C64 (make it public). Whoa, this sounds awesome! I was wondering if we could improve the simplistic hashing here ... I'll open a spinoff issue with this idea. Sounds like a low hanging hashing fruit! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772656211 @ChrisHegarty there are plenty of actions we could take... but I implemented this specific same optimization in question safely in #12681 See https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking for a brief introduction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Improve hash mixing in FST's double-barrel LRU hash [lucene]
mikemccand opened a new issue, #12704: URL: https://github.com/apache/lucene/issues/12704 ### Description Spinoff from [this cool comment](https://github.com/apache/lucene/pull/12633#discussion_r1366847986), thanks to hashing guru @bruno-roustant: ``` Instead, we should multiply with the gold constant BitMixer#PHI_C64 (make it public). This really makes a difference in the evenness of the value distribution. This is one of the secrets of the HPPC hashing. By applying this, we get multiple advantages: * lookup should be improved (less hash collision) * we can try to rehash at 3/4 occupancy because the performance should not be impacted until this point. * in case of hash collision, we can lookup linearly with a pos = pos + 1 instead of quadratic probe (lines 95 and 327); this may avoid some mem cache miss. * (same for the other hash method) ``` This is a simple change, we just need to test on some real FST building cases to confirm good mixing "in practice". The new `IndexToFST` tool in `luceneutil` is helpful for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]
mikemccand commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1366913164 ## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ## @@ -99,20 +184,18 @@ private long hash(FSTCompiler.UnCompiledNode node) { h += 17; } } -// System.out.println(" ret " + (h&Integer.MAX_VALUE)); + Review Comment: > I'll open a spinoff issue with this idea. Sounds like a low hanging hashing fruit! I opened https://github.com/apache/lucene/issues/12704. Thanks @bruno-roustant! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772661255 to have any decent performance, we really need information on the CPU in question and its vector capabilities. And the idea you can write "one loop" that "runs anywhere" is an obvious pipe dream on the part of openjdk... we have to deal with special-case BS like this for every CPU. my advice on openjdk side: give up on the pipe dream and give us some cpuinfo or at least tell us what vector operations are supported. I'm almost to the point of writing /proc/cpuinfo parser. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772669239 also i think JMH is bad news when it comes to downclocking. It does not show the true performance impact of this. It slows down other things on the machine as well: the user might have other shit going on too. This is why the avx-512 variants of the functions avoid 512-bit multiply and just use 256-bit vectors for that part, it is the only safe approach. Unfortunately this approach is slightly suboptimal for your Rocket Lake which doesn't suffer from downclocking, but it is a disaster elsewhere, so we have to play it safe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]
mikemccand commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1772672994 I'll also confirm `Test2BFST` still passes ... soon this test will no longer require a 35 GB heap to run! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]
uschindler commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772673981 > to have any decent performance, we really need information on the CPU in question and its vector capabilities. And the idea you can write "one loop" that "runs anywhere" is an obvious pipe dream on the part of openjdk... we have to deal with special-case BS like this for every CPU. my advice on openjdk side: give up on the pipe dream and give us some cpuinfo or at least tell us what vector operations are supported. I think we should really open an issue at JDK. We discussed about this before. In my opinion, the system should have an API to request for each operator, if it is supported in current configuration (CPU, species size and Hotspot abilities). In addition a shortcut to figure out if C2 is enabled at all, but if C2 is disabled, each of the above specific queries for operator and species combinations should return "no-no". Everything else makes vector API a huge trap. It is a specialist API so please add detailed information to allow specialist use! > I'm almost to the point of writing /proc/cpuinfo parser. Doesn't work on Windows. 😱 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]
uschindler commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772679957 > Unfortunately this approach is slightly suboptimal for your Rocket Lake which doesn't suffer from downclocking, but it is a disaster elsewhere, so we have to play it safe. We could add a sysprops to enable advanced avx512 support. Could be a static boolean then on the impl. If somebody enables it, they must be aware that it may slow down. I have seen stuff like this in native vector databases on buzzwords (they told me). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772695285 Vector API should also fix its bugs. It is totally senseless to have `IntVector.SPECIES_PREFERRED` and `FloatVector.SPECIES_PREFERRED` and then always set them to '512' on every avx-512 machine. It is not about the data type but the operation you are doing on. Openjdk has the same heuristics in its intrinsics/asm that i allude to, to avoid these problems. Hacks in every intrinsic / default checking specific families of CPUs and shit. go look for yourself. But it doesn't want to expose this stuff via vector api to developers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772702122 I would really just fix the api: instead of `IntVector.SPECIES_PREFERRED` constant which is meaningless, it should be a method taking `VectorOperation...` about how you plan to use it. it should be something like this in our code: ``` import static VectorOperators.ADD; import static VectorOperators.MUL; // get preferred species on the platform for doing multiplication and addition static final DOT_PRODUCT_SPECIES = IntVector.preferredFor(MUL, ADD); ``` Then openjdk can keep its hacks to itself. But if they wont fix this, they should expose them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]
jbellis commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772704049 Responding top to bottom, > I wonder how much the speed difference is due to (1) Vectors being out of memory (and if they used PQ for diskann, if they did, we should test PQ with HNSW). (2) Not merging to a single segment and searching multiple segments serially. (1) 90% of it is the PQ, yes. I assume that storing the vector inline w/ the graph also helps some but I did not test that separately. You could definitely get a big speed up just using PQ on top of HNSW. (2) Single segment in both cases. (JVector leaves segment management as an exercise for the user.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772706786 such a method would solve 95% of my problems, if it would throw UnsupportedOperationException or return `null` if the hardware/hotspot doesnt support all the requested VectorOperators. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]
jbellis commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772711571 > DiskANN is known to be slower at indexing than HNSW I don't remember the numbers here, maybe 10% slower? It wasn't material enough to make me worry about it. (This is with using an HNSW-style incremental build, not the two-pass build described in the paper.) > and the blog post does not compare single threaded index times with Lucene. It was about 30% worse several months ago with my concurrent hnsw patch, should be significantly improved now since JVector (1) doesn't need the CompletionTracker to keep the layers consistent, b/c it's single layer, (2) added a DenseIntMap concurrent collection to replace ConcurrentHashMap for the nodes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]
jbellis commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772722737 > It is possible that the candidate postings (gathered via HNSW) don't contain ANY filtered docs. This would require gathering more candidate postings. This was a big problem for our initial deployment, we'd filter down to a few valid items and then spend forever searching the graph for them. Had to add a fairly accurate estimate of how many comparisons the index would need, and use that to decide whether to brute-force the comparison instead. (This is in Cassandra, not JVector, specifically VectorMemtableIndex.expectedNodesVisited.) I don't remember seeing code for this in Lucene but I mostly only looked at the HNSW code so I could have missed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]
jbellis commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772724758 > Or perhaps we "just" make a Lucene Codec component (KnnVectorsFormat) that wraps jvector? (https://github.com/jbellis/jvector) I'm happy to support anyone who wants to try this, including modifying JVector to make it a better fit if necessary. I won't have time to tackle this myself in the medium-term, however. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid use docsSeen in BKDWriter [lucene]
easyice commented on PR #12658: URL: https://github.com/apache/lucene/pull/12658#issuecomment-1772756782 I think we can only use this optimization without deleted docs for merges, because we can't use the cardinality of `liveDocs` as docCount, the `liveDocs` is set to 1 when initialized. We need to iterate through all the doc to know the docCount -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid use docsSeen in BKDWriter [lucene]
easyice commented on code in PR #12658: URL: https://github.com/apache/lucene/pull/12658#discussion_r1366998517 ## lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java: ## @@ -519,9 +526,8 @@ private Runnable writeFieldNDims( // compute the min/max for this slice computePackedValueBounds( values, 0, Math.toIntExact(pointCount), minPackedValue, maxPackedValue, scratchBytesRef1); -for (int i = 0; i < Math.toIntExact(pointCount); ++i) { - docsSeen.set(values.getDocID(i)); -} +// docCount has already been set by {@link BKDWriter#writeField} +assert docCount != -1; Review Comment: I moved the `docCount` parameter form `writeField()` to constructor, it looks more cleaner. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]
rmuir commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1772787526 Thanks for investigating this! Can we just fix vector code to take MemorySegment and wrap array code? I don't think we should add yet another factor to multiply the number of vector impls, there are already too many (encoding * function * cpu). Stuff using IndexInput can be slow and wrap arrays with MemorySegment. IMO we should deprecate IndexInput for lucene 10. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]
rmuir commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1772792213 as far as performance in practice, what kind of alignment is necessary such that it is reasonable for mmap'd files? Please, let it not be 64 bytes alignment for avx-512, that's too wasteful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Improve hash mixing in FST's double-barrel LRU hash [lucene]
bruno-roustant commented on issue #12704: URL: https://github.com/apache/lucene/issues/12704#issuecomment-1772810122 @dweiss will probably say more than me about the awesome BitMixer#PHI_C64 constant! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip [lucene]
mikemccand commented on PR #12653: URL: https://github.com/apache/lucene/pull/12653#issuecomment-1772824555 Thanks @shubhamvishu -- looks great! I plan to merge later today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip [lucene]
mikemccand commented on code in PR #12653: URL: https://github.com/apache/lucene/pull/12653#discussion_r1367036009 ## lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListWriter.java: ## @@ -63,24 +63,23 @@ public abstract class MultiLevelSkipListWriter { /** for every skip level a different buffer is used */ private ByteBuffersDataOutput[] skipBuffer; + /** Length of the window at which the skips are placed on skip level 1 */ + private final long windowLength; + /** Creates a {@code MultiLevelSkipListWriter}. */ protected MultiLevelSkipListWriter( int skipInterval, int skipMultiplier, int maxSkipLevels, int df) { this.skipInterval = skipInterval; this.skipMultiplier = skipMultiplier; -int numberOfSkipLevels; +int numberOfSkipLevels = 1; // calculate the maximum number of skip levels for this document frequency -if (df <= skipInterval) { - numberOfSkipLevels = 1; -} else { - numberOfSkipLevels = 1 + MathUtil.log(df / skipInterval, skipMultiplier); -} - -// make sure it does not exceed maxSkipLevels -if (numberOfSkipLevels > maxSkipLevels) { - numberOfSkipLevels = maxSkipLevels; +if (df > skipInterval) { + // also make sure it does not exceed maxSkipLevels + numberOfSkipLevels = + Math.min(1 + MathUtil.log(df / skipInterval, skipMultiplier), maxSkipLevels); } Review Comment: > Actually it was not the instance variable that you are thinking of and rather was a local variable with duplicate name(i.e. `numberOfSkipLevels`) which was totally unnecessary here so I cleaned that up as well. Aha! That explains my confusion. Such shadowing (`x` and `this.x` being different) is so confusing/dangerous. Thanks for cleaning it up! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for similarity-based vector searches [lucene]
shubhamvishu commented on code in PR #12679: URL: https://github.com/apache/lucene/pull/12679#discussion_r1367106152 ## lucene/core/src/java/org/apache/lucene/search/AbstractRnnVectorQuery.java: ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import java.util.Objects; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.LeafReaderContext; + +/** + * Search for all (approximate) vectors within a radius using the {@link RnnCollector}. + * + * @lucene.experimental + */ +abstract class AbstractRnnVectorQuery extends AbstractKnnVectorQuery { + private static final TopDocs NO_RESULTS = TopDocsCollector.EMPTY_TOPDOCS; + + protected final float traversalThreshold, resultThreshold; + + public AbstractRnnVectorQuery( + String field, float traversalThreshold, float resultThreshold, Query filter) { +super(field, Integer.MAX_VALUE, filter); +assert traversalThreshold <= resultThreshold; Review Comment: Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Scorer should sum up scores into a double [lucene]
shubhamvishu commented on PR #12682: URL: https://github.com/apache/lucene/pull/12682#issuecomment-1772913419 Thanks for the approval @jpountz ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772915097 > > Unfortunately this approach is slightly suboptimal for your Rocket Lake which doesn't suffer from downclocking, but it is a disaster elsewhere, so we have to play it safe. > > We could add a sysprops to enable advanced avx512 support. Could be a static boolean then on the impl. If somebody enables it, they must be aware that it may slow down. I have seen stuff like this in native vector databases on buzzwords (they told me). The jvm already has these. For example a user can set max vector width and avx instructiom level already. I assume that avx 512 users who are running on downclock-susceptible CPUs would already set flags to use only 256bit vectors. So I am afraid of adding our own flags that conflict with those. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] SOLR-15055 Re-implement 'withCollection' and 'maxShardsPerNode' [lucene-solr]
ljak commented on PR #2179: URL: https://github.com/apache/lucene-solr/pull/2179#issuecomment-1772924915 Hi, I know it's an old thread but I have a question. As far as I can tell (after searching), the `maxShardsPerNode` function wasn't re-implemented right (in the new autoscaling framework) ? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]
uschindler commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772952661 Hi, > The jvm already has these. For example a user can set max vector width and avx instructiom level already. I assume that avx 512 users who are running on downclock-susceptible CPUs would already set flags to use only 256bit vectors. So I am afraid of adding our own flags that conflict with those. The problem is that by default JVM enables those flags and then those machines get slow when they use Lucene and support cases are openend at Elasticsearch/Opensearch stuff. So my wish would be to have that opt-in. My idea was to add a sysprop like the MMap ones to enable/disbale manually or disable unsafe unmapping. In that case the sysprop would enable the AVX 512 bit case. It won't conflict with manual AVX=2 JVM override setting (because then the preferredBitSize=256 and our flag is a noop). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]
ChrisHegarty commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1772963134 @rmuir If I understand your comment correctly. I unaligned the vector data in the mmap file, in the benchmark. The results are similar enough to the aligned, maybe a little less when the alignment is terrible! see [b02b79b](https://github.com/apache/lucene/pull/12703/commits/b02b79b0b3aafdb889a5314dc07a235c0eac45db) Rocket Lake, AVX 512 ``` INFO: Java vector incubator API enabled; uses preferredBitSize=512 ... VectorUtilBenchmark.floatDotProductVector 1024 thrpt5 9.994 ± 0.002 ops/us VectorUtilBenchmark.floatDotProductVectorMS11024 thrpt5 11.171 ± 0.007 ops/us VectorUtilBenchmark.floatDotProductVectorMS21024 thrpt5 13.159 ± 0.017 ops/us ``` Mac M2, 128 bit ``` INFO: Java vector incubator API enabled; uses preferredBitSize=128 ... VectorUtilBenchmark.floatDotProductVector 1024 thrpt5 5.242 ± 0.025 ops/us VectorUtilBenchmark.floatDotProductVectorMS11024 thrpt5 7.928 ± 0.155 ops/us VectorUtilBenchmark.floatDotProductVectorMS21024 thrpt5 12.546 ± 0.077 ops/us ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]
ChrisHegarty commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1772974346 > Thanks for investigating this! Can we just fix vector code to take MemorySegment and wrap array code? Yes, that is a good idea. I'll do it and see how poorly is performs. If we do this, then we still need handle the case where a vector spans across more than one segment ( well, it should only ever span at most two ). We can just copy the bytes into a float[] for this and get a memory segment view over it - this will happen so infrequently. > I don't think we should add yet another factor to multiply the number of vector impls, there are already too many (encoding * function * cpu). Agreed. > Stuff using IndexInput can be slow and wrap arrays with MemorySegment. IMO we should deprecate IndexInput for lucene 10. I'm not sure how this work work, but I accept the sentiment. And I think I see/agree where you are going longer term with this ( and Java 22 ). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772982177 Just dropping an ACK here, for now. I do get the issues, and I agree that there could be better ways to frame things at the vector API level. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]
mikemccand commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1772988443 `Test2BFST` passed! ``` The slowest tests (exceeding 500 ms) during this run: 2993.94s Test2BFST.test (:lucene:core) The slowest suites (exceeding 1s) during this run: 2994.06s Test2BFST (:lucene:core) BUILD SUCCESSFUL in 49m 58s 246 actionable tasks: 78 executed, 168 up-to-date ``` That last check failed because of formatting -- I re-tidy'd and I think it's ready -- I'll push shortly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]
mikemccand merged PR #12633: URL: https://github.com/apache/lucene/pull/12633 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality [lucene]
mikemccand commented on issue #12542: URL: https://github.com/apache/lucene/issues/12542#issuecomment-1772993863 I've merged the change into `main`! I'll let it bake for some time (week or two?) and if all looks good, backport to 9.x. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]
uschindler commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1773038045 > Just dropping an ACK here, for now. I do get the issues, and I agree that there could be better ways to frame things at the vector API level. Let's write a proposal together in a Google doc (or similar) and we could later open an issue or JEP. As we are both openjdk members, this would be helpful to me to participate in that process (want to learn sth new). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Tony-X commented on PR #12688: URL: https://github.com/apache/lucene/pull/12688#issuecomment-1773113866 Thanks @bruno-roustant ! If you're okay to share it feel free to share it here. I'm in the process of baking my own specific implementation (which internally uses a single long as bit buffer), but I might absorb some interesting ideas from your impl. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]
ChrisHegarty commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1773365569 Well... as simple wrapping of float[] into MemorySegment is not going to work out, the Vector API does not like it due to alignment constraints (which seems overly pedantic since it can operate fine when accessing a float[] directly! ) ``` FloatVector fromMemorySegment(...) throws IllegalArgumentException - if the memory segment is a heap segment that is not backed by a byte[] array. ``` This approach would of course work fine for the binary computations, since they are byte[] backed. The current approach has three variants: 1. `dotProduct(float[] a, float[] b)` 2. `dotProduct(float[] a, RandomAccessVectorValues b, int bOffset)` 3. `dotProduct(RandomAccessVectorValues a, int aOffset, RandomAccessVectorValues b, int bOffset)` .. no.1 is used as a fallback when we span segments. I suspect that no.1 and no.2 are not as perf sensitive as no.3? In fact, we could try to trace the float[] back to where it originates, maybe read and store it as bytes but access it as a float. Maybe there are other ways to invert things? I'll need to sleep on it!! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Capture build scans on ge.apache.org to benefit from deep build insights [lucene]
dsmiley commented on PR #12293: URL: https://github.com/apache/lucene/pull/12293#issuecomment-1773459815 I'm eager to see the kind of build insights Gradle Enterprise offers us. If there are no further concerns, I'll merge Tuesday. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]
rmuir commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1773612369 > Well... as simple wrapping of float[] into MemorySegment is not going to work out, the Vector API does not like it due to alignment constraints (which seems overly pedantic since it can operate fine when accessing a float[] directly! ) But all `float[]` on the heap are just like any other heap object and aligned to 8 bytes or something? ``` $ java -XX:+PrintFlagsFinal -version | grep -i align bool AlignVector = false {C2 product} {default} intx InteriorEntryAlignment = 16 {C2 pd product} {default} intx NumberOfLoopInstrToAlign = 4 {C2 product} {default} int ObjectAlignmentInBytes = 8 {product lp64_product} {default} intx OptoLoopAlignment= 16 {pd product} {default} bool UseUnalignedLoadStores = true {ARCH product} {default} ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove direct dependency of NodeHash to FST [lucene]
dungba88 commented on PR #12690: URL: https://github.com/apache/lucene/pull/12690#issuecomment-1773630051 As the other PR has been merged, I have rebased and resolved the conflict -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org