Re: [I] HnwsGraph creates disconnected components [lucene]

2023-10-20 Thread via GitHub


nitirajrathore commented on issue #12627:
URL: https://github.com/apache/lucene/issues/12627#issuecomment-1772314503

   Thanks @msokolov : These are really good suggestions. I will try to 
incorporate these ideas in solutions. I think in the end there can be multiple 
ways to allow more connectivity.
   
   I was also worried about how the connections are made and pruned. I checked 
[these 
lines](https://github.com/nitirajrathore/lucene/blob/f64bb19697708bfd91e05ff4314976c991f60cbc/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java#L293-L293)
 in our algorithm. Here we connect a node to the neighbour list only if is not 
closer to any of the current neighbour. Think of a situaion where for some 
nodes (say `a`) if the first neighbour connected (say `b`) has a good score but 
the rest of the candidate nodes have higher score with `b` than `a`. Then the 
rest of the candidates will NOT get connected to `a` AT ALL. Later if because 
of some other connection of `b`, say the non diverse connection of b -> a is 
removed then the `a` will get completely disconnected from graph. Basically we 
should have more nodes connected in the begining itself so that if some nodes 
are removed we do not run danger of disconnected graphs.
   
   To check this I added lot of code to collect the events (like `add`, 
`connect`, `disconnect`) and was able to reproduce it with just 10K graph size 
itself with max-conn 16. Example below. But the same is not reproduced with 
max-conn 32 or max-conn 64. Although, increasing max-conn in this case helped 
but the basic algorithm for connections does not seem very good to me.
   
   To fix this I checked that the original paper proposes the concept of 
[keepPrunedConnections](https://arxiv.org/pdf/1603.09320.pdf). I implemented 
that in my local and ran tests. But something is not right, I will keep 
checking if I did something wrong in the implementation etc. Will submit 
"draft" PR soon for others comments.
   
   I think there is another issue in current implementation. Look at [this 
code](https://github.com/nitirajrathore/lucene/blob/f64bb19697708bfd91e05ff4314976c991f60cbc/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java#L278-L278)
 where we finally connected edge from `b ---> a` and now if `b` has more 
connections then max-conn then we are removing one of the non diverse 
connection of `b` Say we remove the `b ---> c`. So we are removing this half of 
the undirected connection but I don't think we are removing the other half `c 
---> b` anywhere. This will leave inconsistent Graph. I tried fixing this as 
well, but I am getting some errors. I will try to fix and submit the draft PR.
   
   Note that adding the concept of `keepPrunedConnections` will not just impact 
these disconnected nodes but in general the connections of all the nodes will 
increase and will become close to 'max-conn'. I think this is a good property 
without much additional cost at the time of graph creation. But the search time 
may increase a bit since there will be more connections to explore for the same 
max-conn. But I think this will also increase the Recall with ExactKNN. I will 
run the performance tests once code complete to find those numbers.
   
   ---
   
   - Real example of disconnectedness
   In one of the 10K vector runs, I found 3 nodes were disconnected. namely 
297, 298, 5601
   I checked that 297 was initially connected to 293. After that it was 
compared to a lot of candidates but was not connected to any because their 
score with 293 was higher than their score with 297. For example here 297 was 
compared with 43 and 197 but was NOT connected to either because their score 
with 293 was higher. In this case 297 was left with just one connection (with 
293). Later on the connection of 293 ---> 297 also got deleted because of 
diversity checked triggered for other nodes connection with 293. 
   ```
   level=0,node=297,type=CONNECT,source=297,dest=293,otherdata=score=0.9578
   level=0,node=297,type=COMPARE,source=297,dest=43,otherdata=score=0.9541
   
level=0,node=297,type=COMPARE,source=43,dest=293,otherdata=neighborSimilarity=0.9816,
  score=0.9541
   level=0,node=297,type=COMPARE,source=297,dest=197,otherdata=score=0.9539
   
level=0,node=297,type=COMPARE,source=197,dest=293,otherdata=neighborSimilarity=0.9868,
  score=0.9539
   ..
   level=0,node=298,type=CONNECT,source=298,dest=297,otherdata=score=0.9854
   
   ```
   Final connections were, full 2-way connection 297 <---> 298 and one sided 
connection of 297---> 293 (left over connection).
   But there was no path to 297 and 298 from any other node as the other way 
connection from 293 ---> 297 got deleted.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please

Re: [I] Optimize FST suffix sharing for block tree index [lucene]

2023-10-20 Thread via GitHub


mikemccand commented on issue #12702:
URL: https://github.com/apache/lucene/issues/12702#issuecomment-1772457667

   > The floor data is guaranteed to be stored within single arc (never be 
prefix shared) in FST because fp is encoded before it.
   
   But won't the leading bytes of `fp` be shared in that prefix (since we 
switched to MSB vLong encoding)?
   
   > Out of curiosity, i tried to completely disable suffix sharing in block 
tree index, result in only 1.47% total .tip size increased for wikimediumall.
   
   This is impressive!  I would have expected a worse impact.
   
   This is likely because `BlockTree` is essentially storing a prefix-ish trie 
in RAM, not the full terms dictionary.  So the suffixes are mostly dropped from 
the FST index and left in the term blocks stored separately.
   
   > I wonder if we can avoid adding floor data outputs into NodeHash some way?
   
   I'm curious: are there any `floorData` outputs in `NodeHash` (shared 
suffixes) at all today in the BlockTree terms index?
   
   On the ["limit how much RAM FST Compiler is allowed to use to share 
suffixes" PR](https://github.com/apache/lucene/pull/12633) I also tested fully 
disabling `NodeHash` (no prefix sharing) when storing all `wikimediumall` index 
`body` terms in an FST but found a much bigger increase (65% increase: 350.2 MB 
-> 577.4 MB), because the suffixes are stored.
   
   Similarly, if we explore [experimental codecs that hold all terms in an 
FST](https://github.com/apache/lucene/pull/12688), now possible / reasonable 
since the FST is off-heap, sharing the suffixes will be important. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-20 Thread via GitHub


ChrisHegarty opened a new pull request, #12703:
URL: https://github.com/apache/lucene/pull/12703

   [ This PR is draft - not ready to me merged. It is intended to help 
facilitate a discussion ]
   
   This PR enhances the vector similarity functions so that they can access the 
underlying memory directly, rather than first copying to a primitive array 
within the Java heap.
   
   I added a number of overloads to `VectorSimilarityFunction`, to allow the 
retrieval of the vector data to be pushed down. This way, the actual retrieval 
can be pushed into the provider implementation. This feels right, to me.
   
   `RandomAccessVectorValues` encapsulates and provides access to the 
underlying vector data. I added the ability to retrieve the backing IndexInput 
here, so that it's possible to bypass the accessor that does the copy. This is 
not great, but maybe ok, especially if we could restrict access?
   
   I updated `MemorySegmentIndexInput` to support retrieval of the backing 
segment for a given position. That way the vector provider can access this 
directly. This kinda feels ok, it makes the vector provider and memory segment 
index input more close in nature, without imposing incubating APIs into the 
currently-previewing implementation.
   
   There are now a couple of extra variants of the distance calculation 
functions, but they are largely a cut'n'paste of sections of each other. We 
might not need them all, but that kinda depends on which are more performance 
sensitive than others.
   
   Currently only float dot product has been updated, so as to first determine 
the potential performance benefits as well as the approach.
   
   Outstanding work and areas for improvement:
   1. Figure out something better than exposing IndexInput in 
RandomAccessVectorValues
   2. Binary and other distance calculations
   3. Evaluate perf impact with macro benchmarks ( only micro done so far )
   4. There are a couple of hacks to get the benchmark running - remove these
   5.  ..
   
   relates #12482


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-20 Thread via GitHub


ChrisHegarty commented on PR #12703:
URL: https://github.com/apache/lucene/pull/12703#issuecomment-1772530717

   Some benchmark results.
   
   Mac M2, 128 bit
   ```
   INFO: Java vector incubator API enabled; uses preferredBitSize=128
   ...
   Benchmark (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatDotProductVector   1024  thrpt5   6.568 ± 
0.189  ops/us
   VectorUtilBenchmark.floatDotProductVectorMS11024  thrpt5   8.823 ± 
0.157  ops/us
   VectorUtilBenchmark.floatDotProductVectorMS21024  thrpt5  12.561 ± 
0.298  ops/us
   ```
   
   Rocket Lake, AVX 512
   ```
   INFO: Java vector incubator API enabled; uses preferredBitSize=512
   ...
   Benchmark (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatDotProductVector   1024  thrpt5   9.911 ± 
0.004  ops/us
   VectorUtilBenchmark.floatDotProductVectorMS11024  thrpt5  10.081 ± 
0.004  ops/us
   VectorUtilBenchmark.floatDotProductVectorMS21024  thrpt5  14.162 ± 
0.020  ops/us
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub


ChrisHegarty commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772535368

   Thanks @rmuir @gf2121  I need to spend a bit more evaluating this.  But it 
looks like no action is needed here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix index out of bounds when writing FST to different metaOut (#12697) [lucene]

2023-10-20 Thread via GitHub


mikemccand merged PR #12698:
URL: https://github.com/apache/lucene/pull/12698


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] ArrayIndexOutOfBoundsException when writing the FSTStore-backed FST with different DataOutput for meta [lucene]

2023-10-20 Thread via GitHub


mikemccand closed issue #12697: ArrayIndexOutOfBoundsException when writing the 
FSTStore-backed FST with different DataOutput for meta
URL: https://github.com/apache/lucene/issues/12697


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-20 Thread via GitHub


bruno-roustant commented on code in PR #12633:
URL: https://github.com/apache/lucene/pull/12633#discussion_r1366758770


##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -20,76 +20,161 @@
 import org.apache.lucene.util.packed.PackedInts;
 import org.apache.lucene.util.packed.PagedGrowableWriter;
 
+// TODO: any way to make a reverse suffix lookup (msokolov's idea) instead of 
more costly hash?
+// hmmm, though, hash is not so wasteful
+// since it does not have to store value of each entry: the value is the node 
pointer in the FST.
+// actually, there is much to save
+// there -- we would not need any long per entry -- we'd be able to start at 
the FST end node and
+// work backwards from the transitions
+
+// TODO: couldn't we prune natrually babck until we see a transition with an 
output?  it's highly

Review Comment:
   "natrually babck"



##
lucene/core/src/java/org/apache/lucene/util/packed/AbstractPagedMutable.java:
##
@@ -110,8 +110,10 @@ protected long baseRamBytesUsed() {
   public long ramBytesUsed() {
 long bytesUsed = RamUsageEstimator.alignObjectSize(baseRamBytesUsed());
 bytesUsed += 
RamUsageEstimator.alignObjectSize(RamUsageEstimator.shallowSizeOf(subMutables));
+// System.out.println("abstract.ramBytesUsed:");

Review Comment:
   Do we keep these commented prints?



##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -99,20 +184,18 @@ private long hash(FSTCompiler.UnCompiledNode node) {
 h += 17;
   }
 }
-// System.out.println("  ret " + (h&Integer.MAX_VALUE));
+

Review Comment:
   Do we need to mask with Long.MAX_VALUE below, since we mask anyway with the 
table mask?
   
   Instead, we should multiply with the gold constant BitMixer#PHI_C64 (make it 
public).
   This really makes a difference in the evenness of the value distribution. 
This is one of the secrets of the HPPC hashing. By applying this, we get 
multiple advantages:
   - lookup should be improved (less hash collision)
   - we can try to rehash at 3/4 occupancy because the performance should not 
be impacted until this point.
   - in case of hash collision, we can lookup linearly with a pos = pos + 1 
instead of quadratic probe (lines 95 and 327); this may avoid some mem cache 
miss.
   
   (same for the other hash method)



##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -135,32 +123,28 @@ public class FSTCompiler {
* Instantiates an FST/FSA builder with default settings and pruning options 
turned off. For more
* tuning and tweaking, see {@link Builder}.
*/
+  // TODO: remove this?  Builder API should be the only entry point?
   public FSTCompiler(FST.INPUT_TYPE inputType, Outputs outputs) {
-this(inputType, 0, 0, true, true, Integer.MAX_VALUE, outputs, true, 15, 
1f);
+this(inputType, 32.0, outputs, true, 15, 1f);
   }
 
   private FSTCompiler(
   FST.INPUT_TYPE inputType,
-  int minSuffixCount1,
-  int minSuffixCount2,
-  boolean doShareSuffix,
-  boolean doShareNonSingletonNodes,
-  int shareMaxTailLength,
+  double suffixRAMLimitMB, // pass 0 to disable suffix compression/trie; 
larger values create

Review Comment:
   `@param`in method javadoc instead? This is not easy to read with spotless 
truncation.



##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -207,59 +187,26 @@ public Builder(FST.INPUT_TYPE inputType, Outputs 
outputs) {
 }
 
 /**
- * If pruning the input graph during construction, this threshold is used 
for telling if a node
- * is kept or pruned. If transition_count(node) >= minSuffixCount1, the 
node is kept.
+ * The approximate maximum amount of RAM (in MB) to use holding the suffix 
cache, which enables
+ * the FST to share common suffixes. Pass {@link Double#POSITIVE_INFINITY} 
to keep all suffixes
+ * and create an exactly minimal FST. In this case, the amount of RAM 
actually used will be
+ * bounded by the number of unique suffixes. If you pass a value smaller 
than the builder would
+ * use, the least recently used suffixes will be discarded, thus reducing 
suffix sharing and
+ * creating a non-minimal FST. In this case, the larger the limit, the 
closer the FST will be to
+ * its true minimal size, with diminishing returns as you increasea the 
limit. Pass {@code 0} to

Review Comment:
   "increasea"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@luc

Re: [PR] Fix index out of bounds when writing FST to different metaOut (#12697) [lucene]

2023-10-20 Thread via GitHub


mikemccand commented on PR #12698:
URL: https://github.com/apache/lucene/pull/12698#issuecomment-1772566523

   Thanks @dungba88 -- I merged to `main` and `branch_9x`!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Random access term dictionary [lucene]

2023-10-20 Thread via GitHub


bruno-roustant commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1772589968

   I'll also try to review!
   On the bit packing subject, I have some handy generic code (not in Lucene 
yet) to write and read variable size bits. Tell me if you are interested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Concurrent HNSW Merge [lucene]

2023-10-20 Thread via GitHub


benwtrent commented on PR #12660:
URL: https://github.com/apache/lucene/pull/12660#issuecomment-1772603178

   This is awesome. I am so happy it's a clean change without tons of 
complexity and we still get 4x speed up with additional threads.
   
   I will give it a review this weekend or early next week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-20 Thread via GitHub


mikemccand commented on code in PR #12633:
URL: https://github.com/apache/lucene/pull/12633#discussion_r1366900931


##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -135,32 +123,28 @@ public class FSTCompiler {
* Instantiates an FST/FSA builder with default settings and pruning options 
turned off. For more
* tuning and tweaking, see {@link Builder}.
*/
+  // TODO: remove this?  Builder API should be the only entry point?
   public FSTCompiler(FST.INPUT_TYPE inputType, Outputs outputs) {
-this(inputType, 0, 0, true, true, Integer.MAX_VALUE, outputs, true, 15, 
1f);
+this(inputType, 32.0, outputs, true, 15, 1f);
   }
 
   private FSTCompiler(
   FST.INPUT_TYPE inputType,
-  int minSuffixCount1,
-  int minSuffixCount2,
-  boolean doShareSuffix,
-  boolean doShareNonSingletonNodes,
-  int shareMaxTailLength,
+  double suffixRAMLimitMB, // pass 0 to disable suffix compression/trie; 
larger values create

Review Comment:
   Thanks @bruno-roustant -- I agree it looks awful :) -- but I think I'll just 
remove this wimpy comment.  These private ctor parameters are better documented 
on the Builder setters.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-20 Thread via GitHub


mikemccand commented on code in PR #12633:
URL: https://github.com/apache/lucene/pull/12633#discussion_r1366903022


##
lucene/core/src/java/org/apache/lucene/util/packed/AbstractPagedMutable.java:
##
@@ -110,8 +110,10 @@ protected long baseRamBytesUsed() {
   public long ramBytesUsed() {
 long bytesUsed = RamUsageEstimator.alignObjectSize(baseRamBytesUsed());
 bytesUsed += 
RamUsageEstimator.alignObjectSize(RamUsageEstimator.shallowSizeOf(subMutables));
+// System.out.println("abstract.ramBytesUsed:");

Review Comment:
   I don't think we have a general rule :)  I'll remove these ones.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-20 Thread via GitHub


mikemccand commented on code in PR #12633:
URL: https://github.com/apache/lucene/pull/12633#discussion_r1366910269


##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -99,20 +184,18 @@ private long hash(FSTCompiler.UnCompiledNode node) {
 h += 17;
   }
 }
-// System.out.println("  ret " + (h&Integer.MAX_VALUE));
+

Review Comment:
   > Do we need to mask with Long.MAX_VALUE below, since we mask anyway with 
the table mask?
   
   You're right, this is pointless -- I'll remove from both hash functions -- 
then we preserve that top (sign) bit for this following change:
   
   > Instead, we should multiply with the gold constant BitMixer#PHI_C64 (make 
it public).
   
   Whoa, this sounds awesome!  I was wondering if we could improve the 
simplistic hashing here ... I'll open a spinoff issue with this idea.  Sounds 
like a low hanging hashing fruit!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub


rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772656211

   @ChrisHegarty there are plenty of actions we could take... but I implemented 
this specific same optimization in question safely in #12681 
   
   See https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking 
for a brief introduction. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-20 Thread via GitHub


mikemccand opened a new issue, #12704:
URL: https://github.com/apache/lucene/issues/12704

   ### Description
   
   Spinoff from [this cool 
comment](https://github.com/apache/lucene/pull/12633#discussion_r1366847986), 
thanks to hashing guru @bruno-roustant:
   
   ```
   Instead, we should multiply with the gold constant BitMixer#PHI_C64 (make it 
public).
   This really makes a difference in the evenness of the value distribution. 
This is one of the secrets of the HPPC hashing. By applying this, we get 
multiple advantages:
   
 * lookup should be improved (less hash collision)
 * we can try to rehash at 3/4 occupancy because the performance should not 
be impacted until this point.
 * in case of hash collision, we can lookup linearly with a pos = pos + 1 
instead of quadratic probe (lines 95 and 327); this may avoid some mem cache 
miss.
 * 
   (same for the other hash method)
   ```
   
   This is a simple change, we just need to test on some real FST building 
cases to confirm good mixing "in practice".  The new `IndexToFST` tool in 
`luceneutil` is helpful for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-20 Thread via GitHub


mikemccand commented on code in PR #12633:
URL: https://github.com/apache/lucene/pull/12633#discussion_r1366913164


##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -99,20 +184,18 @@ private long hash(FSTCompiler.UnCompiledNode node) {
 h += 17;
   }
 }
-// System.out.println("  ret " + (h&Integer.MAX_VALUE));
+

Review Comment:
   > I'll open a spinoff issue with this idea. Sounds like a low hanging 
hashing fruit!
   
   I opened https://github.com/apache/lucene/issues/12704.  Thanks 
@bruno-roustant!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub


rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772661255

   to have any decent performance, we really need information on the CPU in 
question and its vector capabilities. And the idea you can write "one loop" 
that "runs anywhere" is an obvious pipe dream on the part of openjdk... we have 
to deal with special-case BS like this for every CPU. my advice on openjdk 
side: give up on the pipe dream and give us some cpuinfo or at least tell us 
what vector operations are supported. I'm almost to the point of writing 
/proc/cpuinfo parser.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub


rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772669239

   also i think JMH is bad news when it comes to downclocking. It does not show 
the true performance impact of this. It slows down other things on the machine 
as well: the user might have other shit going on too.
   
   This is why the avx-512 variants of the functions avoid 512-bit multiply and 
just use 256-bit vectors for that part, it is the only safe approach.
   
   Unfortunately this approach is slightly suboptimal for your Rocket Lake 
which doesn't suffer from downclocking, but it is a disaster elsewhere, so we 
have to play it safe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-20 Thread via GitHub


mikemccand commented on PR #12633:
URL: https://github.com/apache/lucene/pull/12633#issuecomment-1772672994

   I'll also confirm `Test2BFST` still passes ... soon this test will no longer 
require a 35 GB heap to run!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub


uschindler commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772673981

   > to have any decent performance, we really need information on the CPU in 
question and its vector capabilities. And the idea you can write "one loop" 
that "runs anywhere" is an obvious pipe dream on the part of openjdk... we have 
to deal with special-case BS like this for every CPU. my advice on openjdk 
side: give up on the pipe dream and give us some cpuinfo or at least tell us 
what vector operations are supported.
   
   I think we should really open an issue at JDK. We discussed about this 
before. In my opinion, the system should have an API to request for each 
operator, if it is supported in current configuration (CPU, species size and 
Hotspot abilities). In addition a shortcut to figure out if C2 is enabled at 
all, but if C2 is disabled, each of the above specific queries for operator and 
species combinations should return "no-no".
   
   Everything else makes vector API a huge trap. It is a specialist API so 
please add detailed information to allow specialist use!
   
   > I'm almost to the point of writing /proc/cpuinfo parser.
   
   Doesn't work on Windows. 😱


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub


uschindler commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772679957

   > Unfortunately this approach is slightly suboptimal for your Rocket Lake 
which doesn't suffer from downclocking, but it is a disaster elsewhere, so we 
have to play it safe.
   
   We could add a sysprops to enable advanced avx512 support. Could be a static 
boolean then on the impl. If somebody enables it, they must be aware that it 
may slow down. I have seen stuff like this in native vector databases on 
buzzwords (they told me).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub


rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772695285

   Vector API should also fix its bugs. It is totally senseless to have 
`IntVector.SPECIES_PREFERRED` and `FloatVector.SPECIES_PREFERRED` and then 
always set them to '512' on every avx-512 machine. It is not about the data 
type but the operation you are doing on.
   
   Openjdk has the same heuristics in its intrinsics/asm that i allude to, to 
avoid these problems. Hacks in every intrinsic / default checking specific 
families of CPUs and shit. go look for yourself. But it doesn't want to expose 
this stuff via vector api to developers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub


rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772702122

   I would really just fix the api: instead of `IntVector.SPECIES_PREFERRED` 
constant which is meaningless, it should be a method taking 
`VectorOperation...` about how you plan to use it. it should be something like 
this in our code:
   
   ```
   import static VectorOperators.ADD;
   import static VectorOperators.MUL;
   
   // get preferred species on the platform for doing multiplication and 
addition
   static final DOT_PRODUCT_SPECIES = IntVector.preferredFor(MUL, ADD);
   ```
   
   Then openjdk can keep its hacks to itself. But if they wont fix this, they 
should expose them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-20 Thread via GitHub


jbellis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772704049

   Responding top to bottom,
   
   > I wonder how much the speed difference is due to (1) Vectors being out of 
memory (and if they used PQ for diskann, if they did, we should test PQ with 
HNSW). (2) Not merging to a single segment and searching multiple segments 
serially.
   
   (1) 90% of it is the PQ, yes.  I assume that storing the vector inline w/ 
the graph also helps some but I did not test that separately.  You could 
definitely get a big speed up just using PQ on top of HNSW.  
   
   (2) Single segment in both cases.  (JVector leaves segment management as an 
exercise for the user.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub


rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772706786

   such a method would solve 95% of my problems, if it would throw 
UnsupportedOperationException or return `null` if the hardware/hotspot doesnt 
support all the requested VectorOperators.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-20 Thread via GitHub


jbellis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772711571

   > DiskANN is known to be slower at indexing than HNSW 
   
   I don't remember the numbers here, maybe 10% slower?  It wasn't material 
enough to make me worry about it.  (This is with using an HNSW-style 
incremental build, not the two-pass build described in the paper.)
   
   > and the blog post does not compare single threaded index times with Lucene.
   
   It was about 30% worse several months ago with my concurrent hnsw patch, 
should be significantly improved now since JVector (1) doesn't need the 
CompletionTracker to keep the layers consistent, b/c it's single layer, (2) 
added a DenseIntMap concurrent collection to replace ConcurrentHashMap for the 
nodes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-20 Thread via GitHub


jbellis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772722737

   > It is possible that the candidate postings (gathered via HNSW) don't 
contain ANY filtered docs. This would require gathering more candidate postings.
   
   This was a big problem for our initial deployment, we'd filter down to a few 
valid items and then spend forever searching the graph for them.  Had to add a 
fairly accurate estimate of how many comparisons the index would need, and use 
that to decide whether to brute-force the comparison instead.  (This is in 
Cassandra, not JVector, specifically VectorMemtableIndex.expectedNodesVisited.) 
 I don't remember seeing code for this in Lucene but I mostly only looked at 
the HNSW code so I could have missed it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-20 Thread via GitHub


jbellis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772724758

   > Or perhaps we "just" make a Lucene Codec component (KnnVectorsFormat) that 
wraps jvector? (https://github.com/jbellis/jvector)
   
   I'm happy to support anyone who wants to try this, including modifying 
JVector to make it a better fit if necessary.  I won't have time to tackle this 
myself in the medium-term, however.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid use docsSeen in BKDWriter [lucene]

2023-10-20 Thread via GitHub


easyice commented on PR #12658:
URL: https://github.com/apache/lucene/pull/12658#issuecomment-1772756782

   I think we can only use  this optimization without deleted docs  for merges, 
because we can't use the cardinality of `liveDocs` as docCount,  the `liveDocs` 
is set to 1 when initialized.  We need to iterate through all the doc to know 
the docCount


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid use docsSeen in BKDWriter [lucene]

2023-10-20 Thread via GitHub


easyice commented on code in PR #12658:
URL: https://github.com/apache/lucene/pull/12658#discussion_r1366998517


##
lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java:
##
@@ -519,9 +526,8 @@ private Runnable writeFieldNDims(
 // compute the min/max for this slice
 computePackedValueBounds(
 values, 0, Math.toIntExact(pointCount), minPackedValue, 
maxPackedValue, scratchBytesRef1);
-for (int i = 0; i < Math.toIntExact(pointCount); ++i) {
-  docsSeen.set(values.getDocID(i));
-}
+// docCount has already been set by {@link BKDWriter#writeField}
+assert docCount != -1;

Review Comment:
   I moved the `docCount` parameter form `writeField()` to constructor, it 
looks more cleaner.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-20 Thread via GitHub


rmuir commented on PR #12703:
URL: https://github.com/apache/lucene/pull/12703#issuecomment-1772787526

   Thanks for investigating this! Can we just fix vector code to take 
MemorySegment and wrap array code?
   
   I don't think we should add yet another factor to multiply the number of 
vector impls, there are already too many (encoding * function * cpu). 
   
   Stuff using IndexInput can be slow and wrap arrays with MemorySegment. IMO 
we should deprecate IndexInput for lucene 10.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-20 Thread via GitHub


rmuir commented on PR #12703:
URL: https://github.com/apache/lucene/pull/12703#issuecomment-1772792213

   as far as performance in practice, what kind of alignment is necessary such 
that it is reasonable for mmap'd files? Please, let it not be 64 bytes 
alignment for avx-512, that's too wasteful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-20 Thread via GitHub


bruno-roustant commented on issue #12704:
URL: https://github.com/apache/lucene/issues/12704#issuecomment-1772810122

   @dweiss will probably say more than me about the awesome BitMixer#PHI_C64 
constant!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip [lucene]

2023-10-20 Thread via GitHub


mikemccand commented on PR #12653:
URL: https://github.com/apache/lucene/pull/12653#issuecomment-1772824555

   Thanks @shubhamvishu -- looks great!  I plan to merge later today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip [lucene]

2023-10-20 Thread via GitHub


mikemccand commented on code in PR #12653:
URL: https://github.com/apache/lucene/pull/12653#discussion_r1367036009


##
lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListWriter.java:
##
@@ -63,24 +63,23 @@ public abstract class MultiLevelSkipListWriter {
   /** for every skip level a different buffer is used */
   private ByteBuffersDataOutput[] skipBuffer;
 
+  /** Length of the window at which the skips are placed on skip level 1 */
+  private final long windowLength;
+
   /** Creates a {@code MultiLevelSkipListWriter}. */
   protected MultiLevelSkipListWriter(
   int skipInterval, int skipMultiplier, int maxSkipLevels, int df) {
 this.skipInterval = skipInterval;
 this.skipMultiplier = skipMultiplier;
 
-int numberOfSkipLevels;
+int numberOfSkipLevels = 1;
 // calculate the maximum number of skip levels for this document frequency
-if (df <= skipInterval) {
-  numberOfSkipLevels = 1;
-} else {
-  numberOfSkipLevels = 1 + MathUtil.log(df / skipInterval, skipMultiplier);
-}
-
-// make sure it does not exceed maxSkipLevels
-if (numberOfSkipLevels > maxSkipLevels) {
-  numberOfSkipLevels = maxSkipLevels;
+if (df > skipInterval) {
+  // also make sure it does not exceed maxSkipLevels
+  numberOfSkipLevels =
+  Math.min(1 + MathUtil.log(df / skipInterval, skipMultiplier), 
maxSkipLevels);
 }

Review Comment:
   > Actually it was not the instance variable that you are thinking of and 
rather was a local variable with duplicate name(i.e. `numberOfSkipLevels`) 
which was totally unnecessary here so I cleaned that up as well.
   
   Aha!  That explains my confusion.  Such shadowing (`x` and `this.x` being 
different) is so confusing/dangerous.  Thanks for cleaning it up!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for similarity-based vector searches [lucene]

2023-10-20 Thread via GitHub


shubhamvishu commented on code in PR #12679:
URL: https://github.com/apache/lucene/pull/12679#discussion_r1367106152


##
lucene/core/src/java/org/apache/lucene/search/AbstractRnnVectorQuery.java:
##
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Objects;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReaderContext;
+
+/**
+ * Search for all (approximate) vectors within a radius using the {@link 
RnnCollector}.
+ *
+ * @lucene.experimental
+ */
+abstract class AbstractRnnVectorQuery extends AbstractKnnVectorQuery {
+  private static final TopDocs NO_RESULTS = TopDocsCollector.EMPTY_TOPDOCS;
+
+  protected final float traversalThreshold, resultThreshold;
+
+  public AbstractRnnVectorQuery(
+  String field, float traversalThreshold, float resultThreshold, Query 
filter) {
+super(field, Integer.MAX_VALUE, filter);
+assert traversalThreshold <= resultThreshold;

Review Comment:
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Scorer should sum up scores into a double [lucene]

2023-10-20 Thread via GitHub


shubhamvishu commented on PR #12682:
URL: https://github.com/apache/lucene/pull/12682#issuecomment-1772913419

   Thanks for the approval @jpountz !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub


rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772915097

   > > Unfortunately this approach is slightly suboptimal for your Rocket Lake 
which doesn't suffer from downclocking, but it is a disaster elsewhere, so we 
have to play it safe.
   > 
   > We could add a sysprops to enable advanced avx512 support. Could be a 
static boolean then on the impl. If somebody enables it, they must be aware 
that it may slow down. I have seen stuff like this in native vector databases 
on buzzwords (they told me).
   
   The jvm already has these. For example a user can set max vector width and 
avx instructiom level already. I assume that avx 512 users who are running on 
downclock-susceptible CPUs would already set flags to use only 256bit vectors. 
So I am afraid of adding our own flags that conflict with those.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] SOLR-15055 Re-implement 'withCollection' and 'maxShardsPerNode' [lucene-solr]

2023-10-20 Thread via GitHub


ljak commented on PR #2179:
URL: https://github.com/apache/lucene-solr/pull/2179#issuecomment-1772924915

   Hi,
   
   I know it's an old thread but I have a question.
   
   As far as I can tell (after searching), the `maxShardsPerNode` function 
wasn't re-implemented right (in the new autoscaling framework) ?
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub


uschindler commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772952661

   Hi,
   
   > The jvm already has these. For example a user can set max vector width and 
avx instructiom level already. I assume that avx 512 users who are running on 
downclock-susceptible CPUs would already set flags to use only 256bit vectors. 
So I am afraid of adding our own flags that conflict with those.
   
   The problem is that by default JVM enables those flags and then those 
machines get slow when they use Lucene and support cases are openend at 
Elasticsearch/Opensearch stuff. So my wish would be to have that opt-in.
   
   My idea was to add a sysprop like the MMap ones to enable/disbale manually 
or disable unsafe unmapping. In that case the sysprop would enable the AVX 512 
bit case. It won't conflict with manual AVX=2 JVM override setting (because 
then the preferredBitSize=256 and our flag is a noop).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-20 Thread via GitHub


ChrisHegarty commented on PR #12703:
URL: https://github.com/apache/lucene/pull/12703#issuecomment-1772963134

   @rmuir  If I understand your comment correctly. I unaligned the vector data 
in the mmap file, in the benchmark. The results are similar enough to the 
aligned, maybe a little less when the alignment is terrible! see 
[b02b79b](https://github.com/apache/lucene/pull/12703/commits/b02b79b0b3aafdb889a5314dc07a235c0eac45db)
   
   Rocket Lake, AVX 512
   ```
   INFO: Java vector incubator API enabled; uses preferredBitSize=512
   ...
   VectorUtilBenchmark.floatDotProductVector   1024  thrpt5   9.994 ± 
0.002  ops/us
   VectorUtilBenchmark.floatDotProductVectorMS11024  thrpt5  11.171 ± 
0.007  ops/us
   VectorUtilBenchmark.floatDotProductVectorMS21024  thrpt5  13.159 ± 
0.017  ops/us
   ```
   
   Mac M2, 128 bit
   ```
   INFO: Java vector incubator API enabled; uses preferredBitSize=128
   ...
   VectorUtilBenchmark.floatDotProductVector   1024  thrpt5   5.242 ± 
0.025  ops/us
   VectorUtilBenchmark.floatDotProductVectorMS11024  thrpt5   7.928 ± 
0.155  ops/us
   VectorUtilBenchmark.floatDotProductVectorMS21024  thrpt5  12.546 ± 
0.077  ops/us
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-20 Thread via GitHub


ChrisHegarty commented on PR #12703:
URL: https://github.com/apache/lucene/pull/12703#issuecomment-1772974346

   > Thanks for investigating this! Can we just fix vector code to take 
MemorySegment and wrap array code?
   
   Yes, that is a good idea. I'll do it and see how poorly is performs.  If we 
do this, then we still need handle the case where a vector spans across more 
than one segment ( well, it should only ever span at most two ). We can just 
copy the bytes into a float[] for this and get a memory segment view over it - 
this will happen so infrequently.
   
   > I don't think we should add yet another factor to multiply the number of 
vector impls, there are already too many (encoding * function * cpu).
   
   Agreed.
   
   > Stuff using IndexInput can be slow and wrap arrays with MemorySegment. IMO 
we should deprecate IndexInput for lucene 10.
   
   I'm not sure how this work work, but I accept the sentiment. And I think I 
see/agree where you are going longer term with this ( and Java 22 ).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub


ChrisHegarty commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772982177

   Just dropping an ACK here, for now. I do get the issues, and I agree that 
there could be better ways to frame things at the vector API level.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-20 Thread via GitHub


mikemccand commented on PR #12633:
URL: https://github.com/apache/lucene/pull/12633#issuecomment-1772988443

   `Test2BFST` passed!
   
   ```
   The slowest tests (exceeding 500 ms) during this run:

   
 2993.94s Test2BFST.test (:lucene:core) 

   
   The slowest suites (exceeding 1s) during this run:   

   
 2994.06s Test2BFST (:lucene:core)  

   


   
   BUILD SUCCESSFUL in 49m 58s  

   
   246 actionable tasks: 78 executed, 168 up-to-date
   ```
   
   That last check failed because of formatting -- I re-tidy'd and I think it's 
ready -- I'll push shortly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-20 Thread via GitHub


mikemccand merged PR #12633:
URL: https://github.com/apache/lucene/pull/12633


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality [lucene]

2023-10-20 Thread via GitHub


mikemccand commented on issue #12542:
URL: https://github.com/apache/lucene/issues/12542#issuecomment-1772993863

   I've merged the change into `main`!  I'll let it bake for some time (week or 
two?) and if all looks good, backport to 9.x.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub


uschindler commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1773038045

   > Just dropping an ACK here, for now. I do get the issues, and I agree that 
there could be better ways to frame things at the vector API level.
   
   Let's write a proposal together in a Google doc (or similar) and we could 
later open an issue or JEP. As we are both openjdk members, this would be 
helpful to me to participate in that process (want to learn sth new).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Random access term dictionary [lucene]

2023-10-20 Thread via GitHub


Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1773113866

   Thanks @bruno-roustant ! If you're okay to share it feel free to share it 
here. 
   
   I'm in the process of baking my own specific implementation (which 
internally uses a single long as bit buffer), but I might absorb some 
interesting ideas from your impl. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-20 Thread via GitHub


ChrisHegarty commented on PR #12703:
URL: https://github.com/apache/lucene/pull/12703#issuecomment-1773365569

   Well... as simple wrapping of float[] into MemorySegment is not going to 
work out, the Vector API does not like it due to alignment constraints (which 
seems overly pedantic since it can operate fine when accessing a float[] 
directly! )
   
   ```
   FloatVector fromMemorySegment(...)
   
   throws IllegalArgumentException - if the memory segment is a heap segment 
that is not backed by a byte[] array.
   ```
   
   This approach would of course work fine for the binary computations, since 
they are byte[] backed.
   
   The current approach has three variants:
1. `dotProduct(float[] a, float[] b)`
2. `dotProduct(float[] a, RandomAccessVectorValues b, int bOffset)`
3. `dotProduct(RandomAccessVectorValues a, int aOffset, 
RandomAccessVectorValues b, int bOffset)`
   
   .. no.1 is used as a fallback when we span segments.
   
   I suspect that no.1 and no.2 are not as perf sensitive as no.3?  In fact, we 
could try to trace the float[] back to where it originates, maybe read and 
store it as bytes but access it as a float.
   
   Maybe there are other ways to invert things? I'll need to sleep on it!! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Capture build scans on ge.apache.org to benefit from deep build insights [lucene]

2023-10-20 Thread via GitHub


dsmiley commented on PR #12293:
URL: https://github.com/apache/lucene/pull/12293#issuecomment-1773459815

   I'm eager to see the kind of build insights Gradle Enterprise offers us.  If 
there are no further concerns, I'll merge Tuesday.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-20 Thread via GitHub


rmuir commented on PR #12703:
URL: https://github.com/apache/lucene/pull/12703#issuecomment-1773612369

   > Well... as simple wrapping of float[] into MemorySegment is not going to 
work out, the Vector API does not like it due to alignment constraints (which 
seems overly pedantic since it can operate fine when accessing a float[] 
directly! )
   
   But all `float[]` on the heap are just like any other heap object and 
aligned to 8 bytes or something?
   
   ```
   $ java -XX:+PrintFlagsFinal -version | grep -i align
bool AlignVector  = false   
   {C2 product} {default}
intx InteriorEntryAlignment   = 16  
{C2 pd product} {default}
intx NumberOfLoopInstrToAlign = 4   
   {C2 product} {default}
 int ObjectAlignmentInBytes   = 8   
   {product lp64_product} {default}
intx OptoLoopAlignment= 16  
   {pd product} {default}
bool UseUnalignedLoadStores   = true
 {ARCH product} {default}
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove direct dependency of NodeHash to FST [lucene]

2023-10-20 Thread via GitHub


dungba88 commented on PR #12690:
URL: https://github.com/apache/lucene/pull/12690#issuecomment-1773630051

   As the other PR has been merged, I have rebased and resolved the conflict


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org