[GitHub] [lucene] jpountz commented on pull request #12415: Optimize disjunction counts.

2023-07-06 Thread via GitHub


jpountz commented on PR #12415:
URL: https://github.com/apache/lucene/pull/12415#issuecomment-1623101984

   To me a big question with this API is whether we should consider methods on 
the `DocIdStream` terminal or not. If we do this, then this may enable more 
optimizations later on, e.g. it would be legal to create such objects that are 
backed by an iterator. But on the other hand, you wouldn't be able to propage 
this optimization in a `MultiLeafCollector` since it would be illegal for each 
sub `LeafCollector` to consume the `DocIdStream` independently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tang-hi commented on issue #12419: IndexWriter and ConcurrentMergeScheduler and SegmentReader can cause static initialization deadlock

2023-07-06 Thread via GitHub


tang-hi commented on issue #12419:
URL: https://github.com/apache/lucene/issues/12419#issuecomment-1623127458

   In my opinion, this deadlock only occurs at the beginning of program 
execution. As long as you ensure that the IndexWriter is constructed before the 
ConcurrentMergeScheduler, the deadlock situation should not occur.
   
   Do you have a specific scenario where you need to concurrently construct the 
IndexWriter and ConcurrentMergeScheduler at the beginning of the program?
   ```Java
   public static void main(String[] args) throws Exception {
   CountDownLatch startLatch = new CountDownLatch(1);
   
   Thread t1 = new Thread(() -> {
 try {
   new IndexWriter(new ByteBuffersDirectory(), new IndexWriterConfig());
 } catch (IOException e) {
   throw new RuntimeException(e);
 }
   });
   
   Thread t2 = new Thread(() -> {
 try {
   startLatch.await();
   new ConcurrentMergeScheduler();
 } catch (InterruptedException e) {
   throw new RuntimeException(e);
 }
   });
   
   t1.start();
   t2.start();
   
   t1.join();
   System.out.println("Start!");
   startLatch.countDown();
   
   t2.join();
   
   System.out.println("Done");
   
   for(int i = 0; i < 1000; i++) {
 deadlock();
   }
 }
   
 static void deadlock() throws InterruptedException {
   CountDownLatch startLatch = new CountDownLatch(1);
   
   Thread t1 = new Thread(() -> {
 try {
   startLatch.await();
   new IndexWriter(new ByteBuffersDirectory(), new IndexWriterConfig());
 } catch (IOException | InterruptedException e) {
   throw new RuntimeException(e);
 }
   });
   
   Thread t2 = new Thread(() -> {
 try {
   startLatch.await();
   new ConcurrentMergeScheduler();
 } catch (InterruptedException e) {
   throw new RuntimeException(e);
 }
   });
   
   t1.start();
   t2.start();
   
   System.out.println("Start!");
   startLatch.countDown();
   
   t2.join();
   t1.join();
   
   System.out.println("Done");
 }
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] hydrogen666 commented on issue #12419: IndexWriter and ConcurrentMergeScheduler and SegmentReader can cause static initialization deadlock

2023-07-06 Thread via GitHub


hydrogen666 commented on issue #12419:
URL: https://github.com/apache/lucene/issues/12419#issuecomment-1623167682

   @tang-hi Our Elasticsearch has migrated to Segment replication mechanism, so 
the code runing on the Primary shard and Replica shard is different. The 
replica shard will not construct `IndexWriter` any more.
   
   Thread A starts a primary shard with procedure below
   1. Construct `ConcurrentMergeScheduler`
   2. Consctruct `IndexWriter`
   
   Thread B start a replica shard with precedure below
   1. Deserialize `SegmentInfos` from disk
   2. Start a read-only SearcherManager
   
   In the first step, we will call `SegmentInfos#readCommit` to  deserialize 
`SegmentInfos`, call stack below
   ```
   at org.apache.lucene.index.IndexWriter.(IndexWriter.java:6485)
   at 
org.apache.lucene.index.SegmentInfos.parseSegmentInfos(SegmentInfos.java:499)
   at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:364)
   at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:311)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] almogtavor commented on issue #12406: Register nested queries (ToParentBlockJoinQuery) to Lucene Monitor

2023-07-06 Thread via GitHub


almogtavor commented on issue #12406:
URL: https://github.com/apache/lucene/issues/12406#issuecomment-1623435023

   @romseygeek @dweiss I'd love to get feedback from you on the subject


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a diff in pull request #12345: LUCENE-10641: IndexSearcher#setTimeout should also abort query rewrites, point ranges and vector searches

2023-07-06 Thread via GitHub


mikemccand commented on code in PR #12345:
URL: https://github.com/apache/lucene/pull/12345#discussion_r1254254429


##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -763,6 +763,11 @@ public Query rewrite(Query original) throws IOException {
 for (Query rewrittenQuery = query.rewrite(this);
 rewrittenQuery != query;
 rewrittenQuery = query.rewrite(this)) {
+  if (queryTimeout != null) {

Review Comment:
   @Deepika0510 I think what @jpountz meant is we should somehow intercept the 
I/O Lucene is inevitably doing in a `rewrite` implementation to check for 
timeout.  Similar to how `ExitableDirectoryReader` wraps the low level Lucene 
APIs and checks for timeout.
   
   But I don't see how we could cleanly do this (wrap `Directory`) in the 
context of `rewrite`.
   
   Maybe instead we just use `ExitableDirectoryReader`, but make a new 
generalized version `ExitableIndexReader` that takes any `IndexReader` (not 
just the `DirectoryReader` implementation)?
   
   Then, if a timeout is set on `IndexSearcher`, we use this wrapped timeout 
`IndexReader` in rewrite somehow?  We should maybe create that wrapped reader 
once on `IndexSearcher.setTimeout` calls and reuse it across all concurrent 
rewrites?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] easyice opened a new pull request, #12420: Optimize DocIdsWriter for BKD in reverse case with index sorting

2023-07-06 Thread via GitHub


easyice opened a new pull request, #12420:
URL: https://github.com/apache/lucene/pull/12420

   ### Description
   
   In https://github.com/apache/lucene/pull/438 , 
https://github.com/apache/lucene/pull/510, we optimized the doc ids more 
efficiently for store and decode, this pr allows these optimizations to be 
applied in reverse case.
   
   for bitset store type, will write the bitset in same data format as asc 
order, and when reading the bitset, it have a same FixedBitset as asc order, 
just iterate over the bitset in reverse


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12345: LUCENE-10641: IndexSearcher#setTimeout should also abort query rewrites, point ranges and vector searches

2023-07-06 Thread via GitHub


jpountz commented on code in PR #12345:
URL: https://github.com/apache/lucene/pull/12345#discussion_r1254334257


##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -763,6 +763,11 @@ public Query rewrite(Query original) throws IOException {
 for (Query rewrittenQuery = query.rewrite(this);
 rewrittenQuery != query;
 rewrittenQuery = query.rewrite(this)) {
+  if (queryTimeout != null) {

Review Comment:
   Woops, sorry for missing your question. @mikemccand 's understanding is 
correct, something like `ExitableDirectoryReader`.  +1 to @mikemccand 's 
suggestion:
- Implement a package-private `ExitableIndexReader`, which is really the 
same as `ExitableDirectoryReader` but for any `IndexReader`. It should wrap 
terms and points, but not postings and doc values (which are already covered 
via the bulk scorer).
- Update `IndexSearcher#getIndexReader` to return the wrapped index reader. 
This will automatically make sure expensive rewrites are intercepted since 
`IndexSearcher#getIndexReader` is their only way to get access to the reader.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #12417: forutil add vectorized and scalar code

2023-07-06 Thread via GitHub


uschindler commented on code in PR #12417:
URL: https://github.com/apache/lucene/pull/12417#discussion_r1254354197


##
lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultForUtil90.java:
##
@@ -0,0 +1,118 @@
+// This file has been automatically generated, DO NOT EDIT
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.internal.vectorization;
+
+import java.io.IOException;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.DataOutput;
+
+final class DefaultForUtil90 implements ForUtil90 {
+  private static final int[] tmp = new int[BLOCK_SIZE];

Review Comment:
   You can't make those static as this would break thread safety. To have 
tohose temporary buffers isolated, each caller gets its own instance of the 
ForUtil90, see `Provider#newForUtil90()`.
   
   Please check the original code.



##
lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultForUtil90.java:
##
@@ -0,0 +1,118 @@
+// This file has been automatically generated, DO NOT EDIT
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.internal.vectorization;
+
+import java.io.IOException;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.DataOutput;
+
+final class DefaultForUtil90 implements ForUtil90 {
+  private static final int[] tmp = new int[BLOCK_SIZE];

Review Comment:
   The same applies for the field below and those in Panama.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #12417: forutil add vectorized and scalar code

2023-07-06 Thread via GitHub


uschindler commented on code in PR #12417:
URL: https://github.com/apache/lucene/pull/12417#discussion_r1254355675


##
lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultForUtil90.java:
##
@@ -0,0 +1,118 @@
+// This file has been automatically generated, DO NOT EDIT
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.internal.vectorization;
+
+import java.io.IOException;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.DataOutput;
+
+final class DefaultForUtil90 implements ForUtil90 {
+  private static final int[] tmp = new int[BLOCK_SIZE];

Review Comment:
   Ah you fixed already.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] easyice commented on pull request #12420: Optimize DocIdsWriter for BKD in reverse case with index sorting

2023-07-06 Thread via GitHub


easyice commented on PR #12420:
URL: https://github.com/apache/lucene/pull/12420#issuecomment-1623592791

   duplicate with https://github.com/apache/lucene/pull/666, so i close it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] easyice closed pull request #12420: Optimize DocIdsWriter for BKD in reverse case with index sorting

2023-07-06 Thread via GitHub


easyice closed pull request #12420: Optimize DocIdsWriter for BKD in reverse 
case with index sorting
URL: https://github.com/apache/lucene/pull/12420


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12413: Fix HNSW graph visitation limit bug

2023-07-06 Thread via GitHub


benwtrent commented on PR #12413:
URL: https://github.com/apache/lucene/pull/12413#issuecomment-1623859719

   @msokolov found my bug 🤦 in the simplified version. I updated, removed the 
need for tracking candidates & results since we only care about the best found 
entry point.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jbellis opened a new pull request, #12421: Concurrent hnsw graph and builder, take two

2023-07-06 Thread via GitHub


jbellis opened a new pull request, #12421:
URL: https://github.com/apache/lucene/pull/12421

   ### Motivation
   I need to support concurrent reads and writes to an HNSW index for 
Cassandra.  One option would be to partition the document range and assign each 
range to a single thread with the existing implementation.  However,
   * It is much faster to query a single on-heap hnsw index, than to query 
multiple such indexes and combine the result.  (log(N) vs M log(N/M) where N is 
the number of documents and M is the number of partitions).
   * I need to ultimately write the combined index to disk, so even with some 
contention necessarily occurring during building of the index, we still come 
out way ahead in terms of total efficiency vs creating per-thread indexes and 
combining them, since combining the per-thread indexes boils down to "pick the 
largest and then add all the other nodes normally," you don't really benefit 
from having computed the others previously and you end up doing close to 2x the 
work.
   
   ### Performance
   Numbers are from my SIFT test harness at 
https://github.com/jbellis/hnswdemo/tree/lucene-bench.  Numbers are averaged 
across 5 test runs on my i9-12900 (8 performance cores and 8 efficiency)
   
   Serial construction of 1M nodes: 292.3s
   Parallel construction, 1 thread: 379.0s = 129.7% of serial
   Parallel construction, 2 threads: 191.3s = 50.5% of parallel/1
   Parallel construction, 4 threads: 96.1s = 25.4% of parallel/1
   Parallel construction, 8 threads: 52.6s = 13.8% of parallel/1
   
   Serial queries of 100k vectors / top 100: 38.3s
   Parallel queries, 1 thread: 41.6s = 1.09% of serial
   Parallel queries, 2 threads: 21.0s = 50.5% of parallel/1
   Parallel queries, 4 threads: 10.3s = 24.7% of parallel/1
   Parallel queries, 8 threads: 5.3s = 12.7% of parallel/1
   
   To summarize, there is about a 30% overhead during construction and 10% 
overhead at query time from using the concurrent class.  The concurrent class 
parallelizes construction nearly perfectly through 4 threads, with some 
additional inefficiency becoming visible at 8.  (Probably this is the effect of 
having to do more vector comparisons across the concurrently added nodes – I 
would expect this to remain relatively small and not exploding as thread counts 
increase.)  Uncontended queries scale close to perfectly to at least 8 threads.
   
   ### Design and implementation
   ConcurrentOnHeapHnswGraph is very similar to OnHeapHnswGraph with concurrent 
backing Collections.  The main addition is a getView operation to provide a 
threadsafe snapshot of the graph for searches.  The View uses a 
CompletionTracker class to provide a kind of snapshot isolation – otherwise it 
is impossible to prevent partially added nodes from causing very difficult to 
debug race conditions.  This is used during construction as well as for 
user-invoked searches.
   
   (The initial CompletionTracker implementation was just an AtomicInteger 
clock and a Map, but the constant box/unbox 
introduced significant CPU and GC overhead.  The current implementation is a 
result of optimizing that.)
   
   ConcurrentHnswGraphBuilder adds an incremental ram used estimate as a return 
value to addGraphNode, and a buildAsync method that takes an ExecutorService 
for fine-grained control over parallelism.  Otherwise, it follows the API of 
HnswGraphBuilder closely.  (Closely enough that my original PR extracted a 
common interface and added factories so that they could be plugged in 
interchangeably, but this is now removed after Michael’s feedback.)
   
   The key to achieving good concurrency while maintaining correctness without 
synchronization is, we track in-progress node additions across all threads in a 
ConcurrentSkipListSet. After adding ourselves in addGraphNode, we take a 
snapshot of this set (this is O(1) for CSLS), and consider all other 
in-progress updates as neighbor candidates (subject to normal level 
constraints).
   
   In general, the main concern with the Concurrent Builder compared to the 
serial is to make sure that partially complete operations never “leak” to other 
threads.  In particular,
   * Neighbor manipulation has been encapsulated in ConcurrentNeighborSet.  CNS 
implements a copy-on-write NeighborArray – my initial implementation used a 
ConcurrentSkipListSet but this was significantly slower since even during 
construction, there are many more “iterate through the neighbors” operations 
than “change the neighbors.”  We have to subclass NeighborArray here to be able 
to efficiently handle the case of concurrently inserting the same node (as a 
forward-link and a back-link).
   * Entry point updating is not done until the new node has been fully added.
   
   One more point is a little subtle – 
   * When a new node is added, it considers both existing nodes in the graph as 
candidates, as well as other nodes concurrently in the process of being added. 
It does this by taking a snapshot of

[GitHub] [lucene] jbellis commented on pull request #12254: add ConcurrentOnHeapHnswGraph and Builder

2023-07-06 Thread via GitHub


jbellis commented on PR #12254:
URL: https://github.com/apache/lucene/pull/12254#issuecomment-1623876231

   Closing in favor of #12421.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jbellis closed pull request #12254: add ConcurrentOnHeapHnswGraph and Builder

2023-07-06 Thread via GitHub


jbellis closed pull request #12254: add ConcurrentOnHeapHnswGraph and Builder
URL: https://github.com/apache/lucene/pull/12254


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #12413: Fix HNSW graph visitation limit bug

2023-07-06 Thread via GitHub


msokolov commented on code in PR #12413:
URL: https://github.com/apache/lucene/pull/12413#discussion_r1254834917


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##
@@ -256,6 +256,56 @@ public NeighborQueue searchLevel(
 return results;
   }
 
+  /**
+   * Function to find the best entry point from which to search the zeroth 
graph layer.
+   *
+   * @param query vector query with which to search
+   * @param vectors random access vector values
+   * @param graph the HNSWGraph
+   * @param visitLimit How many vectors are allowed to be visited
+   * @return An integer array whose first element is the best entry point, and 
second is the number
+   * of candidates visited. Entry point of `-1` indicates visitation limit 
exceed
+   * @throws IOException When accessing the vector fails
+   */
+  private int[] findBestEntryPoint(
+  T query, RandomAccessVectorValues vectors, HnswGraph graph, int 
visitLimit)
+  throws IOException {
+int size = graph.size();
+int visitedCount = 1;
+prepareScratchState(vectors.size());
+int currentEp = graph.entryNode();
+float currentScore = compare(query, vectors, currentEp);
+boolean foundBetter;
+for (int level = graph.numLevels() - 1; level >= 1; level--) {
+  foundBetter = true;
+  visited.set(currentEp);
+  // Keep searching the given level until we stop finding a better 
candidate entry point
+  while (foundBetter) {

Review Comment:
   nit: with do/while you could avoid setting foundBetter = true above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12413: Fix HNSW graph visitation limit bug

2023-07-06 Thread via GitHub


benwtrent commented on code in PR #12413:
URL: https://github.com/apache/lucene/pull/12413#discussion_r1254849673


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##
@@ -256,6 +256,56 @@ public NeighborQueue searchLevel(
 return results;
   }
 
+  /**
+   * Function to find the best entry point from which to search the zeroth 
graph layer.
+   *
+   * @param query vector query with which to search
+   * @param vectors random access vector values
+   * @param graph the HNSWGraph
+   * @param visitLimit How many vectors are allowed to be visited
+   * @return An integer array whose first element is the best entry point, and 
second is the number
+   * of candidates visited. Entry point of `-1` indicates visitation limit 
exceed
+   * @throws IOException When accessing the vector fails
+   */
+  private int[] findBestEntryPoint(
+  T query, RandomAccessVectorValues vectors, HnswGraph graph, int 
visitLimit)
+  throws IOException {
+int size = graph.size();
+int visitedCount = 1;
+prepareScratchState(vectors.size());
+int currentEp = graph.entryNode();
+float currentScore = compare(query, vectors, currentEp);
+boolean foundBetter;
+for (int level = graph.numLevels() - 1; level >= 1; level--) {
+  foundBetter = true;
+  visited.set(currentEp);
+  // Keep searching the given level until we stop finding a better 
candidate entry point
+  while (foundBetter) {

Review Comment:
   do/while is indeed valid. But daggum, it confuses the crap out of me, wish 
Java didn't have it. If you don't mind, I will keep the `while` loop. I don't 
particularly mind if somebody changes it later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12413: Fix HNSW graph visitation limit bug

2023-07-06 Thread via GitHub


benwtrent commented on PR #12413:
URL: https://github.com/apache/lucene/pull/12413#issuecomment-1624222651

   > I don't know what bug I found, but this LGTM
   
   Commas are important! Just meant to tell you I found my own bug. Not that 
YOU found my bug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent merged pull request #12413: Fix HNSW graph visitation limit bug

2023-07-06 Thread via GitHub


benwtrent merged PR #12413:
URL: https://github.com/apache/lucene/pull/12413


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ChrisHegarty commented on a diff in pull request #12417: forutil add vectorized and scalar code

2023-07-06 Thread via GitHub


ChrisHegarty commented on code in PR #12417:
URL: https://github.com/apache/lucene/pull/12417#discussion_r1254852005


##
lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaForUtil90.java:
##
@@ -0,0 +1,8994 @@
+// This file has been automatically generated, DO NOT EDIT
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.internal.vectorization;
+
+import java.io.IOException;
+import jdk.incubator.vector.IntVector;
+import jdk.incubator.vector.VectorOperators;
+import jdk.incubator.vector.VectorSpecies;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.DataOutput;
+
+final class PanamaForUtil90 implements ForUtil90 {
+  private final int[] tmp = new int[BLOCK_SIZE];
+   
+  private final int[] decoded = new int[BLOCK_SIZE];
+
+  private static final int totalBits = 32;
+  private static final VectorSpecies SPECIES_128 = 
IntVector.SPECIES_128;
+
+  /** Encode 128 integers from {@code input} into {@code out}. */
+  @Override
+  public void encode(int[] input, int bitsPerValue, DataOutput out) throws 
IOException {
+if (bitsPerValue == 32) {

Review Comment:
   If I'm not mistaken, then I don't think that we can ever get here with 
bitsPerValue == 32. If so, then this can be removed.



##
lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaForUtil90.java:
##
@@ -0,0 +1,8994 @@
+// This file has been automatically generated, DO NOT EDIT
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.internal.vectorization;
+
+import java.io.IOException;
+import jdk.incubator.vector.IntVector;
+import jdk.incubator.vector.VectorOperators;
+import jdk.incubator.vector.VectorSpecies;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.DataOutput;
+
+final class PanamaForUtil90 implements ForUtil90 {
+  private final int[] tmp = new int[BLOCK_SIZE];
+   
+  private final int[] decoded = new int[BLOCK_SIZE];
+
+  private static final int totalBits = 32;
+  private static final VectorSpecies SPECIES_128 = 
IntVector.SPECIES_128;
+
+  /** Encode 128 integers from {@code input} into {@code out}. */
+  @Override
+  public void encode(int[] input, int bitsPerValue, DataOutput out) throws 
IOException {
+if (bitsPerValue == 32) {
+  for (int i = 0; i < 128; i++) {
+out.writeInt(input[i]);
+  }
+  return;
+}
+switch (bitsPerValue) {
+  case 1:
+fastPack1(input, tmp);
+break;
+  case 2:
+fastPack2(input, tmp);
+break;
+  case 3:
+fastPack3(input, tmp);
+break;
+  case 4:
+fastPack4(input, tmp);
+break;
+  case 5:
+fastPack5(input, tmp);
+break;
+  case 6:
+fastPack6(input, tmp);
+break;
+  case 7:
+fastPack7(input, tmp);
+break;
+  case 8:
+fastPack8(input, tmp);
+break;
+  case 9:
+fastPack9(input, tmp);
+break;
+  case 10:
+fastPack10(input, tmp);
+break;
+  case 11:
+fastPack11(input, tmp);
+break;
+  case 12:
+fastPack12(input, tmp);
+break;
+  case 13:
+fastPack13(input, tmp);
+break;
+  case 14:
+fastPack14(input, tmp);
+break;
+  case 15:
+fastPack15(input, tmp);
+break;
+  case 16:
+fastPack16(input, tmp);
+ 

[GitHub] [lucene] tang-hi commented on a diff in pull request #12417: forutil add vectorized and scalar code

2023-07-06 Thread via GitHub


tang-hi commented on code in PR #12417:
URL: https://github.com/apache/lucene/pull/12417#discussion_r1255241555


##
lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaForUtil90.java:
##
@@ -0,0 +1,8994 @@
+// This file has been automatically generated, DO NOT EDIT
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.internal.vectorization;
+
+import java.io.IOException;
+import jdk.incubator.vector.IntVector;
+import jdk.incubator.vector.VectorOperators;
+import jdk.incubator.vector.VectorSpecies;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.DataOutput;
+
+final class PanamaForUtil90 implements ForUtil90 {
+  private final int[] tmp = new int[BLOCK_SIZE];
+   
+  private final int[] decoded = new int[BLOCK_SIZE];
+
+  private static final int totalBits = 32;
+  private static final VectorSpecies SPECIES_128 = 
IntVector.SPECIES_128;
+
+  /** Encode 128 integers from {@code input} into {@code out}. */
+  @Override
+  public void encode(int[] input, int bitsPerValue, DataOutput out) throws 
IOException {
+if (bitsPerValue == 32) {

Review Comment:
   Yes, you are correct. I mistakenly thought that we needed to compress 
bitsPerValue from 1 to 32.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org