Re: [PR] TaskExecutor should not fork unnecessarily [lucene]

2024-07-08 Thread via GitHub


jpountz commented on PR #13472:
URL: https://github.com/apache/lucene/pull/13472#issuecomment-2213200920

   @original-brownbear  Would you like to work on a PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] TaskExecutor should not fork unnecessarily [lucene]

2024-07-08 Thread via GitHub


jpountz commented on PR #13472:
URL: https://github.com/apache/lucene/pull/13472#issuecomment-2213213760

   I just pushed an annotation for this change: 
https://github.com/mikemccand/luceneutil/commit/a64ac17a9d1a935649837990f2accbace0b93262.
   
   Several queries got a bit faster with a low p-value: 
https://people.apache.org/~mikemccand/lucenebench/2024.07.05.14.34.52.html.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Binary search all terms. [lucene]

2024-07-08 Thread via GitHub


vsop-479 commented on PR #13192:
URL: https://github.com/apache/lucene/pull/13192#issuecomment-2213224505

   > This many new allocations
   
   Maybe we can share these allocations(`suffixes`, `positions`, `positions`) 
from `searchers`, since they are just immutable and non-stateful data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] WIP: draft of intra segment concurrency [lucene]

2024-07-08 Thread via GitHub


javanna commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1668225258


##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -328,42 +336,65 @@ protected LeafSlice[] slices(List 
leaves) {
   /** Static method to segregate LeafReaderContexts amongst multiple slices */
   public static LeafSlice[] slices(
   List leaves, int maxDocsPerSlice, int 
maxSegmentsPerSlice) {
+
+// TODO this is a temporary hack to force testing against multiple leaf 
reader context slices.
+// It must be reverted before merging.
+maxDocsPerSlice = 1;
+maxSegmentsPerSlice = 1;
+// end hack
+
 // Make a copy so we can sort:
 List sortedLeaves = new ArrayList<>(leaves);
 
 // Sort by maxDoc, descending:
-Collections.sort(
-sortedLeaves, Collections.reverseOrder(Comparator.comparingInt(l -> 
l.reader().maxDoc(;
+sortedLeaves.sort(Collections.reverseOrder(Comparator.comparingInt(l -> 
l.reader().maxDoc(;
 
-final List> groupedLeaves = new ArrayList<>();
-long docSum = 0;
-List group = null;
+final List> groupedLeafPartitions = new 
ArrayList<>();
+int currentSliceNumDocs = 0;
+List group = null;
 for (LeafReaderContext ctx : sortedLeaves) {
   if (ctx.reader().maxDoc() > maxDocsPerSlice) {
 assert group == null;
-groupedLeaves.add(Collections.singletonList(ctx));
+// if the segment does not fit in a single slice, we split it in 
multiple partitions of

Review Comment:
   Thanks for your feedback @mikemccand ! You are right to be confused, this is 
a POC and details about abstractions needs refining. 
   
   Here is how I got to this place: we have slices that target multiple leaf 
reader contexts, and I wanted to be able to have a slice still targeting 
multiple segments, at the same time targeting a partition of a segment. The 
easier way to do that was to wrap a leaf reader context into a new abstraction 
that holds a leaf reader context and defines a range of doc ids as well. A 
partition can still point to the entire segment though.
   
   Technically, a slice is a set of segments, yet a partition of an index, 
while a leaf reader context partition is a partition of a segment, hence the 
confusion :)  One natural next step could be to fold the range of doc ids into 
the leaf reader context perhaps, but that is a substantial API change. 
Otherwise keeping the structure i came up with, but renaming things to clarify 
the concepts and reduce confusion. Do you have other options in mind?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace AtomicLong with LongAdder in HitsThresholdChecker [lucene]

2024-07-08 Thread via GitHub


jpountz merged PR #13546:
URL: https://github.com/apache/lucene/pull/13546


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize MaxScoreBulkScorer [lucene]

2024-07-08 Thread via GitHub


jpountz merged PR #13544:
URL: https://github.com/apache/lucene/pull/13544


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Override single byte writes to OutputStreamIndexOutput to remove locking [lucene]

2024-07-08 Thread via GitHub


jpountz commented on code in PR #13543:
URL: https://github.com/apache/lucene/pull/13543#discussion_r1668261240


##
lucene/core/src/java/org/apache/lucene/store/OutputStreamIndexOutput.java:
##
@@ -135,5 +135,19 @@ void writeLong(long i) throws IOException {
   BitUtil.VH_LE_LONG.set(buf, count, i);
   count += Long.BYTES;
 }
+
+@Override
+public void write(int b) throws IOException {
+  // override single byte write to avoid synchronization overhead now that 
JEP374 removed biased
+  // locking
+  byte[] buffer = buf;
+  int count = this.count;
+  if (count >= buffer.length) {
+super.write(b);
+  } else {
+buffer[count] = (byte) b;
+this.count = count + 1;
+  }
+}

Review Comment:
   Ok.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Override single byte writes to OutputStreamIndexOutput to remove locking [lucene]

2024-07-08 Thread via GitHub


jpountz merged PR #13543:
URL: https://github.com/apache/lucene/pull/13543


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] WIP: draft of intra segment concurrency [lucene]

2024-07-08 Thread via GitHub


javanna commented on PR #13542:
URL: https://github.com/apache/lucene/pull/13542#issuecomment-2213602344

   Bulk reply to some of the feedback I got:
   
   hi @shubhamvishu , 
   
   > I know it might be too early to ask(as changes are not yet consolidated), 
but curious if we have any early benchmarking numbers when intra concurrency is 
enabled?
   
   I have run no benchmarks so far. I focused more on making tests happy, which 
felt like a good initial step. We do know that there will be a substantial gain 
when we search force merged indices.
   
   
   @mikemccand 
   
   > I think queries that do substantial work in Weight/ScorerSupplier.get are 
also tricky? E.g. points queries (PointRangeQuery) walks the BKD tree to build 
a bitset. Ideally, if a segment has multiple LeafReaderContextPartitions, we'd 
only do the BKD intersect once, and then the N threads working on those 
partitions can share that bitset?
   
   Yes, definitely, I mentioned this above in the description. Removing 
duplicated effort would be great, but I focused on making things work to start 
with. What you mention is a natural next step. It may though end up influencing 
the design depending on where we decide to de-duplicate work: does the notion 
of segment partition needs to be visible to consumers, or can it be an 
implementation detail?
   
   @jpountz 
   
   > I wonder if it should be on IndexSearcher to not call 
Collector#getLeafCollector multiple times if the first call throws a 
CollectionTerminatedException (what you implemented) or if it should be on the 
CollectorManager to no longer assume that Collector#getLeafCollector gets 
called exactly once per leaf. The latter feels cleaner to me conceptually, and 
I believe that it would also avoid the issue that you had with LRUQueryCache? 
The downside is that we'll need to fix all collectors that make this assumption.
   
   Agreed, that adjustment is a hack. The change in expectation should be 
reflected in the `Collector` API semantics though (rather that 
`CollectorManager`?), is that what you meant? For instance, taking the total 
hit count example, `TotalHitCountCollector` would need to be adapted to handle 
the case where `getLeafCollector` gets called providing the same leaf multiple 
times? It would need to be made synchronized and it would need to track which 
leaves were early terminated? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]

2024-07-08 Thread via GitHub


ChrisHegarty commented on PR #13535:
URL: https://github.com/apache/lucene/pull/13535#issuecomment-2213644712

   Thanks for the comments so far. I updated the PR to only check same-thread 
semantics for MSII clone and slice. And also added some basic thread checks to 
MockIndexInputWrapper.   I think this is a reasonable improvement and a good 
place to stop in this PR. As suggested, further improvements can be done as a 
follow up.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]

2024-07-08 Thread via GitHub


uschindler commented on PR #13535:
URL: https://github.com/apache/lucene/pull/13535#issuecomment-2213707284

   There are some test failures due to strict thread checking. I think the mock 
input should only do this when its in confined mode.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] TaskExecutor should not fork unnecessarily [lucene]

2024-07-08 Thread via GitHub


original-brownbear commented on PR #13472:
URL: https://github.com/apache/lucene/pull/13472#issuecomment-2213989662

   Sure thing, on it! :) sorry could've done that right away, tired me just 
didn't realise it this morning .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

2024-07-08 Thread via GitHub


mikemccand opened a new issue, #13551:
URL: https://github.com/apache/lucene/issues/13551

   ### Description
   
   This is really a "discussion" issue.  I'm not sure at all that the idea is 
feasible:
   
   I've been testing `luceneutil` with heavy KNN indexing (Cohere wikipedia 
`en` 768 dimension vectors) and one dismal part of the experience is the swap 
storm caused by HNSW's random access to the raw vectors stored in the `.vec` 
files for each segment on "cold start" searching.
   
   Even when the box has plenty of RAM to hold all `.vec` files, the swap storm 
takes minutes on the nightly benchy box, even with a fast NVMe SSD holding the 
index.  Even if the index is freshly built, the OS doesn't seem to cache the 
`.vec` files since they appear to be "write once", until the searching starts 
up, and then the swap storm begins.  This was with `FLOAT32` vectors ... I 
suspect the problem is less severe with `int4` or `int8` compressed vectors 
(haven't tested).
   
   At Amazon (customer facing product search) we also see this swap storm when 
cold starting the searching process even after [NRT segment 
replication](https://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html)
 has just copied the files locally: they don't stay "hot" in the OS in that 
case either (looks like "write once" to the OS).
   
   Lucene already has an awesome feature to "fix" 
this:`MMapDirectory.setPreload`.  It will pre-touch all pages associated with 
that file on open, so the OS caches them "once" on startup, much more 
efficiently than the random access HNSW.  But this only makes sense for 
applications/users that know they have enough RAM (we will test this at Amazon 
to see if it helps our cold start problems).  For my `luceneutil` tests, simply 
`cat /path/to/index/*.vec >> /dev/null` (basically the same as `.setPreload` I 
think) fixed/sidestepped the swap storm.
   
   (The Linux kernel's less aggressive default readahead for memory-mapped IO 
vs traditional NIO is likely not helping either?  Is this really a thing?  I 
have not tested `NIOFSDirectory` to see if the swap storm is lessened).
   
   Longer term we have discussed using AKNN algorithms that are more 
disk-friendly (e.g. [DiskANN](https://github.com/apache/lucene/issues/12615)), 
but shorter t erm, I'm wondering if we could somehow help users with a default 
`Directory` (from `FSDirectory.open`) that somehow / sometimes preloads `.vec` 
files?  It's not easy to do -- you wouldn't know up front that the application 
will do KNN searching at all.  And, maybe only certain vectors in the `.vec` 
will ever be accessed and so you need to pay that long random access price to 
find and cache just those ones.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce a SkipperCodec for testing docvalues skipper index [lucene]

2024-07-08 Thread via GitHub


jpountz commented on code in PR #13550:
URL: https://github.com/apache/lucene/pull/13550#discussion_r1668890679


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseDocValuesFormatTestCase.java:
##
@@ -157,6 +158,7 @@ public void testNumberMergeAwayAllValuesWithSkipper() 
throws IOException {
 Analyzer analyzer = new MockAnalyzer(random());
 IndexWriterConfig iwconfig = newIndexWriterConfig(analyzer);
 iwconfig.setMergePolicy(newLogMergePolicy());
+iwconfig.setCodec(SkipperCodec.randomInstance(random()));

Review Comment:
   This doesn't look right, as this overrides the codec that is configured on 
the test case, e.g. `SimpleTextDocValuesFormat` would no longer be exercised? 
Should we create a new `TestLucene90DocValuesFormatVariableSkipInterval` class 
or something along these lines instead?



##
lucene/test-framework/src/java/org/apache/lucene/tests/codecs/skipper/SkipperCodec.java:
##
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.tests.codecs.skipper;
+
+import java.util.Random;
+import org.apache.lucene.codecs.DocValuesFormat;
+import org.apache.lucene.codecs.FilterCodec;
+import org.apache.lucene.codecs.lucene90.Lucene90DocValuesFormat;
+import org.apache.lucene.tests.util.TestUtil;
+
+/**
+ * A codec that uses {@link Lucene90DocValuesFormat} with a random skip index 
size for its doc
+ * values and delegates to the default codec for everything else.
+ */
+public final class SkipperCodec extends FilterCodec {

Review Comment:
   Maybe rename to `Skipper90Codec` to reflect that it maps to 
`Lucene90DocValuesFormat`? I assume we'd create a new one when we release a new 
DV format?
   
   Also, does it need to be a new Codec, or could it extend `Lucene911Codec` 
and override `getDocValuesFormatForField`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

2024-07-08 Thread via GitHub


rmuir commented on issue #13551:
URL: https://github.com/apache/lucene/issues/13551#issuecomment-2214496576

   > It's not easy to do -- you wouldn't know up front that the application 
will do KNN searching at all. And, maybe only certain vectors in the `.vec` 
will ever be accessed and so you need to pay that long random access price to 
find and cache just those ones.
   
   I'm trying to understand this sentence:
   * it should be a one-liner using `setPreload` to preload "*.vec" if we 
wanted to do it either from FSDirectory.open or by default in MMapDirectory
   * if the application isn't doing KNN searching then they won't have .vec? I 
struggle to imagine someone indexing a ton of "extra" vectors that isnt "using" 
them and hasn't noticed big performance impact


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

2024-07-08 Thread via GitHub


jpountz commented on issue #13551:
URL: https://github.com/apache/lucene/issues/13551#issuecomment-2214541782

   It wouldn't solve the issue, only mitigate it, but hopefully cold start 
performance gets better when we start leveraging `IndexInput#prefetch` to load 
multiple vectors from disk concurrently 
(https://github.com/apache/lucene/issues/13179).
   
   > The Linux kernel's less aggressive default readahead for memory-mapped IO 
vs traditional NIO
   
   FWIW, it's not that the read-ahead is less agressive for mmap than 
traditional I/O, it's that since recently `MMapDirectory` explicitly tells the 
OS to not perform readahead for vectors by passing `MADV_RANDOM` to `madvise`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Significant drop in recall for 8 bit Scalar Quantizer [lucene]

2024-07-08 Thread via GitHub


benwtrent commented on issue #13519:
URL: https://github.com/apache/lucene/issues/13519#issuecomment-2214591883

   @jbhateja could you unpack how this would actually work when using 
dot-product with linear scale corrections?
   
   I would imagine we could switch to an "unsigned byte" comparison and thus 
handle the bit flipping directly in SIMD, but this would introduce a new vector 
comparison that needs to be added. 
   
   I am not sure how we can just "flip the bit" while maintain distance 
comparisons for euclidean & dotProduct without introducing error or slowing 
things down significantly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve VectorUtil::xorBitCount perf on ARM [lucene]

2024-07-08 Thread via GitHub


ChrisHegarty merged PR #13545:
URL: https://github.com/apache/lucene/pull/13545


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] WIP: draft of intra segment concurrency [lucene]

2024-07-08 Thread via GitHub


jpountz commented on PR #13542:
URL: https://github.com/apache/lucene/pull/13542#issuecomment-2214643611

   > The change in expectation should be reflected in the Collector API 
semantics though (rather that CollectorManager?), is that what you meant? 
   
   I was referring to `CollectorManager` because we could have two partitions 
of the same leaf that are collected by two different `Collector`s of the same 
`CollectorManager`. So both collectors would see this leaf only once, but we 
still need to avoid double-counting this leaf.
   
   > It would need to be made synchronized and it would need to track which 
leaves were early terminated?
   
   Something like that indeed. And only try to apply the `Weight#count` 
optimization on the first time that `getLeafCollector` is called on a leaf to 
avoid hitting the problem you had with `LRUQueryCache`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] WIP: draft of intra segment concurrency [lucene]

2024-07-08 Thread via GitHub


msokolov commented on PR #13542:
URL: https://github.com/apache/lucene/pull/13542#issuecomment-2214811101

   I wonder if we should tackle the issue with caching / cloning scorers? We 
have scorers/scorerSuppliers that do a lot of up-front work when created and we 
don't want to duplicate that work when two threads search the same segment. 
OTOH two threads cannot share the same Scorer instance since it has internal 
state namely the iterator it advances, and maybe more. This seems like the 
biggest problem to tackle - maybe it would make sense to start there? OTOH we 
need some kind of execution framework (this PR) that we can use to exercise 
such changes - I guess it's a kind of leapfrogging problem, gotta start 
somewhere.  Given that, maybe we create a long-running branch we can use to 
iterate on this in multiple steps before merging to main?
   
   As for the Scorer-cloning I poked at it and it seems tricky. I think that 
Weight can cache Scorer and/or ScorerSupplier, and I guess BulkScorer. Then 
when asked for a scorer-type object that it already has it can create a 
shallowCopy(). Shallow-copying should be implemented so as to do the least 
amount of work to create a new instance that can be iterated independently. 
This will be up to each Scorer/ScorerSupplier/BulkScorer to figure out -- maybe 
some of them just return null and let Weight create a new one as it does today. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] WIP: draft of intra segment concurrency [lucene]

2024-07-08 Thread via GitHub


msokolov commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1668890994


##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -328,42 +336,65 @@ protected LeafSlice[] slices(List 
leaves) {
   /** Static method to segregate LeafReaderContexts amongst multiple slices */
   public static LeafSlice[] slices(
   List leaves, int maxDocsPerSlice, int 
maxSegmentsPerSlice) {
+
+// TODO this is a temporary hack to force testing against multiple leaf 
reader context slices.
+// It must be reverted before merging.
+maxDocsPerSlice = 1;
+maxSegmentsPerSlice = 1;
+// end hack
+
 // Make a copy so we can sort:
 List sortedLeaves = new ArrayList<>(leaves);
 
 // Sort by maxDoc, descending:
-Collections.sort(
-sortedLeaves, Collections.reverseOrder(Comparator.comparingInt(l -> 
l.reader().maxDoc(;
+sortedLeaves.sort(Collections.reverseOrder(Comparator.comparingInt(l -> 
l.reader().maxDoc(;
 
-final List> groupedLeaves = new ArrayList<>();
-long docSum = 0;
-List group = null;
+final List> groupedLeafPartitions = new 
ArrayList<>();
+int currentSliceNumDocs = 0;
+List group = null;
 for (LeafReaderContext ctx : sortedLeaves) {
   if (ctx.reader().maxDoc() > maxDocsPerSlice) {
 assert group == null;
-groupedLeaves.add(Collections.singletonList(ctx));
+// if the segment does not fit in a single slice, we split it in 
multiple partitions of

Review Comment:
   I had worked up a version of this where I modified 
LeafReaderContext/IndexReaderContext to create a new kind of context that 
models the range within a segment. I had added interval start/end to LRC, but I 
suspect a cleaner way would be to make a new thing (IntervalReaderContext or 
so) and then change APIs to expect IndexReaderContext instead of 
CompositeReaderContext? If we do it this way it might make it easier to handle 
some cases like the single-threaded execution you mentioned. But this is more 
about cleaning up the APIs than making it work and we can argue endlessly about 
what is neater, so I think your approach to delay such questions makes sense.



##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -328,42 +336,65 @@ protected LeafSlice[] slices(List 
leaves) {
   /** Static method to segregate LeafReaderContexts amongst multiple slices */
   public static LeafSlice[] slices(
   List leaves, int maxDocsPerSlice, int 
maxSegmentsPerSlice) {
+
+// TODO this is a temporary hack to force testing against multiple leaf 
reader context slices.
+// It must be reverted before merging.
+maxDocsPerSlice = 1;
+maxSegmentsPerSlice = 1;
+// end hack
+
 // Make a copy so we can sort:
 List sortedLeaves = new ArrayList<>(leaves);
 
 // Sort by maxDoc, descending:
-Collections.sort(
-sortedLeaves, Collections.reverseOrder(Comparator.comparingInt(l -> 
l.reader().maxDoc(;
+sortedLeaves.sort(Collections.reverseOrder(Comparator.comparingInt(l -> 
l.reader().maxDoc(;
 
-final List> groupedLeaves = new ArrayList<>();
-long docSum = 0;
-List group = null;
+final List> groupedLeafPartitions = new 
ArrayList<>();
+int currentSliceNumDocs = 0;
+List group = null;
 for (LeafReaderContext ctx : sortedLeaves) {
   if (ctx.reader().maxDoc() > maxDocsPerSlice) {
 assert group == null;
-groupedLeaves.add(Collections.singletonList(ctx));
+// if the segment does not fit in a single slice, we split it in 
multiple partitions of
+// equal size
+int numSlices = Math.ceilDiv(ctx.reader().maxDoc(), maxDocsPerSlice);

Review Comment:
   My mental model of the whole slice/partition/segment/interval concept is: 
existing physical segments (leaves) divide the index into arbitrary sizes. 
existing slices (what we have today, not what is called slices in this PR) 
group segments together. partitions or intervals (in my view) are a logical 
division of the index into roughly equal-sized contiguous (in docid space) 
portions and they overlay the segments arbitrarily.  Then it is the job of 
IndexSearcher  to map this logical division of work into the underlying 
physical segments.  The main comment here is - let's not confuse ourselves by 
re-using the word "slice" which already means something else!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Refactor and javadoc update for KNN vector writer classes [lucene]

2024-07-08 Thread via GitHub


benwtrent commented on code in PR #13548:
URL: https://github.com/apache/lucene/pull/13548#discussion_r1669043121


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##
@@ -139,6 +142,54 @@ public int nextDoc() throws IOException {
 }
   }
 
+  /**
+   * Given old doc ids and an id mapping, maps old ordinal to new ordinal. 
Note: this method return
+   * nothing and output are written to parameters
+   *

Review Comment:
   ```suggestion
  * @param oldDocIds the old or current document ordinals. Must not be null.
  * @param sortMap, the document sorting map for how to make the new 
ordinals. Must not be null.
   ```



##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##
@@ -139,6 +142,54 @@ public int nextDoc() throws IOException {
 }
   }
 
+  /**
+   * Given old doc ids and an id mapping, maps old ordinal to new ordinal. 
Note: this method return
+   * nothing and output are written to parameters
+   *
+   * @param old2NewOrd int[] maps from old ord to new ord
+   * @param new2OldOrd int[] maps from new ord to old ord
+   * @param newDocsWithField set of new doc ids which has the value
+   */
+  public static void mapOldOrdToNewOrd(
+  DocsWithFieldSet oldDocIds,
+  Sorter.DocMap sortMap,
+  int[] old2NewOrd,
+  int[] new2OldOrd,
+  DocsWithFieldSet newDocsWithField)
+  throws IOException {
+// TODO: a similar function exists in 
IncrementalHnswGraphMerger#getNewOrdMapping
+//   maybe we can do a further refactoring
+assert (old2NewOrd != null || new2OldOrd != null || newDocsWithField != 
null);

Review Comment:
   ```suggestion
   Objects.requireNonNull(oldDocIds);
   Objects.requireNonNull(sortMap);
   assert (old2NewOrd != null || new2OldOrd != null || newDocsWithField != 
null);
   ```



##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsFormat.java:
##
@@ -56,8 +56,8 @@
  *   [vlong] length of this field's vectors, in bytes
  *   [vint] dimension of this field's vectors
  *   [int] the number of documents having values for this field
- *   [int8] if equals to -1, dense – all documents have values for 
a field. If equals to
- *   0, sparse – some documents missing values.
+ *   [int8] if equals to -2, empty - no vectory values. If equals 
to -1, dense – all

Review Comment:
   ```suggestion
*   [int8] if equals to -2, empty - no vector values. If equals 
to -1, dense – all
   ```



##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##
@@ -139,6 +142,54 @@ public int nextDoc() throws IOException {
 }
   }
 
+  /**
+   * Given old doc ids and an id mapping, maps old ordinal to new ordinal. 
Note: this method return
+   * nothing and output are written to parameters
+   *
+   * @param old2NewOrd int[] maps from old ord to new ord
+   * @param new2OldOrd int[] maps from new ord to old ord
+   * @param newDocsWithField set of new doc ids which has the value
+   */
+  public static void mapOldOrdToNewOrd(
+  DocsWithFieldSet oldDocIds,
+  Sorter.DocMap sortMap,
+  int[] old2NewOrd,
+  int[] new2OldOrd,
+  DocsWithFieldSet newDocsWithField)
+  throws IOException {
+// TODO: a similar function exists in 
IncrementalHnswGraphMerger#getNewOrdMapping
+//   maybe we can do a further refactoring
+assert (old2NewOrd != null || new2OldOrd != null || newDocsWithField != 
null);
+IntIntHashMap newIdToOldOrd = new IntIntHashMap();

Review Comment:
   Since this method isn't in charge of creating `old2NewOrd` or `new2OldOrd` 
can we assert those array sizes they are provided?



##
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java:
##
@@ -356,7 +356,7 @@ BufferedUpdate next() throws IOException {
   }
 }
 
-BytesRef nextTerm() throws IOException {
+private BytesRef nextTerm() throws IOException {

Review Comment:
   seems like an extra thing out of nowhere? Its probably ok in this commit, 
just struck me as a strange inclusion :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

2024-07-08 Thread via GitHub


mikemccand commented on issue #13551:
URL: https://github.com/apache/lucene/issues/13551#issuecomment-2214828022

   Oh sorry I used the wrong term (thank you @rmuir for clarifying!): it's not 
a swap storm I'm seeing, it's a page storm.  The OS has plenty of free ram 
(reported by `free`), and that goes down and `buff/cache` goes up as the OS 
pulls and caches pages in for the `.vec` file.  I don't think I'm running too 
many ram hogging crapplications ;)
   
   > * it should be a one-liner using `setPreload` to preload "*.vec" if we 
wanted to do it either from FSDirectory.open or by default in MMapDirectory
   
   +1 -- it would be a simple change.  But I worry if it would do more harm 
than good in some cases, e.g. if there are truly cold HNSW cases where the 
application plans to suffer through paging to identify the subset of hot 
vectors?  I don't know if that is even a thing -- a bunch of dark matter 
vectors that never get visited?  I guess we do know that [HNSW can and does 
sometimes produce disconnected 
graphs](https://github.com/apache/lucene/issues/12627), but I think those dark 
matter islands are "typically" smallish.
   
   > * if the application isn't doing KNN searching then they won't have .vec? 
I struggle to imagine someone indexing a ton of "extra" vectors that isnt 
"using" them and hasn't noticed big performance impact
   
   +1!  Especially because indexing them is quite costly!
   
   > It wouldn't solve the issue, only mitigate it, but hopefully cold start 
performance gets better when we start leveraging `IndexInput#prefetch` to load 
multiple vectors from disk concurrently (#13179).
   
   +1 -- that would be a big help especially when paging in from fast SSDs 
since these devices have high IO concurrency.
   
   > > The Linux kernel's less aggressive default readahead for memory-mapped 
IO vs traditional NIO
   > 
   > FWIW, it's not that the read-ahead is less agressive for mmap than 
traditional I/O, it's that since recently `MMapDirectory` explicitly tells the 
OS to not perform readahead for vectors by passing `MADV_RANDOM` to `madvise`.
   
   Ahhh, that makes sense!  And I think it is still correct to `madvise` in 
this way for our `.vec` files: it ~~really~~ likely (is there any locality to 
HNSW's access patterns?) is a random access pattern from HNSW.  It does make me 
wonder if the scalar compression will actually help or not.  I guess it might 
still help if that compression means multiple vectors fit into a single IO page.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Refactor and javadoc update for KNN vector writer classes [lucene]

2024-07-08 Thread via GitHub


zhaih commented on code in PR #13548:
URL: https://github.com/apache/lucene/pull/13548#discussion_r1669054287


##
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java:
##
@@ -356,7 +356,7 @@ BufferedUpdate next() throws IOException {
   }
 }
 
-BytesRef nextTerm() throws IOException {
+private BytesRef nextTerm() throws IOException {

Review Comment:
   Yeah I just edited it when I was going over the code, thought it should be 
fine leaving it here :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Refactor and javadoc update for KNN vector writer classes [lucene]

2024-07-08 Thread via GitHub


zhaih commented on code in PR #13548:
URL: https://github.com/apache/lucene/pull/13548#discussion_r1669056420


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##
@@ -139,6 +142,54 @@ public int nextDoc() throws IOException {
 }
   }
 
+  /**
+   * Given old doc ids and an id mapping, maps old ordinal to new ordinal. 
Note: this method return
+   * nothing and output are written to parameters
+   *
+   * @param old2NewOrd int[] maps from old ord to new ord
+   * @param new2OldOrd int[] maps from new ord to old ord
+   * @param newDocsWithField set of new doc ids which has the value
+   */
+  public static void mapOldOrdToNewOrd(
+  DocsWithFieldSet oldDocIds,
+  Sorter.DocMap sortMap,
+  int[] old2NewOrd,
+  int[] new2OldOrd,
+  DocsWithFieldSet newDocsWithField)
+  throws IOException {
+// TODO: a similar function exists in 
IncrementalHnswGraphMerger#getNewOrdMapping
+//   maybe we can do a further refactoring
+assert (old2NewOrd != null || new2OldOrd != null || newDocsWithField != 
null);
+IntIntHashMap newIdToOldOrd = new IntIntHashMap();

Review Comment:
   good idea



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Test TestIndexWriterWithThreads#testIOExceptionDuringWriteSegmentWithThreadsOnlyOnce Failed [lucene]

2024-07-08 Thread via GitHub


aoli-al opened a new issue, #13552:
URL: https://github.com/apache/lucene/issues/13552

   ### Description
   
   I saw a flaky test, 
`TestIndexWriterWithThreads#testIOExceptionDuringWriteSegmentWithThreadsOnlyOnce`
 caused by concurrency issues recently: 
   
   ```
   MockDirectoryWrapper: cannot close: there are still 30 open files: 
{_1.tvd=1, _0_Lucene99_0.pos=1, _1.nvd=1, _0.nvd=1, _2.fdt=1, _2.tvx=1, 
_0_Lucene99_0.tim=1, _0_Lucene99_0.tip=1, _2.fdx=1, _2_Lucene99_0.doc=1, 
_2.nvd=1, _2.tvd=1, _0_Lucene90_0.dvd=1, _1.fdx=1, _1_Lucene99_0.doc=1, 
_1.fdt=1, _1.tvx=1, _2_Lucene99_0.pos=1, _1_Lucene90_0.dvd=1, 
_0_Lucene99_0.doc=1, _0.fdx=1, _2_Lucene99_0.tip=1, _0.fdt=1, _0.tvx=1, 
_2_Lucene99_0.tim=1, _0.tvd=1, _1_Lucene99_0.pos=1, _2_Lucene90_0.dvd=1, 
_1_Lucene99_0.tip=1, _1_Lucene99_0.tim=1}
   java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are 
still 30 open files: {_1.tvd=1, _0_Lucene99_0.pos=1, _1.nvd=1, _0.nvd=1, 
_2.fdt=1, _2.tvx=1, _0_Lucene99_0.tim=1, _0_Lucene99_0.tip=1, _2.fdx=1, 
_2_Lucene99_0.doc=1, _2.nvd=1, _2.tvd=1, _0_Lucene90_0.dvd=1, _1.fdx=1, 
_1_Lucene99_0.doc=1, _1.fdt=1, _1.tvx=1, _2_Lucene99_0.pos=1, 
_1_Lucene90_0.dvd=1, _0_Lucene99_0.doc=1, _0.fdx=1, _2_Lucene99_0.tip=1, 
_0.fdt=1, _0.tvx=1, _2_Lucene99_0.tim=1, _0.tvd=1, _1_Lucene99_0.pos=1, 
_2_Lucene90_0.dvd=1, _1_Lucene99_0.tip=1, _1_Lucene99_0.tim=1}
at 
__randomizedtesting.SeedInfo.seed([4D88B1144F0B21A3:1E330D289AB0246]:0)
at 
org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:876)
at 
org.apache.lucene.index.TestIndexWriterWithThreads._testMultipleThreadsFailure(TestIndexWriterWithThreads.java:349)
at 
org.apache.lucene.index.TestIndexWriterWithThreads.testIOExceptionDuringWriteSegmentWithThreadsOnlyOnce(TestIndexWriterWithThreads.java:502)
at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
at 
org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
at 
org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
at 
org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
at 
org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
at 
org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
at 
org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at 
org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
at 
org.apache.lucene.tests.util.

Re: [PR] Improve VectorUtil::xorBitCount perf on ARM [lucene]

2024-07-08 Thread via GitHub


uschindler commented on PR #13545:
URL: https://github.com/apache/lucene/pull/13545#issuecomment-2214845553

   Hi,
   in the backport to 9.x the benchmark file was wrongly merged. It landed in 
the test directory. In 9.x we have no benchmark-jmh module in Gradle, so the 
file should have been left out while cherry-picking.
   Can you remove the file in a followup commit?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve VectorUtil::xorBitCount perf on ARM [lucene]

2024-07-08 Thread via GitHub


uschindler commented on PR #13545:
URL: https://github.com/apache/lucene/pull/13545#issuecomment-2214846241

   See: 
https://github.com/apache/lucene/commit/c8b4a76ecc93a98c779364b18f62c9b67552c192#diff-dd8d7417893f9b2fecaef29491b94d5daeaae6d496c4b21bb9633b4f7b060e59


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve VectorUtil::xorBitCount perf on ARM [lucene]

2024-07-08 Thread via GitHub


uschindler commented on code in PR #13545:
URL: https://github.com/apache/lucene/pull/13545#discussion_r1669064660


##
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##
@@ -212,6 +212,14 @@ public static int int4DotProductPacked(byte[] unpacked, 
byte[] packed) {
 return IMPL.int4DotProduct(unpacked, false, packed, true);
   }
 
+  /**
+   * For xorBitCount we stride over the values as either 64-bits (long) or 
32-bits (int) at a time.
+   * On ARM Long::bitCount is not vectorized, and therefore produces less than 
optimal code, when
+   * compared to Integer::bitCount. While Long::bitCount is optimal on x64. 
TODO: include the
+   * OpenJDK JIRA url

Review Comment:
   Do you have the JIRA issue number already?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

2024-07-08 Thread via GitHub


uschindler commented on issue #13551:
URL: https://github.com/apache/lucene/issues/13551#issuecomment-2214868089

   I don't think we should change anything here in MMapDirectory. It is all 
available and easy to do for one that wants to do this. Elasticserach is doing 
this for some files, but we have no default on preloading anything.
   
   For preloading you can pass MMapDirectory#setPreload and give it a lamda 
expression to select on which file extensions you want preloading: 
https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/store/MMapDirectory.html#setPreload(java.util.function.BiPredicate)
   
   The default is: 
https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/store/MMapDirectory.html#NO_FILES


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

2024-07-08 Thread via GitHub


uschindler commented on issue #13551:
URL: https://github.com/apache/lucene/issues/13551#issuecomment-2214871854

   > * it should be a one-liner using `setPreload` to preload "*.vec" if we 
wanted to do it either from FSDirectory.open or by default in MMapDirectory
   
   It is trivial:
   ```java
   mmapDir.setPreload(name -> name.endsWith(".vec"));
   ```
   
   So no change needed in the API or behaviour of Lucene.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] WIP: draft of intra segment concurrency [lucene]

2024-07-08 Thread via GitHub


stefanvodita commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1669100332


##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -328,42 +336,65 @@ protected LeafSlice[] slices(List 
leaves) {
   /** Static method to segregate LeafReaderContexts amongst multiple slices */
   public static LeafSlice[] slices(
   List leaves, int maxDocsPerSlice, int 
maxSegmentsPerSlice) {
+
+// TODO this is a temporary hack to force testing against multiple leaf 
reader context slices.
+// It must be reverted before merging.
+maxDocsPerSlice = 1;
+maxSegmentsPerSlice = 1;
+// end hack
+
 // Make a copy so we can sort:
 List sortedLeaves = new ArrayList<>(leaves);
 
 // Sort by maxDoc, descending:
-Collections.sort(
-sortedLeaves, Collections.reverseOrder(Comparator.comparingInt(l -> 
l.reader().maxDoc(;
+sortedLeaves.sort(Collections.reverseOrder(Comparator.comparingInt(l -> 
l.reader().maxDoc(;
 
-final List> groupedLeaves = new ArrayList<>();
-long docSum = 0;
-List group = null;
+final List> groupedLeafPartitions = new 
ArrayList<>();
+int currentSliceNumDocs = 0;
+List group = null;
 for (LeafReaderContext ctx : sortedLeaves) {
   if (ctx.reader().maxDoc() > maxDocsPerSlice) {
 assert group == null;
-groupedLeaves.add(Collections.singletonList(ctx));
+// if the segment does not fit in a single slice, we split it in 
multiple partitions of
+// equal size
+int numSlices = Math.ceilDiv(ctx.reader().maxDoc(), maxDocsPerSlice);

Review Comment:
   > partitions or intervals (in my view) are a logical division of the index 
into roughly equal-sized contiguous (in docid space) portions and they overlay 
the segments arbitrarily
   
   I like the idea of calling them virtual segments in contrast to physical 
segments. Interval (and less so partition) seems to imply the part about 
contiguity in docid space, but is that a necessary assumption? It's efficient 
for it to work that way, but maybe we can allow more flexibility in how we 
determine these partitions / intervals (virtual segments).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Refactor and javadoc update for KNN vector writer classes [lucene]

2024-07-08 Thread via GitHub


zhaih merged PR #13548:
URL: https://github.com/apache/lucene/pull/13548


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make HNSW merges faster [lucene]

2024-07-08 Thread via GitHub


benwtrent commented on issue #12440:
URL: https://github.com/apache/lucene/issues/12440#issuecomment-2215292611

   I had another idea, I wonder if we can initialize HNSW via coarse grained 
clusters. Depending on the clustering algorithm used, we can use clusters built 
from various segments to boot-strap the hnsw graph building. 
   
   This obviously would be "new research". 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Significant drop in recall for 8 bit Scalar Quantizer [lucene]

2024-07-08 Thread via GitHub


MilindShyani commented on issue #13519:
URL: https://github.com/apache/lucene/issues/13519#issuecomment-2215327156

   @benwtrent Apologies for the late response! I am traveling (and marveling 
2000 year old pyramids) right now. The transformation you wrote indeed matches 
mine. Thinking about the bucketing correction meanwhile, will post shortly. 
Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve VectorUtil::xorBitCount perf on ARM [lucene]

2024-07-08 Thread via GitHub


uschindler commented on PR #13545:
URL: https://github.com/apache/lucene/pull/13545#issuecomment-2215482308

   I reverted the addition of the file to 9.x branch: 
86d080a4e0b4e53e0c9a3f2e2b120bff204c7276


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Fix quantized vector writer ram estimates [lucene]

2024-07-08 Thread via GitHub


benwtrent opened a new pull request, #13553:
URL: https://github.com/apache/lucene/pull/13553

   I still need to write a test, but wanted to open this PR early.
   
   Scalar Quantized vector writer ram usage estimates completely ignores the 
raw float vectors. Meaning, if you have flush based on ram usage configured, 
you can easily overshoot that estimate and cause an OOM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix quantized vector writer ram estimates [lucene]

2024-07-08 Thread via GitHub


benwtrent commented on code in PR #13553:
URL: https://github.com/apache/lucene/pull/13553#discussion_r1669464616


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java:
##
@@ -172,9 +172,6 @@ public void finish() throws IOException {
   public long ramBytesUsed() {
 long total = SHALLOW_RAM_BYTES_USED;
 total += flatVectorWriter.ramBytesUsed();
-for (FieldWriter field : fields) {
-  total += field.ramBytesUsed();
-}

Review Comment:
   the delegate writer will take into account these higher level indexing 
writers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix quantized vector writer ram estimates [lucene]

2024-07-08 Thread via GitHub


benwtrent commented on code in PR #13553:
URL: https://github.com/apache/lucene/pull/13553#discussion_r1669465330


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java:
##
@@ -299,9 +299,7 @@ public void finish() throws IOException {
   @Override
   public long ramBytesUsed() {
 long total = SHALLOW_RAM_BYTES_USED;
-for (FieldWriter field : fields) {
-  total += field.ramBytesUsed();
-}
+total += rawVectorDelegate.ramBytesUsed();

Review Comment:
   rawVectorDelegate takes into account its field writer estimates which then 
take the scalar quantized writer estimates into account.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce heap usage for knn index writers [lucene]

2024-07-08 Thread via GitHub


benwtrent commented on PR #13538:
URL: https://github.com/apache/lucene/pull/13538#issuecomment-2215637537

   Related: https://github.com/apache/lucene/pull/13553


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix quantized vector writer ram estimates [lucene]

2024-07-08 Thread via GitHub


benwtrent commented on PR #13553:
URL: https://github.com/apache/lucene/pull/13553#issuecomment-2215645063

   @gautamworah96 @msokolov this might be part of the reason for the OOMs, the 
estimates were completely ignoring the float[] vector sizes for fieldwriters 🤦 
. I plan on iterating on this fix first, then proceed with 
https://github.com/apache/lucene/pull/13538 
   
   The other PR helps reduce heap usage slightly, but not by much. Honestly, 
the issue is that we just weren't tracking the usage in the estimate correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use `IndexInput#prefetch` for terms dictionary lookups. [lucene]

2024-07-08 Thread via GitHub


github-actions[bot] commented on PR #13359:
URL: https://github.com/apache/lucene/pull/13359#issuecomment-2215694348

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] WordBreakSpellChecker.generateBreakUpSuggestions() should do breadth first search [lucene]

2024-07-08 Thread via GitHub


hossman commented on issue #12100:
URL: https://github.com/apache/lucene/issues/12100#issuecomment-2215985880

   I realized today that I had been working on branch_9x, so i've updated the 
patch to apply cleanly to main
   
   
[WordBreakSpellChecker.breadthfirst.GH-12100.patch.txt](https://github.com/user-attachments/files/16135858/WordBreakSpellChecker.breadthfirst.GH-12100.patch.txt)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Binary search all terms. [lucene]

2024-07-08 Thread via GitHub


vsop-479 commented on PR #13192:
URL: https://github.com/apache/lucene/pull/13192#issuecomment-2216245652

   Or, we just supply this (maybe only for `non-allEqual` leaf blocks) as an 
option, So, users can use it when their applications  are not busy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce TestLucene90DocValuesFormatVariableSkipIntervalfor testing docvalues skipper index [lucene]

2024-07-08 Thread via GitHub


iverase commented on code in PR #13550:
URL: https://github.com/apache/lucene/pull/13550#discussion_r1669846499


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseDocValuesFormatTestCase.java:
##
@@ -157,6 +158,7 @@ public void testNumberMergeAwayAllValuesWithSkipper() 
throws IOException {
 Analyzer analyzer = new MockAnalyzer(random());
 IndexWriterConfig iwconfig = newIndexWriterConfig(analyzer);
 iwconfig.setMergePolicy(newLogMergePolicy());
+iwconfig.setCodec(SkipperCodec.randomInstance(random()));

Review Comment:
   sure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce TestLucene90DocValuesFormatVariableSkipIntervalfor testing docvalues skipper index [lucene]

2024-07-08 Thread via GitHub


iverase commented on code in PR #13550:
URL: https://github.com/apache/lucene/pull/13550#discussion_r1669846808


##
lucene/test-framework/src/java/org/apache/lucene/tests/codecs/skipper/SkipperCodec.java:
##
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.tests.codecs.skipper;
+
+import java.util.Random;
+import org.apache.lucene.codecs.DocValuesFormat;
+import org.apache.lucene.codecs.FilterCodec;
+import org.apache.lucene.codecs.lucene90.Lucene90DocValuesFormat;
+import org.apache.lucene.tests.util.TestUtil;
+
+/**
+ * A codec that uses {@link Lucene90DocValuesFormat} with a random skip index 
size for its doc
+ * values and delegates to the default codec for everything else.
+ */
+public final class SkipperCodec extends FilterCodec {

Review Comment:
   right, I was too complicated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce TestLucene90DocValuesFormatVariableSkipIntervalfor testing docvalues skipper index [lucene]

2024-07-08 Thread via GitHub


iverase commented on PR #13550:
URL: https://github.com/apache/lucene/pull/13550#issuecomment-2216740642

   @jpountz I added a new test and change the title and description of the 
issue as we don't need to add a new Codec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org