Re: [PR] TaskExecutor should not fork unnecessarily [lucene]
jpountz commented on PR #13472: URL: https://github.com/apache/lucene/pull/13472#issuecomment-2213200920 @original-brownbear Would you like to work on a PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] TaskExecutor should not fork unnecessarily [lucene]
jpountz commented on PR #13472: URL: https://github.com/apache/lucene/pull/13472#issuecomment-2213213760 I just pushed an annotation for this change: https://github.com/mikemccand/luceneutil/commit/a64ac17a9d1a935649837990f2accbace0b93262. Several queries got a bit faster with a low p-value: https://people.apache.org/~mikemccand/lucenebench/2024.07.05.14.34.52.html. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Binary search all terms. [lucene]
vsop-479 commented on PR #13192: URL: https://github.com/apache/lucene/pull/13192#issuecomment-2213224505 > This many new allocations Maybe we can share these allocations(`suffixes`, `positions`, `positions`) from `searchers`, since they are just immutable and non-stateful data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] WIP: draft of intra segment concurrency [lucene]
javanna commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1668225258 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -328,42 +336,65 @@ protected LeafSlice[] slices(List leaves) { /** Static method to segregate LeafReaderContexts amongst multiple slices */ public static LeafSlice[] slices( List leaves, int maxDocsPerSlice, int maxSegmentsPerSlice) { + +// TODO this is a temporary hack to force testing against multiple leaf reader context slices. +// It must be reverted before merging. +maxDocsPerSlice = 1; +maxSegmentsPerSlice = 1; +// end hack + // Make a copy so we can sort: List sortedLeaves = new ArrayList<>(leaves); // Sort by maxDoc, descending: -Collections.sort( -sortedLeaves, Collections.reverseOrder(Comparator.comparingInt(l -> l.reader().maxDoc(; +sortedLeaves.sort(Collections.reverseOrder(Comparator.comparingInt(l -> l.reader().maxDoc(; -final List> groupedLeaves = new ArrayList<>(); -long docSum = 0; -List group = null; +final List> groupedLeafPartitions = new ArrayList<>(); +int currentSliceNumDocs = 0; +List group = null; for (LeafReaderContext ctx : sortedLeaves) { if (ctx.reader().maxDoc() > maxDocsPerSlice) { assert group == null; -groupedLeaves.add(Collections.singletonList(ctx)); +// if the segment does not fit in a single slice, we split it in multiple partitions of Review Comment: Thanks for your feedback @mikemccand ! You are right to be confused, this is a POC and details about abstractions needs refining. Here is how I got to this place: we have slices that target multiple leaf reader contexts, and I wanted to be able to have a slice still targeting multiple segments, at the same time targeting a partition of a segment. The easier way to do that was to wrap a leaf reader context into a new abstraction that holds a leaf reader context and defines a range of doc ids as well. A partition can still point to the entire segment though. Technically, a slice is a set of segments, yet a partition of an index, while a leaf reader context partition is a partition of a segment, hence the confusion :) One natural next step could be to fold the range of doc ids into the leaf reader context perhaps, but that is a substantial API change. Otherwise keeping the structure i came up with, but renaming things to clarify the concepts and reduce confusion. Do you have other options in mind? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace AtomicLong with LongAdder in HitsThresholdChecker [lucene]
jpountz merged PR #13546: URL: https://github.com/apache/lucene/pull/13546 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize MaxScoreBulkScorer [lucene]
jpountz merged PR #13544: URL: https://github.com/apache/lucene/pull/13544 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Override single byte writes to OutputStreamIndexOutput to remove locking [lucene]
jpountz commented on code in PR #13543: URL: https://github.com/apache/lucene/pull/13543#discussion_r1668261240 ## lucene/core/src/java/org/apache/lucene/store/OutputStreamIndexOutput.java: ## @@ -135,5 +135,19 @@ void writeLong(long i) throws IOException { BitUtil.VH_LE_LONG.set(buf, count, i); count += Long.BYTES; } + +@Override +public void write(int b) throws IOException { + // override single byte write to avoid synchronization overhead now that JEP374 removed biased + // locking + byte[] buffer = buf; + int count = this.count; + if (count >= buffer.length) { +super.write(b); + } else { +buffer[count] = (byte) b; +this.count = count + 1; + } +} Review Comment: Ok. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Override single byte writes to OutputStreamIndexOutput to remove locking [lucene]
jpountz merged PR #13543: URL: https://github.com/apache/lucene/pull/13543 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] WIP: draft of intra segment concurrency [lucene]
javanna commented on PR #13542: URL: https://github.com/apache/lucene/pull/13542#issuecomment-2213602344 Bulk reply to some of the feedback I got: hi @shubhamvishu , > I know it might be too early to ask(as changes are not yet consolidated), but curious if we have any early benchmarking numbers when intra concurrency is enabled? I have run no benchmarks so far. I focused more on making tests happy, which felt like a good initial step. We do know that there will be a substantial gain when we search force merged indices. @mikemccand > I think queries that do substantial work in Weight/ScorerSupplier.get are also tricky? E.g. points queries (PointRangeQuery) walks the BKD tree to build a bitset. Ideally, if a segment has multiple LeafReaderContextPartitions, we'd only do the BKD intersect once, and then the N threads working on those partitions can share that bitset? Yes, definitely, I mentioned this above in the description. Removing duplicated effort would be great, but I focused on making things work to start with. What you mention is a natural next step. It may though end up influencing the design depending on where we decide to de-duplicate work: does the notion of segment partition needs to be visible to consumers, or can it be an implementation detail? @jpountz > I wonder if it should be on IndexSearcher to not call Collector#getLeafCollector multiple times if the first call throws a CollectionTerminatedException (what you implemented) or if it should be on the CollectorManager to no longer assume that Collector#getLeafCollector gets called exactly once per leaf. The latter feels cleaner to me conceptually, and I believe that it would also avoid the issue that you had with LRUQueryCache? The downside is that we'll need to fix all collectors that make this assumption. Agreed, that adjustment is a hack. The change in expectation should be reflected in the `Collector` API semantics though (rather that `CollectorManager`?), is that what you meant? For instance, taking the total hit count example, `TotalHitCountCollector` would need to be adapted to handle the case where `getLeafCollector` gets called providing the same leaf multiple times? It would need to be made synchronized and it would need to track which leaves were early terminated? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]
ChrisHegarty commented on PR #13535: URL: https://github.com/apache/lucene/pull/13535#issuecomment-2213644712 Thanks for the comments so far. I updated the PR to only check same-thread semantics for MSII clone and slice. And also added some basic thread checks to MockIndexInputWrapper. I think this is a reasonable improvement and a good place to stop in this PR. As suggested, further improvements can be done as a follow up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]
uschindler commented on PR #13535: URL: https://github.com/apache/lucene/pull/13535#issuecomment-2213707284 There are some test failures due to strict thread checking. I think the mock input should only do this when its in confined mode. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] TaskExecutor should not fork unnecessarily [lucene]
original-brownbear commented on PR #13472: URL: https://github.com/apache/lucene/pull/13472#issuecomment-2213989662 Sure thing, on it! :) sorry could've done that right away, tired me just didn't realise it this morning . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]
mikemccand opened a new issue, #13551: URL: https://github.com/apache/lucene/issues/13551 ### Description This is really a "discussion" issue. I'm not sure at all that the idea is feasible: I've been testing `luceneutil` with heavy KNN indexing (Cohere wikipedia `en` 768 dimension vectors) and one dismal part of the experience is the swap storm caused by HNSW's random access to the raw vectors stored in the `.vec` files for each segment on "cold start" searching. Even when the box has plenty of RAM to hold all `.vec` files, the swap storm takes minutes on the nightly benchy box, even with a fast NVMe SSD holding the index. Even if the index is freshly built, the OS doesn't seem to cache the `.vec` files since they appear to be "write once", until the searching starts up, and then the swap storm begins. This was with `FLOAT32` vectors ... I suspect the problem is less severe with `int4` or `int8` compressed vectors (haven't tested). At Amazon (customer facing product search) we also see this swap storm when cold starting the searching process even after [NRT segment replication](https://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html) has just copied the files locally: they don't stay "hot" in the OS in that case either (looks like "write once" to the OS). Lucene already has an awesome feature to "fix" this:`MMapDirectory.setPreload`. It will pre-touch all pages associated with that file on open, so the OS caches them "once" on startup, much more efficiently than the random access HNSW. But this only makes sense for applications/users that know they have enough RAM (we will test this at Amazon to see if it helps our cold start problems). For my `luceneutil` tests, simply `cat /path/to/index/*.vec >> /dev/null` (basically the same as `.setPreload` I think) fixed/sidestepped the swap storm. (The Linux kernel's less aggressive default readahead for memory-mapped IO vs traditional NIO is likely not helping either? Is this really a thing? I have not tested `NIOFSDirectory` to see if the swap storm is lessened). Longer term we have discussed using AKNN algorithms that are more disk-friendly (e.g. [DiskANN](https://github.com/apache/lucene/issues/12615)), but shorter t erm, I'm wondering if we could somehow help users with a default `Directory` (from `FSDirectory.open`) that somehow / sometimes preloads `.vec` files? It's not easy to do -- you wouldn't know up front that the application will do KNN searching at all. And, maybe only certain vectors in the `.vec` will ever be accessed and so you need to pay that long random access price to find and cache just those ones. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce a SkipperCodec for testing docvalues skipper index [lucene]
jpountz commented on code in PR #13550: URL: https://github.com/apache/lucene/pull/13550#discussion_r1668890679 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseDocValuesFormatTestCase.java: ## @@ -157,6 +158,7 @@ public void testNumberMergeAwayAllValuesWithSkipper() throws IOException { Analyzer analyzer = new MockAnalyzer(random()); IndexWriterConfig iwconfig = newIndexWriterConfig(analyzer); iwconfig.setMergePolicy(newLogMergePolicy()); +iwconfig.setCodec(SkipperCodec.randomInstance(random())); Review Comment: This doesn't look right, as this overrides the codec that is configured on the test case, e.g. `SimpleTextDocValuesFormat` would no longer be exercised? Should we create a new `TestLucene90DocValuesFormatVariableSkipInterval` class or something along these lines instead? ## lucene/test-framework/src/java/org/apache/lucene/tests/codecs/skipper/SkipperCodec.java: ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.tests.codecs.skipper; + +import java.util.Random; +import org.apache.lucene.codecs.DocValuesFormat; +import org.apache.lucene.codecs.FilterCodec; +import org.apache.lucene.codecs.lucene90.Lucene90DocValuesFormat; +import org.apache.lucene.tests.util.TestUtil; + +/** + * A codec that uses {@link Lucene90DocValuesFormat} with a random skip index size for its doc + * values and delegates to the default codec for everything else. + */ +public final class SkipperCodec extends FilterCodec { Review Comment: Maybe rename to `Skipper90Codec` to reflect that it maps to `Lucene90DocValuesFormat`? I assume we'd create a new one when we release a new DV format? Also, does it need to be a new Codec, or could it extend `Lucene911Codec` and override `getDocValuesFormatForField`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]
rmuir commented on issue #13551: URL: https://github.com/apache/lucene/issues/13551#issuecomment-2214496576 > It's not easy to do -- you wouldn't know up front that the application will do KNN searching at all. And, maybe only certain vectors in the `.vec` will ever be accessed and so you need to pay that long random access price to find and cache just those ones. I'm trying to understand this sentence: * it should be a one-liner using `setPreload` to preload "*.vec" if we wanted to do it either from FSDirectory.open or by default in MMapDirectory * if the application isn't doing KNN searching then they won't have .vec? I struggle to imagine someone indexing a ton of "extra" vectors that isnt "using" them and hasn't noticed big performance impact -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]
jpountz commented on issue #13551: URL: https://github.com/apache/lucene/issues/13551#issuecomment-2214541782 It wouldn't solve the issue, only mitigate it, but hopefully cold start performance gets better when we start leveraging `IndexInput#prefetch` to load multiple vectors from disk concurrently (https://github.com/apache/lucene/issues/13179). > The Linux kernel's less aggressive default readahead for memory-mapped IO vs traditional NIO FWIW, it's not that the read-ahead is less agressive for mmap than traditional I/O, it's that since recently `MMapDirectory` explicitly tells the OS to not perform readahead for vectors by passing `MADV_RANDOM` to `madvise`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Significant drop in recall for 8 bit Scalar Quantizer [lucene]
benwtrent commented on issue #13519: URL: https://github.com/apache/lucene/issues/13519#issuecomment-2214591883 @jbhateja could you unpack how this would actually work when using dot-product with linear scale corrections? I would imagine we could switch to an "unsigned byte" comparison and thus handle the bit flipping directly in SIMD, but this would introduce a new vector comparison that needs to be added. I am not sure how we can just "flip the bit" while maintain distance comparisons for euclidean & dotProduct without introducing error or slowing things down significantly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve VectorUtil::xorBitCount perf on ARM [lucene]
ChrisHegarty merged PR #13545: URL: https://github.com/apache/lucene/pull/13545 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] WIP: draft of intra segment concurrency [lucene]
jpountz commented on PR #13542: URL: https://github.com/apache/lucene/pull/13542#issuecomment-2214643611 > The change in expectation should be reflected in the Collector API semantics though (rather that CollectorManager?), is that what you meant? I was referring to `CollectorManager` because we could have two partitions of the same leaf that are collected by two different `Collector`s of the same `CollectorManager`. So both collectors would see this leaf only once, but we still need to avoid double-counting this leaf. > It would need to be made synchronized and it would need to track which leaves were early terminated? Something like that indeed. And only try to apply the `Weight#count` optimization on the first time that `getLeafCollector` is called on a leaf to avoid hitting the problem you had with `LRUQueryCache`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] WIP: draft of intra segment concurrency [lucene]
msokolov commented on PR #13542: URL: https://github.com/apache/lucene/pull/13542#issuecomment-2214811101 I wonder if we should tackle the issue with caching / cloning scorers? We have scorers/scorerSuppliers that do a lot of up-front work when created and we don't want to duplicate that work when two threads search the same segment. OTOH two threads cannot share the same Scorer instance since it has internal state namely the iterator it advances, and maybe more. This seems like the biggest problem to tackle - maybe it would make sense to start there? OTOH we need some kind of execution framework (this PR) that we can use to exercise such changes - I guess it's a kind of leapfrogging problem, gotta start somewhere. Given that, maybe we create a long-running branch we can use to iterate on this in multiple steps before merging to main? As for the Scorer-cloning I poked at it and it seems tricky. I think that Weight can cache Scorer and/or ScorerSupplier, and I guess BulkScorer. Then when asked for a scorer-type object that it already has it can create a shallowCopy(). Shallow-copying should be implemented so as to do the least amount of work to create a new instance that can be iterated independently. This will be up to each Scorer/ScorerSupplier/BulkScorer to figure out -- maybe some of them just return null and let Weight create a new one as it does today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] WIP: draft of intra segment concurrency [lucene]
msokolov commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1668890994 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -328,42 +336,65 @@ protected LeafSlice[] slices(List leaves) { /** Static method to segregate LeafReaderContexts amongst multiple slices */ public static LeafSlice[] slices( List leaves, int maxDocsPerSlice, int maxSegmentsPerSlice) { + +// TODO this is a temporary hack to force testing against multiple leaf reader context slices. +// It must be reverted before merging. +maxDocsPerSlice = 1; +maxSegmentsPerSlice = 1; +// end hack + // Make a copy so we can sort: List sortedLeaves = new ArrayList<>(leaves); // Sort by maxDoc, descending: -Collections.sort( -sortedLeaves, Collections.reverseOrder(Comparator.comparingInt(l -> l.reader().maxDoc(; +sortedLeaves.sort(Collections.reverseOrder(Comparator.comparingInt(l -> l.reader().maxDoc(; -final List> groupedLeaves = new ArrayList<>(); -long docSum = 0; -List group = null; +final List> groupedLeafPartitions = new ArrayList<>(); +int currentSliceNumDocs = 0; +List group = null; for (LeafReaderContext ctx : sortedLeaves) { if (ctx.reader().maxDoc() > maxDocsPerSlice) { assert group == null; -groupedLeaves.add(Collections.singletonList(ctx)); +// if the segment does not fit in a single slice, we split it in multiple partitions of Review Comment: I had worked up a version of this where I modified LeafReaderContext/IndexReaderContext to create a new kind of context that models the range within a segment. I had added interval start/end to LRC, but I suspect a cleaner way would be to make a new thing (IntervalReaderContext or so) and then change APIs to expect IndexReaderContext instead of CompositeReaderContext? If we do it this way it might make it easier to handle some cases like the single-threaded execution you mentioned. But this is more about cleaning up the APIs than making it work and we can argue endlessly about what is neater, so I think your approach to delay such questions makes sense. ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -328,42 +336,65 @@ protected LeafSlice[] slices(List leaves) { /** Static method to segregate LeafReaderContexts amongst multiple slices */ public static LeafSlice[] slices( List leaves, int maxDocsPerSlice, int maxSegmentsPerSlice) { + +// TODO this is a temporary hack to force testing against multiple leaf reader context slices. +// It must be reverted before merging. +maxDocsPerSlice = 1; +maxSegmentsPerSlice = 1; +// end hack + // Make a copy so we can sort: List sortedLeaves = new ArrayList<>(leaves); // Sort by maxDoc, descending: -Collections.sort( -sortedLeaves, Collections.reverseOrder(Comparator.comparingInt(l -> l.reader().maxDoc(; +sortedLeaves.sort(Collections.reverseOrder(Comparator.comparingInt(l -> l.reader().maxDoc(; -final List> groupedLeaves = new ArrayList<>(); -long docSum = 0; -List group = null; +final List> groupedLeafPartitions = new ArrayList<>(); +int currentSliceNumDocs = 0; +List group = null; for (LeafReaderContext ctx : sortedLeaves) { if (ctx.reader().maxDoc() > maxDocsPerSlice) { assert group == null; -groupedLeaves.add(Collections.singletonList(ctx)); +// if the segment does not fit in a single slice, we split it in multiple partitions of +// equal size +int numSlices = Math.ceilDiv(ctx.reader().maxDoc(), maxDocsPerSlice); Review Comment: My mental model of the whole slice/partition/segment/interval concept is: existing physical segments (leaves) divide the index into arbitrary sizes. existing slices (what we have today, not what is called slices in this PR) group segments together. partitions or intervals (in my view) are a logical division of the index into roughly equal-sized contiguous (in docid space) portions and they overlay the segments arbitrarily. Then it is the job of IndexSearcher to map this logical division of work into the underlying physical segments. The main comment here is - let's not confuse ourselves by re-using the word "slice" which already means something else! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Refactor and javadoc update for KNN vector writer classes [lucene]
benwtrent commented on code in PR #13548: URL: https://github.com/apache/lucene/pull/13548#discussion_r1669043121 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -139,6 +142,54 @@ public int nextDoc() throws IOException { } } + /** + * Given old doc ids and an id mapping, maps old ordinal to new ordinal. Note: this method return + * nothing and output are written to parameters + * Review Comment: ```suggestion * @param oldDocIds the old or current document ordinals. Must not be null. * @param sortMap, the document sorting map for how to make the new ordinals. Must not be null. ``` ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -139,6 +142,54 @@ public int nextDoc() throws IOException { } } + /** + * Given old doc ids and an id mapping, maps old ordinal to new ordinal. Note: this method return + * nothing and output are written to parameters + * + * @param old2NewOrd int[] maps from old ord to new ord + * @param new2OldOrd int[] maps from new ord to old ord + * @param newDocsWithField set of new doc ids which has the value + */ + public static void mapOldOrdToNewOrd( + DocsWithFieldSet oldDocIds, + Sorter.DocMap sortMap, + int[] old2NewOrd, + int[] new2OldOrd, + DocsWithFieldSet newDocsWithField) + throws IOException { +// TODO: a similar function exists in IncrementalHnswGraphMerger#getNewOrdMapping +// maybe we can do a further refactoring +assert (old2NewOrd != null || new2OldOrd != null || newDocsWithField != null); Review Comment: ```suggestion Objects.requireNonNull(oldDocIds); Objects.requireNonNull(sortMap); assert (old2NewOrd != null || new2OldOrd != null || newDocsWithField != null); ``` ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsFormat.java: ## @@ -56,8 +56,8 @@ * [vlong] length of this field's vectors, in bytes * [vint] dimension of this field's vectors * [int] the number of documents having values for this field - * [int8] if equals to -1, dense – all documents have values for a field. If equals to - * 0, sparse – some documents missing values. + * [int8] if equals to -2, empty - no vectory values. If equals to -1, dense – all Review Comment: ```suggestion * [int8] if equals to -2, empty - no vector values. If equals to -1, dense – all ``` ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -139,6 +142,54 @@ public int nextDoc() throws IOException { } } + /** + * Given old doc ids and an id mapping, maps old ordinal to new ordinal. Note: this method return + * nothing and output are written to parameters + * + * @param old2NewOrd int[] maps from old ord to new ord + * @param new2OldOrd int[] maps from new ord to old ord + * @param newDocsWithField set of new doc ids which has the value + */ + public static void mapOldOrdToNewOrd( + DocsWithFieldSet oldDocIds, + Sorter.DocMap sortMap, + int[] old2NewOrd, + int[] new2OldOrd, + DocsWithFieldSet newDocsWithField) + throws IOException { +// TODO: a similar function exists in IncrementalHnswGraphMerger#getNewOrdMapping +// maybe we can do a further refactoring +assert (old2NewOrd != null || new2OldOrd != null || newDocsWithField != null); +IntIntHashMap newIdToOldOrd = new IntIntHashMap(); Review Comment: Since this method isn't in charge of creating `old2NewOrd` or `new2OldOrd` can we assert those array sizes they are provided? ## lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java: ## @@ -356,7 +356,7 @@ BufferedUpdate next() throws IOException { } } -BytesRef nextTerm() throws IOException { +private BytesRef nextTerm() throws IOException { Review Comment: seems like an extra thing out of nowhere? Its probably ok in this commit, just struck me as a strange inclusion :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]
mikemccand commented on issue #13551: URL: https://github.com/apache/lucene/issues/13551#issuecomment-2214828022 Oh sorry I used the wrong term (thank you @rmuir for clarifying!): it's not a swap storm I'm seeing, it's a page storm. The OS has plenty of free ram (reported by `free`), and that goes down and `buff/cache` goes up as the OS pulls and caches pages in for the `.vec` file. I don't think I'm running too many ram hogging crapplications ;) > * it should be a one-liner using `setPreload` to preload "*.vec" if we wanted to do it either from FSDirectory.open or by default in MMapDirectory +1 -- it would be a simple change. But I worry if it would do more harm than good in some cases, e.g. if there are truly cold HNSW cases where the application plans to suffer through paging to identify the subset of hot vectors? I don't know if that is even a thing -- a bunch of dark matter vectors that never get visited? I guess we do know that [HNSW can and does sometimes produce disconnected graphs](https://github.com/apache/lucene/issues/12627), but I think those dark matter islands are "typically" smallish. > * if the application isn't doing KNN searching then they won't have .vec? I struggle to imagine someone indexing a ton of "extra" vectors that isnt "using" them and hasn't noticed big performance impact +1! Especially because indexing them is quite costly! > It wouldn't solve the issue, only mitigate it, but hopefully cold start performance gets better when we start leveraging `IndexInput#prefetch` to load multiple vectors from disk concurrently (#13179). +1 -- that would be a big help especially when paging in from fast SSDs since these devices have high IO concurrency. > > The Linux kernel's less aggressive default readahead for memory-mapped IO vs traditional NIO > > FWIW, it's not that the read-ahead is less agressive for mmap than traditional I/O, it's that since recently `MMapDirectory` explicitly tells the OS to not perform readahead for vectors by passing `MADV_RANDOM` to `madvise`. Ahhh, that makes sense! And I think it is still correct to `madvise` in this way for our `.vec` files: it ~~really~~ likely (is there any locality to HNSW's access patterns?) is a random access pattern from HNSW. It does make me wonder if the scalar compression will actually help or not. I guess it might still help if that compression means multiple vectors fit into a single IO page. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Refactor and javadoc update for KNN vector writer classes [lucene]
zhaih commented on code in PR #13548: URL: https://github.com/apache/lucene/pull/13548#discussion_r1669054287 ## lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java: ## @@ -356,7 +356,7 @@ BufferedUpdate next() throws IOException { } } -BytesRef nextTerm() throws IOException { +private BytesRef nextTerm() throws IOException { Review Comment: Yeah I just edited it when I was going over the code, thought it should be fine leaving it here :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Refactor and javadoc update for KNN vector writer classes [lucene]
zhaih commented on code in PR #13548: URL: https://github.com/apache/lucene/pull/13548#discussion_r1669056420 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -139,6 +142,54 @@ public int nextDoc() throws IOException { } } + /** + * Given old doc ids and an id mapping, maps old ordinal to new ordinal. Note: this method return + * nothing and output are written to parameters + * + * @param old2NewOrd int[] maps from old ord to new ord + * @param new2OldOrd int[] maps from new ord to old ord + * @param newDocsWithField set of new doc ids which has the value + */ + public static void mapOldOrdToNewOrd( + DocsWithFieldSet oldDocIds, + Sorter.DocMap sortMap, + int[] old2NewOrd, + int[] new2OldOrd, + DocsWithFieldSet newDocsWithField) + throws IOException { +// TODO: a similar function exists in IncrementalHnswGraphMerger#getNewOrdMapping +// maybe we can do a further refactoring +assert (old2NewOrd != null || new2OldOrd != null || newDocsWithField != null); +IntIntHashMap newIdToOldOrd = new IntIntHashMap(); Review Comment: good idea -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Test TestIndexWriterWithThreads#testIOExceptionDuringWriteSegmentWithThreadsOnlyOnce Failed [lucene]
aoli-al opened a new issue, #13552: URL: https://github.com/apache/lucene/issues/13552 ### Description I saw a flaky test, `TestIndexWriterWithThreads#testIOExceptionDuringWriteSegmentWithThreadsOnlyOnce` caused by concurrency issues recently: ``` MockDirectoryWrapper: cannot close: there are still 30 open files: {_1.tvd=1, _0_Lucene99_0.pos=1, _1.nvd=1, _0.nvd=1, _2.fdt=1, _2.tvx=1, _0_Lucene99_0.tim=1, _0_Lucene99_0.tip=1, _2.fdx=1, _2_Lucene99_0.doc=1, _2.nvd=1, _2.tvd=1, _0_Lucene90_0.dvd=1, _1.fdx=1, _1_Lucene99_0.doc=1, _1.fdt=1, _1.tvx=1, _2_Lucene99_0.pos=1, _1_Lucene90_0.dvd=1, _0_Lucene99_0.doc=1, _0.fdx=1, _2_Lucene99_0.tip=1, _0.fdt=1, _0.tvx=1, _2_Lucene99_0.tim=1, _0.tvd=1, _1_Lucene99_0.pos=1, _2_Lucene90_0.dvd=1, _1_Lucene99_0.tip=1, _1_Lucene99_0.tim=1} java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still 30 open files: {_1.tvd=1, _0_Lucene99_0.pos=1, _1.nvd=1, _0.nvd=1, _2.fdt=1, _2.tvx=1, _0_Lucene99_0.tim=1, _0_Lucene99_0.tip=1, _2.fdx=1, _2_Lucene99_0.doc=1, _2.nvd=1, _2.tvd=1, _0_Lucene90_0.dvd=1, _1.fdx=1, _1_Lucene99_0.doc=1, _1.fdt=1, _1.tvx=1, _2_Lucene99_0.pos=1, _1_Lucene90_0.dvd=1, _0_Lucene99_0.doc=1, _0.fdx=1, _2_Lucene99_0.tip=1, _0.fdt=1, _0.tvx=1, _2_Lucene99_0.tim=1, _0.tvd=1, _1_Lucene99_0.pos=1, _2_Lucene90_0.dvd=1, _1_Lucene99_0.tip=1, _1_Lucene99_0.tim=1} at __randomizedtesting.SeedInfo.seed([4D88B1144F0B21A3:1E330D289AB0246]:0) at org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:876) at org.apache.lucene.index.TestIndexWriterWithThreads._testMultipleThreadsFailure(TestIndexWriterWithThreads.java:349) at org.apache.lucene.index.TestIndexWriterWithThreads.testIOExceptionDuringWriteSegmentWithThreadsOnlyOnce(TestIndexWriterWithThreads.java:502) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758) at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946) at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982) at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996) at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48) at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45) at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390) at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843) at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490) at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955) at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840) at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891) at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902) at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38) at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53) at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) at org.apache.lucene.tests.util.
Re: [PR] Improve VectorUtil::xorBitCount perf on ARM [lucene]
uschindler commented on PR #13545: URL: https://github.com/apache/lucene/pull/13545#issuecomment-2214845553 Hi, in the backport to 9.x the benchmark file was wrongly merged. It landed in the test directory. In 9.x we have no benchmark-jmh module in Gradle, so the file should have been left out while cherry-picking. Can you remove the file in a followup commit? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve VectorUtil::xorBitCount perf on ARM [lucene]
uschindler commented on PR #13545: URL: https://github.com/apache/lucene/pull/13545#issuecomment-2214846241 See: https://github.com/apache/lucene/commit/c8b4a76ecc93a98c779364b18f62c9b67552c192#diff-dd8d7417893f9b2fecaef29491b94d5daeaae6d496c4b21bb9633b4f7b060e59 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve VectorUtil::xorBitCount perf on ARM [lucene]
uschindler commented on code in PR #13545: URL: https://github.com/apache/lucene/pull/13545#discussion_r1669064660 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -212,6 +212,14 @@ public static int int4DotProductPacked(byte[] unpacked, byte[] packed) { return IMPL.int4DotProduct(unpacked, false, packed, true); } + /** + * For xorBitCount we stride over the values as either 64-bits (long) or 32-bits (int) at a time. + * On ARM Long::bitCount is not vectorized, and therefore produces less than optimal code, when + * compared to Integer::bitCount. While Long::bitCount is optimal on x64. TODO: include the + * OpenJDK JIRA url Review Comment: Do you have the JIRA issue number already? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]
uschindler commented on issue #13551: URL: https://github.com/apache/lucene/issues/13551#issuecomment-2214868089 I don't think we should change anything here in MMapDirectory. It is all available and easy to do for one that wants to do this. Elasticserach is doing this for some files, but we have no default on preloading anything. For preloading you can pass MMapDirectory#setPreload and give it a lamda expression to select on which file extensions you want preloading: https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/store/MMapDirectory.html#setPreload(java.util.function.BiPredicate) The default is: https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/store/MMapDirectory.html#NO_FILES -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]
uschindler commented on issue #13551: URL: https://github.com/apache/lucene/issues/13551#issuecomment-2214871854 > * it should be a one-liner using `setPreload` to preload "*.vec" if we wanted to do it either from FSDirectory.open or by default in MMapDirectory It is trivial: ```java mmapDir.setPreload(name -> name.endsWith(".vec")); ``` So no change needed in the API or behaviour of Lucene. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] WIP: draft of intra segment concurrency [lucene]
stefanvodita commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1669100332 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -328,42 +336,65 @@ protected LeafSlice[] slices(List leaves) { /** Static method to segregate LeafReaderContexts amongst multiple slices */ public static LeafSlice[] slices( List leaves, int maxDocsPerSlice, int maxSegmentsPerSlice) { + +// TODO this is a temporary hack to force testing against multiple leaf reader context slices. +// It must be reverted before merging. +maxDocsPerSlice = 1; +maxSegmentsPerSlice = 1; +// end hack + // Make a copy so we can sort: List sortedLeaves = new ArrayList<>(leaves); // Sort by maxDoc, descending: -Collections.sort( -sortedLeaves, Collections.reverseOrder(Comparator.comparingInt(l -> l.reader().maxDoc(; +sortedLeaves.sort(Collections.reverseOrder(Comparator.comparingInt(l -> l.reader().maxDoc(; -final List> groupedLeaves = new ArrayList<>(); -long docSum = 0; -List group = null; +final List> groupedLeafPartitions = new ArrayList<>(); +int currentSliceNumDocs = 0; +List group = null; for (LeafReaderContext ctx : sortedLeaves) { if (ctx.reader().maxDoc() > maxDocsPerSlice) { assert group == null; -groupedLeaves.add(Collections.singletonList(ctx)); +// if the segment does not fit in a single slice, we split it in multiple partitions of +// equal size +int numSlices = Math.ceilDiv(ctx.reader().maxDoc(), maxDocsPerSlice); Review Comment: > partitions or intervals (in my view) are a logical division of the index into roughly equal-sized contiguous (in docid space) portions and they overlay the segments arbitrarily I like the idea of calling them virtual segments in contrast to physical segments. Interval (and less so partition) seems to imply the part about contiguity in docid space, but is that a necessary assumption? It's efficient for it to work that way, but maybe we can allow more flexibility in how we determine these partitions / intervals (virtual segments). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Refactor and javadoc update for KNN vector writer classes [lucene]
zhaih merged PR #13548: URL: https://github.com/apache/lucene/pull/13548 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make HNSW merges faster [lucene]
benwtrent commented on issue #12440: URL: https://github.com/apache/lucene/issues/12440#issuecomment-2215292611 I had another idea, I wonder if we can initialize HNSW via coarse grained clusters. Depending on the clustering algorithm used, we can use clusters built from various segments to boot-strap the hnsw graph building. This obviously would be "new research". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Significant drop in recall for 8 bit Scalar Quantizer [lucene]
MilindShyani commented on issue #13519: URL: https://github.com/apache/lucene/issues/13519#issuecomment-2215327156 @benwtrent Apologies for the late response! I am traveling (and marveling 2000 year old pyramids) right now. The transformation you wrote indeed matches mine. Thinking about the bucketing correction meanwhile, will post shortly. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve VectorUtil::xorBitCount perf on ARM [lucene]
uschindler commented on PR #13545: URL: https://github.com/apache/lucene/pull/13545#issuecomment-2215482308 I reverted the addition of the file to 9.x branch: 86d080a4e0b4e53e0c9a3f2e2b120bff204c7276 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix quantized vector writer ram estimates [lucene]
benwtrent opened a new pull request, #13553: URL: https://github.com/apache/lucene/pull/13553 I still need to write a test, but wanted to open this PR early. Scalar Quantized vector writer ram usage estimates completely ignores the raw float vectors. Meaning, if you have flush based on ram usage configured, you can easily overshoot that estimate and cause an OOM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix quantized vector writer ram estimates [lucene]
benwtrent commented on code in PR #13553: URL: https://github.com/apache/lucene/pull/13553#discussion_r1669464616 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java: ## @@ -172,9 +172,6 @@ public void finish() throws IOException { public long ramBytesUsed() { long total = SHALLOW_RAM_BYTES_USED; total += flatVectorWriter.ramBytesUsed(); -for (FieldWriter field : fields) { - total += field.ramBytesUsed(); -} Review Comment: the delegate writer will take into account these higher level indexing writers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix quantized vector writer ram estimates [lucene]
benwtrent commented on code in PR #13553: URL: https://github.com/apache/lucene/pull/13553#discussion_r1669465330 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -299,9 +299,7 @@ public void finish() throws IOException { @Override public long ramBytesUsed() { long total = SHALLOW_RAM_BYTES_USED; -for (FieldWriter field : fields) { - total += field.ramBytesUsed(); -} +total += rawVectorDelegate.ramBytesUsed(); Review Comment: rawVectorDelegate takes into account its field writer estimates which then take the scalar quantized writer estimates into account. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce heap usage for knn index writers [lucene]
benwtrent commented on PR #13538: URL: https://github.com/apache/lucene/pull/13538#issuecomment-2215637537 Related: https://github.com/apache/lucene/pull/13553 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix quantized vector writer ram estimates [lucene]
benwtrent commented on PR #13553: URL: https://github.com/apache/lucene/pull/13553#issuecomment-2215645063 @gautamworah96 @msokolov this might be part of the reason for the OOMs, the estimates were completely ignoring the float[] vector sizes for fieldwriters 🤦 . I plan on iterating on this fix first, then proceed with https://github.com/apache/lucene/pull/13538 The other PR helps reduce heap usage slightly, but not by much. Honestly, the issue is that we just weren't tracking the usage in the estimate correctly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use `IndexInput#prefetch` for terms dictionary lookups. [lucene]
github-actions[bot] commented on PR #13359: URL: https://github.com/apache/lucene/pull/13359#issuecomment-2215694348 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] WordBreakSpellChecker.generateBreakUpSuggestions() should do breadth first search [lucene]
hossman commented on issue #12100: URL: https://github.com/apache/lucene/issues/12100#issuecomment-2215985880 I realized today that I had been working on branch_9x, so i've updated the patch to apply cleanly to main [WordBreakSpellChecker.breadthfirst.GH-12100.patch.txt](https://github.com/user-attachments/files/16135858/WordBreakSpellChecker.breadthfirst.GH-12100.patch.txt) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Binary search all terms. [lucene]
vsop-479 commented on PR #13192: URL: https://github.com/apache/lucene/pull/13192#issuecomment-2216245652 Or, we just supply this (maybe only for `non-allEqual` leaf blocks) as an option, So, users can use it when their applications are not busy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce TestLucene90DocValuesFormatVariableSkipIntervalfor testing docvalues skipper index [lucene]
iverase commented on code in PR #13550: URL: https://github.com/apache/lucene/pull/13550#discussion_r1669846499 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseDocValuesFormatTestCase.java: ## @@ -157,6 +158,7 @@ public void testNumberMergeAwayAllValuesWithSkipper() throws IOException { Analyzer analyzer = new MockAnalyzer(random()); IndexWriterConfig iwconfig = newIndexWriterConfig(analyzer); iwconfig.setMergePolicy(newLogMergePolicy()); +iwconfig.setCodec(SkipperCodec.randomInstance(random())); Review Comment: sure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce TestLucene90DocValuesFormatVariableSkipIntervalfor testing docvalues skipper index [lucene]
iverase commented on code in PR #13550: URL: https://github.com/apache/lucene/pull/13550#discussion_r1669846808 ## lucene/test-framework/src/java/org/apache/lucene/tests/codecs/skipper/SkipperCodec.java: ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.tests.codecs.skipper; + +import java.util.Random; +import org.apache.lucene.codecs.DocValuesFormat; +import org.apache.lucene.codecs.FilterCodec; +import org.apache.lucene.codecs.lucene90.Lucene90DocValuesFormat; +import org.apache.lucene.tests.util.TestUtil; + +/** + * A codec that uses {@link Lucene90DocValuesFormat} with a random skip index size for its doc + * values and delegates to the default codec for everything else. + */ +public final class SkipperCodec extends FilterCodec { Review Comment: right, I was too complicated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce TestLucene90DocValuesFormatVariableSkipIntervalfor testing docvalues skipper index [lucene]
iverase commented on PR #13550: URL: https://github.com/apache/lucene/pull/13550#issuecomment-2216740642 @jpountz I added a new test and change the title and description of the issue as we don't need to add a new Codec. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org