Re: [PR] TaskExecutor should not fork unnecessarily [lucene]
javanna commented on code in PR #13472: URL: https://github.com/apache/lucene/pull/13472#discussion_r1665263601 ## lucene/core/src/test/org/apache/lucene/search/TestTaskExecutor.java: ## @@ -251,14 +271,15 @@ public void testInvokeAllDoesNotLeaveTasksBehind() { for (int i = 0; i < tasksWithNormalExit; i++) { callables.add( () -> { -tasksExecuted.incrementAndGet(); -return null; +throw new AssertionError( +"must not be called since the first task failing cancels all subsequent tasks"); }); } expectThrows(RuntimeException.class, () -> taskExecutor.invokeAll(callables)); assertEquals(1, tasksExecuted.get()); // the callables are technically all run, but the cancelled ones will be no-op -assertEquals(100, tasksStarted.get()); +// add one for the task the gets executed on the current thread Review Comment: Right, the assertion error, thanks. That makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] TaskExecutor should not fork unnecessarily [lucene]
javanna commented on code in PR #13472: URL: https://github.com/apache/lucene/pull/13472#discussion_r1665266009 ## lucene/CHANGES.txt: ## @@ -277,6 +277,15 @@ Optimizations * GITHUB#12941: Don't preserve auxiliary buffer contents in LSBRadixSorter if it grows. (Stefan Vodita) +Changes in runtime behavior +- + +* GITHUB#13472: When an executor is provided to the IndexSearcher constructor, the searcher now executes tasks on the + thread that invoked a search as well as its configured executor. Users should reduce the executor's thread-count by 1 + to retain the previous level of parallelism. Moreover, it is now possible to start searches from the same executor + that is configured in the IndexSearcher without risk of deadlocking. A separate executor for starting searches is no + longer required. Review Comment: Can you add your name at the end please? You should totally take credit for your contributions ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] TaskExecutor should not fork unnecessarily [lucene]
original-brownbear commented on code in PR #13472: URL: https://github.com/apache/lucene/pull/13472#discussion_r1665341009 ## lucene/CHANGES.txt: ## @@ -277,6 +277,15 @@ Optimizations * GITHUB#12941: Don't preserve auxiliary buffer contents in LSBRadixSorter if it grows. (Stefan Vodita) +Changes in runtime behavior +- + +* GITHUB#13472: When an executor is provided to the IndexSearcher constructor, the searcher now executes tasks on the + thread that invoked a search as well as its configured executor. Users should reduce the executor's thread-count by 1 + to retain the previous level of parallelism. Moreover, it is now possible to start searches from the same executor + that is configured in the IndexSearcher without risk of deadlocking. A separate executor for starting searches is no + longer required. Review Comment: Thanks Luca, name added! :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] TaskExecutor should not fork unnecessarily [lucene]
javanna merged PR #13472: URL: https://github.com/apache/lucene/pull/13472 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add XNOR in FixedBitSet. [lucene]
vsop-479 opened a new pull request, #13540: URL: https://github.com/apache/lucene/pull/13540 ### Description Trying to get `softLiveDocs` from `liveDocs` and `hardLiveDocs` by using XNOR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] TaskExecutor should not fork unnecessarily [lucene]
javanna commented on PR #13472: URL: https://github.com/apache/lucene/pull/13472#issuecomment-2208500325 Thanks @original-brownbear ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] [9.x] TaskExecutor should not fork unnecessarily (#13472) [lucene]
javanna opened a new pull request, #13541: URL: https://github.com/apache/lucene/pull/13541 When an executor is provided to the IndexSearcher constructor, the searcher now executes tasks on the thread that invoked a search as well as its configured executor. Users should reduce the executor's thread-count by 1 to retain the previous level of parallelism. Moreover, it is now possible to start searches from the same executor that is configured in the IndexSearcher without risk of deadlocking. A separate executor for starting searches is no longer required. Previously, a separate executor was required to prevent deadlock, and all the tasks were offloaded to it unconditionally, wasting resources in some scenarios due to unnecessary forking, and the caller thread having to wait for all tasks to be completed anyways. it can now actively contribute to the execution as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Only search soft deleted in SoftDeletesRetentionMergePolicy.applyRetentionQuery [lucene]
vsop-479 commented on PR #13536: URL: https://github.com/apache/lucene/pull/13536#issuecomment-2208519800 I think we can get `softLiveDocs` by from liveDocs and hardLiveDocs by using XNOR (https://github.com/apache/lucene/pull/13540), and use it replace `wrapLiveDocs`. Or, we can remove `wrapLiveDocs`, if it is useless. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [9.x] TaskExecutor should not fork unnecessarily (#13472) [lucene]
javanna merged PR #13541: URL: https://github.com/apache/lucene/pull/13541 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Significant drop in recall for 8 bit Scalar Quantizer [lucene]
benwtrent commented on issue #13519: URL: https://github.com/apache/lucene/issues/13519#issuecomment-2208862938 @naveentatikonda this is a WIP :). > Why are we setting SIGNED_CORRECTION as 127 instead of 128 ? My logic here was scaling by what we will round by. But thinking about it, I think we should scale down by 128. > ...the min should be -128 I disagree. We want `-127, 127`. Otherwise we have to handle the exotic `-128*-128` case when doing vector comparisons and this disallows some nice SIMD optimizations down the road. > In this new code we are not adding this... This is a wip...I didn't figure out yet how to incorporate this rounding error as it assumes unsigned operations and has a knock-on effect on the buckets. So, simply adding the naive rounding error would reduce recall significantly. I am wondering how to apply it. Will report back if I can figure it out :). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Significant drop in recall for 8 bit Scalar Quantizer [lucene]
benwtrent commented on issue #13519: URL: https://github.com/apache/lucene/issues/13519#issuecomment-2208933244 I pushed a change to my branch. - Scales by 128 - Rounds to between -127, 127 - Applies a modification of the rounding error compensation. The idea behind how I applied the different compensation now is that we need to handle the linear scaling as this can cause us to snap to different buckets. I am still not 100% sure my transformation is correct there. Recall over the same data is now: `0.877`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]
ChrisHegarty commented on PR #13535: URL: https://github.com/apache/lucene/pull/13535#issuecomment-2208974840 I updated clone and slice. It finds some issues. These look like incorrect usage of READONCE. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]
ChrisHegarty commented on PR #13535: URL: https://github.com/apache/lucene/pull/13535#issuecomment-2208977180 ... there's still some intermittent failures because of this. I'll track them down so that we can discuss. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]
uschindler commented on code in PR #13535: URL: https://github.com/apache/lucene/pull/13535#discussion_r1665740237 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java: ## @@ -45,7 +45,8 @@ public IndexInput openInput(Path path, IOContext context, int chunkSizePower, bo path = Unwrappable.unwrapAll(path); boolean success = false; -final Arena arena = Arena.ofShared(); +final boolean confined = context == IOContext.READONCE; +final Arena arena = context == IOContext.READONCE ? Arena.ofConfined() : Arena.ofShared(); Review Comment: use the confined boolean here, too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]
uschindler commented on PR #13535: URL: https://github.com/apache/lucene/pull/13535#issuecomment-2209054786 What intermittent failures did you see? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]
uschindler commented on PR #13535: URL: https://github.com/apache/lucene/pull/13535#issuecomment-2209086060 To me those problems where a CFS file was opened with READONCE look like a bug. It is also unlikely that the CFS file is just opened once and then closed again. If we want to keep the CFS file support, we should at least allow slices of IndexInput. Whenever you create clones its a bug., because those are used in different threads, but slices may be acceptable (e.g. for CFS files). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]
uschindler commented on PR #13535: URL: https://github.com/apache/lucene/pull/13535#issuecomment-2209087721 > To me those problems where a CFS file was opened with READONCE look like a bug. It is also unlikely that the CFS file is just opened once and then closed again. > > If we want to keep the CFS file support, we should at least allow slices of IndexInput. Whenever you create clones its a bug., because those are used in different threads, but slices may be acceptable (e.g. for CFS files). Maybe the pending deletes stuff really only wants to open the CFS file to read deleted docs, but this still looks strange to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]
uschindler commented on PR #13535: URL: https://github.com/apache/lucene/pull/13535#issuecomment-2209152707 Hi, when thinking about this. ;Maybe we should for now just disallow clones and not slices. Slices are needed for CFS files. But nevertheless the code affected by the change seems to have some shortcomings. Why does it need to open a CFS file only to read deletes from it? Shouldn't the CFS file not be open already as the IndexSearcher is open already? MMapping a huge amount of data just to read a few kilobytes looks wrong to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] WIP: draft of intra segment concurrency [lucene]
javanna opened a new pull request, #13542: URL: https://github.com/apache/lucene/pull/13542 I experimented trying to introduce intra-segment concurrency in Lucene, by leveraging the existing `Scorer#score` method that takes a range of id as argument, and switching the searcher to call that to search a partition of a leaf. I introduced a `LeafReaderContextPartition` abstraction that wraps `LeafReaderContext` and the range of doc ids that the partition targets. A slice now points to specific partitions of segments, identified by the range of doc ids. I have found a couple of challenging problems (thanks to test failures) that I have solved / worked around the best ways I could at the time, but I am sure the current solutions are not yet good enough to merge. They work but need refining. They are good to highlight the problems I encountered: 1) IndexSearcher#count / TotalHitCountCollector rely on `Weight#count(LeafReaderContext)`, which now gets called multiple times against the same leaf and leads to excessive counting of hits. Resolved by synchronizing the call to `getLeafCollector`, to make sure we don't pull a leaf collector for the same segment multiple times in parallel, and mark a leaf early terminated when `getLeafCollector` throws a `CollectionTerminatedException` so that we skip that same leaf in subsequent calls that refer to it. 2) LRUQueryCache caches the return value of `Weight#count`. When we execute the same query against the same segment multiple times (as part of a single search call), the first time we do the actual counting for the docs that the first partition holds, and subsequent times we should do the same, count hits in each partition of the same segment instead of retrieving the count from the cache. The right thing to do is to only get from the cache the first time we execute against a certain segment as part of a certain search operation. Subsequent executions within the same search, that target the same segment should not go through the cache. If we did not go through the cache for the first occurrence of a certain segment, it was not cached. This is dealt with in LRUQueryCache.CachingWrapperWeight#count for now, keeping track of which leaf contexts are seen. 3) CheckHits verifies matches and may go outside of the bounds of the doc id range of the current slice These are more or less the changes I made step by step: - Added LeafReaderContextPartition abstraction that holds LeafReaderContext + the range of doc ids it targets. A slice points now to specific subsets of segments, identified by their corresponding range of doc ids. - Introduced additional IndexSearcher#search method that takes LeafReaderContextPartition[] instead of LeafReaderContext[], which calls `scorer.score(leafCollector, ctx.reader().getLiveDocs(), slice.minDocId, slice.maxDocId);` providing the range of doc ids in place of `scorer.score(leafCollector, ctx.reader().getLiveDocs());` that would score all documents - Added override for new protected IndexSearcher#search to subclasses that require it - Introduced wrapping of `LeafReaderContext` at the beginning of each search, to share state between slices that target the same segment (to solve early termination issues described below) - hacked IndexSearcher#getSlices and LuceneTestCase to generate as many slices as possible, as well as set an executor as often as possible in tests, in order to get test coverage and find issues There's still a couple of rough edges, some test failures that need to be investigated to see if it's test problems or actual bugs. I hacked the way we generate slices to produce single document slices, but that produces way too many slices in some situations which causes test failures due to OOM, or too much cloning happening. I am looking for early feedback: does the technical approach make sense? Would you do it entirely differently? How do we solve the problems I encountered described above? Can I get help on tests that need investigation? Relates to #9721 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Decouple within-query concurrency from the index's segment geometry [LUCENE-8675] [lucene]
javanna commented on issue #9721: URL: https://github.com/apache/lucene/issues/9721#issuecomment-2209168739 I opened an initial draft of my take at intra segment concurrency (#13542) . It needs quite a bit of work and discussion, but I hope it helps as start, hopefully getting intra-segment concurrency into Lucene 10, one can always hope :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] WIP: draft of intra segment concurrency [lucene]
javanna commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1665816216 ## lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java: ## @@ -362,6 +362,9 @@ public long cost() { final IntersectVisitor visitor = getIntersectVisitor(result); long cost = -1; +// maybe allowing calling get multiple times, or clone the scorer to avoid duplication +// across multiple +// slices that point to the same segment Review Comment: this is not a requirement, I am thinking it could be a follow-up -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] WIP: draft of intra segment concurrency [lucene]
javanna commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1665818254 ## lucene/core/src/test/org/apache/lucene/index/TestSegmentToThreadMapping.java: ## @@ -160,83 +155,133 @@ public CacheHelper getReaderCacheHelper() { }; } - public void testSingleSlice() { -LeafReader largeSegmentReader = dummyIndexReader(50_000); -LeafReader firstMediumSegmentReader = dummyIndexReader(30_000); -LeafReader secondMediumSegmentReader = dummyIndexReader(30__000); -LeafReader thirdMediumSegmentReader = dummyIndexReader(30_000); + private static List createLeafReaderContexts(int... maxDocs) { List leafReaderContexts = new ArrayList<>(); +for (int maxDoc : maxDocs) { + leafReaderContexts.add(new LeafReaderContext(dummyIndexReader(maxDoc))); +} +Collections.shuffle(leafReaderContexts, random()); +return leafReaderContexts; + } -leafReaderContexts.add(new LeafReaderContext(largeSegmentReader)); -leafReaderContexts.add(new LeafReaderContext(firstMediumSegmentReader)); -leafReaderContexts.add(new LeafReaderContext(secondMediumSegmentReader)); -leafReaderContexts.add(new LeafReaderContext(thirdMediumSegmentReader)); - -IndexSearcher.LeafSlice[] resultSlices = IndexSearcher.slices(leafReaderContexts, 250_000, 5); - -assertTrue(resultSlices.length == 1); - -final LeafReaderContext[] leaves = resultSlices[0].leaves; + @AwaitsFix(bugUrl = "restore!") Review Comment: these tests are disabled only due to the hack that forces slices with a single document to increase test coverage. Otherwise they are fine and up-to-date. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] WIP: draft of intra segment concurrency [lucene]
javanna commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1665818950 ## lucene/core/src/test/org/apache/lucene/search/TestSortRandom.java: ## @@ -119,7 +119,8 @@ private void testRandomStringSort(SortField.Type type) throws Exception { System.out.println(" reader=" + r); } -final IndexSearcher s = newSearcher(r, false); +// TODO this needs investigation, it fails with concurrency enabled (seed: 393E6798D70D742D) +final IndexSearcher s = newSearcher(r, false, random.nextBoolean(), false); Review Comment: This needs attention, I was not yet able to figure out the reason for the failure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]
uschindler commented on PR #13535: URL: https://github.com/apache/lucene/pull/13535#issuecomment-2209186177 Thanks, looks fine now. I think we can revert the changes to the SegmentReader and look into this in another issue. It still looks strage not me, but I had not time to look closely into the code of the parts which used READONCE on CFS files, that you now changed to DEFAULT. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]
uschindler commented on code in PR #13535: URL: https://github.com/apache/lucene/pull/13535#discussion_r1665828693 ## lucene/core/src/java/org/apache/lucene/index/ReadersAndUpdates.java: ## @@ -578,7 +578,8 @@ public synchronized boolean writeFieldUpdates( // IndexWriter.commitMergedDeletes). final SegmentReader reader; if (this.reader == null) { -reader = new SegmentReader(info, indexCreatedVersionMajor, IOContext.READONCE); +IOContext context = info.info.getUseCompoundFile() ? IOContext.DEFAULT : IOContext.READONCE; Review Comment: This looks correct with READONCE, as we close the segment reader down there, so it is only open for short time and we never look into it by different threads. So I think we can revert that change and use READONCE. It is still a bit expensive to mmap a huge CFS file just to read a few bytes. ## lucene/core/src/java/org/apache/lucene/index/PendingSoftDeletes.java: ## @@ -232,7 +232,7 @@ private FieldInfos readFieldInfos() throws IOException { segInfo .getCodec() .compoundFormat() -.getCompoundReader(segInfo.dir, segInfo, IOContext.READONCE); +.getCompoundReader(segInfo.dir, segInfo, IOContext.DEFAULT); Review Comment: READONCE is also fine here, because we only open the CFS file for very short time and close it down there. So as said before, this is a bit of overhead to mmap a large CFS file just to read fieldinfos. We should think about this a bit in a separate issue. It is good you catched it. ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -53,6 +53,7 @@ abstract class MemorySegmentIndexInput extends IndexInput final long length; final long chunkSizeMask; final int chunkSizePower; + final boolean confined; Review Comment: We could also pass down the IOContext for more flexibility in the future. For now a single boolean is fine! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]
uschindler commented on PR #13535: URL: https://github.com/apache/lucene/pull/13535#issuecomment-2209201436 > Thanks, looks fine now. I think we can revert the changes to the SegmentReader and look into this in another issue. It still looks strange not me, but I had not time to look closely into the code of the parts which used READONCE on CFS files, that you now changed to DEFAULT. > > If the code opens the CFS file and closes it later, using READONCE is fine, but not in general to open such a file. I reviewed this. We can undo the IOContext#DEFAULT changes for the CFS files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]
uschindler commented on code in PR #13535: URL: https://github.com/apache/lucene/pull/13535#discussion_r1665836026 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -591,15 +611,17 @@ MemorySegmentIndexInput buildSlice(String sliceDescription, long offset, long le null, // clones don't have an Arena, as they can't close) slices[0].asSlice(offset, length), length, - chunkSizePower); + chunkSizePower, + confined); Review Comment: passing here confined for the slice looks correct, because this would disallow clones of possible slices. +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]
ChrisHegarty commented on code in PR #13535: URL: https://github.com/apache/lucene/pull/13535#discussion_r1665858393 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -591,15 +611,17 @@ MemorySegmentIndexInput buildSlice(String sliceDescription, long offset, long le null, // clones don't have an Arena, as they can't close) slices[0].asSlice(offset, length), length, - chunkSizePower); + chunkSizePower, + confined); Review Comment: I added a test for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] WIP: draft of intra segment concurrency [lucene]
stefanvodita commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1665869563 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -328,42 +336,65 @@ protected LeafSlice[] slices(List leaves) { /** Static method to segregate LeafReaderContexts amongst multiple slices */ public static LeafSlice[] slices( List leaves, int maxDocsPerSlice, int maxSegmentsPerSlice) { + +// TODO this is a temporary hack to force testing against multiple leaf reader context slices. +// It must be reverted before merging. +maxDocsPerSlice = 1; +maxSegmentsPerSlice = 1; +// end hack + // Make a copy so we can sort: List sortedLeaves = new ArrayList<>(leaves); // Sort by maxDoc, descending: -Collections.sort( -sortedLeaves, Collections.reverseOrder(Comparator.comparingInt(l -> l.reader().maxDoc(; +sortedLeaves.sort(Collections.reverseOrder(Comparator.comparingInt(l -> l.reader().maxDoc(; -final List> groupedLeaves = new ArrayList<>(); -long docSum = 0; -List group = null; +final List> groupedLeafPartitions = new ArrayList<>(); +int currentSliceNumDocs = 0; +List group = null; for (LeafReaderContext ctx : sortedLeaves) { if (ctx.reader().maxDoc() > maxDocsPerSlice) { assert group == null; -groupedLeaves.add(Collections.singletonList(ctx)); +// if the segment does not fit in a single slice, we split it in multiple partitions of Review Comment: Thank you for moving forward on this issue @javanna! I had a different strategy in mind for slicing the index. With the current implementation, we deduce the number of slices based on a given per-slice doc count. What if the number of slices was given instead? Each segment would not be divided into n slices, but a slice could straddle over a few segments. I think it gives us better control over the concurrency and leaves us with fewer slices per segment while achieving the same level of concurrency, which is better because it reduces the fixed cost of search that we pay per slice per segment. Are there challenges that make that hard to implement? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] WIP: draft of intra segment concurrency [lucene]
javanna commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1665874144 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -328,42 +336,65 @@ protected LeafSlice[] slices(List leaves) { /** Static method to segregate LeafReaderContexts amongst multiple slices */ public static LeafSlice[] slices( List leaves, int maxDocsPerSlice, int maxSegmentsPerSlice) { + +// TODO this is a temporary hack to force testing against multiple leaf reader context slices. +// It must be reverted before merging. +maxDocsPerSlice = 1; +maxSegmentsPerSlice = 1; +// end hack + // Make a copy so we can sort: List sortedLeaves = new ArrayList<>(leaves); // Sort by maxDoc, descending: -Collections.sort( -sortedLeaves, Collections.reverseOrder(Comparator.comparingInt(l -> l.reader().maxDoc(; +sortedLeaves.sort(Collections.reverseOrder(Comparator.comparingInt(l -> l.reader().maxDoc(; -final List> groupedLeaves = new ArrayList<>(); -long docSum = 0; -List group = null; +final List> groupedLeafPartitions = new ArrayList<>(); +int currentSliceNumDocs = 0; +List group = null; for (LeafReaderContext ctx : sortedLeaves) { if (ctx.reader().maxDoc() > maxDocsPerSlice) { assert group == null; -groupedLeaves.add(Collections.singletonList(ctx)); +// if the segment does not fit in a single slice, we split it in multiple partitions of Review Comment: I honestly don't even remember how I divided segments into multiple slices. That's subject to change and I'd expect that to be the last details to look at. First, make search across segment partitions work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]
ChrisHegarty commented on code in PR #13535: URL: https://github.com/apache/lucene/pull/13535#discussion_r1665875714 ## lucene/core/src/java/org/apache/lucene/index/ReadersAndUpdates.java: ## @@ -578,7 +578,8 @@ public synchronized boolean writeFieldUpdates( // IndexWriter.commitMergedDeletes). final SegmentReader reader; if (this.reader == null) { -reader = new SegmentReader(info, indexCreatedVersionMajor, IOContext.READONCE); +IOContext context = info.info.getUseCompoundFile() ? IOContext.DEFAULT : IOContext.READONCE; Review Comment: Reverting this results in the following failure: ``` java.lang.IllegalStateException: confined at __randomizedtesting.SeedInfo.seed([E35616674660C0C9:EB8233FC2A66214D]:0) at org.apache.lucene.store.MemorySegmentIndexInput.ensureNotConfined(MemorySegmentIndexInput.java:108) at org.apache.lucene.store.MemorySegmentIndexInput.clone(MemorySegmentIndexInput.java:539) at org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl.clone(MemorySegmentIndexInput.java:661) at org.apache.lucene.tests.store.MockIndexInputWrapper.clone(MockIndexInputWrapper.java:94) at org.apache.lucene.tests.store.MockIndexInputWrapper.clone(MockIndexInputWrapper.java:30) at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:607) at org.apache.lucene.codecs.blocktreeords.OrdsBlockTreeTermsReader.(OrdsBlockTreeTermsReader.java:100) at org.apache.lucene.tests.codecs.mockrandom.MockRandomPostingsFormat.fieldsProducer(MockRandomPostingsFormat.java:379) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.(PerFieldPostingsFormat.java:328) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:390) at org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:98) at org.apache.lucene.index.SegmentReader.(SegmentReader.java:95) at org.apache.lucene.index.ReadersAndUpdates.writeFieldUpdates(ReadersAndUpdates.java:581) at org.apache.lucene.index.ReaderPool.commit(ReaderPool.java:353) at org.apache.lucene.index.IndexWriter.writeReaderPool(IndexWriter.java:3990) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3692) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:4122) at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1331) at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1369) at org.apache.lucene.util.IOUtils.close(IOUtils.java:85) at org.apache.lucene.util.IOUtils.close(IOUtils.java:72) at org.apache.lucene.index.TestNumericDocValuesUpdates.testMultipleUpdatesSameDoc(TestNumericDocValuesUpdates.java:116) ... ``` `CodecUtil.checksumEntireFile` is cloning. It could just slice the whole input. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a confined Arena for IOContext.READONCE [lucene]
uschindler commented on code in PR #13535: URL: https://github.com/apache/lucene/pull/13535#discussion_r1665893332 ## lucene/core/src/java/org/apache/lucene/index/ReadersAndUpdates.java: ## @@ -578,7 +578,8 @@ public synchronized boolean writeFieldUpdates( // IndexWriter.commitMergedDeletes). final SegmentReader reader; if (this.reader == null) { -reader = new SegmentReader(info, indexCreatedVersionMajor, IOContext.READONCE); +IOContext context = info.info.getUseCompoundFile() ? IOContext.DEFAULT : IOContext.READONCE; Review Comment: Hm damn. At least the other one only reading field Infos looks fine. But this should also fail if it is not a compound file... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Override single byte writes to OutputStreamIndexOutput to remove locking [lucene]
original-brownbear opened a new pull request, #13543: URL: https://github.com/apache/lucene/pull/13543 Single byte writes to BufferedOutputStream show up pretty hot in indexing benchmarks. We can save the locking overhead introduced by JEP374 by overriding and providing a no-lock fast path. Didn't benchmark this with Lucene benchmarks but there's this https://gist.github.com/shipilev/71fd881f7cbf3f87028bf87385e84889 and the issue is somewhat obvious looking at the profiling for the nightlies I believe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Convert more classes to record classes [lucene]
shubhamvishu commented on PR #13328: URL: https://github.com/apache/lucene/pull/13328#issuecomment-2209345556 Hi @uschindler, Could you please review the current code changes once you get some time and if it looks good maybe we can move this forward? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Significant drop in recall for 8 bit Scalar Quantizer [lucene]
naveentatikonda commented on issue #13519: URL: https://github.com/apache/lucene/issues/13519#issuecomment-2209645562 > I disagree. We want `-127, 127`. Otherwise we have to handle the exotic `-128*-128` case when doing vector comparisons and this disallows some nice SIMD optimizations down the road. > @benwtrent Sorry, didn't get it. Can you pls share more details or links related to this. > * Applies a modification of the rounding error compensation. The idea behind how I applied the different compensation now is that we need to handle the linear scaling as this can cause us to snap to different buckets. I am still not 100% sure my transformation is correct there. @MilindShyani Can you check and comment on this. `(dx - dxq) * (dxq - SIGNED_CORRECTION * alpha)` is the rounding error compensation added by Ben. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] WordBreakSpellChecker.generateBreakUpSuggestions() should do breadth first search [lucene]
hossman commented on issue #12100: URL: https://github.com/apache/lucene/issues/12100#issuecomment-2209648298 I was reminded of this issue recently, and worked up a patch with the improved algorithm and a new test case that shows how even with a lost of candidate terms in the index, and a lot of possible suggestions to return, it can find the suggestions with the minimal numChanges more quickly then the old depth first approach (as demonstrated by the very small maxEvaluations used which is not enough for the old approach to find the expected suggestion. [WordBreakSpellChecker.breadthfirst.GH-12100.patch.txt](https://github.com/user-attachments/files/16103465/WordBreakSpellChecker.breadthfirst.GH-12100.patch.txt) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Inter-segment I/O concurrency. [lucene]
github-actions[bot] commented on PR #13509: URL: https://github.com/apache/lucene/pull/13509#issuecomment-2209661389 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Only check liveDocs is null one time in FreqProxTermsWriter.applyDeletes [lucene]
github-actions[bot] commented on PR #13506: URL: https://github.com/apache/lucene/pull/13506#issuecomment-2209661406 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Minor cleanup in some Facet tests [lucene]
github-actions[bot] commented on PR #13489: URL: https://github.com/apache/lucene/pull/13489#issuecomment-2209661435 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org