[GitHub] [lucene] LuXugang merged pull request #12153: Unrelated code in TestIndexSortSortedNumericDocValuesRangeQuery
LuXugang merged PR #12153: URL: https://github.com/apache/lucene/pull/12153 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] s1monw commented on a diff in pull request #12198: Reduce contention in DocumentsWriterFlushControl.
s1monw commented on code in PR #12198: URL: https://github.com/apache/lucene/pull/12198#discussion_r1136693360 ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java: ## @@ -185,8 +185,12 @@ private boolean updatePeaks(long delta) { return true; } - DocumentsWriterPerThread doAfterDocument(DocumentsWriterPerThread perThread, boolean isUpdate) { + DocumentsWriterPerThread doAfterDocument(DocumentsWriterPerThread perThread) { final long delta = perThread.getCommitLastBytesUsedDelta(); +if (flushPolicy.flushOnDocCount() == false && delta < flushPolicy.flushOnRAMGranularity()) { Review Comment: ```suggestion // in order to prevent contention in the case of many threads indexing small documents // we skip ram accounting unless the DWPT accumulated enough ram to be worthwhile if (flushPolicy.flushOnDocCount() == false && delta < flushPolicy.flushOnRAMGranularity()) { ``` ## lucene/core/src/java/org/apache/lucene/index/FlushByRamOrCountsPolicy.java: ## @@ -95,14 +89,25 @@ protected void markLargestWriterPending( } } - /** - * Returns true if this {@link FlushPolicy} flushes on {@link - * IndexWriterConfig#getMaxBufferedDocs()}, otherwise false. - */ - protected boolean flushOnDocCount() { + @Override + public boolean flushOnDocCount() { return indexWriterConfig.getMaxBufferedDocs() != IndexWriterConfig.DISABLE_AUTO_FLUSH; } + @Override + public long flushOnRAMGranularity() { Review Comment: I don't think this needs to be an abstract method. I am not sure anyone users this interface at all except of tests and unless we override this method in a test I think it can be an impl detail of the only caller. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] s1monw commented on a diff in pull request #12199: Reduce contention in DocumentsWriterPerThreadPool.
s1monw commented on code in PR #12199: URL: https://github.com/apache/lucene/pull/12199#discussion_r1136706886 ## lucene/core/src/java/org/apache/lucene/index/ConcurrentApproximatePriorityQueue.java: ## @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.index; + +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; +import java.util.function.Predicate; + +/** + * Concurrent version of {@link ApproximatePriorityQueue}, which trades a bit more of ordering for + * better concurrency by maintaining 8 sub {@link ApproximatePriorityQueue}s that are locked + * independently. + */ +final class ConcurrentApproximatePriorityQueue { + + /** Keeping 8 queues should already help a lot compared to a single one. */ + static final int CONCURRENCY = 8; + + private static final int MASK = 0x07; + + final Lock[] locks; + final ApproximatePriorityQueue[] queues; + + ConcurrentApproximatePriorityQueue() { +locks = new Lock[CONCURRENCY]; +@SuppressWarnings({"rawtypes", "unchecked"}) +ApproximatePriorityQueue[] queues = new ApproximatePriorityQueue[CONCURRENCY]; +this.queues = queues; +for (int i = 0; i < CONCURRENCY; ++i) { + locks[i] = new ReentrantLock(); + queues[i] = new ApproximatePriorityQueue<>(); +} + } + + void add(T entry, long weight) { +// Seed the order in which to look at entries based on the current thread. This helps distribute +// entries across queues and gives a bit of thread affinity between entries and threads, which +// can't hurt. +final int threadHash = Thread.currentThread().hashCode(); +for (int i = 0; i < CONCURRENCY; ++i) { + final int index = (threadHash + i) & MASK; + final Lock lock = locks[index]; + final ApproximatePriorityQueue queue = queues[index]; + if (lock.tryLock()) { +try { + queue.add(entry, weight); + return; +} finally { + lock.unlock(); +} + } +} +final int index = threadHash & MASK; +final Lock lock = locks[index]; +final ApproximatePriorityQueue queue = queues[index]; +lock.lock(); +try { + queue.add(entry, weight); +} finally { + lock.unlock(); +} + } + + T poll(Predicate predicate) { +final int threadHash = Thread.currentThread().hashCode(); +for (int i = 0; i < CONCURRENCY; ++i) { + final int index = (threadHash + i) & MASK; + final Lock lock = locks[index]; + final ApproximatePriorityQueue queue = queues[index]; + if (lock.tryLock()) { +try { + T entry = queue.poll(predicate); + if (entry != null) { +return entry; + } +} finally { + lock.unlock(); +} + } +} +for (int i = 0; i < CONCURRENCY; ++i) { + final int index = (threadHash + i) & MASK; + final Lock lock = locks[index]; + final ApproximatePriorityQueue queue = queues[index]; + lock.lock(); + try { +T entry = queue.poll(predicate); +if (entry != null) { + return entry; +} + } finally { +lock.unlock(); + } +} +return null; + } + + // Only used for assertions + boolean contains(Object o) { +for (int i = 0; i < CONCURRENCY; ++i) { Review Comment: nitpick can you add a check that assertions are enabled? ## lucene/core/src/java/org/apache/lucene/index/ConcurrentApproximatePriorityQueue.java: ## @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF AN
[GitHub] [lucene] jpountz commented on pull request #12114: Use radix sort to sort postings when index sorting is enabled.
jpountz commented on PR #12114: URL: https://github.com/apache/lucene/pull/12114#issuecomment-1469650532 I purposedly introduced a bug to see what would fail, and only high-level tests that check early query termination or dynamic pruning failed, so I introduced lower-level tests that make sure that postings get reordered correctly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #12198: Reduce contention in DocumentsWriterFlushControl.
jpountz merged PR #12198: URL: https://github.com/apache/lucene/pull/12198 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #12114: Use radix sort to sort postings when index sorting is enabled.
jpountz merged PR #12114: URL: https://github.com/apache/lucene/pull/12114 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on issue #12185: Using DirectIODirectory results in BufferOverflowException
mikemccand commented on issue #12185: URL: https://github.com/apache/lucene/issues/12185#issuecomment-1469863743 Hmm, I think the root cause exception is essentially [thrown from here](https://github.com/openjdk/jdk17/blob/master/src/java.base/share/classes/java/nio/Buffer.java#L722), which means it looks like `DirectIOIndexOutput` failed to flush its direct `ByteBuffer` when it was full. Not sure how that could happen on quickly staring at the code. Does this consistently reproduce? Looks like you are using `SerialMergeScheduler` so maybe it does (no thread scheduling "randomness")? Which JDK are you running? And which Linux flavor/version? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #12199: Reduce contention in DocumentsWriterPerThreadPool.
jpountz merged PR #12199: URL: https://github.com/apache/lucene/pull/12199 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request, #12205: Remove remaining sources of contention on indexing.
jpountz opened a new pull request, #12205: URL: https://github.com/apache/lucene/pull/12205 With this change, running `IndexGeoNames` with 20 threads goes from 16-17 seconds to 15-16 seconds. If I disable the 3 text fields - which are the main bottleneck for indexing - out of 19 fields, then indexing geonames goes from 12-13 seconds to 9-10 seconds with this change. After this change, running `IndexGeoNames` with 20 threads doesn't suffer from contention anymore. I'm only seeing a small bit of waiting due to stall control, which is expected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request, #12206: Fully reuse postings enums when flushing sorted indexes.
jpountz opened a new pull request, #12206: URL: https://github.com/apache/lucene/pull/12206 Currently we're only half reusing postings enums when flushing sorted indexes as we still create new wrapper instances every time, which can be costly with fields that have many terms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12206: Fully reuse postings enums when flushing sorted indexes.
jpountz commented on code in PR #12206: URL: https://github.com/apache/lucene/pull/12206#discussion_r1137412879 ## lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataInput.java: ## @@ -71,7 +71,11 @@ public ByteBuffersDataInput(List buffers) { this.blockMask = (1 << blockBits) - 1; } -this.size = Arrays.stream(blocks).mapToLong(block -> block.remaining()).sum(); +long size = 0; +for (ByteBuffer block : blocks) { + size += block.remaining(); +} +this.size = size; Review Comment: When running some workloads with index sorting enabled, these two streams showed up on the profile and changing the logic to be more procedural indeed made things a bit faster. For reference, `toDataInput` is called on every unique term when index sorting is enabled, and ofter the postings list is short. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI
mdmarshmallow commented on PR #12194: URL: https://github.com/apache/lucene/pull/12194#issuecomment-1470400045 I tested with wikimedium10m. Looks like my change caused the `Prefix3` test to slow down.. not sure why. ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Prefix3 116.79 (1.7%) 101.89 (1.2%) -12.8% ( -15% - -10%) 0.000 OrNotHighMed 270.50 (3.5%) 254.28 (2.8%) -6.0% ( -11% -0%) 0.000 OrHighNotHigh 173.22 (3.7%) 162.94 (3.3%) -5.9% ( -12% -1%) 0.000 OrNotHighHigh 304.62 (4.0%) 287.18 (2.4%) -5.7% ( -11% -0%) 0.000 OrHighNotMed 269.23 (4.3%) 257.83 (3.6%) -4.2% ( -11% -3%) 0.001 Wildcard 411.95 (2.3%) 395.75 (1.9%) -3.9% ( -7% -0%) 0.000 OrHighNotLow 391.63 (4.5%) 378.58 (3.8%) -3.3% ( -11% -5%) 0.012 HighTermTitleBDVSort 14.56 (7.8%) 14.14 (6.1%) -2.9% ( -15% - 12%) 0.199 OrHighHigh 25.14 (3.3%) 24.68 (4.3%) -1.8% ( -9% -5%) 0.131 OrHighMed 104.03 (2.8%) 102.70 (4.3%) -1.3% ( -8% -6%) 0.267 MedIntervalsOrdered 11.84 (4.4%) 11.72 (3.8%) -1.0% ( -8% -7%) 0.417 OrHighLow 395.96 (1.9%) 391.83 (2.0%) -1.0% ( -4% -2%) 0.097 HighIntervalsOrdered 19.10 (3.9%) 18.95 (4.1%) -0.8% ( -8% -7%) 0.541 AndHighMedDayTaxoFacets 112.90 (1.8%) 112.13 (2.2%) -0.7% ( -4% -3%) 0.275 AndHighHigh 62.08 (3.5%) 61.72 (4.8%) -0.6% ( -8% -8%) 0.666 OrNotHighLow 875.73 (3.2%) 871.01 (3.3%) -0.5% ( -6% -6%) 0.599 LowSloppyPhrase5.77 (2.3%)5.74 (3.0%) -0.5% ( -5% -4%) 0.536 LowIntervalsOrdered 46.88 (2.6%) 46.63 (2.7%) -0.5% ( -5% -4%) 0.540 MedSloppyPhrase 10.85 (2.1%) 10.80 (2.7%) -0.5% ( -5% -4%) 0.514 MedSpanNear 13.61 (3.4%) 13.55 (3.5%) -0.4% ( -7% -6%) 0.681 HighSloppyPhrase 22.71 (3.6%) 22.64 (4.4%) -0.3% ( -8% -7%) 0.789 HighTerm 279.23 (5.2%) 278.37 (3.9%) -0.3% ( -8% -9%) 0.832 OrHighMedDayTaxoFacets4.67 (3.9%)4.65 (3.6%) -0.3% ( -7% -7%) 0.809 AndHighLow 826.71 (2.5%) 824.54 (2.5%) -0.3% ( -5% -4%) 0.740 PKLookup 143.32 (3.6%) 143.00 (3.2%) -0.2% ( -6% -6%) 0.837 LowPhrase 138.92 (1.7%) 138.73 (2.2%) -0.1% ( -4% -3%) 0.828 LowTerm 436.79 (2.7%) 436.39 (3.7%) -0.1% ( -6% -6%) 0.929 AndHighMed 165.94 (3.4%) 165.82 (4.3%) -0.1% ( -7% -7%) 0.953 HighPhrase 75.52 (2.1%) 75.47 (2.1%) -0.1% ( -4% -4%) 0.921 MedTermDayTaxoFacets 23.58 (4.0%) 23.57 (3.2%) -0.0% ( -7% -7%) 0.979 HighSpanNear 23.96 (2.5%) 23.96 (2.8%)0.0% ( -5% -5%) 0.998 Fuzzy2 38.68 (1.7%) 38.69 (2.5%)0.0% ( -4% -4%) 0.992 AndHighHighDayTaxoFacets 13.62 (1.8%) 13.63 (1.7%)0.1% ( -3% -3%) 0.903 BrowseMonthSSDVFacets 10.96 (1.2%) 10.97 (1.0%)0.1% ( -2% -2%) 0.766 Respell 49.97 (1.4%) 50.04 (1.6%)0.1% ( -2% -3%) 0.769 MedTerm 368.29 (4.8%) 369.09 (4.6%)0.2% ( -8% - 10%) 0.884 MedPhrase 54.77 (3.6%) 54.94 (3.5%)0.3% ( -6% -7%) 0.778 HighTermDayOfYearSort 299.85 (2.2%) 300.91 (1.9%)0.4% ( -3% -4%) 0.590 BrowseRandomLabelSSDVFacets6.81 (5.8%)6.83 (6.9%)0.4% ( -11% - 13%) 0.862 IntNRQ 357.58 (6.6%) 358.90 (7.3%)0.4% ( -12% - 15%) 0.867 LowSpanNear 1
[GitHub] [lucene] jpountz commented on a diff in pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI
jpountz commented on code in PR #12194: URL: https://github.com/apache/lucene/pull/12194#discussion_r1137546969 ## lucene/core/src/java/org/apache/lucene/search/DocIdSetIterator.java: ## @@ -211,4 +216,22 @@ protected final int slowAdvance(int target) throws IOException { * may be a rough heuristic, hardcoded value, or otherwise completely inaccurate. */ public abstract long cost(); + + /** + * Returns the next doc ID that may not be a match. Note that this API will essentially provide + * the following two guarantees: Review Comment: > By making it illegal to call this method once the iterator is exhausted, do you mean we need to throw an exception here? No I was thinking of just documenting the behavior as undefined. And throw assertions in AssertingScorer. > Is it ok we always return the last non matching doc (or maxDoc) under this scenario, and warn callers to check if the iterator's current doc is NO_MORE_DOCS to detect? Yes, exactly. This is the same for `nextDoc` or `advance`, it is illegal to call these methods when the current doc is NO_MORE_DOCS. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI
jpountz commented on code in PR #12194: URL: https://github.com/apache/lucene/pull/12194#discussion_r1137550527 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PostingsReader.java: ## @@ -479,6 +481,31 @@ private void refillDocs() throws IOException { assert docBuffer[BLOCK_SIZE] == NO_MORE_DOCS; } +@Override +public int peekNextNonMatchingDocID() throws IOException { Review Comment: Skip data theoretically has information we can leverage here too which could allow skipping multiple blocks at once. +1 to not read ahead in the postings readers, it feel like it would make things too complicated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI
jpountz commented on PR #12194: URL: https://github.com/apache/lucene/pull/12194#issuecomment-1470514750 > Do you have any pointer which benchmark task I could potentially use? If there isn't one available, I could try to add some next. Maybe we could try to leverage the geonames dataset (there's a few benchmarks for it in lucene-util), which has a few low-cardinality fields like the time zone or country. Then enable index sorting on these fields. And make sure we're getting performance when using the time zone in required or excluded clauses? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih merged pull request #12126: Refactor part of IndexFileDeleter and ReplicaFileDeleter into a common utility class
zhaih merged PR #12126: URL: https://github.com/apache/lucene/pull/12126 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih closed issue #11885: Refactor and generalize file deleter
zhaih closed issue #11885: Refactor and generalize file deleter URL: https://github.com/apache/lucene/issues/11885 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org