[GitHub] [lucene] jpountz commented on a diff in pull request #12206: Fully reuse postings enums when flushing sorted indexes.
jpountz commented on code in PR #12206: URL: https://github.com/apache/lucene/pull/12206#discussion_r1137412879 ## lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataInput.java: ## @@ -71,7 +71,11 @@ public ByteBuffersDataInput(List buffers) { this.blockMask = (1 << blockBits) - 1; } -this.size = Arrays.stream(blocks).mapToLong(block -> block.remaining()).sum(); +long size = 0; +for (ByteBuffer block : blocks) { + size += block.remaining(); +} +this.size = size; Review Comment: When running some workloads with index sorting enabled, these two streams showed up on the profile and changing the logic to be more procedural indeed made things a bit faster. For reference, `toDataInput` is called on every unique term when index sorting is enabled, and often the postings list is short. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a diff in pull request #12206: Fully reuse postings enums when flushing sorted indexes.
mikemccand commented on code in PR #12206: URL: https://github.com/apache/lucene/pull/12206#discussion_r1138515062 ## lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataInput.java: ## @@ -71,7 +71,11 @@ public ByteBuffersDataInput(List buffers) { this.blockMask = (1 << blockBits) - 1; } -this.size = Arrays.stream(blocks).mapToLong(block -> block.remaining()).sum(); +long size = 0; +for (ByteBuffer block : blocks) { + size += block.remaining(); +} +this.size = size; Review Comment: This also helps greatly when looking at stack traces that traverse through stream'd code! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #12206: Fully reuse postings enums when flushing sorted indexes.
jpountz merged PR #12206: URL: https://github.com/apache/lucene/pull/12206 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] s1monw commented on a diff in pull request #12205: Remove remaining sources of contention on indexing.
s1monw commented on code in PR #12205: URL: https://github.com/apache/lucene/pull/12205#discussion_r1138779734 ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java: ## @@ -604,7 +620,9 @@ long markForFullFlush() { * blocking indexing.*/ pruneBlockedQueue(flushingQueue); assert assertBlockedFlushes(documentsWriter.deleteQueue); + assert numQueued == flushQueue.size(); Review Comment: can you please add a message to the assert. it's very hard to reason if that fails what the actual values were and since this is concurrency it might be crucial. ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java: ## @@ -437,11 +438,18 @@ public String toString() { } DocumentsWriterPerThread nextPendingFlush() { +if (numQueued == 0 && numPending == 0) { + // Common case, avoid taking a lock. Review Comment: It is really hard to reason about concurrency with these volatile reads / writes. I can see how this all works with the queue design we use in IW / DW but it deserves some more comments. We should really tell why we do these extra cases and why it works to avoid changes that break common behavior. it's crucial that we state that we will ensure queue draining after we added to the queue ie. in the full flush case otherwise we might miss pending flushes. ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java: ## @@ -634,7 +652,9 @@ private void pruneBlockedQueue(final DocumentsWriterDeleteQueue flushingQueue) { iterator.remove(); addFlushingDWPT(blockedFlush); // don't decr pending here - it's already done when DWPT is blocked +assert numQueued == flushQueue.size(); Review Comment: same here ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java: ## @@ -634,7 +652,9 @@ private void pruneBlockedQueue(final DocumentsWriterDeleteQueue flushingQueue) { iterator.remove(); addFlushingDWPT(blockedFlush); // don't decr pending here - it's already done when DWPT is blocked +assert numQueued == flushQueue.size(); Review Comment: another question, why are we not incrementing the numQueued for every blocked flush? I think this would be more intuitive... the assert is still good -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12205: Remove remaining sources of contention on indexing.
jpountz commented on code in PR #12205: URL: https://github.com/apache/lucene/pull/12205#discussion_r1138844621 ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java: ## @@ -634,7 +652,9 @@ private void pruneBlockedQueue(final DocumentsWriterDeleteQueue flushingQueue) { iterator.remove(); addFlushingDWPT(blockedFlush); // don't decr pending here - it's already done when DWPT is blocked +assert numQueued == flushQueue.size(); Review Comment: Can you elaborate a bit? Most places that I updated care about the number of DWPTs that are queued but not blocked, which they use to reason about whether indexing threads should help with flushes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] stefanvodita commented on pull request #11729: GITHUB#11728: Improve code clarity for OrdinalMap
stefanvodita commented on PR #11729: URL: https://github.com/apache/lucene/pull/11729#issuecomment-1472199332 Bumping this PR because I think it’s a good change in its current form. The increased expressiveness would be helpful to me if I was using this class. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] magibney opened a new pull request, #12207: simplify PrefixQuery to avoid requiring Automaton
magibney opened a new pull request, #12207: URL: https://github.com/apache/lucene/pull/12207 For all but extremely selective queries (which are quite fast anyway), the performance of `PrefixQuery` benefits from directly subclassing `MultiTermQuery`, as opposed to subclassing `AutomatonQuery`. This PR proposes a new approach that initially seeks the exact threshold term (the first beyond the range defined by prefix) for each segment. Once you know the exact threshold term, you can determine a single index (of term BytesRef) that differentiates the threshold term from any term matching the prefix; this is the only index we must inspect for each term in the `FilteredTermsEnum`. This approach doesn't work for `AutomatonQuery` generally; but for `PrefixQuery`, where matching terms are guaranteed to appear in a single contiguous range, it's quite straightforward. This also also removes the need for the arbitrary limitation on prefix length (which was a consequence of recursive parsing of prefixes in Automaton builder). Note: this PR still supports lazily building an Automaton in `visitQuery()`, which at the moment is I think only necessary to support highlighting. (TODO: I suspect that changing the `QueryVisitor` API to be more explicitly term-oriented as opposed to Automaton-oriented may be appropriate, and would remove the need for this accommodation). I also left (for now, at least) the prefix automaton as leveraged by `Intervals`. Because this optimization targets the "phase 1/terms collection" part of `MultiTermQuery`, the benefits are greatest in cases where the number of terms matched by the prefix is large, and where the performance impact of "phase 2" -- running the rewritten query -- doesn't overwhelm the cost of "phase 1". "phase 2" cost is proportionally high (relative to field cardinality) for larger indexes, so `wikimedium10k` shows substantial improvement (~50% faster for single-character/broad prefixes), while `wikimedium10m` shows only ~5-10% improvement for the same tasks. For both 10k and 10m, it appears that performance of extremely selective prefixes decreases slightly (~1% for both cases), probably as a result of the extra initial limit threshold seek. But the performance of the most negatively-impacted queries is quite good to begin with, so the absolute gains for broader prefixes should far outweigh the slight regression for the selective case. It's also worth noting that autocomplete (a common use case for prefix queries) tends to have relatively small indexes, so would expect to see greater proportional gains. Fields with content heavily weighted toward particular shared prefixes should also see significant benefits (broad prefixes benefitting the most, and there's a limit to how broad a prefix can be on a field with a relatively even distribution of prefixes). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] magibney commented on pull request #12207: simplify PrefixQuery to avoid requiring Automaton
magibney commented on PR #12207: URL: https://github.com/apache/lucene/pull/12207#issuecomment-1472396185 In evaluating this change with `luceneutil` benchmarks, I found that the change had _very_ different effects depending on how many terms were matched by a given prefix -- so I distinguished some of the broader existing prefixes with a different label, and added some new, even broader prefixes: https://github.com/magibney/luceneutil/commit/46aef9a3374f4b2d61dd8bc599a49b15c2665ce3 The outcome from running with the above changes ``` python3 src/python/localrun.py -source wikimedium10k TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Wildcard 2799.63 (12.2%) 2678.39 (9.3%) -4.3% ( -22% - 19%) 0.206 PrefixLow10 7323.41 (11.0%) 7108.30 (9.0%) -2.9% ( -20% - 19%) 0.354 IntNRQ 2789.37 (7.1%) 2742.51 (8.6%) -1.7% ( -16% - 15%) 0.500 Prefix3 2568.94 (7.0%) 2543.68 (10.0%) -1.0% ( -16% - 17%) 0.720 OrHighLow 3252.17 (9.3%) 3228.60 (16.2%) -0.7% ( -24% - 27%) 0.862 Fuzzy1 748.39 (18.4%) 746.53 (9.3%) -0.2% ( -23% - 33%) 0.957 OrHighHigh 2049.79 (11.0%) 2070.90 (15.4%)1.0% ( -22% - 30%) 0.808 PrefixHigh5 3190.32 (8.8%) 3226.09 (11.8%)1.1% ( -17% - 23%) 0.734 OrHighMed 1042.59 (11.3%) 1056.00 (17.7%)1.3% ( -24% - 34%) 0.785 LowSloppyPhrase 3342.34 (10.4%) 3387.24 (7.0%)1.3% ( -14% - 20%) 0.632 Respell 731.75 (14.6%) 742.81 (5.5%)1.5% ( -16% - 25%) 0.664 HighPhrase 1323.92 (6.0%) 1348.72 (7.2%)1.9% ( -10% - 16%) 0.372 LowTerm12970.93 (15.5%)13223.24 (17.9%)1.9% ( -27% - 41%) 0.713 MedSloppyPhrase 4249.81 (7.9%) 4335.44 (7.6%)2.0% ( -12% - 19%) 0.411 Fuzzy2 228.49 (20.1%) 234.20 (8.7%)2.5% ( -21% - 39%) 0.610 MedIntervalsOrdered 6441.84 (8.8%) 6613.62 (12.0%)2.7% ( -16% - 25%) 0.423 MedPhrase 3503.22 (6.1%) 3609.22 (10.3%)3.0% ( -12% - 20%) 0.258 LowPhrase 3586.38 (8.3%) 3697.41 (7.7%)3.1% ( -11% - 20%) 0.222 BrowseRandomLabelSSDVFacets 1176.50 (5.4%) 1213.69 (9.9%)3.2% ( -11% - 19%) 0.209 HighSpanNear 616.16 (8.9%) 636.01 (9.6%)3.2% ( -14% - 23%) 0.273 HighIntervalsOrdered 897.72 (15.6%) 928.76 (8.6%)3.5% ( -17% - 32%) 0.385 AndHighHigh 2887.82 (7.5%) 2989.88 (12.5%)3.5% ( -15% - 25%) 0.279 HighTermMonthSort 3309.27 (4.0%) 3431.91 (5.7%)3.7% ( -5% - 13%) 0.017 HighTerm 5754.20 (16.1%) 5994.63 (21.7%)4.2% ( -28% - 50%) 0.489 HighTermDayOfYearSort 3134.24 (5.6%) 3267.92 (4.7%)4.3% ( -5% - 15%) 0.009 HighSloppyPhrase 1724.32 (6.8%) 1802.54 (10.0%)4.5% ( -11% - 22%) 0.093 AndHighMed 3483.26 (8.5%) 3654.61 (8.3%)4.9% ( -10% - 23%) 0.064 MedSpanNear 3135.88 (7.2%) 3294.35 (8.4%)5.1% ( -9% - 22%) 0.041 LowIntervalsOrdered 2313.84 (9.3%) 2451.25 (8.5%)5.9% ( -10% - 26%) 0.035 BrowseRandomLabelTaxoFacets 1623.57 (7.8%) 1740.06 (11.9%)7.2% ( -11% - 29%) 0.024 MedTerm 6399.97 (13.9%) 6862.00 (22.8%)7.2% ( -25% - 51%) 0.227 LowSpanNear 3017.79 (7.7%) 3262.91 (7.6%)8.1% ( -6% - 25%) 0.001 AndHighLow 9832.71 (12.2%)10734.03 (12.9%)9.2% ( -14% - 38%) 0.021 PKLookup 216.97 (21.3%) 240.13 (19.2%) 10.7% ( -24% - 65%) 0.096 BrowseDateSSDVFacets 1492.26 (20.9%) 1651.87 (6.0%) 10.7% ( -13% - 47%) 0.028 BrowseDayOfYearTaxoFacets 4411.61 (19.6%) 4884.69 (7.5%) 10.7% ( -13% - 47%) 0.022 BrowseMonthTaxoFacets 4160.47 (17.0%) 4664.68 (7.8%) 12.1% ( -10% - 44%
[GitHub] [lucene] jpountz commented on a diff in pull request #12205: Remove remaining sources of contention on indexing.
jpountz commented on code in PR #12205: URL: https://github.com/apache/lucene/pull/12205#discussion_r1139164751 ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java: ## @@ -634,7 +652,9 @@ private void pruneBlockedQueue(final DocumentsWriterDeleteQueue flushingQueue) { iterator.remove(); addFlushingDWPT(blockedFlush); // don't decr pending here - it's already done when DWPT is blocked +assert numQueued == flushQueue.size(); Review Comment: @s1monw I tried a different approach where `numPending` also includes queued and blocked DWPTs. Is it the sort of thing you had in mind? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12205: Remove remaining sources of contention on indexing.
jpountz commented on code in PR #12205: URL: https://github.com/apache/lucene/pull/12205#discussion_r1139165859 ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java: ## @@ -133,7 +156,7 @@ private boolean assertMemory() { // peak document final long expected = (2 * ramBufferBytes) - + ((numPending + numFlushingDWPT() + numBlockedFlushes()) * peakDelta) + + ((numPending + numFlushingDWPT()) * peakDelta) Review Comment: is it a bug that queued flushes were not considered before? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org