[GitHub] [lucene] jpountz commented on a diff in pull request #12206: Fully reuse postings enums when flushing sorted indexes.

2023-03-16 Thread via GitHub


jpountz commented on code in PR #12206:
URL: https://github.com/apache/lucene/pull/12206#discussion_r1137412879


##
lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataInput.java:
##
@@ -71,7 +71,11 @@ public ByteBuffersDataInput(List buffers) {
   this.blockMask = (1 << blockBits) - 1;
 }
 
-this.size = Arrays.stream(blocks).mapToLong(block -> 
block.remaining()).sum();
+long size = 0;
+for (ByteBuffer block : blocks) {
+  size += block.remaining();
+}
+this.size = size;

Review Comment:
   When running some workloads with index sorting enabled, these two streams 
showed up on the profile and changing the logic to be more procedural indeed 
made things a bit faster. For reference, `toDataInput` is called on every 
unique term when index sorting is enabled, and often the postings list is short.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a diff in pull request #12206: Fully reuse postings enums when flushing sorted indexes.

2023-03-16 Thread via GitHub


mikemccand commented on code in PR #12206:
URL: https://github.com/apache/lucene/pull/12206#discussion_r1138515062


##
lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataInput.java:
##
@@ -71,7 +71,11 @@ public ByteBuffersDataInput(List buffers) {
   this.blockMask = (1 << blockBits) - 1;
 }
 
-this.size = Arrays.stream(blocks).mapToLong(block -> 
block.remaining()).sum();
+long size = 0;
+for (ByteBuffer block : blocks) {
+  size += block.remaining();
+}
+this.size = size;

Review Comment:
   This also helps greatly when looking at stack traces that traverse through 
stream'd code!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #12206: Fully reuse postings enums when flushing sorted indexes.

2023-03-16 Thread via GitHub


jpountz merged PR #12206:
URL: https://github.com/apache/lucene/pull/12206


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] s1monw commented on a diff in pull request #12205: Remove remaining sources of contention on indexing.

2023-03-16 Thread via GitHub


s1monw commented on code in PR #12205:
URL: https://github.com/apache/lucene/pull/12205#discussion_r1138779734


##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java:
##
@@ -604,7 +620,9 @@ long markForFullFlush() {
* blocking indexing.*/
   pruneBlockedQueue(flushingQueue);
   assert assertBlockedFlushes(documentsWriter.deleteQueue);
+  assert numQueued == flushQueue.size();

Review Comment:
   can you please add a message to the assert. it's very hard to reason if that 
fails what the actual values were and since this is concurrency it might be 
crucial.



##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java:
##
@@ -437,11 +438,18 @@ public String toString() {
   }
 
   DocumentsWriterPerThread nextPendingFlush() {
+if (numQueued == 0 && numPending == 0) {
+  // Common case, avoid taking a lock.

Review Comment:
   It is really hard to reason about concurrency with these volatile reads / 
writes. I can see how this all works with the queue design we use in IW / DW 
but it deserves some more comments. We should really tell why we do these extra 
cases and why it works to avoid changes that break common behavior. it's 
crucial that we state that we will ensure queue draining after we added to the 
queue ie. in the full flush case otherwise we might miss pending flushes. 



##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java:
##
@@ -634,7 +652,9 @@ private void pruneBlockedQueue(final 
DocumentsWriterDeleteQueue flushingQueue) {
 iterator.remove();
 addFlushingDWPT(blockedFlush);
 // don't decr pending here - it's already done when DWPT is blocked
+assert numQueued == flushQueue.size();

Review Comment:
   same here 



##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java:
##
@@ -634,7 +652,9 @@ private void pruneBlockedQueue(final 
DocumentsWriterDeleteQueue flushingQueue) {
 iterator.remove();
 addFlushingDWPT(blockedFlush);
 // don't decr pending here - it's already done when DWPT is blocked
+assert numQueued == flushQueue.size();

Review Comment:
   another question, why are we not incrementing the numQueued for every 
blocked flush? I think this would be more intuitive... the assert is still good



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12205: Remove remaining sources of contention on indexing.

2023-03-16 Thread via GitHub


jpountz commented on code in PR #12205:
URL: https://github.com/apache/lucene/pull/12205#discussion_r1138844621


##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java:
##
@@ -634,7 +652,9 @@ private void pruneBlockedQueue(final 
DocumentsWriterDeleteQueue flushingQueue) {
 iterator.remove();
 addFlushingDWPT(blockedFlush);
 // don't decr pending here - it's already done when DWPT is blocked
+assert numQueued == flushQueue.size();

Review Comment:
   Can you elaborate a bit? Most places that I updated care about the number of 
DWPTs that are queued but not blocked, which they use to reason about whether 
indexing threads should help with flushes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stefanvodita commented on pull request #11729: GITHUB#11728: Improve code clarity for OrdinalMap

2023-03-16 Thread via GitHub


stefanvodita commented on PR #11729:
URL: https://github.com/apache/lucene/pull/11729#issuecomment-1472199332

   Bumping this PR because I think it’s a good change in its current form. The 
increased expressiveness would be helpful to me if I was using this class.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] magibney opened a new pull request, #12207: simplify PrefixQuery to avoid requiring Automaton

2023-03-16 Thread via GitHub


magibney opened a new pull request, #12207:
URL: https://github.com/apache/lucene/pull/12207

   For all but extremely selective queries (which are quite fast anyway), the 
performance of `PrefixQuery` benefits from directly subclassing 
`MultiTermQuery`, as opposed to subclassing `AutomatonQuery`.
   
   This PR proposes a new approach that initially seeks the exact threshold 
term (the first beyond the range defined by prefix) for each segment. Once you 
know the exact threshold term, you can determine a single index (of term 
BytesRef) that differentiates the threshold term from any term matching the 
prefix; this is the only index we must inspect for each term in the 
`FilteredTermsEnum`.
   
   This approach doesn't work for `AutomatonQuery` generally; but for 
`PrefixQuery`, where matching terms are guaranteed to appear in a single 
contiguous range, it's quite straightforward. This also also removes the need 
for the arbitrary limitation on prefix length (which was a consequence of 
recursive parsing of prefixes in Automaton builder).
   
   Note: this PR still supports lazily building an Automaton in `visitQuery()`, 
which at the moment is I think only necessary to support highlighting. (TODO: I 
suspect that changing the `QueryVisitor` API to be more explicitly 
term-oriented as opposed to Automaton-oriented may be appropriate, and would 
remove the need for this accommodation). I also left (for now, at least) the 
prefix automaton as leveraged by `Intervals`.
   
   Because this optimization targets the "phase 1/terms collection" part of 
`MultiTermQuery`, the benefits are greatest in cases where the number of terms 
matched by the prefix is large, and where the performance impact of "phase 2" 
-- running the rewritten query -- doesn't overwhelm the cost of "phase 1". 
"phase 2" cost is proportionally high (relative to field cardinality) for 
larger indexes, so `wikimedium10k` shows substantial improvement (~50% faster 
for single-character/broad prefixes), while `wikimedium10m` shows only ~5-10% 
improvement for the same tasks.
   
   For both 10k and 10m, it appears that performance of extremely selective 
prefixes decreases slightly (~1% for both cases), probably as a result of the 
extra initial limit threshold seek. But the performance of the most 
negatively-impacted queries is quite good to begin with, so the absolute gains 
for broader prefixes should far outweigh the slight regression for the 
selective case.
   
   It's also worth noting that autocomplete (a common use case for prefix 
queries) tends to have relatively small indexes, so would expect to see greater 
proportional gains.
   
   Fields with content heavily weighted toward particular shared prefixes 
should also see significant benefits (broad prefixes benefitting the most, and 
there's a limit to how broad a prefix can be on a field with a relatively even 
distribution of prefixes).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] magibney commented on pull request #12207: simplify PrefixQuery to avoid requiring Automaton

2023-03-16 Thread via GitHub


magibney commented on PR #12207:
URL: https://github.com/apache/lucene/pull/12207#issuecomment-1472396185

   In evaluating this change with `luceneutil` benchmarks, I found that the 
change had _very_ different effects depending on how many terms were matched by 
a given prefix -- so I distinguished some of the broader existing prefixes with 
a different label, and added some new, even broader prefixes: 
https://github.com/magibney/luceneutil/commit/46aef9a3374f4b2d61dd8bc599a49b15c2665ce3
   
   The outcome from running with the above changes
   
   ```
   python3 src/python/localrun.py -source wikimedium10k
   
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
   Wildcard 2799.63 (12.2%) 2678.39  
(9.3%)   -4.3% ( -22% -   19%) 0.206
PrefixLow10 7323.41 (11.0%) 7108.30  
(9.0%)   -2.9% ( -20% -   19%) 0.354
 IntNRQ 2789.37  (7.1%) 2742.51  
(8.6%)   -1.7% ( -16% -   15%) 0.500
Prefix3 2568.94  (7.0%) 2543.68 
(10.0%)   -1.0% ( -16% -   17%) 0.720
  OrHighLow 3252.17  (9.3%) 3228.60 
(16.2%)   -0.7% ( -24% -   27%) 0.862
 Fuzzy1  748.39 (18.4%)  746.53  
(9.3%)   -0.2% ( -23% -   33%) 0.957
 OrHighHigh 2049.79 (11.0%) 2070.90 
(15.4%)1.0% ( -22% -   30%) 0.808
PrefixHigh5 3190.32  (8.8%) 3226.09 
(11.8%)1.1% ( -17% -   23%) 0.734
  OrHighMed 1042.59 (11.3%) 1056.00 
(17.7%)1.3% ( -24% -   34%) 0.785
LowSloppyPhrase 3342.34 (10.4%) 3387.24  
(7.0%)1.3% ( -14% -   20%) 0.632
Respell  731.75 (14.6%)  742.81  
(5.5%)1.5% ( -16% -   25%) 0.664
 HighPhrase 1323.92  (6.0%) 1348.72  
(7.2%)1.9% ( -10% -   16%) 0.372
LowTerm12970.93 (15.5%)13223.24 
(17.9%)1.9% ( -27% -   41%) 0.713
MedSloppyPhrase 4249.81  (7.9%) 4335.44  
(7.6%)2.0% ( -12% -   19%) 0.411
 Fuzzy2  228.49 (20.1%)  234.20  
(8.7%)2.5% ( -21% -   39%) 0.610
MedIntervalsOrdered 6441.84  (8.8%) 6613.62 
(12.0%)2.7% ( -16% -   25%) 0.423
  MedPhrase 3503.22  (6.1%) 3609.22 
(10.3%)3.0% ( -12% -   20%) 0.258
  LowPhrase 3586.38  (8.3%) 3697.41  
(7.7%)3.1% ( -11% -   20%) 0.222
BrowseRandomLabelSSDVFacets 1176.50  (5.4%) 1213.69  
(9.9%)3.2% ( -11% -   19%) 0.209
   HighSpanNear  616.16  (8.9%)  636.01  
(9.6%)3.2% ( -14% -   23%) 0.273
   HighIntervalsOrdered  897.72 (15.6%)  928.76  
(8.6%)3.5% ( -17% -   32%) 0.385
AndHighHigh 2887.82  (7.5%) 2989.88 
(12.5%)3.5% ( -15% -   25%) 0.279
  HighTermMonthSort 3309.27  (4.0%) 3431.91  
(5.7%)3.7% (  -5% -   13%) 0.017
   HighTerm 5754.20 (16.1%) 5994.63 
(21.7%)4.2% ( -28% -   50%) 0.489
  HighTermDayOfYearSort 3134.24  (5.6%) 3267.92  
(4.7%)4.3% (  -5% -   15%) 0.009
   HighSloppyPhrase 1724.32  (6.8%) 1802.54 
(10.0%)4.5% ( -11% -   22%) 0.093
 AndHighMed 3483.26  (8.5%) 3654.61  
(8.3%)4.9% ( -10% -   23%) 0.064
MedSpanNear 3135.88  (7.2%) 3294.35  
(8.4%)5.1% (  -9% -   22%) 0.041
LowIntervalsOrdered 2313.84  (9.3%) 2451.25  
(8.5%)5.9% ( -10% -   26%) 0.035
BrowseRandomLabelTaxoFacets 1623.57  (7.8%) 1740.06 
(11.9%)7.2% ( -11% -   29%) 0.024
MedTerm 6399.97 (13.9%) 6862.00 
(22.8%)7.2% ( -25% -   51%) 0.227
LowSpanNear 3017.79  (7.7%) 3262.91  
(7.6%)8.1% (  -6% -   25%) 0.001
 AndHighLow 9832.71 (12.2%)10734.03 
(12.9%)9.2% ( -14% -   38%) 0.021
   PKLookup  216.97 (21.3%)  240.13 
(19.2%)   10.7% ( -24% -   65%) 0.096
   BrowseDateSSDVFacets 1492.26 (20.9%) 1651.87  
(6.0%)   10.7% ( -13% -   47%) 0.028
  BrowseDayOfYearTaxoFacets 4411.61 (19.6%) 4884.69  
(7.5%)   10.7% ( -13% -   47%) 0.022
  BrowseMonthTaxoFacets 4160.47 (17.0%) 4664.68  
(7.8%)   12.1% ( -10% -   44%

[GitHub] [lucene] jpountz commented on a diff in pull request #12205: Remove remaining sources of contention on indexing.

2023-03-16 Thread via GitHub


jpountz commented on code in PR #12205:
URL: https://github.com/apache/lucene/pull/12205#discussion_r1139164751


##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java:
##
@@ -634,7 +652,9 @@ private void pruneBlockedQueue(final 
DocumentsWriterDeleteQueue flushingQueue) {
 iterator.remove();
 addFlushingDWPT(blockedFlush);
 // don't decr pending here - it's already done when DWPT is blocked
+assert numQueued == flushQueue.size();

Review Comment:
   @s1monw I tried a different approach where `numPending` also includes queued 
and blocked DWPTs. Is it the sort of thing you had in mind?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12205: Remove remaining sources of contention on indexing.

2023-03-16 Thread via GitHub


jpountz commented on code in PR #12205:
URL: https://github.com/apache/lucene/pull/12205#discussion_r1139165859


##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java:
##
@@ -133,7 +156,7 @@ private boolean assertMemory() {
   // peak document
   final long expected =
   (2 * ramBufferBytes)
-  + ((numPending + numFlushingDWPT() + numBlockedFlushes()) * 
peakDelta)
+  + ((numPending + numFlushingDWPT()) * peakDelta)

Review Comment:
   is it a bug that queued flushes were not considered before?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org