[GitHub] [lucene] jpountz commented on pull request #12199: Reduce contention in DocumentsWriterPerThreadPool.

2023-03-17 Thread via GitHub


jpountz commented on PR #12199:
URL: https://github.com/apache/lucene/pull/12199#issuecomment-1473272224

   I suspect this change to be the source of the speedup when indexing vectors 
on https://home.apache.org/~mikemccand/lucenebench/indexing.html, but maybe 
more because of the introduced affinity between indexing threads and DWPTs than 
because of reduced contention since contention generally shows up when indexing 
is fast, which isn't the case with vectors?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] MarcusSorealheis opened a new pull request, #12208: Explain term automaton queries

2023-03-17 Thread via GitHub


MarcusSorealheis opened a new pull request, #12208:
URL: https://github.com/apache/lucene/pull/12208

   ### Description
   
   This is a draft PR  to address #12178 that I wrote in my evenings, as I'm 
out of office and recovering (possibly slow to respond). Using the open source 
search community to prevent my edges from dulling and because it's fun. This PR 
is very likely bad and needs a lot of touch up as I haven't been focused on 
this work for a while. It may also be a helpful guide for others who are 
looking to take this issue on and do a much better job. 
   
   I need to update the tests to evaluate the functionality including but not 
limited to evaluating the explain output for matching docs. I started with the 
handling null due to the origin of my colleague Akaash's issue. 
   
   for context to those unfamiliar, I have mostly worked on janitorial things 
in this repo, though I am close to some of the adjacent machine learning and 
developer productivity worlds. I'm a product person not a dev. I volunteer to 
help open source projects for happiness and rehabilitation. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] s1monw commented on a diff in pull request #12205: Remove remaining sources of contention on indexing.

2023-03-17 Thread via GitHub


s1monw commented on code in PR #12205:
URL: https://github.com/apache/lucene/pull/12205#discussion_r1140021701


##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java:
##
@@ -634,7 +652,9 @@ private void pruneBlockedQueue(final 
DocumentsWriterDeleteQueue flushingQueue) {
 iterator.remove();
 addFlushingDWPT(blockedFlush);
 // don't decr pending here - it's already done when DWPT is blocked
+assert numQueued == flushQueue.size();

Review Comment:
   All I was referring to was why we don't do `numQueued++` and `numQueued--` 
in the corresponding places instead of `numQueued=flushQueue.size()` I don't 
think we should put _blocked_ in there since  we can't really do anything with 
it for the time being but I haven't thought about it too much.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Hanyakubin opened a new issue, #12209: CPU usage continuously increasing when IMAP SEARCH command coming from user agent

2023-03-17 Thread via GitHub


Hanyakubin opened a new issue, #12209:
URL: https://github.com/apache/lucene/issues/12209

   ### Description
   
   We use Apache James to provide voice visual mail for iPhone users and Lucene 
module is used in James to store nessecary information for index search.
   
   But the CPU usage of James server is continuously increasing like this:
   
![MicrosoftTeams-image](https://user-images.githubusercontent.com/128135965/225887687-05f06f26-d4fd-4c9f-bf7d-11e25f6c22ee.png)
   
   Below is the command sequence from iPhone:
   - SELECT INBOX
   - UID SEARCH 1:* NOT DELETED
   - UID SEARCH 1:* UNSEEN
   - UID STORE [uid_1] +FLAGS.SILENT (\Seen))
   - UID STORE [uid_2] +FLAGS.SILENT (\Seen))
   - ..
   - LOGOUT
   
   After several tests, we think maybe there are some problems in the implement 
of SEARCH command, so we change the search query command request to
   - UID SEARCH UID ALL
   instead of 1:* NOT/UN~ in our test environment.
   Then the CPU usage became stable like this:
   ![MicrosoftTeams-image 
(3)](https://user-images.githubusercontent.com/128135965/225890304-e0c5a474-3342-487f-ad39-8b8710ca961d.png)
   
   We checked sourcecode and found there is a block in James like below:
   
   `public Flux search(SearchQuery query, MailboxSession 
mailboxSession) throws MailboxException {`
   `if (query.equals(LIST_ALL_QUERY) || query.equals(LIST_FROM_ONE)) {`
   `return listAllMessageUids(mailboxSession);`
   `}`
   `return index.search(mailboxSession, getMailboxEntity(), query);`
   `}`
   
   When the query is LIST_ALL_QUERY or LIST_FROM_ONE, it will go into the if 
block, not using index search of Lucene and the CPU usage is stable. Otherwise, 
in the other query cases when index.search method being called, the CPU usage 
will increase.
   
   The question is why. Is there some performance problem in Lucene, or this 
issue can be resolved by correct configuration?
   
   
   ### Version and environment details
   
   James: 3.6.0
   Lucene: 3.6.2
   JVM: OpenJDK 64-Bit Server VM
   OS: CentOS7.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12205: Remove remaining sources of contention on indexing.

2023-03-17 Thread via GitHub


jpountz commented on code in PR #12205:
URL: https://github.com/apache/lucene/pull/12205#discussion_r1140418586


##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java:
##
@@ -634,7 +652,9 @@ private void pruneBlockedQueue(final 
DocumentsWriterDeleteQueue flushingQueue) {
 iterator.remove();
 addFlushingDWPT(blockedFlush);
 // don't decr pending here - it's already done when DWPT is blocked
+assert numQueued == flushQueue.size();

Review Comment:
   Ack. I changed to increments/decrements.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12205: Remove remaining sources of contention on indexing.

2023-03-17 Thread via GitHub


jpountz commented on code in PR #12205:
URL: https://github.com/apache/lucene/pull/12205#discussion_r1139165859


##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java:
##
@@ -133,7 +156,7 @@ private boolean assertMemory() {
   // peak document
   final long expected =
   (2 * ramBufferBytes)
-  + ((numPending + numFlushingDWPT() + numBlockedFlushes()) * 
peakDelta)
+  + ((numPending + numFlushingDWPT()) * peakDelta)

Review Comment:
   is it a bug that queued flushes were not considered before?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] MarcusSorealheis commented on pull request #12208: Explain term automaton queries

2023-03-17 Thread via GitHub


MarcusSorealheis commented on PR #12208:
URL: https://github.com/apache/lucene/pull/12208#issuecomment-1474501209

   @rmuir or @mikemccand  Do you have suggestions on who could be a good fit to 
advise me here on what types of tests or review?
   
   There's a few that are obvious that I need to fix, but I also recognize 
there maybe some desired strategies around design here. I intend to add more 
robust tests. This PR touches some important areas of the codebase. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a diff in pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI

2023-03-17 Thread via GitHub


zacharymorn commented on code in PR #12194:
URL: https://github.com/apache/lucene/pull/12194#discussion_r1140939880


##
lucene/core/src/java/org/apache/lucene/search/DocIdSetIterator.java:
##
@@ -211,4 +216,22 @@ protected final int slowAdvance(int target) throws 
IOException {
* may be a rough heuristic, hardcoded value, or otherwise completely 
inaccurate.
*/
   public abstract long cost();
+
+  /**
+   * Returns the next doc ID that may not be a match. Note that this API will 
essentially provide
+   * the following two guarantees:

Review Comment:
   Makes sense!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI

2023-03-17 Thread via GitHub


zacharymorn commented on PR #12194:
URL: https://github.com/apache/lucene/pull/12194#issuecomment-1474732221

   > Sorry for the delayed response, so I saw it in my optimized version of 
`BitSet#or` that does use the `peekNextNonMatchingDocID` API. I also found the 
bug, it turns out we just weren't resetting the `lastNonMatchingDoc` in the 
`reset()` function. I created a [pull 
request](https://github.com/zacharymorn/lucene/pull/1) with my `BitSet#or` 
changes and the (one-liner) bug fix.
   > 
   > I'm also planning on benchmarking this and will post here when I do.
   
   Thanks @mdmarshmallow for the PR and benchmark result. I have left a few 
comments there. I feel the reasons you are not seeing much change from 
benchmark could be that the implementation in `BitSet#or` might got overridden 
by subclasses, and/or the implementations in subclasses did not utilize their 
specialized data structures?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI

2023-03-17 Thread via GitHub


zacharymorn commented on PR #12194:
URL: https://github.com/apache/lucene/pull/12194#issuecomment-1474743257

   > Maybe we could try to leverage the geonames dataset (there's a few 
benchmarks for it in lucene-util), which has a few low-cardinality fields like 
the time zone or country. Then enable index sorting on these fields. And make 
sure we're getting performance when using the time zone in required or excluded 
clauses?
   
   Thanks @jpountz for the suggestion! I actually did something similar by 
modifying the benchmarking code to support sorting index by `month` field, 
which is a `KeywordField`, and creating corresponding query with the following 
syntax to exercise the `ReqExclScorer - Posting - DocValue` path that I have 
changed:
   
   `ReqExclWithDocValues: newbenchmarktest//names docvalues:-month:[Jan TO Aug]`
   
   However, while I was able to run this query, the test actually failed when 
verifying top matches & scores between baseline and modified code. Upon further 
digging, I realized a potential issue to this API idea. As Lucene does a lot of 
two phase iterations, and two phase iterator's approximation may provide a 
superset of the actual matches. If we were to use this API to find and ignore / 
skip over a bunch of doc ids from approximation, wouldn't the result be 
inaccurate? 
   
   For example, in the below `skippingReqApproximation` disi I put inside 
`ReqExclScorer`, `exclApproximation.peekNextNonMatchingDocID()` may provide an 
inaccurate, longer run of matches as approximation provides superset matches. 
If we were to further confirm the boundary by actually checking matches on its 
iterator, then we basically resort to linear scan on the iterator, which 
defeats the purpose of this new API?
   
   ```
   final DocIdSetIterator skippingReqApproximation =
   new DocIdSetIterator() {
 @Override
 public int docID() {}
   
 @Override
 public int nextDoc() throws IOException {}
   
 @Override
 public int advance(int target) throws IOException {
   // this exclNonMatchingDoc could be inaccruate, as 
exclApproximation provides a superset of matches than its underlying iterator
   int exclNonMatchingDoc = 
exclApproximation.peekNextNonMatchingDocID(); 
   
   if (exclApproximation.docID() < target
   && target < exclNonMatchingDoc) {
 return reqApproximation.advance(exclNonMatchingDoc);
   } else {
 return reqApproximation.advance(target);
   }
 }
   
 @Override
 public long cost() {}
   };
   ```
   
   Please let me know if you have any suggestion on this.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org