[GitHub] [lucene] jpountz commented on pull request #12199: Reduce contention in DocumentsWriterPerThreadPool.
jpountz commented on PR #12199: URL: https://github.com/apache/lucene/pull/12199#issuecomment-1473272224 I suspect this change to be the source of the speedup when indexing vectors on https://home.apache.org/~mikemccand/lucenebench/indexing.html, but maybe more because of the introduced affinity between indexing threads and DWPTs than because of reduced contention since contention generally shows up when indexing is fast, which isn't the case with vectors? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] MarcusSorealheis opened a new pull request, #12208: Explain term automaton queries
MarcusSorealheis opened a new pull request, #12208: URL: https://github.com/apache/lucene/pull/12208 ### Description This is a draft PR to address #12178 that I wrote in my evenings, as I'm out of office and recovering (possibly slow to respond). Using the open source search community to prevent my edges from dulling and because it's fun. This PR is very likely bad and needs a lot of touch up as I haven't been focused on this work for a while. It may also be a helpful guide for others who are looking to take this issue on and do a much better job. I need to update the tests to evaluate the functionality including but not limited to evaluating the explain output for matching docs. I started with the handling null due to the origin of my colleague Akaash's issue. for context to those unfamiliar, I have mostly worked on janitorial things in this repo, though I am close to some of the adjacent machine learning and developer productivity worlds. I'm a product person not a dev. I volunteer to help open source projects for happiness and rehabilitation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] s1monw commented on a diff in pull request #12205: Remove remaining sources of contention on indexing.
s1monw commented on code in PR #12205: URL: https://github.com/apache/lucene/pull/12205#discussion_r1140021701 ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java: ## @@ -634,7 +652,9 @@ private void pruneBlockedQueue(final DocumentsWriterDeleteQueue flushingQueue) { iterator.remove(); addFlushingDWPT(blockedFlush); // don't decr pending here - it's already done when DWPT is blocked +assert numQueued == flushQueue.size(); Review Comment: All I was referring to was why we don't do `numQueued++` and `numQueued--` in the corresponding places instead of `numQueued=flushQueue.size()` I don't think we should put _blocked_ in there since we can't really do anything with it for the time being but I haven't thought about it too much. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Hanyakubin opened a new issue, #12209: CPU usage continuously increasing when IMAP SEARCH command coming from user agent
Hanyakubin opened a new issue, #12209: URL: https://github.com/apache/lucene/issues/12209 ### Description We use Apache James to provide voice visual mail for iPhone users and Lucene module is used in James to store nessecary information for index search. But the CPU usage of James server is continuously increasing like this:  Below is the command sequence from iPhone: - SELECT INBOX - UID SEARCH 1:* NOT DELETED - UID SEARCH 1:* UNSEEN - UID STORE [uid_1] +FLAGS.SILENT (\Seen)) - UID STORE [uid_2] +FLAGS.SILENT (\Seen)) - .. - LOGOUT After several tests, we think maybe there are some problems in the implement of SEARCH command, so we change the search query command request to - UID SEARCH UID ALL instead of 1:* NOT/UN~ in our test environment. Then the CPU usage became stable like this:  We checked sourcecode and found there is a block in James like below: `public Flux search(SearchQuery query, MailboxSession mailboxSession) throws MailboxException {` `if (query.equals(LIST_ALL_QUERY) || query.equals(LIST_FROM_ONE)) {` `return listAllMessageUids(mailboxSession);` `}` `return index.search(mailboxSession, getMailboxEntity(), query);` `}` When the query is LIST_ALL_QUERY or LIST_FROM_ONE, it will go into the if block, not using index search of Lucene and the CPU usage is stable. Otherwise, in the other query cases when index.search method being called, the CPU usage will increase. The question is why. Is there some performance problem in Lucene, or this issue can be resolved by correct configuration? ### Version and environment details James: 3.6.0 Lucene: 3.6.2 JVM: OpenJDK 64-Bit Server VM OS: CentOS7.2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12205: Remove remaining sources of contention on indexing.
jpountz commented on code in PR #12205: URL: https://github.com/apache/lucene/pull/12205#discussion_r1140418586 ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java: ## @@ -634,7 +652,9 @@ private void pruneBlockedQueue(final DocumentsWriterDeleteQueue flushingQueue) { iterator.remove(); addFlushingDWPT(blockedFlush); // don't decr pending here - it's already done when DWPT is blocked +assert numQueued == flushQueue.size(); Review Comment: Ack. I changed to increments/decrements. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12205: Remove remaining sources of contention on indexing.
jpountz commented on code in PR #12205: URL: https://github.com/apache/lucene/pull/12205#discussion_r1139165859 ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java: ## @@ -133,7 +156,7 @@ private boolean assertMemory() { // peak document final long expected = (2 * ramBufferBytes) - + ((numPending + numFlushingDWPT() + numBlockedFlushes()) * peakDelta) + + ((numPending + numFlushingDWPT()) * peakDelta) Review Comment: is it a bug that queued flushes were not considered before? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] MarcusSorealheis commented on pull request #12208: Explain term automaton queries
MarcusSorealheis commented on PR #12208: URL: https://github.com/apache/lucene/pull/12208#issuecomment-1474501209 @rmuir or @mikemccand Do you have suggestions on who could be a good fit to advise me here on what types of tests or review? There's a few that are obvious that I need to fix, but I also recognize there maybe some desired strategies around design here. I intend to add more robust tests. This PR touches some important areas of the codebase. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on a diff in pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI
zacharymorn commented on code in PR #12194: URL: https://github.com/apache/lucene/pull/12194#discussion_r1140939880 ## lucene/core/src/java/org/apache/lucene/search/DocIdSetIterator.java: ## @@ -211,4 +216,22 @@ protected final int slowAdvance(int target) throws IOException { * may be a rough heuristic, hardcoded value, or otherwise completely inaccurate. */ public abstract long cost(); + + /** + * Returns the next doc ID that may not be a match. Note that this API will essentially provide + * the following two guarantees: Review Comment: Makes sense! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI
zacharymorn commented on PR #12194: URL: https://github.com/apache/lucene/pull/12194#issuecomment-1474732221 > Sorry for the delayed response, so I saw it in my optimized version of `BitSet#or` that does use the `peekNextNonMatchingDocID` API. I also found the bug, it turns out we just weren't resetting the `lastNonMatchingDoc` in the `reset()` function. I created a [pull request](https://github.com/zacharymorn/lucene/pull/1) with my `BitSet#or` changes and the (one-liner) bug fix. > > I'm also planning on benchmarking this and will post here when I do. Thanks @mdmarshmallow for the PR and benchmark result. I have left a few comments there. I feel the reasons you are not seeing much change from benchmark could be that the implementation in `BitSet#or` might got overridden by subclasses, and/or the implementations in subclasses did not utilize their specialized data structures? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI
zacharymorn commented on PR #12194: URL: https://github.com/apache/lucene/pull/12194#issuecomment-1474743257 > Maybe we could try to leverage the geonames dataset (there's a few benchmarks for it in lucene-util), which has a few low-cardinality fields like the time zone or country. Then enable index sorting on these fields. And make sure we're getting performance when using the time zone in required or excluded clauses? Thanks @jpountz for the suggestion! I actually did something similar by modifying the benchmarking code to support sorting index by `month` field, which is a `KeywordField`, and creating corresponding query with the following syntax to exercise the `ReqExclScorer - Posting - DocValue` path that I have changed: `ReqExclWithDocValues: newbenchmarktest//names docvalues:-month:[Jan TO Aug]` However, while I was able to run this query, the test actually failed when verifying top matches & scores between baseline and modified code. Upon further digging, I realized a potential issue to this API idea. As Lucene does a lot of two phase iterations, and two phase iterator's approximation may provide a superset of the actual matches. If we were to use this API to find and ignore / skip over a bunch of doc ids from approximation, wouldn't the result be inaccurate? For example, in the below `skippingReqApproximation` disi I put inside `ReqExclScorer`, `exclApproximation.peekNextNonMatchingDocID()` may provide an inaccurate, longer run of matches as approximation provides superset matches. If we were to further confirm the boundary by actually checking matches on its iterator, then we basically resort to linear scan on the iterator, which defeats the purpose of this new API? ``` final DocIdSetIterator skippingReqApproximation = new DocIdSetIterator() { @Override public int docID() {} @Override public int nextDoc() throws IOException {} @Override public int advance(int target) throws IOException { // this exclNonMatchingDoc could be inaccruate, as exclApproximation provides a superset of matches than its underlying iterator int exclNonMatchingDoc = exclApproximation.peekNextNonMatchingDocID(); if (exclApproximation.docID() < target && target < exclNonMatchingDoc) { return reqApproximation.advance(exclNonMatchingDoc); } else { return reqApproximation.advance(target); } } @Override public long cost() {} }; ``` Please let me know if you have any suggestion on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org