[jira] [Commented] (LUCENE-10616) Moving to dictionaries has made stored fields slower at skipping
[ https://issues.apache.org/jira/browse/LUCENE-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562072#comment-17562072 ] Adrien Grand commented on LUCENE-10616: --- Thanks [~joe hou] for giving it a try! The high-level idea looks good to me, of somehow leveraging information in the {{StoredFieldVisitor}} to only decompress the bits that matter. In terms of implementation, I would like to see if we can avoid introducing the new {{StoredFieldVisitor#hasMoreFieldsToVisit}} method and rely on {{StoredFieldVisitor#needsField}} returning {{STOP}} instead. The fact that decompressing data and decoding decompressed data are interleaved also make the code harder to test, I wonder if we could change the signature of {{Decompressor#decompress}} to return an {{InputStream}} that would decompress data lazily instead of filling a {{BytesRef}} so that it's possible to stop decompressing early while still being able to test decompression and decoding in isolation? > Moving to dictionaries has made stored fields slower at skipping > > > Key: LUCENE-10616 > URL: https://issues.apache.org/jira/browse/LUCENE-10616 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > [~ywelsch] has been digging into a regression of stored fields retrieval that > is caused by LUCENE-9486. > Say your documents have two stored fields, one that is 100B and is stored > first, and the other one that is 100kB, and you are only interested in the > first one. While the idea behind blocks of stored fields is to store multiple > documents in the same block to leverage redundancy across documents, > sometimes documents are larger than the block size. As soon as documents are > larger than 2x the block size, our stored fields format splits such large > documents into multiple blocks, so that you wouldn't need to decompress > everything only to retrieve a couple small fields. > Before LUCENE-9486, BEST_SPEED had a block size of 16kB, so only retrieving > the first field value would only need to decompress 16kB of data. With the > move to preset dictionaries in LUCENE-9486 and then LUCENE-9917, we now have > blocks of 80kB, so stored fields would now need to decompress 80kB of data, > 5x more than before. > With dictionaries, our blocks are now split into 10 sub blocks. We happen to > eagerly decompress all sub blocks that intersect with the stored document, > which is why we would decompress 80kB of data, but this is an implementation > detail. It should be possible to decompress these sub blocks lazily so that > we would only decompress those that intersect with one of the field values > that the user is interested in retrieving? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #996: LUCENE-10151: Some fixes to query timeouts.
jpountz commented on code in PR #996: URL: https://github.com/apache/lucene/pull/996#discussion_r912762338 ## lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java: ## @@ -2033,6 +2034,15 @@ protected LeafSlice[] slices(List leaves) { } ret.setSimilarity(classEnvRule.similarity); ret.setQueryCachingPolicy(MAYBE_CACHE_POLICY); + if (random().nextBoolean()) { Review Comment: Right, actually timeout would change expectations about the output of `IndexSearcher` so I don't think we can do it here, we'd need to do this in tests that are specific to query timeouts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #996: LUCENE-10151: Some fixes to query timeouts.
jpountz commented on code in PR #996: URL: https://github.com/apache/lucene/pull/996#discussion_r912762557 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -85,7 +85,11 @@ public class IndexSearcher { private static QueryCache DEFAULT_QUERY_CACHE; private static QueryCachingPolicy DEFAULT_CACHING_POLICY = new UsageTrackingQueryCachingPolicy(); private QueryTimeout queryTimeout = null; - private boolean partialResult = false; + // TODO: does partialResult need to be volatile? It can be set on one of the threads of the Review Comment: Agreed, I'll keep the comment but remove the TODO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10640) Can TimeLimitingBulkScorer exponentially grow the window size?
Adrien Grand created LUCENE-10640: - Summary: Can TimeLimitingBulkScorer exponentially grow the window size? Key: LUCENE-10640 URL: https://issues.apache.org/jira/browse/LUCENE-10640 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand {{TimeLimitingBulkScorer}} scores 100 documents at a time. Unfortunately, bulk scorers have non-null overhead for {{BulkScorer#score}} since they need to set the scorer, figure out how to combine the Scorer with the competitive iterator of the collector, etc. Larger windows of doc IDs would help better amortize such costs. Could we grow the window of scored doc IDs exponentially, maybe with guarantees such as making sure that the new window is at most 50% of doc IDs that have been scored so far so that this exponential growth could only exceed the configured timeout by 50%? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10641) IndexSearcher#setTimeout should also abort query rewrites, point ranges and vector searches
Adrien Grand created LUCENE-10641: - Summary: IndexSearcher#setTimeout should also abort query rewrites, point ranges and vector searches Key: LUCENE-10641 URL: https://issues.apache.org/jira/browse/LUCENE-10641 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand {{IndexSearcher}} only checks the query timeout in the collection phase for now. It should check the timeout in other operations that may take time such as intersecting a fuzzy automaton with a terms dictionary, evaluating points that fall into a range or running a vector search. This should be possible to do by wrapping the IndexReader's data structures in the same way as {{ExitableDirectoryReader}}? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #1: Fix markup conversion error
mikemccand commented on issue #1: URL: https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1173701233 > I found at least one test issue in the test repo https://github.com/mocobeta/sandbox-lucene-10557/issues appears in google's top search result. I didn't think it happens so quickly, but I might have to make the repo private. If there is anyone who is interested in reviewing/debugging this issue, please let me know. I'll give you access to the repo. Hmm ... we could maybe rename the repository so that it [falls under one of the `robots.txt` rules at github.com](https://github.com/robots.txt)? Reading at least one answer on stackoverflow suggested this approach. Of course, it is brittle: if the `robots.txt` changes, the web crawlers will see the content again, but maybe for our short-term purposes it is acceptable? Here's the current `robots.txt` content: ``` # If you would like to crawl GitHub contact us via https://support.github.com?tags=dotcom-robots # We also provide an extensive API: https://docs.github.com User-agent: baidu crawl-delay: 1 User-agent: * Disallow: /*/pulse Disallow: /*/tree/ Disallow: /gist/ Disallow: /*/forks Disallow: /*/stars Disallow: /*/download Disallow: /*/revisions Disallow: /*/issues/new Disallow: /*/issues/search Disallow: /*/commits/ Disallow: /*/commits/*?author Disallow: /*/commits/*?path Disallow: /*/branches Disallow: /*/tags Disallow: /*/contributors Disallow: /*/comments Disallow: /*/stargazers Disallow: /*/archive/ Disallow: /*/blame/ Disallow: /*/watchers Disallow: /*/network Disallow: /*/graphs Disallow: /*/raw/ Disallow: /*/compare/ Disallow: /*/cache/ Disallow: /.git/ Disallow: */.git/ Disallow: /*.git$ Disallow: /search/advanced Disallow: /search Disallow: */search Disallow: /*q= Disallow: /*.atom Disallow: /ekansa/Open-Context-Data Disallow: /ekansa/opencontext-* Disallow: */tarball/ Disallow: */zipball/ Disallow: /*source=* Disallow: /*ref_cta=* Disallow: /*plan=* Disallow: /*return_to=* Disallow: /*ref_loc=* Disallow: /*setup_organization=* Disallow: /*source_repo=* Disallow: /*ref_page=* Disallow: /*source=* Disallow: /*referrer=* Disallow: /*report=* Disallow: /*author=* Disallow: /*since=* Disallow: /*until=* Disallow: /*commits?author=* Disallow: /*report-abuse?report=* Disallow: /*tab=* Allow: /*?tab=achievements&achievement=* Disallow: /account-login Disallow: /Explodingstuff/ ``` So maybe if named/renamed this test repo with a prefix of `forks-` or `stars-`? Of course, GitHub might disallow this, but it's worth a shot? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error
mocobeta commented on issue #1: URL: https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1173755435 ``` Disallow: /*/forks ``` I think the first wildcard `*` would match all repository names. It looks like this entry disallows crawling the forked repository list page? I made the test repo read-only ("archived") - at least nobody can update or add comments on that.  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error
mocobeta commented on issue #1: URL: https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1173774134 ah ok, `/*/forks` should match any path that includes "forks" after the second `/`. I'll change the repository name if it's needed; for now, there seems no substantial bad effect (we'd need to be careful not to increase the site rank of it). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error
mocobeta commented on issue #1: URL: https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1173841068 I reviewed several hundreds of issues in the latest test migration. The conversion errors are not uncommon and readers would often have to go back to Jira to reach correct/original information - it's not a great experience. This is still a major blocker for migration to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #996: LUCENE-10151: Some fixes to query timeouts.
msokolov commented on code in PR #996: URL: https://github.com/apache/lucene/pull/996#discussion_r913038692 ## lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java: ## @@ -2033,6 +2034,15 @@ protected LeafSlice[] slices(List leaves) { } ret.setSimilarity(classEnvRule.similarity); ret.setQueryCachingPolicy(MAYBE_CACHE_POLICY); + if (random().nextBoolean()) { Review Comment: I wonder if we enable this randomly with a large timeout value, say 5 minutes, that should never trigger an actual timeout during unit tests, would it exercise a different code path? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher
[ https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562216#comment-17562216 ] ASF subversion and git services commented on LUCENE-10151: -- Commit 81d4a7a69f1c9085e40df412be87de22d0aa8cd6 in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=81d4a7a69f1 ] LUCENE-10151: Some fixes to query timeouts. (#996) I noticed some minor bugs in the original PR #927 that this PR should fix: - When a timeout is set, we would no longer catch `CollectionTerminatedException`. - I added randomization to `LuceneTestCase` to randomly set a timeout, it would have caught the above bug. - Fixed visibility of `TimeLimitingBulkScorer`. > Add timeout support to IndexSearcher > > > Key: LUCENE-10151 > URL: https://issues.apache.org/jira/browse/LUCENE-10151 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Greg Miller >Priority: Minor > Fix For: 9.3 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > I'd like to explore adding optional "timeout" capabilities to > {{IndexSearcher}}. This would enable users to (optionally) specify a maximum > time budget for search execution. If the search "times out", partial results > would be available. > This idea originated on the dev list (thanks [~jpountz] for the suggestion). > Thread for reference: > [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E] > > A couple things to watch out for with this change: > # We want to make sure it's robust to a two-phase query evaluation scenario > where the "approximate" step matches a large number of candidates but the > "confirmation" step matches very few (or none). This is a particularly tricky > case. > # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is > {{GREATER_THAN_OR_EQUAL_TO}} if the query times out > # We want to make sure it plays nice with the {{LRUCache}} since it iterates > the query to pre-populate a {{BitSet}} when caching. That step shouldn't be > allowed to overrun the timeout. The proper way to handle this probably needs > some thought. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher
[ https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562219#comment-17562219 ] Adrien Grand commented on LUCENE-10151: --- bq. I've merged this now to main and backported to 9.x Did you forget to push to branch_9x? I cannot see the change there. > Add timeout support to IndexSearcher > > > Key: LUCENE-10151 > URL: https://issues.apache.org/jira/browse/LUCENE-10151 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Greg Miller >Priority: Minor > Fix For: 9.3 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > I'd like to explore adding optional "timeout" capabilities to > {{IndexSearcher}}. This would enable users to (optionally) specify a maximum > time budget for search execution. If the search "times out", partial results > would be available. > This idea originated on the dev list (thanks [~jpountz] for the suggestion). > Thread for reference: > [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E] > > A couple things to watch out for with this change: > # We want to make sure it's robust to a two-phase query evaluation scenario > where the "approximate" step matches a large number of candidates but the > "confirmation" step matches very few (or none). This is a particularly tricky > case. > # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is > {{GREATER_THAN_OR_EQUAL_TO}} if the query times out > # We want to make sure it plays nice with the {{LRUCache}} since it iterates > the query to pre-populate a {{BitSet}} when caching. That step shouldn't be > allowed to overrun the timeout. The proper way to handle this probably needs > some thought. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #996: LUCENE-10151: Some fixes to query timeouts.
jpountz merged PR #996: URL: https://github.com/apache/lucene/pull/996 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] stefanvodita opened a new pull request, #1004: LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests
stefanvodita opened a new pull request, #1004: URL: https://github.com/apache/lucene/pull/1004 Replace all usages of `SortedSetDocValues.NO_MORE_ORDS` in tests and start using `SortedSetDocValues.docValueCount()`. Jira: https://issues.apache.org/jira/browse/LUCENE-10603 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562230#comment-17562230 ] Stefan Vodita commented on LUCENE-10603: Hi Greg! I thought I'd help out. [Here|https://github.com/apache/lucene/pull/1004]'s a PR with the test changes. > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 3h 50m > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request, #1005: LUCENE-10636: Avoid computing the same scores multiple times.
jpountz opened a new pull request, #1005: URL: https://github.com/apache/lucene/pull/1005 `BlockMaxMaxscoreScorer` would previously compute the score twice for essential scorers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #1005: LUCENE-10636: Avoid computing the same scores multiple times.
jpountz commented on PR #1005: URL: https://github.com/apache/lucene/pull/1005#issuecomment-1174000758 luceneutil on `wikimedium10m` seems to confirm that this gives a noticeable speedup to disjunctions: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value AndHighLow 1621.76 (3.1%) 1611.29 (2.5%) -0.6% ( -6% -5%) 0.476 OrNotHighLow 1514.17 (2.9%) 1507.60 (3.0%) -0.4% ( -6% -5%) 0.638 AndHighMed 184.49 (3.9%) 187.19 (5.2%)1.5% ( -7% - 10%) 0.312 OrNotHighMed 1442.80 (3.4%) 1464.42 (4.6%)1.5% ( -6% -9%) 0.242 OrHighNotLow 1877.95 (3.9%) 1907.58 (5.3%)1.6% ( -7% - 11%) 0.282 AndHighHigh 84.87 (4.0%) 86.25 (5.8%)1.6% ( -7% - 11%) 0.301 OrNotHighHigh 1025.32 (3.4%) 1043.39 (2.9%)1.8% ( -4% -8%) 0.078 OrHighNotHigh 1371.19 (3.7%) 1400.07 (3.3%)2.1% ( -4% -9%) 0.057 OrHighNotMed 1566.12 (3.8%) 1601.33 (3.9%)2.2% ( -5% - 10%) 0.064 OrHighLow 788.64 (8.4%) 845.63 (6.8%)7.2% ( -7% - 24%) 0.003 OrHighMed 178.01 (7.5%) 193.39 (6.0%)8.6% ( -4% - 24%) 0.000 OrHighHigh 68.26 (11.9%) 74.88 (9.6%)9.7% ( -10% - 35%) 0.004 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #907: LUCENE-10357 Ghost fields and postings/points
jpountz commented on code in PR #907: URL: https://github.com/apache/lucene/pull/907#discussion_r900220890 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -1378,7 +1378,7 @@ private static Status.TermIndexStatus checkFields( computedFieldCount++; final Terms terms = fields.terms(field); - if (terms == null) { + if (terms == Terms.EMPTY) { Review Comment: Let's remove this `if` block entirely? ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -1378,7 +1378,7 @@ private static Status.TermIndexStatus checkFields( computedFieldCount++; final Terms terms = fields.terms(field); - if (terms == null) { + if (terms == Terms.EMPTY) { Review Comment: Then let's fix codecs to return a `Terms` instance that has the correct values for `hasFreqs`, `hasOffsets`, `hasPositions` and `hasPayloads`? E.g. maybe you could add a new `Terms#empty(FieldInfo)` method that does the right thing based on the `FieldInfo` and leverage this method in postings formats? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher
[ https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562247#comment-17562247 ] Adrien Grand commented on LUCENE-10151: --- For reference, I opened new JIRA issues for suggested follow-ups: LUCENE-10640, LUCENE-10641. > Add timeout support to IndexSearcher > > > Key: LUCENE-10151 > URL: https://issues.apache.org/jira/browse/LUCENE-10151 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Greg Miller >Priority: Minor > Fix For: 9.3 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > I'd like to explore adding optional "timeout" capabilities to > {{IndexSearcher}}. This would enable users to (optionally) specify a maximum > time budget for search execution. If the search "times out", partial results > would be available. > This idea originated on the dev list (thanks [~jpountz] for the suggestion). > Thread for reference: > [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E] > > A couple things to watch out for with this change: > # We want to make sure it's robust to a two-phase query evaluation scenario > where the "approximate" step matches a large number of candidates but the > "confirmation" step matches very few (or none). This is a particularly tricky > case. > # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is > {{GREATER_THAN_OR_EQUAL_TO}} if the query times out > # We want to make sure it plays nice with the {{LRUCache}} since it iterates > the query to pre-populate a {{BitSet}} when caching. That step shouldn't be > allowed to overrun the timeout. The proper way to handle this probably needs > some thought. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on a diff in pull request #1005: LUCENE-10636: Avoid computing the same scores multiple times.
zacharymorn commented on code in PR #1005: URL: https://github.com/apache/lucene/pull/1005#discussion_r913203251 ## lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java: ## @@ -35,11 +34,13 @@ class BlockMaxMaxscoreScorer extends Scorer { // heap of scorers ordered by doc ID private final DisiPriorityQueue essentialsScorers; - // list of scorers ordered by maxScore - private final LinkedList maxScoreSortedEssentialScorers; - + // array of scorers ordered by maxScore private final DisiWrapper[] allScorers; + // index of the first essential scorer is the `allScorers` array. All scorers before this index Review Comment: ```suggestion // index of the first essential scorer in the `allScorers` array. All scorers before this index ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on a diff in pull request #1005: LUCENE-10636: Avoid computing the same scores multiple times.
zacharymorn commented on code in PR #1005: URL: https://github.com/apache/lucene/pull/1005#discussion_r913207313 ## lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java: ## @@ -248,6 +251,17 @@ public long cost() { @Override public boolean matches() throws IOException { +// Only sum up scores of non-essential scorers, essential scores were already folded into +// the score. +for (int i = 0; i < firstEssentialScorerIndex; ++i) { + DisiWrapper w = allScorers[i]; + if (w.doc < doc) { +w.doc = w.iterator.advance(doc); + } + if (w.doc == doc) { +score += allScorers[i].scorer.score(); + } +} return score() >= minCompetitiveScore; Review Comment: Nit: maybe just use `score` instead of `score()` here ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562266#comment-17562266 ] ASF subversion and git services commented on LUCENE-10480: -- Commit a5c99aca1abc9b73a0c68d4f23533311382b718c in lucene's branch refs/heads/branch_9x from zacharymorn [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a5c99aca1ab ] LUCENE-10480: Use BMM scorer for 2 clauses disjunction (#972) (#1002) (cherry picked from commit 503ec5597331454bf8b6af79b9701cfdccf5) > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 5h 40m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn merged pull request #1002: LUCENE-10480: (Backporting) Use BMM scorer for 2 clauses disjunction
zacharymorn merged PR #1002: URL: https://github.com/apache/lucene/pull/1002 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org