[jira] [Commented] (LUCENE-10616) Moving to dictionaries has made stored fields slower at skipping

2022-07-04 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562072#comment-17562072
 ] 

Adrien Grand commented on LUCENE-10616:
---

Thanks [~joe hou] for giving it a try! The high-level idea looks good to me, of 
somehow leveraging information in the {{StoredFieldVisitor}} to only decompress 
the bits that matter. In terms of implementation, I would like to see if we can 
avoid introducing the new {{StoredFieldVisitor#hasMoreFieldsToVisit}} method 
and rely on {{StoredFieldVisitor#needsField}} returning {{STOP}} instead. The 
fact that decompressing data and decoding decompressed data are interleaved 
also make the code harder to test, I wonder if we could change the signature of 
{{Decompressor#decompress}} to return an {{InputStream}} that would decompress 
data lazily instead of filling a {{BytesRef}} so that it's possible to stop 
decompressing early while still being able to test decompression and decoding 
in isolation?

> Moving to dictionaries has made stored fields slower at skipping
> 
>
> Key: LUCENE-10616
> URL: https://issues.apache.org/jira/browse/LUCENE-10616
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~ywelsch] has been digging into a regression of stored fields retrieval that 
> is caused by LUCENE-9486.
> Say your documents have two stored fields, one that is 100B and is stored 
> first, and the other one that is 100kB, and you are only interested in the 
> first one. While the idea behind blocks of stored fields is to store multiple 
> documents in the same block to leverage redundancy across documents, 
> sometimes documents are larger than the block size. As soon as documents are 
> larger than 2x the block size, our stored fields format splits such large 
> documents into multiple blocks, so that you wouldn't need to decompress 
> everything only to retrieve a couple small fields.
> Before LUCENE-9486, BEST_SPEED had a block size of 16kB, so only retrieving 
> the first field value would only need to decompress 16kB of data. With the 
> move to preset dictionaries in LUCENE-9486 and then LUCENE-9917, we now have 
> blocks of 80kB, so stored fields would now need to decompress 80kB of data, 
> 5x more than before.
> With dictionaries, our blocks are now split into 10 sub blocks. We happen to 
> eagerly decompress all sub blocks that intersect with the stored document, 
> which is why we would decompress 80kB of data, but this is an implementation 
> detail. It should be possible to decompress these sub blocks lazily so that 
> we would only decompress those that intersect with one of the field values 
> that the user is interested in retrieving?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #996: LUCENE-10151: Some fixes to query timeouts.

2022-07-04 Thread GitBox


jpountz commented on code in PR #996:
URL: https://github.com/apache/lucene/pull/996#discussion_r912762338


##
lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java:
##
@@ -2033,6 +2034,15 @@ protected LeafSlice[] slices(List 
leaves) {
   }
   ret.setSimilarity(classEnvRule.similarity);
   ret.setQueryCachingPolicy(MAYBE_CACHE_POLICY);
+  if (random().nextBoolean()) {

Review Comment:
   Right, actually timeout would change expectations about the output of 
`IndexSearcher` so I don't think we can do it here, we'd need to do this in 
tests that are specific to query timeouts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #996: LUCENE-10151: Some fixes to query timeouts.

2022-07-04 Thread GitBox


jpountz commented on code in PR #996:
URL: https://github.com/apache/lucene/pull/996#discussion_r912762557


##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -85,7 +85,11 @@ public class IndexSearcher {
   private static QueryCache DEFAULT_QUERY_CACHE;
   private static QueryCachingPolicy DEFAULT_CACHING_POLICY = new 
UsageTrackingQueryCachingPolicy();
   private QueryTimeout queryTimeout = null;
-  private boolean partialResult = false;
+  // TODO: does partialResult need to be volatile? It can be set on one of the 
threads of the

Review Comment:
   Agreed, I'll keep the comment but remove the TODO.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10640) Can TimeLimitingBulkScorer exponentially grow the window size?

2022-07-04 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10640:
-

 Summary: Can TimeLimitingBulkScorer exponentially grow the window 
size?
 Key: LUCENE-10640
 URL: https://issues.apache.org/jira/browse/LUCENE-10640
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


{{TimeLimitingBulkScorer}} scores 100 documents at a time. Unfortunately, bulk 
scorers have non-null overhead for {{BulkScorer#score}} since they need to set 
the scorer, figure out how to combine the Scorer with the competitive iterator 
of the collector, etc. Larger windows of doc IDs would help better amortize 
such costs.

Could we grow the window of scored doc IDs exponentially, maybe with guarantees 
such as making sure that the new window is at most 50% of doc IDs that have 
been scored so far so that this exponential growth could only exceed the 
configured timeout by 50%?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10641) IndexSearcher#setTimeout should also abort query rewrites, point ranges and vector searches

2022-07-04 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10641:
-

 Summary: IndexSearcher#setTimeout should also abort query 
rewrites, point ranges and vector searches
 Key: LUCENE-10641
 URL: https://issues.apache.org/jira/browse/LUCENE-10641
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


{{IndexSearcher}} only checks the query timeout in the collection phase for 
now. It should check the timeout in other operations that may take time such as 
intersecting a fuzzy automaton with a terms dictionary, evaluating points that 
fall into a range or running a vector search. This should be possible to do by 
wrapping the IndexReader's data structures in the same way as 
{{ExitableDirectoryReader}}?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on issue #1: Fix markup conversion error

2022-07-04 Thread GitBox


mikemccand commented on issue #1:
URL: 
https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1173701233

   > I found at least one test issue in the test repo 
https://github.com/mocobeta/sandbox-lucene-10557/issues appears in google's top 
search result. I didn't think it happens so quickly, but I might have to make 
the repo private. If there is anyone who is interested in reviewing/debugging 
this issue, please let me know. I'll give you access to the repo.
   
   Hmm ... we could maybe rename the repository so that it [falls under one of 
the `robots.txt` rules at github.com](https://github.com/robots.txt)?  Reading 
at least one answer on stackoverflow suggested this approach.  Of course, it is 
brittle: if the `robots.txt` changes, the web crawlers will see the content 
again, but maybe for our short-term purposes it is acceptable?
   
   Here's the current `robots.txt` content:
   
   ```
   # If you would like to crawl GitHub contact us via 
https://support.github.com?tags=dotcom-robots
   # We also provide an extensive API: https://docs.github.com
   User-agent: baidu
   crawl-delay: 1
   
   
   User-agent: *
   
   Disallow: /*/pulse
   Disallow: /*/tree/
   Disallow: /gist/
   Disallow: /*/forks
   Disallow: /*/stars
   Disallow: /*/download
   Disallow: /*/revisions
   Disallow: /*/issues/new
   Disallow: /*/issues/search
   Disallow: /*/commits/
   Disallow: /*/commits/*?author
   Disallow: /*/commits/*?path
   Disallow: /*/branches
   Disallow: /*/tags
   Disallow: /*/contributors
   Disallow: /*/comments
   Disallow: /*/stargazers
   Disallow: /*/archive/
   Disallow: /*/blame/
   Disallow: /*/watchers
   Disallow: /*/network
   Disallow: /*/graphs
   Disallow: /*/raw/
   Disallow: /*/compare/
   Disallow: /*/cache/
   Disallow: /.git/
   Disallow: */.git/
   Disallow: /*.git$
   Disallow: /search/advanced
   Disallow: /search
   Disallow: */search
   Disallow: /*q=
   Disallow: /*.atom
   
   Disallow: /ekansa/Open-Context-Data
   Disallow: /ekansa/opencontext-*
   Disallow: */tarball/
   Disallow: */zipball/
   
   Disallow: /*source=*
   Disallow: /*ref_cta=*
   Disallow: /*plan=*
   Disallow: /*return_to=*
   Disallow: /*ref_loc=*
   Disallow: /*setup_organization=*
   Disallow: /*source_repo=*
   Disallow: /*ref_page=*
   Disallow: /*source=*
   Disallow: /*referrer=*
   Disallow: /*report=*
   Disallow: /*author=*
   Disallow: /*since=*
   Disallow: /*until=*
   Disallow: /*commits?author=*
   Disallow: /*report-abuse?report=*
   Disallow: /*tab=*
   Allow: /*?tab=achievements&achievement=*
   
   Disallow: /account-login
   Disallow: /Explodingstuff/
   ```
   
   So maybe if named/renamed this test repo with a prefix of `forks-` or 
`stars-`?  Of course, GitHub might disallow this, but it's worth a shot?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error

2022-07-04 Thread GitBox


mocobeta commented on issue #1:
URL: 
https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1173755435

   ```
   Disallow: /*/forks
   ```
   I think the first wildcard `*` would match all repository names. It looks 
like this entry disallows crawling the forked repository list page? 
   
   I made the test repo read-only ("archived") - at least nobody can update or 
add comments on that.
   ![Screenshot from 2022-07-04 
21-17-47](https://user-images.githubusercontent.com/1825333/177153243-4afb9a0d-8d39-4bc8-aee1-cf546edb3635.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error

2022-07-04 Thread GitBox


mocobeta commented on issue #1:
URL: 
https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1173774134

   ah ok, `/*/forks` should match any path that includes "forks" after the 
second `/`. 
   I'll change the repository name if it's needed; for now, there seems no 
substantial bad effect (we'd need to be careful not to increase the site rank 
of it).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error

2022-07-04 Thread GitBox


mocobeta commented on issue #1:
URL: 
https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1173841068

   I reviewed several hundreds of issues in the latest test migration. The 
conversion errors are not uncommon and readers would often have to go back to 
Jira to reach correct/original information - it's not a great experience. This 
is still a major blocker for migration to me.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #996: LUCENE-10151: Some fixes to query timeouts.

2022-07-04 Thread GitBox


msokolov commented on code in PR #996:
URL: https://github.com/apache/lucene/pull/996#discussion_r913038692


##
lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java:
##
@@ -2033,6 +2034,15 @@ protected LeafSlice[] slices(List 
leaves) {
   }
   ret.setSimilarity(classEnvRule.similarity);
   ret.setQueryCachingPolicy(MAYBE_CACHE_POLICY);
+  if (random().nextBoolean()) {

Review Comment:
   I wonder if we enable this randomly with a large timeout value, say 5 
minutes, that should never trigger an actual timeout during unit tests, would 
it exercise a different code path?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher

2022-07-04 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562216#comment-17562216
 ] 

ASF subversion and git services commented on LUCENE-10151:
--

Commit 81d4a7a69f1c9085e40df412be87de22d0aa8cd6 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=81d4a7a69f1 ]

LUCENE-10151: Some fixes to query timeouts. (#996)

I noticed some minor bugs in the original PR #927 that this PR should fix:
 - When a timeout is set, we would no longer catch
   `CollectionTerminatedException`.
 - I added randomization to `LuceneTestCase` to randomly set a timeout, it
   would have caught the above bug.
 - Fixed visibility of `TimeLimitingBulkScorer`.

> Add timeout support to IndexSearcher
> 
>
> Key: LUCENE-10151
> URL: https://issues.apache.org/jira/browse/LUCENE-10151
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I'd like to explore adding optional "timeout" capabilities to 
> {{IndexSearcher}}. This would enable users to (optionally) specify a maximum 
> time budget for search execution. If the search "times out", partial results 
> would be available.
> This idea originated on the dev list (thanks [~jpountz] for the suggestion). 
> Thread for reference: 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E]
>  
> A couple things to watch out for with this change:
>  # We want to make sure it's robust to a two-phase query evaluation scenario 
> where the "approximate" step matches a large number of candidates but the 
> "confirmation" step matches very few (or none). This is a particularly tricky 
> case.
>  # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is 
> {{GREATER_THAN_OR_EQUAL_TO}} if the query times out
>  # We want to make sure it plays nice with the {{LRUCache}} since it iterates 
> the query to pre-populate a {{BitSet}} when caching. That step shouldn't be 
> allowed to overrun the timeout. The proper way to handle this probably needs 
> some thought.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher

2022-07-04 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562219#comment-17562219
 ] 

Adrien Grand commented on LUCENE-10151:
---

bq. I've merged this now to main and backported to 9.x

Did you forget to push to branch_9x? I cannot see the change there.

> Add timeout support to IndexSearcher
> 
>
> Key: LUCENE-10151
> URL: https://issues.apache.org/jira/browse/LUCENE-10151
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I'd like to explore adding optional "timeout" capabilities to 
> {{IndexSearcher}}. This would enable users to (optionally) specify a maximum 
> time budget for search execution. If the search "times out", partial results 
> would be available.
> This idea originated on the dev list (thanks [~jpountz] for the suggestion). 
> Thread for reference: 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E]
>  
> A couple things to watch out for with this change:
>  # We want to make sure it's robust to a two-phase query evaluation scenario 
> where the "approximate" step matches a large number of candidates but the 
> "confirmation" step matches very few (or none). This is a particularly tricky 
> case.
>  # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is 
> {{GREATER_THAN_OR_EQUAL_TO}} if the query times out
>  # We want to make sure it plays nice with the {{LRUCache}} since it iterates 
> the query to pre-populate a {{BitSet}} when caching. That step shouldn't be 
> allowed to overrun the timeout. The proper way to handle this probably needs 
> some thought.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #996: LUCENE-10151: Some fixes to query timeouts.

2022-07-04 Thread GitBox


jpountz merged PR #996:
URL: https://github.com/apache/lucene/pull/996


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stefanvodita opened a new pull request, #1004: LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests

2022-07-04 Thread GitBox


stefanvodita opened a new pull request, #1004:
URL: https://github.com/apache/lucene/pull/1004

   Replace all usages of `SortedSetDocValues.NO_MORE_ORDS` in tests and start 
using `SortedSetDocValues.docValueCount()`.
   
   Jira: https://issues.apache.org/jira/browse/LUCENE-10603
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-07-04 Thread Stefan Vodita (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562230#comment-17562230
 ] 

Stefan Vodita commented on LUCENE-10603:


Hi Greg! I thought I'd help out. 
[Here|https://github.com/apache/lucene/pull/1004]'s a PR with the test changes.

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz opened a new pull request, #1005: LUCENE-10636: Avoid computing the same scores multiple times.

2022-07-04 Thread GitBox


jpountz opened a new pull request, #1005:
URL: https://github.com/apache/lucene/pull/1005

   `BlockMaxMaxscoreScorer` would previously compute the score twice for 
essential
   scorers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #1005: LUCENE-10636: Avoid computing the same scores multiple times.

2022-07-04 Thread GitBox


jpountz commented on PR #1005:
URL: https://github.com/apache/lucene/pull/1005#issuecomment-1174000758

   luceneutil on `wikimedium10m` seems to confirm that this gives a noticeable 
speedup to disjunctions:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 AndHighLow 1621.76  (3.1%) 1611.29  
(2.5%)   -0.6% (  -6% -5%) 0.476
   OrNotHighLow 1514.17  (2.9%) 1507.60  
(3.0%)   -0.4% (  -6% -5%) 0.638
 AndHighMed  184.49  (3.9%)  187.19  
(5.2%)1.5% (  -7% -   10%) 0.312
   OrNotHighMed 1442.80  (3.4%) 1464.42  
(4.6%)1.5% (  -6% -9%) 0.242
   OrHighNotLow 1877.95  (3.9%) 1907.58  
(5.3%)1.6% (  -7% -   11%) 0.282
AndHighHigh   84.87  (4.0%)   86.25  
(5.8%)1.6% (  -7% -   11%) 0.301
  OrNotHighHigh 1025.32  (3.4%) 1043.39  
(2.9%)1.8% (  -4% -8%) 0.078
  OrHighNotHigh 1371.19  (3.7%) 1400.07  
(3.3%)2.1% (  -4% -9%) 0.057
   OrHighNotMed 1566.12  (3.8%) 1601.33  
(3.9%)2.2% (  -5% -   10%) 0.064
  OrHighLow  788.64  (8.4%)  845.63  
(6.8%)7.2% (  -7% -   24%) 0.003
  OrHighMed  178.01  (7.5%)  193.39  
(6.0%)8.6% (  -4% -   24%) 0.000
 OrHighHigh   68.26 (11.9%)   74.88  
(9.6%)9.7% ( -10% -   35%) 0.004
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #907: LUCENE-10357 Ghost fields and postings/points

2022-07-04 Thread GitBox


jpountz commented on code in PR #907:
URL: https://github.com/apache/lucene/pull/907#discussion_r900220890


##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -1378,7 +1378,7 @@ private static Status.TermIndexStatus checkFields(
   computedFieldCount++;
 
   final Terms terms = fields.terms(field);
-  if (terms == null) {
+  if (terms == Terms.EMPTY) {

Review Comment:
   Let's remove this `if` block entirely?



##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -1378,7 +1378,7 @@ private static Status.TermIndexStatus checkFields(
   computedFieldCount++;
 
   final Terms terms = fields.terms(field);
-  if (terms == null) {
+  if (terms == Terms.EMPTY) {

Review Comment:
   Then let's fix codecs to return a `Terms` instance that has the correct 
values for `hasFreqs`, `hasOffsets`, `hasPositions` and `hasPayloads`? E.g. 
maybe you could add a new `Terms#empty(FieldInfo)` method that does the right 
thing based on the `FieldInfo` and leverage this method in postings formats?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher

2022-07-04 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562247#comment-17562247
 ] 

Adrien Grand commented on LUCENE-10151:
---

For reference, I opened new JIRA issues for suggested follow-ups: LUCENE-10640, 
LUCENE-10641.

> Add timeout support to IndexSearcher
> 
>
> Key: LUCENE-10151
> URL: https://issues.apache.org/jira/browse/LUCENE-10151
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I'd like to explore adding optional "timeout" capabilities to 
> {{IndexSearcher}}. This would enable users to (optionally) specify a maximum 
> time budget for search execution. If the search "times out", partial results 
> would be available.
> This idea originated on the dev list (thanks [~jpountz] for the suggestion). 
> Thread for reference: 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E]
>  
> A couple things to watch out for with this change:
>  # We want to make sure it's robust to a two-phase query evaluation scenario 
> where the "approximate" step matches a large number of candidates but the 
> "confirmation" step matches very few (or none). This is a particularly tricky 
> case.
>  # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is 
> {{GREATER_THAN_OR_EQUAL_TO}} if the query times out
>  # We want to make sure it plays nice with the {{LRUCache}} since it iterates 
> the query to pre-populate a {{BitSet}} when caching. That step shouldn't be 
> allowed to overrun the timeout. The proper way to handle this probably needs 
> some thought.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a diff in pull request #1005: LUCENE-10636: Avoid computing the same scores multiple times.

2022-07-04 Thread GitBox


zacharymorn commented on code in PR #1005:
URL: https://github.com/apache/lucene/pull/1005#discussion_r913203251


##
lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java:
##
@@ -35,11 +34,13 @@ class BlockMaxMaxscoreScorer extends Scorer {
   // heap of scorers ordered by doc ID
   private final DisiPriorityQueue essentialsScorers;
 
-  // list of scorers ordered by maxScore
-  private final LinkedList maxScoreSortedEssentialScorers;
-
+  // array of scorers ordered by maxScore
   private final DisiWrapper[] allScorers;
 
+  // index of the first essential scorer is the `allScorers` array. All 
scorers before this index

Review Comment:
   ```suggestion
 // index of the first essential scorer in the `allScorers` array. All 
scorers before this index
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a diff in pull request #1005: LUCENE-10636: Avoid computing the same scores multiple times.

2022-07-04 Thread GitBox


zacharymorn commented on code in PR #1005:
URL: https://github.com/apache/lucene/pull/1005#discussion_r913207313


##
lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java:
##
@@ -248,6 +251,17 @@ public long cost() {
 
   @Override
   public boolean matches() throws IOException {
+// Only sum up scores of non-essential scorers, essential scores were 
already folded into
+// the score.
+for (int i = 0; i < firstEssentialScorerIndex; ++i) {
+  DisiWrapper w = allScorers[i];
+  if (w.doc < doc) {
+w.doc = w.iterator.advance(doc);
+  }
+  if (w.doc == doc) {
+score += allScorers[i].scorer.score();
+  }
+}
 return score() >= minCompetitiveScore;

Review Comment:
   Nit: maybe just use `score` instead of `score()` here ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-04 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562266#comment-17562266
 ] 

ASF subversion and git services commented on LUCENE-10480:
--

Commit a5c99aca1abc9b73a0c68d4f23533311382b718c in lucene's branch 
refs/heads/branch_9x from zacharymorn
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a5c99aca1ab ]

LUCENE-10480: Use BMM scorer for 2 clauses disjunction (#972) (#1002)

(cherry picked from commit 503ec5597331454bf8b6af79b9701cfdccf5)

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn merged pull request #1002: LUCENE-10480: (Backporting) Use BMM scorer for 2 clauses disjunction

2022-07-04 Thread GitBox


zacharymorn merged PR #1002:
URL: https://github.com/apache/lucene/pull/1002


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org