[GitHub] [lucene] gsmiller commented on a diff in pull request #12089: Explore TermInSet Query that "self optimizes"

2023-02-04 Thread via GitHub
gsmiller commented on code in PR #12089: URL: https://github.com/apache/lucene/pull/12089#discussion_r1072871141 ## lucene/core/src/java/org/apache/lucene/search/DisiWrapper.java: ## @@ -57,4 +57,14 @@ public DisiWrapper(Scorer scorer) { matchCost = 0f; } } + + p

[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-02-04 Thread via GitHub
gsmiller commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416767759 @rmuir I modified the PR to update the existing `TermInSetQuery` in-place, instead of introducing a new sandbox concept. I've added some test coverage and re-ran all the benchmarks. The

[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-02-04 Thread via GitHub
rmuir commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416779603 Hmm, @gsmiller I really personally don't like all this complexity and optimizations mixed into `TermInSetQuery`. It becomes almost unmanageable. This is why I like the IndexOrDocValu

[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-02-04 Thread via GitHub
rmuir commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416780926 and when i look at the benchmarks, there are cases where IndexOrDV does better, and there are cases where this one does worse. I'm not seeing a "clear winner", so i'd personally lean towa

[GitHub] [lucene] almogtavor commented on issue #11870: Create a Markdown based documentation

2023-02-04 Thread via GitHub
almogtavor commented on issue #11870: URL: https://github.com/apache/lucene/issues/11870#issuecomment-1416785510 @rmuir @dweiss @uschindler Do you think that a PR in this subject that creates the initial Docusaurus website with some basic Getting Started sections will get accepted? I genera

[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-02-04 Thread via GitHub
rmuir commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416786686 another way to say it: I see a lot more optimizations and conditionals in the query than I see wins in the benchmark? which of these truly matter? Can we remove some to simplify the

[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-02-04 Thread via GitHub
rmuir commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416790181 i'll work up a draft PR trying to work it from the opposite direction (keep queries split apart) so my suggestions will be more concrete. You did the hard part already and made a benchmark

[GitHub] [lucene] rmuir opened a new pull request, #12127: Remove useless abstractions in DocValues-based queries

2023-02-04 Thread via GitHub
rmuir opened a new pull request, #12127: URL: https://github.com/apache/lucene/pull/12127 There's no need to make things abstract: DocValues does the right thing Optimizing for where no docs for the field in the segment exist is easy, simple null check (replacing the existing one!) --

[GitHub] [lucene] rmuir commented on pull request #12127: Remove useless abstractions in DocValues-based queries

2023-02-04 Thread via GitHub
rmuir commented on PR #12127: URL: https://github.com/apache/lucene/pull/12127#issuecomment-1416834391 one nice benefit is, not all of these N-subclasses were checking for the case where field doesn't exist in segment and optimizing it by returning `null` Scorer, now the case is always opti

[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-02-04 Thread via GitHub
rmuir commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416839513 I think some of the benchmarks needs to be revisited here anyway. `DocValuesTermsQuery` is not in great shape, which may lead to bad results from IndexOrDocValuesQuery and cause confusion.

[GitHub] [lucene] rmuir opened a new pull request, #12128: Speed up docvalues set query by making use of sortedness

2023-02-04 Thread via GitHub
rmuir opened a new pull request, #12128: URL: https://github.com/apache/lucene/pull/12128 LongHashSet is used for the set of numbers, but it has some issues: * tries too hard to extend AbstractSet, mostly for testing * causes traps with boxing if you aren't careful * complex hashcode

[GitHub] [lucene] rmuir opened a new pull request, #12129: Speedup sandbox/DocValuesTermsQuery

2023-02-04 Thread via GitHub
rmuir opened a new pull request, #12129: URL: https://github.com/apache/lucene/pull/12129 * Optimize the common case that docs only have single values for the field * In the multivalued case, terminate reading docvalues if they are > maximum set ordinal * Implement ScorerSupplier, so t

[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-02-04 Thread via GitHub
rmuir commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416888408 I fired up 3 related PRs to at least make it more of a fair fight :) #12127 #12128 #12129 -- This is an automated message from the Apache Git Service. To respond to t

[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-02-04 Thread via GitHub
rmuir commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416891188 > Anyway, back to the point about complexity vs. benefit, I 100% agree that relying on `IndexOrDocValues` would be preferable if we can solve for cost over-estimating. I'd pursued this pat

[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-02-04 Thread via GitHub
gsmiller commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416895938 Thanks @rmuir! I'll have a look at the PRs tomorrow. I've got a bad case of jet lag kicking in at the moment and my brain is turning to mush. In the meantime, I took a pass at simplifyi

[GitHub] [lucene] rmuir commented on pull request #12129: Speedup sandbox/DocValuesTermsQuery

2023-02-04 Thread via GitHub
rmuir commented on PR #12129: URL: https://github.com/apache/lucene/pull/12129#issuecomment-1416909246 moved this thing to `Sorted(Set)DocValuesField.newSlowSetQuery` as it is a much better API to document its slowness safe;y and to encapsulate the guts. it does not yet claim to help

[GitHub] [lucene] rmuir commented on pull request #12129: Speedup sandbox/DocValuesTermsQuery

2023-02-04 Thread via GitHub
rmuir commented on PR #12129: URL: https://github.com/apache/lucene/pull/12129#issuecomment-1416915982 Another use-case this works with is for the IP address types, which is really just some sugar over `BinaryPoint`: `IndexOrDocValuesQuery(InetAddressPoint.newSetQuery(..), SortedSetDocValu