[GitHub] [lucene] mrkm4ntr closed pull request #12149: No need to try packing for singleton merge
mrkm4ntr closed pull request #12149: No need to try packing for singleton merge URL: https://github.com/apache/lucene/pull/12149 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] tylerbertrand opened a new pull request, #12150: Gradle optimizations
tylerbertrand opened a new pull request, #12150: URL: https://github.com/apache/lucene/pull/12150 ## Description Ported recent build improvements from Apache Solr ([#1319](https://github.com/apache/solr/pull/1319) and [#1345](https://github.com/apache/solr/pull/1345)) with some modifications. In addition, the following build issues were addressed: 1. Task `validateSourcePatterns` was including the `.gradle` directory for both `buildSrc` and `dev-tools/missing-doclet` as inputs, causing this task to not be cacheable. * Resolved by excluding all `.gradle` directories as inputs to `validateSourcePatterns`, not just the `.gradle` directory for the root project. 2. With the changes to `copyTestResources`/`processTestResources`, the test `TestCustomAnalyzer.testStopWordsFromFile` was failing because `teststop.txt` was no longer at the root of the classes directory. * Resolved by following the convention set by other tests: providing the path to the file with `this.getDataPath("teststop.txt").toString()` ## Improvements Improved overall build time: * [Build scan timeline summary before changes](https://scans.gradle.com/s/wyryx5onlnlls#timeline) * [Build scan timeline summary after changes](https://scans.gradle.com/s/qupifnuiysdfq#timeline) Reduced number of executed cacheable tasks (cache misses): * [Build scan before changes](https://scans.gradle.com/s/wyryx5onlnlls/timeline?cacheability=cacheable&outcome=success,failed&sort=longest) * [Build scan after changes](https://scans.gradle.com/s/qupifnuiysdfq/timeline?cacheability=cacheable&outcome=success,failed) Improved the number of cacheable tasks: * [Non-cacheable tasks before changes](https://scans.gradle.com/s/wyryx5onlnlls/timeline?cacheability=any-non-cacheable&outcome=success,failed&sort=longest) * [Non-cacheable tasks after changes](https://scans.gradle.com/s/qupifnuiysdfq/timeline?cacheability=any-non-cacheable&outcome=success,failed) Reduced the number of eagerly created tasks: * [Eagerly created tasks before changes](https://scans.gradle.com/s/wyryx5onlnlls/performance/configuration#summary-task-created-immediately) * [Eagerly created tasks after changes](https://scans.gradle.com/s/qupifnuiysdfq/performance/configuration#summary-task-created-immediately) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] tylerbertrand commented on issue #12145: port gradle improvements to Lucene
tylerbertrand commented on issue #12145: URL: https://github.com/apache/lucene/issues/12145#issuecomment-1431528200 Submitted [Gradle Optimizations PR](https://github.com/apache/lucene/pull/12150) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller opened a new issue, #12151: Benchmark Current Approaches for TermInSetQuery Evaluation
gsmiller opened a new issue, #12151: URL: https://github.com/apache/lucene/issues/12151 ### Description Given recent efforts and discussions around "term in set" query implementations and performance, I wanted to "back up" a bit and simply document the performance characteristics of the _current_ implementations for "term in set" query clauses. I ran a series of ad hoc benchmarks with a simple benchmark "tool" (thanks @rmuir for the inspiration/foundation in the benchmark tooling): [TiSBench.java.txt](https://github.com/apache/lucene/files/10744456/TiSBench.java.txt) The results, along with a description of the problem, setup and my thoughts on the results are as follows. I'm going to immediately mark this issue as "resolved," but let's use it as a place to discuss/challenge the current performance characteristics (for anyone interested): ## Background Some search use-cases require evaluating relatively large "term-in-set" clauses, testing whether-or-not a given doc matches at least one term in a provided set of terms (within a specific field). These clauses may be used at allow- or deny-lists, but do not contribute to the score of a doc. Here are two real-world, motivating scenarios: First, imagine searching over a catalog of products, where products have been assigned a [UNSPSC](https://en.wikipedia.org/wiki/UNSPSC) categorization identifier. At query-time, we may need to restrict search results based on these codes, either filtering out or only including products found within certain category codes. The list of codes we're interested in including/excluding can be modeled as a term disjunction contained within a `FILTER` or `MUST_NOT` boolean clause. In this case, the relationship between a UNSPSC identifier and documents is one-to-many within our index. Next, imagine building a dating app where users search over other user profiles for prospective dates. Users also have a way to indicate they are "not interested" in a specific profile, meaning that profile should be excluded from their results in future searches. Assuming we index a unique PK identifier with each profile doc, we can represent this "block list" semantic with a boolean `MUST_NOT` clause term disjunction provided with each query. Where this use-case differs from the UNSPSC case is in the relationship of profile IDs to docs. While UNSPSC to doc is a one-to-many relationship, profile IDs are a strict 1-to-1 relationship. Of course, it should be mentioned that both of these problems could be "inverted" if the allow-/deny-lists are static to the user by pre-computing which users should _not_ see any given document and indexing a list of "blocked users" for each one. At query-time, we no longer need a set of terms, but just need to pass the user's unique ID. This approach has some downsides: 1) block-lists can be slow to update depending on the index update pipeline, 2) in cases like UNSPSC, a single update to a block-list will result in many documents needing to be updated, 3) the index size grows with-respect-to the number of users/block-lists in the system, which may not be a reasonable scaling factor depending on what drives the business. For these reasons, we _assume_ it is a reasonable use-case to want to provide relatively large lists of "term in set" clauses at query-time. These use-cases can served through at least three different, functionally equivalent, implementations: 1. A standard `BooleanQuery` disjunction clause, which manages a priority queue of individual term postings and does "doc at a time" scoring. The downside of this approach tends to be the cost overhead of PQ management. 2. A `TermInSetQuery` clause, which fully iterates all term postings up-front and builds an on-heap bitset representing the result of the disjunction, effectively doing "term at a time" scoring. The downside of this approach is that it must fully iterate all the postings, and can't take advantage of skipping. 3. A second-phase doc values filter, which requires values to also be indexed in a columnar store DV field and checks every "candidate" doc in a post-filtering type approach (i.e., `SortedSetDocValuesField#newSlowSetQuery`). The downside of this approach is that it must check every candidate, and can't take advantage of skipping. `IndexOrDocValuesQuery` can also be used to "wrap" an "index query" (one of `BooleanQuery` or `TermInSetQuery`) and a "doc values" query (`SortedSetDocValuesField#newSlowSetQuery`). This query compares the estimated lead cost to the estimated lead cost of the index query to decide if a posting-approach or a docvalues-approach would be more likely to perform best. In an effort to determine the relative performance characteristics of these different approaches, I ran benchmarks over various different use-cases.
[GitHub] [lucene] gsmiller closed issue #12151: Benchmark Current Approaches for TermInSetQuery Evaluation
gsmiller closed issue #12151: Benchmark Current Approaches for TermInSetQuery Evaluation URL: https://github.com/apache/lucene/issues/12151 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #12150: Gradle optimizations
dweiss commented on code in PR #12150: URL: https://github.com/apache/lucene/pull/12150#discussion_r1107408873 ## gradle/validation/jar-checks.gradle: ## @@ -231,7 +238,8 @@ subprojects { } } } - + def f = new File(project.buildDir.path + "/" + outputFileName) + f.text = errors Review Comment: This uses local encoding. Should use UTF8. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #12150: Gradle optimizations
dweiss commented on PR #12150: URL: https://github.com/apache/lucene/pull/12150#issuecomment-1431703638 I don't like how gradle requires that extra verbosity (command line providers) for the sake of achieving the optimal task-caching behavior, but I'm fine with these changes and I think I understand what's going on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #12150: Gradle optimizations
uschindler commented on code in PR #12150: URL: https://github.com/apache/lucene/pull/12150#discussion_r1107482290 ## gradle/validation/jar-checks.gradle: ## @@ -231,7 +238,8 @@ subprojects { } } } - + def f = new File(project.buildDir.path + "/" + outputFileName) + f.text = errors Review Comment: This was also part of the Solr PR and was overlooked there. In general I don't understand why this was added at all. What is the output file used for at all? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani merged pull request #12148: Improve DocAndScoreQuery#toString
jtibshirani merged PR #12148: URL: https://github.com/apache/lucene/pull/12148 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #12150: Gradle optimizations
dweiss commented on PR #12150: URL: https://github.com/apache/lucene/pull/12150#issuecomment-1431912086 An output file is required for caching - otherwise there is no link between inputs-and-outputs (up-to-date checks). A task without outputs is never skipped in incremental builds (it's never 'up to date'). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org