[GitHub] [lucene] mrkm4ntr closed pull request #12149: No need to try packing for singleton merge

2023-02-15 Thread via GitHub


mrkm4ntr closed pull request #12149: No need to try packing for singleton merge
URL: https://github.com/apache/lucene/pull/12149


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tylerbertrand opened a new pull request, #12150: Gradle optimizations

2023-02-15 Thread via GitHub


tylerbertrand opened a new pull request, #12150:
URL: https://github.com/apache/lucene/pull/12150

   ## Description
   
   
   
   Ported recent build improvements from Apache Solr 
([#1319](https://github.com/apache/solr/pull/1319) and 
[#1345](https://github.com/apache/solr/pull/1345)) with some modifications.
   
   In addition, the following build issues were addressed:
   1. Task `validateSourcePatterns` was including the `.gradle` directory for 
both `buildSrc` and `dev-tools/missing-doclet` as inputs, causing this task to 
not be cacheable.
  * Resolved by excluding all `.gradle` directories as inputs to 
`validateSourcePatterns`, not just the `.gradle` directory for the root project.
   2. With the changes to `copyTestResources`/`processTestResources`, the test 
`TestCustomAnalyzer.testStopWordsFromFile` was failing because `teststop.txt` 
was no longer at the root of the classes directory.
  * Resolved by following the convention set by other tests: providing the 
path to the file with `this.getDataPath("teststop.txt").toString()`
   
   ## Improvements
   
   Improved overall build time:
 * [Build scan timeline summary before 
changes](https://scans.gradle.com/s/wyryx5onlnlls#timeline)
 * [Build scan timeline summary after 
changes](https://scans.gradle.com/s/qupifnuiysdfq#timeline)
   
   Reduced number of executed cacheable tasks (cache misses):
 * [Build scan before 
changes](https://scans.gradle.com/s/wyryx5onlnlls/timeline?cacheability=cacheable&outcome=success,failed&sort=longest)
 * [Build scan after 
changes](https://scans.gradle.com/s/qupifnuiysdfq/timeline?cacheability=cacheable&outcome=success,failed)
   
   Improved the number of cacheable tasks:
 * [Non-cacheable tasks before 
changes](https://scans.gradle.com/s/wyryx5onlnlls/timeline?cacheability=any-non-cacheable&outcome=success,failed&sort=longest)
 * [Non-cacheable tasks after 
changes](https://scans.gradle.com/s/qupifnuiysdfq/timeline?cacheability=any-non-cacheable&outcome=success,failed)
   
   Reduced the number of eagerly created tasks:
 * [Eagerly created tasks before 
changes](https://scans.gradle.com/s/wyryx5onlnlls/performance/configuration#summary-task-created-immediately)
 * [Eagerly created tasks after 
changes](https://scans.gradle.com/s/qupifnuiysdfq/performance/configuration#summary-task-created-immediately)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tylerbertrand commented on issue #12145: port gradle improvements to Lucene

2023-02-15 Thread via GitHub


tylerbertrand commented on issue #12145:
URL: https://github.com/apache/lucene/issues/12145#issuecomment-1431528200

   Submitted [Gradle Optimizations 
PR](https://github.com/apache/lucene/pull/12150)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new issue, #12151: Benchmark Current Approaches for TermInSetQuery Evaluation

2023-02-15 Thread via GitHub


gsmiller opened a new issue, #12151:
URL: https://github.com/apache/lucene/issues/12151

   ### Description
   
   Given recent efforts and discussions around "term in set" query 
implementations and performance, I wanted to "back up" a bit and simply 
document the performance characteristics of the _current_ implementations for 
"term in set" query clauses.
   
   I ran a series of ad hoc benchmarks with a simple benchmark "tool" (thanks 
@rmuir for the inspiration/foundation in the benchmark tooling): 
[TiSBench.java.txt](https://github.com/apache/lucene/files/10744456/TiSBench.java.txt)
   
   The results, along with a description of the problem, setup and my thoughts 
on the results are as follows. I'm going to immediately mark this issue as 
"resolved," but let's use it as a place to discuss/challenge the current 
performance characteristics (for anyone interested):
   
   ## Background
   Some search use-cases require evaluating relatively large "term-in-set" 
clauses, testing whether-or-not a given doc
   matches at least one term in a provided set of terms (within a specific 
field). These clauses may be used at allow-
   or deny-lists, but do not contribute to the score of a doc. Here are two 
real-world, motivating scenarios:
   
   First, imagine searching over a catalog of products, where products have 
been assigned a
   [UNSPSC](https://en.wikipedia.org/wiki/UNSPSC) categorization identifier. At 
query-time, we may need to restrict search 
   results based on these codes, either filtering out or only including 
products found within certain category codes. The
   list of codes we're interested in including/excluding can be modeled as a 
term disjunction contained within a `FILTER`
   or `MUST_NOT` boolean clause. In this case, the relationship between a 
UNSPSC identifier and documents is one-to-many
   within our index.
   
   Next, imagine building a dating app where users search over other user 
profiles for prospective dates. Users also have
   a way to indicate they are "not interested" in a specific profile, meaning 
that profile should be excluded from their
   results in future searches. Assuming we index a unique PK identifier with 
each profile doc, we can represent this
   "block list" semantic with a boolean `MUST_NOT` clause term disjunction 
provided with each query. Where this use-case 
   differs from the UNSPSC case is in the relationship of profile IDs to docs. 
While UNSPSC to doc is a one-to-many 
   relationship, profile IDs are a strict 1-to-1 relationship.
   
   Of course, it should be mentioned that both of these problems could be 
"inverted" if the allow-/deny-lists are static
   to the user by pre-computing which users should _not_ see any given document 
and indexing a list of "blocked users"
   for each one. At query-time, we no longer need a set of terms, but just need 
to pass the user's unique ID. This
   approach has some downsides: 1) block-lists can be slow to update depending 
on the index update pipeline, 2) in cases
   like UNSPSC, a single update to a block-list will result in many documents 
needing to be updated, 3) the index size
   grows with-respect-to the number of users/block-lists in the system, which 
may not be a reasonable scaling factor
   depending on what drives the business. For these reasons, we _assume_ it is 
a reasonable use-case to want to provide
   relatively large lists of "term in set" clauses at query-time.
   
   These use-cases can served through at least three different, functionally 
equivalent, implementations:
   1. A standard `BooleanQuery` disjunction clause, which manages a priority 
queue of individual term postings and does
   "doc at a time" scoring. The downside of this approach tends to be the cost 
overhead of PQ management.
   2. A `TermInSetQuery` clause, which fully iterates all term postings 
up-front and builds an on-heap bitset representing
   the result of the disjunction, effectively doing "term at a time" scoring. 
The downside of this approach is that it
   must fully iterate all the postings, and can't take advantage of skipping.
   3. A second-phase doc values filter, which requires values to also be 
indexed in a columnar store DV field and checks
   every "candidate" doc in a post-filtering type approach (i.e., 
`SortedSetDocValuesField#newSlowSetQuery`). The
   downside of this approach is that it must check every candidate, and can't 
take advantage of skipping.
   
   `IndexOrDocValuesQuery` can also be used to "wrap" an "index query" (one of 
`BooleanQuery` or `TermInSetQuery`) and a
   "doc values" query (`SortedSetDocValuesField#newSlowSetQuery`). This query 
compares the estimated lead cost to the
   estimated lead cost of the index query to decide if a posting-approach or a 
docvalues-approach would be more likely
   to perform best.
   
   In an effort to determine the relative performance characteristics of these 
different approaches, I ran benchmarks over
   various different use-cases.

[GitHub] [lucene] gsmiller closed issue #12151: Benchmark Current Approaches for TermInSetQuery Evaluation

2023-02-15 Thread via GitHub


gsmiller closed issue #12151: Benchmark Current Approaches for TermInSetQuery 
Evaluation
URL: https://github.com/apache/lucene/issues/12151


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a diff in pull request #12150: Gradle optimizations

2023-02-15 Thread via GitHub


dweiss commented on code in PR #12150:
URL: https://github.com/apache/lucene/pull/12150#discussion_r1107408873


##
gradle/validation/jar-checks.gradle:
##
@@ -231,7 +238,8 @@ subprojects {
   }
 }
   }
-
+  def f = new File(project.buildDir.path + "/" + outputFileName)
+  f.text = errors

Review Comment:
   This uses local encoding. Should use UTF8.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #12150: Gradle optimizations

2023-02-15 Thread via GitHub


dweiss commented on PR #12150:
URL: https://github.com/apache/lucene/pull/12150#issuecomment-1431703638

   I don't like how gradle requires that extra verbosity (command line 
providers) for the sake of achieving the optimal task-caching behavior, but I'm 
fine with these changes and I think I understand what's going on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #12150: Gradle optimizations

2023-02-15 Thread via GitHub


uschindler commented on code in PR #12150:
URL: https://github.com/apache/lucene/pull/12150#discussion_r1107482290


##
gradle/validation/jar-checks.gradle:
##
@@ -231,7 +238,8 @@ subprojects {
   }
 }
   }
-
+  def f = new File(project.buildDir.path + "/" + outputFileName)
+  f.text = errors

Review Comment:
   This was also part of the Solr PR and was overlooked there.
   
   In general I don't understand why this was added at all. What is the output 
file used for at all?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani merged pull request #12148: Improve DocAndScoreQuery#toString

2023-02-15 Thread via GitHub


jtibshirani merged PR #12148:
URL: https://github.com/apache/lucene/pull/12148


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #12150: Gradle optimizations

2023-02-15 Thread via GitHub


dweiss commented on PR #12150:
URL: https://github.com/apache/lucene/pull/12150#issuecomment-1431912086

   An output file is required for caching - otherwise there is no link between 
inputs-and-outputs (up-to-date checks). A task without outputs is never skipped 
in incremental builds (it's never 'up to date').


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org