[GitHub] [lucene] benwtrent opened a new pull request, #12130: Fix TestFeatureField#testBasicsNonScoringCase test

2023-02-06 Thread via GitHub


benwtrent opened a new pull request, #12130:
URL: https://github.com/apache/lucene/pull/12130

   Sometimes the random search lucene test searcher will wrap the reader. 
Consequently, we need to make sure to use the reader provided by the test 
`IndexSearcher` or the reader may be different between creating the weight with 
the searcher vs. accessing the leaf context for the scorer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-02-06 Thread via GitHub


gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1419161514

   @rmuir I grabbed your patch for adding a `ScoreSupplier` to 
`DocValuesTermsQuery` (#12129) and reran benchmarks. The gap between IndexOrDV 
and the "self-optimizing" TermInSetQuery have closed with this change. It looks 
like I was wrong about the way IndexOrDV plans PK-type queries. I thought it 
was choosing to use doc values based on what I saw in profiler output, but what 
I was really seeing was the up-front ordinal lookups in `DocValuesTermsQuery` 
as a result of not having the `ScoreSupplier` abstraction. With your patch, 
that goes away.
   
   The only gap that remains now is when the field is _not_ a PK-style field 
but the terms being used in the disjunction have a low aggregate cost (relative 
to the other terms in the field; e.g., `Medium Cardinality + Low Cost Country 
Code Filter Terms`). In this case, IndexOrDV is always using doc values (due to 
the field-level stats used for cost), but—by doing some term-seeking—we could 
better decide to use postings. 
   
   Here are updated benchmark results: 
[TiSBenchResults_Simplified_DVSSPatch.md.txt](https://github.com/apache/lucene/files/10663766/TiSBenchResults_Simplified_DVSSPatch.md.txt)
   (Note that "low cardinality" cases are kind of terrible still because the 
TiSQuery is being rewritten to a BooleanQuery)
   
   > to me the issue is a problem with TermInSetQuery ScorerSupplier cost method
   
   +1. Maybe there's a way to address this remaining gap by being smarter about 
the cost function without term-seeking? That would be ideal.
   
   I also played around with the idea of a "cost iterator" abstraction on 
`ScoreSupplier` as a way to allow something like `TermInSetQuery` to provide 
incremental costs to `IndexOrDocValuesQuery` as it term-seeks. This feels 
clunky to me, and I'm not proposing it as a "good idea" right now, but I'll 
share it as another approach. I was able to get comparable benchmark results 
with this technique, and it still allows `IndexOrDocValuesQuery` to "own" the 
decision between postings and doc values: 
https://github.com/apache/lucene/compare/main...gsmiller:lucene:explore/tis-score-supplier-cost-iterator.
 Benchmark results for this approach are here: 
[TiSBenchResults_SSIterator.md.txt](https://github.com/apache/lucene/files/10663947/TiSBenchResults_SSIterator.md.txt).
 It feels overly complicated though.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-02-06 Thread via GitHub


rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1419214500

   That's good that it made progress. I will look more into it tonight. I want 
to get these patches landed to simplify benchmarking.
   
   its true there is one benchmark where this combined query does better 
("Medium Cardinality + Low Cost Country Code Filter Terms") but there is also 
one benchmark where it does substantially worse ("Low Cardinality + High Cost 
Country Code Filter Terms").
   
   so net/net i would say they are equivalent. But I will look into this case 
to see if we can still do better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent merged pull request #12130: Fix TestFeatureField#testBasicsNonScoringCase test

2023-02-06 Thread via GitHub


benwtrent merged PR #12130:
URL: https://github.com/apache/lucene/pull/12130


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler opened a new pull request, #12131: Port over gradle setting generator from Solr

2023-02-06 Thread via GitHub


uschindler opened a new pull request, #12131:
URL: https://github.com/apache/lucene/pull/12131

   In Apache Solr we improved the local settings generation to be done directly 
in gardlew startup (similar to gradle downloader).
   
   This has several positive effects:
   - We can do our Github CI and Jenkins checks in one go, as the file is now 
generated before gradle even starts, so the build will succeed on first run.
   - The template file is editable by committers without going into script 
files. Number of processors for threads is inserted by templating
   
   See https://github.com/apache/solr/pull/1320 and 
https://issues.apache.org/jira/browse/SOLR-16641 for details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-02-06 Thread via GitHub


jpountz commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1097579916


##
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##
@@ -380,21 +431,28 @@ public ScorerSupplier scorerSupplier(LeafReaderContext 
context) throws IOExcepti
 // cost estimates.
 final long cost;
 final long queryTermsCount = termData.size();
-long potentialExtraCost = indexTerms.getSumDocFreq();
+final long sumDocFreq = indexTerms.getSumDocFreq();
+long potentialExtraCost = sumDocFreq;
 final long indexedTermCount = indexTerms.size();
 if (indexedTermCount != -1) {
   potentialExtraCost -= indexedTermCount;
 }
 cost = queryTermsCount + potentialExtraCost;
 
+final boolean isPrimaryKeyField = indexedTermCount != -1 && sumDocFreq 
== indexedTermCount;

Review Comment:
   Since `terms.size()` is an optional index statistic, maybe check `sumDocFreq 
== docCount` instead?



##
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##
@@ -258,13 +271,41 @@ public Matches matches(LeafReaderContext context, int 
doc) throws IOException {
* On the given leaf context, try to either rewrite to a disjunction if 
there are few matching
* terms, or build a bitset containing matching docs.
*/
-  private WeightOrDocIdSet rewrite(LeafReaderContext context) throws 
IOException {
+  private WeightOrDocIdSet rewrite(
+  LeafReaderContext context, long leadCost, boolean isPrimaryKeyField, 
DocValuesType dvType)
+  throws IOException {
 final LeafReader reader = context.reader();
 
 Terms terms = reader.terms(field);
 if (terms == null) {
   return null;
 }
+
+long costThreshold = Long.MAX_VALUE;
+if (dvType == DocValuesType.SORTED || dvType == 
DocValuesType.SORTED_SET) {
+  // Establish a threshold for switching to doc values. Give postings 
a significant
+  // advantage for the primary-key case, since many of the primary-key 
terms may not
+  // actually be in this segment. The 8x factor is arbitrary, based on 
IndexOrDVQuery,
+  // but has performed well in benchmarks:
+  costThreshold = isPrimaryKeyField ? leadCost << 3 : leadCost;
+
+  if (termData.size() > costThreshold) {
+// If the number of terms is > the number of candidates, DV should 
perform better.

Review Comment:
   I wonder if this is right given that the doc-values query still eagerly 
evaluates all terms against the terms dictionary. For this to work correctly, 
we'd need a query that looks up terms lazily rather than eagerly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #12116: Improve document API for stored fields.

2023-02-06 Thread via GitHub


jpountz merged PR #12116:
URL: https://github.com/apache/lucene/pull/12116


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] colvinco commented on pull request #12131: Port over gradle setting generator from Solr

2023-02-06 Thread via GitHub


colvinco commented on PR #12131:
URL: https://github.com/apache/lucene/pull/12131#issuecomment-1419339752

   There's another reference in smokeTestRelease.py 
https://github.com/apache/lucene/blob/8df59fc878795dd94e10d4c15a7bc4f1a919843b/dev-tools/scripts/smokeTestRelease.py#L612-L613


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] colvinco closed pull request #12123: Generate gradle.properties from gradlew

2023-02-06 Thread via GitHub


colvinco closed pull request #12123: Generate gradle.properties from gradlew
URL: https://github.com/apache/lucene/pull/12123


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-02-06 Thread via GitHub


gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1419344879

   > but there is also one benchmark where it does substantially worse ("Low 
Cardinality + High Cost Country Code Filter Terms").
   
   100%. The issue here is that `TermInSetQuery` gets rewritten to a 
`BooleanQuery` because there are fewer than 16 terms, so it doesn't have a 
chance to "self-optimize" to use doc values. We can fix this by not eagerly 
rewriting to a `BooleanQuery`, but I held off doing that for now. So this is 
"easily" fixable I think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #12131: Port over gradle setting generator from Solr

2023-02-06 Thread via GitHub


uschindler commented on PR #12131:
URL: https://github.com/apache/lucene/pull/12131#issuecomment-1419364815

   > There's another reference in smokeTestRelease.py
   > 
   > 
https://github.com/apache/lucene/blob/8df59fc878795dd94e10d4c15a7bc4f1a919843b/dev-tools/scripts/smokeTestRelease.py#L612-L613
   
   Fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-02-06 Thread via GitHub


gsmiller commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1097623311


##
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##
@@ -258,13 +271,41 @@ public Matches matches(LeafReaderContext context, int 
doc) throws IOException {
* On the given leaf context, try to either rewrite to a disjunction if 
there are few matching
* terms, or build a bitset containing matching docs.
*/
-  private WeightOrDocIdSet rewrite(LeafReaderContext context) throws 
IOException {
+  private WeightOrDocIdSet rewrite(
+  LeafReaderContext context, long leadCost, boolean isPrimaryKeyField, 
DocValuesType dvType)
+  throws IOException {
 final LeafReader reader = context.reader();
 
 Terms terms = reader.terms(field);
 if (terms == null) {
   return null;
 }
+
+long costThreshold = Long.MAX_VALUE;
+if (dvType == DocValuesType.SORTED || dvType == 
DocValuesType.SORTED_SET) {
+  // Establish a threshold for switching to doc values. Give postings 
a significant
+  // advantage for the primary-key case, since many of the primary-key 
terms may not
+  // actually be in this segment. The 8x factor is arbitrary, based on 
IndexOrDVQuery,
+  // but has performed well in benchmarks:
+  costThreshold = isPrimaryKeyField ? leadCost << 3 : leadCost;
+
+  if (termData.size() > costThreshold) {
+// If the number of terms is > the number of candidates, DV should 
perform better.

Review Comment:
   I'm not sure actually. The up-front term-seeking you refer to is certainly a 
cost, but it doesn't scale with the number of lead hits. So this can still be 
cheaper. But also, +1 to the idea of trying out on-demand term seeking for 
these situations!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #12127: Remove useless abstractions in DocValues-based queries

2023-02-06 Thread via GitHub


rmuir merged PR #12127:
URL: https://github.com/apache/lucene/pull/12127


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #12128: Speed up docvalues set query by making use of sortedness

2023-02-06 Thread via GitHub


rmuir merged PR #12128:
URL: https://github.com/apache/lucene/pull/12128


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12054: Introduce a new `KeywordField`.

2023-02-06 Thread via GitHub


jpountz commented on PR #12054:
URL: https://github.com/apache/lucene/pull/12054#issuecomment-1419458041

   I updated this PR to
- add a `Field.Store` parameter to the constructor that does not rely on 
Field's guessing
- update the demo to pass Field.Store.YES as a value for this parameter 
instead of adding a separate StoredField
- added a `newSetQuery` that creates a `TermInSetQuery` and hopefully soon 
benefits from @gsmiller 's optimization


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #12129: Speedup sandbox/DocValuesTermsQuery

2023-02-06 Thread via GitHub


rmuir merged PR #12129:
URL: https://github.com/apache/lucene/pull/12129


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.

2023-02-06 Thread via GitHub


rmuir commented on code in PR #12054:
URL: https://github.com/apache/lucene/pull/12054#discussion_r1097736939


##
lucene/core/src/java/org/apache/lucene/document/KeywordField.java:
##
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import java.util.Collection;
+import java.util.Objects;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.IndexOptions;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.ConstantScoreQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.SortField;
+import org.apache.lucene.search.SortedSetSelector;
+import org.apache.lucene.search.SortedSetSortField;
+import org.apache.lucene.search.TermInSetQuery;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.util.BytesRef;
+
+/**
+ * Field that indexes a per-document String or {@link BytesRef} into an 
inverted index for fast
+ * filtering, stores values in a columnar fashion using {@link 
DocValuesType#SORTED_SET} doc values
+ * for sorting and faceting, and optionally stores values as stored fields for 
top-hits retrieval.
+ * This field does not support scoring: queries produce constant scores. If 
you also need to store

Review Comment:
   We can nuke this sentence about "if you also need to store the value" now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] janhoy commented on pull request #12065: Update copyright year in NOTICE.txt file.

2023-02-06 Thread via GitHub


janhoy commented on PR #12065:
URL: https://github.com/apache/lucene/pull/12065#issuecomment-1419513356

   Intereting find. At least we don't include years in every single file as 
some projects do, so not a huge burden and we are not obliged to keep or remove 
years, we can do as we want. 
   
   It's not a big deal to me, but I think I lean towards keeping only the year 
of initial publication, as proposed 
[here](https://matija.suklje.name/how-and-why-to-properly-write-copyright-statements-in-your-code#why-keep-the-year)
 and by Roy Fielding 
[here](https://daniel.haxx.se/blog/2023/01/08/copyright-without-years/#comment-26544).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.

2023-02-06 Thread via GitHub


rmuir commented on code in PR #12054:
URL: https://github.com/apache/lucene/pull/12054#discussion_r1097738909


##
lucene/core/src/java/org/apache/lucene/document/KeywordField.java:
##
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import java.util.Collection;
+import java.util.Objects;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.IndexOptions;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.ConstantScoreQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.SortField;
+import org.apache.lucene.search.SortedSetSelector;
+import org.apache.lucene.search.SortedSetSortField;
+import org.apache.lucene.search.TermInSetQuery;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.util.BytesRef;
+
+/**
+ * Field that indexes a per-document String or {@link BytesRef} into an 
inverted index for fast
+ * filtering, stores values in a columnar fashion using {@link 
DocValuesType#SORTED_SET} doc values
+ * for sorting and faceting, and optionally stores values as stored fields for 
top-hits retrieval.
+ * This field does not support scoring: queries produce constant scores. If 
you also need to store
+ * the value, you should add a separate {@link StoredField} instance. If you 
need more fine-grained
+ * control you can use {@link StringField}, {@link SortedDocValuesField} or 
{@link
+ * SortedSetDocValuesField}, and {@link StoredField}.
+ *
+ * This field defines static factory methods for creating common query 
objects:
+ *
+ * 
+ *   {@link #newExactQuery} for matching a value.
+ *   {@link #newSetQuery} for matching any of the values coming from a set.
+ *   {@link #newSortField} for matching a value.
+ * 
+ */
+public class KeywordField extends Field {
+
+  private static final FieldType FIELD_TYPE = new FieldType();
+  private static final FieldType FIELD_TYPE_STORED;
+
+  static {
+FIELD_TYPE.setIndexOptions(IndexOptions.DOCS);
+FIELD_TYPE.setOmitNorms(true);
+FIELD_TYPE.setTokenized(false);
+FIELD_TYPE.setDocValuesType(DocValuesType.SORTED_SET);
+FIELD_TYPE.freeze();
+
+FIELD_TYPE_STORED = new FieldType(FIELD_TYPE);
+FIELD_TYPE_STORED.setStored(true);
+FIELD_TYPE_STORED.freeze();
+  }
+
+  private final StoredValue storedValue;
+
+  /**
+   * Creates a new KeywordField.
+   *
+   * @param name field name
+   * @param value the BytesRef value
+   * @param stored whether to store the field
+   * @throws IllegalArgumentException if the field name or value is null.
+   */
+  public KeywordField(String name, BytesRef value, Store stored) {
+super(name, value, stored == Field.Store.YES ? FIELD_TYPE_STORED : 
FIELD_TYPE);
+if (stored == Store.YES) {
+  storedValue = new StoredValue(value);
+} else {
+  storedValue = null;
+}
+  }
+
+  /**
+   * Creates a new KeywordField from a String value, by indexing its UTF-8 
representation.
+   *
+   * @param name field name
+   * @param value the BytesRef value
+   * @param stored whether to store the field
+   * @throws IllegalArgumentException if the field name or value is null.
+   */
+  public KeywordField(String name, String value, Store stored) {
+super(name, value, stored == Field.Store.YES ? FIELD_TYPE_STORED : 
FIELD_TYPE);
+if (stored == Store.YES) {
+  storedValue = new StoredValue(value);
+} else {
+  storedValue = null;
+}
+  }
+
+  @Override
+  public BytesRef binaryValue() {
+BytesRef binaryValue = super.binaryValue();
+if (binaryValue != null) {
+  return binaryValue;
+} else {
+  return new BytesRef(stringValue());
+}
+  }
+
+  @Override
+  public void setStringValue(String value) {
+super.setStringValue(value);
+if (storedValue != null) {
+  storedValue.setStringValue(value);
+}
+  }
+
+  @Override
+  public void setBytesValue(BytesRef value) {
+super.setBytesValue(value);
+if (storedValue != null) {
+  storedValue.setBinaryValue(value);
+}
+  }
+
+  @Override
+  public StoredValue storedValue() {
+return storedValue;
+  }
+
+  /**
+   * Create a query for matching an exact {@link BytesRef} valu

[GitHub] [lucene] uschindler commented on pull request #12123: Generate gradle.properties from gradlew

2023-02-06 Thread via GitHub


uschindler commented on PR #12123:
URL: https://github.com/apache/lucene/pull/12123#issuecomment-1419549026

   Oh I did not see that PR. Sorry created a duplicate!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #12123: Generate gradle.properties from gradlew

2023-02-06 Thread via GitHub


uschindler commented on PR #12123:
URL: https://github.com/apache/lucene/pull/12123#issuecomment-141914

   See #12131


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #12131: Port over gradle setting generator from Solr

2023-02-06 Thread via GitHub


uschindler commented on PR #12131:
URL: https://github.com/apache/lucene/pull/12131#issuecomment-1419575660

   Hi @colvinco, I merged your PR into my branch and found only a small 
difference in the windows script, which I fixed. Not sure why Solr did not 
apply the JAVA_OPTS for the generator.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler merged pull request #12131: Port over gradle setting generator from Solr

2023-02-06 Thread via GitHub


uschindler merged PR #12131:
URL: https://github.com/apache/lucene/pull/12131


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir opened a new pull request, #12132: Implement ScorerSupplier for Sorted(Set)DocValuesField#newSlowRangeQuery

2023-02-06 Thread via GitHub


rmuir opened a new pull request, #12132:
URL: https://github.com/apache/lucene/pull/12132

   Similar to use of ScorerSupplier in #12129, implement it here too, because 
creation of a Scorer requires `lookupTerm()` operations in the DV terms 
dictionary. 
   
   This results in wasted effort/random accesses, if, based on the cost(), 
IndexOrDocValuesQuery decides not to use this query.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler opened a new pull request, #12133: Simplify LongHashSet by completely removing java.util.Set APIs

2023-02-06 Thread via GitHub


uschindler opened a new pull request, #12133:
URL: https://github.com/apache/lucene/pull/12133

   Instead return LongStream for toString() and testing (and possible other 
use-cases)
   
   This is a followup of @rmuir's PR #12128 and trashes even more code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler merged pull request #12133: Simplify LongHashSet by completely removing java.util.Set APIs

2023-02-06 Thread via GitHub


uschindler merged PR #12133:
URL: https://github.com/apache/lucene/pull/12133


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler opened a new pull request, #12134: Add tests for size() and contains() to LongHashSet

2023-02-06 Thread via GitHub


uschindler opened a new pull request, #12134:
URL: https://github.com/apache/lucene/pull/12134

   Another followup for #12128: Due to previously only testing the 
`java.util.Set` interface, the actual testing code never verified that `size()` 
and the actual call to `contains(long)` worked correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #12134: Add tests for size() and contains() to LongHashSet

2023-02-06 Thread via GitHub


uschindler commented on PR #12134:
URL: https://github.com/apache/lucene/pull/12134#issuecomment-1419932992

   I found a bug, first test works, second one does not work:
   
   ```java
 public void testSameValue() {
   LongHashSet set2 = new LongHashSet(new long[] {42L, 42L});
   assertEquals(1, set2.size());
   assertEquals(42L, set2.minValue);
   assertEquals(42L, set2.maxValue);
 }
   
 public void testSameMissingPlaceholder() {
   LongHashSet set2 = new LongHashSet(new long[] {Long.MIN_VALUE, 
Long.MIN_VALUE});
   assertEquals(1, set2.size());
   assertEquals(Long.MIN_VALUE, set2.minValue);
   assertEquals(Long.MIN_VALUE, set2.maxValue);
 }
   ```
   
   The problem is that `MISSING` is counted twice, because it is not added to 
the hashtable and handled separately in ctor. The fix is easy...
   
   Will commit a fix, too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jimmykobe1171 commented on a diff in pull request #12126: Refactor part of IndexFileDeleter and ReplicaFileDeleter into a common utility class

2023-02-06 Thread via GitHub


jimmykobe1171 commented on code in PR #12126:
URL: https://github.com/apache/lucene/pull/12126#discussion_r1098024345


##
lucene/replicator/src/java/org/apache/lucene/replicator/nrt/CopyJob.java:
##
@@ -206,7 +206,7 @@ private synchronized void _transferAndCancel(CopyJob 
prevJob) throws IOException
   if (Node.VERBOSE_FILES) {
 dest.message("remove partial file " + prevJob.current.tmpName);
   }
-  dest.deleter.deleteNewFile(prevJob.current.tmpName);
+  dest.deleter.deleteIfNoRef(prevJob.current.tmpName);

Review Comment:
   Seems like **deleteIfNoRef** is always safer than **deleteNewFile**. Do we 
still need the method deleteNewFile?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #12134: Add tests for size() and contains() to LongHashSet

2023-02-06 Thread via GitHub


uschindler commented on PR #12134:
URL: https://github.com/apache/lucene/pull/12134#issuecomment-1419949170

   Fixed. Actually code is better readable now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler merged pull request #12134: Add tests for size() and contains() to LongHashSet

2023-02-06 Thread via GitHub


uschindler merged PR #12134:
URL: https://github.com/apache/lucene/pull/12134


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on issue #11428: Handle soft deletes via LiveDocsFormat [LUCENE-10392]

2023-02-06 Thread via GitHub


zacharymorn commented on issue #11428:
URL: https://github.com/apache/lucene/issues/11428#issuecomment-1420073239

   Thanks @dnhatn @rmuir @s1monw for the additional information! Yeah I can see 
now how changing it to use liv doc and not relying on an explicit field, will 
potentially require changes to the `softUpdateDocument` API to differentiate 
between "regular" doc vs. tombstone doc, and also make liv doc format itself 
and its usage more complicated (indeed nothing can beat bitset in terms of 
simplicity!). I'll pause exploring on this from my end then, but will be happy 
to work on it further if there's any preference change down the road.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org