[GitHub] [lucene] jtibshirani commented on a change in pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery

2022-02-10 Thread GitBox


jtibshirani commented on a change in pull request #656:
URL: https://github.com/apache/lucene/pull/656#discussion_r803395995



##
File path: 
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
##
@@ -147,6 +165,11 @@ NeighborQueue searchLevel(
   continue;
 }
 
+numVisited++;
+if (numVisited > visitedLimit) {
+  throw new CollectionTerminatedException();

Review comment:
   This may be an abuse of `CollectionTerminatedException`. Another idea 
would be to try to pass back the information that the search was terminated 
early in `TopDocs.TotalHits` (but this also doesn't seem ideal).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10417) IntNRQ task performance decreased in nightly benchmark

2022-02-10 Thread Feng Guo (Jira)
Feng Guo created LUCENE-10417:
-

 Summary: IntNRQ task performance decreased in nightly benchmark
 Key: LUCENE-10417
 URL: https://issues.apache.org/jira/browse/LUCENE-10417
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/codecs
Reporter: Feng Guo


Probably related to LUCENE-LUCENE-10315,  I'll dig.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10417) IntNRQ task performance decreased in nightly benchmark

2022-02-10 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10417:
--
Description: 
Link: https://home.apache.org/~mikemccand/lucenebench/2022.02.07.18.02.48.html

Probably related to LUCENE-LUCENE-10315,  I'll dig.

  was:Probably related to LUCENE-LUCENE-10315,  I'll dig.


> IntNRQ task performance decreased in nightly benchmark
> --
>
> Key: LUCENE-10417
> URL: https://issues.apache.org/jira/browse/LUCENE-10417
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> Link: https://home.apache.org/~mikemccand/lucenebench/2022.02.07.18.02.48.html
> Probably related to LUCENE-LUCENE-10315,  I'll dig.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10417) IntNRQ task performance decreased in nightly benchmark

2022-02-10 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10417:
--
Description: 
Link: https://home.apache.org/~mikemccand/lucenebench/2022.02.07.18.02.48.html

Probably related to LUCENE-10315,  I'll dig.

  was:
Link: https://home.apache.org/~mikemccand/lucenebench/2022.02.07.18.02.48.html

Probably related to LUCENE-LUCENE-10315,  I'll dig.


> IntNRQ task performance decreased in nightly benchmark
> --
>
> Key: LUCENE-10417
> URL: https://issues.apache.org/jira/browse/LUCENE-10417
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>
> Link: https://home.apache.org/~mikemccand/lucenebench/2022.02.07.18.02.48.html
> Probably related to LUCENE-10315,  I'll dig.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10417) IntNRQ task performance decreased in nightly benchmark

2022-02-10 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo reassigned LUCENE-10417:
-

Assignee: Feng Guo

> IntNRQ task performance decreased in nightly benchmark
> --
>
> Key: LUCENE-10417
> URL: https://issues.apache.org/jira/browse/LUCENE-10417
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Reporter: Feng Guo
>Assignee: Feng Guo
>Priority: Major
>
> Link: https://home.apache.org/~mikemccand/lucenebench/2022.02.07.18.02.48.html
> Probably related to LUCENE-10315,  I'll dig.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a change in pull request #671: Add custom composite action to set up CI environments

2022-02-10 Thread GitBox


mocobeta commented on a change in pull request #671:
URL: https://github.com/apache/lucene/pull/671#discussion_r803409108



##
File path: .github/workflows/gradle-precommit.yml
##
@@ -26,12 +26,9 @@ jobs:
 steps:
 - uses: actions/checkout@v2
 
-- name: Set up JDK
-  uses: actions/setup-java@v2
+- uses: ./.github/actions/setup-action
   with:
-distribution: 'adopt-hotspot'
 java-version: ${{ matrix.java }}

Review comment:
   I don't think the strategy matrix can be shared across workflows. If the 
target Java versions have to be hard-coded in workflow files anyway, the shared 
action in this PR wouldn't be much help for us. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10418) Improve Query rewriting for non-scoring clauses

2022-02-10 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10418:
-

 Summary: Improve Query rewriting for non-scoring clauses
 Key: LUCENE-10418
 URL: https://issues.apache.org/jira/browse/LUCENE-10418
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand


Query rewriting is occasionally important for performance, e.g. it may allow 
using an optimized bulk scorer instead of the default bulk scorer like in the 
example from LUCENE-10412.

One case when we could simplify queries is in the non-scoring case. All layers 
of query wrappers that only affect scoring like BoostQuery and ConstantScore 
query can be removed, which might help identify new opportunities for 
rewriting. For instance, we have several rewrite rules that optimize for 
MatchAllDocsQuery and would fail to recognize it if it is behind a 
ConstantScoreQuery or a BoostQuery. Boolean queries can also simplify 
themselves in the non-scoring case, by changing MUST clauses to FILTER clauses, 
or removing fully optional SHOULD clauses.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10418) Improve Query rewriting for non-scoring clauses

2022-02-10 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490053#comment-17490053
 ] 

Adrien Grand commented on LUCENE-10418:
---

I initially thought of adding a `boolean needsScores` parameter to 
{{Query#rewrite}} to address this case, but non-scoring optimizations are 
mostly applicable to ConstantScoreQuery, BoostQuery and BooleanQuery so I gave 
a try at an approach that only specializes rewriting for these queries while 
keeping the existing API.

> Improve Query rewriting for non-scoring clauses
> ---
>
> Key: LUCENE-10418
> URL: https://issues.apache.org/jira/browse/LUCENE-10418
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> Query rewriting is occasionally important for performance, e.g. it may allow 
> using an optimized bulk scorer instead of the default bulk scorer like in the 
> example from LUCENE-10412.
> One case when we could simplify queries is in the non-scoring case. All 
> layers of query wrappers that only affect scoring like BoostQuery and 
> ConstantScore query can be removed, which might help identify new 
> opportunities for rewriting. For instance, we have several rewrite rules that 
> optimize for MatchAllDocsQuery and would fail to recognize it if it is behind 
> a ConstantScoreQuery or a BoostQuery. Boolean queries can also simplify 
> themselves in the non-scoring case, by changing MUST clauses to FILTER 
> clauses, or removing fully optional SHOULD clauses.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] romseygeek commented on a change in pull request #668: LUCENE-10414: Add fn:fuzzyTerm interval function to flexible query parser

2022-02-10 Thread GitBox


romseygeek commented on a change in pull request #668:
URL: https://github.com/apache/lucene/pull/668#discussion_r803476165



##
File path: 
lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/nodes/intervalfn/FuzzyTerm.java
##
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.queryparser.flexible.standard.nodes.intervalfn;
+
+import java.util.Locale;
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.queries.intervals.Intervals;
+import org.apache.lucene.queries.intervals.IntervalsSource;
+import org.apache.lucene.search.FuzzyQuery;
+
+/**
+ * An interval function equivalent to {@link FuzzyQuery}. A fuzzy term expands 
to a disjunction of
+ * intervals of terms that are within the specified {@code maxEdits} from the 
provided term. A limit
+ * of {@code maxExpansions} prevents the internal implementation from blowing 
up on too many
+ * potential candidate terms.
+ */
+public class FuzzyTerm extends IntervalFunction {
+  private final String term;
+  private final int maxEdits;
+  private final Integer maxExpansions;
+
+  public FuzzyTerm(String term, Integer maxEdits, Integer maxExpansions) {
+this.term = term;
+this.maxEdits = maxEdits == null ? FuzzyQuery.defaultMaxEdits : maxEdits;
+this.maxExpansions = maxExpansions == null ? 
Intervals.DEFAULT_MAX_EXPANSIONS : maxExpansions;
+  }
+
+  @Override
+  public IntervalsSource toIntervalSource(String field, Analyzer analyzer) {
+var fuzzyQuery = new FuzzyQuery(new Term(field, term), maxEdits);

Review comment:
   A static method on FuzzyQuery fits with what we have elsewhere for 
PrefixQuery and WildcardQuery, let's do that




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #658: LUCENE-10378 Implement Weight#count for PointRangeQuery

2022-02-10 Thread GitBox


gautamworah96 commented on a change in pull request #658:
URL: https://github.com/apache/lucene/pull/658#discussion_r803485066



##
File path: lucene/core/src/java/org/apache/lucene/index/PointValues.java
##
@@ -369,6 +369,52 @@ private void intersect(IntersectVisitor visitor, PointTree 
pointTree) throws IOE
 }
   }
 
+  /**
+   * Finds the number of points matching the provided range conditions. Using 
this method is faster
+   * than calling {@link #intersect(IntersectVisitor)} to get the count of 
intersecting points. This
+   * method does not enforce live documents, therefore it should only be used 
when there are no
+   * deleted documents.
+   */
+  public final long countPoints(IntersectVisitor visitor) throws IOException {
+final PointTree pointTree = getPointTree();
+long countPoints = countPoints(visitor, pointTree);
+assert pointTree.moveToParent()
+== false; // just checking to make sure we ended the tree search at 
the root node
+return countPoints;
+  }
+
+  private long countPoints(IntersectVisitor visitor, PointTree pointTree) 
throws IOException {
+Relation r = visitor.compare(pointTree.getMinPackedValue(), 
pointTree.getMaxPackedValue());
+switch (r) {
+  case CELL_OUTSIDE_QUERY:
+// This cell is fully outside the query shape: return 0 as the count 
of its nodes
+return 0;
+  case CELL_INSIDE_QUERY:
+// This cell is fully inside the query shape: return the size of the 
entire node as the
+// count
+return pointTree.size();
+  case CELL_CROSSES_QUERY:
+/*
+The cell crosses the shape boundary, or the cell fully contains the 
query, so we fall
+through and do full counting.
+*/
+if (pointTree.moveToChild()) {
+  int cellCount = 0;
+  do {
+cellCount += countPoints(visitor, pointTree);
+  } while (pointTree.moveToSibling());
+  pointTree.moveToParent();
+  return cellCount;
+} else {
+  // we have reached a leaf node here.
+  pointTree.visitDocValues(visitor);
+  return 0; // the visitor has safely recorded the number of leaf 
nodes that matched
+}
+  default:
+throw new IllegalArgumentException("Unreachable code");
+}
+  }
+

Review comment:
   Got it. Makes sense. This implementation is only dealing with query 
specific loopholes. `PointValues` has nothing to do with these query level 
optimizations. Fixed in the next commit




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #658: LUCENE-10378 Implement Weight#count for PointRangeQuery

2022-02-10 Thread GitBox


gautamworah96 commented on a change in pull request #658:
URL: https://github.com/apache/lucene/pull/658#discussion_r803485794



##
File path: lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java
##
@@ -369,6 +376,45 @@ public Scorer scorer(LeafReaderContext context) throws 
IOException {
 return scorerSupplier.get(Long.MAX_VALUE);
   }
 
+  @Override
+  public int count(LeafReaderContext context) throws IOException {
+LeafReader reader = context.reader();
+
+PointValues values = reader.getPointValues(field);
+if (checkValidPointValues(values) == false) {
+  return 0;
+}
+
+if (reader.hasDeletions() == false
+&& numDims == 1
+&& values.getDocCount() == values.size()) {
+  // if all documents have at-most one point
+  final int[] intersectingLeafNodeCount = {0};
+  // create a custom IntersectVisitor that records the number of 
leafNodes that matched
+  final IntersectVisitor visitor =
+  new IntersectVisitor() {
+@Override
+public void visit(int docID) {
+  intersectingLeafNodeCount[0]++;

Review comment:
   Done. Thanks for the idea @iverase. Looks much cleaner now (+ removes 
the inconsistency of adding the leaf node count separately).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #668: LUCENE-10414: Add fn:fuzzyTerm interval function to flexible query parser

2022-02-10 Thread GitBox


dweiss commented on a change in pull request #668:
URL: https://github.com/apache/lucene/pull/668#discussion_r803527778



##
File path: 
lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/nodes/intervalfn/FuzzyTerm.java
##
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.queryparser.flexible.standard.nodes.intervalfn;
+
+import java.util.Locale;
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.queries.intervals.Intervals;
+import org.apache.lucene.queries.intervals.IntervalsSource;
+import org.apache.lucene.search.FuzzyQuery;
+
+/**
+ * An interval function equivalent to {@link FuzzyQuery}. A fuzzy term expands 
to a disjunction of
+ * intervals of terms that are within the specified {@code maxEdits} from the 
provided term. A limit
+ * of {@code maxExpansions} prevents the internal implementation from blowing 
up on too many
+ * potential candidate terms.
+ */
+public class FuzzyTerm extends IntervalFunction {
+  private final String term;
+  private final int maxEdits;
+  private final Integer maxExpansions;
+
+  public FuzzyTerm(String term, Integer maxEdits, Integer maxExpansions) {
+this.term = term;
+this.maxEdits = maxEdits == null ? FuzzyQuery.defaultMaxEdits : maxEdits;
+this.maxExpansions = maxExpansions == null ? 
Intervals.DEFAULT_MAX_EXPANSIONS : maxExpansions;
+  }
+
+  @Override
+  public IntervalsSource toIntervalSource(String field, Analyzer analyzer) {
+var fuzzyQuery = new FuzzyQuery(new Term(field, term), maxEdits);

Review comment:
   I have already - see commit 86c9756




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a change in pull request #658: LUCENE-10378 Implement Weight#count for PointRangeQuery

2022-02-10 Thread GitBox


iverase commented on a change in pull request #658:
URL: https://github.com/apache/lucene/pull/658#discussion_r803536317



##
File path: lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java
##
@@ -369,6 +378,100 @@ public Scorer scorer(LeafReaderContext context) throws 
IOException {
 return scorerSupplier.get(Long.MAX_VALUE);
   }
 
+  @Override
+  public int count(LeafReaderContext context) throws IOException {
+LeafReader reader = context.reader();
+
+PointValues values = reader.getPointValues(field);
+if (checkValidPointValues(values) == false) {
+  return 0;
+}
+
+if (reader.hasDeletions() == false
+&& numDims == 1
+&& values.getDocCount() == values.size()) {
+  // if all documents have at-most one point
+  return (int) pointCount(values.getPointTree(), this::relate, 
this::matches);
+}
+return super.count(context);
+  }
+
+  /**
+   * Finds the number of points matching the provided range conditions. 
Using this method is
+   * faster than calling {@link PointValues#intersect(IntersectVisitor)} 
to get the count of
+   * intersecting points. This method does not enforce live documents, 
therefore it should only
+   * be used when there are no deleted documents.
+   *
+   * @param pointTree start node of the count operation
+   * @param nodeComparator comparator to be used for checking whether the 
internal node is
+   * inside the range
+   * @param leafComparator comparator to be used for checking whether the 
leaf node is inside
+   * the range
+   * @return count of points that match the range
+   */
+  private long pointCount(
+  PointValues.PointTree pointTree,
+  BiFunction nodeComparator,
+  Predicate leafComparator)
+  throws IOException {
+final int[] matchingLeafNodeCount = {0};
+// create a custom IntersectVisitor that records the number of 
leafNodes that matched
+final IntersectVisitor visitor =
+new IntersectVisitor() {
+  @Override
+  public void visit(int docID) {
+// this branch should be unreachable
+throw new UnsupportedOperationException(
+"This IntersectVisitor does not perform any actions on a "
++ "docID="
++ docID
++ " node being visited");
+  }
+
+  @Override
+  public void visit(int docID, byte[] packedValue) {
+if (leafComparator.test(packedValue)) {
+  matchingLeafNodeCount[0]++;
+}
+  }
+
+  @Override
+  public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
+return nodeComparator.apply(minPackedValue, maxPackedValue);
+  }
+};
+Relation r =

Review comment:
   I ythink we should move the recursive part into its own method and reuse 
the IntersectVisitor? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a change in pull request #658: LUCENE-10378 Implement Weight#count for PointRangeQuery

2022-02-10 Thread GitBox


iverase commented on a change in pull request #658:
URL: https://github.com/apache/lucene/pull/658#discussion_r803536317



##
File path: lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java
##
@@ -369,6 +378,100 @@ public Scorer scorer(LeafReaderContext context) throws 
IOException {
 return scorerSupplier.get(Long.MAX_VALUE);
   }
 
+  @Override
+  public int count(LeafReaderContext context) throws IOException {
+LeafReader reader = context.reader();
+
+PointValues values = reader.getPointValues(field);
+if (checkValidPointValues(values) == false) {
+  return 0;
+}
+
+if (reader.hasDeletions() == false
+&& numDims == 1
+&& values.getDocCount() == values.size()) {
+  // if all documents have at-most one point
+  return (int) pointCount(values.getPointTree(), this::relate, 
this::matches);
+}
+return super.count(context);
+  }
+
+  /**
+   * Finds the number of points matching the provided range conditions. 
Using this method is
+   * faster than calling {@link PointValues#intersect(IntersectVisitor)} 
to get the count of
+   * intersecting points. This method does not enforce live documents, 
therefore it should only
+   * be used when there are no deleted documents.
+   *
+   * @param pointTree start node of the count operation
+   * @param nodeComparator comparator to be used for checking whether the 
internal node is
+   * inside the range
+   * @param leafComparator comparator to be used for checking whether the 
leaf node is inside
+   * the range
+   * @return count of points that match the range
+   */
+  private long pointCount(
+  PointValues.PointTree pointTree,
+  BiFunction nodeComparator,
+  Predicate leafComparator)
+  throws IOException {
+final int[] matchingLeafNodeCount = {0};
+// create a custom IntersectVisitor that records the number of 
leafNodes that matched
+final IntersectVisitor visitor =
+new IntersectVisitor() {
+  @Override
+  public void visit(int docID) {
+// this branch should be unreachable
+throw new UnsupportedOperationException(
+"This IntersectVisitor does not perform any actions on a "
++ "docID="
++ docID
++ " node being visited");
+  }
+
+  @Override
+  public void visit(int docID, byte[] packedValue) {
+if (leafComparator.test(packedValue)) {
+  matchingLeafNodeCount[0]++;
+}
+  }
+
+  @Override
+  public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
+return nodeComparator.apply(minPackedValue, maxPackedValue);
+  }
+};
+Relation r =

Review comment:
   I think we should move the recursive part into its own method and reuse 
the IntersectVisitor? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10419) Identify occasional validateSourcePatterns error on CI servers

2022-02-10 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss reassigned LUCENE-10419:


Assignee: Dawid Weiss

> Identify occasional validateSourcePatterns error on CI servers
> --
>
> Key: LUCENE-10419
> URL: https://issues.apache.org/jira/browse/LUCENE-10419
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
>
> {code}
> What went wrong: Execution failed for task 
> ':lucene:analysis:icu:validateSourcePatterns'. > start 1, end 0, length 0
> {code}
>  
> This annoys me. It's a message from stringbuilder.substring somewhere - let's 
> get the stack of that first and see where the bug is.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10419) Identify occasional validateSourcePatterns error on CI servers

2022-02-10 Thread Dawid Weiss (Jira)
Dawid Weiss created LUCENE-10419:


 Summary: Identify occasional validateSourcePatterns error on CI 
servers
 Key: LUCENE-10419
 URL: https://issues.apache.org/jira/browse/LUCENE-10419
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Dawid Weiss


{code}

What went wrong: Execution failed for task 
':lucene:analysis:icu:validateSourcePatterns'. > start 1, end 0, length 0

{code}

 

This annoys me. It's a message from stringbuilder.substring somewhere - let's 
get the stack of that first and see where the bug is.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10419) Identify occasional validateSourcePatterns error on CI servers

2022-02-10 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490121#comment-17490121
 ] 

ASF subversion and git services commented on LUCENE-10419:
--

Commit 1f1da12c89baea3db689135cf4325d231c7025f3 in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1f1da12 ]

LUCENE-10419: add debugging code.


> Identify occasional validateSourcePatterns error on CI servers
> --
>
> Key: LUCENE-10419
> URL: https://issues.apache.org/jira/browse/LUCENE-10419
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
>
> {code}
> What went wrong: Execution failed for task 
> ':lucene:analysis:icu:validateSourcePatterns'. > start 1, end 0, length 0
> {code}
>  
> This annoys me. It's a message from stringbuilder.substring somewhere - let's 
> get the stack of that first and see where the bug is.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10419) Identify occasional validateSourcePatterns error on CI servers

2022-02-10 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490124#comment-17490124
 ] 

ASF subversion and git services commented on LUCENE-10419:
--

Commit 9289b94329adcf712c72bb2cbe056c161b7d7188 in lucene's branch 
refs/heads/branch_9x from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9289b94 ]

LUCENE-10419: add debugging code.


> Identify occasional validateSourcePatterns error on CI servers
> --
>
> Key: LUCENE-10419
> URL: https://issues.apache.org/jira/browse/LUCENE-10419
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
>
> {code}
> What went wrong: Execution failed for task 
> ':lucene:analysis:icu:validateSourcePatterns'. > start 1, end 0, length 0
> {code}
>  
> This annoys me. It's a message from stringbuilder.substring somewhere - let's 
> get the stack of that first and see where the bug is.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss merged pull request #668: LUCENE-10414: Add fn:fuzzyTerm interval function to flexible query parser

2022-02-10 Thread GitBox


dweiss merged pull request #668:
URL: https://github.com/apache/lucene/pull/668


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10414) Add fn:fuzzyTerm interval function to flexible query parser

2022-02-10 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490129#comment-17490129
 ] 

ASF subversion and git services commented on LUCENE-10414:
--

Commit f6cebac3337926ca871b922241976a4ba4799c70 in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f6cebac ]

LUCENE-10414: Add fn:fuzzyTerm interval function to flexible query parser (#668)



> Add fn:fuzzyTerm interval function to flexible query parser
> ---
>
> Key: LUCENE-10414
> URL: https://issues.apache.org/jira/browse/LUCENE-10414
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Searching for "fuzzy" terms within interval expressions is currently 
> impossible. The Intervals class does expose the necessary low-level machinery 
> to make it happen though.
>  
> PR: [https://github.com/apache/lucene/pull/668/files]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10414) Add fn:fuzzyTerm interval function to flexible query parser

2022-02-10 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490130#comment-17490130
 ] 

ASF subversion and git services commented on LUCENE-10414:
--

Commit 9a293da5967ff272529a532106e64baecf28f24c in lucene's branch 
refs/heads/branch_9x from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9a293da ]

LUCENE-10414: Add fn:fuzzyTerm interval function to flexible query parser (#668)



> Add fn:fuzzyTerm interval function to flexible query parser
> ---
>
> Key: LUCENE-10414
> URL: https://issues.apache.org/jira/browse/LUCENE-10414
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Searching for "fuzzy" terms within interval expressions is currently 
> impossible. The Intervals class does expose the necessary low-level machinery 
> to make it happen though.
>  
> PR: [https://github.com/apache/lucene/pull/668/files]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8739) ZSTD Compressor support in Lucene

2022-02-10 Thread Praveen Nishchal (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490161#comment-17490161
 ] 

Praveen Nishchal commented on LUCENE-8739:
--

Hi Adrien,

Thank you for your feedback! I am a little unclear as to why we should wait for 
Panama to have a new JNI-based codec? That codec will not be part of the Lucene 
core, but as mentioned it will be an unofficial codec included under 
Lucene/codecs? Given the tremendous performance benefits shouldn’t the 
customers (users) be allowed to use JNI in their deployments if they chose to?

> ZSTD Compressor support in Lucene
> -
>
> Key: LUCENE-8739
> URL: https://issues.apache.org/jira/browse/LUCENE-8739
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/codecs
>Reporter: Sean Torres
>Priority: Minor
>  Labels: features
> Attachments: image-2022-01-11-02-18-11-402.png, 
> image-2022-01-11-02-18-57-752.png
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> ZStandard has a great speed and compression ratio tradeoff. 
> ZStandard is open source compression from Facebook.
> More about ZSTD
> [https://github.com/facebook/zstd]
> [https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10420) Move functional interfaces in IOUtils to top-level interfaces

2022-02-10 Thread Tomoko Uchida (Jira)
Tomoko Uchida created LUCENE-10420:
--

 Summary: Move functional interfaces in IOUtils to top-level 
interfaces
 Key: LUCENE-10420
 URL: https://issues.apache.org/jira/browse/LUCENE-10420
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Tomoko Uchida


Suggested at https://github.com/apache/lucene/pull/643#discussion_r802285404.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8739) ZSTD Compressor support in Lucene

2022-02-10 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490174#comment-17490174
 ] 

Adrien Grand commented on LUCENE-8739:
--

My opinion is that there are interesting benefits, but they are not worth the 
cost of adding an extra dependency on the library that provides the JNI 
bindings. Sure it performs better on retrieval than BEST_COMPRESSION, but if 
retrieval is what a user cares most about then BEST_SPEED is an even better 
option.

> ZSTD Compressor support in Lucene
> -
>
> Key: LUCENE-8739
> URL: https://issues.apache.org/jira/browse/LUCENE-8739
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/codecs
>Reporter: Sean Torres
>Priority: Minor
>  Labels: features
> Attachments: image-2022-01-11-02-18-11-402.png, 
> image-2022-01-11-02-18-57-752.png
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> ZStandard has a great speed and compression ratio tradeoff. 
> ZStandard is open source compression from Facebook.
> More about ZSTD
> [https://github.com/facebook/zstd]
> [https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a change in pull request #643: LUCENE-10400: revise binary dictionaries' constructor in kuromoji

2022-02-10 Thread GitBox


mocobeta commented on a change in pull request #643:
URL: https://github.com/apache/lucene/pull/643#discussion_r803637475



##
File path: lucene/core/src/java/org/apache/lucene/util/IOUtils.java
##
@@ -526,4 +526,17 @@ public static void fsync(Path fileToSync, boolean isDir) 
throws IOException {
   public interface IOFunction {
 R apply(T t) throws IOException;
   }
+
+  /**
+   * A resource supplier function that may throw an IOException.
+   *
+   * Note that this would open a resource such as a File. Consumers should 
make sure to close the
+   * resource (e.g., use try-with-resources)
+   *
+   * @see java.util.function.Supplier
+   */
+  @FunctionalInterface

Review comment:
   Hi, could anybody review #673?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10419) Identify occasional validateSourcePatterns error on CI servers

2022-02-10 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490193#comment-17490193
 ] 

Dawid Weiss commented on LUCENE-10419:
--

Captured this:
{code:java}
> Task :lucene:analysis:icu:validateSourcePatterns FAILED
java.lang.StringIndexOutOfBoundsException: start 1, end 854, length 0
at 
java.base/java.lang.AbstractStringBuilder.checkRangeSIOOBE(AbstractStringBuilder.java:1810)
at 
java.base/java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:1070)
at java.base/java.lang.StringBuilder.substring(StringBuilder.java:87)
at 
java.base/java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:1022)
at java.base/java.lang.StringBuilder.substring(StringBuilder.java:87)
at 
org.apache.rat.analysis.license.FullTextMatchingLicense.match(FullTextMatchingLicense.java:100)
at 
org.apache.rat.analysis.util.HeaderMatcherMultiplexer.match(HeaderMatcherMultiplexer.java:40)
at org.apache.rat.analysis.IHeaderMatcher$match$0.call(Unknown Source)
at 
ValidateSourcePatternsTask$_check_closure2$_closure6.doCall(/home/jenkins/workspace/Lucene-main-Linux/gradle/validation/validate-source-patterns.gradle:177)
at jdk.internal.reflect.GeneratedMethodAccessor698.invoke(Unknown 
Source)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at 
org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
at 
org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:263)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
at 
org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:38)
at 
org.codehaus.groovy.runtime.callsite.BooleanReturningMethodInvoker.invoke(BooleanReturningMethodInvoker.java:49)
at 
org.codehaus.groovy.runtime.callsite.BooleanClosureWrapper.call(BooleanClosureWrapper.java:52)
at 
org.codehaus.groovy.runtime.DefaultGroovyMethods.any(DefaultGroovyMethods.java:2642)
at 
org.codehaus.groovy.runtime.DefaultGroovyMethods.any(DefaultGroovyMethods.java:2674)
at org.codehaus.groovy.runtime.dgm$13.invoke(Unknown Source)
at 
org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite$PojoMetaMethodSiteNoUnwrapNoCoerce.invoke(PojoMetaMethodSite.java:247)
at 
org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite.call(PojoMetaMethodSite.java:56)
at 
org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:139)
at 
ValidateSourcePatternsTask$_check_closure2.doCall(/home/jenkins/workspace/Lucene-main-Linux/gradle/validation/validate-source-patterns.gradle:177)
at jdk.internal.reflect.GeneratedMethodAccessor699.invoke(Unknown 
Source)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at 
org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
at 
org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:263)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
at 
org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:38)
at 
org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:148)
at 
ValidateSourcePatternsTask$_check_closure3.doCall(/home/jenkins/workspace/Lucene-main-Linux/gradle/validation/validate-source-patterns.gradle:186)
at jdk.internal.reflect.GeneratedMethodAccessor702.invoke(Unknown 
Source)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at 
org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
at 
org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:263)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
at 
org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:38)
at 
ValidateSourcePatternsTask$_check_closure5.doCall(/home/jenkins/workspace/Lucene-main-Linux/gradle/validation/validate-source-patterns.gradle:244)
at jdk.internal.reflect.GeneratedMethodAccessor700.invoke(Unknown 
Source)
at 
java.base/jdk.internal.reflect.DelegatingMethod

[GitHub] [lucene] msokolov commented on a change in pull request #673: LUCENE-10420: Move functional interfaces in IOUtils to top-level interfaces

2022-02-10 Thread GitBox


msokolov commented on a change in pull request #673:
URL: https://github.com/apache/lucene/pull/673#discussion_r803661345



##
File path: lucene/core/src/java/org/apache/lucene/util/IOUtils.java
##
@@ -521,22 +523,11 @@ public static void fsync(Path fileToSync, boolean isDir) 
throws IOException {
* A Function that may throw an IOException
*
* @see java.util.function.Function
+   * @deprecated was replaced by {@link org.apache.lucene.util.IOFunction}.
*/
   @FunctionalInterface
+  @Deprecated(forRemoval = true, since = "9.1")
   public interface IOFunction {
 R apply(T t) throws IOException;
   }
-
-  /**
-   * A resource supplier function that may throw an IOException.
-   *
-   * Note that this would open a resource such as a File. Consumers should 
make sure to close the
-   * resource (e.g., use try-with-resources)
-   *
-   * @see java.util.function.Supplier
-   */
-  @FunctionalInterface
-  public interface IOSupplier {

Review comment:
   just curious; why are we able to remove this one, while the others are 
merely deprecated?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10177) Rename VectorValues#dimension to VectorValues#getNumDimensions?

2022-02-10 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490196#comment-17490196
 ] 

Michael Sokolov commented on LUCENE-10177:
--

Heh, I prefer {{dimension()}} and  would probably do the rename in the other 
direction, but I won't block this

> Rename VectorValues#dimension to VectorValues#getNumDimensions?
> ---
>
> Key: LUCENE-10177
> URL: https://issues.apache.org/jira/browse/LUCENE-10177
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Major
>
> This would make it consistent with PointValues#getNumDimensions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10419) Identify occasional validateSourcePatterns error on CI servers

2022-02-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490212#comment-17490212
 ] 

Uwe Schindler commented on LUCENE-10419:


Hi,
Wouldn't it a good idea to pass --stacktrace by default on Jenkins jobs?
I can change this. This would have made the debugging code obsolete.

> Identify occasional validateSourcePatterns error on CI servers
> --
>
> Key: LUCENE-10419
> URL: https://issues.apache.org/jira/browse/LUCENE-10419
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
>
> {code}
> What went wrong: Execution failed for task 
> ':lucene:analysis:icu:validateSourcePatterns'. > start 1, end 0, length 0
> {code}
>  
> This annoys me. It's a message from stringbuilder.substring somewhere - let's 
> get the stack of that first and see where the bug is.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10419) Identify occasional validateSourcePatterns error on CI servers

2022-02-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490214#comment-17490214
 ] 

Uwe Schindler commented on LUCENE-10419:


Looks like a bug in Rat. Maybe it found an empty file?

> Identify occasional validateSourcePatterns error on CI servers
> --
>
> Key: LUCENE-10419
> URL: https://issues.apache.org/jira/browse/LUCENE-10419
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
>
> {code}
> What went wrong: Execution failed for task 
> ':lucene:analysis:icu:validateSourcePatterns'. > start 1, end 0, length 0
> {code}
>  
> This annoys me. It's a message from stringbuilder.substring somewhere - let's 
> get the stack of that first and see where the bug is.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10419) Identify occasional validateSourcePatterns error on CI servers

2022-02-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490218#comment-17490218
 ] 

Uwe Schindler commented on LUCENE-10419:


We should log file path in the catch block. Maybe it tried some binary ICU file 
or as said before an empty one.

> Identify occasional validateSourcePatterns error on CI servers
> --
>
> Key: LUCENE-10419
> URL: https://issues.apache.org/jira/browse/LUCENE-10419
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
>
> {code}
> What went wrong: Execution failed for task 
> ':lucene:analysis:icu:validateSourcePatterns'. > start 1, end 0, length 0
> {code}
>  
> This annoys me. It's a message from stringbuilder.substring somewhere - let's 
> get the stack of that first and see where the bug is.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a change in pull request #630: LUCENE-10371 Make IndexRearranger able to arrange segment in a determined order

2022-02-10 Thread GitBox


mikemccand commented on a change in pull request #630:
URL: https://github.com/apache/lucene/pull/630#discussion_r803674222



##
File path: 
lucene/misc/src/java/org/apache/lucene/misc/index/IndexRearranger.java
##
@@ -84,6 +99,28 @@ public void execute() throws Exception {
   }
   executor.shutdown();
 }
+List ordered = new ArrayList<>();
+try (IndexReader reader = DirectoryReader.open(output)) {
+  for (DocumentSelector ds : documentSelectors) {
+boolean found = false;
+for (LeafReaderContext context : reader.leaves()) {
+  SegmentReader sr = (SegmentReader) context.reader();
+  if (ds.getFilteredLiveDocs(sr).nextSetBit(0) != 
DocIdSetIterator.NO_MORE_DOCS) {
+if (found) {
+  throw new IllegalStateException(
+  "A document selector can't match more than 1 rearranged 
segments");

Review comment:
   Hmm maybe include some details in the exception message about which 
doc(s) in which segment(s) were duplicated?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a change in pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery

2022-02-10 Thread GitBox


msokolov commented on a change in pull request #656:
URL: https://github.com/apache/lucene/pull/656#discussion_r803668310



##
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##
@@ -76,17 +81,23 @@ public KnnVectorQuery(String field, float[] target, int k, 
Query filter) {
 
   @Override
   public Query rewrite(IndexReader reader) throws IOException {
-BitSet[] bitSets = null;
+TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
 
+BitSetCollector filterCollector = null;
 if (filter != null) {
+  filterCollector = new BitSetCollector(reader.leaves().size());
   IndexSearcher indexSearcher = new IndexSearcher(reader);
-  bitSets = new BitSet[reader.leaves().size()];
-  indexSearcher.search(filter, new BitSetCollector(bitSets));
+  indexSearcher.search(filter, filterCollector);

Review comment:
   for another day, but I am realizing that we have no opportunity to make 
use of per-segment concurrency here, as we ordinarily do in 
`IndexSearcher.search()`. To do so, we'd need to consider some API change 
though. Perhaps instead of using `rewrite` for this, we could make use of 
`Query`'s two-phase iteration mode of operation. Just a thought for later - 
I'll go open an issue elsewhere.

##
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##
@@ -96,43 +107,98 @@ public Query rewrite(IndexReader reader) throws 
IOException {
 return createRewrittenQuery(reader, topK);
   }
 
-  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, Bits 
bitsFilter)
+  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, 
BitSetCollector filterCollector)
   throws IOException {
-// If the filter is non-null, then it already handles live docs
-if (bitsFilter == null) {
-  bitsFilter = ctx.reader().getLiveDocs();
+
+if (filterCollector == null) {
+  Bits acceptDocs = ctx.reader().getLiveDocs();
+  return ctx.reader()
+  .searchNearestVectors(field, target, kPerLeaf, acceptDocs, 
Integer.MAX_VALUE);
+} else {
+  BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord);
+  if (filterIterator == null || filterIterator.cost() == 0) {
+return NO_RESULTS;
+  }
+
+  if (filterIterator.cost() <= k) {
+// If there <= k possible matches, short-circuit and perform exact 
search, since HNSW must
+// always visit at least k documents
+return exactSearch(ctx, target, k, filterIterator);
+  }
+
+  try {
+// The filter iterator already incorporates live docs
+Bits acceptDocs = filterIterator.getBitSet();
+int visitedLimit = (int) filterIterator.cost();
+return ctx.reader().searchNearestVectors(field, target, kPerLeaf, 
acceptDocs, visitedLimit);
+  } catch (
+  @SuppressWarnings("unused")
+  CollectionTerminatedException e) {
+// We stopped the kNN search because it visited too many nodes, so 
fall back to exact search
+return exactSearch(ctx, target, k, filterIterator);
+  }
 }
+  }
 
-TopDocs results = ctx.reader().searchNearestVectors(field, target, 
kPerLeaf, bitsFilter);
-if (results == null) {
+  private TopDocs exactSearch(
+  LeafReaderContext context, float[] target, int k, DocIdSetIterator 
acceptIterator)
+  throws IOException {
+FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field);
+if (fi == null || fi.getVectorDimension() == 0) {
+  // The field does not exist or does not index vectors
   return NO_RESULTS;
 }
-if (ctx.docBase > 0) {
-  for (ScoreDoc scoreDoc : results.scoreDocs) {
-scoreDoc.doc += ctx.docBase;
-  }
+
+VectorSimilarityFunction similarityFunction = 
fi.getVectorSimilarityFunction();
+VectorValues vectorValues = context.reader().getVectorValues(field);
+
+HitQueue queue = new HitQueue(k, false);

Review comment:
   Did you consider using the pre-populated version? We might be creating 
and discarding a lot of `ScoreDoc`s here.

##
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##
@@ -96,43 +107,98 @@ public Query rewrite(IndexReader reader) throws 
IOException {
 return createRewrittenQuery(reader, topK);
   }
 
-  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, Bits 
bitsFilter)
+  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, 
BitSetCollector filterCollector)
   throws IOException {
-// If the filter is non-null, then it already handles live docs
-if (bitsFilter == null) {
-  bitsFilter = ctx.reader().getLiveDocs();
+
+if (filterCollector == null) {
+  Bits acceptDocs = ctx.reader().getLiveDocs();
+  return ctx.reader()
+  .searchNearestVectors(field, target, kPerLeaf, acceptDocs, 
Integer.MAX_VALUE);
+} else {
+  BitSetIterator filterIterator = filterCollector.g

[GitHub] [lucene] mocobeta commented on a change in pull request #673: LUCENE-10420: Move functional interfaces in IOUtils to top-level interfaces

2022-02-10 Thread GitBox


mocobeta commented on a change in pull request #673:
URL: https://github.com/apache/lucene/pull/673#discussion_r803700144



##
File path: lucene/core/src/java/org/apache/lucene/util/IOUtils.java
##
@@ -521,22 +523,11 @@ public static void fsync(Path fileToSync, boolean isDir) 
throws IOException {
* A Function that may throw an IOException
*
* @see java.util.function.Function
+   * @deprecated was replaced by {@link org.apache.lucene.util.IOFunction}.
*/
   @FunctionalInterface
+  @Deprecated(forRemoval = true, since = "9.1")
   public interface IOFunction {
 R apply(T t) throws IOException;
   }
-
-  /**
-   * A resource supplier function that may throw an IOException.
-   *
-   * Note that this would open a resource such as a File. Consumers should 
make sure to close the
-   * resource (e.g., use try-with-resources)
-   *
-   * @see java.util.function.Supplier
-   */
-  @FunctionalInterface
-  public interface IOSupplier {

Review comment:
   This was added in #643 by me and is still not shipped to the public (I 
will remove this also from the 9x branch.)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10419) Identify occasional validateSourcePatterns error on CI servers

2022-02-10 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490270#comment-17490270
 ] 

Dawid Weiss commented on LUCENE-10419:
--

I do log the path - see the bottom of that quote. I don't have the time to look 
into this now - will do it later. Indeed looks like a bug in rat somewhere.

> Identify occasional validateSourcePatterns error on CI servers
> --
>
> Key: LUCENE-10419
> URL: https://issues.apache.org/jira/browse/LUCENE-10419
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
>
> {code}
> What went wrong: Execution failed for task 
> ':lucene:analysis:icu:validateSourcePatterns'. > start 1, end 0, length 0
> {code}
>  
> This annoys me. It's a message from stringbuilder.substring somewhere - let's 
> get the stack of that first and see where the bug is.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #673: LUCENE-10420: Move functional interfaces in IOUtils to top-level interfaces

2022-02-10 Thread GitBox


mocobeta commented on pull request #673:
URL: https://github.com/apache/lucene/pull/673#issuecomment-1034956590


   Thanks @msokolov for taking a look. I will keep this open for a day or two, 
then merge it if there is no disapproval.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a change in pull request #672: LUCENE-10418: Optimize `Query#rewrite` in the non-scoring case.

2022-02-10 Thread GitBox


msokolov commented on a change in pull request #672:
URL: https://github.com/apache/lucene/pull/672#discussion_r803701669



##
File path: lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java
##
@@ -191,51 +191,55 @@ boolean isPureDisjunction() {
 return clauses.iterator();
   }
 
-  private BooleanQuery rewriteNoScoring() {
-boolean keepShould =
+  BooleanQuery rewriteNoScoring() {
+boolean actuallyRewritten = false;
+BooleanQuery.Builder newQuery =
+new 
BooleanQuery.Builder().setMinimumNumberShouldMatch(getMinimumNumberShouldMatch());
+
+final boolean keepShould =
 getMinimumNumberShouldMatch() > 0
 || (clauseSets.get(Occur.MUST).size() + 
clauseSets.get(Occur.FILTER).size() == 0);
 
-if (clauseSets.get(Occur.MUST).size() == 0 && keepShould) {
-  return this;
-}
-BooleanQuery.Builder newQuery = new BooleanQuery.Builder();
-
-newQuery.setMinimumNumberShouldMatch(getMinimumNumberShouldMatch());
 for (BooleanClause clause : clauses) {
-  switch (clause.getOccur()) {
-case MUST:
-  {
-newQuery.add(clause.getQuery(), Occur.FILTER);
-break;
-  }
-case SHOULD:
-  {
-if (keepShould) {
-  newQuery.add(clause);
-}
-break;
-  }
-case FILTER:
-case MUST_NOT:
-default:
-  {
-newQuery.add(clause);
-  }
+  Query query = clause.getQuery();
+  Query rewritten = ConstantScoreQuery.rewriteNoScoring(query);
+  BooleanClause.Occur occur = clause.getOccur();
+  if (occur == Occur.SHOULD && keepShould == false) {
+// ignore clause
+actuallyRewritten = true;
+  } else if (occur == Occur.MUST) {
+// replace MUST clauses with FILTER clauses
+newQuery.add(rewritten, Occur.FILTER);
+actuallyRewritten = true;
+  } else if (query != rewritten) {
+newQuery.add(rewritten, occur);
+actuallyRewritten = true;
+  } else {
+newQuery.add(clause);
   }
 }
 
+if (actuallyRewritten == false) {
+  return this;
+}
+
 return newQuery.build();
   }
 
   @Override
   public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, 
float boost)
   throws IOException {
-BooleanQuery query = this;
 if (scoreMode.needsScores() == false) {
-  query = rewriteNoScoring();
+  Query rewritten = rewriteNoScoring();
+  if (this != rewritten) {
+// Pass it back to IndexSearcher#rewrite, which might find new 
opportunities for rewriting

Review comment:
   this goes beyond the non-scoring case right? In theory it could result 
in additional rewrites for scoring queries as well?

##
File path: lucene/core/src/java/org/apache/lucene/search/ConstantScoreQuery.java
##
@@ -63,6 +65,22 @@ public Query rewrite(IndexReader reader) throws IOException {
 return super.rewrite(reader);
   }
 
+  /**
+   * Perform some simplifications that are only legal when a query is not 
expected to produce
+   * scores.
+   */
+  static Query rewriteNoScoring(Query query) {

Review comment:
   It might be nice to enable other queries to also be aware of the 
scoring/nonscoring mode? I think we have other queries that can have child 
queries like `DisjunctionMaxQuery` maybe positional queries? I mean this is 
already a step forward - progress! Just wondering if there are other 
opportunities




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a change in pull request #672: LUCENE-10418: Optimize `Query#rewrite` in the non-scoring case.

2022-02-10 Thread GitBox


rmuir commented on a change in pull request #672:
URL: https://github.com/apache/lucene/pull/672#discussion_r803745240



##
File path: lucene/core/src/java/org/apache/lucene/search/ConstantScoreQuery.java
##
@@ -63,6 +65,22 @@ public Query rewrite(IndexReader reader) throws IOException {
 return super.rewrite(reader);
   }
 
+  /**
+   * Perform some simplifications that are only legal when a query is not 
expected to produce
+   * scores.
+   */
+  static Query rewriteNoScoring(Query query) {

Review comment:
   I question how much complexity we should add to optimize really 
degenerate inputs such as `ConstantScoreQuery(DisjunctionMaxQuery())`. I also 
think it might be better to put such logic here, not in e.g. booleanquery. The 
optimization is specific to CSQ, no? For example for your DisjunctionMaxQuery 
case:
   ```java
   } else if (query instanceof DisjunctionMaxQuery) {
 // since we don't care about scoring, turn it into a simple 
booleanquery... does this even make it faster?
 var builder = new BooleanQuery.Builder();
 for (Query subQuery : (DisjunctionMaxQuery)query) {
   builder.add(subQuery, Occur.SHOULD);
 }
 return builder.build();
   }
   ```
   It might also make the logic easier to follow for the BooleanQuery case too, 
especially the recursive piece. I personally think it is a lot better than 
adding `if needsScores == false` conditional logic everywhere to that already 
hairy code. If you move it to CSQ, then there's no conditional anymore, and it 
just seems like a better home.
   
   In general, its messy either way because I think we make it messy. I hate 
that we have `Query.rewrite` but here now we have it happening in 
`Query.createWeight` too. 
   
   It is also unclear to me if this optimization happens for all the correct 
places, where scores are not needed. This doesn't necessarily mean we need to 
add more abstractions or API complexity to make it work cleanly. For example in 
`IndexSearcher.count`, when it has to fall back to `search()` to do the 
counting, it doesn't need scores. it can wrap the query in a ConstantScoreQuery 
to get the optimizations. Probably BooleanQuery could do the same with 
its`FILTER` clauses?
   
   It is just one potential option, to really make this "non-scoring rewrite"  
case easier to optimize everywhere: wrap it in a ConstantScoreQuery and you get 
all the optimizations. There are probably other alternatives we can consider 
too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta opened a new pull request #674: trivial updates on github actions

2022-02-10 Thread GitBox


mocobeta opened a new pull request #674:
URL: https://github.com/apache/lucene/pull/674


   - upgrade actions/setup-java to v2 in the hunspell regression workflow 
(aligned with the main workflow)
   - migrate the distribution to 'temurin' ([supported 
distributions](https://github.com/actions/setup-java#supported-version-syntax))


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #674: trivial updates on github actions

2022-02-10 Thread GitBox


mocobeta commented on pull request #674:
URL: https://github.com/apache/lucene/pull/674#issuecomment-1035042297


   Do you have a minute to take a look? @dweiss 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #674: trivial updates on github actions

2022-02-10 Thread GitBox


dweiss commented on pull request #674:
URL: https://github.com/apache/lucene/pull/674#issuecomment-1035046672


   It'd be interesting to randomize those distributions using a custom action, 
perhaps? I have no experience here whatsoever, but I bet it's possible...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a change in pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery

2022-02-10 Thread GitBox


mayya-sharipova commented on a change in pull request #656:
URL: https://github.com/apache/lucene/pull/656#discussion_r803804328



##
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##
@@ -96,43 +107,98 @@ public Query rewrite(IndexReader reader) throws 
IOException {
 return createRewrittenQuery(reader, topK);
   }
 
-  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, Bits 
bitsFilter)
+  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, 
BitSetCollector filterCollector)
   throws IOException {
-// If the filter is non-null, then it already handles live docs
-if (bitsFilter == null) {
-  bitsFilter = ctx.reader().getLiveDocs();
+
+if (filterCollector == null) {
+  Bits acceptDocs = ctx.reader().getLiveDocs();
+  return ctx.reader()
+  .searchNearestVectors(field, target, kPerLeaf, acceptDocs, 
Integer.MAX_VALUE);
+} else {
+  BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord);
+  if (filterIterator == null || filterIterator.cost() == 0) {
+return NO_RESULTS;
+  }
+
+  if (filterIterator.cost() <= k) {
+// If there <= k possible matches, short-circuit and perform exact 
search, since HNSW must
+// always visit at least k documents
+return exactSearch(ctx, target, k, filterIterator);
+  }
+
+  try {
+// The filter iterator already incorporates live docs
+Bits acceptDocs = filterIterator.getBitSet();
+int visitedLimit = (int) filterIterator.cost();
+return ctx.reader().searchNearestVectors(field, target, kPerLeaf, 
acceptDocs, visitedLimit);
+  } catch (
+  @SuppressWarnings("unused")
+  CollectionTerminatedException e) {

Review comment:
   I agree, also it is an expensive operation to throw an Exception in 
comparison with a just returning a value.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a change in pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery

2022-02-10 Thread GitBox


mayya-sharipova commented on a change in pull request #656:
URL: https://github.com/apache/lucene/pull/656#discussion_r803804328



##
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##
@@ -96,43 +107,98 @@ public Query rewrite(IndexReader reader) throws 
IOException {
 return createRewrittenQuery(reader, topK);
   }
 
-  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, Bits 
bitsFilter)
+  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, 
BitSetCollector filterCollector)
   throws IOException {
-// If the filter is non-null, then it already handles live docs
-if (bitsFilter == null) {
-  bitsFilter = ctx.reader().getLiveDocs();
+
+if (filterCollector == null) {
+  Bits acceptDocs = ctx.reader().getLiveDocs();
+  return ctx.reader()
+  .searchNearestVectors(field, target, kPerLeaf, acceptDocs, 
Integer.MAX_VALUE);
+} else {
+  BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord);
+  if (filterIterator == null || filterIterator.cost() == 0) {
+return NO_RESULTS;
+  }
+
+  if (filterIterator.cost() <= k) {
+// If there <= k possible matches, short-circuit and perform exact 
search, since HNSW must
+// always visit at least k documents
+return exactSearch(ctx, target, k, filterIterator);
+  }
+
+  try {
+// The filter iterator already incorporates live docs
+Bits acceptDocs = filterIterator.getBitSet();
+int visitedLimit = (int) filterIterator.cost();
+return ctx.reader().searchNearestVectors(field, target, kPerLeaf, 
acceptDocs, visitedLimit);
+  } catch (
+  @SuppressWarnings("unused")
+  CollectionTerminatedException e) {

Review comment:
   I agree and also prefer not to throw an Exception if possible; it is an 
expensive operation to throw an Exception in comparison with just returning a 
value.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna opened a new pull request #675: LUCENE-10385: Avoid SimpleText codec in TestIndexSortSortedNumericDocValuesRangeQuery

2022-02-10 Thread GitBox


javanna opened a new pull request #675:
URL: https://github.com/apache/lucene/pull/675


   The recently introduced testCount (added with LUCENE-10385) verifies that 
the Weight#count optimization kicks in. When SimpleText codec is used, 
`DocValues#unwrapSingleton` returns null which disables the optimization and 
makes the test fail.
   
   Relates to #635


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #675: LUCENE-10385: Avoid SimpleText codec in TestIndexSortSortedNumericDocValuesRangeQuery

2022-02-10 Thread GitBox


jpountz merged pull request #675:
URL: https://github.com/apache/lucene/pull/675


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #674: trivial updates on github actions

2022-02-10 Thread GitBox


mocobeta commented on pull request #674:
URL: https://github.com/apache/lucene/pull/674#issuecomment-1035107842


   Thank you, I will merge this soon.
   
   > It'd be interesting to randomize those distributions using a custom 
action, perhaps? I have no experience here whatsoever, but I bet it's 
possible...
   
   I have never tried writing such a complex action, it could be implemented by 
some bash script. ? (I am not sure the Actions' sandbox gives users what level 
of flexibility.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta merged pull request #674: trivial updates on github actions

2022-02-10 Thread GitBox


mocobeta merged pull request #674:
URL: https://github.com/apache/lucene/pull/674


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #674: trivial updates on github actions

2022-02-10 Thread GitBox


dweiss commented on pull request #674:
URL: https://github.com/apache/lucene/pull/674#issuecomment-1035112715


   I think these actions are javascript, basically. I've never written one 
myself, so can't help. Don't worry about it, it was just a wild idea - we have 
lots of randomization elsewhere.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #671: Add custom composite action to set up CI environments

2022-02-10 Thread GitBox


mocobeta commented on pull request #671:
URL: https://github.com/apache/lucene/pull/671#issuecomment-1035126874


   I'm closing this. I think we'd need more complex or fully scratched custom 
actions not to duplicate the JDK set-up across workflows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta closed pull request #671: Add custom composite action to set up CI environments

2022-02-10 Thread GitBox


mocobeta closed pull request #671:
URL: https://github.com/apache/lucene/pull/671


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10419) Identify occasional validateSourcePatterns error on CI servers

2022-02-10 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490365#comment-17490365
 ] 

Dawid Weiss commented on LUCENE-10419:
--

I did take a look at rat's source code. It looks like a concurrency bug 
somewhere with the stringbuilder containing junk. I can't reproduce the same 
error locally though, no matter what. Very strange. I upgraded rat on main to 
0.13; can't see how it's going to help but who better than nothing.

> Identify occasional validateSourcePatterns error on CI servers
> --
>
> Key: LUCENE-10419
> URL: https://issues.apache.org/jira/browse/LUCENE-10419
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
>
> {code}
> What went wrong: Execution failed for task 
> ':lucene:analysis:icu:validateSourcePatterns'. > start 1, end 0, length 0
> {code}
>  
> This annoys me. It's a message from stringbuilder.substring somewhere - let's 
> get the stack of that first and see where the bug is.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10419) Identify occasional validateSourcePatterns error on CI servers

2022-02-10 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490372#comment-17490372
 ] 

ASF subversion and git services commented on LUCENE-10419:
--

Commit 21c5b42063e7a82339136f4da1041d1d7d3d3c1f in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=21c5b42 ]

LUCENE-10419: upgrade rat to 0.13.


> Identify occasional validateSourcePatterns error on CI servers
> --
>
> Key: LUCENE-10419
> URL: https://issues.apache.org/jira/browse/LUCENE-10419
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
>
> {code}
> What went wrong: Execution failed for task 
> ':lucene:analysis:icu:validateSourcePatterns'. > start 1, end 0, length 0
> {code}
>  
> This annoys me. It's a message from stringbuilder.substring somewhere - let's 
> get the stack of that first and see where the bug is.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #672: LUCENE-10418: Optimize `Query#rewrite` in the non-scoring case.

2022-02-10 Thread GitBox


jpountz commented on a change in pull request #672:
URL: https://github.com/apache/lucene/pull/672#discussion_r803883301



##
File path: lucene/core/src/java/org/apache/lucene/search/ConstantScoreQuery.java
##
@@ -63,6 +65,22 @@ public Query rewrite(IndexReader reader) throws IOException {
 return super.rewrite(reader);
   }
 
+  /**
+   * Perform some simplifications that are only legal when a query is not 
expected to produce
+   * scores.
+   */
+  static Query rewriteNoScoring(Query query) {

Review comment:
   This was my reasoning too, `DisjunctionMaxQuery` suggests scoring 
matters, so it didn't look worth optimizing for.
   
   I tried to improve the PR a bit with your ideas @rmuir.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #672: LUCENE-10418: Optimize `Query#rewrite` in the non-scoring case.

2022-02-10 Thread GitBox


jpountz commented on a change in pull request #672:
URL: https://github.com/apache/lucene/pull/672#discussion_r803887266



##
File path: lucene/core/src/java/org/apache/lucene/search/ConstantScoreQuery.java
##
@@ -114,7 +124,19 @@ public long cost() {
   @Override
   public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, 
float boost)
   throws IOException {
-final Weight innerWeight = searcher.createWeight(query, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
+final ScoreMode innerScoreMode;
+switch (scoreMode) {
+  case TOP_SCORES:
+innerScoreMode = ScoreMode.COMPLETE_NO_SCORES;
+break;
+  case TOP_DOCS_WITH_SCORES:
+innerScoreMode = ScoreMode.TOP_DOCS;
+break;
+  default:
+innerScoreMode = scoreMode;
+break;
+}

Review comment:
   I had to add this because the additional wrapping in IndexSearcher made 
a couple test fail because ConstantScoreQuery was not propagating the score 
mode correctly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery

2022-02-10 Thread GitBox


jtibshirani commented on a change in pull request #656:
URL: https://github.com/apache/lucene/pull/656#discussion_r803937272



##
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##
@@ -96,43 +107,98 @@ public Query rewrite(IndexReader reader) throws 
IOException {
 return createRewrittenQuery(reader, topK);
   }
 
-  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, Bits 
bitsFilter)
+  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, 
BitSetCollector filterCollector)
   throws IOException {
-// If the filter is non-null, then it already handles live docs
-if (bitsFilter == null) {
-  bitsFilter = ctx.reader().getLiveDocs();
+
+if (filterCollector == null) {
+  Bits acceptDocs = ctx.reader().getLiveDocs();
+  return ctx.reader()
+  .searchNearestVectors(field, target, kPerLeaf, acceptDocs, 
Integer.MAX_VALUE);
+} else {
+  BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord);
+  if (filterIterator == null || filterIterator.cost() == 0) {

Review comment:
   I can add a comment explaining how I'm using the `BitSetIterator` here 
to capture both the bitset and the (exact) cardinality.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery

2022-02-10 Thread GitBox


jtibshirani commented on a change in pull request #656:
URL: https://github.com/apache/lucene/pull/656#discussion_r803950304



##
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##
@@ -96,43 +107,98 @@ public Query rewrite(IndexReader reader) throws 
IOException {
 return createRewrittenQuery(reader, topK);
   }
 
-  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, Bits 
bitsFilter)
+  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, 
BitSetCollector filterCollector)
   throws IOException {
-// If the filter is non-null, then it already handles live docs
-if (bitsFilter == null) {
-  bitsFilter = ctx.reader().getLiveDocs();
+
+if (filterCollector == null) {
+  Bits acceptDocs = ctx.reader().getLiveDocs();
+  return ctx.reader()
+  .searchNearestVectors(field, target, kPerLeaf, acceptDocs, 
Integer.MAX_VALUE);
+} else {
+  BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord);
+  if (filterIterator == null || filterIterator.cost() == 0) {
+return NO_RESULTS;
+  }
+
+  if (filterIterator.cost() <= k) {
+// If there <= k possible matches, short-circuit and perform exact 
search, since HNSW must
+// always visit at least k documents
+return exactSearch(ctx, target, k, filterIterator);
+  }
+
+  try {
+// The filter iterator already incorporates live docs
+Bits acceptDocs = filterIterator.getBitSet();
+int visitedLimit = (int) filterIterator.cost();
+return ctx.reader().searchNearestVectors(field, target, kPerLeaf, 
acceptDocs, visitedLimit);
+  } catch (
+  @SuppressWarnings("unused")
+  CollectionTerminatedException e) {
+// We stopped the kNN search because it visited too many nodes, so 
fall back to exact search
+return exactSearch(ctx, target, k, filterIterator);
+  }
 }
+  }
 
-TopDocs results = ctx.reader().searchNearestVectors(field, target, 
kPerLeaf, bitsFilter);
-if (results == null) {
+  private TopDocs exactSearch(
+  LeafReaderContext context, float[] target, int k, DocIdSetIterator 
acceptIterator)
+  throws IOException {
+FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field);
+if (fi == null || fi.getVectorDimension() == 0) {
+  // The field does not exist or does not index vectors
   return NO_RESULTS;
 }
-if (ctx.docBase > 0) {
-  for (ScoreDoc scoreDoc : results.scoreDocs) {
-scoreDoc.doc += ctx.docBase;
-  }
+
+VectorSimilarityFunction similarityFunction = 
fi.getVectorSimilarityFunction();
+VectorValues vectorValues = context.reader().getVectorValues(field);
+
+HitQueue queue = new HitQueue(k, false);

Review comment:
   Oh this is good to know about, I'll try to switch over.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery

2022-02-10 Thread GitBox


jtibshirani commented on a change in pull request #656:
URL: https://github.com/apache/lucene/pull/656#discussion_r803965732



##
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##
@@ -96,43 +107,98 @@ public Query rewrite(IndexReader reader) throws 
IOException {
 return createRewrittenQuery(reader, topK);
   }
 
-  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, Bits 
bitsFilter)
+  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, 
BitSetCollector filterCollector)
   throws IOException {
-// If the filter is non-null, then it already handles live docs
-if (bitsFilter == null) {
-  bitsFilter = ctx.reader().getLiveDocs();
+
+if (filterCollector == null) {
+  Bits acceptDocs = ctx.reader().getLiveDocs();
+  return ctx.reader()
+  .searchNearestVectors(field, target, kPerLeaf, acceptDocs, 
Integer.MAX_VALUE);
+} else {
+  BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord);
+  if (filterIterator == null || filterIterator.cost() == 0) {
+return NO_RESULTS;
+  }
+
+  if (filterIterator.cost() <= k) {
+// If there <= k possible matches, short-circuit and perform exact 
search, since HNSW must
+// always visit at least k documents
+return exactSearch(ctx, target, k, filterIterator);
+  }
+
+  try {
+// The filter iterator already incorporates live docs
+Bits acceptDocs = filterIterator.getBitSet();
+int visitedLimit = (int) filterIterator.cost();
+return ctx.reader().searchNearestVectors(field, target, kPerLeaf, 
acceptDocs, visitedLimit);
+  } catch (
+  @SuppressWarnings("unused")
+  CollectionTerminatedException e) {

Review comment:
   I agree, it's nice to avoid using exceptions for normal control flow. 
I'm not too concerned from a performance perspective though, exceptions aren't 
thrown in a "hot loop" and I didn't see a perf hit in testing.
   
   If we go the route of using `TopDocs`, I'd prefer to avoid 'null' since 
that's a bit overloaded (indicates the field is missing or does not have 
vectors). Brainstorming ideas:
   * Just return `EMPTY_TOPDOCS`.
   * Still return best score docs and the visited count. But use `EQUAL_TO` for 
`TotalHits.Relation` if the search completed normally, otherwise use 
`GREATER_THAN_OR_EQUAL_TO`. 
   * Use a special subtype of `TopDocs` instead, which has an explicit 
"complete" flag?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery

2022-02-10 Thread GitBox


jtibshirani commented on a change in pull request #656:
URL: https://github.com/apache/lucene/pull/656#discussion_r804009517



##
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##
@@ -70,18 +118,104 @@ public Query rewrite(IndexReader reader) throws 
IOException {
 return createRewrittenQuery(reader, topK);
   }
 
-  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf) throws 
IOException {
-Bits liveDocs = ctx.reader().getLiveDocs();
-TopDocs results = ctx.reader().searchNearestVectors(field, target, 
kPerLeaf, liveDocs);
-if (results == null) {
+  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, 
BitSetCollector filterCollector)
+  throws IOException {
+
+if (filterCollector == null) {
+  Bits acceptDocs = ctx.reader().getLiveDocs();
+  return ctx.reader()
+  .searchNearestVectors(field, target, kPerLeaf, acceptDocs, 
Integer.MAX_VALUE);
+} else {
+  BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord);
+  if (filterIterator == null || filterIterator.cost() == 0) {
+return NO_RESULTS;
+  }
+
+  if (filterIterator.cost() <= k) {
+// If there <= k possible matches, short-circuit and perform exact 
search, since HNSW must
+// always visit at least k documents
+return exactSearch(ctx, target, k, filterIterator);
+  }
+
+  try {
+// The filter iterator already incorporates live docs
+Bits acceptDocs = filterIterator.getBitSet();
+int visitedLimit = (int) filterIterator.cost();
+return ctx.reader().searchNearestVectors(field, target, kPerLeaf, 
acceptDocs, visitedLimit);
+  } catch (
+  @SuppressWarnings("unused")
+  CollectionTerminatedException e) {
+// We stopped the kNN search because it visited too many nodes, so 
fall back to exact search
+return exactSearch(ctx, target, k, filterIterator);
+  }
+}
+  }
+
+  private TopDocs exactSearch(
+  LeafReaderContext context, float[] target, int k, DocIdSetIterator 
acceptIterator)
+  throws IOException {
+FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field);
+if (fi == null || fi.getVectorDimension() == 0) {
+  // The field does not exist or does not index vectors
   return NO_RESULTS;
 }
-if (ctx.docBase > 0) {
-  for (ScoreDoc scoreDoc : results.scoreDocs) {
-scoreDoc.doc += ctx.docBase;
+
+VectorSimilarityFunction similarityFunction = 
fi.getVectorSimilarityFunction();
+VectorValues vectorValues = context.reader().getVectorValues(field);
+
+HitQueue queue = new HitQueue(k, false);
+DocIdSetIterator iterator =
+ConjunctionUtils.intersectIterators(List.of(acceptIterator, 
vectorValues));

Review comment:
   I just noticed: maybe we should move this intersection earlier to when 
we execute the filter into a bitset. The way we do it now, our assessment of 
the filter selectivity is inaccurate when docs are missing vectors.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a change in pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery

2022-02-10 Thread GitBox


mayya-sharipova commented on a change in pull request #656:
URL: https://github.com/apache/lucene/pull/656#discussion_r804028845



##
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##
@@ -96,43 +107,98 @@ public Query rewrite(IndexReader reader) throws 
IOException {
 return createRewrittenQuery(reader, topK);
   }
 
-  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, Bits 
bitsFilter)
+  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, 
BitSetCollector filterCollector)
   throws IOException {
-// If the filter is non-null, then it already handles live docs
-if (bitsFilter == null) {
-  bitsFilter = ctx.reader().getLiveDocs();
+
+if (filterCollector == null) {
+  Bits acceptDocs = ctx.reader().getLiveDocs();
+  return ctx.reader()
+  .searchNearestVectors(field, target, kPerLeaf, acceptDocs, 
Integer.MAX_VALUE);
+} else {
+  BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord);
+  if (filterIterator == null || filterIterator.cost() == 0) {
+return NO_RESULTS;
+  }
+
+  if (filterIterator.cost() <= k) {
+// If there <= k possible matches, short-circuit and perform exact 
search, since HNSW must
+// always visit at least k documents
+return exactSearch(ctx, target, k, filterIterator);
+  }
+
+  try {
+// The filter iterator already incorporates live docs
+Bits acceptDocs = filterIterator.getBitSet();
+int visitedLimit = (int) filterIterator.cost();
+return ctx.reader().searchNearestVectors(field, target, kPerLeaf, 
acceptDocs, visitedLimit);
+  } catch (
+  @SuppressWarnings("unused")
+  CollectionTerminatedException e) {

Review comment:
   I liked very much of "a special subtype of TopDocs instead, which has an 
explicit "complete" flag"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #658: LUCENE-10378 Implement Weight#count for PointRangeQuery

2022-02-10 Thread GitBox


gautamworah96 commented on a change in pull request #658:
URL: https://github.com/apache/lucene/pull/658#discussion_r804116906



##
File path: lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java
##
@@ -369,6 +378,100 @@ public Scorer scorer(LeafReaderContext context) throws 
IOException {
 return scorerSupplier.get(Long.MAX_VALUE);
   }
 
+  @Override
+  public int count(LeafReaderContext context) throws IOException {
+LeafReader reader = context.reader();
+
+PointValues values = reader.getPointValues(field);
+if (checkValidPointValues(values) == false) {
+  return 0;
+}
+
+if (reader.hasDeletions() == false
+&& numDims == 1
+&& values.getDocCount() == values.size()) {
+  // if all documents have at-most one point
+  return (int) pointCount(values.getPointTree(), this::relate, 
this::matches);
+}
+return super.count(context);
+  }
+
+  /**
+   * Finds the number of points matching the provided range conditions. 
Using this method is
+   * faster than calling {@link PointValues#intersect(IntersectVisitor)} 
to get the count of
+   * intersecting points. This method does not enforce live documents, 
therefore it should only
+   * be used when there are no deleted documents.
+   *
+   * @param pointTree start node of the count operation
+   * @param nodeComparator comparator to be used for checking whether the 
internal node is
+   * inside the range
+   * @param leafComparator comparator to be used for checking whether the 
leaf node is inside
+   * the range
+   * @return count of points that match the range
+   */
+  private long pointCount(
+  PointValues.PointTree pointTree,
+  BiFunction nodeComparator,
+  Predicate leafComparator)
+  throws IOException {
+final int[] matchingLeafNodeCount = {0};
+// create a custom IntersectVisitor that records the number of 
leafNodes that matched
+final IntersectVisitor visitor =
+new IntersectVisitor() {
+  @Override
+  public void visit(int docID) {
+// this branch should be unreachable
+throw new UnsupportedOperationException(
+"This IntersectVisitor does not perform any actions on a "
++ "docID="
++ docID
++ " node being visited");
+  }
+
+  @Override
+  public void visit(int docID, byte[] packedValue) {
+if (leafComparator.test(packedValue)) {
+  matchingLeafNodeCount[0]++;
+}
+  }
+
+  @Override
+  public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
+return nodeComparator.apply(minPackedValue, maxPackedValue);
+  }
+};
+Relation r =

Review comment:
   I've implemented a method signature that I thought would be simpler to 
understand. It restricts all increment/counting operations to the 
`matchingNodeCount` array. The second `pointCount` function just returns `void`.
   
   IMO, The other slightly complex approach to do this resulted in a method 
signature like 
   ```
 private long pointCount(
 IntersectVisitor visitor,
 PointValues.PointTree pointTree,
 BiFunction nodeComparator,
 Predicate leafComparator,
 int[] matchingLeafNodeCount)
   ```
   A 
[branch](https://github.com/gautamworah96/lucene/commit/fe937df49def4dc3cd512fef6c7d39ef53023fb1)
 that implements this method signature and adds matchingLeafNodeCount[0] to the 
final count.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

2022-02-10 Thread GitBox


jtibshirani commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r804189512



##
File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
##
@@ -138,9 +140,20 @@ public void writeField(FieldInfo fieldInfo, 
KnnVectorsReader knnVectorsReader)
 
   long vectorIndexOffset = vectorIndex.getFilePointer();
   // build the graph using the temporary vector data
+  int count = docsWithField.cardinality();
+  int[] docIds = null;
+  if (count < maxDoc) {

Review comment:
   Although it was a bit fragile, I preferred the previous approach of 
passing `null` with a clear comment. Now it seems like we're doing (potentially 
significant?) extra work that will not be used.

##
File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
##
@@ -206,14 +214,19 @@ private void writeMeta(
 meta.writeVLong(vectorIndexOffset);
 meta.writeVLong(vectorIndexLength);
 meta.writeInt(field.getVectorDimension());
-meta.writeInt(docIds.length);
-for (int docId : docIds) {
-  // TODO: delta-encode, or write as bitset
-  meta.writeVInt(docId);
+
+// write docIDs
+meta.writeInt(count);
+if (docIds == null) {
+  meta.writeShort((short) -1); // dense marker, each document has a vector 
value

Review comment:
   Any reason not to use `writeByte` here?

##
File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
##
@@ -372,7 +393,9 @@ int size() {
   implements RandomAccessVectorValues, RandomAccessVectorValuesProducer {
 
 final int dimension;
+final int size;

Review comment:
   Small comment, maybe we can make all of these variables (including the 
new ones) private.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

2022-02-10 Thread GitBox


jtibshirani commented on pull request #649:
URL: https://github.com/apache/lucene/pull/649#issuecomment-1035648183


   Additional motivation for this PR: it could help with performance of exact 
search (in https://github.com/apache/lucene/pull/656). When all docs have 
vectors, we can avoid a binary search in `VectorValues#advance`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10176) Remove VectorValues#size()

2022-02-10 Thread spike liu (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490670#comment-17490670
 ] 

spike liu commented on LUCENE-10176:


I would like to work on this.

> Remove VectorValues#size()
> --
>
> Key: LUCENE-10176
> URL: https://issues.apache.org/jira/browse/LUCENE-10176
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Major
>
> This method doesn't seem to be used anywhere except by 
> SimpleTextKnnVectorsReader#search, which uses it in an incorrect way by using 
> it as the total number of hits matching a nearest-neighbor search (it is 
> incorrect because this number might be higher than the number of vectors 
> having a value because of deletes).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] spike-liu opened a new pull request #676: Lucene-10176: Remove VectorValues#size()

2022-02-10 Thread GitBox


spike-liu opened a new pull request #676:
URL: https://github.com/apache/lucene/pull/676


   https://issues.apache.org/jira/browse/LUCENE-10176


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10176) Remove VectorValues#size()

2022-02-10 Thread spike liu (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490671#comment-17490671
 ] 

spike liu commented on LUCENE-10176:


https://github.com/apache/lucene/pull/676

> Remove VectorValues#size()
> --
>
> Key: LUCENE-10176
> URL: https://issues.apache.org/jira/browse/LUCENE-10176
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Major
>
> This method doesn't seem to be used anywhere except by 
> SimpleTextKnnVectorsReader#search, which uses it in an incorrect way by using 
> it as the total number of hits matching a nearest-neighbor search (it is 
> incorrect because this number might be higher than the number of vectors 
> having a value because of deletes).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #674: trivial updates on github actions

2022-02-10 Thread GitBox


mocobeta commented on pull request #674:
URL: https://github.com/apache/lucene/pull/674#issuecomment-1035913042


   I am not sure if randomizing the distribution per test run is possible 
without forking the setup-java action, but I think a matrix test may be easy  
(if it makes sense to run workflows for multiple distributions on every PR.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10419) Identify occasional validateSourcePatterns error on CI servers

2022-02-10 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490698#comment-17490698
 ] 

Dawid Weiss commented on LUCENE-10419:
--

{code:java}
Unhandled exception while validating patterns on file: 
/home/jenkins/workspace/Lucene-9.x-Linux/lucene/test-framework/src/java/org/apache/lucene/tests/analysis/standard/WordBreakTestUnicode_12_1_0.java{code}
Different file. This has to be a race condition or a JVM bug somewhere on your 
machine, Uwe. This doesn't happen anywhere else as far as I remember - only on 
policeman jenkins. Very strange.

> Identify occasional validateSourcePatterns error on CI servers
> --
>
> Key: LUCENE-10419
> URL: https://issues.apache.org/jira/browse/LUCENE-10419
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
>
> {code}
> What went wrong: Execution failed for task 
> ':lucene:analysis:icu:validateSourcePatterns'. > start 1, end 0, length 0
> {code}
>  
> This annoys me. It's a message from stringbuilder.substring somewhere - let's 
> get the stack of that first and see where the bug is.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a change in pull request #658: LUCENE-10378 Implement Weight#count for PointRangeQuery

2022-02-10 Thread GitBox


iverase commented on a change in pull request #658:
URL: https://github.com/apache/lucene/pull/658#discussion_r804415672



##
File path: lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java
##
@@ -369,6 +378,100 @@ public Scorer scorer(LeafReaderContext context) throws 
IOException {
 return scorerSupplier.get(Long.MAX_VALUE);
   }
 
+  @Override
+  public int count(LeafReaderContext context) throws IOException {
+LeafReader reader = context.reader();
+
+PointValues values = reader.getPointValues(field);
+if (checkValidPointValues(values) == false) {
+  return 0;
+}
+
+if (reader.hasDeletions() == false
+&& numDims == 1
+&& values.getDocCount() == values.size()) {
+  // if all documents have at-most one point
+  return (int) pointCount(values.getPointTree(), this::relate, 
this::matches);
+}
+return super.count(context);
+  }
+
+  /**
+   * Finds the number of points matching the provided range conditions. 
Using this method is
+   * faster than calling {@link PointValues#intersect(IntersectVisitor)} 
to get the count of
+   * intersecting points. This method does not enforce live documents, 
therefore it should only
+   * be used when there are no deleted documents.
+   *
+   * @param pointTree start node of the count operation
+   * @param nodeComparator comparator to be used for checking whether the 
internal node is
+   * inside the range
+   * @param leafComparator comparator to be used for checking whether the 
leaf node is inside
+   * the range
+   * @return count of points that match the range
+   */
+  private long pointCount(
+  PointValues.PointTree pointTree,
+  BiFunction nodeComparator,
+  Predicate leafComparator)
+  throws IOException {
+final int[] matchingLeafNodeCount = {0};
+// create a custom IntersectVisitor that records the number of 
leafNodes that matched
+final IntersectVisitor visitor =
+new IntersectVisitor() {
+  @Override
+  public void visit(int docID) {
+// this branch should be unreachable
+throw new UnsupportedOperationException(
+"This IntersectVisitor does not perform any actions on a "
++ "docID="
++ docID
++ " node being visited");
+  }
+
+  @Override
+  public void visit(int docID, byte[] packedValue) {
+if (leafComparator.test(packedValue)) {
+  matchingLeafNodeCount[0]++;
+}
+  }
+
+  @Override
+  public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
+return nodeComparator.apply(minPackedValue, maxPackedValue);
+  }
+};
+Relation r =

Review comment:
   That is correct but why are you passing the `nodeComparator` and the 
`leafComparator` here? there  not needed anymore as they are part of the 
IntersectVisitor,




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org