date:20221210

[GitHub] [lucene] dweiss commented on a diff in pull request #11998: Migrate away from per-segment-per-threadlocals on SegmentReader

2022-12-10 Thread GitBox



dweiss commented on code in PR #11998:
URL: https://github.com/apache/lucene/pull/11998#discussion_r1045027949


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/AssertingLeafReader.java:
##
@@ -113,34 +116,84 @@ public Fields getTermVectors(int docID) throws 
IOException {
 return fields == null ? null : new AssertingFields(fields);
   }
 
+  @Override
+  public TermVectors termVectors() throws IOException {
+return new AssertingTermVectors(super.termVectors());
+  }
+
+  @Override
+  public StoredFields storedFields() throws IOException {
+return new AssertingStoredFields(super.storedFields());
+  }
+
+  /** Wraps a StoredFields but with additional asserts */
+  public static class AssertingStoredFields extends StoredFields {
+private final StoredFields in;
+private final Thread creationThread = Thread.currentThread();
+
+public AssertingStoredFields(StoredFields in) {
+  this.in = in;
+}
+
+@Override
+public void document(int docID, StoredFieldVisitor visitor) throws 
IOException {
+  assertThread("StoredFields", creationThread);
+  in.document(docID, visitor);
+}
+  }
+
+  /** Wraps a TermVectors but with additional asserts */
+  public static class AssertingTermVectors extends TermVectors {
+private final TermVectors in;
+private final Thread creationThread = Thread.currentThread();
+
+public AssertingTermVectors(TermVectors in) {
+  this.in = in;
+}
+
+@Override
+public Fields get(int doc) throws IOException {
+  assertThread("TermVectors", creationThread);
+  Fields fields = in.get(doc);
+  return fields == null ? null : new AssertingFields(fields);
+}
+  }
+
   /** Wraps a Fields but with additional asserts */
   public static class AssertingFields extends FilterFields {
+private final Thread creationThread = Thread.currentThread();

Review Comment:
   +1. It's not just thread-safety but strict thread-confinement. I assume 
we're fine with this stricter rule within tests/codebase. I don't see any 
reason why these objects should be passed around (?).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on a diff in pull request #11998: Migrate away from per-segment-per-threadlocals on SegmentReader

2022-12-10 Thread GitBox



rmuir commented on code in PR #11998:
URL: https://github.com/apache/lucene/pull/11998#discussion_r1045078398


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/AssertingLeafReader.java:
##
@@ -113,34 +116,84 @@ public Fields getTermVectors(int docID) throws 
IOException {
 return fields == null ? null : new AssertingFields(fields);
   }
 
+  @Override
+  public TermVectors termVectors() throws IOException {
+return new AssertingTermVectors(super.termVectors());
+  }
+
+  @Override
+  public StoredFields storedFields() throws IOException {
+return new AssertingStoredFields(super.storedFields());
+  }
+
+  /** Wraps a StoredFields but with additional asserts */
+  public static class AssertingStoredFields extends StoredFields {
+private final StoredFields in;
+private final Thread creationThread = Thread.currentThread();
+
+public AssertingStoredFields(StoredFields in) {
+  this.in = in;
+}
+
+@Override
+public void document(int docID, StoredFieldVisitor visitor) throws 
IOException {
+  assertThread("StoredFields", creationThread);
+  in.document(docID, visitor);
+}
+  }
+
+  /** Wraps a TermVectors but with additional asserts */
+  public static class AssertingTermVectors extends TermVectors {
+private final TermVectors in;
+private final Thread creationThread = Thread.currentThread();
+
+public AssertingTermVectors(TermVectors in) {
+  this.in = in;
+}
+
+@Override
+public Fields get(int doc) throws IOException {
+  assertThread("TermVectors", creationThread);
+  Fields fields = in.get(doc);
+  return fields == null ? null : new AssertingFields(fields);
+}
+  }
+
   /** Wraps a Fields but with additional asserts */
   public static class AssertingFields extends FilterFields {
+private final Thread creationThread = Thread.currentThread();

Review Comment:
   I had the idea to add these checks for stored/fields vectors just to fail a 
test clearly rather than with some "crazy" behavior that would be difficult to 
debug. 
   
   Looks like i wasn't the first person to have this idea, as @jpountz already 
added most of them 8 years ago: 
https://github.com/apache/lucene/commit/fc94b0b4d9e0
   
   For the postings API, previously we had existing checks on `TermsEnum`, 
`PostingsEnum`. But `Terms` and `Fields` were not checked. Seems like an 
oversight to me, they are all in the same codec postings api (also used by term 
vectors). So, depending on codec's implementation, codec might do e.g. any 
number of disk reads or whatever it wants and these should be checked too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #12003: Some minor code cleanup in IndexSortSortedNumericDocValuesRangeQuery

2022-12-10 Thread GitBox



jpountz commented on code in PR #12003:
URL: https://github.com/apache/lucene/pull/12003#discussion_r1045084049


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java:
##
@@ -212,9 +212,9 @@ public boolean isCacheable(LeafReaderContext ctx) {
   @Override
   public int count(LeafReaderContext context) throws IOException {
 if (context.reader().hasDeletions() == false) {
-  BoundedDocIdSetIterator disi = getDocIdSetIteratorOrNull(context);
-  if (disi != null && disi.delegate == null) {
-return disi.lastDoc - disi.firstDoc;
+  DocIdSetIterator disi = getDocIdSetIteratorOrNull(context);
+  if (disi != null && disi instanceof BoundedDocIdSetIterator == 
false) {
+return Math.toIntExact(disi.cost());

Review Comment:
   I worry that this might be a bit fragile since cost() has no guarantee to be 
accurate. I wonder if we could make `getDocIdSetIteratorOrNull()` return both a 
`DocIdSetIterator` and a number of matches (possibly -1 when unknown) to make 
this less trappy.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #11998: Migrate away from per-segment-per-threadlocals on SegmentReader

2022-12-10 Thread GitBox



jpountz commented on code in PR #11998:
URL: https://github.com/apache/lucene/pull/11998#discussion_r1045086896


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/AssertingLeafReader.java:
##
@@ -113,34 +116,84 @@ public Fields getTermVectors(int docID) throws 
IOException {
 return fields == null ? null : new AssertingFields(fields);
   }
 
+  @Override
+  public TermVectors termVectors() throws IOException {
+return new AssertingTermVectors(super.termVectors());
+  }
+
+  @Override
+  public StoredFields storedFields() throws IOException {
+return new AssertingStoredFields(super.storedFields());
+  }
+
+  /** Wraps a StoredFields but with additional asserts */
+  public static class AssertingStoredFields extends StoredFields {
+private final StoredFields in;
+private final Thread creationThread = Thread.currentThread();
+
+public AssertingStoredFields(StoredFields in) {
+  this.in = in;
+}
+
+@Override
+public void document(int docID, StoredFieldVisitor visitor) throws 
IOException {
+  assertThread("StoredFields", creationThread);
+  in.document(docID, visitor);
+}
+  }
+
+  /** Wraps a TermVectors but with additional asserts */
+  public static class AssertingTermVectors extends TermVectors {
+private final TermVectors in;
+private final Thread creationThread = Thread.currentThread();
+
+public AssertingTermVectors(TermVectors in) {
+  this.in = in;
+}
+
+@Override
+public Fields get(int doc) throws IOException {
+  assertThread("TermVectors", creationThread);
+  Fields fields = in.get(doc);
+  return fields == null ? null : new AssertingFields(fields);
+}
+  }
+
   /** Wraps a Fields but with additional asserts */
   public static class AssertingFields extends FilterFields {
+private final Thread creationThread = Thread.currentThread();

Review Comment:
   +1 to check Terms and Fields too. Most instances returned by codecs should 
be thread-safe but it's still good to make sure consumers don't share them 
across threads for consistency with other codec APIs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz opened a new issue, #12005: Should we fail retrieving doc values on the wrong type?

2022-12-10 Thread GitBox



jpountz opened a new issue, #12005:
URL: https://github.com/apache/lucene/issues/12005

   ### Description
   
   Lucene currently treats the different doc-value types as if they were 
independent features and `LeafReader#getXXXDocValues()` returns `null` in case 
of mismatch, e.g. retrieving binary doc values on a field that indexed numeric 
doc values.
   
   I recently learned that this behavior is considered trappy enough that some 
code bases [forbid these 
methods](https://github.com/apache/lucene/pull/11950#discussion_r1029470218), 
which makes me wonder if we should change our stance and fail 
`LeafReader#getXXXDocValues()` in case of mismatch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #12004: Move byte vector queries into new KnnByteVectorQuery

2022-12-10 Thread GitBox



jpountz commented on code in PR #12004:
URL: https://github.com/apache/lucene/pull/12004#discussion_r1045087956


##
lucene/core/src/java/org/apache/lucene/search/KnnByteVectorQuery.java:
##
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.Objects;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.document.KnnVectorField;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.VectorEncoding;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.BytesRef;
+
+/**
+ * Uses {@link KnnVectorsReader#search(String, BytesRef, int, Bits, int)} to 
perform nearest
+ * neighbour search.
+ *
+ * This query also allows for performing a kNN search subject to a filter. 
In this case, it first
+ * executes the filter for each leaf, then chooses a strategy dynamically:
+ *
+ * 
+ *   If the filter cost is less than k, just execute an exact search
+ *   Otherwise run a kNN search subject to the filter
+ *   If the kNN search visits too many vectors without completing, stop 
and run an exact search
+ * 
+ */
+public class KnnByteVectorQuery extends AbstractKnnVectorQuery {
+
+  private static final TopDocs NO_RESULTS = TopDocsCollector.EMPTY_TOPDOCS;
+
+  private final BytesRef target;
+
+  /**
+   * Find the k nearest documents to the target vector according 
to the vectors in the
+   * given field. target vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find
+   * @throws IllegalArgumentException if k is less than 1
+   */
+  public KnnByteVectorQuery(String field, BytesRef target, int k) {
+this(field, target, k, null);
+  }
+
+  /**
+   * Find the k nearest documents to the target vector according 
to the vectors in the
+   * given field. target vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find
+   * @param filter a filter applied before the vector search
+   * @throws IllegalArgumentException if k is less than 1
+   */
+  public KnnByteVectorQuery(String field, BytesRef target, int k, Query 
filter) {
+super(field, k, filter);
+this.target = target;
+  }
+
+  @Override
+  protected TopDocs approximateSearch(LeafReaderContext context, Bits 
acceptDocs, int visitedLimit)
+  throws IOException {
+TopDocs results =
+context.reader().searchNearestVectors(field, target, k, acceptDocs, 
visitedLimit);
+return results != null ? results : NO_RESULTS;
+  }
+
+  @Override
+  VectorScorer createVectorScorer(LeafReaderContext context, FieldInfo fi) 
throws IOException {
+if (fi.getVectorEncoding() != VectorEncoding.BYTE) {
+  return null;
+}
+return VectorScorer.create(context, fi, target);
+  }
+
+  @Override
+  public String toString(String field) {
+return getClass().getSimpleName()
++ ":"
++ this.field
++ "["
++ target.bytes[target.offset]
++ ",...]["
++ k
++ "]";
+  }
+
+  @Override
+  public void visit(QueryVisitor visitor) {
+if (visitor.acceptField(field)) {
+  visitor.visitLeaf(this);
+}
+  }

Review Comment:
   It looks the same in the parent class, maybe it doesn't need to be 
overridden?



##
lucene/core/src/java/org/apache/lucene/search/KnnByteVectorQuery.java:
##
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed

[GitHub] [lucene] jpountz commented on pull request #11997: Add IntField, LongField, FloatField and DoubleField

2022-12-10 Thread GitBox



jpountz commented on PR #11997:
URL: https://github.com/apache/lucene/pull/11997#issuecomment-1345272269

   > What if we always set the docvalue type to SORTED_NUMERIC, even if the 
user is just storing single values? Lucene will optimize the storage if all 
docs are single-valued, and it's already a common pattern to look for this 
optimization on read
   
   This sounds good to me. The main issue I see with this is that our default 
`SortField`s fail if the field is indexed with multi-valued doc values. But 
maybe we could make it more lenient and sort on the min value for ascending 
sorts and max values for descending sorts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gf2121 opened a new pull request, #12006: Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries

2022-12-10 Thread GitBox



gf2121 opened a new pull request, #12006:
URL: https://github.com/apache/lucene/pull/12006

   In LatLonPointQueries we encode query ints to bytes, and compare bytes by 
decode bytes back to int in `ArrayUtil#compareUnsigned4`. We can directly 
compare ints instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #12006: Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries

2022-12-10 Thread GitBox



rmuir commented on PR #12006:
URL: https://github.com/apache/lucene/pull/12006#issuecomment-1345293418

   @gf2121 does this execute faster? I think originally the reason `byte[]` was 
compared, was that it avoided having to "decode" values with 
`NumericUtils.sortableBytesToInt`.
   
   But, maybe this `NumericUtils.sortableBytesToInt` method has become faster 
now that it is implemented with varhandle.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gf2121 commented on pull request #12006: Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries

2022-12-10 Thread GitBox



gf2121 commented on PR #12006:
URL: https://github.com/apache/lucene/pull/12006#issuecomment-1345297296

   Thanks @rmuir for feedback! This is the current implementation of 
`ArrayUtil#compareUnsigned4`:
   
   ```
   public static int compareUnsigned4(byte[] a, int aOffset, byte[] b, int 
bOffset) {
 return Integer.compareUnsigned(
 (int) BitUtil.VH_BE_INT.get(a, aOffset), (int) 
BitUtil.VH_BE_INT.get(b, bOffset));
   }
   ```
   
   I think it's equivalent to
   ```
   public static int compareUnsigned4(byte[] a, int aOffset, byte[] b, int 
bOffset) {
 return Integer.compare(NumericUtils.sortableBytesToInt(a, aOffset), 
NumericUtils.sortableBytesToInt(b, bOffset));
   }
   ```
   
   Because they are both doing a varhandle get and compare unsigned. The point 
of this PR is to avoid the decoding of one side as the query values are 
consistent during the intersect. 
   
   I'll do some JMH benchmark to confirm if there is any speed up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #12003: Some minor code cleanup in IndexSortSortedNumericDocValuesRangeQuery

2022-12-10 Thread GitBox



gsmiller commented on code in PR #12003:
URL: https://github.com/apache/lucene/pull/12003#discussion_r1045104269


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java:
##
@@ -212,9 +212,9 @@ public boolean isCacheable(LeafReaderContext ctx) {
   @Override
   public int count(LeafReaderContext context) throws IOException {
 if (context.reader().hasDeletions() == false) {
-  BoundedDocIdSetIterator disi = getDocIdSetIteratorOrNull(context);
-  if (disi != null && disi.delegate == null) {
-return disi.lastDoc - disi.firstDoc;
+  DocIdSetIterator disi = getDocIdSetIteratorOrNull(context);
+  if (disi != null && disi instanceof BoundedDocIdSetIterator == 
false) {
+return Math.toIntExact(disi.cost());

Review Comment:
   That's a great point. I'll work on making this less fragile. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gf2121 commented on pull request #12006: Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries

2022-12-10 Thread GitBox



gf2121 commented on PR #12006:
URL: https://github.com/apache/lucene/pull/12006#issuecomment-1345303453

   Here is the JMH result:
   ```
   Benchmark  Mode  Cnt  Score   Error   
Units
   ReadInts24Benchmark.compareUnsigned4  thrpt   10  0.224 ± 0.003  
ops/us
   ReadInts24Benchmark.numericUtilDecode thrpt   10  0.224 ± 0.006  
ops/us
   ReadInts24Benchmark.numericUtilDecodeOneSide  thrpt   10  0.995 ± 0.006  
ops/us
   ```
   
   * compareUnsigned4 is using `Integer.compareUnsigned((int) 
BitUtil.VH_BE_INT.get(a, aOffset), (int) BitUtil.VH_BE_INT.get(b, bOffset));`
   
   * numericUtilDecode is using 
`Integer.compare(NumericUtils.sortableBytesToInt(a, aOffset), 
NumericUtils.sortableBytesToInt(b, bOffset));`
   
   * numericUtilDecodeOneSide is using `aValue > 
NumericUtils.sortableBytesToInt(b, bOffset)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #12006: Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries

2022-12-10 Thread GitBox



rmuir commented on PR #12006:
URL: https://github.com/apache/lucene/pull/12006#issuecomment-1345341579

   makes sense to me, as numericutils is basically just varhandle fetch with an 
xor. thanks for testing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #11901: Github#11869: Add RangeOnRangeFacetCounts

2022-12-10 Thread GitBox



gsmiller commented on code in PR #11901:
URL: https://github.com/apache/lucene/pull/11901#discussion_r1045109686


##
lucene/facet/src/java/org/apache/lucene/facet/rangeonrange/RangeOnRangeFacetCounts.java:
##
@@ -0,0 +1,209 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.rangeonrange;
+
+import java.io.IOException;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.document.BinaryRangeDocValues;
+import org.apache.lucene.document.RangeFieldQuery;
+import org.apache.lucene.facet.FacetCountsWithFilterQuery;
+import org.apache.lucene.facet.FacetResult;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.LabelAndValue;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.PriorityQueue;
+
+abstract class RangeOnRangeFacetCounts extends FacetCountsWithFilterQuery {
+
+  private final byte[][] encodedRanges;
+  private final String[] labels;
+  private final int numEncodedValueBytes;
+  private final int dims;
+
+  /** Counts, initialized in by subclass. */
+  protected final int[] counts;
+
+  /** Our field name. */
+  protected final String field;
+
+  /** Total number of hits. */
+  protected int totCount;
+
+  private final ArrayUtil.ByteArrayComparator comparator;
+
+  /** Type of "range overlap" we want to count. */
+  RangeFieldQuery.QueryType queryType;
+
+  protected RangeOnRangeFacetCounts(
+  String field,
+  FacetsCollector hits,
+  RangeFieldQuery.QueryType queryType,
+  Query fastMatchQuery,
+  int numEncodedValueBytes,
+  byte[][] encodedRanges,
+  String[] labels)
+  throws IOException {
+super(fastMatchQuery);
+
+assert encodedRanges.length == labels.length;
+assert encodedRanges[0].length % (2 * numEncodedValueBytes) == 0;
+
+this.encodedRanges = encodedRanges;
+this.field = field;
+this.labels = labels;
+this.numEncodedValueBytes = numEncodedValueBytes;
+this.dims = encodedRanges[0].length / (2 * this.numEncodedValueBytes);
+this.queryType = queryType;
+this.comparator = 
ArrayUtil.getUnsignedComparator(this.numEncodedValueBytes);
+this.counts = new int[encodedRanges.length];
+count(field, hits.getMatchingDocs());
+  }
+
+  /** Counts from the provided field. */
+  protected void count(String field, List 
matchingDocs)
+  throws IOException {
+// TODO: We currently just exhaustively check the ranges in each document 
with every range in
+// the ranges array.
+// We might be able to do something more efficient here by grouping the 
ranges array into a
+// space partitioning
+// data structure of some sort.
+
+int missingCount = 0;
+
+for (FacetsCollector.MatchingDocs hits : matchingDocs) {
+
+  BinaryRangeDocValues binaryRangeDocValues =
+  new BinaryRangeDocValues(
+  DocValues.getBinary(hits.context.reader(), field), dims, 
numEncodedValueBytes);
+
+  final DocIdSetIterator it = createIterator(hits);

Review Comment:
   Forgot to mention earlier, but I love that you've added support for 
fast-match queries. And that it was so straight-forward to do so after that 
functionality was factored out into its own class. Nice!



##
lucene/facet/src/java/org/apache/lucene/facet/rangeonrange/LongRangeOnRangeFacetCounts.java:
##
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or imp

[GitHub] [lucene] gf2121 merged pull request #12006: Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries

2022-12-10 Thread GitBox



gf2121 merged PR #12006:
URL: https://github.com/apache/lucene/pull/12006


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gf2121 opened a new pull request, #12007: Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries (Backport 9x)

2022-12-10 Thread GitBox



gf2121 opened a new pull request, #12007:
URL: https://github.com/apache/lucene/pull/12007

   Backport of https://github.com/apache/lucene/pull/12006


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gf2121 merged pull request #12007: Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries (Backport 9x)

2022-12-10 Thread GitBox



gf2121 merged PR #12007:
URL: https://github.com/apache/lucene/pull/12007


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller opened a new pull request, #12008: Remove unnecessary NaN checks from LongRange#verifyAndEncode

2022-12-10 Thread GitBox



gsmiller opened a new pull request, #12008:
URL: https://github.com/apache/lucene/pull/12008

   ### Description
   
   Minor tweak to logic that appears to have been copy/pasted from 
`DoubleRange`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller merged pull request #12003: Some minor code cleanup in IndexSortSortedNumericDocValuesRangeQuery

2022-12-10 Thread GitBox



gsmiller merged PR #12003:
URL: https://github.com/apache/lucene/pull/12003


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #12008: Remove unnecessary NaN checks from LongRange#verifyAndEncode

2022-12-10 Thread GitBox



rmuir commented on PR #12008:
URL: https://github.com/apache/lucene/pull/12008#issuecomment-1345382186

   I think we should open a followup issue to look into this check. I enabled 
the check (without your patch) to make sure it finds this bug, and it did, and 
it also finds other stuff: 
   [log.txt](https://github.com/apache/lucene/files/10200907/log.txt)
   
   I think if we really intend to cast a long to a double, we can do it 
explicitly.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12003: Some minor code cleanup in IndexSortSortedNumericDocValuesRangeQuery

2022-12-10 Thread GitBox



gsmiller commented on PR #12003:
URL: https://github.com/apache/lucene/pull/12003#issuecomment-1345428478

   Thanks @jpountz. Merged/backported.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on issue #12005: Should we fail retrieving doc values on the wrong type?

2022-12-10 Thread GitBox



rmuir commented on issue #12005:
URL: https://github.com/apache/lucene/issues/12005#issuecomment-1345435529

   Don't we have the `DocValues` class already for this?
   
   If you are writing an algorithm on string docvalues, you just call 
`DocValues.getSortedSet`. 
   * If the user indexed it as single-valued `SORTED`, it gift-wraps it as a 
singleton SortedSet instance
   * If the user indexed it as multi-valued `SORTED_SET`, it returns it
   * If the user indexed it as something else (e.g. number), it throws error.
   
   Then, if you wish, you can specialize the single-valued case by checking 
`DocValues.unwrapSingleton`.
   * If the user indexed it as single-valued `SORTED`, you can process it with 
specialized algorithm.
   * If the user indexed it as multi-valued `SORTED` but no documents actually 
have multiple values, you can process it with specialized algorithm (since 
codecs use this same "gift-wrapping").
   
   I think for stuff like queries this is always the way to go. Maybe just 
javadoc-link it from LeafReader.java?
   
   But IMO leafreader should stay simple and not do such "shenanigans" as 
things like CheckIndex, merge, etc need to work low-level.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a diff in pull request #11998: Migrate away from per-segment-per-threadlocals on SegmentReader

[GitHub] [lucene] rmuir commented on a diff in pull request #11998: Migrate away from per-segment-per-threadlocals on SegmentReader

[GitHub] [lucene] jpountz commented on a diff in pull request #12003: Some minor code cleanup in IndexSortSortedNumericDocValuesRangeQuery

[GitHub] [lucene] jpountz commented on a diff in pull request #11998: Migrate away from per-segment-per-threadlocals on SegmentReader

[GitHub] [lucene] jpountz opened a new issue, #12005: Should we fail retrieving doc values on the wrong type?

[GitHub] [lucene] jpountz commented on a diff in pull request #12004: Move byte vector queries into new KnnByteVectorQuery

[GitHub] [lucene] jpountz commented on pull request #11997: Add IntField, LongField, FloatField and DoubleField

[GitHub] [lucene] gf2121 opened a new pull request, #12006: Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries

[GitHub] [lucene] rmuir commented on pull request #12006: Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries

[GitHub] [lucene] gf2121 commented on pull request #12006: Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries

[GitHub] [lucene] gsmiller commented on a diff in pull request #12003: Some minor code cleanup in IndexSortSortedNumericDocValuesRangeQuery

[GitHub] [lucene] gf2121 commented on pull request #12006: Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries

[GitHub] [lucene] rmuir commented on pull request #12006: Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries

[GitHub] [lucene] gsmiller commented on a diff in pull request #11901: Github#11869: Add RangeOnRangeFacetCounts

[GitHub] [lucene] gf2121 merged pull request #12006: Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries

[GitHub] [lucene] gf2121 opened a new pull request, #12007: Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries (Backport 9x)

[GitHub] [lucene] gf2121 merged pull request #12007: Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries (Backport 9x)

[GitHub] [lucene] gsmiller opened a new pull request, #12008: Remove unnecessary NaN checks from LongRange#verifyAndEncode

[GitHub] [lucene] gsmiller merged pull request #12003: Some minor code cleanup in IndexSortSortedNumericDocValuesRangeQuery

[GitHub] [lucene] rmuir commented on pull request #12008: Remove unnecessary NaN checks from LongRange#verifyAndEncode

[GitHub] [lucene] gsmiller commented on pull request #12003: Some minor code cleanup in IndexSortSortedNumericDocValuesRangeQuery

[GitHub] [lucene] rmuir commented on issue #12005: Should we fail retrieving doc values on the wrong type?

22 matches

Site Navigation

Mail list logo

Footer information