Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]

2025-04-17 Thread via GitHub


jainankitk commented on code in PR #14439:
URL: https://github.com/apache/lucene/pull/14439#discussion_r2049741831


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/facet/plain/histograms/TestHistogramCollectorManager.java:
##
@@ -115,6 +117,38 @@ public void testSkipIndexWithSortAndLowInterval() throws 
IOException {
 .setCodec(TestUtil.alwaysDocValuesFormat(new 
Lucene90DocValuesFormat(3;
   }
 
+  public void testMultiRangePointTreeCollector() throws IOException {

Review Comment:
   Okay, ran the test and few times and I understand what the error is.
   
   ```
   java.lang.AssertionError: expected:<[1=>978, 3=>1004, 4=>1000, 2=>1022, 
0=>996]> but was:<[4=>5000, 1=>4890, 2=>5110, 3=>5020, 0=>4980]>
   ```
   
   The actual values are 5 times of the expected. Essentially, the logic 
efficiently collects values for entire segment and in case of intra-segment 
concurrency, it ends up collecting 5 times. I am wondering if the collectors 
can be aware of intra-segment concurrency, and not use the optimized path when 
it is enabled?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]

2025-04-17 Thread via GitHub


jainankitk commented on code in PR #14439:
URL: https://github.com/apache/lucene/pull/14439#discussion_r2049534867


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java:
##
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.facet.plain.histograms;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.function.Function;
+import org.apache.lucene.index.PointValues;
+import org.apache.lucene.internal.hppc.LongIntHashMap;
+import org.apache.lucene.search.CollectionTerminatedException;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.ArrayUtil;
+
+class PointTreeBulkCollector {
+  static boolean collect(
+  final PointValues pointValues,
+  final long bucketWidth,
+  final LongIntHashMap collectorCounts,
+  final int maxBuckets)
+  throws IOException {
+// TODO: Do we really need pointValues.getDocCount() == pointValues.size()
+if (pointValues == null
+|| pointValues.getNumDimensions() != 1
+|| pointValues.getDocCount() != pointValues.size()
+|| ArrayUtil.getValue(pointValues.getBytesPerDimension()) == null) {
+  return false;
+}
+
+final Function byteToLong =
+ArrayUtil.getValue(pointValues.getBytesPerDimension());
+final long minValue = getLongFromByte(byteToLong, 
pointValues.getMinPackedValue());
+final long maxValue = getLongFromByte(byteToLong, 
pointValues.getMaxPackedValue());
+long leafMinBucket = Math.floorDiv(minValue, bucketWidth);
+long leafMaxBucket = Math.floorDiv(maxValue, bucketWidth);
+
+// We want that # leaf nodes is more than # buckets so that we can 
completely skip over
+// some of the leaf nodes. Higher this ratio, more efficient it is than 
naive approach!
+if ((pointValues.size() / 512) < (leafMaxBucket - leafMinBucket)) {
+  return false;

Review Comment:
   The user does not need to get anything. We will just skip the optimized path 
and fall back to the original execution plan



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]

2025-04-17 Thread via GitHub


jainankitk commented on code in PR #14439:
URL: https://github.com/apache/lucene/pull/14439#discussion_r2049535183


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java:
##
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.facet.plain.histograms;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.function.Function;
+import org.apache.lucene.index.PointValues;
+import org.apache.lucene.internal.hppc.LongIntHashMap;
+import org.apache.lucene.search.CollectionTerminatedException;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.ArrayUtil;
+
+class PointTreeBulkCollector {
+  static boolean collect(
+  final PointValues pointValues,
+  final long bucketWidth,
+  final LongIntHashMap collectorCounts,
+  final int maxBuckets)
+  throws IOException {
+// TODO: Do we really need pointValues.getDocCount() == pointValues.size()
+if (pointValues == null
+|| pointValues.getNumDimensions() != 1
+|| pointValues.getDocCount() != pointValues.size()
+|| ArrayUtil.getValue(pointValues.getBytesPerDimension()) == null) {
+  return false;

Review Comment:
   The user does not need to get anything. We will just skip the optimized path 
and fall back to the original execution plan



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Upgrade to gradle 8.14-rc-2 [lucene]

2025-04-17 Thread via GitHub


dweiss commented on PR #14519:
URL: https://github.com/apache/lucene/pull/14519#issuecomment-2814109034

   I would like to wait until the official release is made (not rc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]

2025-04-17 Thread via GitHub


github-actions[bot] commented on PR #14371:
URL: https://github.com/apache/lucene/pull/14371#issuecomment-2814249617

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Upgrade to gradle 8.14-rc-2 [lucene]

2025-04-17 Thread via GitHub


harshavamsi commented on PR #14519:
URL: https://github.com/apache/lucene/pull/14519#issuecomment-2814127985

   > I would like to wait until the official release is made (not rc).
   
   I'll mark this for draft and will pick up once 8.14 is fully released.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove sloppySin calculations [lucene]

2025-04-17 Thread via GitHub


rmuir commented on PR #14516:
URL: https://github.com/apache/lucene/pull/14516#issuecomment-2814266775

   It will mostly impact geo sorting.
   
   But replacing the fast SloppyMath calculation here with slower equivalents 
from OpenJDK won't buy you any improved error rate: precision is purposefully 
truncated to ensure stable sorts: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/SloppyMath.java#L73-L74
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Tone down TestIndexWriterDelete.testDeleteAllRepeated (OOMs sometimes) [lucene]

2025-04-17 Thread via GitHub


dweiss commented on issue #14508:
URL: https://github.com/apache/lucene/issues/14508#issuecomment-2812002598

   I think it'd be good to set up jenkins the way Robert does - point gradle's 
temp folders at some tmpfs mount, at least from linux. This would save your 
disks from wearing out significantly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]

2025-04-17 Thread via GitHub


jainankitk commented on code in PR #14439:
URL: https://github.com/apache/lucene/pull/14439#discussion_r2050147491


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/facet/plain/histograms/TestHistogramCollectorManager.java:
##
@@ -115,6 +117,38 @@ public void testSkipIndexWithSortAndLowInterval() throws 
IOException {
 .setCodec(TestUtil.alwaysDocValuesFormat(new 
Lucene90DocValuesFormat(3;
   }
 
+  public void testMultiRangePointTreeCollector() throws IOException {

Review Comment:
   Okay, I have added logic for collecting only for one of the partitions if 
intra-segment concurrency is enabled. Ran the test 50k times after that without 
any failures:
   
   ```
   > Task :lucene:core:testClasses
   > Task :lucene:core:test
   > Task :lucene:core:wipeTaskTemp
   
   > Task :lucene:sandbox:test
   :lucene:sandbox:test (SUCCESS): 5 test(s)
   
   > Task :lucene:sandbox:wipeTaskTemp
   The slowest suites (exceeding 1s) during this run:
 580.41s TestHistogramCollectorManager (:lucene:sandbox)
   
   BUILD SUCCESSFUL in 9m 58s
   246 actionable tasks: 181 executed, 65 up-to-date
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]

2025-04-17 Thread via GitHub


jainankitk commented on code in PR #14439:
URL: https://github.com/apache/lucene/pull/14439#discussion_r2050117622


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/HistogramCollector.java:
##
@@ -53,11 +59,22 @@ public LeafCollector getLeafCollector(LeafReaderContext 
context) throws IOExcept
   // The segment has no values, nothing to do.
   throw new CollectionTerminatedException();
 }
+
+// We can use multi range traversal logic to collect the histogram on 
numeric
+// field indexed as point range query for MATCH_ALL cases
+final PointValues pointValues = context.reader().getPointValues(field);
+if (query instanceof MatchAllDocsQuery && 
!context.reader().hasDeletions()) {

Review Comment:
   Yes, I will open follow-up once we are close to merging this PR. I will keep 
this conversation open, until the issue is created



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]

2025-04-17 Thread via GitHub


jainankitk commented on code in PR #14439:
URL: https://github.com/apache/lucene/pull/14439#discussion_r2050115412


##
lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java:
##
@@ -801,4 +802,16 @@ public static int compareUnsigned4(byte[] a, int aOffset, 
byte[] b, int bOffset)
 return Integer.compareUnsigned(
 (int) BitUtil.VH_BE_INT.get(a, aOffset), (int) 
BitUtil.VH_BE_INT.get(b, bOffset));
   }
+
+  public static Function getValue(int numBytes) {

Review Comment:
   Thanks for suggesting `NumericUtils`. I wasn't aware of the 
`NumericUtils#sortableBytesToLong`, exactly what I needed to flip the sign bit 
correctly!
   
   ```
   // Flip the sign bit back
   v ^= 0x8000L;
   ```
   
   I have removed this method from `ArrayUtil` and added in 
`PointTreeBulkCollector`. Please let me know if you still see any concerns



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add a timeout for forceMergeDeletes in IndexWriter [lucene]

2025-04-17 Thread via GitHub


vigyasharma commented on issue #14431:
URL: https://github.com/apache/lucene/issues/14431#issuecomment-2811827103

   > Suppose forceMergeDeletes() returned the MergeSpec
   
   This could be a _"good first issue"_, I'll create a spin-off issue for the 
same. We can close it if others disagree with the idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ensuring skip list is read for fields indexed with only DOCS [lucene]

2025-04-17 Thread via GitHub


expani commented on code in PR #14511:
URL: https://github.com/apache/lucene/pull/14511#discussion_r2047828296


##
lucene/core/src/java/org/apache/lucene/codecs/lucene103/Lucene103PostingsReader.java:
##
@@ -282,6 +288,10 @@ public PostingsEnum postings(
   @Override
   public ImpactsEnum impacts(FieldInfo fieldInfo, BlockTermState state, int 
flags)
   throws IOException {
+if (state.docFreq <= BLOCK_SIZE) {
+  // no skip data
+  return new SlowImpactsEnum(postings(fieldInfo, state, null, flags));
+}

Review Comment:
   DUMMY_IMPACTS in PostingsReader is a `List`  ( 10.1.0 ) 
   whereas 
   the DUMMY_IMPACTS returned by SlowImpactsEnum ( 9.11.1 ) was a `Impacts` 
which would still apply for cases where using the skip list doesn't make sense 
`state.docFreq <= BLOCK_SIZE` 
   
   I picked this check from `Lucene99PostingsReader` where we stored a default 
impact of freq=1 even if the field was indexed with `IndexOptions.DOCS` 
https://github.com/apache/lucene/blob/main/lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene99/Lucene99PostingsWriter.java#L275



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ensuring skip list is read for fields indexed with only DOCS [lucene]

2025-04-17 Thread via GitHub


expani commented on code in PR #14511:
URL: https://github.com/apache/lucene/pull/14511#discussion_r2047828296


##
lucene/core/src/java/org/apache/lucene/codecs/lucene103/Lucene103PostingsReader.java:
##
@@ -282,6 +288,10 @@ public PostingsEnum postings(
   @Override
   public ImpactsEnum impacts(FieldInfo fieldInfo, BlockTermState state, int 
flags)
   throws IOException {
+if (state.docFreq <= BLOCK_SIZE) {
+  // no skip data
+  return new SlowImpactsEnum(postings(fieldInfo, state, null, flags));
+}

Review Comment:
   DUMMY_IMPACTS in PostingsReader is a `List`  ( 10.1.0 ) 
   whereas 
   the DUMMY_IMPACTS returned by SlowImpactsEnum ( 9.11.1 ) was a `Impacts` 
which would still apply for cases where using the skip list doesn't make sense 
`state.docFreq <= BLOCK_SIZE` 
   
   I picked this check from `Lucene99PostingsReader` where we stored a default 
impact of freq=1 even if the field was indexed with `IndexOptions.DOCS` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Use a non-deprecated assertThat, and change several test assertions to use assertThat [lucene]

2025-04-17 Thread via GitHub


thecoop opened a new pull request, #14518:
URL: https://github.com/apache/lucene/pull/14518

   The main reason for this PR is using an un-deprecated `assertThat` by 
default. I've also updated a few uses of old-style assertions to use 
`assertThat` and `Matchers` that give a more descriptive error message. There's 
obviously loads more that could be updated; this is a few to start with.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Upgrade to gradle 8.14-rc-2 [lucene]

2025-04-17 Thread via GitHub


harshavamsi opened a new pull request, #14519:
URL: https://github.com/apache/lucene/pull/14519

   ### Description
   
   
   
   I was trying to do a fresh import of Lucene into Intellij 2025.1 when it 
threw 
   
   ```
   java.lang.ClassCastException: class org.codehaus.groovy.runtime.GStringImpl 
cannot be cast to class java.lang.String 
(org.codehaus.groovy.runtime.GStringImpl is in unnamed module of loader 
org.gradle.internal.classloader.VisitableURLClassLoader @2280cdac; 
java.lang.String is in module java.base of loader 'bootstrap')
   in ToolingStreamApiUtils.writeCollection(ToolingStreamApiUtils.java:145)
   FAILURE: Build failed with an exception.
   * What went wrong:
   java.io.NotSerializableException: 
org.gradle.api.internal.file.DefaultFilePropertyFactory$FixedFile
   org.gradle.api.internal.file.DefaultFilePropertyFactory$FixedFile
   ```
   
   Looks like the issue is related to 
https://github.com/gradle/gradle/issues/32606 and is fixed in gradle 8.14. 
Raising this PR if we would want to upgrade to 8.14


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use a non-deprecated assertThat, and change several test assertions to use assertThat [lucene]

2025-04-17 Thread via GitHub


thecoop commented on code in PR #14518:
URL: https://github.com/apache/lucene/pull/14518#discussion_r2048959333


##
lucene/core/src/test/org/apache/lucene/index/TestAddIndexes.java:
##
@@ -810,10 +810,7 @@ public MergeSpecification findMerges(CodecReader... 
readers) throws IOException
 }
 for (MergePolicy.OneMerge merge : merges) {
   if (merge.getMergeInfo() != null) {
-assertFalse(
-Arrays.stream(c.destDir.listAll())
-.collect(Collectors.toSet())
-.containsAll(merge.getMergeInfo().files()));
+
assertFalse(Set.of(c.destDir.listAll()).containsAll(merge.getMergeInfo().files()));

Review Comment:
   This doesn't easily translate into a matcher, so I've changed to use 
`Set.of` to make it a bit more readable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove sloppySin calculations [lucene]

2025-04-17 Thread via GitHub


jainankitk commented on PR #14516:
URL: https://github.com/apache/lucene/pull/14516#issuecomment-2813787215

   > I don't agree with the such changes without supporting benchmarks, sorry. 
Especially it is bad news to just replace 3 functions (sin,cos,asin) all at 
once in YOLO fashion.
   
   Thanks for raising these concerns. The intention was to just remove custom 
`sloppySin`. Removing `SloppyMath.cos`, `SloppyMath.asin` was by mistake. Let 
me work on providing supporting benchmark. I am wondering if there is something 
in `LuceneUtil` benchmark that can be used for this purpose?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]

2025-04-17 Thread via GitHub


jainankitk commented on code in PR #14439:
URL: https://github.com/apache/lucene/pull/14439#discussion_r2049533078


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java:
##
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.facet.plain.histograms;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.function.Function;
+import org.apache.lucene.index.PointValues;
+import org.apache.lucene.internal.hppc.LongIntHashMap;
+import org.apache.lucene.search.CollectionTerminatedException;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.ArrayUtil;
+
+class PointTreeBulkCollector {

Review Comment:
   Do we need explicit `@lucene.experimental` even for classes within sandbox? 
As per my understanding, everything sandbox is experimental by default?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]

2025-04-17 Thread via GitHub


stefanvodita commented on code in PR #14439:
URL: https://github.com/apache/lucene/pull/14439#discussion_r2048509305


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/facet/plain/histograms/TestHistogramCollectorManager.java:
##
@@ -115,6 +117,38 @@ public void testSkipIndexWithSortAndLowInterval() throws 
IOException {
 .setCodec(TestUtil.alwaysDocValuesFormat(new 
Lucene90DocValuesFormat(3;
   }
 
+  public void testMultiRangePointTreeCollector() throws IOException {

Review Comment:
   Let's run this test more? I've checked it out and seen it both pass and fail.



##
lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java:
##
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.facet.plain.histograms;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.function.Function;
+import org.apache.lucene.index.PointValues;
+import org.apache.lucene.internal.hppc.LongIntHashMap;
+import org.apache.lucene.search.CollectionTerminatedException;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.ArrayUtil;
+
+class PointTreeBulkCollector {
+  static boolean collect(
+  final PointValues pointValues,
+  final long bucketWidth,
+  final LongIntHashMap collectorCounts,
+  final int maxBuckets)
+  throws IOException {
+// TODO: Do we really need pointValues.getDocCount() == pointValues.size()
+if (pointValues == null
+|| pointValues.getNumDimensions() != 1
+|| pointValues.getDocCount() != pointValues.size()
+|| ArrayUtil.getValue(pointValues.getBytesPerDimension()) == null) {
+  return false;
+}
+
+final Function byteToLong =
+ArrayUtil.getValue(pointValues.getBytesPerDimension());

Review Comment:
   Should we only call this once?



##
lucene/sandbox/src/test/org/apache/lucene/sandbox/facet/plain/histograms/TestHistogramCollectorManager.java:
##
@@ -115,6 +117,38 @@ public void testSkipIndexWithSortAndLowInterval() throws 
IOException {
 .setCodec(TestUtil.alwaysDocValuesFormat(new 
Lucene90DocValuesFormat(3;
   }
 
+  public void testMultiRangePointTreeCollector() throws IOException {
+Directory dir = newDirectory();
+IndexWriter w = new IndexWriter(dir, new IndexWriterConfig());
+
+long[] values = new long[5000];
+
+for (int i = 0; i < values.length; i++) {
+  values[i] = RandomizedTest.between(0, 5000); // Generates a random 
integer

Review Comment:
   Let's use `random()` inherited from `LuceneTestCase` to ensure failures are 
reproducible.



##
lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java:
##
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.facet.plain.histograms;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.function.Function;
+import org.apache.lucene.index.PointValues;
+import org.apache.lucene.internal.hppc.LongIntHashMap;
+import org.apache.lucene.search.CollectionTerminatedException;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.ArrayUtil;
+
+class PointTreeBulkCollector {
+  static boolean collec

Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]

2025-04-17 Thread via GitHub


kkewwei commented on code in PR #14397:
URL: https://github.com/apache/lucene/pull/14397#discussion_r2048995276


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java:
##
@@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException {
   bytes.offset = bytes.length = 0;
   for (int decompressed = 0; decompressed < totalLength; ) {
 final int toDecompress = Math.min(totalLength - decompressed, 
chunkSize);
+decompressor.reset();
 decompressor.decompress(fieldsStream, toDecompress, 0, 
toDecompress, spare);

Review Comment:
   I tried but failed in just relying on outer `reuseIfPossible` to decide 
whether to cache PreSet Dict . In this case, outer must call the `reset` to 
clear the cache:
   chunk0 [doc0(length>0)]
   chunk1[doc0(length=0), doc0(length=1)]
   
   Steps are as follow:
   1. Reading the chunk0/doc0, `reuseIfPossible`=false
   2. Reading the chunk1/doc0, `reuseIfPossible`=false. As length is 0, lucene 
will not read the `predict`, the PreSet Dict is not cached.
   3. Reading the chunk1/doc1. In the case, doc1 is in the chunk1, 
`reuseIfPossible`=true, but the PreSet Dict is not cached, lucene will throw 
exception.
   
   In the case, we should call `reset` in the step1.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]

2025-04-17 Thread via GitHub


kkewwei commented on code in PR #14397:
URL: https://github.com/apache/lucene/pull/14397#discussion_r2048995276


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java:
##
@@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException {
   bytes.offset = bytes.length = 0;
   for (int decompressed = 0; decompressed < totalLength; ) {
 final int toDecompress = Math.min(totalLength - decompressed, 
chunkSize);
+decompressor.reset();
 decompressor.decompress(fieldsStream, toDecompress, 0, 
toDecompress, spare);

Review Comment:
   I tried but failed in just relying on outer `reuseIfPossible` to decide 
whether to cache PreSet Dict . In the follow case, outer must call the `reset` 
to clear the cache, we have two chunks:
   1. chunk0 [doc0(length>0)]
   2. chunk1[doc0(length=0), doc0(length=1)]
   
   Steps are as follow:
   1. Reading the chunk0/doc0, `reuseIfPossible`=false
   3. Reading the chunk1/doc0, `reuseIfPossible`=false. As length is 0, lucene 
will not read the `predict`, the PreSet Dict is not cached.
   4. Reading the chunk1/doc1. In the case, doc1 is in the chunk1, 
`reuseIfPossible`=true, but the PreSet Dict is not cached, lucene will throw 
exception.
   
   In the case, we should call `reset` in the step1.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]

2025-04-17 Thread via GitHub


kkewwei commented on code in PR #14397:
URL: https://github.com/apache/lucene/pull/14397#discussion_r2048995276


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java:
##
@@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException {
   bytes.offset = bytes.length = 0;
   for (int decompressed = 0; decompressed < totalLength; ) {
 final int toDecompress = Math.min(totalLength - decompressed, 
chunkSize);
+decompressor.reset();
 decompressor.decompress(fieldsStream, toDecompress, 0, 
toDecompress, spare);

Review Comment:
   I tried but failed in just relying on outer `reuseIfPossible` to decide 
whether to cache PreSet Dict , In the follow case, outer must call the `reset` 
to clear the cache.
   
We have two chunks:
   1. chunk0 [doc0(length>0)]
   2. chunk1[doc0(length=0), doc0(length=1)]
   
   Steps are as follow:
   1. Reading the chunk0/doc0, `reuseIfPossible`=false
   3. Reading the chunk1/doc0, `reuseIfPossible`=false. As length is 0, lucene 
will not read the `predict`, the PreSet Dict is not cached.
   4. Reading the chunk1/doc1. In the case, doc1 is in the chunk1, 
`reuseIfPossible`=true, but the PreSet Dict is not cached, lucene will throw 
exception.
   
   In the case, we should call `reset` in the step1.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]

2025-04-17 Thread via GitHub


kkewwei commented on code in PR #14397:
URL: https://github.com/apache/lucene/pull/14397#discussion_r2048995276


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java:
##
@@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException {
   bytes.offset = bytes.length = 0;
   for (int decompressed = 0; decompressed < totalLength; ) {
 final int toDecompress = Math.min(totalLength - decompressed, 
chunkSize);
+decompressor.reset();
 decompressor.decompress(fieldsStream, toDecompress, 0, 
toDecompress, spare);

Review Comment:
   I tried but failed in just relying on outer `reuseIfPossible` to decide 
whether to cache PreSet Dict , In the follow case, outer must call the `reset` 
to clear the cache.
   
We have two chunks:
   1. chunk0 [doc0(length>0)]
   2. chunk1[doc0(length=0), doc1(length=1)]
   
   Steps are as follow:
   1. Reading the chunk0/doc0, `reuseIfPossible`=false
   3. Reading the chunk1/doc0, `reuseIfPossible`=false. As length is 0, lucene 
will not read the `predict`, the PreSet Dict is not cached.
   4. Reading the chunk1/doc1. In the case, doc1 is in the current chunk1, 
`reuseIfPossible`=true, but the PreSet Dict is not cached, lucene will throw 
exception.
   
   In the case, we should call `reset` in the step1.
   
   



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java:
##
@@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException {
   bytes.offset = bytes.length = 0;
   for (int decompressed = 0; decompressed < totalLength; ) {
 final int toDecompress = Math.min(totalLength - decompressed, 
chunkSize);
+decompressor.reset();
 decompressor.decompress(fieldsStream, toDecompress, 0, 
toDecompress, spare);

Review Comment:
   I tried but failed in just relying on outer `reuseIfPossible` to decide 
whether to cache PreSet Dict , In the follow case, outer must call the `reset` 
to clear the cache.
   
We have two chunks:
   1. chunk0 [doc0(length>0)]
   2. chunk1[doc0(length=0), doc1(length=1)]
   
   Steps are as follow:
   1. Reading the chunk0/doc0, `reuseIfPossible`=false
   3. Reading the chunk1/doc0, `reuseIfPossible`=false. As length is 0, lucene 
will not read the `predict`, the PreSet Dict is not cached.
   4. Reading the chunk1/doc1. In the case, doc1 is in the current chunk1, 
`reuseIfPossible`=true, but the PreSet Dict is not cached for now, lucene will 
throw exception.
   
   In the case, we should call `reset` in the step1.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]

2025-04-17 Thread via GitHub


kkewwei commented on code in PR #14397:
URL: https://github.com/apache/lucene/pull/14397#discussion_r2048995276


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java:
##
@@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException {
   bytes.offset = bytes.length = 0;
   for (int decompressed = 0; decompressed < totalLength; ) {
 final int toDecompress = Math.min(totalLength - decompressed, 
chunkSize);
+decompressor.reset();
 decompressor.decompress(fieldsStream, toDecompress, 0, 
toDecompress, spare);

Review Comment:
   I tried but failed in just relying on outer `reuseIfPossible` to decide 
whether to cache PreSet Dict , In the follow case, outer must call the `reset` 
to clear the cache.
   
We have two chunks:
   1. chunk0 [doc0(length>0)]
   2. chunk1[doc0(length=0), doc1(length=1)]
   
   Steps are as follow:
   1. Reading the chunk0/doc0, `reuseIfPossible`=false
   3. Reading the chunk1/doc0, `reuseIfPossible`=false. As length is 0, lucene 
will not read the `predict`, the PreSet Dict is not cached.
   4. Reading the chunk1/doc1. In the case, doc1 is in the chunk1, 
`reuseIfPossible`=true, but the PreSet Dict is not cached, lucene will throw 
exception.
   
   In the case, we should call `reset` in the step1.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]

2025-04-17 Thread via GitHub


kkewwei commented on code in PR #14397:
URL: https://github.com/apache/lucene/pull/14397#discussion_r2048995276


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java:
##
@@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException {
   bytes.offset = bytes.length = 0;
   for (int decompressed = 0; decompressed < totalLength; ) {
 final int toDecompress = Math.min(totalLength - decompressed, 
chunkSize);
+decompressor.reset();
 decompressor.decompress(fieldsStream, toDecompress, 0, 
toDecompress, spare);

Review Comment:
   I tried but failed in just relying on outer `reuseIfPossible` to decide 
whether to cache PreSet Dict , In the follow case, outer must call the `reset` 
to clear the cache.
   
We have two chunks:
   1. chunk0 [doc0(length>0)]
   2. chunk1[doc0(length=0), doc1(length=1)]
   
   Steps are as follow:
   1. Reading the chunk0/doc0, `reuseIfPossible`=false
   3. Reading the chunk1/doc0, `reuseIfPossible`=false. As length is 0, lucene 
will not read the `PreSet Dict`, the `PreSet Dict` is not cached.
   4. Reading the chunk1/doc1. In the case, doc1 is in the current chunk1, 
`reuseIfPossible`=true, but the `PreSet Dict` is not cached for now, lucene 
will throw exception.
   
   In the case, we should call `reset` in the step1.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]

2025-04-17 Thread via GitHub


jainankitk commented on PR #14439:
URL: https://github.com/apache/lucene/pull/14439#issuecomment-2810960212

   I have updated the PR, and the code flow is like below now:
   
   * `HistogramCollector` overrides the `setWeight` for accessing the 
underlying `Query`
   * To keep things simple, just optimizing for `MATCH_ALL_DOCS` query and no 
deleted documents for now
   * Optimized path is enabled only if `pointTree` is built for the field
   * There are few other conditions for optimized path being enabled. Being 
conservative for now, and fallback to original.
   * Add small unit test to verify working as expected
   
   Will add few more unit tests, once I get some feedback on the code changes. 
Can also include small performance unit test that demonstrates, pointTree based 
collection is faster than the docValues based collector.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove sloppySin calculations [lucene]

2025-04-17 Thread via GitHub


rmuir commented on code in PR #14516:
URL: https://github.com/apache/lucene/pull/14516#discussion_r2048713697


##
lucene/core/src/java/org/apache/lucene/geo/Rectangle.java:
##
@@ -28,9 +25,6 @@
 import static org.apache.lucene.geo.GeoUtils.MIN_LON_RADIANS;
 import static org.apache.lucene.geo.GeoUtils.checkLatitude;
 import static org.apache.lucene.geo.GeoUtils.checkLongitude;
-import static org.apache.lucene.geo.GeoUtils.sloppySin;
-import static org.apache.lucene.util.SloppyMath.asin;
-import static org.apache.lucene.util.SloppyMath.cos;

Review Comment:
   This is also changing asin and cos at the same time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Remove sloppySin calculations [lucene]

2025-04-17 Thread via GitHub


jainankitk opened a new pull request, #14516:
URL: https://github.com/apache/lucene/pull/14516

   ### Description
   
   Wondering if we really need sloppySin anymore. For modern hardware with JDK 
17 on recent processors:
   * The performance difference should be negligible
   * Modern JVMs heavily optimize Math.sin calls
   * The approximation error from sloppySin (~0.1%) is not worth probably 
insignificant performance benefit!
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ensuring skip list is read for fields indexed with only DOCS [lucene]

2025-04-17 Thread via GitHub


expani commented on code in PR #14511:
URL: https://github.com/apache/lucene/pull/14511#discussion_r2047828296


##
lucene/core/src/java/org/apache/lucene/codecs/lucene103/Lucene103PostingsReader.java:
##
@@ -282,6 +288,10 @@ public PostingsEnum postings(
   @Override
   public ImpactsEnum impacts(FieldInfo fieldInfo, BlockTermState state, int 
flags)
   throws IOException {
+if (state.docFreq <= BLOCK_SIZE) {
+  // no skip data
+  return new SlowImpactsEnum(postings(fieldInfo, state, null, flags));
+}

Review Comment:
   DUMMY_IMPACTS in PostingsReader was a `List` 
   whereas 
   the DUMMY_IMPACTS returned by SlowImpactsEnum is a `Impacts` which would 
still apply for cases where using the skip list doesn't make sense 
`state.docFreq <= BLOCK_SIZE` 
   
   I picked this check from `Lucene99PostingsReader` where we stored a default 
impact of freq=1 even if the field was indexed with `IndexOptions.DOCS` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Strange stack traces for new bitset focused doc iterators [lucene]

2025-04-17 Thread via GitHub


benwtrent commented on issue #14517:
URL: https://github.com/apache/lucene/issues/14517#issuecomment-2812913299

   Thinking about the first stack trace more, it may actually be because 
`windowBase > windowMax`. Following the trace up, we reach 
`DenseConjunctionBulkScorer#scoreWindow`, where `int bitsetWindowMax = (int) 
Math.min(minDocIDRunEnd, (long) WINDOW_SIZE + min);`
   
   Where `bitsetWindowMax` is passed as `windowMax` down stack, and `min` is 
passed as window base. 
   
   If `minDocIDRunEnd` somehow got set to something that is actually less than 
`min`, we could end up in this position.
   
   Looking at how its set:
   
   ```
   for (DisiWrapper w : iterators) {
 int docIdRunEnd = w.docIDRunEnd();
 if (w.docID() > min || docIdRunEnd < minRunEndThreshold) {
   windowApproximations.add(w.approximation());
   if (w.twoPhase() != null) {
 windowTwoPhases.add(w.twoPhase());
   }
 } else {
   minDocIDRunEnd = Math.min(minDocIDRunEnd, docIdRunEnd);
 }
   }
   ```
   Possibly `docIDRunEnd()` as returned from one of the iterators is actually < 
`min` but its `docId>min`? that seems weird...
   
   I will keep reading. Maybe its further up stack and the min/max provided to 
`scoreWindow` is already out of whack...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org