Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]
jainankitk commented on code in PR #14439: URL: https://github.com/apache/lucene/pull/14439#discussion_r2049741831 ## lucene/sandbox/src/test/org/apache/lucene/sandbox/facet/plain/histograms/TestHistogramCollectorManager.java: ## @@ -115,6 +117,38 @@ public void testSkipIndexWithSortAndLowInterval() throws IOException { .setCodec(TestUtil.alwaysDocValuesFormat(new Lucene90DocValuesFormat(3; } + public void testMultiRangePointTreeCollector() throws IOException { Review Comment: Okay, ran the test and few times and I understand what the error is. ``` java.lang.AssertionError: expected:<[1=>978, 3=>1004, 4=>1000, 2=>1022, 0=>996]> but was:<[4=>5000, 1=>4890, 2=>5110, 3=>5020, 0=>4980]> ``` The actual values are 5 times of the expected. Essentially, the logic efficiently collects values for entire segment and in case of intra-segment concurrency, it ends up collecting 5 times. I am wondering if the collectors can be aware of intra-segment concurrency, and not use the optimized path when it is enabled? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]
jainankitk commented on code in PR #14439: URL: https://github.com/apache/lucene/pull/14439#discussion_r2049534867 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java: ## @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.facet.plain.histograms; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.function.Function; +import org.apache.lucene.index.PointValues; +import org.apache.lucene.internal.hppc.LongIntHashMap; +import org.apache.lucene.search.CollectionTerminatedException; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.util.ArrayUtil; + +class PointTreeBulkCollector { + static boolean collect( + final PointValues pointValues, + final long bucketWidth, + final LongIntHashMap collectorCounts, + final int maxBuckets) + throws IOException { +// TODO: Do we really need pointValues.getDocCount() == pointValues.size() +if (pointValues == null +|| pointValues.getNumDimensions() != 1 +|| pointValues.getDocCount() != pointValues.size() +|| ArrayUtil.getValue(pointValues.getBytesPerDimension()) == null) { + return false; +} + +final Function byteToLong = +ArrayUtil.getValue(pointValues.getBytesPerDimension()); +final long minValue = getLongFromByte(byteToLong, pointValues.getMinPackedValue()); +final long maxValue = getLongFromByte(byteToLong, pointValues.getMaxPackedValue()); +long leafMinBucket = Math.floorDiv(minValue, bucketWidth); +long leafMaxBucket = Math.floorDiv(maxValue, bucketWidth); + +// We want that # leaf nodes is more than # buckets so that we can completely skip over +// some of the leaf nodes. Higher this ratio, more efficient it is than naive approach! +if ((pointValues.size() / 512) < (leafMaxBucket - leafMinBucket)) { + return false; Review Comment: The user does not need to get anything. We will just skip the optimized path and fall back to the original execution plan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]
jainankitk commented on code in PR #14439: URL: https://github.com/apache/lucene/pull/14439#discussion_r2049535183 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java: ## @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.facet.plain.histograms; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.function.Function; +import org.apache.lucene.index.PointValues; +import org.apache.lucene.internal.hppc.LongIntHashMap; +import org.apache.lucene.search.CollectionTerminatedException; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.util.ArrayUtil; + +class PointTreeBulkCollector { + static boolean collect( + final PointValues pointValues, + final long bucketWidth, + final LongIntHashMap collectorCounts, + final int maxBuckets) + throws IOException { +// TODO: Do we really need pointValues.getDocCount() == pointValues.size() +if (pointValues == null +|| pointValues.getNumDimensions() != 1 +|| pointValues.getDocCount() != pointValues.size() +|| ArrayUtil.getValue(pointValues.getBytesPerDimension()) == null) { + return false; Review Comment: The user does not need to get anything. We will just skip the optimized path and fall back to the original execution plan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Upgrade to gradle 8.14-rc-2 [lucene]
dweiss commented on PR #14519: URL: https://github.com/apache/lucene/pull/14519#issuecomment-2814109034 I would like to wait until the official release is made (not rc). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up advancing within a sparse block in IndexedDISI. [lucene]
github-actions[bot] commented on PR #14371: URL: https://github.com/apache/lucene/pull/14371#issuecomment-2814249617 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Upgrade to gradle 8.14-rc-2 [lucene]
harshavamsi commented on PR #14519: URL: https://github.com/apache/lucene/pull/14519#issuecomment-2814127985 > I would like to wait until the official release is made (not rc). I'll mark this for draft and will pick up once 8.14 is fully released. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove sloppySin calculations [lucene]
rmuir commented on PR #14516: URL: https://github.com/apache/lucene/pull/14516#issuecomment-2814266775 It will mostly impact geo sorting. But replacing the fast SloppyMath calculation here with slower equivalents from OpenJDK won't buy you any improved error rate: precision is purposefully truncated to ensure stable sorts: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/SloppyMath.java#L73-L74 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Tone down TestIndexWriterDelete.testDeleteAllRepeated (OOMs sometimes) [lucene]
dweiss commented on issue #14508: URL: https://github.com/apache/lucene/issues/14508#issuecomment-2812002598 I think it'd be good to set up jenkins the way Robert does - point gradle's temp folders at some tmpfs mount, at least from linux. This would save your disks from wearing out significantly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]
jainankitk commented on code in PR #14439: URL: https://github.com/apache/lucene/pull/14439#discussion_r2050147491 ## lucene/sandbox/src/test/org/apache/lucene/sandbox/facet/plain/histograms/TestHistogramCollectorManager.java: ## @@ -115,6 +117,38 @@ public void testSkipIndexWithSortAndLowInterval() throws IOException { .setCodec(TestUtil.alwaysDocValuesFormat(new Lucene90DocValuesFormat(3; } + public void testMultiRangePointTreeCollector() throws IOException { Review Comment: Okay, I have added logic for collecting only for one of the partitions if intra-segment concurrency is enabled. Ran the test 50k times after that without any failures: ``` > Task :lucene:core:testClasses > Task :lucene:core:test > Task :lucene:core:wipeTaskTemp > Task :lucene:sandbox:test :lucene:sandbox:test (SUCCESS): 5 test(s) > Task :lucene:sandbox:wipeTaskTemp The slowest suites (exceeding 1s) during this run: 580.41s TestHistogramCollectorManager (:lucene:sandbox) BUILD SUCCESSFUL in 9m 58s 246 actionable tasks: 181 executed, 65 up-to-date ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]
jainankitk commented on code in PR #14439: URL: https://github.com/apache/lucene/pull/14439#discussion_r2050117622 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/HistogramCollector.java: ## @@ -53,11 +59,22 @@ public LeafCollector getLeafCollector(LeafReaderContext context) throws IOExcept // The segment has no values, nothing to do. throw new CollectionTerminatedException(); } + +// We can use multi range traversal logic to collect the histogram on numeric +// field indexed as point range query for MATCH_ALL cases +final PointValues pointValues = context.reader().getPointValues(field); +if (query instanceof MatchAllDocsQuery && !context.reader().hasDeletions()) { Review Comment: Yes, I will open follow-up once we are close to merging this PR. I will keep this conversation open, until the issue is created -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]
jainankitk commented on code in PR #14439: URL: https://github.com/apache/lucene/pull/14439#discussion_r2050115412 ## lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java: ## @@ -801,4 +802,16 @@ public static int compareUnsigned4(byte[] a, int aOffset, byte[] b, int bOffset) return Integer.compareUnsigned( (int) BitUtil.VH_BE_INT.get(a, aOffset), (int) BitUtil.VH_BE_INT.get(b, bOffset)); } + + public static Function getValue(int numBytes) { Review Comment: Thanks for suggesting `NumericUtils`. I wasn't aware of the `NumericUtils#sortableBytesToLong`, exactly what I needed to flip the sign bit correctly! ``` // Flip the sign bit back v ^= 0x8000L; ``` I have removed this method from `ArrayUtil` and added in `PointTreeBulkCollector`. Please let me know if you still see any concerns -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add a timeout for forceMergeDeletes in IndexWriter [lucene]
vigyasharma commented on issue #14431: URL: https://github.com/apache/lucene/issues/14431#issuecomment-2811827103 > Suppose forceMergeDeletes() returned the MergeSpec This could be a _"good first issue"_, I'll create a spin-off issue for the same. We can close it if others disagree with the idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Ensuring skip list is read for fields indexed with only DOCS [lucene]
expani commented on code in PR #14511: URL: https://github.com/apache/lucene/pull/14511#discussion_r2047828296 ## lucene/core/src/java/org/apache/lucene/codecs/lucene103/Lucene103PostingsReader.java: ## @@ -282,6 +288,10 @@ public PostingsEnum postings( @Override public ImpactsEnum impacts(FieldInfo fieldInfo, BlockTermState state, int flags) throws IOException { +if (state.docFreq <= BLOCK_SIZE) { + // no skip data + return new SlowImpactsEnum(postings(fieldInfo, state, null, flags)); +} Review Comment: DUMMY_IMPACTS in PostingsReader is a `List` ( 10.1.0 ) whereas the DUMMY_IMPACTS returned by SlowImpactsEnum ( 9.11.1 ) was a `Impacts` which would still apply for cases where using the skip list doesn't make sense `state.docFreq <= BLOCK_SIZE` I picked this check from `Lucene99PostingsReader` where we stored a default impact of freq=1 even if the field was indexed with `IndexOptions.DOCS` https://github.com/apache/lucene/blob/main/lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene99/Lucene99PostingsWriter.java#L275 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Ensuring skip list is read for fields indexed with only DOCS [lucene]
expani commented on code in PR #14511: URL: https://github.com/apache/lucene/pull/14511#discussion_r2047828296 ## lucene/core/src/java/org/apache/lucene/codecs/lucene103/Lucene103PostingsReader.java: ## @@ -282,6 +288,10 @@ public PostingsEnum postings( @Override public ImpactsEnum impacts(FieldInfo fieldInfo, BlockTermState state, int flags) throws IOException { +if (state.docFreq <= BLOCK_SIZE) { + // no skip data + return new SlowImpactsEnum(postings(fieldInfo, state, null, flags)); +} Review Comment: DUMMY_IMPACTS in PostingsReader is a `List` ( 10.1.0 ) whereas the DUMMY_IMPACTS returned by SlowImpactsEnum ( 9.11.1 ) was a `Impacts` which would still apply for cases where using the skip list doesn't make sense `state.docFreq <= BLOCK_SIZE` I picked this check from `Lucene99PostingsReader` where we stored a default impact of freq=1 even if the field was indexed with `IndexOptions.DOCS` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Use a non-deprecated assertThat, and change several test assertions to use assertThat [lucene]
thecoop opened a new pull request, #14518: URL: https://github.com/apache/lucene/pull/14518 The main reason for this PR is using an un-deprecated `assertThat` by default. I've also updated a few uses of old-style assertions to use `assertThat` and `Matchers` that give a more descriptive error message. There's obviously loads more that could be updated; this is a few to start with. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Upgrade to gradle 8.14-rc-2 [lucene]
harshavamsi opened a new pull request, #14519: URL: https://github.com/apache/lucene/pull/14519 ### Description I was trying to do a fresh import of Lucene into Intellij 2025.1 when it threw ``` java.lang.ClassCastException: class org.codehaus.groovy.runtime.GStringImpl cannot be cast to class java.lang.String (org.codehaus.groovy.runtime.GStringImpl is in unnamed module of loader org.gradle.internal.classloader.VisitableURLClassLoader @2280cdac; java.lang.String is in module java.base of loader 'bootstrap') in ToolingStreamApiUtils.writeCollection(ToolingStreamApiUtils.java:145) FAILURE: Build failed with an exception. * What went wrong: java.io.NotSerializableException: org.gradle.api.internal.file.DefaultFilePropertyFactory$FixedFile org.gradle.api.internal.file.DefaultFilePropertyFactory$FixedFile ``` Looks like the issue is related to https://github.com/gradle/gradle/issues/32606 and is fixed in gradle 8.14. Raising this PR if we would want to upgrade to 8.14 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a non-deprecated assertThat, and change several test assertions to use assertThat [lucene]
thecoop commented on code in PR #14518: URL: https://github.com/apache/lucene/pull/14518#discussion_r2048959333 ## lucene/core/src/test/org/apache/lucene/index/TestAddIndexes.java: ## @@ -810,10 +810,7 @@ public MergeSpecification findMerges(CodecReader... readers) throws IOException } for (MergePolicy.OneMerge merge : merges) { if (merge.getMergeInfo() != null) { -assertFalse( -Arrays.stream(c.destDir.listAll()) -.collect(Collectors.toSet()) -.containsAll(merge.getMergeInfo().files())); + assertFalse(Set.of(c.destDir.listAll()).containsAll(merge.getMergeInfo().files())); Review Comment: This doesn't easily translate into a matcher, so I've changed to use `Set.of` to make it a bit more readable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove sloppySin calculations [lucene]
jainankitk commented on PR #14516: URL: https://github.com/apache/lucene/pull/14516#issuecomment-2813787215 > I don't agree with the such changes without supporting benchmarks, sorry. Especially it is bad news to just replace 3 functions (sin,cos,asin) all at once in YOLO fashion. Thanks for raising these concerns. The intention was to just remove custom `sloppySin`. Removing `SloppyMath.cos`, `SloppyMath.asin` was by mistake. Let me work on providing supporting benchmark. I am wondering if there is something in `LuceneUtil` benchmark that can be used for this purpose? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]
jainankitk commented on code in PR #14439: URL: https://github.com/apache/lucene/pull/14439#discussion_r2049533078 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java: ## @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.facet.plain.histograms; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.function.Function; +import org.apache.lucene.index.PointValues; +import org.apache.lucene.internal.hppc.LongIntHashMap; +import org.apache.lucene.search.CollectionTerminatedException; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.util.ArrayUtil; + +class PointTreeBulkCollector { Review Comment: Do we need explicit `@lucene.experimental` even for classes within sandbox? As per my understanding, everything sandbox is experimental by default? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]
stefanvodita commented on code in PR #14439: URL: https://github.com/apache/lucene/pull/14439#discussion_r2048509305 ## lucene/sandbox/src/test/org/apache/lucene/sandbox/facet/plain/histograms/TestHistogramCollectorManager.java: ## @@ -115,6 +117,38 @@ public void testSkipIndexWithSortAndLowInterval() throws IOException { .setCodec(TestUtil.alwaysDocValuesFormat(new Lucene90DocValuesFormat(3; } + public void testMultiRangePointTreeCollector() throws IOException { Review Comment: Let's run this test more? I've checked it out and seen it both pass and fail. ## lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java: ## @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.facet.plain.histograms; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.function.Function; +import org.apache.lucene.index.PointValues; +import org.apache.lucene.internal.hppc.LongIntHashMap; +import org.apache.lucene.search.CollectionTerminatedException; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.util.ArrayUtil; + +class PointTreeBulkCollector { + static boolean collect( + final PointValues pointValues, + final long bucketWidth, + final LongIntHashMap collectorCounts, + final int maxBuckets) + throws IOException { +// TODO: Do we really need pointValues.getDocCount() == pointValues.size() +if (pointValues == null +|| pointValues.getNumDimensions() != 1 +|| pointValues.getDocCount() != pointValues.size() +|| ArrayUtil.getValue(pointValues.getBytesPerDimension()) == null) { + return false; +} + +final Function byteToLong = +ArrayUtil.getValue(pointValues.getBytesPerDimension()); Review Comment: Should we only call this once? ## lucene/sandbox/src/test/org/apache/lucene/sandbox/facet/plain/histograms/TestHistogramCollectorManager.java: ## @@ -115,6 +117,38 @@ public void testSkipIndexWithSortAndLowInterval() throws IOException { .setCodec(TestUtil.alwaysDocValuesFormat(new Lucene90DocValuesFormat(3; } + public void testMultiRangePointTreeCollector() throws IOException { +Directory dir = newDirectory(); +IndexWriter w = new IndexWriter(dir, new IndexWriterConfig()); + +long[] values = new long[5000]; + +for (int i = 0; i < values.length; i++) { + values[i] = RandomizedTest.between(0, 5000); // Generates a random integer Review Comment: Let's use `random()` inherited from `LuceneTestCase` to ensure failures are reproducible. ## lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/PointTreeBulkCollector.java: ## @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.facet.plain.histograms; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.function.Function; +import org.apache.lucene.index.PointValues; +import org.apache.lucene.internal.hppc.LongIntHashMap; +import org.apache.lucene.search.CollectionTerminatedException; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.util.ArrayUtil; + +class PointTreeBulkCollector { + static boolean collec
Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]
kkewwei commented on code in PR #14397: URL: https://github.com/apache/lucene/pull/14397#discussion_r2048995276 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java: ## @@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException { bytes.offset = bytes.length = 0; for (int decompressed = 0; decompressed < totalLength; ) { final int toDecompress = Math.min(totalLength - decompressed, chunkSize); +decompressor.reset(); decompressor.decompress(fieldsStream, toDecompress, 0, toDecompress, spare); Review Comment: I tried but failed in just relying on outer `reuseIfPossible` to decide whether to cache PreSet Dict . In this case, outer must call the `reset` to clear the cache: chunk0 [doc0(length>0)] chunk1[doc0(length=0), doc0(length=1)] Steps are as follow: 1. Reading the chunk0/doc0, `reuseIfPossible`=false 2. Reading the chunk1/doc0, `reuseIfPossible`=false. As length is 0, lucene will not read the `predict`, the PreSet Dict is not cached. 3. Reading the chunk1/doc1. In the case, doc1 is in the chunk1, `reuseIfPossible`=true, but the PreSet Dict is not cached, lucene will throw exception. In the case, we should call `reset` in the step1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]
kkewwei commented on code in PR #14397: URL: https://github.com/apache/lucene/pull/14397#discussion_r2048995276 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java: ## @@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException { bytes.offset = bytes.length = 0; for (int decompressed = 0; decompressed < totalLength; ) { final int toDecompress = Math.min(totalLength - decompressed, chunkSize); +decompressor.reset(); decompressor.decompress(fieldsStream, toDecompress, 0, toDecompress, spare); Review Comment: I tried but failed in just relying on outer `reuseIfPossible` to decide whether to cache PreSet Dict . In the follow case, outer must call the `reset` to clear the cache, we have two chunks: 1. chunk0 [doc0(length>0)] 2. chunk1[doc0(length=0), doc0(length=1)] Steps are as follow: 1. Reading the chunk0/doc0, `reuseIfPossible`=false 3. Reading the chunk1/doc0, `reuseIfPossible`=false. As length is 0, lucene will not read the `predict`, the PreSet Dict is not cached. 4. Reading the chunk1/doc1. In the case, doc1 is in the chunk1, `reuseIfPossible`=true, but the PreSet Dict is not cached, lucene will throw exception. In the case, we should call `reset` in the step1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]
kkewwei commented on code in PR #14397: URL: https://github.com/apache/lucene/pull/14397#discussion_r2048995276 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java: ## @@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException { bytes.offset = bytes.length = 0; for (int decompressed = 0; decompressed < totalLength; ) { final int toDecompress = Math.min(totalLength - decompressed, chunkSize); +decompressor.reset(); decompressor.decompress(fieldsStream, toDecompress, 0, toDecompress, spare); Review Comment: I tried but failed in just relying on outer `reuseIfPossible` to decide whether to cache PreSet Dict , In the follow case, outer must call the `reset` to clear the cache. We have two chunks: 1. chunk0 [doc0(length>0)] 2. chunk1[doc0(length=0), doc0(length=1)] Steps are as follow: 1. Reading the chunk0/doc0, `reuseIfPossible`=false 3. Reading the chunk1/doc0, `reuseIfPossible`=false. As length is 0, lucene will not read the `predict`, the PreSet Dict is not cached. 4. Reading the chunk1/doc1. In the case, doc1 is in the chunk1, `reuseIfPossible`=true, but the PreSet Dict is not cached, lucene will throw exception. In the case, we should call `reset` in the step1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]
kkewwei commented on code in PR #14397: URL: https://github.com/apache/lucene/pull/14397#discussion_r2048995276 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java: ## @@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException { bytes.offset = bytes.length = 0; for (int decompressed = 0; decompressed < totalLength; ) { final int toDecompress = Math.min(totalLength - decompressed, chunkSize); +decompressor.reset(); decompressor.decompress(fieldsStream, toDecompress, 0, toDecompress, spare); Review Comment: I tried but failed in just relying on outer `reuseIfPossible` to decide whether to cache PreSet Dict , In the follow case, outer must call the `reset` to clear the cache. We have two chunks: 1. chunk0 [doc0(length>0)] 2. chunk1[doc0(length=0), doc1(length=1)] Steps are as follow: 1. Reading the chunk0/doc0, `reuseIfPossible`=false 3. Reading the chunk1/doc0, `reuseIfPossible`=false. As length is 0, lucene will not read the `predict`, the PreSet Dict is not cached. 4. Reading the chunk1/doc1. In the case, doc1 is in the current chunk1, `reuseIfPossible`=true, but the PreSet Dict is not cached, lucene will throw exception. In the case, we should call `reset` in the step1. ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java: ## @@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException { bytes.offset = bytes.length = 0; for (int decompressed = 0; decompressed < totalLength; ) { final int toDecompress = Math.min(totalLength - decompressed, chunkSize); +decompressor.reset(); decompressor.decompress(fieldsStream, toDecompress, 0, toDecompress, spare); Review Comment: I tried but failed in just relying on outer `reuseIfPossible` to decide whether to cache PreSet Dict , In the follow case, outer must call the `reset` to clear the cache. We have two chunks: 1. chunk0 [doc0(length>0)] 2. chunk1[doc0(length=0), doc1(length=1)] Steps are as follow: 1. Reading the chunk0/doc0, `reuseIfPossible`=false 3. Reading the chunk1/doc0, `reuseIfPossible`=false. As length is 0, lucene will not read the `predict`, the PreSet Dict is not cached. 4. Reading the chunk1/doc1. In the case, doc1 is in the current chunk1, `reuseIfPossible`=true, but the PreSet Dict is not cached for now, lucene will throw exception. In the case, we should call `reset` in the step1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]
kkewwei commented on code in PR #14397: URL: https://github.com/apache/lucene/pull/14397#discussion_r2048995276 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java: ## @@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException { bytes.offset = bytes.length = 0; for (int decompressed = 0; decompressed < totalLength; ) { final int toDecompress = Math.min(totalLength - decompressed, chunkSize); +decompressor.reset(); decompressor.decompress(fieldsStream, toDecompress, 0, toDecompress, spare); Review Comment: I tried but failed in just relying on outer `reuseIfPossible` to decide whether to cache PreSet Dict , In the follow case, outer must call the `reset` to clear the cache. We have two chunks: 1. chunk0 [doc0(length>0)] 2. chunk1[doc0(length=0), doc1(length=1)] Steps are as follow: 1. Reading the chunk0/doc0, `reuseIfPossible`=false 3. Reading the chunk1/doc0, `reuseIfPossible`=false. As length is 0, lucene will not read the `predict`, the PreSet Dict is not cached. 4. Reading the chunk1/doc1. In the case, doc1 is in the chunk1, `reuseIfPossible`=true, but the PreSet Dict is not cached, lucene will throw exception. In the case, we should call `reset` in the step1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]
kkewwei commented on code in PR #14397: URL: https://github.com/apache/lucene/pull/14397#discussion_r2048995276 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java: ## @@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException { bytes.offset = bytes.length = 0; for (int decompressed = 0; decompressed < totalLength; ) { final int toDecompress = Math.min(totalLength - decompressed, chunkSize); +decompressor.reset(); decompressor.decompress(fieldsStream, toDecompress, 0, toDecompress, spare); Review Comment: I tried but failed in just relying on outer `reuseIfPossible` to decide whether to cache PreSet Dict , In the follow case, outer must call the `reset` to clear the cache. We have two chunks: 1. chunk0 [doc0(length>0)] 2. chunk1[doc0(length=0), doc1(length=1)] Steps are as follow: 1. Reading the chunk0/doc0, `reuseIfPossible`=false 3. Reading the chunk1/doc0, `reuseIfPossible`=false. As length is 0, lucene will not read the `PreSet Dict`, the `PreSet Dict` is not cached. 4. Reading the chunk1/doc1. In the case, doc1 is in the current chunk1, `reuseIfPossible`=true, but the `PreSet Dict` is not cached for now, lucene will throw exception. In the case, we should call `reset` in the step1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Logic for collecting Histogram efficiently using Point Trees [lucene]
jainankitk commented on PR #14439: URL: https://github.com/apache/lucene/pull/14439#issuecomment-2810960212 I have updated the PR, and the code flow is like below now: * `HistogramCollector` overrides the `setWeight` for accessing the underlying `Query` * To keep things simple, just optimizing for `MATCH_ALL_DOCS` query and no deleted documents for now * Optimized path is enabled only if `pointTree` is built for the field * There are few other conditions for optimized path being enabled. Being conservative for now, and fallback to original. * Add small unit test to verify working as expected Will add few more unit tests, once I get some feedback on the code changes. Can also include small performance unit test that demonstrates, pointTree based collection is faster than the docValues based collector. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove sloppySin calculations [lucene]
rmuir commented on code in PR #14516: URL: https://github.com/apache/lucene/pull/14516#discussion_r2048713697 ## lucene/core/src/java/org/apache/lucene/geo/Rectangle.java: ## @@ -28,9 +25,6 @@ import static org.apache.lucene.geo.GeoUtils.MIN_LON_RADIANS; import static org.apache.lucene.geo.GeoUtils.checkLatitude; import static org.apache.lucene.geo.GeoUtils.checkLongitude; -import static org.apache.lucene.geo.GeoUtils.sloppySin; -import static org.apache.lucene.util.SloppyMath.asin; -import static org.apache.lucene.util.SloppyMath.cos; Review Comment: This is also changing asin and cos at the same time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Remove sloppySin calculations [lucene]
jainankitk opened a new pull request, #14516: URL: https://github.com/apache/lucene/pull/14516 ### Description Wondering if we really need sloppySin anymore. For modern hardware with JDK 17 on recent processors: * The performance difference should be negligible * Modern JVMs heavily optimize Math.sin calls * The approximation error from sloppySin (~0.1%) is not worth probably insignificant performance benefit! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Ensuring skip list is read for fields indexed with only DOCS [lucene]
expani commented on code in PR #14511: URL: https://github.com/apache/lucene/pull/14511#discussion_r2047828296 ## lucene/core/src/java/org/apache/lucene/codecs/lucene103/Lucene103PostingsReader.java: ## @@ -282,6 +288,10 @@ public PostingsEnum postings( @Override public ImpactsEnum impacts(FieldInfo fieldInfo, BlockTermState state, int flags) throws IOException { +if (state.docFreq <= BLOCK_SIZE) { + // no skip data + return new SlowImpactsEnum(postings(fieldInfo, state, null, flags)); +} Review Comment: DUMMY_IMPACTS in PostingsReader was a `List` whereas the DUMMY_IMPACTS returned by SlowImpactsEnum is a `Impacts` which would still apply for cases where using the skip list doesn't make sense `state.docFreq <= BLOCK_SIZE` I picked this check from `Lucene99PostingsReader` where we stored a default impact of freq=1 even if the field was indexed with `IndexOptions.DOCS` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Strange stack traces for new bitset focused doc iterators [lucene]
benwtrent commented on issue #14517: URL: https://github.com/apache/lucene/issues/14517#issuecomment-2812913299 Thinking about the first stack trace more, it may actually be because `windowBase > windowMax`. Following the trace up, we reach `DenseConjunctionBulkScorer#scoreWindow`, where `int bitsetWindowMax = (int) Math.min(minDocIDRunEnd, (long) WINDOW_SIZE + min);` Where `bitsetWindowMax` is passed as `windowMax` down stack, and `min` is passed as window base. If `minDocIDRunEnd` somehow got set to something that is actually less than `min`, we could end up in this position. Looking at how its set: ``` for (DisiWrapper w : iterators) { int docIdRunEnd = w.docIDRunEnd(); if (w.docID() > min || docIdRunEnd < minRunEndThreshold) { windowApproximations.add(w.approximation()); if (w.twoPhase() != null) { windowTwoPhases.add(w.twoPhase()); } } else { minDocIDRunEnd = Math.min(minDocIDRunEnd, docIdRunEnd); } } ``` Possibly `docIDRunEnd()` as returned from one of the iterators is actually < `min` but its `docId>min`? that seems weird... I will keep reading. Maybe its further up stack and the min/max provided to `scoreWindow` is already out of whack... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org