Re: [PR] Add target search concurrency to TieredMergePolicy [lucene]
jpountz commented on code in PR #13430: URL: https://github.com/apache/lucene/pull/13430#discussion_r1680544831 ## lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java: ## @@ -1095,6 +1095,12 @@ public static TieredMergePolicy newTieredMergePolicy(Random r) { } else { tmp.setSegmentsPerTier(TestUtil.nextInt(r, 10, 50)); } +if (rarely(r)) { + tmp.setTargetSearchConcurrency(TestUtil.nextInt(r, 2, 20)); +} else { + tmp.setTargetSearchConcurrency(TestUtil.nextInt(r, 10, 50)); Review Comment: I would have expected the opposite: that we commonly set a reasonable target search concurrency, and rarely set a high value? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Ensure to use IOContext.READONCE when reading segment files [lucene]
ChrisHegarty merged PR #13574: URL: https://github.com/apache/lucene/pull/13574 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Compute facets while collecting [lucene]
epotyom commented on code in PR #13568: URL: https://github.com/apache/lucene/pull/13568#discussion_r1680705815 ## lucene/sandbox/src/test/org/apache/lucene/sandbox/facet/TestRangeFacet.java: ## @@ -0,0 +1,1654 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.facet; + +import static org.apache.lucene.facet.FacetsConfig.DEFAULT_INDEX_FIELD_NAME; + +import com.carrotsearch.randomizedtesting.generators.RandomNumbers; +import java.io.IOException; +import java.util.List; +import org.apache.lucene.document.Document; +import org.apache.lucene.document.DoubleDocValuesField; +import org.apache.lucene.document.DoublePoint; +import org.apache.lucene.document.LongPoint; +import org.apache.lucene.document.NumericDocValuesField; +import org.apache.lucene.document.SortedNumericDocValuesField; +import org.apache.lucene.facet.DrillDownQuery; +import org.apache.lucene.facet.DrillSideways; +import org.apache.lucene.facet.FacetField; +import org.apache.lucene.facet.FacetResult; +import org.apache.lucene.facet.FacetsConfig; +import org.apache.lucene.facet.LabelAndValue; +import org.apache.lucene.facet.MultiDoubleValuesSource; +import org.apache.lucene.facet.MultiLongValuesSource; +import org.apache.lucene.facet.range.DoubleRange; +import org.apache.lucene.facet.range.LongRange; +import org.apache.lucene.facet.range.Range; +import org.apache.lucene.facet.taxonomy.TaxonomyReader; +import org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader; +import org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyWriter; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.sandbox.facet.abstracts.OrdLabelBiMap; +import org.apache.lucene.sandbox.facet.ranges.DoubleRangeFacetCutter; +import org.apache.lucene.sandbox.facet.ranges.LongRangeFacetCutter; +import org.apache.lucene.sandbox.facet.ranges.RangeOrdLabelBiMap; +import org.apache.lucene.sandbox.facet.recorders.CountFacetRecorder; +import org.apache.lucene.sandbox.facet.taxonomy.TaxonomyFacetsCutter; +import org.apache.lucene.search.CollectorOwner; +import org.apache.lucene.search.DoubleValues; +import org.apache.lucene.search.DoubleValuesSource; +import org.apache.lucene.search.Explanation; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.LongValuesSource; +import org.apache.lucene.search.MatchAllDocsQuery; +import org.apache.lucene.search.MultiCollectorManager; +import org.apache.lucene.store.Directory; +import org.apache.lucene.tests.index.RandomIndexWriter; +import org.apache.lucene.tests.search.DummyTotalHitCountCollector; +import org.apache.lucene.tests.util.TestUtil; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.NumericUtils; + +/** + * Test sandbox facet ranges. Mostly test cases from LongRangeFacetCounts adopted for sandbox + * faceting. + */ +public class TestRangeFacet extends SandboxFacetTestCase { + + public void testBasicLong() throws Exception { +Directory d = newDirectory(); +RandomIndexWriter w = new RandomIndexWriter(random(), d); +Document doc = new Document(); +NumericDocValuesField field = new NumericDocValuesField("field", 0L); +doc.add(field); +for (long l = 0; l < 100; l++) { + field.setLongValue(l); + w.addDocument(doc); +} + +// Also add Long.MAX_VALUE +field.setLongValue(Long.MAX_VALUE); +w.addDocument(doc); + +IndexReader r = w.getReader(); +w.close(); + +IndexSearcher s = newSearcher(r); +LongRange[] inputRanges = +new LongRange[] { + new LongRange("less than 10", 0L, true, 10L, false), + new LongRange("less than or equal to 10", 0L, true, 10L, true), + new LongRange("over 90", 90L, false, 100L, false), + new LongRange("90 or above", 90L, true, 100L, false), + new LongRange("over 1000", 1000L, false, Long.MAX_VALUE, true), +}; + +MultiLongValuesSource valuesSource = MultiLongValuesSource.fromLongField("field"); +LongRangeFacetCutter longRangeFacetCutter = +LongRangeFacetCutter.create("field", valuesSource, inputR
Re: [PR] [9.x] Ensure to use IOContext.READONCE when reading segment files [lucene]
uschindler commented on code in PR #13578: URL: https://github.com/apache/lucene/pull/13578#discussion_r1680744068 ## lucene/test-framework/src/java/org/apache/lucene/tests/store/MockDirectoryWrapper.java: ## @@ -814,6 +814,14 @@ public synchronized IndexInput openInput(String name, IOContext context) throws context = LuceneTestCase.newIOContext(randomState, context); final boolean confined = context == IOContext.READONCE; Review Comment: I think in 9.x this must be `context.readOnce`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [9.x] Ensure to use IOContext.READONCE when reading segment files [lucene]
uschindler commented on code in PR #13578: URL: https://github.com/apache/lucene/pull/13578#discussion_r1680744068 ## lucene/test-framework/src/java/org/apache/lucene/tests/store/MockDirectoryWrapper.java: ## @@ -814,6 +814,14 @@ public synchronized IndexInput openInput(String name, IOContext context) throws context = LuceneTestCase.newIOContext(randomState, context); final boolean confined = context == IOContext.READONCE; Review Comment: I think in 9.x this must be `final boolean confined = context.readOnce`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add levels to DocValues skipper index [lucene]
iverase commented on code in PR #13563: URL: https://github.com/apache/lucene/pull/13563#discussion_r1680751732 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java: ## @@ -1792,61 +1794,88 @@ public DocValuesSkipper getSkipper(FieldInfo field) throws IOException { if (input.length() > 0) { input.prefetch(0, 1); } +// TODO: should we write to disk the actual max level for this segment? return new DocValuesSkipper() { - int minDocID = -1; - int maxDocID = -1; - long minValue, maxValue; - int docCount; + final int[] minDocID = new int[SKIP_INDEX_MAX_LEVEL]; + final int[] maxDocID = new int[SKIP_INDEX_MAX_LEVEL]; + + { +for (int i = 0; i < SKIP_INDEX_MAX_LEVEL; i++) { + minDocID[i] = maxDocID[i] = -1; +} + } + + final long[] minValue = new long[SKIP_INDEX_MAX_LEVEL]; + final long[] maxValue = new long[SKIP_INDEX_MAX_LEVEL]; + final int[] docCount = new int[SKIP_INDEX_MAX_LEVEL]; + int levels; @Override public void advance(int target) throws IOException { if (target > entry.maxDocId) { - minDocID = DocIdSetIterator.NO_MORE_DOCS; - maxDocID = DocIdSetIterator.NO_MORE_DOCS; + // skipper is exhausted + for (int i = 0; i < SKIP_INDEX_MAX_LEVEL; i++) { +minDocID[i] = maxDocID[i] = DocIdSetIterator.NO_MORE_DOCS; + } } else { + // find next interval + assert target > maxDocID[0] : "target must be bigger that current interval"; while (true) { -maxDocID = input.readInt(); -if (maxDocID >= target) { - minDocID = input.readInt(); - maxValue = input.readLong(); - minValue = input.readLong(); - docCount = input.readInt(); +levels = input.readByte(); Review Comment: I modified the implementation so we will have access to higher levels at any point. Let me know what you think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Compute facets while collecting [lucene]
stefanvodita commented on code in PR #13568: URL: https://github.com/apache/lucene/pull/13568#discussion_r1680778354 ## lucene/demo/src/java/org/apache/lucene/demo/facet/SandboxFacetsExample.java: ## @@ -0,0 +1,714 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.demo.facet; + +import static org.apache.lucene.facet.FacetsConfig.DEFAULT_INDEX_FIELD_NAME; +import static org.apache.lucene.sandbox.facet.ComparableUtils.rankCountOrdToComparable; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.analysis.core.WhitespaceAnalyzer; +import org.apache.lucene.document.Document; +import org.apache.lucene.document.DoubleDocValuesField; +import org.apache.lucene.document.NumericDocValuesField; +import org.apache.lucene.facet.DrillDownQuery; +import org.apache.lucene.facet.DrillSideways; +import org.apache.lucene.facet.FacetField; +import org.apache.lucene.facet.FacetResult; +import org.apache.lucene.facet.FacetsConfig; +import org.apache.lucene.facet.LabelAndValue; +import org.apache.lucene.facet.MultiLongValuesSource; +import org.apache.lucene.facet.range.LongRange; +import org.apache.lucene.facet.taxonomy.FacetLabel; +import org.apache.lucene.facet.taxonomy.TaxonomyReader; +import org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader; +import org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyWriter; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexWriter; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.index.IndexWriterConfig.OpenMode; +import org.apache.lucene.sandbox.facet.ComparableUtils; +import org.apache.lucene.sandbox.facet.FacetFieldCollector; +import org.apache.lucene.sandbox.facet.FacetFieldCollectorManager; +import org.apache.lucene.sandbox.facet.abstracts.OrdLabelBiMap; +import org.apache.lucene.sandbox.facet.abstracts.OrdToComparable; +import org.apache.lucene.sandbox.facet.abstracts.OrdinalIterator; +import org.apache.lucene.sandbox.facet.abstracts.Reducer; +import org.apache.lucene.sandbox.facet.ordinal_iterators.TopnOrdinalIterator; +import org.apache.lucene.sandbox.facet.ranges.LongRangeFacetCutter; +import org.apache.lucene.sandbox.facet.ranges.RangeOrdLabelBiMap; +import org.apache.lucene.sandbox.facet.recorders.CountFacetRecorder; +import org.apache.lucene.sandbox.facet.recorders.LongAggregationsFacetRecorder; +import org.apache.lucene.sandbox.facet.recorders.MultiFacetsRecorder; +import org.apache.lucene.sandbox.facet.taxonomy.TaxonomyChildrenOrdinalIterator; +import org.apache.lucene.sandbox.facet.taxonomy.TaxonomyFacetsCutter; +import org.apache.lucene.sandbox.facet.taxonomy.TaxonomyOrdLabelBiMap; +import org.apache.lucene.search.CollectorOwner; +import org.apache.lucene.search.DoubleValuesSource; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.LongValuesSource; +import org.apache.lucene.search.MatchAllDocsQuery; +import org.apache.lucene.search.MultiCollectorManager; +import org.apache.lucene.search.TopDocs; +import org.apache.lucene.search.TopScoreDocCollectorManager; +import org.apache.lucene.store.ByteBuffersDirectory; +import org.apache.lucene.store.Directory; +import org.apache.lucene.util.IOUtils; + +/** Demo for sandbox faceting. */ +public class SandboxFacetsExample { Review Comment: There are other places as well where we use the `SandboxFacet` name. For something descriptive, how about `MatchTimeFacet`? For something cute, how about thinking around precious stones? I like [`Lustre`](https://en.wikipedia.org/wiki/Lustre_(mineralogy)). If you want something in between cute and technical, [`Facetron`](https://facetron.com/thefacetron/) is a faceting machine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mai
Re: [PR] Add target search concurrency to TieredMergePolicy [lucene]
carlosdelest commented on code in PR #13430: URL: https://github.com/apache/lucene/pull/13430#discussion_r1680829569 ## lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java: ## @@ -1095,6 +1095,12 @@ public static TieredMergePolicy newTieredMergePolicy(Random r) { } else { tmp.setSegmentsPerTier(TestUtil.nextInt(r, 10, 50)); } +if (rarely(r)) { + tmp.setTargetSearchConcurrency(TestUtil.nextInt(r, 2, 20)); +} else { + tmp.setTargetSearchConcurrency(TestUtil.nextInt(r, 10, 50)); Review Comment: I see - I was mimicking the segments per tier approach here, but I guess that makes sense. Changing it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]
rmuir commented on code in PR #13572: URL: https://github.com/apache/lucene/pull/13572#discussion_r1680841649 ## lucene/core/build.gradle: ## @@ -14,10 +14,43 @@ * See the License for the specific language governing permissions and * limitations under the License. */ +plugins { + id "c" +} apply plugin: 'java-library' +apply plugin: 'c' description = 'Lucene core library' +model { + toolChains { +gcc(Gcc) { + target("linux_aarch64"){ +cppCompiler.withArguments { args -> + args << "-O3 --shared" Review Comment: @goankur at a glance it seems you are ignoring return value from it. i think it should be `accum = vdotq_s32(accum, v1, v2)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Significant drop in recall for 8 bit Scalar Quantizer [lucene]
benwtrent commented on issue #13519: URL: https://github.com/apache/lucene/issues/13519#issuecomment-2233036266 > In any case, you have worked with the code far longer than I have, so if you are confident about it please go ahead and commit :) @naveentatikonda could you test my branch as well? I want to make sure my numbers were not flukes on my end. We should also test with different data sets & different similarity metrics. We should verify that euclidean is also fixed, etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [9.x] Ensure to use IOContext.READONCE when reading segment files [lucene]
ChrisHegarty commented on code in PR #13578: URL: https://github.com/apache/lucene/pull/13578#discussion_r1680862628 ## lucene/test-framework/src/java/org/apache/lucene/tests/store/MockDirectoryWrapper.java: ## @@ -814,6 +814,14 @@ public synchronized IndexInput openInput(String name, IOContext context) throws context = LuceneTestCase.newIOContext(randomState, context); final boolean confined = context == IOContext.READONCE; Review Comment: The failure is because of a use of `READ` rather than `READONCE` when checksumming a segments file in `SegmentInfos::readCommit`. So, the new check in the MockDirectoryWrapper is working as expected - finding a really problem. Orthogonally, we could expand the check as you suggested, but it is not strictly necessary here, unless you think that it should be done anyway? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [9.x] Ensure to use IOContext.READONCE when reading segment files [lucene]
uschindler commented on code in PR #13578: URL: https://github.com/apache/lucene/pull/13578#discussion_r1680884653 ## lucene/test-framework/src/java/org/apache/lucene/tests/store/MockDirectoryWrapper.java: ## @@ -814,6 +814,14 @@ public synchronized IndexInput openInput(String name, IOContext context) throws context = LuceneTestCase.newIOContext(randomState, context); final boolean confined = context == IOContext.READONCE; Review Comment: Ah nice. We have removed the IOContext in checksum IndexInput in main branch. 👍 I think the suggested change here is not needed. It's legacy code anyways. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Aggregate files from the same segment into a single Arena [lucene]
ChrisHegarty commented on PR #13570: URL: https://github.com/apache/lucene/pull/13570#issuecomment-2233089190 > Can someone take a stab at summarizing this for CHANGES.txt, thus avoiding details a reader is unlikely to know about? Like me :-). How will a user of Lucene benefit? ``` * GITHUB#13570, GITHUB#13574, GITHUB#13535: Avoid performance degradation with closing shared Arenas. Index files are mmapped one-to-one with the JDK's foreign shared Arena. The JVM deoptimizes the top few frames of all threads when closing a shared Arena (see JDK-8335480). We mitigate this situation by 1) using a confined Arena where appropriate, and 2) grouping files from the same segment to a single shared Arena. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add target search concurrency to TieredMergePolicy [lucene]
jpountz merged PR #13430: URL: https://github.com/apache/lucene/pull/13430 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add a MergePolicy wrapper that preserves search concurrency? [lucene]
jpountz closed issue #12877: Add a MergePolicy wrapper that preserves search concurrency? URL: https://github.com/apache/lucene/issues/12877 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Compute facets while collecting [lucene]
mikemccand commented on PR #13568: URL: https://github.com/apache/lucene/pull/13568#issuecomment-2233193543 > Computing facets during collection is more expensive, because we need to collect in each searcher slice, and then merge results in `CollectorManager#reduce`. At the same time, it reduces latency, because initial collection is done in parallel and there is an opportunity to reuse values computed for a document and use for the doc for all the different collectors (drillsideways). +1, this is a nice side effect of this approach. It would resolve https://github.com/apache/lucene/issues/13503. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [9.x] Ensure to use IOContext.READONCE when reading segment files [lucene]
uschindler commented on code in PR #13578: URL: https://github.com/apache/lucene/pull/13578#discussion_r1681009450 ## lucene/test-framework/src/java/org/apache/lucene/tests/store/MockDirectoryWrapper.java: ## @@ -814,6 +814,14 @@ public synchronized IndexInput openInput(String name, IOContext context) throws context = LuceneTestCase.newIOContext(randomState, context); final boolean confined = context == IOContext.READONCE; Review Comment: i think we should maybe still make the check like this. this would cover more problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add levels to DocValues skipper index [lucene]
jpountz commented on code in PR #13563: URL: https://github.com/apache/lucene/pull/13563#discussion_r1680897686 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesConsumer.java: ## @@ -207,65 +210,133 @@ void accumulate(long value) { maxValue = Math.max(maxValue, value); } +void accumulate(SkipAccumulator other) { + minDocID = Math.min(minDocID, other.minDocID); + maxDocID = Math.max(maxDocID, other.maxDocID); + minValue = Math.min(minValue, other.minValue); + maxValue = Math.max(maxValue, other.maxValue); + docCount += other.docCount; +} + void nextDoc(int docID) { maxDocID = docID; ++docCount; } -void writeTo(DataOutput output) throws IOException { - output.writeInt(maxDocID); - output.writeInt(minDocID); - output.writeLong(maxValue); - output.writeLong(minValue); - output.writeInt(docCount); +public static SkipAccumulator merge(List list, int index, int length) { + SkipAccumulator acc = new SkipAccumulator(list.get(index).minDocID); + for (int i = 0; i < length; i++) { +acc.accumulate(list.get(index + i)); + } + return acc; } } private void writeSkipIndex(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { assert field.hasDocValuesSkipIndex(); -// TODO: This disk compression once we introduce levels -long start = data.getFilePointer(); -SortedNumericDocValues values = valuesProducer.getSortedNumeric(field); +final long start = data.getFilePointer(); +final SortedNumericDocValues values = valuesProducer.getSortedNumeric(field); long globalMaxValue = Long.MIN_VALUE; long globalMinValue = Long.MAX_VALUE; int globalDocCount = 0; int maxDocId = -1; +final List> accumulators = new ArrayList<>(SKIP_INDEX_MAX_LEVEL); +for (int i = 0; i < SKIP_INDEX_MAX_LEVEL; i++) { + accumulators.add(new ArrayList<>()); +} SkipAccumulator accumulator = null; -int counter = 0; +final int maxAccumulators = 1 << (SKIP_INDEX_LEVEL_SHIFT * (SKIP_INDEX_MAX_LEVEL - 1)); for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - if (counter == 0) { + if (accumulator == null) { accumulator = new SkipAccumulator(doc); +accumulators.get(0).add(accumulator); } accumulator.nextDoc(doc); for (int i = 0, end = values.docValueCount(); i < end; ++i) { accumulator.accumulate(values.nextValue()); } - if (++counter == skipIndexIntervalSize) { + if (accumulator.docCount == skipIndexIntervalSize) { globalMaxValue = Math.max(globalMaxValue, accumulator.maxValue); globalMinValue = Math.min(globalMinValue, accumulator.minValue); globalDocCount += accumulator.docCount; maxDocId = accumulator.maxDocID; -accumulator.writeTo(data); -counter = 0; +accumulator = null; +if (accumulators.size() == maxAccumulators) { + writeLevels(accumulators); + for (List accumulatorList : accumulators) { +accumulatorList.clear(); + } +} } } -if (counter > 0) { - globalMaxValue = Math.max(globalMaxValue, accumulator.maxValue); - globalMinValue = Math.min(globalMinValue, accumulator.minValue); - globalDocCount += accumulator.docCount; - maxDocId = accumulator.maxDocID; - accumulator.writeTo(data); +if (accumulators.isEmpty() == false) { + if (accumulator != null) { +globalMaxValue = Math.max(globalMaxValue, accumulator.maxValue); +globalMinValue = Math.min(globalMinValue, accumulator.minValue); +globalDocCount += accumulator.docCount; +maxDocId = accumulator.maxDocID; + } + writeLevels(accumulators); } meta.writeLong(start); // record the start in meta meta.writeLong(data.getFilePointer() - start); // record the length +assert globalDocCount == 0 || globalMaxValue >= globalMinValue; meta.writeLong(globalMaxValue); meta.writeLong(globalMinValue); +assert globalDocCount <= maxDocId + 1; meta.writeInt(globalDocCount); meta.writeInt(maxDocId); } + private void writeLevels(List> accumulators) throws IOException { +for (int i = 1; i < accumulators.size(); i++) { + buildLevel(accumulators.get(i), accumulators.get(i - 1)); +} +int totalAccumulators = accumulators.get(0).size(); +for (int index = 0; index < totalAccumulators; index++) { + // compute how many levels we need to write for the current accumulator + final int levels = getLevels(index, totalAccumulators); + // build the levels + final SkipAccumulator[] accLevels = new SkipAccumulator[levels]; + for (int level = 0; level < levels; level++) { +accLevels[level] = +accumulators.get(level).get(index
Re: [PR] [9.x] Ensure to use IOContext.READONCE when reading segment files [lucene]
ChrisHegarty commented on code in PR #13578: URL: https://github.com/apache/lucene/pull/13578#discussion_r1681020288 ## lucene/test-framework/src/java/org/apache/lucene/tests/store/MockDirectoryWrapper.java: ## @@ -814,6 +814,14 @@ public synchronized IndexInput openInput(String name, IOContext context) throws context = LuceneTestCase.newIOContext(randomState, context); final boolean confined = context == IOContext.READONCE; Review Comment: yes, I agree. Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Aggregate files from the same segment into a single Arena [lucene]
ChrisHegarty commented on code in PR #13570: URL: https://github.com/apache/lucene/pull/13570#discussion_r1681055574 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java: ## @@ -125,4 +134,31 @@ private final MemorySegment[] map( } return segments; } + + public ConcurrentHashMap attachment() { +return new ConcurrentHashMap<>(); + } + + /** + * Gets an arena for the give path, potentially aggregating files from the same segment into a + * single ref counted shared arena. A ref counted shared arena, if created will be added to the + * given arenas map. + */ + static Arena getSharedArena(Path p, ConcurrentHashMap arenas) { +String filename = p.getFileName().toString(); +String segmentName = IndexFileNames.parseSegmentName(filename); +if (filename.length() == segmentName.length()) { + // no segment found; just use a shared segment + return Arena.ofShared(); Review Comment: Ok, https://github.com/apache/lucene/pull/13574 is merged, and we have enforcement in the tests to disallow opening `segment_*` with a context other than `READONCE`. Unless I'm mistaken, then this concern has been addressed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [9.x] Ensure to use IOContext.READONCE when reading segment files [lucene]
ChrisHegarty merged PR #13578: URL: https://github.com/apache/lucene/pull/13578 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add levels to DocValues skipper index [lucene]
iverase commented on code in PR #13563: URL: https://github.com/apache/lucene/pull/13563#discussion_r1681076840 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesConsumer.java: ## @@ -207,65 +210,133 @@ void accumulate(long value) { maxValue = Math.max(maxValue, value); } +void accumulate(SkipAccumulator other) { + minDocID = Math.min(minDocID, other.minDocID); + maxDocID = Math.max(maxDocID, other.maxDocID); + minValue = Math.min(minValue, other.minValue); + maxValue = Math.max(maxValue, other.maxValue); + docCount += other.docCount; +} + void nextDoc(int docID) { maxDocID = docID; ++docCount; } -void writeTo(DataOutput output) throws IOException { - output.writeInt(maxDocID); - output.writeInt(minDocID); - output.writeLong(maxValue); - output.writeLong(minValue); - output.writeInt(docCount); +public static SkipAccumulator merge(List list, int index, int length) { + SkipAccumulator acc = new SkipAccumulator(list.get(index).minDocID); + for (int i = 0; i < length; i++) { +acc.accumulate(list.get(index + i)); + } + return acc; } } private void writeSkipIndex(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { assert field.hasDocValuesSkipIndex(); -// TODO: This disk compression once we introduce levels -long start = data.getFilePointer(); -SortedNumericDocValues values = valuesProducer.getSortedNumeric(field); +final long start = data.getFilePointer(); +final SortedNumericDocValues values = valuesProducer.getSortedNumeric(field); long globalMaxValue = Long.MIN_VALUE; long globalMinValue = Long.MAX_VALUE; int globalDocCount = 0; int maxDocId = -1; +final List> accumulators = new ArrayList<>(SKIP_INDEX_MAX_LEVEL); +for (int i = 0; i < SKIP_INDEX_MAX_LEVEL; i++) { + accumulators.add(new ArrayList<>()); +} SkipAccumulator accumulator = null; -int counter = 0; +final int maxAccumulators = 1 << (SKIP_INDEX_LEVEL_SHIFT * (SKIP_INDEX_MAX_LEVEL - 1)); for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - if (counter == 0) { + if (accumulator == null) { accumulator = new SkipAccumulator(doc); +accumulators.get(0).add(accumulator); } accumulator.nextDoc(doc); for (int i = 0, end = values.docValueCount(); i < end; ++i) { accumulator.accumulate(values.nextValue()); } - if (++counter == skipIndexIntervalSize) { + if (accumulator.docCount == skipIndexIntervalSize) { globalMaxValue = Math.max(globalMaxValue, accumulator.maxValue); globalMinValue = Math.min(globalMinValue, accumulator.minValue); globalDocCount += accumulator.docCount; maxDocId = accumulator.maxDocID; -accumulator.writeTo(data); -counter = 0; +accumulator = null; +if (accumulators.size() == maxAccumulators) { Review Comment: I removed the List of List so this is not an issue any more -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add levels to DocValues skipper index [lucene]
iverase commented on code in PR #13563: URL: https://github.com/apache/lucene/pull/13563#discussion_r1681079162 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesConsumer.java: ## @@ -207,65 +210,133 @@ void accumulate(long value) { maxValue = Math.max(maxValue, value); } +void accumulate(SkipAccumulator other) { + minDocID = Math.min(minDocID, other.minDocID); + maxDocID = Math.max(maxDocID, other.maxDocID); Review Comment: Done. The assert need to be `minDocID <= other.minDocId && maxDocId < other.maxDocID)` because we initializr the accumulator with the minDocID -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add levels to DocValues skipper index [lucene]
iverase commented on code in PR #13563: URL: https://github.com/apache/lucene/pull/13563#discussion_r1681080290 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesConsumer.java: ## @@ -207,65 +210,133 @@ void accumulate(long value) { maxValue = Math.max(maxValue, value); } +void accumulate(SkipAccumulator other) { + minDocID = Math.min(minDocID, other.minDocID); + maxDocID = Math.max(maxDocID, other.maxDocID); + minValue = Math.min(minValue, other.minValue); + maxValue = Math.max(maxValue, other.maxValue); + docCount += other.docCount; +} + void nextDoc(int docID) { maxDocID = docID; ++docCount; } -void writeTo(DataOutput output) throws IOException { - output.writeInt(maxDocID); - output.writeInt(minDocID); - output.writeLong(maxValue); - output.writeLong(minValue); - output.writeInt(docCount); +public static SkipAccumulator merge(List list, int index, int length) { + SkipAccumulator acc = new SkipAccumulator(list.get(index).minDocID); + for (int i = 0; i < length; i++) { +acc.accumulate(list.get(index + i)); + } + return acc; } } private void writeSkipIndex(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { assert field.hasDocValuesSkipIndex(); -// TODO: This disk compression once we introduce levels -long start = data.getFilePointer(); -SortedNumericDocValues values = valuesProducer.getSortedNumeric(field); +final long start = data.getFilePointer(); +final SortedNumericDocValues values = valuesProducer.getSortedNumeric(field); long globalMaxValue = Long.MIN_VALUE; long globalMinValue = Long.MAX_VALUE; int globalDocCount = 0; int maxDocId = -1; +final List> accumulators = new ArrayList<>(SKIP_INDEX_MAX_LEVEL); Review Comment: yes, this list of list is awful. I was thinking in reusing ti but it creates more issues that it helps. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add levels to DocValues skipper index [lucene]
iverase commented on code in PR #13563: URL: https://github.com/apache/lucene/pull/13563#discussion_r1681080747 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesConsumer.java: ## @@ -207,65 +210,133 @@ void accumulate(long value) { maxValue = Math.max(maxValue, value); } +void accumulate(SkipAccumulator other) { + minDocID = Math.min(minDocID, other.minDocID); + maxDocID = Math.max(maxDocID, other.maxDocID); + minValue = Math.min(minValue, other.minValue); + maxValue = Math.max(maxValue, other.maxValue); + docCount += other.docCount; +} + void nextDoc(int docID) { maxDocID = docID; ++docCount; } -void writeTo(DataOutput output) throws IOException { - output.writeInt(maxDocID); - output.writeInt(minDocID); - output.writeLong(maxValue); - output.writeLong(minValue); - output.writeInt(docCount); +public static SkipAccumulator merge(List list, int index, int length) { + SkipAccumulator acc = new SkipAccumulator(list.get(index).minDocID); + for (int i = 0; i < length; i++) { +acc.accumulate(list.get(index + i)); + } + return acc; } } private void writeSkipIndex(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { assert field.hasDocValuesSkipIndex(); -// TODO: This disk compression once we introduce levels -long start = data.getFilePointer(); -SortedNumericDocValues values = valuesProducer.getSortedNumeric(field); +final long start = data.getFilePointer(); +final SortedNumericDocValues values = valuesProducer.getSortedNumeric(field); long globalMaxValue = Long.MIN_VALUE; long globalMinValue = Long.MAX_VALUE; int globalDocCount = 0; int maxDocId = -1; +final List> accumulators = new ArrayList<>(SKIP_INDEX_MAX_LEVEL); +for (int i = 0; i < SKIP_INDEX_MAX_LEVEL; i++) { + accumulators.add(new ArrayList<>()); +} SkipAccumulator accumulator = null; -int counter = 0; +final int maxAccumulators = 1 << (SKIP_INDEX_LEVEL_SHIFT * (SKIP_INDEX_MAX_LEVEL - 1)); for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - if (counter == 0) { + if (accumulator == null) { accumulator = new SkipAccumulator(doc); +accumulators.get(0).add(accumulator); } accumulator.nextDoc(doc); for (int i = 0, end = values.docValueCount(); i < end; ++i) { accumulator.accumulate(values.nextValue()); } - if (++counter == skipIndexIntervalSize) { + if (accumulator.docCount == skipIndexIntervalSize) { globalMaxValue = Math.max(globalMaxValue, accumulator.maxValue); globalMinValue = Math.min(globalMinValue, accumulator.minValue); globalDocCount += accumulator.docCount; maxDocId = accumulator.maxDocID; -accumulator.writeTo(data); -counter = 0; +accumulator = null; +if (accumulators.size() == maxAccumulators) { + writeLevels(accumulators); + for (List accumulatorList : accumulators) { +accumulatorList.clear(); + } +} } } -if (counter > 0) { - globalMaxValue = Math.max(globalMaxValue, accumulator.maxValue); - globalMinValue = Math.min(globalMinValue, accumulator.minValue); - globalDocCount += accumulator.docCount; - maxDocId = accumulator.maxDocID; - accumulator.writeTo(data); +if (accumulators.isEmpty() == false) { + if (accumulator != null) { +globalMaxValue = Math.max(globalMaxValue, accumulator.maxValue); +globalMinValue = Math.min(globalMinValue, accumulator.minValue); +globalDocCount += accumulator.docCount; +maxDocId = accumulator.maxDocID; + } + writeLevels(accumulators); } meta.writeLong(start); // record the start in meta meta.writeLong(data.getFilePointer() - start); // record the length +assert globalDocCount == 0 || globalMaxValue >= globalMinValue; meta.writeLong(globalMaxValue); meta.writeLong(globalMinValue); +assert globalDocCount <= maxDocId + 1; meta.writeInt(globalDocCount); meta.writeInt(maxDocId); } + private void writeLevels(List> accumulators) throws IOException { +for (int i = 1; i < accumulators.size(); i++) { + buildLevel(accumulators.get(i), accumulators.get(i - 1)); +} +int totalAccumulators = accumulators.get(0).size(); +for (int index = 0; index < totalAccumulators; index++) { + // compute how many levels we need to write for the current accumulator + final int levels = getLevels(index, totalAccumulators); + // build the levels + final SkipAccumulator[] accLevels = new SkipAccumulator[levels]; + for (int level = 0; level < levels; level++) { +accLevels[level] = +accumulators.get(level).get(index
Re: [PR] Add levels to DocValues skipper index [lucene]
iverase commented on code in PR #13563: URL: https://github.com/apache/lucene/pull/13563#discussion_r1681081567 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesConsumer.java: ## @@ -207,65 +210,133 @@ void accumulate(long value) { maxValue = Math.max(maxValue, value); } +void accumulate(SkipAccumulator other) { + minDocID = Math.min(minDocID, other.minDocID); + maxDocID = Math.max(maxDocID, other.maxDocID); + minValue = Math.min(minValue, other.minValue); + maxValue = Math.max(maxValue, other.maxValue); + docCount += other.docCount; +} + void nextDoc(int docID) { maxDocID = docID; ++docCount; } -void writeTo(DataOutput output) throws IOException { - output.writeInt(maxDocID); - output.writeInt(minDocID); - output.writeLong(maxValue); - output.writeLong(minValue); - output.writeInt(docCount); +public static SkipAccumulator merge(List list, int index, int length) { + SkipAccumulator acc = new SkipAccumulator(list.get(index).minDocID); + for (int i = 0; i < length; i++) { +acc.accumulate(list.get(index + i)); + } + return acc; } } private void writeSkipIndex(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { assert field.hasDocValuesSkipIndex(); -// TODO: This disk compression once we introduce levels -long start = data.getFilePointer(); -SortedNumericDocValues values = valuesProducer.getSortedNumeric(field); +final long start = data.getFilePointer(); +final SortedNumericDocValues values = valuesProducer.getSortedNumeric(field); long globalMaxValue = Long.MIN_VALUE; long globalMinValue = Long.MAX_VALUE; int globalDocCount = 0; int maxDocId = -1; +final List> accumulators = new ArrayList<>(SKIP_INDEX_MAX_LEVEL); +for (int i = 0; i < SKIP_INDEX_MAX_LEVEL; i++) { + accumulators.add(new ArrayList<>()); +} SkipAccumulator accumulator = null; -int counter = 0; +final int maxAccumulators = 1 << (SKIP_INDEX_LEVEL_SHIFT * (SKIP_INDEX_MAX_LEVEL - 1)); for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - if (counter == 0) { + if (accumulator == null) { accumulator = new SkipAccumulator(doc); +accumulators.get(0).add(accumulator); } accumulator.nextDoc(doc); for (int i = 0, end = values.docValueCount(); i < end; ++i) { accumulator.accumulate(values.nextValue()); } - if (++counter == skipIndexIntervalSize) { + if (accumulator.docCount == skipIndexIntervalSize) { globalMaxValue = Math.max(globalMaxValue, accumulator.maxValue); globalMinValue = Math.min(globalMinValue, accumulator.minValue); globalDocCount += accumulator.docCount; maxDocId = accumulator.maxDocID; -accumulator.writeTo(data); -counter = 0; +accumulator = null; +if (accumulators.size() == maxAccumulators) { + writeLevels(accumulators); + for (List accumulatorList : accumulators) { +accumulatorList.clear(); + } +} } } -if (counter > 0) { - globalMaxValue = Math.max(globalMaxValue, accumulator.maxValue); - globalMinValue = Math.min(globalMinValue, accumulator.minValue); - globalDocCount += accumulator.docCount; - maxDocId = accumulator.maxDocID; - accumulator.writeTo(data); +if (accumulators.isEmpty() == false) { + if (accumulator != null) { +globalMaxValue = Math.max(globalMaxValue, accumulator.maxValue); +globalMinValue = Math.min(globalMinValue, accumulator.minValue); +globalDocCount += accumulator.docCount; +maxDocId = accumulator.maxDocID; + } + writeLevels(accumulators); } meta.writeLong(start); // record the start in meta meta.writeLong(data.getFilePointer() - start); // record the length +assert globalDocCount == 0 || globalMaxValue >= globalMinValue; meta.writeLong(globalMaxValue); meta.writeLong(globalMinValue); +assert globalDocCount <= maxDocId + 1; meta.writeInt(globalDocCount); meta.writeInt(maxDocId); } + private void writeLevels(List> accumulators) throws IOException { +for (int i = 1; i < accumulators.size(); i++) { + buildLevel(accumulators.get(i), accumulators.get(i - 1)); +} +int totalAccumulators = accumulators.get(0).size(); +for (int index = 0; index < totalAccumulators; index++) { + // compute how many levels we need to write for the current accumulator + final int levels = getLevels(index, totalAccumulators); + // build the levels + final SkipAccumulator[] accLevels = new SkipAccumulator[levels]; Review Comment: yes, I did not realize when I modified the way we were writing the levels -- This is an automated message
Re: [PR] Add levels to DocValues skipper index [lucene]
iverase commented on code in PR #13563: URL: https://github.com/apache/lucene/pull/13563#discussion_r1681082271 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java: ## @@ -1792,61 +1794,91 @@ public DocValuesSkipper getSkipper(FieldInfo field) throws IOException { if (input.length() > 0) { input.prefetch(0, 1); } +// TODO: should we write to disk the actual max level for this segment? return new DocValuesSkipper() { - int minDocID = -1; - int maxDocID = -1; - long minValue, maxValue; - int docCount; + final int[] minDocID = new int[SKIP_INDEX_MAX_LEVEL]; + final int[] maxDocID = new int[SKIP_INDEX_MAX_LEVEL]; + + { +for (int i = 0; i < SKIP_INDEX_MAX_LEVEL; i++) { + minDocID[i] = maxDocID[i] = -1; +} + } + + final long[] minValue = new long[SKIP_INDEX_MAX_LEVEL]; + final long[] maxValue = new long[SKIP_INDEX_MAX_LEVEL]; + final int[] docCount = new int[SKIP_INDEX_MAX_LEVEL]; + int levels; @Override public void advance(int target) throws IOException { if (target > entry.maxDocId) { - minDocID = DocIdSetIterator.NO_MORE_DOCS; - maxDocID = DocIdSetIterator.NO_MORE_DOCS; + // skipper is exhausted + for (int i = 0; i < SKIP_INDEX_MAX_LEVEL; i++) { +minDocID[i] = maxDocID[i] = DocIdSetIterator.NO_MORE_DOCS; + } } else { + // find next interval + assert target > maxDocID[0] : "target must be bigger that current interval"; while (true) { -maxDocID = input.readInt(); -if (maxDocID >= target) { - minDocID = input.readInt(); - maxValue = input.readLong(); - minValue = input.readLong(); - docCount = input.readInt(); +levels = input.readByte(); +assert levels <= SKIP_INDEX_MAX_LEVEL && levels > 0 +: "level out of range [" + levels + "]"; +boolean competitive = true; +// check if current interval is competitive or we can jump to the next position +for (int level = levels - 1; level >= 0; level--) { + if ((maxDocID[level] = input.readInt()) < target) { +input.skipBytes(SKIP_INDEX_JUMP_LENGTH_PER_LEVEL[level]); // the jump for the level +competitive = false; +break; + } + minDocID[level] = input.readInt(); + maxValue[level] = input.readLong(); + minValue[level] = input.readLong(); + docCount[level] = input.readInt(); +} +if (competitive) { + // adjust levels + while (levels < SKIP_INDEX_MAX_LEVEL) { +if (maxDocID[levels] == -1 || maxDocID[levels] < target) { Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add levels to DocValues skipper index [lucene]
iverase commented on code in PR #13563: URL: https://github.com/apache/lucene/pull/13563#discussion_r1681083206 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/AssertingLeafReader.java: ## @@ -1194,24 +1194,27 @@ public int numLevels() { @Override public int minDocID(int level) { assertThread("Doc values skipper", creationThread); - Objects.checkIndex(level, numLevels()); int minDocID = in.minDocID(level); assert minDocID <= in.maxDocID(level); - if (level > 0) { -assert minDocID <= in.minDocID(level - 1); + if (minDocID != -1 && minDocID != DocIdSetIterator.NO_MORE_DOCS) { +Objects.checkIndex(level, numLevels()); + } + for (int i = 0; i < level; i++) { +assert minDocID >= in.minDocID(i); Review Comment: exactly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add levels to DocValues skipper index [lucene]
iverase commented on code in PR #13563: URL: https://github.com/apache/lucene/pull/13563#discussion_r1681096883 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -3301,17 +3301,17 @@ private static void checkDocValueSkipper(FieldInfo fi, DocValuesSkipper skipper) if (skipper.maxDocID(0) == NO_MORE_DOCS) { break; } + if (skipper.minDocID(0) < doc) { +throw new CheckIndexException( +"skipper dv iterator for field: " ++ fieldName ++ " reports wrong minDocID, got " ++ skipper.minDocID(0) ++ " < " ++ doc); + } int levels = skipper.numLevels(); for (int level = 0; level < levels; level++) { -if (skipper.minDocID(level) < doc) { - throw new CheckIndexException( - "skipper dv iterator for field: " - + fieldName - + " reports wrong minDocID, got " - + skipper.minDocID(level) - + " < " - + doc); -} Review Comment: The current condition is `skipper.minDocID(level) < doc` where the doc is the provided doc on advance. So the provided doc needs to be lower than or equal to the minDocID not to trigger the error. I It checks we have advance to the closer interval forward. This check is strange as it is only valid for implementations with exact bounds and it is clearly not valid for levels above 0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add levels to DocValues skipper index [lucene]
jpountz commented on code in PR #13563: URL: https://github.com/apache/lucene/pull/13563#discussion_r1681097434 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/AssertingLeafReader.java: ## @@ -1194,24 +1194,27 @@ public int numLevels() { @Override public int minDocID(int level) { assertThread("Doc values skipper", creationThread); - Objects.checkIndex(level, numLevels()); int minDocID = in.minDocID(level); assert minDocID <= in.maxDocID(level); - if (level > 0) { -assert minDocID <= in.minDocID(level - 1); + if (minDocID != -1 && minDocID != DocIdSetIterator.NO_MORE_DOCS) { +Objects.checkIndex(level, numLevels()); + } + for (int i = 0; i < level; i++) { +assert minDocID >= in.minDocID(i); Review Comment: Could we make `numLevels` return 1 to avoid this issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add levels to DocValues skipper index [lucene]
iverase commented on code in PR #13563: URL: https://github.com/apache/lucene/pull/13563#discussion_r1681112760 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/AssertingLeafReader.java: ## @@ -1194,24 +1194,27 @@ public int numLevels() { @Override public int minDocID(int level) { assertThread("Doc values skipper", creationThread); - Objects.checkIndex(level, numLevels()); int minDocID = in.minDocID(level); assert minDocID <= in.maxDocID(level); - if (level > 0) { -assert minDocID <= in.minDocID(level - 1); + if (minDocID != -1 && minDocID != DocIdSetIterator.NO_MORE_DOCS) { +Objects.checkIndex(level, numLevels()); + } + for (int i = 0; i < level; i++) { +assert minDocID >= in.minDocID(i); Review Comment: yes, done and revert changes here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Are we properly accounting for `NeighborArray.rwlock`? [lucene]
msokolov opened a new issue, #13580: URL: https://github.com/apache/lucene/issues/13580 ### Description Really two issues: 1. `OnHeapHnswGraph.ramBytesUsed` has a complicated job - it's hard to tell whether it takes into account the probably quite significant RAM usage of the `ReentrantReadWriteLock` that gets created for every node in the graph. 2. We create these locks even in the case where we are flushing (via `HnswGraphBuilder`) and don't ever use them. We only need locks available when merging 3. We could hash the node ids and share a smaller pool of locks among them and thus save a bunch of allocations -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Aggregate files from the same segment into a single Arena [lucene]
magibney commented on PR #13570: URL: https://github.com/apache/lucene/pull/13570#issuecomment-2233474047 This is looking good. I think making it possible for MMapDirectory to customize/bypass grouping is still important -- something along the lines of what Uwe was thinking in [these](#issuecomment-2228827755) [comments](#issuecomment-2228845198): >>Maybe the file pattern should really be a Predicate to separate the implementation. Maybe add a method to return protected Optional getArenaGroupingKey(String filename) (or as Function>), which returns empty optional if no grouping allowed or the segment number otherwise as key. > >This could be similarily implemented like the setPreload(BiPredicate) method. >It could even be simpler by passing the grouping key to the provider method directly in addition to filename. Then the code that figures out which group is used can reside in MMapDirectory only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Aggregate files from the same segment into a single Arena [lucene]
uschindler commented on PR #13570: URL: https://github.com/apache/lucene/pull/13570#issuecomment-2233498077 I am fine with everything - BUT: we should still check how the grouping works when for an existing segment an additional `*.del` file is updated (same for softdeletes and docvalues updates). If it is treated as same segment, at least for del files it would not cause too big issues (because those are quite small), but if we have in-place atomic updates, we could have files already closed and deleted, but arena still there. I will do some checks on Windows which is very sensitive with deleting still open files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Are we properly accounting for `NeighborArray.rwlock`? [lucene]
msokolov commented on issue #13580: URL: https://github.com/apache/lucene/issues/13580#issuecomment-2233498918 I also found we are acquiring these locks for writing in `HnswGraphBuilder.addDiverseNeighbors` even in the single-threaded case where it is not needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Compute facets while collecting [lucene]
epotyom commented on code in PR #13568: URL: https://github.com/apache/lucene/pull/13568#discussion_r1681197072 ## lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysQuery.java: ## @@ -252,9 +213,9 @@ public int hashCode() { final int prime = 31; Review Comment: I'm not sure what are benefits of doing deep hashCode in this case? But I believe we are not changing the behavior with this change, both FacetsCollectorManager and CollectorOwner don't override default hashCode method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Aggregate files from the same segment into a single Arena [lucene]
uschindler commented on PR #13570: URL: https://github.com/apache/lucene/pull/13570#issuecomment-2233534218 On Windows I got the following test error - kindly what I expected (Windows is not allowed to delete files which are still open or mmapped): ``` TestIndexWriter > testDeleteUnusedFiles FAILED java.lang.AssertionError at __randomizedtesting.SeedInfo.seed([6C3A14A35BBB8424:D30C462C33E329CB]:0) at org.junit.Assert.fail(Assert.java:87) at org.junit.Assert.assertTrue(Assert.java:42) at org.junit.Assert.assertTrue(Assert.java:53) at org.apache.lucene.index.TestIndexWriter.testDeleteUnusedFiles(TestIndexWriter.java:1303) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Aggregate files from the same segment into a single Arena [lucene]
ChrisHegarty commented on PR #13570: URL: https://github.com/apache/lucene/pull/13570#issuecomment-2233544810 > I am fine with everything - BUT: we should still check how the grouping works when for an existing segment an additional `*.del` file is updated (same for softdeletes and docvalues updates). If it is treated as same segment, at least for del files it would not cause too big issues (because those are quite small), but if we have in-place atomic updates, we could have files already closed and deleted, but arena still there. > > I will do some checks on Windows which is very sensitive with deleting still open files. Same, I checking too. Additionally, I'm still sketching out the few options around API support for a `groupingFunction`, etc, to allow for an opt-out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Aggregate files from the same segment into a single Arena [lucene]
uschindler commented on PR #13570: URL: https://github.com/apache/lucene/pull/13570#issuecomment-2233547375 It fails reproducible. Let me look into it :-) Not sure what the issue is (if it is deletes, softdeletes or docvalues updates). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Aggregate files from the same segment into a single Arena [lucene]
ChrisHegarty commented on PR #13570: URL: https://github.com/apache/lucene/pull/13570#issuecomment-2233549425 > On Windows I got the following test error - actually what I expected (Windows is not allowed to delete files which are still open or mmapped Ha! you're too fast. I'm still spinning up a VM to test Windows. > It fails reproducible. Let me look into it :-) Not sure what the issue is (if it is deletes, softdeletes or docvalues updates). Thanks. It's with you. Appreciate it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Expose flat vectors in "user space" [lucene]
benwtrent commented on issue #13468: URL: https://github.com/apache/lucene/issues/13468#issuecomment-2233567273 Maybe this is done? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Are we properly accounting for `NeighborArray.rwlock`? [lucene]
benwtrent commented on issue #13580: URL: https://github.com/apache/lucene/issues/13580#issuecomment-2233568004 @msokolov related: https://github.com/apache/lucene/issues/12732 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Aggregate files from the same segment into a single Arena [lucene]
uschindler commented on PR #13570: URL: https://github.com/apache/lucene/pull/13570#issuecomment-2233584894 > It fails reproducible. Let me look into it :-) Not sure what the issue is (if it is deletes, softdeletes or docvalues updates). This test failure is "interesting". This test enforces MMapDirectory and it actually tests that the file can't be deleted. It is a CFS file ``` // NOTE: here we rely on "Windows" behavior, i.e. even though IW wanted to delete _0.cfs // since it was merged away, because we have a reader open against this file, // it should still be here: assertTrue(Files.exists(indexPath.resolve("_0.cfs"))); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Aggregate files from the same segment into a single Arena [lucene]
uschindler commented on PR #13570: URL: https://github.com/apache/lucene/pull/13570#issuecomment-2233610476 Test also fails in main branch, so this is not new. Must have been to do with our previous changes. Interestingly it does not fail on [Policeman Jenkins](https://jenkins.thetaphi.de/job/Lucene-main-Windows/). I need more beers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestSnapshotDeletionPolicy#testMultiThreadedSnapshotting assertion failure [lucene]
aoli-al commented on issue #13571: URL: https://github.com/apache/lucene/issues/13571#issuecomment-2233656919 Hi @jpountz and @benwtrent, I saw your previous discussion about the test failure. Do you think this patch will help you to replay the failure? Also, let me know if you have any questions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Aggregate files from the same segment into a single Arena [lucene]
uschindler commented on PR #13570: URL: https://github.com/apache/lucene/pull/13570#issuecomment-2233693234 Hi, this failure is unrelated. The difference is: Locally I use Windows 11, Jenkins uses Windows 10. Actually it looks like on Windows 11 it is in fact possible to delete memory mapped files. I have to confirm this (something rigs a bell in my head). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Compute facets while collecting [lucene]
mikemccand commented on code in PR #13568: URL: https://github.com/apache/lucene/pull/13568#discussion_r1681230468 ## lucene/core/src/java/org/apache/lucene/search/CollectorOwner.java: ## @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collection; +import java.util.Collections; +import java.util.List; + +/** + * Like {@link CollectorManager}, but it owns the collectors its manager creates. It is convenient + * that clients of the class don't have to worry about keeping the list of collectors, as well as + * about making the collectors type (C) compatible when reduce is called. Instance of this class + * also caches results of {@link CollectorManager#reduce(Collection)}. Review Comment: `also caches` -> `also cache`? ## lucene/core/src/java/org/apache/lucene/search/CollectorOwner.java: ## @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collection; +import java.util.Collections; +import java.util.List; + +/** + * Like {@link CollectorManager}, but it owns the collectors its manager creates. It is convenient + * that clients of the class don't have to worry about keeping the list of collectors, as well as + * about making the collectors type (C) compatible when reduce is called. Instance of this class Review Comment: `collectors` -> `collector's`? `instance` -> `instances`? ## lucene/core/src/java/org/apache/lucene/search/CollectorOwner.java: ## @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collection; +import java.util.Collections; +import java.util.List; + +/** + * Like {@link CollectorManager}, but it owns the collectors its manager creates. It is convenient Review Comment: Maybe also explain that it wraps an existing (already created) `CollectorManager`? ## lucene/core/src/java/org/apache/lucene/search/CollectorOwner.java: ## @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the Li
Re: [I] Are we properly accounting for `NeighborArray.rwlock`? [lucene]
msokolov commented on issue #13580: URL: https://github.com/apache/lucene/issues/13580#issuecomment-2233756360 ooh thanks @benwtrent I had forgotten about that. I'm working up a new version based on (3) above that I hope will reduce this usage and only require it for concurrent mergers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Aggregate files from the same segment into a single Arena [lucene]
uschindler commented on PR #13570: URL: https://github.com/apache/lucene/pull/13570#issuecomment-2233938475 So somebody what has more knowledge about how the filenames of deletion and/or softdeletes/docvalues instant update files are generated should help us. I think deleteion files should not be a problem, as those are READONCE, but for the docvalues updates we may possibly need to have them on the exclusion list for grouping. I have no time to look into this now, I will check how to fix the test with Windows 11 and possibly setup a Windows 11 node on Policeman Jenkins. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Expose flat vectors in "user space" [lucene]
msokolov commented on issue #13468: URL: https://github.com/apache/lucene/issues/13468#issuecomment-2233960860 Yes, thanks - I'll resolve -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Expose flat vectors in "user space" [lucene]
msokolov closed issue #13468: Expose flat vectors in "user space" URL: https://github.com/apache/lucene/issues/13468 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestSnapshotDeletionPolicy#testMultiThreadedSnapshotting assertion failure [lucene]
benwtrent commented on issue #13571: URL: https://github.com/apache/lucene/issues/13571#issuecomment-2233982449 This is indeed interesting @aoli-al I will have to dig deeper to see. This part of the codebase is...opaque to say the least :/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Gradle build: cleanup of dependency resolution and consolidation of dependency versions [lucene]
dsmiley commented on code in PR #13484: URL: https://github.com/apache/lucene/pull/13484#discussion_r1681635416 ## versions.lock: ## Review Comment: Overall; thanks for doing this @dweiss ! What script generates versions.lock? The "because" sections look mysterious. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Aggregate files from the same segment into a single Arena [lucene]
ChrisHegarty commented on PR #13570: URL: https://github.com/apache/lucene/pull/13570#issuecomment-2234125875 I added a grouping function to the API - see [4591f7f](https://github.com/apache/lucene/pull/13570/commits/4591f7f904f0f76d4ef3fa5b4accee184183ddf1) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Aggregate files from the same segment into a single Arena [lucene]
ChrisHegarty commented on PR #13570: URL: https://github.com/apache/lucene/pull/13570#issuecomment-2234130516 > So somebody what has more knowledge about how the filenames of deletion and/or softdeletes/docvalues instant update files are generated should help us. > > I think deleteion files should not be a problem, as those are READONCE, but for the docvalues updates we may possibly need to have them on the exclusion list for grouping. I'll try to look into this, and also get someone to help, if needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Compute facets while collecting [lucene]
epotyom commented on code in PR #13568: URL: https://github.com/apache/lucene/pull/13568#discussion_r1681693916 ## lucene/sandbox/src/test/org/apache/lucene/sandbox/facet/SandboxFacetTestCase.java: ## @@ -0,0 +1,407 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.facet; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.Comparator; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import org.apache.lucene.facet.FacetResult; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.FacetsCollector.MatchingDocs; +import org.apache.lucene.facet.FacetsConfig; +import org.apache.lucene.facet.LabelAndValue; +import org.apache.lucene.facet.taxonomy.FacetLabel; +import org.apache.lucene.facet.taxonomy.TaxonomyFacetLabels; +import org.apache.lucene.facet.taxonomy.TaxonomyFacetLabels.FacetLabelReader; +import org.apache.lucene.facet.taxonomy.TaxonomyReader; +import org.apache.lucene.sandbox.facet.abstracts.OrdLabelBiMap; +import org.apache.lucene.sandbox.facet.abstracts.OrdToComparable; +import org.apache.lucene.sandbox.facet.abstracts.OrdinalIterator; +import org.apache.lucene.sandbox.facet.ordinal_iterators.TopnOrdinalIterator; +import org.apache.lucene.sandbox.facet.recorders.CountFacetRecorder; +import org.apache.lucene.sandbox.facet.taxonomy.TaxonomyChildrenOrdinalIterator; +import org.apache.lucene.sandbox.facet.taxonomy.TaxonomyOrdLabelBiMap; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.tests.util.TestUtil; +import org.apache.lucene.util.BytesRef; + +public abstract class SandboxFacetTestCase extends LuceneTestCase { + // we don't have access to overall count for all facets from count recorder, + // and we can't compute it as a SUM of values for each facet ID because we need to respect cases + // where + // the same doc belongs to multiple facets (e.g. overlapping ranges and + // multi value fields). We can add an extra range that includes everything, + // or consider supporting overall count in CountFacetRecorder. But it is not exactly the value + // we can get now, as this value wouldn't respect top-n cutoff. Is this value a must have facets + // feature? + static final int VALUE_CANT_BE_COMPUTED = -5; Review Comment: > Also, what is out plan to address this permanently? I've changed the comment back to TODO and added some details, WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Compute facets while collecting [lucene]
epotyom commented on code in PR #13568: URL: https://github.com/apache/lucene/pull/13568#discussion_r1681713454 ## lucene/sandbox/src/test/org/apache/lucene/sandbox/facet/SandboxFacetTestCase.java: ## @@ -0,0 +1,407 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.facet; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.Comparator; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import org.apache.lucene.facet.FacetResult; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.FacetsCollector.MatchingDocs; +import org.apache.lucene.facet.FacetsConfig; +import org.apache.lucene.facet.LabelAndValue; +import org.apache.lucene.facet.taxonomy.FacetLabel; +import org.apache.lucene.facet.taxonomy.TaxonomyFacetLabels; +import org.apache.lucene.facet.taxonomy.TaxonomyFacetLabels.FacetLabelReader; +import org.apache.lucene.facet.taxonomy.TaxonomyReader; +import org.apache.lucene.sandbox.facet.abstracts.OrdLabelBiMap; +import org.apache.lucene.sandbox.facet.abstracts.OrdToComparable; +import org.apache.lucene.sandbox.facet.abstracts.OrdinalIterator; +import org.apache.lucene.sandbox.facet.ordinal_iterators.TopnOrdinalIterator; +import org.apache.lucene.sandbox.facet.recorders.CountFacetRecorder; +import org.apache.lucene.sandbox.facet.taxonomy.TaxonomyChildrenOrdinalIterator; +import org.apache.lucene.sandbox.facet.taxonomy.TaxonomyOrdLabelBiMap; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.tests.util.TestUtil; +import org.apache.lucene.util.BytesRef; + +public abstract class SandboxFacetTestCase extends LuceneTestCase { + // we don't have access to overall count for all facets from count recorder, + // and we can't compute it as a SUM of values for each facet ID because we need to respect cases + // where + // the same doc belongs to multiple facets (e.g. overlapping ranges and + // multi value fields). We can add an extra range that includes everything, + // or consider supporting overall count in CountFacetRecorder. But it is not exactly the value + // we can get now, as this value wouldn't respect top-n cutoff. Is this value a must have facets + // feature? + static final int VALUE_CANT_BE_COMPUTED = -5; + + /** + * Utility method that uses {@link FacetLabelReader} to get facet labels for each hit in {@link + * MatchingDocs}. The method returns {@code List>} where outer list has one entry + * per document and inner list has all {@link FacetLabel} entries that belong to a document. The + * inner list may be empty if no {@link FacetLabel} are found for a hit. + * + * @param taxoReader {@link TaxonomyReader} used to read taxonomy during search. This instance is + * expected to be open for reading. + * @param fc {@link FacetsCollector} A collector with matching hits. + * @param dimension facet dimension for which labels are requested. A null value fetches labels + * for all dimensions. + * @return {@code List} where outer list has one non-null entry per document. and + * inner list contain all {@link FacetLabel} entries that belong to a document. + * @throws IOException when a low-level IO issue occurs. + */ + public List> getAllTaxonomyFacetLabels( + String dimension, TaxonomyReader taxoReader, FacetsCollector fc) throws IOException { +List> actualLabels = new ArrayList<>(); +TaxonomyFacetLabels taxoLabels = +new TaxonomyFacetLabels(taxoReader, FacetsConfig.DEFAULT_INDEX_FIELD_NAME); +for (MatchingDocs m : fc.getMatchingDocs()) { + FacetLabelReader facetLabelReader = taxoLabels.getFacetLabelReader(m.context); + DocIdSetIterator disi = m.bits.iterator(); + while (disi.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { +actualLabels.add(allFacetLabels(disi.docID(), dimension, facetLabelReader)); + } +} +return actualLabels; + } + + /** + * Utility method to get all facet labels for an input docId and dimension using the supplied + * {@link FacetLabelReader}. +
Re: [PR] Aggregate files from the same segment into a single Arena [lucene]
magibney commented on code in PR #13570: URL: https://github.com/apache/lucene/pull/13570#discussion_r1681711345 ## lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java: ## @@ -83,6 +86,19 @@ public class MMapDirectory extends FSDirectory { */ public static final BiPredicate NO_FILES = (filename, context) -> false; + /** Argument for {@link #setGroupingFunction(Function)} that configures no grouping. */ + public static final Function> NO_GROUPING = filename -> Optional.empty(); + + /** Argument for {@link #setGroupingFunction(Function)} that configures grouping by segment. */ + public static final Function> GROUP_BY_SEGMENT = + filename -> { +String segmentName = IndexFileNames.parseSegmentName(filename); +if (filename.length() == segmentName.length()) { + return Optional.empty(); +} +return Optional.of(segmentName); + }; Review Comment: I think Uwe had floated the idea of doing extra validation that we're actually getting a segment name -- e.g. something like: ```java String segmentPrefix = IndexFileNames.parseSegmentName(filename); int segmentPrefixLength = segmentPrefix.length(); if (segmentPrefixLength > 1 && segmentPrefixLength != filename.length() && segmentName.charAt(0) == '_') { String segmentName = segmentPrefix.substring(1); try { // try to parse a proper segment identifier Long.parseLong(segmentName, Character.MAX_RADIX); return Optional.of(segmentName); } catch (NumberFormatException ex) { // unable to parse as a segment identifier } } return Optional.empty(); ``` If we can account for all explicit exceptions (where we don't expect to be able to parse a segment name) we could even throw `IllegalArgumentException` here instead of returning `Optional.empty()`? (this would prevent grouping from being used in cases where it should not be, by failing hard -- and could be a good complement to ensuring `READ_ONCE` is used where appropriate). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] HnswLock: access locks via hash and only use for concurrent indexing [lucene]
msokolov opened a new pull request, #13581: URL: https://github.com/apache/lucene/pull/13581 Addresses https://github.com/apache/lucene/issues/13580 by adding a locking wrapper for OnHeapHnswGraph's NeighborArrays, and supplying this when running concurrent merges. With this: 1. We no longer guard access to rows with locks when graph build is done in a single thread. 2. We only create a fixed number of locks rather than a lock per node -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] HnswLock: access locks via hash and only use for concurrent indexing [lucene]
msokolov commented on PR #13581: URL: https://github.com/apache/lucene/pull/13581#issuecomment-2234263814 Tested with 1M 256-dim docs ... at least it doesn't seem to make anything worse? # M=16,width=50 Tests with 1M 256-d vectors, M=16, beam-width-index=50 | condition | merge threads | recall | latency | index time | |---|---||-|| | candidate | 12 | 0.927 | 2.37| 177206 | | candidate | 12 | 0.928 | 2.57| 174283 | | baseline | 12 | 0.927 | 2.49| 178528 | | baseline | 12 | 0.927 | 2.40| 180748 | | candidate | 1 | 0.925 | 2.37| 176221 | | candidate | 1 | 0.925 | 2.36| 186181 | | baseline | 1 | 0.926 | 2.46| 176939 | | baseline | 1 | 0.926 | 2.55| 185635 | # M=32,width=100 Tests with 1M 256-d vectors, M=32, beam-width-index=100 | condition | merge threads | recall | latency | index time | |---|---||-|| | candidate | 12 | 0.969 | 2.54| 390794 | | candidate | 12 | 0.969 | 2.51| 434838 | | baseline | 12 | 0.969 | 2.55| 393600 | | baseline | 12 | 0.967 | 2.82| 441376 | | candidate | 1 | 0.969 | 2.48| 428682 | | candidate | 1 | 0.970 | 2.49| 392494 | | baseline | 1 | 0.970 | 2.72| 403489 | | baseline | 1 | 0.968 | 2.44| 409693 | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Stop bounding outer window. [lucene]
jpountz opened a new pull request, #13582: URL: https://github.com/apache/lucene/pull/13582 Currently `MaxScoreBulkScorer` requires its "outer" window to be at least `WINDOW_SIZE`. The intuition there was that we should make sure we should use the whole range of the bit set that we are using to collect matches. The downside is that it may force us to use an upper level in the skip list that has worse upper bounds for the scores. luceneutil suggests that this is not a good trade-off: removing this requirement makes some queries a bit slower, but `OrHighMin` and `OrHighRare` much faster: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value CountOrHighHigh 56.30 (36.8%) 52.86 (28.4%) -6.1% ( -52% - 93%) 0.623 Or2Terms2StopWords 162.62 (2.7%) 153.87 (4.2%) -5.4% ( -11% -1%) 0.000 CountOrHighMed 103.04 (24.6%) 99.36 (19.0%) -3.6% ( -37% - 53%) 0.667 IntNRQ 344.56 (9.9%) 333.25 (11.0%) -3.3% ( -22% - 19%) 0.407 OrStopWords 32.05 (4.6%) 31.15 (6.8%) -2.8% ( -13% -9%) 0.201 LowTerm 899.50 (4.9%) 877.72 (5.7%) -2.4% ( -12% -8%) 0.227 MedTerm 506.67 (6.2%) 496.16 (7.1%) -2.1% ( -14% - 12%) 0.412 HighTerm 444.00 (6.1%) 435.85 (7.3%) -1.8% ( -14% - 12%) 0.467 CountTerm 9416.88 (3.5%) 9260.39 (5.2%) -1.7% ( -9% -7%) 0.319 OrHighNotLow 367.96 (7.0%) 362.09 (6.0%) -1.6% ( -13% - 12%) 0.515 HighTermDayOfYearSort 857.10 (3.8%) 849.83 (3.8%) -0.8% ( -8% -7%) 0.556 OrHighNotMed 328.60 (7.0%) 325.94 (6.1%) -0.8% ( -12% - 13%) 0.743 Prefix3 287.21 (2.4%) 285.09 (1.7%) -0.7% ( -4% -3%) 0.349 Or3Terms 156.94 (3.1%) 155.80 (4.0%) -0.7% ( -7% -6%) 0.589 Fuzzy2 87.16 (1.3%) 86.59 (1.1%) -0.7% ( -3% -1%) 0.156 OrHighHigh 75.93 (2.3%) 75.62 (2.1%) -0.4% ( -4% -4%) 0.611 HighTermTitleBDVSort 13.53 (3.3%) 13.48 (6.8%) -0.4% ( -10% - 10%) 0.859 CountAndHighMed 120.03 (2.9%) 119.60 (1.9%) -0.4% ( -5% -4%) 0.705 Fuzzy1 89.56 (1.1%) 89.25 (1.2%) -0.3% ( -2% -1%) 0.422 AndStopWords 29.15 (3.5%) 29.05 (3.6%) -0.3% ( -7% -7%) 0.811 And3Terms 155.12 (2.2%) 154.91 (2.5%) -0.1% ( -4% -4%) 0.883 And2Terms2StopWords 149.66 (2.5%) 149.48 (2.3%) -0.1% ( -4% -4%) 0.897 Respell 49.39 (1.4%) 49.34 (1.3%) -0.1% ( -2% -2%) 0.826 Wildcard 77.63 (3.2%) 77.56 (3.0%) -0.1% ( -6% -6%) 0.935 OrNotHighLow 964.64 (2.9%) 964.50 (2.2%) -0.0% ( -4% -5%) 0.987 OrHighNotHigh 233.73 (7.3%) 233.73 (6.5%)0.0% ( -12% - 14%) 1.000 CountAndHighHigh 41.12 (2.4%) 41.15 (2.2%)0.1% ( -4% -4%) 0.937 HighTermMonthSort 3589.46 (1.3%) 3594.78 (2.7%)0.1% ( -3% -4%) 0.853 TermDTSort 360.82 (8.1%) 362.55 (5.2%)0.5% ( -11% - 14%) 0.852 PKLookup 285.81 (1.8%) 287.35 (1.7%)0.5% ( -2% -4%) 0.418 OrNotHighHigh 265.21 (7.0%) 266.65 (6.3%)0.5% ( -11% - 14%) 0.830 AndHighHigh 68.85 (2.6%) 69.25 (2.2%)0.6% ( -4% -5%) 0.527 Phrase 11.48 (3.1%) 11.56 (4.1%)0.6% ( -6% -8%) 0.644 AndHighMed 147.82 (2.3%) 148.76 (2.0%)0.6% ( -3% -5%) 0.438 AndHighLow 759.97 (3.8%) 765.97 (2.2%)0.8% ( -5% -7%) 0.502 HighTermTitleSort 141.20 (2.9%) 142.32 (3.1%)0.8% ( -5% -6%) 0.479 OrNotHighMed 377.50 (6.6%) 380.75 (5.8%)0.9% ( -10% - 14%)
Re: [PR] Compute facets while collecting [lucene]
epotyom commented on code in PR #13568: URL: https://github.com/apache/lucene/pull/13568#discussion_r1681768843 ## lucene/core/src/java/org/apache/lucene/search/CollectorOwner.java: ## @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collection; +import java.util.Collections; +import java.util.List; + +/** + * Like {@link CollectorManager}, but it owns the collectors its manager creates. It is convenient + * that clients of the class don't have to worry about keeping the list of collectors, as well as + * about making the collectors type (C) compatible when reduce is called. Instance of this class + * also caches results of {@link CollectorManager#reduce(Collection)}. + * + * Note that instance of this class ignores any {@link Collector} created by {@link + * CollectorManager#newCollector()} directly, not through {@link #newCollector()} + * + * @lucene.experimental + */ +public final class CollectorOwner { Review Comment: The problem with using `CollectorManager` is that reduce method takes `Collection collectors` as an argument. It means that users not only have to manage the list of collectors themselves, but also make sure that their `CollectorManager` and collection of `Collector`s are of the same type `C`. It becomes a problem for DrillSideways, where we want to use different `CollectorManager`s for different dimensions. In theory we can overcome that by keeping a list of types for dimensions, and cast Collectors' collection items to the right type before calling `#reduce`. But creating `CollectorOwner` class looks like better solution to me - it allows DrillSideways and IndexSearcher to be ignorant of collector's and result's types. As a bonus, it manages the list of Collectors for both `DrillSideways` and `IndexSearcher`. Please let me know if you think its not worth it - we can revisit the idea of keeping list of types in DrillSideways? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Compute facets while collecting [lucene]
epotyom commented on PR #13568: URL: https://github.com/apache/lucene/pull/13568#issuecomment-2234352412 Thank you for reviewing @mikemccand ! I've made the changes and updated the branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Aggregate files from the same segment into a single Arena [lucene]
uschindler commented on code in PR #13570: URL: https://github.com/apache/lucene/pull/13570#discussion_r1681841607 ## lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java: ## @@ -167,6 +188,7 @@ public MMapDirectory(Path path, LockFactory lockFactory, long maxChunkSize) thro throw new IllegalArgumentException("Maximum chunk size for mmap must be >0"); } this.chunkSizePower = Long.SIZE - 1 - Long.numberOfLeadingZeros(maxChunkSize); +this.groupingFunction = GROUP_BY_SEGMENT; Review Comment: See above. ## lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java: ## @@ -102,6 +118,11 @@ public class MMapDirectory extends FSDirectory { */ public static final long DEFAULT_MAX_CHUNK_SIZE; + /** A provider specific context object or null, that will be passed to openInput. */ + final Object attachment = PROVIDER.attachment(); + + Function> groupingFunction; Review Comment: Maybe initialize it here (like for preload). Can also be private. ## lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java: ## @@ -184,6 +206,21 @@ public void setPreload(BiPredicate preload) { this.preload = preload; } + /** Review Comment: Maybe add some short description also to class definition (like for preload). ## lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java: ## @@ -83,6 +86,19 @@ public class MMapDirectory extends FSDirectory { */ public static final BiPredicate NO_FILES = (filename, context) -> false; + /** Argument for {@link #setGroupingFunction(Function)} that configures no grouping. */ + public static final Function> NO_GROUPING = filename -> Optional.empty(); + + /** Argument for {@link #setGroupingFunction(Function)} that configures grouping by segment. */ + public static final Function> GROUP_BY_SEGMENT = + filename -> { +String segmentName = IndexFileNames.parseSegmentName(filename); +if (filename.length() == segmentName.length()) { + return Optional.empty(); +} +return Optional.of(segmentName); + }; Review Comment: Do we also have a function that parses the Long without reimplementing it here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Aggregate files from the same segment into a single Arena [lucene]
uschindler commented on code in PR #13570: URL: https://github.com/apache/lucene/pull/13570#discussion_r1681847294 ## lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java: ## @@ -83,6 +86,19 @@ public class MMapDirectory extends FSDirectory { */ public static final BiPredicate NO_FILES = (filename, context) -> false; + /** Argument for {@link #setGroupingFunction(Function)} that configures no grouping. */ + public static final Function> NO_GROUPING = filename -> Optional.empty(); + + /** Argument for {@link #setGroupingFunction(Function)} that configures grouping by segment. */ + public static final Function> GROUP_BY_SEGMENT = + filename -> { +String segmentName = IndexFileNames.parseSegmentName(filename); +if (filename.length() == segmentName.length()) { + return Optional.empty(); +} +return Optional.of(segmentName); + }; Review Comment: Maybe add a method to validate and extract a segment name to `IndexFileNames` class. We possibly also have one in test framework that could be moved... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] HnswLock: access locks via hash and only use for concurrent indexing [lucene]
benwtrent commented on PR #13581: URL: https://github.com/apache/lucene/pull/13581#issuecomment-2234506747 It's weird that the merge threads have no impact on index build time. Did you create a bunch of segments and then force merge to exercise this code path? Or was the buffer size too large to even allow merges? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]
goankur commented on code in PR #13572: URL: https://github.com/apache/lucene/pull/13572#discussion_r1682009005 ## lucene/core/build.gradle: ## @@ -14,10 +14,43 @@ * See the License for the specific language governing permissions and * limitations under the License. */ +plugins { + id "c" +} apply plugin: 'java-library' +apply plugin: 'c' description = 'Lucene core library' +model { + toolChains { +gcc(Gcc) { + target("linux_aarch64"){ +cppCompiler.withArguments { args -> + args << "-O3 --shared" Review Comment: That was it, just needed a fresh pair of eyes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]
goankur commented on PR #13572: URL: https://github.com/apache/lucene/pull/13572#issuecomment-2235177271 > Do we even need to use intrinsics? function is so simple that the compiler seems to do the right thing, e.g. use `SDOT` dot production instruction, given the correct flags: > > https://godbolt.org/z/KG1dPnrqn > > https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions- > I haven't benchmarked, just seems `SDOT` is the one to optimize for, and GCC can both recognize the code shape and autovectorize to it without hassle. > > my cheap 2021 phone has `asimddp` feature in /proc/cpuinfo, dot product support seems widespread. > > You can use it directly via intrinsic, too, no need to use add/multiply intrinsic: https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#dot-product > > But unless it is really faster than what GCC does with simple C, no need. With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have `10X` better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]
rmuir commented on PR #13572: URL: https://github.com/apache/lucene/pull/13572#issuecomment-2235209531 > With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have `10X` better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison). This seems correct to me. The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as `SDOT` in its vocabulary at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Early terminate visit BKD leaf when current value greater than upper point in sorted dim. [lucene]
vsop-479 commented on PR #12528: URL: https://github.com/apache/lucene/pull/12528#issuecomment-2235220539 > if there are slowdowns due to the extra check for each visited point. Maybe this extra check can be weakened by [Branch predictor ](https://en.wikipedia.org/wiki/Branch_predictor) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]
rmuir commented on code in PR #13572: URL: https://github.com/apache/lucene/pull/13572#discussion_r1682041365 ## lucene/core/build.gradle: ## @@ -14,12 +14,59 @@ * See the License for the specific language governing permissions and * limitations under the License. */ +plugins { + id "c" +} apply plugin: 'java-library' +apply plugin: 'c' description = 'Lucene core library' +model { + toolChains { +gcc(Gcc) { + target("linux_aarch64"){ +path '/usr/bin/' +cCompiler.executable 'gcc10-cc' +cCompiler.withArguments { args -> + args << "--shared" + << "-O3" + << "-march=armv8.2-a+dotprod" Review Comment: for simple case of benchmarking i would just try `-march=native` to compile the best it can for the specific cpu/microarch. Then you could test your `dot8s` on ARM 256-bit SVE (graviton3, graviton4) which might be interesting... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Early terminate visit BKD leaf when current value greater than upper point in sorted dim. [lucene]
vsop-479 commented on PR #12528: URL: https://github.com/apache/lucene/pull/12528#issuecomment-2235705078 > if there are slowdowns due to the extra check for each visited point. Maybe this extra check can be weakened by [Branch predictor](https://en.wikipedia.org/wiki/Branch_predictor) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org