[GitHub] [lucene] akhgeek30 opened a new issue, #11864: ArrayIndexOutOfBoundException
akhgeek30 opened a new issue, #11864: URL: https://github.com/apache/lucene/issues/11864 ### Description Steps to reproduce 1. Query = abc-ghi 2. Create a synonym file as Synonym.txt = { abc,def ghi,jkl } 3. Schema to be followed managed-schema Error : `java.lang.ArrayIndexOutOfBoundsException: 0\r\n\tat org.apache.lucene.util.QueryBuilder.newSynonymQuery(QueryBuilder.java:653)\r\n\tat org.apache.solr.parser.SolrQueryParserBase.newSynonymQuery(SolrQueryParserBase.java:617)\r\n\tat org.apache.lucene.util.QueryBuilder.analyzeGraphBoolean(QueryBuilder.java:533)\r\n\tat org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:320)\r\n\tat org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:240)\r\n\tat org.apache.solr.parser.SolrQueryParserBase.newFieldQuery(SolrQueryParserBase.java:524)\r\n\tat org.apache.solr.parser.QueryParser.newFieldQuery(QueryParser.java:62)\r\n\tat org.apache.solr.parser.SolrQueryParserBase.getFieldQuery(SolrQueryParserBase.java:1072)\r\n\tat org.apache.solr.parser.SolrQueryParserBase.handleBareTokenQuery(SolrQueryParserBase.java:806)\r\n\tat org.apache.solr.parser.QueryParser.Term(QueryParser.java:421)\r\n\tat org.apache.solr.parser.QueryParser.Clause(QueryParser.java:278)\r\ n\tat org.apache.solr.parser.QueryParser.Query(QueryParser.java:162)\r\n\tat org.apache.solr.parser.QueryParser.Clause(QueryParser.java:282)\r\n\tat org.apache.solr.parser.QueryParser.Query(QueryParser.java:222)\r\n\tat org.apache.solr.parser.QueryParser.Clause(QueryParser.java:282)\r\n\tat org.apache.solr.parser.QueryParser.Query(QueryParser.java:162)\r\n\tat org.apache.solr.parser.QueryParser.Clause(QueryParser.java:282)\r\n\tat org.apache.solr.parser.QueryParser.Query(QueryParser.java:162)\r\n\tat org.apache.solr.parser.QueryParser.Clause(QueryParser.java:282)\r\n\tat org.apache.solr.parser.QueryParser.Query(QueryParser.java:222)\r\n\tat org.apache.solr.parser.QueryParser.TopLevelQuery(QueryParser.java:131)\r\n\tat org.apache.solr.parser.SolrQueryParserBase.parse(SolrQueryParserBase.java:260)\r\n\tat org.apache.solr.search.LuceneQParser.parse(LuceneQParser.java:49)\r\n\tat org.apache.solr.search.QParser.getQuery(QParser.java:173)\r\n\tat org.apache.solr.search.ExtendedDismaxQPars er.getBoostQueries(ExtendedDismaxQParser.java:566)\r\n\tat org.apache.solr.search.ExtendedDismaxQParser.parse(ExtendedDismaxQParser.java:187)\r\n\tat org.apache.solr.search.QParser.getQuery(QParser.java:173)\r\n\tat org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:159)\r\n\tat org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:272)\r\n\tat ` Found Issue in org/apache/lucene/util/QueryBuilder.java protected Query newSynonymQuery(Term terms[]) { SynonymQuery.Builder builder = new SynonymQuery.Builder(**_terms[0].field()_**); for (Term term : terms) { builder.addTerm(term); } return builder.build(); } ### Version and environment details Version > 8.0.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase opened a new pull request, #11865: Fix duplicate entry in CHANGES.txt
iverase opened a new pull request, #11865: URL: https://github.com/apache/lucene/pull/11865 Seem a leftover for last commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase merged pull request #11865: Fix duplicate entry in CHANGES.txt
iverase merged PR #11865: URL: https://github.com/apache/lucene/pull/11865 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections
benwtrent commented on code in PR #11860: URL: https://github.com/apache/lucene/pull/11860#discussion_r1000640297 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsReader.java: ## @@ -0,0 +1,505 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.lucene95; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.Arrays; +import java.util.HashMap; +import java.util.Map; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.index.*; +import org.apache.lucene.search.ScoreDoc; +import org.apache.lucene.search.TopDocs; +import org.apache.lucene.search.TotalHits; +import org.apache.lucene.store.ChecksumIndexInput; +import org.apache.lucene.store.DataInput; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.RamUsageEstimator; +import org.apache.lucene.util.hnsw.HnswGraph; +import org.apache.lucene.util.hnsw.HnswGraphSearcher; +import org.apache.lucene.util.hnsw.NeighborQueue; +import org.apache.lucene.util.packed.DirectMonotonicReader; +import org.apache.lucene.util.packed.PackedInts; + +/** + * Reads vectors from the index segments along with index data structures supporting KNN search. + * + * @lucene.experimental + */ +public final class Lucene95HnswVectorsReader extends KnnVectorsReader { + + private final FieldInfos fieldInfos; + private final Map fields = new HashMap<>(); + private final IndexInput vectorData; + private final IndexInput vectorIndex; + + Lucene95HnswVectorsReader(SegmentReadState state) throws IOException { +this.fieldInfos = state.fieldInfos; +int versionMeta = readMetadata(state); +boolean success = false; +try { + vectorData = + openDataInput( + state, + versionMeta, + Lucene95HnswVectorsFormat.VECTOR_DATA_EXTENSION, + Lucene95HnswVectorsFormat.VECTOR_DATA_CODEC_NAME); + vectorIndex = + openDataInput( + state, + versionMeta, + Lucene95HnswVectorsFormat.VECTOR_INDEX_EXTENSION, + Lucene95HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME); + success = true; +} finally { + if (success == false) { +IOUtils.closeWhileHandlingException(this); + } +} + } + + private int readMetadata(SegmentReadState state) throws IOException { +String metaFileName = +IndexFileNames.segmentFileName( +state.segmentInfo.name, state.segmentSuffix, Lucene95HnswVectorsFormat.META_EXTENSION); +int versionMeta = -1; +try (ChecksumIndexInput meta = state.directory.openChecksumInput(metaFileName, state.context)) { + Throwable priorE = null; + try { +versionMeta = +CodecUtil.checkIndexHeader( +meta, +Lucene95HnswVectorsFormat.META_CODEC_NAME, +Lucene95HnswVectorsFormat.VERSION_START, +Lucene95HnswVectorsFormat.VERSION_CURRENT, +state.segmentInfo.getId(), +state.segmentSuffix); +readFields(meta, state.fieldInfos); + } catch (Throwable exception) { +priorE = exception; + } finally { +CodecUtil.checkFooter(meta, priorE); + } +} +return versionMeta; + } + + private static IndexInput openDataInput( + SegmentReadState state, int versionMeta, String fileExtension, String codecName) + throws IOException { +String fileName = +IndexFileNames.segmentFileName(state.segmentInfo.name, state.segmentSuffix, fileExtension); +IndexInput in = state.directory.openInput(fileName, state.context); +boolean success = false; +try { + int versionVectorData = + CodecUtil.checkIndexHeader( + in, + codecName, + Lucene95HnswVectorsFormat.VERSION_START, + Lucene95HnswVectorsFormat.VERSION_CURRENT, + state.segmentInfo.getId(), + state.segmentSuffix); + if (versionMeta != versionVe
[GitHub] [lucene] mikemccand commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
mikemccand commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r1000886175 ## lucene/misc/src/java/org/apache/lucene/misc/store/ByteTrackingIndexOutput.java: ## @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; +import org.apache.lucene.store.IndexOutput; + +/** An {@link IndexOutput} that wraps another instance and tracks the number of bytes written */ +public class ByteTrackingIndexOutput extends IndexOutput { Review Comment: Maybe open a follow-on issue to add a `FilterIndexOutput`? These delegators are spooky when they are not properly tested... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
mikemccand commented on PR #11796: URL: https://github.com/apache/lucene/pull/11796#issuecomment-1285872091 This looks great to me! I love all the engagement (83+ comments!) and how it iterated to such a simple solution. I left a small comment for a follow-on issue ... and it looks like `CHANGES.txt` is conflicting again @mdmarshmallow maybe open another follow-on issue in `luceneutil` to add this to nightly benchmarks? It'd be great to see impact on WAF over time of interesting index-time changes... I'll push this in a few days if nobody objects. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] NightOwl888 opened a new issue, #11866: On many analyzers, the getDefaultStopSet() method returns a modifiable set, contrary to the docs
NightOwl888 opened a new issue, #11866: URL: https://github.com/apache/lucene/issues/11866 ### Description Several of the analyzers state that they are supposed to return an unmodifiable `CharArraySet`, but the set that is returned is writable, as you can see in the source. https://github.com/apache/lucene/blob/cc342ea7407c729a743123d8f7957aff6c6f9792/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchAnalyzer.java#L67-L92 Note that the `Snowball` sets are also returned as writable. ### Version and environment details All versions, all environments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11866: On many analyzers, the getDefaultStopSet() method returns a modifiable set, contrary to the docs
rmuir commented on issue #11866: URL: https://github.com/apache/lucene/issues/11866#issuecomment-1285890056 The example is not correct. `WordlistLoader.getSnowballWordSet()` returns an unmodifiableSet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections
benwtrent commented on code in PR #11860: URL: https://github.com/apache/lucene/pull/11860#discussion_r1000928799 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsReader.java: ## @@ -0,0 +1,505 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.lucene95; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.Arrays; +import java.util.HashMap; +import java.util.Map; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.index.*; +import org.apache.lucene.search.ScoreDoc; +import org.apache.lucene.search.TopDocs; +import org.apache.lucene.search.TotalHits; +import org.apache.lucene.store.ChecksumIndexInput; +import org.apache.lucene.store.DataInput; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.RamUsageEstimator; +import org.apache.lucene.util.hnsw.HnswGraph; +import org.apache.lucene.util.hnsw.HnswGraphSearcher; +import org.apache.lucene.util.hnsw.NeighborQueue; +import org.apache.lucene.util.packed.DirectMonotonicReader; +import org.apache.lucene.util.packed.PackedInts; + +/** + * Reads vectors from the index segments along with index data structures supporting KNN search. + * + * @lucene.experimental + */ +public final class Lucene95HnswVectorsReader extends KnnVectorsReader { + + private final FieldInfos fieldInfos; + private final Map fields = new HashMap<>(); + private final IndexInput vectorData; + private final IndexInput vectorIndex; + + Lucene95HnswVectorsReader(SegmentReadState state) throws IOException { +this.fieldInfos = state.fieldInfos; +int versionMeta = readMetadata(state); +boolean success = false; +try { + vectorData = + openDataInput( + state, + versionMeta, + Lucene95HnswVectorsFormat.VECTOR_DATA_EXTENSION, + Lucene95HnswVectorsFormat.VECTOR_DATA_CODEC_NAME); + vectorIndex = + openDataInput( + state, + versionMeta, + Lucene95HnswVectorsFormat.VECTOR_INDEX_EXTENSION, + Lucene95HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME); + success = true; +} finally { + if (success == false) { +IOUtils.closeWhileHandlingException(this); + } +} + } + + private int readMetadata(SegmentReadState state) throws IOException { +String metaFileName = +IndexFileNames.segmentFileName( +state.segmentInfo.name, state.segmentSuffix, Lucene95HnswVectorsFormat.META_EXTENSION); +int versionMeta = -1; +try (ChecksumIndexInput meta = state.directory.openChecksumInput(metaFileName, state.context)) { + Throwable priorE = null; + try { +versionMeta = +CodecUtil.checkIndexHeader( +meta, +Lucene95HnswVectorsFormat.META_CODEC_NAME, +Lucene95HnswVectorsFormat.VERSION_START, +Lucene95HnswVectorsFormat.VERSION_CURRENT, +state.segmentInfo.getId(), +state.segmentSuffix); +readFields(meta, state.fieldInfos); + } catch (Throwable exception) { +priorE = exception; + } finally { +CodecUtil.checkFooter(meta, priorE); + } +} +return versionMeta; + } + + private static IndexInput openDataInput( + SegmentReadState state, int versionMeta, String fileExtension, String codecName) + throws IOException { +String fileName = +IndexFileNames.segmentFileName(state.segmentInfo.name, state.segmentSuffix, fileExtension); +IndexInput in = state.directory.openInput(fileName, state.context); +boolean success = false; +try { + int versionVectorData = + CodecUtil.checkIndexHeader( + in, + codecName, + Lucene95HnswVectorsFormat.VERSION_START, + Lucene95HnswVectorsFormat.VERSION_CURRENT, + state.segmentInfo.getId(), + state.segmentSuffix); + if (versionMeta != versionVe
[GitHub] [lucene] NightOwl888 commented on issue #11866: On many analyzers, the getDefaultStopSet() method returns a modifiable set, contrary to the docs
NightOwl888 commented on issue #11866: URL: https://github.com/apache/lucene/issues/11866#issuecomment-1285916552 I attempted to modify it, and it is succeeding. ``` SoraniAnalyzer.getDefaultStopSet().Add("foo33") // returns true ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani opened a new pull request, #11867: Add monster test that indexes 1M vectors
jtibshirani opened a new pull request, #11867: URL: https://github.com/apache/lucene/pull/11867 This is a rough draft of a large-scale test for kNN vectors. It tests a large dataset of kNN vectors to check for issues that only show up when segments are very large, like overflow. The dataset is based on the StackOverflow track from Elasticsearch's rally benchmarks: https://github.com/elastic/rally-tracks/tree/master/so_vector. I tried developing a test using random vectors, but HNSW can become quite slow and ineffective when the data doesn't have structure. Steps to run the test 1. Download the dataset: `wget https://rally-tracks.elastic.co/so_vector/documents.bin` 2. Move the dataset to the resources folder: `mv documents.bin lucene/core/src/resources/` 3. Start the test: `./gradlew test --tests TestManyKnnVectors.testLargeSegment -Dtests.monster=true -Dtests.verbose=true -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1` Relates to #11863. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors
rmuir commented on code in PR #11867: URL: https://github.com/apache/lucene/pull/11867#discussion_r1001074649 ## lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java: ## @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.document; + +import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.IndexWriter; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.index.VectorValues; +import org.apache.lucene.search.Sort; +import org.apache.lucene.search.SortField; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FSDirectory; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.tests.util.LuceneTestCase.Monster; + +import java.io.IOException; +import java.io.InputStream; +import java.net.URL; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.Path; +import java.nio.file.Paths; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + + +/** + * Tests a large dataset of kNN vectors to check for issues that only show up when + * segments are very large, like overflow. The dataset is based on the StackOverflow + * track from Elasticsearch's rally benchmarks: https://github.com/elastic/rally-tracks/tree/master/so_vector. + * + * Steps to run the test + * 1. Download the dataset: wget https://rally-tracks.elastic.co/so_vector/documents.bin + * 2. Move the dataset to the resources folder: mv documents.bin lucene/core/src/resources/ Review Comment: This tries to make a 3GB jar file as part of `:lucene:core:jar` task. For me it takes an eternity due to the zipping of the file into the jar. I dropped the file in `src/test` folder instead and the test is running with it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors
rmuir commented on code in PR #11867: URL: https://github.com/apache/lucene/pull/11867#discussion_r1001077394 ## lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java: ## @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.document; + +import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.IndexWriter; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.index.VectorValues; +import org.apache.lucene.search.Sort; +import org.apache.lucene.search.SortField; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FSDirectory; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.tests.util.LuceneTestCase.Monster; + +import java.io.IOException; +import java.io.InputStream; +import java.net.URL; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.Path; +import java.nio.file.Paths; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + + +/** + * Tests a large dataset of kNN vectors to check for issues that only show up when + * segments are very large, like overflow. The dataset is based on the StackOverflow + * track from Elasticsearch's rally benchmarks: https://github.com/elastic/rally-tracks/tree/master/so_vector. + * + * Steps to run the test + * 1. Download the dataset: wget https://rally-tracks.elastic.co/so_vector/documents.bin + * 2. Move the dataset to the resources folder: mv documents.bin lucene/core/src/resources/ + * 3. Start the test: + * ./gradlew test --tests TestManyKnnVectors.testLargeSegment -Dtests.monster=true -Dtests.verbose=true \ + * -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1 + */ +@TimeoutSuite(millis = 10_800_000) // 3 hour timeout +@Monster("takes ~2 hours and needs 2GB heap") +public class TestManyKnnVectors extends LuceneTestCase { + public void testLargeSegment() throws Exception { +IndexWriterConfig iwc = newIndexWriterConfig(); +if (random().nextBoolean()) { + iwc.setIndexSort(new Sort(new SortField("sortkey", SortField.Type.INT))); +} +String fieldName = "field"; +VectorSimilarityFunction similarityFunction = VectorSimilarityFunction.DOT_PRODUCT; + +URL documentsPath = getClass().getClassLoader().getResource("documents.bin"); +assertNotNull(documentsPath); + +try (FileChannel input = FileChannel.open(Paths.get(documentsPath.toURI())); + Directory dir = FSDirectory.open(createTempDir("ManyKnnVectors")); Review Comment: if we use `newFSDirectory()` instead, then we get a checkindex at the end too. It can give more confidence in tests like these (as well as confidence there is no overflow in checkindex itself). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors
rmuir commented on code in PR #11867: URL: https://github.com/apache/lucene/pull/11867#discussion_r1001089104 ## lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java: ## @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.document; + +import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.IndexWriter; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.index.VectorValues; +import org.apache.lucene.search.Sort; +import org.apache.lucene.search.SortField; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FSDirectory; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.tests.util.LuceneTestCase.Monster; + +import java.io.IOException; +import java.io.InputStream; +import java.net.URL; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.Path; +import java.nio.file.Paths; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + + +/** + * Tests a large dataset of kNN vectors to check for issues that only show up when + * segments are very large, like overflow. The dataset is based on the StackOverflow + * track from Elasticsearch's rally benchmarks: https://github.com/elastic/rally-tracks/tree/master/so_vector. + * + * Steps to run the test + * 1. Download the dataset: wget https://rally-tracks.elastic.co/so_vector/documents.bin + * 2. Move the dataset to the resources folder: mv documents.bin lucene/core/src/resources/ + * 3. Start the test: + * ./gradlew test --tests TestManyKnnVectors.testLargeSegment -Dtests.monster=true -Dtests.verbose=true \ + * -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1 + */ +@TimeoutSuite(millis = 10_800_000) // 3 hour timeout +@Monster("takes ~2 hours and needs 2GB heap") +public class TestManyKnnVectors extends LuceneTestCase { + public void testLargeSegment() throws Exception { +IndexWriterConfig iwc = newIndexWriterConfig(); Review Comment: we may want to specify the codec explicitly via `iwc.setCodec(TestUtil.getDefaultCodec())`. otherwise at least maybe suppress simpletext or anything that could be very slow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors
rmuir commented on code in PR #11867: URL: https://github.com/apache/lucene/pull/11867#discussion_r1001090648 ## lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java: ## @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.document; + +import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.IndexWriter; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.index.VectorValues; +import org.apache.lucene.search.Sort; +import org.apache.lucene.search.SortField; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FSDirectory; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.tests.util.LuceneTestCase.Monster; + +import java.io.IOException; +import java.io.InputStream; +import java.net.URL; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.Path; +import java.nio.file.Paths; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + + +/** + * Tests a large dataset of kNN vectors to check for issues that only show up when + * segments are very large, like overflow. The dataset is based on the StackOverflow + * track from Elasticsearch's rally benchmarks: https://github.com/elastic/rally-tracks/tree/master/so_vector. + * + * Steps to run the test + * 1. Download the dataset: wget https://rally-tracks.elastic.co/so_vector/documents.bin + * 2. Move the dataset to the resources folder: mv documents.bin lucene/core/src/resources/ + * 3. Start the test: + * ./gradlew test --tests TestManyKnnVectors.testLargeSegment -Dtests.monster=true -Dtests.verbose=true \ + * -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1 + */ +@TimeoutSuite(millis = 10_800_000) // 3 hour timeout +@Monster("takes ~2 hours and needs 2GB heap") +public class TestManyKnnVectors extends LuceneTestCase { + public void testLargeSegment() throws Exception { +IndexWriterConfig iwc = newIndexWriterConfig(); Review Comment: also, maybe consider not using random IW config but instead specifying one that will more efficiently run the test. For example configuring rambuffer to be large or whatever. It is a tradeoff that other monster tests take so that they are a little less monstrous, but still test the thing we want to test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors
rmuir commented on code in PR #11867: URL: https://github.com/apache/lucene/pull/11867#discussion_r1001183142 ## lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java: ## @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.document; + +import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.IndexWriter; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.index.VectorValues; +import org.apache.lucene.search.Sort; +import org.apache.lucene.search.SortField; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FSDirectory; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.tests.util.LuceneTestCase.Monster; + +import java.io.IOException; +import java.io.InputStream; +import java.net.URL; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.Path; +import java.nio.file.Paths; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + + +/** + * Tests a large dataset of kNN vectors to check for issues that only show up when + * segments are very large, like overflow. The dataset is based on the StackOverflow + * track from Elasticsearch's rally benchmarks: https://github.com/elastic/rally-tracks/tree/master/so_vector. + * + * Steps to run the test + * 1. Download the dataset: wget https://rally-tracks.elastic.co/so_vector/documents.bin + * 2. Move the dataset to the resources folder: mv documents.bin lucene/core/src/resources/ + * 3. Start the test: + * ./gradlew test --tests TestManyKnnVectors.testLargeSegment -Dtests.monster=true -Dtests.verbose=true \ + * -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1 + */ +@TimeoutSuite(millis = 10_800_000) // 3 hour timeout +@Monster("takes ~2 hours and needs 2GB heap") +public class TestManyKnnVectors extends LuceneTestCase { + public void testLargeSegment() throws Exception { +IndexWriterConfig iwc = newIndexWriterConfig(); Review Comment: i'd be happy to propose some changes. running the test was entirely too slow without this on my machine, I was gonna hit the test timeout :) so I did the following and restarted the test: * Removed randomized `newIndexWriterConfig` as we want performance and not lots of merging or anything. especially for this test! * set big rambuffer (200MB) * set default codec (TestUtil.getDefaultCodec) * Removed unrelated randomized indexsort and numericdocvalues field ``` IndexWriterConfig iwc = new IndexWriterConfig(); iwc.setCodec(TestUtil.getDefaultCodec()); iwc.setRAMBufferSizeMB(200); ``` It seems to be chugging along faster on my slow 2018 2-core computer :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #11867: Add monster test that indexes 1M vectors
jtibshirani commented on code in PR #11867: URL: https://github.com/apache/lucene/pull/11867#discussion_r1001200789 ## lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java: ## @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.document; + +import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.IndexWriter; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.index.VectorValues; +import org.apache.lucene.search.Sort; +import org.apache.lucene.search.SortField; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FSDirectory; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.tests.util.LuceneTestCase.Monster; + +import java.io.IOException; +import java.io.InputStream; +import java.net.URL; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.Path; +import java.nio.file.Paths; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + + +/** + * Tests a large dataset of kNN vectors to check for issues that only show up when + * segments are very large, like overflow. The dataset is based on the StackOverflow + * track from Elasticsearch's rally benchmarks: https://github.com/elastic/rally-tracks/tree/master/so_vector. + * + * Steps to run the test + * 1. Download the dataset: wget https://rally-tracks.elastic.co/so_vector/documents.bin + * 2. Move the dataset to the resources folder: mv documents.bin lucene/core/src/resources/ + * 3. Start the test: + * ./gradlew test --tests TestManyKnnVectors.testLargeSegment -Dtests.monster=true -Dtests.verbose=true \ + * -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1 + */ +@TimeoutSuite(millis = 10_800_000) // 3 hour timeout +@Monster("takes ~2 hours and needs 2GB heap") +public class TestManyKnnVectors extends LuceneTestCase { + public void testLargeSegment() throws Exception { +IndexWriterConfig iwc = newIndexWriterConfig(); Review Comment: These are good points, I'll push the suggested changes. I guess my computer is beefier, I completed runs in under 2 hours each and confirmed it fails before the change, succeeds after. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors
rmuir commented on code in PR #11867: URL: https://github.com/apache/lucene/pull/11867#discussion_r1001206144 ## lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java: ## @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.document; + +import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.IndexWriter; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.index.VectorValues; +import org.apache.lucene.search.Sort; +import org.apache.lucene.search.SortField; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FSDirectory; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.tests.util.LuceneTestCase.Monster; + +import java.io.IOException; +import java.io.InputStream; +import java.net.URL; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.Path; +import java.nio.file.Paths; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + + +/** + * Tests a large dataset of kNN vectors to check for issues that only show up when + * segments are very large, like overflow. The dataset is based on the StackOverflow + * track from Elasticsearch's rally benchmarks: https://github.com/elastic/rally-tracks/tree/master/so_vector. + * + * Steps to run the test + * 1. Download the dataset: wget https://rally-tracks.elastic.co/so_vector/documents.bin + * 2. Move the dataset to the resources folder: mv documents.bin lucene/core/src/resources/ + * 3. Start the test: + * ./gradlew test --tests TestManyKnnVectors.testLargeSegment -Dtests.monster=true -Dtests.verbose=true \ + * -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1 + */ +@TimeoutSuite(millis = 10_800_000) // 3 hour timeout +@Monster("takes ~2 hours and needs 2GB heap") +public class TestManyKnnVectors extends LuceneTestCase { + public void testLargeSegment() throws Exception { +IndexWriterConfig iwc = newIndexWriterConfig(); Review Comment: Its probably beefier :) But I think maybe you also got luckier with the `newIndexWriterConfig`, my computer was just merging and merging and merging. Now with the changes it actually spends its time indexing (albeit maxing out just one cpu core all bottlenecked on `dotProduct()`). Of course, the test could be modified to use multiple cores as a next step. With the changes, I actually see your messages such as `1> Indexed 78 vectors out of 100` rather than being flooded with constant merging. And thats 780k out of 1M after only 40 minutes on my machine, so it may complete in under an hour for me already. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors
rmuir commented on code in PR #11867: URL: https://github.com/apache/lucene/pull/11867#discussion_r1001212579 ## lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java: ## @@ -61,11 +61,13 @@ @Monster("takes ~2 hours and needs 2GB heap") public class TestManyKnnVectors extends LuceneTestCase { public void testLargeSegment() throws Exception { -// Make sure to use the default codec instead of a random one -IndexWriterConfig iwc = newIndexWriterConfig().setCodec(TestUtil.getDefaultCodec()); +IndexWriterConfig iwc = new IndexWriterConfig(); +iwc.setCodec(TestUtil.getDefaultCodec()); // Make sure to use the default codec instead of a random one +iwc.setRAMBufferSizeMB(3_000); // Use a 3GB buffer to create a single large segment Review Comment: Maybe use a smaller value here... otherwise we need to change the docs of `@Monster` annotation and your comments about configuring test heap sizes. And I think its good to keep monster tests less monstrous. I'm using 200MB buffer, still with your suggested 2GB heap. It seems to flush about 600-700k docs in each segment. There's no merges happening until the test asks for it with a forceMerge(1), which is running now. compared to the 3GB buffer, yeah, I've gotta suffer the forceMerge in the end, not sure how long that's gonna take, but it exercises the merge code as well as the flush code, and keeps the heap memory usage lower, for less monstrosity in the test. May be the right tradeoff. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #11867: Add monster test that indexes 1M vectors
jtibshirani commented on code in PR #11867: URL: https://github.com/apache/lucene/pull/11867#discussion_r1001214017 ## lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java: ## @@ -61,11 +61,13 @@ @Monster("takes ~2 hours and needs 2GB heap") public class TestManyKnnVectors extends LuceneTestCase { public void testLargeSegment() throws Exception { -// Make sure to use the default codec instead of a random one -IndexWriterConfig iwc = newIndexWriterConfig().setCodec(TestUtil.getDefaultCodec()); +IndexWriterConfig iwc = new IndexWriterConfig(); +iwc.setCodec(TestUtil.getDefaultCodec()); // Make sure to use the default codec instead of a random one +iwc.setRAMBufferSizeMB(3_000); // Use a 3GB buffer to create a single large segment Review Comment: D'oh, yes this shouldn't be bigger than the suggested heap size... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors
rmuir commented on code in PR #11867: URL: https://github.com/apache/lucene/pull/11867#discussion_r1001217406 ## lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java: ## @@ -0,0 +1,131 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.document; + +import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.IndexWriter; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.index.VectorValues; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FSDirectory; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.tests.util.LuceneTestCase.Monster; +import org.apache.lucene.tests.util.TestUtil; + +import java.io.IOException; +import java.net.URL; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.Paths; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + + +/** + * Tests a large dataset of kNN vectors to check for issues that only show up when + * segments are very large, like overflow. The dataset is based on the StackOverflow + * track from Elasticsearch's rally benchmarks: https://github.com/elastic/rally-tracks/tree/master/so_vector. + * + * Steps to run the test + * 1. Download the dataset: wget https://rally-tracks.elastic.co/so_vector/documents.bin + * 2. Move the dataset to the resources folder: mv documents.bin lucene/core/src/resources/ + * 3. Start the test: + * ./gradlew test --tests TestManyKnnVectors.testLargeSegment -Dtests.monster=true -Dtests.verbose=true \ + * -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1 + */ +@TimeoutSuite(millis = 10_800_000) // 3 hour timeout +@Monster("takes ~2 hours and needs 2GB heap") +public class TestManyKnnVectors extends LuceneTestCase { + public void testLargeSegment() throws Exception { +IndexWriterConfig iwc = new IndexWriterConfig(); +iwc.setCodec(TestUtil.getDefaultCodec()); // Make sure to use the default codec instead of a random one +iwc.setRAMBufferSizeMB(200); // Use a 200MB buffer to create larger initial segments + +String fieldName = "field"; +VectorSimilarityFunction similarityFunction = VectorSimilarityFunction.DOT_PRODUCT; + +URL documentsPath = getClass().getClassLoader().getResource("documents.bin"); +assertNotNull(documentsPath); + +try (FileChannel input = FileChannel.open(Paths.get(documentsPath.toURI())); + Directory dir = FSDirectory.open(createTempDir("ManyKnnVectors")); + IndexWriter iw = new IndexWriter(dir, iwc)) { + + // This data is enough to trigger the overflow bug in issue #11858, + // since 1_000_000 * 768 * 4 > Integer.MAX_VALUE + int numVectors = 1_000_000; + int dims = 768; + + VectorReader vectorReader = new VectorReader(input, dims); + for (int i = 0; i < numVectors; i++) { +float[] vector = vectorReader.next(); +Document doc = new Document(); +doc.add(new KnnVectorField(fieldName, vector, similarityFunction)); +iw.addDocument(doc); +if (VERBOSE && i % 10_000 == 0) { + System.out.println("Indexed " + i + " vectors out of " + numVectors); +} + } Review Comment: We can improve the output for this long-running test. I had to fill in the gaps with `jstack` otherwise: I would also consider changing the loop to be `for (int i = 1; i <= numVectors; i++)`. Then the print will say "Indexed 100 vectors out of 100 vectors" at the very end, so that you know indexing is complete. This does not happen today. Maybe also here before the `forceMerge`: ``` if (VERBOSE) { System.out.println("forceMerge()ing to one segment..."); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec
[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors
rmuir commented on code in PR #11867: URL: https://github.com/apache/lucene/pull/11867#discussion_r1001224833 ## lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java: ## @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.document; + +import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.IndexWriter; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.index.VectorValues; +import org.apache.lucene.search.Sort; +import org.apache.lucene.search.SortField; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FSDirectory; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.tests.util.LuceneTestCase.Monster; + +import java.io.IOException; +import java.io.InputStream; +import java.net.URL; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.Path; +import java.nio.file.Paths; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + + +/** + * Tests a large dataset of kNN vectors to check for issues that only show up when + * segments are very large, like overflow. The dataset is based on the StackOverflow + * track from Elasticsearch's rally benchmarks: https://github.com/elastic/rally-tracks/tree/master/so_vector. + * + * Steps to run the test + * 1. Download the dataset: wget https://rally-tracks.elastic.co/so_vector/documents.bin + * 2. Move the dataset to the resources folder: mv documents.bin lucene/core/src/resources/ Review Comment: I think for this one i just suggest changing the code comment to say `mv documents.bin lucene/core/src/test/`. It makes for a faster experience. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors
rmuir commented on code in PR #11867: URL: https://github.com/apache/lucene/pull/11867#discussion_r1001226287 ## lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java: ## @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.document; + +import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.IndexWriter; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.index.VectorValues; +import org.apache.lucene.search.Sort; +import org.apache.lucene.search.SortField; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FSDirectory; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.tests.util.LuceneTestCase.Monster; + +import java.io.IOException; +import java.io.InputStream; +import java.net.URL; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.Path; +import java.nio.file.Paths; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + + +/** + * Tests a large dataset of kNN vectors to check for issues that only show up when + * segments are very large, like overflow. The dataset is based on the StackOverflow + * track from Elasticsearch's rally benchmarks: https://github.com/elastic/rally-tracks/tree/master/so_vector. + * + * Steps to run the test + * 1. Download the dataset: wget https://rally-tracks.elastic.co/so_vector/documents.bin + * 2. Move the dataset to the resources folder: mv documents.bin lucene/core/src/resources/ + * 3. Start the test: + * ./gradlew test --tests TestManyKnnVectors.testLargeSegment -Dtests.monster=true -Dtests.verbose=true \ + * -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1 + */ +@TimeoutSuite(millis = 10_800_000) // 3 hour timeout +@Monster("takes ~2 hours and needs 2GB heap") +public class TestManyKnnVectors extends LuceneTestCase { + public void testLargeSegment() throws Exception { +IndexWriterConfig iwc = newIndexWriterConfig(); +if (random().nextBoolean()) { + iwc.setIndexSort(new Sort(new SortField("sortkey", SortField.Type.INT))); +} +String fieldName = "field"; +VectorSimilarityFunction similarityFunction = VectorSimilarityFunction.DOT_PRODUCT; + +URL documentsPath = getClass().getClassLoader().getResource("documents.bin"); +assertNotNull(documentsPath); + +try (FileChannel input = FileChannel.open(Paths.get(documentsPath.toURI())); + Directory dir = FSDirectory.open(createTempDir("ManyKnnVectors")); Review Comment: Maybe by using `newFSDirectory` instead, we can remove the loop that reads the vectors from all the docs at the end? I would just nuke the loop thru all the docs myself, and keep the checks that e.g. vector field exists with the dimensions you expect. that's good to have in the test. CheckIndex will read all the vectors though, but more thoroughly and probably not cost the test really any more runtime either. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11867: Add monster test that indexes 1M vectors
rmuir commented on PR #11867: URL: https://github.com/apache/lucene/pull/11867#issuecomment-1286335299 With current test i hit the exception on the 9.4 tag: BUILD FAILED in 2h 24m 45s: 2GB heap. Never saw any significant time (e.g. 0.1%) in GC or other jvm threads when inspecting the running test: The initial indexing takes about an hour and then the forcemerge takes an eternity (over an hour), but it works: ``` org.apache.lucene.document.TestManyKnnVectors > testLargeSegment FAILED java.lang.IllegalStateException: Vector data length 307200 not matching size=100 * dim=768 * byteSize=4 = -1222967296 at __randomizedtesting.SeedInfo.seed([CF186B7BCEFCCF79:EBD7012A6CACC57]:0) at org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsReader.validateFieldEntry(Lucene94HnswVectorsReader.java:185) at org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsReader.readFields(Lucene94HnswVectorsReader.java:156) at org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsReader.readMetadata(Lucene94HnswVectorsReader.java:103) at org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsReader.(Lucene94HnswVectorsReader.java:64) at org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsFormat.fieldsReader(Lucene94HnswVectorsFormat.java:157) at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.(PerFieldKnnVectorsFormat.java:219) at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat.fieldsReader(PerFieldKnnVectorsFormat.java:81) at org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:157) at org.apache.lucene.index.SegmentReader.(SegmentReader.java:91) at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:179) at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:221) at org.apache.lucene.index.IndexWriter.lambda$getReader$0(IndexWriter.java:536) at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:138) at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:598) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:112) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:91) at org.apache.lucene.document.TestManyKnnVectors.testLargeSegment(TestManyKnnVectors.java:94) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
mdmarshmallow commented on PR #11796: URL: https://github.com/apache/lucene/pull/11796#issuecomment-1286344459 Thanks Mike, I added an issue to `luceneutil`: https://github.com/mikemccand/luceneutil/issues/208 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow opened a new issue, #11868: Add a FilterIndexOutput
mdmarshmallow opened a new issue, #11868: URL: https://github.com/apache/lucene/issues/11868 ### Description We have several subclasses of `IndexOutput` that have delegates, most recently one was added in this PR: https://github.com/apache/lucene/pull/11796. Adding a `FilterIndexOutput`, similar to `FilterDirectory`, to make sure all these delegators get tested properly would be a good idea. (suggested by @mikemccand here: https://github.com/apache/lucene/pull/11796/files#r1000886175). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
mdmarshmallow commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r1001276262 ## lucene/misc/src/java/org/apache/lucene/misc/store/ByteTrackingIndexOutput.java: ## @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.misc.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; +import org.apache.lucene.store.IndexOutput; + +/** An {@link IndexOutput} that wraps another instance and tracks the number of bytes written */ +public class ByteTrackingIndexOutput extends IndexOutput { Review Comment: I made an issue to track this here: https://github.com/apache/lucene/issues/11868 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] MarcusSorealheis commented on pull request #874: LUCENE-10471 Increse max dims for vectors to 2048
MarcusSorealheis commented on PR #874: URL: https://github.com/apache/lucene/pull/874#issuecomment-1286509849 Should we punish and exclude customers who cannot complete requisite steps of dimensional reduction or allow them to explore with very expensive compute. Many popular large language models surpass the current threshold for better or worse. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] JavaCoderCff closed pull request #271: LUCENE-9969:TaxoArrays, a member variable of the DirectoryTaxonomyReader class, i…
JavaCoderCff closed pull request #271: LUCENE-9969:TaxoArrays, a member variable of the DirectoryTaxonomyReader class, i… URL: https://github.com/apache/lucene/pull/271 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org