[GitHub] [lucene] akhgeek30 opened a new issue, #11864: ArrayIndexOutOfBoundException

2022-10-20 Thread GitBox


akhgeek30 opened a new issue, #11864:
URL: https://github.com/apache/lucene/issues/11864

   ### Description
   
   Steps to reproduce
   1. Query = abc-ghi
   2. Create a synonym file as
   Synonym.txt = {
   abc,def
   ghi,jkl
   }
   
   3. Schema to be followed
   managed-schema
 
   
 
 
 
  
  
  
   
   
 
 
 
  
  
  
   
   
   
   
   Error :
   `java.lang.ArrayIndexOutOfBoundsException: 0\r\n\tat 
org.apache.lucene.util.QueryBuilder.newSynonymQuery(QueryBuilder.java:653)\r\n\tat
 
org.apache.solr.parser.SolrQueryParserBase.newSynonymQuery(SolrQueryParserBase.java:617)\r\n\tat
 
org.apache.lucene.util.QueryBuilder.analyzeGraphBoolean(QueryBuilder.java:533)\r\n\tat
 
org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:320)\r\n\tat
 
org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:240)\r\n\tat
 
org.apache.solr.parser.SolrQueryParserBase.newFieldQuery(SolrQueryParserBase.java:524)\r\n\tat
 org.apache.solr.parser.QueryParser.newFieldQuery(QueryParser.java:62)\r\n\tat 
org.apache.solr.parser.SolrQueryParserBase.getFieldQuery(SolrQueryParserBase.java:1072)\r\n\tat
 
org.apache.solr.parser.SolrQueryParserBase.handleBareTokenQuery(SolrQueryParserBase.java:806)\r\n\tat
 org.apache.solr.parser.QueryParser.Term(QueryParser.java:421)\r\n\tat 
org.apache.solr.parser.QueryParser.Clause(QueryParser.java:278)\r\
 n\tat org.apache.solr.parser.QueryParser.Query(QueryParser.java:162)\r\n\tat 
org.apache.solr.parser.QueryParser.Clause(QueryParser.java:282)\r\n\tat 
org.apache.solr.parser.QueryParser.Query(QueryParser.java:222)\r\n\tat 
org.apache.solr.parser.QueryParser.Clause(QueryParser.java:282)\r\n\tat 
org.apache.solr.parser.QueryParser.Query(QueryParser.java:162)\r\n\tat 
org.apache.solr.parser.QueryParser.Clause(QueryParser.java:282)\r\n\tat 
org.apache.solr.parser.QueryParser.Query(QueryParser.java:162)\r\n\tat 
org.apache.solr.parser.QueryParser.Clause(QueryParser.java:282)\r\n\tat 
org.apache.solr.parser.QueryParser.Query(QueryParser.java:222)\r\n\tat 
org.apache.solr.parser.QueryParser.TopLevelQuery(QueryParser.java:131)\r\n\tat 
org.apache.solr.parser.SolrQueryParserBase.parse(SolrQueryParserBase.java:260)\r\n\tat
 org.apache.solr.search.LuceneQParser.parse(LuceneQParser.java:49)\r\n\tat 
org.apache.solr.search.QParser.getQuery(QParser.java:173)\r\n\tat 
org.apache.solr.search.ExtendedDismaxQPars
 er.getBoostQueries(ExtendedDismaxQParser.java:566)\r\n\tat 
org.apache.solr.search.ExtendedDismaxQParser.parse(ExtendedDismaxQParser.java:187)\r\n\tat
 org.apache.solr.search.QParser.getQuery(QParser.java:173)\r\n\tat 
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:159)\r\n\tat
 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:272)\r\n\tat
 
   `
   Found Issue in org/apache/lucene/util/QueryBuilder.java
   
   protected Query newSynonymQuery(Term terms[]) {
   SynonymQuery.Builder builder = new 
SynonymQuery.Builder(**_terms[0].field()_**);
   for (Term term : terms) {
 builder.addTerm(term);
   }
   return builder.build();
 }
 
   
   ### Version and environment details
   
   Version > 8.0.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase opened a new pull request, #11865: Fix duplicate entry in CHANGES.txt

2022-10-20 Thread GitBox


iverase opened a new pull request, #11865:
URL: https://github.com/apache/lucene/pull/11865

   Seem a leftover for last commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase merged pull request #11865: Fix duplicate entry in CHANGES.txt

2022-10-20 Thread GitBox


iverase merged PR #11865:
URL: https://github.com/apache/lucene/pull/11865


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections

2022-10-20 Thread GitBox


benwtrent commented on code in PR #11860:
URL: https://github.com/apache/lucene/pull/11860#discussion_r1000640297


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsReader.java:
##
@@ -0,0 +1,505 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene95;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.index.*;
+import org.apache.lucene.search.ScoreDoc;
+import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.search.TotalHits;
+import org.apache.lucene.store.ChecksumIndexInput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.RamUsageEstimator;
+import org.apache.lucene.util.hnsw.HnswGraph;
+import org.apache.lucene.util.hnsw.HnswGraphSearcher;
+import org.apache.lucene.util.hnsw.NeighborQueue;
+import org.apache.lucene.util.packed.DirectMonotonicReader;
+import org.apache.lucene.util.packed.PackedInts;
+
+/**
+ * Reads vectors from the index segments along with index data structures 
supporting KNN search.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene95HnswVectorsReader extends KnnVectorsReader {
+
+  private final FieldInfos fieldInfos;
+  private final Map fields = new HashMap<>();
+  private final IndexInput vectorData;
+  private final IndexInput vectorIndex;
+
+  Lucene95HnswVectorsReader(SegmentReadState state) throws IOException {
+this.fieldInfos = state.fieldInfos;
+int versionMeta = readMetadata(state);
+boolean success = false;
+try {
+  vectorData =
+  openDataInput(
+  state,
+  versionMeta,
+  Lucene95HnswVectorsFormat.VECTOR_DATA_EXTENSION,
+  Lucene95HnswVectorsFormat.VECTOR_DATA_CODEC_NAME);
+  vectorIndex =
+  openDataInput(
+  state,
+  versionMeta,
+  Lucene95HnswVectorsFormat.VECTOR_INDEX_EXTENSION,
+  Lucene95HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME);
+  success = true;
+} finally {
+  if (success == false) {
+IOUtils.closeWhileHandlingException(this);
+  }
+}
+  }
+
+  private int readMetadata(SegmentReadState state) throws IOException {
+String metaFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene95HnswVectorsFormat.META_EXTENSION);
+int versionMeta = -1;
+try (ChecksumIndexInput meta = 
state.directory.openChecksumInput(metaFileName, state.context)) {
+  Throwable priorE = null;
+  try {
+versionMeta =
+CodecUtil.checkIndexHeader(
+meta,
+Lucene95HnswVectorsFormat.META_CODEC_NAME,
+Lucene95HnswVectorsFormat.VERSION_START,
+Lucene95HnswVectorsFormat.VERSION_CURRENT,
+state.segmentInfo.getId(),
+state.segmentSuffix);
+readFields(meta, state.fieldInfos);
+  } catch (Throwable exception) {
+priorE = exception;
+  } finally {
+CodecUtil.checkFooter(meta, priorE);
+  }
+}
+return versionMeta;
+  }
+
+  private static IndexInput openDataInput(
+  SegmentReadState state, int versionMeta, String fileExtension, String 
codecName)
+  throws IOException {
+String fileName =
+IndexFileNames.segmentFileName(state.segmentInfo.name, 
state.segmentSuffix, fileExtension);
+IndexInput in = state.directory.openInput(fileName, state.context);
+boolean success = false;
+try {
+  int versionVectorData =
+  CodecUtil.checkIndexHeader(
+  in,
+  codecName,
+  Lucene95HnswVectorsFormat.VERSION_START,
+  Lucene95HnswVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  if (versionMeta != versionVe

[GitHub] [lucene] mikemccand commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-10-20 Thread GitBox


mikemccand commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r1000886175


##
lucene/misc/src/java/org/apache/lucene/misc/store/ByteTrackingIndexOutput.java:
##
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.lucene.store.IndexOutput;
+
+/** An {@link IndexOutput} that wraps another instance and tracks the number 
of bytes written */
+public class ByteTrackingIndexOutput extends IndexOutput {

Review Comment:
   Maybe open a follow-on issue to add a `FilterIndexOutput`?  These delegators 
are spooky when they are not properly tested...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-10-20 Thread GitBox


mikemccand commented on PR #11796:
URL: https://github.com/apache/lucene/pull/11796#issuecomment-1285872091

   This looks great to me!  I love all the engagement (83+ comments!) and how 
it iterated to such a simple solution.  I left a small comment for a follow-on 
issue ... and it looks like `CHANGES.txt` is conflicting again
   
   @mdmarshmallow maybe open another follow-on issue in `luceneutil` to add 
this to nightly benchmarks?  It'd be great to see impact on WAF over time of 
interesting index-time changes...
   
   I'll push this in a few days if nobody objects.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] NightOwl888 opened a new issue, #11866: On many analyzers, the getDefaultStopSet() method returns a modifiable set, contrary to the docs

2022-10-20 Thread GitBox


NightOwl888 opened a new issue, #11866:
URL: https://github.com/apache/lucene/issues/11866

   ### Description
   
   Several of the analyzers state that they are supposed to return an 
unmodifiable `CharArraySet`, but the set that is returned is writable, as you 
can see in the source.
   
   
https://github.com/apache/lucene/blob/cc342ea7407c729a743123d8f7957aff6c6f9792/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchAnalyzer.java#L67-L92
   
   Note that the `Snowball` sets are also returned as writable.
   
   ### Version and environment details
   
   All versions, all environments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11866: On many analyzers, the getDefaultStopSet() method returns a modifiable set, contrary to the docs

2022-10-20 Thread GitBox


rmuir commented on issue #11866:
URL: https://github.com/apache/lucene/issues/11866#issuecomment-1285890056

   The example is not correct. `WordlistLoader.getSnowballWordSet()` returns an 
unmodifiableSet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections

2022-10-20 Thread GitBox


benwtrent commented on code in PR #11860:
URL: https://github.com/apache/lucene/pull/11860#discussion_r1000928799


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsReader.java:
##
@@ -0,0 +1,505 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene95;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.index.*;
+import org.apache.lucene.search.ScoreDoc;
+import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.search.TotalHits;
+import org.apache.lucene.store.ChecksumIndexInput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.RamUsageEstimator;
+import org.apache.lucene.util.hnsw.HnswGraph;
+import org.apache.lucene.util.hnsw.HnswGraphSearcher;
+import org.apache.lucene.util.hnsw.NeighborQueue;
+import org.apache.lucene.util.packed.DirectMonotonicReader;
+import org.apache.lucene.util.packed.PackedInts;
+
+/**
+ * Reads vectors from the index segments along with index data structures 
supporting KNN search.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene95HnswVectorsReader extends KnnVectorsReader {
+
+  private final FieldInfos fieldInfos;
+  private final Map fields = new HashMap<>();
+  private final IndexInput vectorData;
+  private final IndexInput vectorIndex;
+
+  Lucene95HnswVectorsReader(SegmentReadState state) throws IOException {
+this.fieldInfos = state.fieldInfos;
+int versionMeta = readMetadata(state);
+boolean success = false;
+try {
+  vectorData =
+  openDataInput(
+  state,
+  versionMeta,
+  Lucene95HnswVectorsFormat.VECTOR_DATA_EXTENSION,
+  Lucene95HnswVectorsFormat.VECTOR_DATA_CODEC_NAME);
+  vectorIndex =
+  openDataInput(
+  state,
+  versionMeta,
+  Lucene95HnswVectorsFormat.VECTOR_INDEX_EXTENSION,
+  Lucene95HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME);
+  success = true;
+} finally {
+  if (success == false) {
+IOUtils.closeWhileHandlingException(this);
+  }
+}
+  }
+
+  private int readMetadata(SegmentReadState state) throws IOException {
+String metaFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene95HnswVectorsFormat.META_EXTENSION);
+int versionMeta = -1;
+try (ChecksumIndexInput meta = 
state.directory.openChecksumInput(metaFileName, state.context)) {
+  Throwable priorE = null;
+  try {
+versionMeta =
+CodecUtil.checkIndexHeader(
+meta,
+Lucene95HnswVectorsFormat.META_CODEC_NAME,
+Lucene95HnswVectorsFormat.VERSION_START,
+Lucene95HnswVectorsFormat.VERSION_CURRENT,
+state.segmentInfo.getId(),
+state.segmentSuffix);
+readFields(meta, state.fieldInfos);
+  } catch (Throwable exception) {
+priorE = exception;
+  } finally {
+CodecUtil.checkFooter(meta, priorE);
+  }
+}
+return versionMeta;
+  }
+
+  private static IndexInput openDataInput(
+  SegmentReadState state, int versionMeta, String fileExtension, String 
codecName)
+  throws IOException {
+String fileName =
+IndexFileNames.segmentFileName(state.segmentInfo.name, 
state.segmentSuffix, fileExtension);
+IndexInput in = state.directory.openInput(fileName, state.context);
+boolean success = false;
+try {
+  int versionVectorData =
+  CodecUtil.checkIndexHeader(
+  in,
+  codecName,
+  Lucene95HnswVectorsFormat.VERSION_START,
+  Lucene95HnswVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  if (versionMeta != versionVe

[GitHub] [lucene] NightOwl888 commented on issue #11866: On many analyzers, the getDefaultStopSet() method returns a modifiable set, contrary to the docs

2022-10-20 Thread GitBox


NightOwl888 commented on issue #11866:
URL: https://github.com/apache/lucene/issues/11866#issuecomment-1285916552

   I attempted to modify it, and it is succeeding.
   
   ```
   SoraniAnalyzer.getDefaultStopSet().Add("foo33")
   
   // returns true
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani opened a new pull request, #11867: Add monster test that indexes 1M vectors

2022-10-20 Thread GitBox


jtibshirani opened a new pull request, #11867:
URL: https://github.com/apache/lucene/pull/11867

   This is a rough draft of a large-scale test for kNN vectors.
   
   It tests a large dataset of kNN vectors to check for issues that only show 
up when
   segments are very large, like overflow. The dataset is based on the 
StackOverflow
   track from Elasticsearch's rally benchmarks: 
https://github.com/elastic/rally-tracks/tree/master/so_vector.
   I tried developing a test using random vectors, but HNSW can become quite 
slow
   and ineffective when the data doesn't have structure.

   Steps to run the test
   1. Download the dataset: `wget 
https://rally-tracks.elastic.co/so_vector/documents.bin`
   2. Move the dataset to the resources folder: `mv documents.bin 
lucene/core/src/resources/`
3. Start the test: `./gradlew test --tests 
TestManyKnnVectors.testLargeSegment -Dtests.monster=true -Dtests.verbose=true 
-Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1`
   
   Relates to #11863.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors

2022-10-20 Thread GitBox


rmuir commented on code in PR #11867:
URL: https://github.com/apache/lucene/pull/11867#discussion_r1001074649


##
lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.search.Sort;
+import org.apache.lucene.search.SortField;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FSDirectory;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.tests.util.LuceneTestCase.Monster;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.URL;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.FloatBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+
+/**
+ * Tests a large dataset of kNN vectors to check for issues that only show up 
when
+ * segments are very large, like overflow. The dataset is based on the 
StackOverflow
+ * track from Elasticsearch's rally benchmarks: 
https://github.com/elastic/rally-tracks/tree/master/so_vector.
+ *
+ * Steps to run the test
+ *   1. Download the dataset: wget 
https://rally-tracks.elastic.co/so_vector/documents.bin
+ *   2. Move the dataset to the resources folder: mv documents.bin 
lucene/core/src/resources/

Review Comment:
   This tries to make a 3GB jar file as part of `:lucene:core:jar` task. For me 
it takes an eternity due to the zipping of the file into the jar. I dropped the 
file in `src/test` folder instead and the test is running with it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors

2022-10-20 Thread GitBox


rmuir commented on code in PR #11867:
URL: https://github.com/apache/lucene/pull/11867#discussion_r1001077394


##
lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.search.Sort;
+import org.apache.lucene.search.SortField;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FSDirectory;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.tests.util.LuceneTestCase.Monster;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.URL;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.FloatBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+
+/**
+ * Tests a large dataset of kNN vectors to check for issues that only show up 
when
+ * segments are very large, like overflow. The dataset is based on the 
StackOverflow
+ * track from Elasticsearch's rally benchmarks: 
https://github.com/elastic/rally-tracks/tree/master/so_vector.
+ *
+ * Steps to run the test
+ *   1. Download the dataset: wget 
https://rally-tracks.elastic.co/so_vector/documents.bin
+ *   2. Move the dataset to the resources folder: mv documents.bin 
lucene/core/src/resources/
+ *   3. Start the test:
+ * ./gradlew test --tests TestManyKnnVectors.testLargeSegment 
-Dtests.monster=true -Dtests.verbose=true \
+ *   -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1
+ */
+@TimeoutSuite(millis = 10_800_000) // 3 hour timeout
+@Monster("takes ~2 hours and needs 2GB heap")
+public class TestManyKnnVectors extends LuceneTestCase {
+  public void testLargeSegment() throws Exception {
+IndexWriterConfig iwc = newIndexWriterConfig();
+if (random().nextBoolean()) {
+  iwc.setIndexSort(new Sort(new SortField("sortkey", SortField.Type.INT)));
+}
+String fieldName = "field";
+VectorSimilarityFunction similarityFunction = 
VectorSimilarityFunction.DOT_PRODUCT;
+
+URL documentsPath = 
getClass().getClassLoader().getResource("documents.bin");
+assertNotNull(documentsPath);
+
+try (FileChannel input = 
FileChannel.open(Paths.get(documentsPath.toURI()));
+ Directory dir = FSDirectory.open(createTempDir("ManyKnnVectors"));

Review Comment:
   if we use `newFSDirectory()` instead, then we get a checkindex at the end 
too. It can give more confidence in tests like these (as well as confidence 
there is no overflow in checkindex itself).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors

2022-10-20 Thread GitBox


rmuir commented on code in PR #11867:
URL: https://github.com/apache/lucene/pull/11867#discussion_r1001089104


##
lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.search.Sort;
+import org.apache.lucene.search.SortField;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FSDirectory;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.tests.util.LuceneTestCase.Monster;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.URL;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.FloatBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+
+/**
+ * Tests a large dataset of kNN vectors to check for issues that only show up 
when
+ * segments are very large, like overflow. The dataset is based on the 
StackOverflow
+ * track from Elasticsearch's rally benchmarks: 
https://github.com/elastic/rally-tracks/tree/master/so_vector.
+ *
+ * Steps to run the test
+ *   1. Download the dataset: wget 
https://rally-tracks.elastic.co/so_vector/documents.bin
+ *   2. Move the dataset to the resources folder: mv documents.bin 
lucene/core/src/resources/
+ *   3. Start the test:
+ * ./gradlew test --tests TestManyKnnVectors.testLargeSegment 
-Dtests.monster=true -Dtests.verbose=true \
+ *   -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1
+ */
+@TimeoutSuite(millis = 10_800_000) // 3 hour timeout
+@Monster("takes ~2 hours and needs 2GB heap")
+public class TestManyKnnVectors extends LuceneTestCase {
+  public void testLargeSegment() throws Exception {
+IndexWriterConfig iwc = newIndexWriterConfig();

Review Comment:
   we may want to specify the codec explicitly via 
`iwc.setCodec(TestUtil.getDefaultCodec())`. otherwise at least maybe suppress 
simpletext or anything that could be very slow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors

2022-10-20 Thread GitBox


rmuir commented on code in PR #11867:
URL: https://github.com/apache/lucene/pull/11867#discussion_r1001090648


##
lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.search.Sort;
+import org.apache.lucene.search.SortField;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FSDirectory;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.tests.util.LuceneTestCase.Monster;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.URL;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.FloatBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+
+/**
+ * Tests a large dataset of kNN vectors to check for issues that only show up 
when
+ * segments are very large, like overflow. The dataset is based on the 
StackOverflow
+ * track from Elasticsearch's rally benchmarks: 
https://github.com/elastic/rally-tracks/tree/master/so_vector.
+ *
+ * Steps to run the test
+ *   1. Download the dataset: wget 
https://rally-tracks.elastic.co/so_vector/documents.bin
+ *   2. Move the dataset to the resources folder: mv documents.bin 
lucene/core/src/resources/
+ *   3. Start the test:
+ * ./gradlew test --tests TestManyKnnVectors.testLargeSegment 
-Dtests.monster=true -Dtests.verbose=true \
+ *   -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1
+ */
+@TimeoutSuite(millis = 10_800_000) // 3 hour timeout
+@Monster("takes ~2 hours and needs 2GB heap")
+public class TestManyKnnVectors extends LuceneTestCase {
+  public void testLargeSegment() throws Exception {
+IndexWriterConfig iwc = newIndexWriterConfig();

Review Comment:
   also, maybe consider not using random IW config but instead specifying one 
that will more efficiently run the test. For example configuring rambuffer to 
be large or whatever. It is a tradeoff that other monster tests take so that 
they are a little less monstrous, but still test the thing we want to test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors

2022-10-20 Thread GitBox


rmuir commented on code in PR #11867:
URL: https://github.com/apache/lucene/pull/11867#discussion_r1001183142


##
lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.search.Sort;
+import org.apache.lucene.search.SortField;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FSDirectory;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.tests.util.LuceneTestCase.Monster;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.URL;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.FloatBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+
+/**
+ * Tests a large dataset of kNN vectors to check for issues that only show up 
when
+ * segments are very large, like overflow. The dataset is based on the 
StackOverflow
+ * track from Elasticsearch's rally benchmarks: 
https://github.com/elastic/rally-tracks/tree/master/so_vector.
+ *
+ * Steps to run the test
+ *   1. Download the dataset: wget 
https://rally-tracks.elastic.co/so_vector/documents.bin
+ *   2. Move the dataset to the resources folder: mv documents.bin 
lucene/core/src/resources/
+ *   3. Start the test:
+ * ./gradlew test --tests TestManyKnnVectors.testLargeSegment 
-Dtests.monster=true -Dtests.verbose=true \
+ *   -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1
+ */
+@TimeoutSuite(millis = 10_800_000) // 3 hour timeout
+@Monster("takes ~2 hours and needs 2GB heap")
+public class TestManyKnnVectors extends LuceneTestCase {
+  public void testLargeSegment() throws Exception {
+IndexWriterConfig iwc = newIndexWriterConfig();

Review Comment:
   i'd be happy to propose some changes. running the test was entirely too slow 
without this on my machine, I was gonna hit the test timeout :) so I did the 
following and restarted the test:
   * Removed randomized `newIndexWriterConfig` as we want performance and not 
lots of merging or anything. especially for this test!
   * set big rambuffer (200MB)
   * set default codec (TestUtil.getDefaultCodec)
   * Removed unrelated randomized indexsort and numericdocvalues field
   
   ```
   IndexWriterConfig iwc = new IndexWriterConfig();
   iwc.setCodec(TestUtil.getDefaultCodec());
   iwc.setRAMBufferSizeMB(200);
   ```
   
   It seems to be chugging along faster on my slow 2018 2-core computer :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #11867: Add monster test that indexes 1M vectors

2022-10-20 Thread GitBox


jtibshirani commented on code in PR #11867:
URL: https://github.com/apache/lucene/pull/11867#discussion_r1001200789


##
lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.search.Sort;
+import org.apache.lucene.search.SortField;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FSDirectory;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.tests.util.LuceneTestCase.Monster;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.URL;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.FloatBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+
+/**
+ * Tests a large dataset of kNN vectors to check for issues that only show up 
when
+ * segments are very large, like overflow. The dataset is based on the 
StackOverflow
+ * track from Elasticsearch's rally benchmarks: 
https://github.com/elastic/rally-tracks/tree/master/so_vector.
+ *
+ * Steps to run the test
+ *   1. Download the dataset: wget 
https://rally-tracks.elastic.co/so_vector/documents.bin
+ *   2. Move the dataset to the resources folder: mv documents.bin 
lucene/core/src/resources/
+ *   3. Start the test:
+ * ./gradlew test --tests TestManyKnnVectors.testLargeSegment 
-Dtests.monster=true -Dtests.verbose=true \
+ *   -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1
+ */
+@TimeoutSuite(millis = 10_800_000) // 3 hour timeout
+@Monster("takes ~2 hours and needs 2GB heap")
+public class TestManyKnnVectors extends LuceneTestCase {
+  public void testLargeSegment() throws Exception {
+IndexWriterConfig iwc = newIndexWriterConfig();

Review Comment:
   These are good points, I'll push the suggested changes.
   
   I guess my computer is beefier, I completed runs in under 2 hours each and 
confirmed it fails before the change, succeeds after.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors

2022-10-20 Thread GitBox


rmuir commented on code in PR #11867:
URL: https://github.com/apache/lucene/pull/11867#discussion_r1001206144


##
lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.search.Sort;
+import org.apache.lucene.search.SortField;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FSDirectory;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.tests.util.LuceneTestCase.Monster;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.URL;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.FloatBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+
+/**
+ * Tests a large dataset of kNN vectors to check for issues that only show up 
when
+ * segments are very large, like overflow. The dataset is based on the 
StackOverflow
+ * track from Elasticsearch's rally benchmarks: 
https://github.com/elastic/rally-tracks/tree/master/so_vector.
+ *
+ * Steps to run the test
+ *   1. Download the dataset: wget 
https://rally-tracks.elastic.co/so_vector/documents.bin
+ *   2. Move the dataset to the resources folder: mv documents.bin 
lucene/core/src/resources/
+ *   3. Start the test:
+ * ./gradlew test --tests TestManyKnnVectors.testLargeSegment 
-Dtests.monster=true -Dtests.verbose=true \
+ *   -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1
+ */
+@TimeoutSuite(millis = 10_800_000) // 3 hour timeout
+@Monster("takes ~2 hours and needs 2GB heap")
+public class TestManyKnnVectors extends LuceneTestCase {
+  public void testLargeSegment() throws Exception {
+IndexWriterConfig iwc = newIndexWriterConfig();

Review Comment:
   Its probably beefier :) But I think maybe you also got luckier with the 
`newIndexWriterConfig`, my computer was just merging and merging and merging. 
Now with the changes it actually spends its time indexing (albeit maxing out 
just one cpu core all bottlenecked on `dotProduct()`). Of course, the test 
could be modified to use multiple cores as a next step.
   
   With the changes, I actually see your messages such as `1> Indexed 78 
vectors out of 100` rather than being flooded with constant merging. And 
thats 780k out of 1M after only 40 minutes on my machine, so it may complete in 
under an hour for me already.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors

2022-10-20 Thread GitBox


rmuir commented on code in PR #11867:
URL: https://github.com/apache/lucene/pull/11867#discussion_r1001212579


##
lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java:
##
@@ -61,11 +61,13 @@
 @Monster("takes ~2 hours and needs 2GB heap")
 public class TestManyKnnVectors extends LuceneTestCase {
   public void testLargeSegment() throws Exception {
-// Make sure to use the default codec instead of a random one
-IndexWriterConfig iwc = 
newIndexWriterConfig().setCodec(TestUtil.getDefaultCodec());
+IndexWriterConfig iwc = new IndexWriterConfig();
+iwc.setCodec(TestUtil.getDefaultCodec()); // Make sure to use the default 
codec instead of a random one
+iwc.setRAMBufferSizeMB(3_000); // Use a 3GB buffer to create a single 
large segment

Review Comment:
   Maybe use a smaller value here... otherwise we need to change the docs of 
`@Monster` annotation and your comments about configuring test heap sizes. And 
I think its good to keep monster tests less monstrous.
   
   I'm using 200MB buffer, still with your suggested 2GB heap. It seems to 
flush about 600-700k docs in each segment. There's no merges happening until 
the test asks for it with a forceMerge(1), which is running now. compared to 
the 3GB buffer, yeah, I've gotta suffer the forceMerge in the end, not sure how 
long that's gonna take, but it exercises the merge code as well as the flush 
code, and keeps the heap memory usage lower, for less monstrosity in the test. 
May be the right tradeoff.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #11867: Add monster test that indexes 1M vectors

2022-10-20 Thread GitBox


jtibshirani commented on code in PR #11867:
URL: https://github.com/apache/lucene/pull/11867#discussion_r1001214017


##
lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java:
##
@@ -61,11 +61,13 @@
 @Monster("takes ~2 hours and needs 2GB heap")
 public class TestManyKnnVectors extends LuceneTestCase {
   public void testLargeSegment() throws Exception {
-// Make sure to use the default codec instead of a random one
-IndexWriterConfig iwc = 
newIndexWriterConfig().setCodec(TestUtil.getDefaultCodec());
+IndexWriterConfig iwc = new IndexWriterConfig();
+iwc.setCodec(TestUtil.getDefaultCodec()); // Make sure to use the default 
codec instead of a random one
+iwc.setRAMBufferSizeMB(3_000); // Use a 3GB buffer to create a single 
large segment

Review Comment:
   D'oh, yes this shouldn't be bigger than the suggested heap size...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors

2022-10-20 Thread GitBox


rmuir commented on code in PR #11867:
URL: https://github.com/apache/lucene/pull/11867#discussion_r1001217406


##
lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java:
##
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FSDirectory;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.tests.util.LuceneTestCase.Monster;
+import org.apache.lucene.tests.util.TestUtil;
+
+import java.io.IOException;
+import java.net.URL;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.FloatBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.Paths;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+
+/**
+ * Tests a large dataset of kNN vectors to check for issues that only show up 
when
+ * segments are very large, like overflow. The dataset is based on the 
StackOverflow
+ * track from Elasticsearch's rally benchmarks: 
https://github.com/elastic/rally-tracks/tree/master/so_vector.
+ *
+ * Steps to run the test
+ *   1. Download the dataset: wget 
https://rally-tracks.elastic.co/so_vector/documents.bin
+ *   2. Move the dataset to the resources folder: mv documents.bin 
lucene/core/src/resources/
+ *   3. Start the test:
+ * ./gradlew test --tests TestManyKnnVectors.testLargeSegment 
-Dtests.monster=true -Dtests.verbose=true \
+ *   -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1
+ */
+@TimeoutSuite(millis = 10_800_000) // 3 hour timeout
+@Monster("takes ~2 hours and needs 2GB heap")
+public class TestManyKnnVectors extends LuceneTestCase {
+  public void testLargeSegment() throws Exception {
+IndexWriterConfig iwc = new IndexWriterConfig();
+iwc.setCodec(TestUtil.getDefaultCodec()); // Make sure to use the default 
codec instead of a random one
+iwc.setRAMBufferSizeMB(200); // Use a 200MB buffer to create larger 
initial segments
+
+String fieldName = "field";
+VectorSimilarityFunction similarityFunction = 
VectorSimilarityFunction.DOT_PRODUCT;
+
+URL documentsPath = 
getClass().getClassLoader().getResource("documents.bin");
+assertNotNull(documentsPath);
+
+try (FileChannel input = 
FileChannel.open(Paths.get(documentsPath.toURI()));
+ Directory dir = FSDirectory.open(createTempDir("ManyKnnVectors"));
+ IndexWriter iw = new IndexWriter(dir, iwc)) {
+
+  // This data is enough to trigger the overflow bug in issue #11858,
+  // since 1_000_000 * 768 * 4 > Integer.MAX_VALUE
+  int numVectors = 1_000_000;
+  int dims = 768;
+
+  VectorReader vectorReader = new VectorReader(input, dims);
+  for (int i = 0; i < numVectors; i++) {
+float[] vector = vectorReader.next();
+Document doc = new Document();
+doc.add(new KnnVectorField(fieldName, vector, similarityFunction));
+iw.addDocument(doc);
+if (VERBOSE && i % 10_000 == 0) {
+  System.out.println("Indexed " + i + " vectors out of " + numVectors);
+}
+  }

Review Comment:
   We can improve the output for this long-running test. I had to fill in the 
gaps with `jstack` otherwise:
   I would also consider changing the loop to be `for (int i = 1; i <= 
numVectors; i++)`. Then the print will say "Indexed 100 vectors out of 
100 vectors" at the very end, so that you know indexing is complete. This 
does not happen today.
   
   Maybe also here before the `forceMerge`:
   ```
   if (VERBOSE) {
  System.out.println("forceMerge()ing to one segment...");
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the spec

[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors

2022-10-20 Thread GitBox


rmuir commented on code in PR #11867:
URL: https://github.com/apache/lucene/pull/11867#discussion_r1001224833


##
lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.search.Sort;
+import org.apache.lucene.search.SortField;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FSDirectory;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.tests.util.LuceneTestCase.Monster;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.URL;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.FloatBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+
+/**
+ * Tests a large dataset of kNN vectors to check for issues that only show up 
when
+ * segments are very large, like overflow. The dataset is based on the 
StackOverflow
+ * track from Elasticsearch's rally benchmarks: 
https://github.com/elastic/rally-tracks/tree/master/so_vector.
+ *
+ * Steps to run the test
+ *   1. Download the dataset: wget 
https://rally-tracks.elastic.co/so_vector/documents.bin
+ *   2. Move the dataset to the resources folder: mv documents.bin 
lucene/core/src/resources/

Review Comment:
   I think for this one i just suggest changing the code comment to say `mv 
documents.bin lucene/core/src/test/`. It makes for a faster experience.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #11867: Add monster test that indexes 1M vectors

2022-10-20 Thread GitBox


rmuir commented on code in PR #11867:
URL: https://github.com/apache/lucene/pull/11867#discussion_r1001226287


##
lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.search.Sort;
+import org.apache.lucene.search.SortField;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FSDirectory;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.tests.util.LuceneTestCase.Monster;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.URL;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.FloatBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+
+/**
+ * Tests a large dataset of kNN vectors to check for issues that only show up 
when
+ * segments are very large, like overflow. The dataset is based on the 
StackOverflow
+ * track from Elasticsearch's rally benchmarks: 
https://github.com/elastic/rally-tracks/tree/master/so_vector.
+ *
+ * Steps to run the test
+ *   1. Download the dataset: wget 
https://rally-tracks.elastic.co/so_vector/documents.bin
+ *   2. Move the dataset to the resources folder: mv documents.bin 
lucene/core/src/resources/
+ *   3. Start the test:
+ * ./gradlew test --tests TestManyKnnVectors.testLargeSegment 
-Dtests.monster=true -Dtests.verbose=true \
+ *   -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1
+ */
+@TimeoutSuite(millis = 10_800_000) // 3 hour timeout
+@Monster("takes ~2 hours and needs 2GB heap")
+public class TestManyKnnVectors extends LuceneTestCase {
+  public void testLargeSegment() throws Exception {
+IndexWriterConfig iwc = newIndexWriterConfig();
+if (random().nextBoolean()) {
+  iwc.setIndexSort(new Sort(new SortField("sortkey", SortField.Type.INT)));
+}
+String fieldName = "field";
+VectorSimilarityFunction similarityFunction = 
VectorSimilarityFunction.DOT_PRODUCT;
+
+URL documentsPath = 
getClass().getClassLoader().getResource("documents.bin");
+assertNotNull(documentsPath);
+
+try (FileChannel input = 
FileChannel.open(Paths.get(documentsPath.toURI()));
+ Directory dir = FSDirectory.open(createTempDir("ManyKnnVectors"));

Review Comment:
   Maybe by using `newFSDirectory` instead, we can remove the loop that reads 
the vectors from all the docs at the end? I would just nuke the loop thru all 
the docs myself, and keep the checks that e.g. vector field exists with the 
dimensions you expect. that's good to have in the test.
   
   CheckIndex will read all the vectors though, but more thoroughly and 
probably not cost the test really any more runtime either.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11867: Add monster test that indexes 1M vectors

2022-10-20 Thread GitBox


rmuir commented on PR #11867:
URL: https://github.com/apache/lucene/pull/11867#issuecomment-1286335299

   With current test i hit the exception on the 9.4 tag: BUILD FAILED in 2h 24m 
45s:
   2GB heap. Never saw any significant time (e.g. 0.1%) in GC or other jvm 
threads when inspecting the running test:
   The initial indexing takes about an hour and then the forcemerge takes an 
eternity (over an hour), but it works:
   ```
   org.apache.lucene.document.TestManyKnnVectors > testLargeSegment FAILED
   java.lang.IllegalStateException: Vector data length 307200 not 
matching size=100 * dim=768 * byteSize=4 = -1222967296
   at 
__randomizedtesting.SeedInfo.seed([CF186B7BCEFCCF79:EBD7012A6CACC57]:0)
   at 
org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsReader.validateFieldEntry(Lucene94HnswVectorsReader.java:185)
   at 
org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsReader.readFields(Lucene94HnswVectorsReader.java:156)
   at 
org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsReader.readMetadata(Lucene94HnswVectorsReader.java:103)
   at 
org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsReader.(Lucene94HnswVectorsReader.java:64)
   at 
org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsFormat.fieldsReader(Lucene94HnswVectorsFormat.java:157)
   at 
org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.(PerFieldKnnVectorsFormat.java:219)
   at 
org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat.fieldsReader(PerFieldKnnVectorsFormat.java:81)
   at 
org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:157)
   at 
org.apache.lucene.index.SegmentReader.(SegmentReader.java:91)
   at 
org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:179)
   at 
org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:221)
   at 
org.apache.lucene.index.IndexWriter.lambda$getReader$0(IndexWriter.java:536)
   at 
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:138)
   at 
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:598)
   at 
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:112)
   at 
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:91)
   at 
org.apache.lucene.document.TestManyKnnVectors.testLargeSegment(TestManyKnnVectors.java:94)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-10-20 Thread GitBox


mdmarshmallow commented on PR #11796:
URL: https://github.com/apache/lucene/pull/11796#issuecomment-1286344459

   Thanks Mike, I added an issue to `luceneutil`: 
https://github.com/mikemccand/luceneutil/issues/208


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow opened a new issue, #11868: Add a FilterIndexOutput

2022-10-20 Thread GitBox


mdmarshmallow opened a new issue, #11868:
URL: https://github.com/apache/lucene/issues/11868

   ### Description
   
   We have several subclasses of `IndexOutput` that have delegates, most 
recently one was added in this PR: https://github.com/apache/lucene/pull/11796. 
Adding a `FilterIndexOutput`, similar to `FilterDirectory`, to make sure all 
these delegators get tested properly would be a good idea. (suggested by 
@mikemccand here: 
https://github.com/apache/lucene/pull/11796/files#r1000886175).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-10-20 Thread GitBox


mdmarshmallow commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r1001276262


##
lucene/misc/src/java/org/apache/lucene/misc/store/ByteTrackingIndexOutput.java:
##
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.lucene.store.IndexOutput;
+
+/** An {@link IndexOutput} that wraps another instance and tracks the number 
of bytes written */
+public class ByteTrackingIndexOutput extends IndexOutput {

Review Comment:
   I made an issue to track this here: 
https://github.com/apache/lucene/issues/11868



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] MarcusSorealheis commented on pull request #874: LUCENE-10471 Increse max dims for vectors to 2048

2022-10-20 Thread GitBox


MarcusSorealheis commented on PR #874:
URL: https://github.com/apache/lucene/pull/874#issuecomment-1286509849

   Should we punish and exclude customers who cannot complete requisite steps 
of dimensional reduction or allow them to explore with very expensive compute. 
Many popular large language models surpass the current threshold for better or 
worse. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] JavaCoderCff closed pull request #271: LUCENE-9969:TaxoArrays, a member variable of the DirectoryTaxonomyReader class, i…

2022-10-20 Thread GitBox


JavaCoderCff closed pull request #271: LUCENE-9969:TaxoArrays, a member 
variable of the DirectoryTaxonomyReader class, i…
URL: https://github.com/apache/lucene/pull/271


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org