date:20221118

[GitHub] [lucene] jpountz commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections

2022-11-18 Thread GitBox



jpountz commented on code in PR #11860:
URL: https://github.com/apache/lucene/pull/11860#discussion_r1025393090


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java:
##
@@ -0,0 +1,168 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene95;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Lucene 9.4 vector format, which encodes numeric vector values and an 
optional associated graph
+ * connecting the documents having values. The graph is used to power HNSW 
search. The format
+ * consists of three files:
+ *
+ * .vec (vector data) file
+ *
+ * For each field:
+ *
+ * 
+ *   Vector data ordered by field, document ordinal, and vector dimension. 
When the
+ *   vectorEncoding is BYTE, each sample is stored as a single byte. When 
it is FLOAT32, each
+ *   sample is stored as an IEEE float in little-endian byte order.
+ *   DocIds encoded by {@link IndexedDISI#writeBitSet(DocIdSetIterator, 
IndexOutput, byte)},
+ *   note that only in sparse case
+ *   OrdToDoc was encoded by {@link 
org.apache.lucene.util.packed.DirectMonotonicWriter}, note
+ *   that only in sparse case
+ * 
+ *
+ * .vex (vector index)
+ *
+ * Stores graphs connecting the documents for each field organized as a 
list of nodes' neighbours
+ * as following:
+ *
+ * 
+ *   For each level:
+ *   
+ * For each node:
+ * 
+ *   [int32] the number of neighbor nodes
+ *   array[int32] the neighbor ordinals
+ *   array[int32] padding if the number of the node's 
neighbors is less than
+ *   the maximum number of connections allowed on this level. 
Padding is equal to
+ *   ((maxConnOnLevel – the number of neighbours) * 4) bytes.
+ * 
+ *   
+ * 
+ *
+ * .vem (vector metadata) file
+ *
+ * For each field:
+ *
+ * 
+ *   [int32] field number
+ *   [int32] vector similarity function ordinal
+ *   [vlong] offset to this field's vectors in the .vec file
+ *   [vlong] length of this field's vectors, in bytes
+ *   [vlong] offset to this field's index in the .vex file
+ *   [vlong] length of this field's index data, in bytes
+ *   [int] dimension of this field's vectors
+ *   [int] the number of documents having values for this field
+ *   [int8] if equals to -1, dense – all documents have values for 
a field. If equals to
+ *   0, sparse – some documents missing values.
+ *   DocIds were encoded by {@link 
IndexedDISI#writeBitSet(DocIdSetIterator, IndexOutput, byte)}
+ *   OrdToDoc was encoded by {@link 
org.apache.lucene.util.packed.DirectMonotonicWriter}, note
+ *   that only in sparse case
+ *   [int] the maximum number of connections (neigbours) that each 
node can have
+ *   [int] number of levels in the graph
+ *   Graph nodes by level. For each level
+ *   
+ * [int] the number of nodes on this level
+ * array[int] for levels greater than 0 list of nodes on 
this level, stored as
+ * the level 0th nodes' ordinals.
+ *   
+ * 
+ *
+ * @lucene.experimental
+ */
+public final class Lucene95HnswVectorsFormat extends KnnVectorsFormat {
+
+  static final String META_CODEC_NAME = "lucene95HnswVectorsFormatMeta";

Review Comment:
   nit: we generally use titlecase for codec names
   ```suggestion
 static final String META_CODEC_NAME = "Lucene95HnswVectorsFormatMeta";
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---

[GitHub] [lucene] jpountz commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections

2022-11-18 Thread GitBox



jpountz commented on code in PR #11860:
URL: https://github.com/apache/lucene/pull/11860#discussion_r1026184428


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsReader.java:
##
@@ -0,0 +1,497 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene95;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.index.*;
+import org.apache.lucene.search.ScoreDoc;
+import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.search.TotalHits;
+import org.apache.lucene.store.ChecksumIndexInput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.RamUsageEstimator;
+import org.apache.lucene.util.hnsw.HnswGraph;
+import org.apache.lucene.util.hnsw.HnswGraphSearcher;
+import org.apache.lucene.util.hnsw.NeighborQueue;
+import org.apache.lucene.util.packed.DirectMonotonicReader;
+
+/**
+ * Reads vectors from the index segments along with index data structures 
supporting KNN search.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene95HnswVectorsReader extends KnnVectorsReader {
+
+  private final FieldInfos fieldInfos;
+  private final Map fields = new HashMap<>();
+  private final IndexInput vectorData;
+  private final IndexInput vectorIndex;
+
+  Lucene95HnswVectorsReader(SegmentReadState state) throws IOException {
+this.fieldInfos = state.fieldInfos;
+int versionMeta = readMetadata(state);
+boolean success = false;
+try {
+  vectorData =
+  openDataInput(
+  state,
+  versionMeta,
+  Lucene95HnswVectorsFormat.VECTOR_DATA_EXTENSION,
+  Lucene95HnswVectorsFormat.VECTOR_DATA_CODEC_NAME);
+  vectorIndex =
+  openDataInput(
+  state,
+  versionMeta,
+  Lucene95HnswVectorsFormat.VECTOR_INDEX_EXTENSION,
+  Lucene95HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME);
+  success = true;
+} finally {
+  if (success == false) {
+IOUtils.closeWhileHandlingException(this);
+  }
+}
+  }
+
+  private int readMetadata(SegmentReadState state) throws IOException {
+String metaFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene95HnswVectorsFormat.META_EXTENSION);
+int versionMeta = -1;
+try (ChecksumIndexInput meta = 
state.directory.openChecksumInput(metaFileName, state.context)) {
+  Throwable priorE = null;
+  try {
+versionMeta =
+CodecUtil.checkIndexHeader(
+meta,
+Lucene95HnswVectorsFormat.META_CODEC_NAME,
+Lucene95HnswVectorsFormat.VERSION_START,
+Lucene95HnswVectorsFormat.VERSION_CURRENT,
+state.segmentInfo.getId(),
+state.segmentSuffix);
+readFields(meta, state.fieldInfos);
+  } catch (Throwable exception) {
+priorE = exception;
+  } finally {
+CodecUtil.checkFooter(meta, priorE);
+  }
+}
+return versionMeta;
+  }
+
+  private static IndexInput openDataInput(
+  SegmentReadState state, int versionMeta, String fileExtension, String 
codecName)
+  throws IOException {
+String fileName =
+IndexFileNames.segmentFileName(state.segmentInfo.name, 
state.segmentSuffix, fileExtension);
+IndexInput in = state.directory.openInput(fileName, state.context);
+boolean success = false;
+try {
+  int versionVectorData =
+  CodecUtil.checkIndexHeader(
+  in,
+  codecName,
+  Lucene95HnswVectorsFormat.VERSION_START,
+  Lucene95HnswVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  if (versionMeta != versionVec

[GitHub] [lucene] rmuir commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections

2022-11-18 Thread GitBox



rmuir commented on code in PR #11860:
URL: https://github.com/apache/lucene/pull/11860#discussion_r1026351356


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsReader.java:
##
@@ -0,0 +1,497 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene95;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.index.*;
+import org.apache.lucene.search.ScoreDoc;
+import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.search.TotalHits;
+import org.apache.lucene.store.ChecksumIndexInput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.RamUsageEstimator;
+import org.apache.lucene.util.hnsw.HnswGraph;
+import org.apache.lucene.util.hnsw.HnswGraphSearcher;
+import org.apache.lucene.util.hnsw.NeighborQueue;
+import org.apache.lucene.util.packed.DirectMonotonicReader;
+
+/**
+ * Reads vectors from the index segments along with index data structures 
supporting KNN search.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene95HnswVectorsReader extends KnnVectorsReader {
+
+  private final FieldInfos fieldInfos;
+  private final Map fields = new HashMap<>();
+  private final IndexInput vectorData;
+  private final IndexInput vectorIndex;
+
+  Lucene95HnswVectorsReader(SegmentReadState state) throws IOException {
+this.fieldInfos = state.fieldInfos;
+int versionMeta = readMetadata(state);
+boolean success = false;
+try {
+  vectorData =
+  openDataInput(
+  state,
+  versionMeta,
+  Lucene95HnswVectorsFormat.VECTOR_DATA_EXTENSION,
+  Lucene95HnswVectorsFormat.VECTOR_DATA_CODEC_NAME);
+  vectorIndex =
+  openDataInput(
+  state,
+  versionMeta,
+  Lucene95HnswVectorsFormat.VECTOR_INDEX_EXTENSION,
+  Lucene95HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME);
+  success = true;
+} finally {
+  if (success == false) {
+IOUtils.closeWhileHandlingException(this);
+  }
+}
+  }
+
+  private int readMetadata(SegmentReadState state) throws IOException {
+String metaFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene95HnswVectorsFormat.META_EXTENSION);
+int versionMeta = -1;
+try (ChecksumIndexInput meta = 
state.directory.openChecksumInput(metaFileName, state.context)) {
+  Throwable priorE = null;
+  try {
+versionMeta =
+CodecUtil.checkIndexHeader(
+meta,
+Lucene95HnswVectorsFormat.META_CODEC_NAME,
+Lucene95HnswVectorsFormat.VERSION_START,
+Lucene95HnswVectorsFormat.VERSION_CURRENT,
+state.segmentInfo.getId(),
+state.segmentSuffix);
+readFields(meta, state.fieldInfos);
+  } catch (Throwable exception) {
+priorE = exception;
+  } finally {
+CodecUtil.checkFooter(meta, priorE);
+  }
+}
+return versionMeta;
+  }
+
+  private static IndexInput openDataInput(
+  SegmentReadState state, int versionMeta, String fileExtension, String 
codecName)
+  throws IOException {
+String fileName =
+IndexFileNames.segmentFileName(state.segmentInfo.name, 
state.segmentSuffix, fileExtension);
+IndexInput in = state.directory.openInput(fileName, state.context);
+boolean success = false;
+try {
+  int versionVectorData =
+  CodecUtil.checkIndexHeader(
+  in,
+  codecName,
+  Lucene95HnswVectorsFormat.VERSION_START,
+  Lucene95HnswVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  if (versionMeta != versionVecto

[GitHub] [lucene] jpountz commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections

2022-11-18 Thread GitBox



jpountz commented on code in PR #11860:
URL: https://github.com/apache/lucene/pull/11860#discussion_r1026383814


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsReader.java:
##
@@ -0,0 +1,497 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene95;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.index.*;
+import org.apache.lucene.search.ScoreDoc;
+import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.search.TotalHits;
+import org.apache.lucene.store.ChecksumIndexInput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.RamUsageEstimator;
+import org.apache.lucene.util.hnsw.HnswGraph;
+import org.apache.lucene.util.hnsw.HnswGraphSearcher;
+import org.apache.lucene.util.hnsw.NeighborQueue;
+import org.apache.lucene.util.packed.DirectMonotonicReader;
+
+/**
+ * Reads vectors from the index segments along with index data structures 
supporting KNN search.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene95HnswVectorsReader extends KnnVectorsReader {
+
+  private final FieldInfos fieldInfos;
+  private final Map fields = new HashMap<>();
+  private final IndexInput vectorData;
+  private final IndexInput vectorIndex;
+
+  Lucene95HnswVectorsReader(SegmentReadState state) throws IOException {
+this.fieldInfos = state.fieldInfos;
+int versionMeta = readMetadata(state);
+boolean success = false;
+try {
+  vectorData =
+  openDataInput(
+  state,
+  versionMeta,
+  Lucene95HnswVectorsFormat.VECTOR_DATA_EXTENSION,
+  Lucene95HnswVectorsFormat.VECTOR_DATA_CODEC_NAME);
+  vectorIndex =
+  openDataInput(
+  state,
+  versionMeta,
+  Lucene95HnswVectorsFormat.VECTOR_INDEX_EXTENSION,
+  Lucene95HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME);
+  success = true;
+} finally {
+  if (success == false) {
+IOUtils.closeWhileHandlingException(this);
+  }
+}
+  }
+
+  private int readMetadata(SegmentReadState state) throws IOException {
+String metaFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene95HnswVectorsFormat.META_EXTENSION);
+int versionMeta = -1;
+try (ChecksumIndexInput meta = 
state.directory.openChecksumInput(metaFileName, state.context)) {
+  Throwable priorE = null;
+  try {
+versionMeta =
+CodecUtil.checkIndexHeader(
+meta,
+Lucene95HnswVectorsFormat.META_CODEC_NAME,
+Lucene95HnswVectorsFormat.VERSION_START,
+Lucene95HnswVectorsFormat.VERSION_CURRENT,
+state.segmentInfo.getId(),
+state.segmentSuffix);
+readFields(meta, state.fieldInfos);
+  } catch (Throwable exception) {
+priorE = exception;
+  } finally {
+CodecUtil.checkFooter(meta, priorE);
+  }
+}
+return versionMeta;
+  }
+
+  private static IndexInput openDataInput(
+  SegmentReadState state, int versionMeta, String fileExtension, String 
codecName)
+  throws IOException {
+String fileName =
+IndexFileNames.segmentFileName(state.segmentInfo.name, 
state.segmentSuffix, fileExtension);
+IndexInput in = state.directory.openInput(fileName, state.context);
+boolean success = false;
+try {
+  int versionVectorData =
+  CodecUtil.checkIndexHeader(
+  in,
+  codecName,
+  Lucene95HnswVectorsFormat.VERSION_START,
+  Lucene95HnswVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  if (versionMeta != versionVec

[GitHub] [lucene] rmuir commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections

2022-11-18 Thread GitBox



rmuir commented on code in PR #11860:
URL: https://github.com/apache/lucene/pull/11860#discussion_r1026406375


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsReader.java:
##
@@ -0,0 +1,497 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene95;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.index.*;
+import org.apache.lucene.search.ScoreDoc;
+import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.search.TotalHits;
+import org.apache.lucene.store.ChecksumIndexInput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.RamUsageEstimator;
+import org.apache.lucene.util.hnsw.HnswGraph;
+import org.apache.lucene.util.hnsw.HnswGraphSearcher;
+import org.apache.lucene.util.hnsw.NeighborQueue;
+import org.apache.lucene.util.packed.DirectMonotonicReader;
+
+/**
+ * Reads vectors from the index segments along with index data structures 
supporting KNN search.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene95HnswVectorsReader extends KnnVectorsReader {
+
+  private final FieldInfos fieldInfos;
+  private final Map fields = new HashMap<>();
+  private final IndexInput vectorData;
+  private final IndexInput vectorIndex;
+
+  Lucene95HnswVectorsReader(SegmentReadState state) throws IOException {
+this.fieldInfos = state.fieldInfos;
+int versionMeta = readMetadata(state);
+boolean success = false;
+try {
+  vectorData =
+  openDataInput(
+  state,
+  versionMeta,
+  Lucene95HnswVectorsFormat.VECTOR_DATA_EXTENSION,
+  Lucene95HnswVectorsFormat.VECTOR_DATA_CODEC_NAME);
+  vectorIndex =
+  openDataInput(
+  state,
+  versionMeta,
+  Lucene95HnswVectorsFormat.VECTOR_INDEX_EXTENSION,
+  Lucene95HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME);
+  success = true;
+} finally {
+  if (success == false) {
+IOUtils.closeWhileHandlingException(this);
+  }
+}
+  }
+
+  private int readMetadata(SegmentReadState state) throws IOException {
+String metaFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene95HnswVectorsFormat.META_EXTENSION);
+int versionMeta = -1;
+try (ChecksumIndexInput meta = 
state.directory.openChecksumInput(metaFileName, state.context)) {
+  Throwable priorE = null;
+  try {
+versionMeta =
+CodecUtil.checkIndexHeader(
+meta,
+Lucene95HnswVectorsFormat.META_CODEC_NAME,
+Lucene95HnswVectorsFormat.VERSION_START,
+Lucene95HnswVectorsFormat.VERSION_CURRENT,
+state.segmentInfo.getId(),
+state.segmentSuffix);
+readFields(meta, state.fieldInfos);
+  } catch (Throwable exception) {
+priorE = exception;
+  } finally {
+CodecUtil.checkFooter(meta, priorE);
+  }
+}
+return versionMeta;
+  }
+
+  private static IndexInput openDataInput(
+  SegmentReadState state, int versionMeta, String fileExtension, String 
codecName)
+  throws IOException {
+String fileName =
+IndexFileNames.segmentFileName(state.segmentInfo.name, 
state.segmentSuffix, fileExtension);
+IndexInput in = state.directory.openInput(fileName, state.context);
+boolean success = false;
+try {
+  int versionVectorData =
+  CodecUtil.checkIndexHeader(
+  in,
+  codecName,
+  Lucene95HnswVectorsFormat.VERSION_START,
+  Lucene95HnswVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  if (versionMeta != versionVecto

[GitHub] [lucene] dweiss commented on pull request #11947: Add self-contained artifact upload script for apache nexus (#11329)

2022-11-18 Thread GitBox



dweiss commented on PR #11947:
URL: https://github.com/apache/lucene/pull/11947#issuecomment-1320064830

   I initially wanted to use Python but without additional libraries it 
overwhelmed me and I wanted to keep it simple and self-contained. There is some 
extra verbosity (XML processing in Java is hell) but it does work and I staged 
artifacts for Lucene 10.0.0 successfully (and dropped the repository 
afterwards).
   
   Now... I'm not so sure how to test the wizard bit. I based the changes on 
@HoustonPutman 's work on Solr side. The script can accept user name and 
password in environment variables, on command line or will prompt for password 
if not provided... but I haven't tested this as I'm not sure how to test it in 
isolation (@janhoy - is there any way to do it?).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on issue #11329: Add an equivalent of ant's stage-maven-artifacts for the release wizard [LUCENE-10293]

2022-11-18 Thread GitBox



dweiss commented on issue #11329:
URL: https://github.com/apache/lucene/issues/11329#issuecomment-1320066031

   Patch implemented in #11947 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #11947: Add self-contained artifact upload script for apache nexus (#11329)

2022-11-18 Thread GitBox



dweiss commented on PR #11947:
URL: https://github.com/apache/lucene/pull/11947#issuecomment-1320072940

   For stand-alone use, it's simple enough too:
   ```
   java dev-tools\scripts\StageArtifacts.java -u dweiss 
/release/candidate/maven-artifacts
   ```
   will prompt for password for nexus access.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #11947: Add self-contained artifact upload script for apache nexus (#11329)

2022-11-18 Thread GitBox



jpountz commented on PR #11947:
URL: https://github.com/apache/lucene/pull/11947#issuecomment-1320074546

   Thanks @dweiss I'll give it a try!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] madrob commented on a diff in pull request #11947: Add self-contained artifact upload script for apache nexus (#11329)

2022-11-18 Thread GitBox



madrob commented on code in PR #11947:
URL: https://github.com/apache/lucene/pull/11947#discussion_r1026512732


##
dev-tools/scripts/StageArtifacts.java:
##
@@ -0,0 +1,395 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import org.w3c.dom.Document;
+import org.w3c.dom.Node;
+import org.w3c.dom.NodeList;
+import org.xml.sax.InputSource;
+import org.xml.sax.SAXException;
+
+import javax.xml.parsers.DocumentBuilderFactory;
+import javax.xml.parsers.ParserConfigurationException;
+import java.io.Console;
+import java.io.IOException;
+import java.io.StringReader;
+import java.net.Authenticator;
+import java.net.HttpURLConnection;
+import java.net.PasswordAuthentication;
+import java.net.URI;
+import java.net.URISyntaxException;
+import java.net.URLEncoder;
+import java.net.http.HttpClient;
+import java.net.http.HttpRequest;
+import java.net.http.HttpResponse;
+import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Sonatype nexus artifact staging/deployment script. This could be made
+ * nicer, but this keeps it to JDK classes only.
+ *
+ * The implementation is based on the REST API documentation of
+ * https://oss.sonatype.org/nexus-staging-plugin/default/docs/index.html";>nexus-staging-plugin
+ * and on anecdotal evidence and reverse-engineered information from around
+ * the web... Weird that such a crucial piece of infrastructure has such 
obscure
+ * documentation.
+ */
+public class StageArtifacts {
+  private static final String DEFAULT_NEXUS_URI = 
"https://repository.apache.org";;
+
+  private static class Params {
+URI nexusUri = URI.create(DEFAULT_NEXUS_URI);
+String userName;
+char[] userPass;
+Path mavenDir;
+String description;
+
+private static char[] envVar(String envVar) {
+  var value = System.getenv(envVar);
+  return value == null ? null : value.toCharArray();
+}
+
+static Params parse(String[] args) {
+  try {
+var params = new Params();
+for (int i = 0; i < args.length; i++) {
+  switch (args[i]) {
+case "-n":
+case "--nexus":
+  params.nexusUri = URI.create(args[++i]);
+  break;
+case "-u":
+case "--user":
+  params.userName = args[++i];
+  break;
+case "-p":
+case "--password":
+  params.userPass = args[++i].toCharArray();
+  break;
+case "--description":
+  params.description = args[++i];
+  break;
+
+case "-h":
+case "--help":
+  System.out.println("java " + StageArtifacts.class.getName() + " 
[options] path");
+  System.out.println("  -u, --user  User name for 
authentication.");
+  System.out.println("  better: ASF_USERNAME env. 
var.");
+  System.out.println("  -p, --password  Password for 
authentication.");
+  System.out.println("  better: ASF_PASSWORD env. 
var.");
+  System.out.println("  -n, --nexus URL to Apache Nexus 
(optional).");
+  System.out.println("  --description  Staging repo description 
(optional).");
+  System.out.println("");
+  System.out.println("  pathPath to maven artifact 
directory.");
+  System.out.println("");
+  System.out.println(" Password can be omitted for console 
prompt-input.");

Review Comment:
   Use multi-line string?



##
dev-tools/scripts/StageArtifacts.java:
##
@@ -0,0 +1,395 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance

[GitHub] [lucene] gsmiller commented on pull request #11901: Github#11869: Add RangeOnRangeFacetCounts

2022-11-18 Thread GitBox



gsmiller commented on PR #11901:
URL: https://github.com/apache/lucene/pull/11901#issuecomment-1320139204

   @mdmarshmallow it's similar to FacetSets but a bit different since FacetSets 
work over stored points, while this would work over stored ranges. I think it 
would make sense to eventually support multiple dimensions here, since the data 
(`LongRangeDocValueFields`) allow for it. I'm thinking about a user who is 
storing multi-dim ranges and creating queries that filter over those same 
ranges, but has no way to facet over them.
   
   As for how to handle faceting of multi-dim ranges, I think the logic would 
be the same as the "slow" query you reference, and would depend on the 
`RangeFieldQuery#QueryType` specified.
   
   But... as for concrete next steps, I see the value in simplifying the 
problem to a single dimension to get started. That likely covers most use-cases 
anyway. But I'd also like to create a path towards supporting multi-dim without 
having to create a completely new API. Could we re-use 
`LongRangeDocValueFields` for the indexing piece of this, and create a faceting 
implementation that functions over that data, but only make it work for single 
dimensions to start? Maybe the faceting API you create is simplified to the 
single-dimension case and it validates that the underlying field being faceted 
is single-dim? That way, in the future, if we add multi-dim support, users can 
keep using the same field type for their documents but the API gets generalized?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] benwtrent commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections

2022-11-18 Thread GitBox



benwtrent commented on code in PR #11860:
URL: https://github.com/apache/lucene/pull/11860#discussion_r1026551961


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java:
##
@@ -0,0 +1,168 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene95;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Lucene 9.4 vector format, which encodes numeric vector values and an 
optional associated graph
+ * connecting the documents having values. The graph is used to power HNSW 
search. The format
+ * consists of three files:
+ *
+ * .vec (vector data) file
+ *
+ * For each field:
+ *
+ * 
+ *   Vector data ordered by field, document ordinal, and vector dimension. 
When the
+ *   vectorEncoding is BYTE, each sample is stored as a single byte. When 
it is FLOAT32, each
+ *   sample is stored as an IEEE float in little-endian byte order.
+ *   DocIds encoded by {@link IndexedDISI#writeBitSet(DocIdSetIterator, 
IndexOutput, byte)},
+ *   note that only in sparse case
+ *   OrdToDoc was encoded by {@link 
org.apache.lucene.util.packed.DirectMonotonicWriter}, note
+ *   that only in sparse case
+ * 
+ *
+ * .vex (vector index)
+ *
+ * Stores graphs connecting the documents for each field organized as a 
list of nodes' neighbours
+ * as following:
+ *
+ * 
+ *   For each level:
+ *   
+ * For each node:
+ * 
+ *   [int32] the number of neighbor nodes
+ *   array[int32] the neighbor ordinals
+ *   array[int32] padding if the number of the node's 
neighbors is less than
+ *   the maximum number of connections allowed on this level. 
Padding is equal to
+ *   ((maxConnOnLevel – the number of neighbours) * 4) bytes.
+ * 
+ *   
+ * 
+ *
+ * .vem (vector metadata) file
+ *
+ * For each field:
+ *
+ * 
+ *   [int32] field number
+ *   [int32] vector similarity function ordinal
+ *   [vlong] offset to this field's vectors in the .vec file
+ *   [vlong] length of this field's vectors, in bytes
+ *   [vlong] offset to this field's index in the .vex file
+ *   [vlong] length of this field's index data, in bytes
+ *   [int] dimension of this field's vectors
+ *   [int] the number of documents having values for this field
+ *   [int8] if equals to -1, dense – all documents have values for 
a field. If equals to
+ *   0, sparse – some documents missing values.
+ *   DocIds were encoded by {@link 
IndexedDISI#writeBitSet(DocIdSetIterator, IndexOutput, byte)}
+ *   OrdToDoc was encoded by {@link 
org.apache.lucene.util.packed.DirectMonotonicWriter}, note
+ *   that only in sparse case
+ *   [int] the maximum number of connections (neigbours) that each 
node can have
+ *   [int] number of levels in the graph
+ *   Graph nodes by level. For each level
+ *   
+ * [int] the number of nodes on this level
+ * array[int] for levels greater than 0 list of nodes on 
this level, stored as
+ * the level 0th nodes' ordinals.
+ *   
+ * 
+ *
+ * @lucene.experimental
+ */
+public final class Lucene95HnswVectorsFormat extends KnnVectorsFormat {
+
+  static final String META_CODEC_NAME = "lucene95HnswVectorsFormatMeta";

Review Comment:
   I can fix that here, 92, 94 were not title cased.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubs

[GitHub] [lucene] msokolov commented on pull request #11860: GITHUB-11830 Better optimize storage for vector connections

2022-11-18 Thread GitBox



msokolov commented on PR #11860:
URL: https://github.com/apache/lucene/pull/11860#issuecomment-1320161291

   +1 to the awesomeness - thanks for iterating on this fruit! - how 
high-hanging it is depends on one's perspective I guess.
   
   I have to say I am mildly amused that we are now using IndexedDISI to access 
the node neighbors / "postings" since the very initial hacky implementation of 
this hnsw index was based on SortedNumericDocValues, which is very similar 
under the hood I think. We decided to implement a new codec instead so that 
this could evolve independently, which I think made sense at the time, but it 
seems we've finally come full circle, at least sharing the implementation if 
not the actual field format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir opened a new issue, #11948: clean up smoketester GPG leaks

2022-11-18 Thread GitBox



rmuir opened a new issue, #11948:
URL: https://github.com/apache/lucene/issues/11948

   ### Description
   
   smoketester leaks a GPG agent on my computer everytime it runs. @risdenk 
pointed out this fix from solr: 
https://github.com/apache/solr/commit/0cfef740617cc40585e3121e0b41e5cc8002471f


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a diff in pull request #11947: Add self-contained artifact upload script for apache nexus (#11329)

2022-11-18 Thread GitBox



dweiss commented on code in PR #11947:
URL: https://github.com/apache/lucene/pull/11947#discussion_r1026571426


##
dev-tools/scripts/StageArtifacts.java:
##
@@ -0,0 +1,395 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import org.w3c.dom.Document;
+import org.w3c.dom.Node;
+import org.w3c.dom.NodeList;
+import org.xml.sax.InputSource;
+import org.xml.sax.SAXException;
+
+import javax.xml.parsers.DocumentBuilderFactory;
+import javax.xml.parsers.ParserConfigurationException;
+import java.io.Console;
+import java.io.IOException;
+import java.io.StringReader;
+import java.net.Authenticator;
+import java.net.HttpURLConnection;
+import java.net.PasswordAuthentication;
+import java.net.URI;
+import java.net.URISyntaxException;
+import java.net.URLEncoder;
+import java.net.http.HttpClient;
+import java.net.http.HttpRequest;
+import java.net.http.HttpResponse;
+import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Sonatype nexus artifact staging/deployment script. This could be made
+ * nicer, but this keeps it to JDK classes only.
+ *
+ * The implementation is based on the REST API documentation of
+ * https://oss.sonatype.org/nexus-staging-plugin/default/docs/index.html";>nexus-staging-plugin
+ * and on anecdotal evidence and reverse-engineered information from around
+ * the web... Weird that such a crucial piece of infrastructure has such 
obscure
+ * documentation.
+ */
+public class StageArtifacts {
+  private static final String DEFAULT_NEXUS_URI = 
"https://repository.apache.org";;
+
+  private static class Params {
+URI nexusUri = URI.create(DEFAULT_NEXUS_URI);
+String userName;
+char[] userPass;
+Path mavenDir;
+String description;
+
+private static char[] envVar(String envVar) {
+  var value = System.getenv(envVar);
+  return value == null ? null : value.toCharArray();
+}
+
+static Params parse(String[] args) {
+  try {
+var params = new Params();
+for (int i = 0; i < args.length; i++) {
+  switch (args[i]) {
+case "-n":
+case "--nexus":
+  params.nexusUri = URI.create(args[++i]);
+  break;
+case "-u":
+case "--user":
+  params.userName = args[++i];

Review Comment:
   there's a catch block that does it, below. it's a waste of time to elaborate 
much on arg validation if it's used from the wizard anyway?



##
dev-tools/scripts/StageArtifacts.java:
##
@@ -0,0 +1,395 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import org.w3c.dom.Document;
+import org.w3c.dom.Node;
+import org.w3c.dom.NodeList;
+import org.xml.sax.InputSource;
+import org.xml.sax.SAXException;
+
+import javax.xml.parsers.DocumentBuilderFactory;
+import javax.xml.parsers.ParserConfigurationException;
+import java.io.Console;
+import java.io.IOException;
+import java.io.StringReader;
+import java.net.Authenticator;
+import java.net.HttpURLConnection;
+import java.net.PasswordAuthentication;
+import java.net.URI;
+import java.net.URISyntaxException;
+import java.net.URLEncoder;
+import java.net.http.HttpClient;
+import java.net.http.HttpRequest;
+im

[GitHub] [lucene] dweiss commented on a diff in pull request #11947: Add self-contained artifact upload script for apache nexus (#11329)

2022-11-18 Thread GitBox



dweiss commented on code in PR #11947:
URL: https://github.com/apache/lucene/pull/11947#discussion_r1026572395


##
dev-tools/scripts/StageArtifacts.java:
##
@@ -0,0 +1,395 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import org.w3c.dom.Document;
+import org.w3c.dom.Node;
+import org.w3c.dom.NodeList;
+import org.xml.sax.InputSource;
+import org.xml.sax.SAXException;
+
+import javax.xml.parsers.DocumentBuilderFactory;
+import javax.xml.parsers.ParserConfigurationException;
+import java.io.Console;
+import java.io.IOException;
+import java.io.StringReader;
+import java.net.Authenticator;
+import java.net.HttpURLConnection;
+import java.net.PasswordAuthentication;
+import java.net.URI;
+import java.net.URISyntaxException;
+import java.net.URLEncoder;
+import java.net.http.HttpClient;
+import java.net.http.HttpRequest;
+import java.net.http.HttpResponse;
+import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Sonatype nexus artifact staging/deployment script. This could be made
+ * nicer, but this keeps it to JDK classes only.
+ *
+ * The implementation is based on the REST API documentation of
+ * https://oss.sonatype.org/nexus-staging-plugin/default/docs/index.html";>nexus-staging-plugin
+ * and on anecdotal evidence and reverse-engineered information from around
+ * the web... Weird that such a crucial piece of infrastructure has such 
obscure
+ * documentation.
+ */
+public class StageArtifacts {
+  private static final String DEFAULT_NEXUS_URI = 
"https://repository.apache.org";;
+
+  private static class Params {
+URI nexusUri = URI.create(DEFAULT_NEXUS_URI);
+String userName;
+char[] userPass;
+Path mavenDir;
+String description;
+
+private static char[] envVar(String envVar) {
+  var value = System.getenv(envVar);
+  return value == null ? null : value.toCharArray();
+}
+
+static Params parse(String[] args) {
+  try {
+var params = new Params();
+for (int i = 0; i < args.length; i++) {
+  switch (args[i]) {
+case "-n":
+case "--nexus":
+  params.nexusUri = URI.create(args[++i]);
+  break;
+case "-u":
+case "--user":
+  params.userName = args[++i];
+  break;
+case "-p":
+case "--password":
+  params.userPass = args[++i].toCharArray();
+  break;
+case "--description":
+  params.description = args[++i];
+  break;
+
+case "-h":
+case "--help":
+  System.out.println("java " + StageArtifacts.class.getName() + " 
[options] path");
+  System.out.println("  -u, --user  User name for 
authentication.");
+  System.out.println("  better: ASF_USERNAME env. 
var.");
+  System.out.println("  -p, --password  Password for 
authentication.");
+  System.out.println("  better: ASF_PASSWORD env. 
var.");
+  System.out.println("  -n, --nexus URL to Apache Nexus 
(optional).");
+  System.out.println("  --description  Staging repo description 
(optional).");
+  System.out.println("");
+  System.out.println("  pathPath to maven artifact 
directory.");
+  System.out.println("");
+  System.out.println(" Password can be omitted for console 
prompt-input.");
+  System.exit(0);
+
+default:
+  if (params.mavenDir != null) {
+throw new RuntimeException("Exactly one maven artifact 
directory should be provided.");
+  }
+  params.mavenDir = Paths.get(args[i]);
+  break;
+  }
+}
+
+if (params.userName == null) {
+  var v = envVar("ASF_USERNAME");
+  if (v != null) {
+params.userName = new String(v);
+

[GitHub] [lucene] dweiss commented on a diff in pull request #11947: Add self-contained artifact upload script for apache nexus (#11329)

2022-11-18 Thread GitBox



dweiss commented on code in PR #11947:
URL: https://github.com/apache/lucene/pull/11947#discussion_r1026573130


##
dev-tools/scripts/StageArtifacts.java:
##
@@ -0,0 +1,395 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import org.w3c.dom.Document;
+import org.w3c.dom.Node;
+import org.w3c.dom.NodeList;
+import org.xml.sax.InputSource;
+import org.xml.sax.SAXException;
+
+import javax.xml.parsers.DocumentBuilderFactory;
+import javax.xml.parsers.ParserConfigurationException;
+import java.io.Console;
+import java.io.IOException;
+import java.io.StringReader;
+import java.net.Authenticator;
+import java.net.HttpURLConnection;
+import java.net.PasswordAuthentication;
+import java.net.URI;
+import java.net.URISyntaxException;
+import java.net.URLEncoder;
+import java.net.http.HttpClient;
+import java.net.http.HttpRequest;
+import java.net.http.HttpResponse;
+import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Sonatype nexus artifact staging/deployment script. This could be made
+ * nicer, but this keeps it to JDK classes only.
+ *
+ * The implementation is based on the REST API documentation of
+ * https://oss.sonatype.org/nexus-staging-plugin/default/docs/index.html";>nexus-staging-plugin
+ * and on anecdotal evidence and reverse-engineered information from around
+ * the web... Weird that such a crucial piece of infrastructure has such 
obscure
+ * documentation.
+ */
+public class StageArtifacts {
+  private static final String DEFAULT_NEXUS_URI = 
"https://repository.apache.org";;
+
+  private static class Params {
+URI nexusUri = URI.create(DEFAULT_NEXUS_URI);
+String userName;
+char[] userPass;
+Path mavenDir;
+String description;
+
+private static char[] envVar(String envVar) {
+  var value = System.getenv(envVar);
+  return value == null ? null : value.toCharArray();
+}
+
+static Params parse(String[] args) {
+  try {
+var params = new Params();
+for (int i = 0; i < args.length; i++) {
+  switch (args[i]) {
+case "-n":
+case "--nexus":
+  params.nexusUri = URI.create(args[++i]);
+  break;
+case "-u":
+case "--user":
+  params.userName = args[++i];
+  break;
+case "-p":
+case "--password":
+  params.userPass = args[++i].toCharArray();
+  break;
+case "--description":
+  params.description = args[++i];
+  break;
+
+case "-h":
+case "--help":
+  System.out.println("java " + StageArtifacts.class.getName() + " 
[options] path");
+  System.out.println("  -u, --user  User name for 
authentication.");
+  System.out.println("  better: ASF_USERNAME env. 
var.");
+  System.out.println("  -p, --password  Password for 
authentication.");
+  System.out.println("  better: ASF_PASSWORD env. 
var.");
+  System.out.println("  -n, --nexus URL to Apache Nexus 
(optional).");
+  System.out.println("  --description  Staging repo description 
(optional).");
+  System.out.println("");
+  System.out.println("  pathPath to maven artifact 
directory.");
+  System.out.println("");
+  System.out.println(" Password can be omitted for console 
prompt-input.");
+  System.exit(0);
+
+default:
+  if (params.mavenDir != null) {
+throw new RuntimeException("Exactly one maven artifact 
directory should be provided.");
+  }
+  params.mavenDir = Paths.get(args[i]);
+  break;
+  }
+}
+
+if (params.userName == null) {
+  var v = envVar("ASF_USERNAME");
+  if (v != null) {
+params.userName = new String(v);
+

[GitHub] [lucene] dweiss opened a new pull request, #11949: Add star import check/validation

2022-11-18 Thread GitBox



dweiss opened a new pull request, #11949:
URL: https://github.com/apache/lucene/pull/11949

   It's been a few times that I saw a comment on misc. PRs mentioning we want 
to avoid star imports. Let's just automate it? Seems like we already have tools 
to help out here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on issue #11830: Store HNSW graph connections more compactly

2022-11-18 Thread GitBox



msokolov commented on issue #11830:
URL: https://github.com/apache/lucene/issues/11830#issuecomment-1320183235

   Hey this looks great! Awesome to see the storage gains with no loss in
   query time
   
   On Thu, Nov 17, 2022 at 2:25 PM Benjamin Trent ***@***.***>
   wrote:
   
   > I changed the PR to move towards delta encoding & vint. Even with storing
   > the memory offsets within vex, the storage improvements are much better
   > than PackedInts.
   >
   > Table with some numbers around the size improvements for different data
   > sets & parameters:
   > packed_vex_mb_size vex_mb_size packed_index_build_time index_build_time
   > params dataset percent_reduction
   > 79.9 161.6 767 784 "{'M': 16, 'efConstruction': 100}" glove-100-angular
   > 50.55693069
   > 108.4 464.1 1138 1225 "{'M': 48, 'efConstruction': 100}" glove-100-angular
   > 76.64296488
   > 2.3 8.2 36 36 "{'M': 16, 'efConstruction': 100}" mnist-784-euclidean
   > 71.95121951
   > 2.4 23.5 36 36 "{'M': 48, 'efConstruction': 100}" mnist-784-euclidean
   > 89.78723404
   > 66.1 392.2 501 572 "{'M': 48, 'efConstruction': 100}" sift-128-euclidean
   > 83.1463539
   > 59.7 136.6 449 516 "{'M': 16, 'efConstruction': 100}" sift-128-euclidean
   > 56.29575403
   >
   > For the curious, here are the QPS numbers (higher is better) for packed
   > (delta & vint) vs baseline:
   > Glove
   >
   > [image: image]
   > 

   > MNist
   >
   > [image: image]
   > 

   > SIFT
   >
   > [image: image]
   > 

   >
   > —
   > Reply to this email directly, view it on GitHub
   > ,
   > or unsubscribe
   > 

   > .
   > You are receiving this because you commented.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller opened a new pull request, #11950: Fix NPE in BinaryRangeFieldRangeQuery when field does not exist or is of wrong type

2022-11-18 Thread GitBox



gsmiller opened a new pull request, #11950:
URL: https://github.com/apache/lucene/pull/11950

   ### Description
   
   This fixes a bug where variants of `BinaryRangeFieldRangeQuery` will result 
in an NPE if the field doesn't exist in a segment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a diff in pull request #11947: Add self-contained artifact upload script for apache nexus (#11329)

2022-11-18 Thread GitBox



dweiss commented on code in PR #11947:
URL: https://github.com/apache/lucene/pull/11947#discussion_r1026588147


##
dev-tools/scripts/StageArtifacts.java:
##
@@ -0,0 +1,395 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import org.w3c.dom.Document;
+import org.w3c.dom.Node;
+import org.w3c.dom.NodeList;
+import org.xml.sax.InputSource;
+import org.xml.sax.SAXException;
+
+import javax.xml.parsers.DocumentBuilderFactory;
+import javax.xml.parsers.ParserConfigurationException;
+import java.io.Console;
+import java.io.IOException;
+import java.io.StringReader;
+import java.net.Authenticator;
+import java.net.HttpURLConnection;
+import java.net.PasswordAuthentication;
+import java.net.URI;
+import java.net.URISyntaxException;
+import java.net.URLEncoder;
+import java.net.http.HttpClient;
+import java.net.http.HttpRequest;
+import java.net.http.HttpResponse;
+import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Sonatype nexus artifact staging/deployment script. This could be made
+ * nicer, but this keeps it to JDK classes only.
+ *
+ * The implementation is based on the REST API documentation of
+ * https://oss.sonatype.org/nexus-staging-plugin/default/docs/index.html";>nexus-staging-plugin
+ * and on anecdotal evidence and reverse-engineered information from around
+ * the web... Weird that such a crucial piece of infrastructure has such 
obscure
+ * documentation.
+ */
+public class StageArtifacts {
+  private static final String DEFAULT_NEXUS_URI = 
"https://repository.apache.org";;
+
+  private static class Params {
+URI nexusUri = URI.create(DEFAULT_NEXUS_URI);
+String userName;
+char[] userPass;
+Path mavenDir;
+String description;
+
+private static char[] envVar(String envVar) {
+  var value = System.getenv(envVar);
+  return value == null ? null : value.toCharArray();
+}
+
+static Params parse(String[] args) {
+  try {
+var params = new Params();
+for (int i = 0; i < args.length; i++) {
+  switch (args[i]) {
+case "-n":
+case "--nexus":
+  params.nexusUri = URI.create(args[++i]);
+  break;
+case "-u":
+case "--user":
+  params.userName = args[++i];

Review Comment:
   I added that method, just for you, @madrob :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss merged pull request #11949: Add star import check/validation

2022-11-18 Thread GitBox



dweiss merged PR #11949:
URL: https://github.com/apache/lucene/pull/11949


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #11949: Add star import check/validation

2022-11-18 Thread GitBox



dweiss commented on PR #11949:
URL: https://github.com/apache/lucene/pull/11949#issuecomment-1320195854

   I'll backport to 9x manually.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir merged pull request #11936: Lower gradle heap: 3GB is unnecessary

2022-11-18 Thread GitBox



rmuir merged PR #11936:
URL: https://github.com/apache/lucene/pull/11936


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] agorlenko commented on pull request #11946: add similarity threshold for hnsw

2022-11-18 Thread GitBox



agorlenko commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1320221923

   If we use only post-filter in KnnVectorQuery, then we have to set k = 
Integer.MAX_VALUE (or another very big value) and calculate similarity with all 
vectors. So the complexity would be O(n). 
   
   I had another idea: we can check the similarity while we are traversing the 
graph. If similarity is less then threshold, we can get rid of this node and 
stop to explore this path. In that case we set k = Integer.MAX_VALUE, set 
similarityThreshold value, but the time complexity would be between O(log(n)) 
and O(n) (it depends on number of vectors with similarity greater than 
threshold). I hope that it allow us to solve task like the ones I described 
above (https://github.com/apache/lucene/pull/11946#issuecomment-1318924833) 
more efficiently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir opened a new issue, #11951: TestStressIndexing can sometime take minutes

2022-11-18 Thread GitBox



rmuir opened a new issue, #11951:
URL: https://github.com/apache/lucene/issues/11951

   ### Description
   
   I've seen this happen several times, so i think it may not be hard to 
reproduce, have not tried to use the seed yet:
   
   ```
   > Task :randomizationInfo
   Running tests with randomization seed: tests.seed=6D4E6284011DCBC9
   ...
   The slowest tests (exceeding 500 ms) during this run:
 212.58s TestStressIndexing.testStressIndexAndSearching (:lucene:core)
   ```
   
   ### Gradle command to reproduce
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #11928: GH#11922: Allow DisjunctionDISIApproximation to short-circuit

2022-11-18 Thread GitBox



gsmiller commented on code in PR #11928:
URL: https://github.com/apache/lucene/pull/11928#discussion_r1026694386


##
lucene/core/src/java/org/apache/lucene/search/DisjunctionDISIApproximation.java:
##
@@ -45,29 +51,54 @@ public long cost() {
 
   @Override
   public int docID() {
-return subIterators.top().doc;
+return docID;
   }
 
-  @Override
-  public int nextDoc() throws IOException {
+  private int doNext(int target) throws IOException {
+if (target == DocIdSetIterator.NO_MORE_DOCS) {
+  docID = DocIdSetIterator.NO_MORE_DOCS;
+  return docID;
+}
+
 DisiWrapper top = subIterators.top();
-final int doc = top.doc;
 do {
-  top.doc = top.approximation.nextDoc();
+  top.doc = top.approximation.advance(target);
+  if (top.doc == target) {
+subIterators.updateTop();
+docID = target;
+return docID;
+  }
   top = subIterators.updateTop();
-} while (top.doc == doc);
+} while (top.doc < target);
+docID = top.doc;
 
-return top.doc;
+return docID;
+  }
+
+  @Override
+  public int nextDoc() throws IOException {
+if (docID == DocIdSetIterator.NO_MORE_DOCS) {
+  return docID;
+}

Review Comment:
   Right, good point. Will simplify.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #11928: GH#11922: Allow DisjunctionDISIApproximation to short-circuit

2022-11-18 Thread GitBox



gsmiller commented on code in PR #11928:
URL: https://github.com/apache/lucene/pull/11928#discussion_r1026697770


##
lucene/MIGRATE.md:
##
@@ -102,6 +102,12 @@ Lucene 9.2 or stay with 9.0.
 
 See LUCENE-10558 for more details and workarounds.
 
+### DisjunctionDISIApproximation behavior change

Review Comment:
   OK, yeah- makes sense. I'll move to the 9.5 section and remove the `MIGRATE` 
entry.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss merged pull request #11947: Add self-contained artifact upload script for apache nexus (#11329)

2022-11-18 Thread GitBox



dweiss merged PR #11947:
URL: https://github.com/apache/lucene/pull/11947


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss closed issue #11329: Add an equivalent of ant's stage-maven-artifacts for the release wizard [LUCENE-10293]

2022-11-18 Thread GitBox



dweiss closed issue #11329: Add an equivalent of ant's stage-maven-artifacts 
for the release wizard [LUCENE-10293]
URL: https://github.com/apache/lucene/issues/11329


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #11928: GH#11922: Allow DisjunctionDISIApproximation to short-circuit

2022-11-18 Thread GitBox



gsmiller commented on code in PR #11928:
URL: https://github.com/apache/lucene/pull/11928#discussion_r1026737221


##
lucene/core/src/java/org/apache/lucene/search/DisjunctionDISIApproximation.java:
##
@@ -45,29 +51,54 @@ public long cost() {
 
   @Override
   public int docID() {
-return subIterators.top().doc;
+return docID;
   }
 
-  @Override
-  public int nextDoc() throws IOException {
+  private int doNext(int target) throws IOException {
+if (target == DocIdSetIterator.NO_MORE_DOCS) {
+  docID = DocIdSetIterator.NO_MORE_DOCS;
+  return docID;
+}

Review Comment:
   That's fair. I removed the check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #11928: GH#11922: Allow DisjunctionDISIApproximation to short-circuit

2022-11-18 Thread GitBox



gsmiller commented on PR #11928:
URL: https://github.com/apache/lucene/pull/11928#issuecomment-1320356774

   @jpountz thanks for the implementation feedback! I've updated the PR, but 
still plan to do more benchmarking to really understand the benefit, etc. 
before looking to actually merge this. I'll follow up here once I've been able 
to do that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #11946: add similarity threshold for hnsw

2022-11-18 Thread GitBox



msokolov commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1320401647

   > If we use only post-filter in KnnVectorQuery, then we have to set k = 
Integer.MAX_VALUE (or another very big value) and calculate similarity with all 
vectors. So the complexity would be O(n).
   
   No, we don't have to do that. We can simply post-filter. Think of it like 
this - we want K matches with score > T. So we get the K top-scoring matches. 
If any have score less than T, we drop them. It's the same result as if we did 
the thresholding while collecting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #11945: Decrease test time for TestManyKnnDocs.testLargeSegment

2022-11-18 Thread GitBox



msokolov commented on PR #11945:
URL: https://github.com/apache/lucene/pull/11945#issuecomment-1320406270

   oh nice plan, thanks everyone


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] agorlenko commented on pull request #11946: add similarity threshold for hnsw

2022-11-18 Thread GitBox



agorlenko commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1320416549

   But we don't know K - that's the problem. The task which I want to solve 
sounds like this: find documents with similarity >= 0.76 (for example). We 
don't have the number of such documents in advance.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #11946: add similarity threshold for hnsw

2022-11-18 Thread GitBox

msokolov commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1320438152

   OK, can we start by providing post-filter? I think this will be a more
   common use case. I want to find the best docs, and ensure that none of them
   are terrible. It is less disruptive, doesn't require changes to the codec.
   Can you explain why you want the "find all docs with score > T"? That is
   going to be a scary thing. What if someone asks for T==0? Then the
   computation and memory requirements are unbounded. I don't think this is a
   search use case - it's some kind of analytics thing that you should do in
   Spark or some kind of off-line computation system.

   On Fri, Nov 18, 2022 at 2:01 PM Alexey Gorlenko ***@***.***>
   wrote:

   > But we don't know K - that's the problem. The task which I want to solve
   > sounds like this: find documents with similarity >= 0.76 (for example). We
   > don't have the number of such documents in advance.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   > You are receiving this because you commented.Message ID:
   > ***@***.***>
   >

-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] agorlenko commented on pull request #11946: add similarity threshold for hnsw

2022-11-18 Thread GitBox



agorlenko commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1320508166

   > Can you explain why you want the "find all docs with score > T"?
   
   For example, we want to give user only suitable for him/her documents. We 
have a custom scorer (based on ml-model, for example) which calculates a score. 
Next, we compare that score with the threshold to determine whether this 
document is suitable for the user or not. But usually that scorer too 
computationally complex to compute it for every document which passed filters. 
In order to deal with this problem we can build another model, much simpler. 
That new model would select candidates for the heavy model. One of the basic 
approaches for building that light model is knn: we have a vector (embedding) 
for user or users' query and we have a vector (embedding) for every document. 
So we just find the nearest documents and pass them to the heavy scorer. But we 
don't know K in that case, we know only the threshold. This threshold is 
defined during the development of the ranking model. Such tasks naturally arise 
in recommendation systems and ranking  as well.
   
   > That is going to be a scary thing. What if someone asks for T==0? Then the 
computation and memory requirements are unbounded.
   
   The same result can be achieved by setting K = 1000...00. I think we don't 
add the new vulnerability here. Maybe it is worth to add a warning to the 
documentation (for K and for similarityThreshold).
   
   
   If you still think that it's a bad idea to support such functionality in 
Lucene, I will rewrite this PR to the post-filter case. But I think it can be 
useful for people who add ML-ranking in search systems based on Lucene.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] benwtrent commented on pull request #11860: GITHUB-11830 Better optimize storage for vector connections

2022-11-18 Thread GitBox



benwtrent commented on PR #11860:
URL: https://github.com/apache/lucene/pull/11860#issuecomment-1320660919

   OK, I did some more performance testing @jpountz @rmuir 
   
   Every once in a while, I see some extreme 100%/99.9% latency spikes in KNN 
search times. This happened on about half of the runs I did locally. It seems 
like spikes like this are typical in the 99.9%, but these are way more extreme:
   
   ```
   100th percentile service time | knn-search-10-100 | 7.02746 |  215.784 | 
208.757   | ms | +2970.58% |
   ```
   
   Additionally, flush has indeed been effected. I tested more thoroughly with 
a larger data set:
   
   ```
   refresh-after-index |   23.5835  |   37.4973  | 13.9138  | 
ms |   +59.00% |
   ```
   I typically see an increase of 10-20ms increase over the baseline.
   
   All the numbers: 
https://gist.github.com/benwtrent/cc9718bf8cb6d353cee37964472f98df


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mdmarshmallow commented on pull request #11901: Github#11869: Add RangeOnRangeFacetCounts

2022-11-18 Thread GitBox



mdmarshmallow commented on PR #11901:
URL: https://github.com/apache/lucene/pull/11901#issuecomment-1320700985

   Ah yeah good point on the FacetSets... so I actually already use 
`LongRangeDocValueFields` here: public class `LongRangeDocValuesFacetField 
extends LongRangeDocValuesField`. The difference is that the 
`LongRangeDocValuesFacetField` constructor enforces a single dimension. I'd 
imagine if we want to extend it, it shouldn't be that hard to just add another 
constructor here.
   
   With regards to this though:
   >As for how to handle faceting of multi-dim ranges, I think the logic would 
be the same as the "slow" query you reference, and would depend on the 
RangeFieldQuery#QueryType specified.
   
   I think I will need to change this PR a bit then. Right now what happens is 
that the query can have multiple ranges, which all need to "match" the stored 
range, which matches the "cross product" idea I talked about earlier which we 
won't be doing. So the way to go from here I think would be to either enforce 
only one range in the query, or just go ahead and do the multidimensional 
implementation. I think the latter makes more sense since I will need to change 
code either way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

39 matches

Mail list logo