Re: [PR] Compute facets while collecting [lucene]

2024-08-10 Thread via GitHub


epotyom commented on code in PR #13568:
URL: https://github.com/apache/lucene/pull/13568#discussion_r1712610080


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/misc/LongValueFacetCutter.java:
##
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.facet.misc;

Review Comment:
   For other reviewers, resolving this thread - as per discussion here 
https://github.com/apache/lucene/pull/13568#discussion_r1692249664 we've 
deprecated `FacetRollup` interface



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Compute facets while collecting [lucene]

2024-08-10 Thread via GitHub


epotyom commented on code in PR #13568:
URL: https://github.com/apache/lucene/pull/13568#discussion_r1712617910


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/cutters/ranges/LongRangeFacetCutter.java:
##
@@ -0,0 +1,413 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.facet.cutters.ranges;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Objects;
+import org.apache.lucene.facet.MultiLongValues;
+import org.apache.lucene.facet.MultiLongValuesSource;
+import org.apache.lucene.facet.range.LongRange;
+import org.apache.lucene.sandbox.facet.cutters.FacetCutter;
+import org.apache.lucene.sandbox.facet.cutters.LeafFacetCutter;
+import org.apache.lucene.search.LongValues;
+import org.apache.lucene.search.LongValuesSource;
+
+/** {@link RangeFacetCutter} for ranges of long values. It's based on 
LongRangeCounter class. */
+public abstract class LongRangeFacetCutter extends RangeFacetCutter {
+
+  MultiLongValuesSource valuesSource;
+  LongValuesSource singleValues; // TODO: refactor - weird that we have both 
multi and single here.
+  LongRangeAndPos[] sortedRanges;
+
+  int requestedRangeCount;
+
+  List elementaryIntervals;
+
+  /** elementary interval boundaries used for efficient counting (bsearch to 
find interval) */
+  long[] boundaries;
+
+  int[] pos;
+
+  // Default interval position, when elementary interval is mapped to this 
interval
+  // it is skipped.
+  static final int SKIP_INTERVAL_POSITION = -1;
+
+  /** Create {@link FacetCutter} for provided value source and long ranges. */
+  public static LongRangeFacetCutter create(
+  String field,
+  MultiLongValuesSource longValuesSource,
+  LongValuesSource singleLongValuesSource,
+  LongRange[] longRanges) {
+if (areOverlappingRanges(longRanges)) {
+  return new OverlappingLongRangeFacetCutter(
+  field, longValuesSource, singleLongValuesSource, longRanges);
+}
+return new NonOverlappingLongRangeFacetCutter(
+field, longValuesSource, singleLongValuesSource, longRanges);
+  }
+
+  public static LongRangeFacetCutter create(
+  String field, MultiLongValuesSource longValuesSource, LongRange[] 
longRanges) {
+return create(field, longValuesSource, null, longRanges);
+  }
+
+  // caller handles conversion of Doubles and DoubleRange to Long and LongRange
+  // ranges need not be sorted
+  LongRangeFacetCutter(
+  String field,
+  MultiLongValuesSource longValuesSource,
+  LongValuesSource singleLongValuesSource,
+  LongRange[] longRanges) {
+super(field);
+valuesSource = longValuesSource;
+if (singleLongValuesSource != null) {
+  singleValues = singleLongValuesSource;
+} else {
+  singleValues = MultiLongValuesSource.unwrapSingleton(valuesSource);
+}
+
+sortedRanges = new LongRangeAndPos[longRanges.length];
+requestedRangeCount = longRanges.length;
+
+for (int i = 0; i < longRanges.length; i++) {
+  sortedRanges[i] = new LongRangeAndPos(longRanges[i], i);
+}
+
+Arrays.sort(this.sortedRanges, Comparator.comparingLong(r -> r.range.min));
+elementaryIntervals = buildElementaryIntervals();
+
+// Keep track of elementary interval boundary ends (for binary search) 
along with the requested
+// range they map back to (and -1 when they map to a "gap" range in case 
of ExclusiveRanges):
+boundaries = new long[elementaryIntervals.size()];
+pos = new int[elementaryIntervals.size()];
+Arrays.fill(pos, SKIP_INTERVAL_POSITION);
+int currRange = 0;
+for (int i = 0; i < boundaries.length; i++) {
+  boundaries[i] = elementaryIntervals.get(i).end;
+  if (currRange < sortedRanges.length) {
+LongRangeAndPos curr = sortedRanges[currRange];
+if (boundaries[i] == curr.range.max) {
+  pos[i] = curr.pos;
+  currRange++;
+}
+  }
+}
+  }
+
+  abstract List buildElementaryIntervals();
+
+  private static boolean areOverlappingRanges(LongRange[] ranges) {
+if (ranges.length == 0) {
+  return false;
+}
+
+// Copy before sorting so we don't mess with the caller's o

Re: [PR] Optimize decoding blocks of postings using the vector API. [lucene]

2024-08-10 Thread via GitHub


uschindler commented on code in PR #13636:
URL: https://github.com/apache/lucene/pull/13636#discussion_r1712624281


##
lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentPostingDecodingUtil.java:
##
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.internal.vectorization;
+
+import java.io.IOException;
+import java.lang.foreign.MemorySegment;
+import java.nio.ByteOrder;
+import jdk.incubator.vector.LongVector;
+import jdk.incubator.vector.VectorOperators;
+import jdk.incubator.vector.VectorSpecies;
+import org.apache.lucene.store.IndexInput;
+
+final class MemorySegmentPostingDecodingUtil extends PostingDecodingUtil {
+
+  private static final VectorSpecies LONG_SPECIES =
+  PanamaVectorConstants.PRERERRED_LONG_SPECIES;
+
+  private final IndexInput in;
+  private final MemorySegment memorySegment;
+
+  MemorySegmentPostingDecodingUtil(IndexInput in, MemorySegment memorySegment) 
{
+this.in = in;
+this.memorySegment = memorySegment;
+  }
+
+  @Override
+  public void readLongs(int count, long[] b) throws IOException {
+if (count < LONG_SPECIES.length()) {
+  in.readLongs(b, 0, count);
+} else {
+  long offset = in.getFilePointer();
+  long endOffset = offset + count * Long.BYTES;
+  int loopBound = LONG_SPECIES.loopBound(count - 1);
+  for (int i = 0;
+  i < loopBound;
+  i += LONG_SPECIES.length(), offset += LONG_SPECIES.length() * 
Long.BYTES) {
+LongVector.fromMemorySegment(LONG_SPECIES, memorySegment, offset, 
ByteOrder.LITTLE_ENDIAN)

Review Comment:
   Hi, just a guess: `copyMemory()` has some overhead for very short arrays. 
   
   The good old ByteBuffers (MappedByteBuffer) had some optimization on 
ByteBuffer.getLongs() and other methods to copy a few items with a for-loop 
instead of copyMemory, see 
https://github.com/openjdk/jdk/blob/358d77dafbe0e35d5b20340fccddc0fb8f3db82a/src/java.base/share/classes/java/nio/X-Buffer.java.template#L954
   
   With MemorySegment theres no such optimization anymore. We discussed this in 
early times of MemorySegmentUsage, and Murizio at that time suggested to use a 
simple loop instead of copyMemory for very small arays `<= 
Bits.JNI_COPY_TO_ARRAY_THRESHOLD`. The value is 6 array entries: 
https://github.com/openjdk/jdk/blob/358d77dafbe0e35d5b20340fccddc0fb8f3db82a/src/java.base/share/classes/java/nio/Bits.java#L232-L236
   
   It could be that the above code is using a loop and not copyMemory behind 
scenes and therefor has a small speedup for long arrays < 6 items.
   
   We could, separately from that issue implement some "short array scalar 
optimization" in `MemorySegmentIndexInput#getBytes/getLongs/getFloats/` 
like in `XBuffer` classes.
   
   I'd revert this additional complexity here and better check for optimizing 
`MemorySegmentIndexInput`. When doing this is would speedup other usages in 
Lucene, too! It is quite easy to add a simple if/else here: 
https://github.com/apache/lucene/blob/5aa1aa98ea2cdb37d7d8d48657703a5eb7fbaefa/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java#L222-L233



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize decoding blocks of postings using the vector API. [lucene]

2024-08-10 Thread via GitHub


uschindler commented on code in PR #13636:
URL: https://github.com/apache/lucene/pull/13636#discussion_r1712624281


##
lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentPostingDecodingUtil.java:
##
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.internal.vectorization;
+
+import java.io.IOException;
+import java.lang.foreign.MemorySegment;
+import java.nio.ByteOrder;
+import jdk.incubator.vector.LongVector;
+import jdk.incubator.vector.VectorOperators;
+import jdk.incubator.vector.VectorSpecies;
+import org.apache.lucene.store.IndexInput;
+
+final class MemorySegmentPostingDecodingUtil extends PostingDecodingUtil {
+
+  private static final VectorSpecies LONG_SPECIES =
+  PanamaVectorConstants.PRERERRED_LONG_SPECIES;
+
+  private final IndexInput in;
+  private final MemorySegment memorySegment;
+
+  MemorySegmentPostingDecodingUtil(IndexInput in, MemorySegment memorySegment) 
{
+this.in = in;
+this.memorySegment = memorySegment;
+  }
+
+  @Override
+  public void readLongs(int count, long[] b) throws IOException {
+if (count < LONG_SPECIES.length()) {
+  in.readLongs(b, 0, count);
+} else {
+  long offset = in.getFilePointer();
+  long endOffset = offset + count * Long.BYTES;
+  int loopBound = LONG_SPECIES.loopBound(count - 1);
+  for (int i = 0;
+  i < loopBound;
+  i += LONG_SPECIES.length(), offset += LONG_SPECIES.length() * 
Long.BYTES) {
+LongVector.fromMemorySegment(LONG_SPECIES, memorySegment, offset, 
ByteOrder.LITTLE_ENDIAN)

Review Comment:
   Hi, just a guess: `copyMemory()` has some overhead for very short arrays. 
   
   The good old ByteBuffers (MappedByteBuffer) had some optimization on 
ByteBuffer.getLongs() and other methods to copy a few items with a for-loop 
instead of copyMemory, see 
https://github.com/openjdk/jdk/blob/358d77dafbe0e35d5b20340fccddc0fb8f3db82a/src/java.base/share/classes/java/nio/X-Buffer.java.template#L954
   
   With MemorySegment theres no such optimization anymore. We discussed this in 
early times of MemorySegment usage in Lucene, and Maurizio (from OpenJDK) at 
that time suggested to try out to use a simple loop instead of copyMemory for 
very small arays `<= Bits.JNI_COPY_TO_ARRAY_THRESHOLD`. The value is 6 array 
entries: 
https://github.com/openjdk/jdk/blob/358d77dafbe0e35d5b20340fccddc0fb8f3db82a/src/java.base/share/classes/java/nio/Bits.java#L232-L236
 (I did this, but with Mike's benchmarks I did not see a large difference, only 
< stddev).
   
   It could be that the above code is using a loop and not copyMemory behind 
scenes and therefor has a small speedup for long arrays < 6 items.
   
   We could, separately from that issue implement some "short array scalar 
optimization" in `MemorySegmentIndexInput#getBytes/getLongs/getFloats/` 
like in `XBuffer` classes.
   
   I'd revert this additional complexity here and better check for optimizing 
`MemorySegmentIndexInput`. When doing this is would speedup other usages in 
Lucene, too! It is quite easy to add a simple if/else here: 
https://github.com/apache/lucene/blob/5aa1aa98ea2cdb37d7d8d48657703a5eb7fbaefa/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java#L222-L233



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize decoding blocks of postings using the vector API. [lucene]

2024-08-10 Thread via GitHub


uschindler commented on code in PR #13636:
URL: https://github.com/apache/lucene/pull/13636#discussion_r1712624281


##
lucene/core/src/java21/org/apache/lucene/internal/vectorization/MemorySegmentPostingDecodingUtil.java:
##
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.internal.vectorization;
+
+import java.io.IOException;
+import java.lang.foreign.MemorySegment;
+import java.nio.ByteOrder;
+import jdk.incubator.vector.LongVector;
+import jdk.incubator.vector.VectorOperators;
+import jdk.incubator.vector.VectorSpecies;
+import org.apache.lucene.store.IndexInput;
+
+final class MemorySegmentPostingDecodingUtil extends PostingDecodingUtil {
+
+  private static final VectorSpecies LONG_SPECIES =
+  PanamaVectorConstants.PRERERRED_LONG_SPECIES;
+
+  private final IndexInput in;
+  private final MemorySegment memorySegment;
+
+  MemorySegmentPostingDecodingUtil(IndexInput in, MemorySegment memorySegment) 
{
+this.in = in;
+this.memorySegment = memorySegment;
+  }
+
+  @Override
+  public void readLongs(int count, long[] b) throws IOException {
+if (count < LONG_SPECIES.length()) {
+  in.readLongs(b, 0, count);
+} else {
+  long offset = in.getFilePointer();
+  long endOffset = offset + count * Long.BYTES;
+  int loopBound = LONG_SPECIES.loopBound(count - 1);
+  for (int i = 0;
+  i < loopBound;
+  i += LONG_SPECIES.length(), offset += LONG_SPECIES.length() * 
Long.BYTES) {
+LongVector.fromMemorySegment(LONG_SPECIES, memorySegment, offset, 
ByteOrder.LITTLE_ENDIAN)

Review Comment:
   Hi, just a guess: `copyMemory()` has some overhead for very short arrays. 
   
   The good old ByteBuffers (MappedByteBuffer) had some optimization on 
ByteBuffer.getLongs() and other methods to copy a few items with a for-loop 
instead of copyMemory, see 
https://github.com/openjdk/jdk/blob/358d77dafbe0e35d5b20340fccddc0fb8f3db82a/src/java.base/share/classes/java/nio/X-Buffer.java.template#L954
   
   With MemorySegment theres no such optimization anymore. We discussed this in 
early times of MemorySegment usage in Lucene, and Maurizio (from OpenJDK) at 
that time suggested to try out to use a simple loop instead of copyMemory for 
very small arays `<= Bits.JNI_COPY_TO_ARRAY_THRESHOLD`. The value is 6 array 
entries: 
https://github.com/openjdk/jdk/blob/358d77dafbe0e35d5b20340fccddc0fb8f3db82a/src/java.base/share/classes/java/nio/Bits.java#L232-L236
 (I did this, but with Mike's benchmarks I did not see a large difference, only 
< stddev).
   
   It could be that the above code is using a loop and not copyMemory behind 
scenes and therefor has a small speedup for long arrays < 6 items.
   
   We could, separately from that issue implement some "short array scalar 
optimization" in `MemorySegmentIndexInput#getBytes/getLongs/getFloats/` 
like in `XBuffer` classes.
   
   I'd revert this additional complexity here and better check for optimizing 
`MemorySegmentIndexInput`. When doing this is would speedup other usages in 
Lucene, too! It is quite easy to add a simple if/else here: 
https://github.com/apache/lucene/blob/5aa1aa98ea2cdb37d7d8d48657703a5eb7fbaefa/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java#L222-L233
   
   Basically as quick try add a (`if (length <= 6) { super.readLongs(); 
return; }` here and measure again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Try applying bipartite graph reordering to KNN graph node ids [lucene]

2024-08-10 Thread via GitHub


msokolov commented on issue #13565:
URL: https://github.com/apache/lucene/issues/13565#issuecomment-2282182975

   Thinking about the implementation a bit I realized that when we reorder the 
vector storage for the benefit of HNSW we will still need a way to iterate over 
vector values in docid order, and we need to map from vector ord to docid when 
searching. None of the existing vector formats handles this: they are optimized 
for vectors that are stored in docid order. To make some progress, I'd start 
with an initial implementation that stores these mappings in a naive way, eg as 
fully-populated arrays, and we can use that to measure how much improvement we 
see in graph storage size and search performance. Then we could revisit and use 
some  more efficient data structure for the ord/doc mapping. Since the ordinals 
would no longer be increasing with docid we can't use 
DirectMonotonicReader/Writer any more, but it needs to be something more like 
what SortedNumericDocValues does.  I'm not super familiar with what we have - I 
wonder if there is some reusable code that would help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Lucene99FlatVectorsReader.getFloatVectorValues(): NPE: Cannot read field "vectorEncoding" because "fieldEntry" is null [lucene]

2024-08-10 Thread via GitHub


msokolov commented on issue #13626:
URL: https://github.com/apache/lucene/issues/13626#issuecomment-2282191793

   It makes sense to me to add null checks. I don't think PerFieldCodec should 
be required? The usage seems legit to me - basically it defines a set of params 
to use with all knn fields.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Unify how missing field entries are handle in knn formats [lucene]

2024-08-10 Thread via GitHub


msokolov commented on PR #13641:
URL: https://github.com/apache/lucene/pull/13641#issuecomment-2282193063

   > It is possible to inappropriately use the knn formats and attempt to merge 
segments with mismatched field names.
   
   I don't understand this - what's inappropriate about it? I guess we have 
something like segment 1 has field A and segment 2 has field B. Shouldn't the 
merged segment have both fields?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] testMergeStability failing for Knn formats [lucene]

2024-08-10 Thread via GitHub


msokolov commented on issue #13640:
URL: https://github.com/apache/lucene/issues/13640#issuecomment-2282193339

   hmm thanks I'll take a look soon


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] testMergeStability failing for Knn formats [lucene]

2024-08-10 Thread via GitHub


msokolov commented on issue #13640:
URL: https://github.com/apache/lucene/issues/13640#issuecomment-2282198464

   I didn't know about this constraint until now. Basically what happens is 
during merge we check for disconnected components and attempt to *add* 
connections to connect them. So it makes sense we might be adding some bytes to 
the vex file.  Maybe a way to avoid this is to skip recreating the HNSW graph 
when merging a single segment. Honestly I don't know why we would be doing 
that.  I'll dig a bit more/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Two fixes for recently-added HnswGraphBuilder.connectComponents: [lucene]

2024-08-10 Thread via GitHub


msokolov opened a new pull request, #13642:
URL: https://github.com/apache/lucene/pull/13642

   1. properly set frozen flag to avoid re-duplicative work
   2. don't try to join a node to itself


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Two fixes for recently-added HnswGraphBuilder.connectComponents: [lucene]

2024-08-10 Thread via GitHub


msokolov commented on PR #13642:
URL: https://github.com/apache/lucene/pull/13642#issuecomment-2282219044

   addresses #13640 
   
   with this fix the failing test seed on that issue succeeded. I re-ran the 
test with tests.iters=1000 and didn't get any fails


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] testMergeStability failing for Knn formats [lucene]

2024-08-10 Thread via GitHub


msokolov commented on issue #13640:
URL: https://github.com/apache/lucene/issues/13640#issuecomment-2282219274

   OK, I guess we would need to actually build a graph when merging a single 
segment in case there are deletions. In any case it would be nice if the graph 
reconnection were stable. This test exposes some interesting problems! Yay for 
our tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Unify how missing field entries are handle in knn formats [lucene]

2024-08-10 Thread via GitHub


benwtrent commented on PR #13641:
URL: https://github.com/apache/lucene/pull/13641#issuecomment-2282231690

   @msokolov maybe?
   
   Looking at the code, this has never worked this way. Field entry existence 
was just assumed to be handled by the perfield codec.
   
   I tried digging into the other formats that have an "per field" version 
(e.g. docValues,) and when I tested, they also threw a NPE in a scenario like 
this.
   
   So, I opted to keep what seems like the status quo. 
   
   But, I am happy to change it if I am wrong about intended behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Two fixes for recently-added HnswGraphBuilder.connectComponents: [lucene]

2024-08-10 Thread via GitHub


msokolov merged PR #13642:
URL: https://github.com/apache/lucene/pull/13642


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org