Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]

2025-02-17 Thread via GitHub


HoustonPutman commented on code in PR #13914:
URL: https://github.com/apache/lucene/pull/13914#discussion_r1958610123


##
lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java:
##
@@ -202,66 +208,83 @@ public SegmentOutput(int hitsLength) {
* is used to compute the equi-weight per bin.
*/
   public static List computeDynamicNumericRanges(
-  long[] values, long[] weights, int len, long totalWeight, int topN) {
+  long[] values, long[] weights, int len, long totalValue, long 
totalWeight, int topN) {
 assert values.length == weights.length && len <= values.length && len >= 0;
 assert topN >= 0;
 List dynamicRangeResult = new ArrayList<>();
 if (len == 0 || topN == 0) {
   return dynamicRangeResult;
 }
 
-new InPlaceMergeSorter() {
-  @Override
-  protected int compare(int index1, int index2) {
-int cmp = Long.compare(values[index1], values[index2]);
-if (cmp == 0) {
-  // If the values are equal, sort based on the weights.
-  // Any weight order is correct as long as it's deterministic.
-  return Long.compare(weights[index1], weights[index2]);
-}
-return cmp;
-  }
+double rangeWeightTarget = (double) totalWeight / topN;
+double[] kWeights = new double[topN];
+for (int i = 0; i < topN; i++) {
+  kWeights[i] = (i == 0 ? 0 : kWeights[i - 1]) + rangeWeightTarget;
+}
 
-  @Override
-  protected void swap(int index1, int index2) {
-long tmp = values[index1];
-values[index1] = values[index2];
-values[index2] = tmp;
-tmp = weights[index1];
-weights[index1] = weights[index2];
-weights[index2] = tmp;
-  }
-}.sort(0, len);
+WeightedSelector.WeightRangeInfo[] kIndexResults =
+new WeightedSelector() {
+  private long pivotValue;
+  private long pivotWeight;
 
-long accuWeight = 0;
-long valueSum = 0;
-int count = 0;
-int minIdx = 0;
+  @Override
+  protected long getWeight(int i) {
+return weights[i];
+  }
 
-double rangeWeightTarget = (double) totalWeight / Math.min(topN, len);
+  @Override
+  protected long getValue(int i) {
+return values[i];
+  }
 
-for (int i = 0; i < len; i++) {
-  accuWeight += weights[i];
-  valueSum += values[i];
-  count++;
+  @Override
+  protected void swap(int index1, int index2) {
+long tmp = values[index1];
+values[index1] = values[index2];
+values[index2] = tmp;
+tmp = weights[index1];
+weights[index1] = weights[index2];
+weights[index2] = tmp;
+  }
 
-  if (accuWeight >= rangeWeightTarget) {
+  @Override
+  protected void setPivot(int i) {
+pivotValue = values[i];
+pivotWeight = weights[i];
+  }
+
+  @Override
+  protected int comparePivot(int j) {
+int cmp = Long.compare(pivotValue, values[j]);
+if (cmp == 0) {
+  // If the values are equal, sort based on the weights.
+  // Any weight order is correct as long as it's deterministic.
+  return Long.compare(pivotWeight, weights[j]);
+}
+return cmp;
+  }
+}.select(0, len, totalValue, 0, totalWeight, 0, kWeights);
+
+int lastIdx = -1;
+long lastTotalValue = 0;
+long lastTotalWeight = 0;
+for (int kIdx = 0; kIdx < topN; kIdx++) {

Review Comment:
   Sounds good! It's pretty trivial to switch over to the other implementation, 
so happy to test that out when the benchmark is available!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]

2025-02-17 Thread via GitHub


HoustonPutman commented on PR #13914:
URL: https://github.com/apache/lucene/pull/13914#issuecomment-2663827155

   > I know you mentioned there is a change in behavior in the caveat, but I do 
believe that this example should probably return ranges with equal counts.
   
   This one was a `<=` that should have been a `<`. I've fixed it and the tests 
pass.
   
   > I've tried to track down this bug, and I haven't quite fixed it, but I 
believe the fix is related to these lines:
   
   Yeah this one was an issue with floating point math. I've set it such that 
the last quantile will always be the total, no need to do math for that.
   
   The test still returns different results, but at least it fails without an 
exception.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Relation check within 1D BKD Leaves [lucene]

2025-02-17 Thread via GitHub


jpountz commented on PR #14244:
URL: https://github.com/apache/lucene/pull/14244#issuecomment-2663735068

   I understand the idea, but this looks a bit like a hack to me. I'm also not 
too fond of optimizing point-in-set queries, as terms would be a better way to 
index the data than points if the goal is to search for exact values?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-17 Thread via GitHub


jpountz commented on code in PR #14204:
URL: https://github.com/apache/lucene/pull/14204#discussion_r1958608604


##
lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java:
##
@@ -0,0 +1,252 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.histogram;
+
+import java.io.IOException;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesSkipper;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.NumericDocValues;
+import org.apache.lucene.index.SortedNumericDocValues;
+import org.apache.lucene.internal.hppc.LongIntHashMap;
+import org.apache.lucene.search.CollectionTerminatedException;
+import org.apache.lucene.search.Collector;
+import org.apache.lucene.search.LeafCollector;
+import org.apache.lucene.search.Scorable;
+import org.apache.lucene.search.ScoreMode;
+
+final class HistogramCollector implements Collector {

Review Comment:
   > So I suppose it's not too terrible to not allow NO_MORE_ORDS value when 
counting.
   
   I'm a bit confused as to what your suggestion is then. For instance if the 
value is -5 and the interval is 10, the bucket would be computer as 
`Math.floorDiv(-5, 10)`, which returns -1.
   
   Based on your points, it seems like the least fragile approach would be to 
make the facet cutter responsible for densifying bucket ordinals?
   
   So in the end, it would make sense to have two histogram implementations, 
one as a cutter, that densifies ordinals using a hash table that is similar to 
your branch, and another one as a raw collector manager for users who are 
interested in computing counts per histogram bucket but nothing else like this 
PR?
   
   > Are you running runGeoBench.cmd to get results?
   
   I'm running the geonames benchmark, that consists or running the 
`IndexGeoNames#main` then `SearchGeoNames#main`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce multiSelect for ScalarQuantizer [lucene]

2025-02-17 Thread via GitHub


HoustonPutman closed pull request #13919: Introduce multiSelect for 
ScalarQuantizer
URL: https://github.com/apache/lucene/pull/13919


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Utility classes to make it easier to use sandbox facet API for most common cases [lucene]

2025-02-17 Thread via GitHub


stefanvodita commented on code in PR #14237:
URL: https://github.com/apache/lucene/pull/14237#discussion_r1958258387


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/utils/BaseFacetBuilder.java:
##
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.facet.utils;

Review Comment:
   I get your argument. Let's keep it as is then.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] [Feature] Add support for passing extra information with KNNVectorField [lucene]

2025-02-17 Thread via GitHub


msokolov commented on issue #14247:
URL: https://github.com/apache/lucene/issues/14247#issuecomment-2663422716

   I wonder if there might be a better way to accomplish your actual goal.  
Adding "extra data" doesn't seem like a good idea to me since it inherently 
blurs the function of the data format. Can you describe the intended use case 
more fully?  I guess this is some kind of clustering? What does that mean? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Utility classes to make it easier to use sandbox facet API for most common cases [lucene]

2025-02-17 Thread via GitHub


epotyom commented on code in PR #14237:
URL: https://github.com/apache/lucene/pull/14237#discussion_r1958063799


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/utils/BaseFacetBuilder.java:
##
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.facet.utils;

Review Comment:
   I want to keep `*FacetOrchstrator`s in the same package as `FacetBuilder`s, 
so that we can keep some `FacetBuilder` method package-private. So the only 
file that maybe doesn't belong in the same package is `ComparableUtils.java`. 
At the same time the idea of ComparableUtils is similar - make it easier for 
users to use the new module, so I thought we could reduce number of packages 
and keep meaningful package structure at the same time... But I don't feel 
strongly about it. WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]

2025-02-17 Thread via GitHub


benwtrent commented on PR #14160:
URL: https://github.com/apache/lucene/pull/14160#issuecomment-2663547774

   > But the only question is whether we could somehow make it available in on 
10.x since it would be a behavior change? I'm not sure if that question is 
still relevant though. If the results are uniformly better and there is no API 
breakage, would there be any concern with backport to 10.x?
   
   
   @msokolov I did backport to 10.x, but due to the recall differences I kept 
the current default behavior in 10.x. But, folks can opt-in to use it by 
supplying a query time parameter. 
   
   I do think we can get uniformly better results eventually, but I thought it 
was "good enough" for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove duplicates from the hnsw recall testing [lucene]

2025-02-17 Thread via GitHub


msokolov commented on PR #14234:
URL: https://github.com/apache/lucene/pull/14234#issuecomment-2663482993

   thanks, @benwtrent ... this test was a bit hacky. Makes sense to remove 
noisy data from a test like this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]

2025-02-17 Thread via GitHub


msokolov commented on PR #14160:
URL: https://github.com/apache/lucene/pull/14160#issuecomment-2663491200

   > Do you think the behavior change/result change is worth waiting for a 
major? I do think folks should be able to use this now, but be able to opt out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Vector API to decode BKD docIds [lucene]

2025-02-17 Thread via GitHub


jpountz commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2663774846

   Thanks for running benchmarks. I'm confused as to why running inner loops of 
size 512 would be to much better than inner loops of size 32. This doesn't feel 
right? Does luceneutil also report worse performance with the inner loop of 
size 32?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Utility classes to make it easier to use sandbox facet API for most common cases [lucene]

2025-02-17 Thread via GitHub


epotyom commented on code in PR #14237:
URL: https://github.com/apache/lucene/pull/14237#discussion_r1958141105


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/utils/CommonFacetBuilder.java:
##
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.facet.utils;
+
+import java.io.IOException;
+import org.apache.lucene.sandbox.facet.cutters.FacetCutter;
+import org.apache.lucene.sandbox.facet.labels.OrdToLabel;
+
+/**
+ * Common {@link FacetBuilder} that works with any {@link FacetCutter} and 
{@link OrdToLabel}
+ * provided by user.
+ */
+public final class CommonFacetBuilder extends 
BaseFacetBuilder {
+
+  private final FacetCutter cutter;
+  private final OrdToLabel ordToLabel;
+
+  public CommonFacetBuilder(String dimension, FacetCutter cutter, OrdToLabel 
ordToLabel) {
+super(dimension);
+this.cutter = cutter;
+this.ordToLabel = ordToLabel;
+  }
+
+  @Override
+  FacetCutter createFacetCutter() {

Review Comment:
   Ah you are right, this method doesn't have to create FacetCutter, e.g. 
FacetCutter might be created in advance and returned by the method. Renamed to 
`getFacetCutter`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]

2025-02-17 Thread via GitHub


houserjohn commented on code in PR #14238:
URL: https://github.com/apache/lucene/pull/14238#discussion_r1958559535


##
lucene/facet/src/test/org/apache/lucene/facet/range/TestDynamicRangeUtil.java:
##
@@ -285,6 +285,23 @@ private static void 
assertDynamicNumericRangeValidProperties(
   pairOffset += count;
 }
 
+// The minimum/maximum of each range is actually the smallest/largest value
+for (int pairOffset = 0, rangeIdx = 0; rangeIdx < 
mockDynamicRangeResult.size(); rangeIdx++) {
+  DynamicRangeUtil.DynamicRangeInfo rangeInfo = 
mockDynamicRangeResult.get(rangeIdx);
+  int count = rangeInfo.count();
+  long min = Long.MAX_VALUE;
+  long max = Long.MIN_VALUE;
+  for (int i = pairOffset; i < pairOffset + count; i++) {

Review Comment:
   Good point. Added in latest revision.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-17 Thread via GitHub


epotyom commented on code in PR #14204:
URL: https://github.com/apache/lucene/pull/14204#discussion_r1958677872


##
lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java:
##
@@ -0,0 +1,252 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.histogram;
+
+import java.io.IOException;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesSkipper;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.NumericDocValues;
+import org.apache.lucene.index.SortedNumericDocValues;
+import org.apache.lucene.internal.hppc.LongIntHashMap;
+import org.apache.lucene.search.CollectionTerminatedException;
+import org.apache.lucene.search.Collector;
+import org.apache.lucene.search.LeafCollector;
+import org.apache.lucene.search.Scorable;
+import org.apache.lucene.search.ScoreMode;
+
+final class HistogramCollector implements Collector {

Review Comment:
   >I'm a bit confused as to what your suggestion is then.
   
   I think we changing `NO_MORE_ORDS` constant value to `Long.MAX_VALUE` or 
`Long.MIN_VALUE` solves this problem (after changing facet ord type from int to 
long).
   
   > So in the end, it would make sense to have two histogram implementations, 
one as a cutter, that densifies ordinals using a hash table that is similar to 
your branch, and another one as a raw collector manager for users who are 
interested in computing counts per histogram bucket but nothing else like this 
PR?
   
   It seems reasonable. And we can look into changing facet ord type to long as 
a separate issue.
   
   > I'm running the geonames benchmark, that consists or running the 
IndexGeoNames#main then SearchGeoNames#main.
   
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-02-17 Thread via GitHub


gaoj0017 commented on PR #14078:
URL: https://github.com/apache/lucene/pull/14078#issuecomment-2664639299

   As we have consistently emphasized in both public and private 
communications, we are concerned that the **OSQ method employs an idea 
highly similar to the one presented in our [extended RaBitQ 
paper](https://arxiv.org/pdf/2409.09913), which was published months earlier 
than OSQ - however, Elastic does not acknowledge extended RaBitQ as a prior art 
but as a parallel work**. Specifically, the high-level idea of OSQ is to 
try different parameters for scalar quantization on a per-vector basis as 
described in Elastic’s reply. This idea is highly similar to one of the key 
ideas of our extended RaBitQ (as described in Section 3.2.2 of our extended 
RaBitQ paper): it involves testing various parameters for each individual 
vector, performing rounding (i.e., scalar quantization) based on these 
parameters, and then calculating the similarity between the rounded vector and 
the original vector. It finally selects the optimal parameter and corresponding 
rounded vector. With this, ou
 r extended RaBitQ method achieves significant performance improvements over 
LVQ, the state-of-the-art variant of scalar quantization by then. For example, 
under 2-bit quantization, the distance estimation error of LVQ is larger than 
that of the extended RaBitQ by orders of magnitude.
   
   In addition, Elastic insists on categorizing our RaBitQ and its extension 
methods as a variant of PQ to highlight the differences from OSQ, despite our 
repeated refutation of this point in our responses in private meetings. We 
believe we have made it clear that our RaBitQ and its extension are not 
variants of PQ, but provide optimized methods of binary and scalar 
quantization, respectively.
   
   We respect OSQ’s differences from extended RaBitQ and would be pleased to 
see if they can make further improvements. However, these differences do not 
negate the similarity between the two methods. At the same time, we hope 
Elastic will respect (1) the significant performance improvements brought by 
our extended RaBitQ method in the realm of scalar quantization and (2) the fact 
that our extended RaBitQ method was published months earlier than OSQ. 
   
   Finally, we would like to reiterate our two points:
   
   1. **The so-called BBQ majorly follows our RaBitQ method;**
   2. **The OSQ method (introduced in this PR) has its major idea similar 
to our extended RaBitQ method (published months earlier) and our extended 
RaBitQ method is a prior art which achieves good accuracy at 1-bit/2-bit 
binary/scalar quantization for the first time.**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Improve user-facing docs. [lucene]

2025-02-17 Thread via GitHub


jpountz opened a new pull request, #14251:
URL: https://github.com/apache/lucene/pull/14251

   This improves user-facing docs in Lucene's package javadocs:
- Make some docs up-to-date, e.g. some of them were still referring to 
oal.index.Fields.
- More emphasis of structured search and knn search.
- More guidance on how to index data in `oal.document`.
- Add API docs for impacts.
- Discourage less strongly to change the similarity at search time.
- Recommend using Query factory methods on Field classes instead of 
creating queries manually.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-17 Thread via GitHub


epotyom commented on code in PR #14204:
URL: https://github.com/apache/lucene/pull/14204#discussion_r1958677872


##
lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java:
##
@@ -0,0 +1,252 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.histogram;
+
+import java.io.IOException;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesSkipper;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.NumericDocValues;
+import org.apache.lucene.index.SortedNumericDocValues;
+import org.apache.lucene.internal.hppc.LongIntHashMap;
+import org.apache.lucene.search.CollectionTerminatedException;
+import org.apache.lucene.search.Collector;
+import org.apache.lucene.search.LeafCollector;
+import org.apache.lucene.search.Scorable;
+import org.apache.lucene.search.ScoreMode;
+
+final class HistogramCollector implements Collector {

Review Comment:
   >I'm a bit confused as to what your suggestion is then.
   
   I think that changing `NO_MORE_ORDS` constant value to `Long.MAX_VALUE` or 
`Long.MIN_VALUE` solves this problem (after changing facet ord type from int to 
long).
   
   > So in the end, it would make sense to have two histogram implementations, 
one as a cutter, that densifies ordinals using a hash table that is similar to 
your branch, and another one as a raw collector manager for users who are 
interested in computing counts per histogram bucket but nothing else like this 
PR?
   
   It seems reasonable. And we can look into changing facet ord type to long as 
a separate issue.
   
   > I'm running the geonames benchmark, that consists or running the 
IndexGeoNames#main then SearchGeoNames#main.
   
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]

2025-02-17 Thread via GitHub


gsmiller commented on PR #14238:
URL: https://github.com/apache/lucene/pull/14238#issuecomment-2664264734

   Looks good. Thanks @houserjohn ! I see @stefanvodita also added his approval 
so I will go ahead and get this merged. Thanks again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Disable the query cache by default. [lucene]

2025-02-17 Thread via GitHub


github-actions[bot] commented on PR #14187:
URL: https://github.com/apache/lucene/pull/14187#issuecomment-2664271825

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Bump floor segment size to 16MB. [lucene]

2025-02-17 Thread via GitHub


github-actions[bot] commented on PR #14189:
URL: https://github.com/apache/lucene/pull/14189#issuecomment-2664271800

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new Directory implementation for AWS S3 [lucene]

2025-02-17 Thread via GitHub


github-actions[bot] commented on PR #13949:
URL: https://github.com/apache/lucene/pull/13949#issuecomment-2664272130

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] [Feature] Add support for passing extra information with KNNVectorField [lucene]

2025-02-17 Thread via GitHub


navneet1v commented on issue #14247:
URL: https://github.com/apache/lucene/issues/14247#issuecomment-2663976487

   > I wonder if there might be a better way to accomplish your actual goal. 
Adding "extra data" doesn't seem like a good idea to me since it inherently 
blurs the function of the data format. Can you describe the intended use case 
more fully? I guess this is some kind of clustering? What does that mean?
   
   As mentioned in the description one use-case I was thinking was, lets say I 
have an identifier named tenantId for the vector and while doing the search I 
always want to get results for that specific tenant hence I want to build 
multiple HNSW graphs at a segment level 1 per tenant. So that when search comes 
in for a specific tenant, rather than searching over the large graph containing 
all the segments, search can traverse the single smaller graph for that 
specific tenant.
   
   
   Alternative to this can be by using filters with k-NN, but in filters we 
need to run the query and collect all the filters docs before going in the HNSW 
search. This adds extra latency and also reduces the recall if the tenant is 
very small.
   
   @msokolov please let me know if you have more questions happy to ans that. 
:) 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-17 Thread via GitHub


epotyom commented on code in PR #14204:
URL: https://github.com/apache/lucene/pull/14204#discussion_r1958677872


##
lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java:
##
@@ -0,0 +1,252 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.histogram;
+
+import java.io.IOException;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesSkipper;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.NumericDocValues;
+import org.apache.lucene.index.SortedNumericDocValues;
+import org.apache.lucene.internal.hppc.LongIntHashMap;
+import org.apache.lucene.search.CollectionTerminatedException;
+import org.apache.lucene.search.Collector;
+import org.apache.lucene.search.LeafCollector;
+import org.apache.lucene.search.Scorable;
+import org.apache.lucene.search.ScoreMode;
+
+final class HistogramCollector implements Collector {

Review Comment:
   >I'm a bit confused as to what your suggestion is then.
   
   I think that changing `NO_MORE_ORDS` constant value to `Long.MAX_VALUE` or 
`Long.MIN_VALUE` solves this problem (after changing facet ord type from int to 
long). With min bucket width > 1, we will never have bucket ID that is equal to 
Long.MAX_VALUE or Long.MIN_VALUE.
   
   > So in the end, it would make sense to have two histogram implementations, 
one as a cutter, that densifies ordinals using a hash table that is similar to 
your branch, and another one as a raw collector manager for users who are 
interested in computing counts per histogram bucket but nothing else like this 
PR?
   
   It seems reasonable. And we can look into changing facet ord type to long as 
a separate issue.
   
   > I'm running the geonames benchmark, that consists or running the 
IndexGeoNames#main then SearchGeoNames#main.
   
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]

2025-02-17 Thread via GitHub


houserjohn commented on code in PR #14238:
URL: https://github.com/apache/lucene/pull/14238#discussion_r1958855333


##
lucene/facet/src/test/org/apache/lucene/facet/range/TestDynamicRangeUtil.java:
##
@@ -76,13 +95,241 @@ public void 
testComputeDynamicNumericRangesWithOneLargeWeight() {
 long[] values = new long[] {45, 32, 52, 14, 455, 342, 53};
 long[] weights = new long[] {143, 23, 1, 52343, 53, 12, 2534};
 
-// value 14 has its own bin since the weight is large, and the rest of 
values fall the other bin
-expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343, 
14L, 14L, 14D));
+// value 14 has its own bin since the weight is large, and the rest of the 
values fall in the
+// other bin
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343L, 
14L, 14L, 14D));
 expectedRangeInfoList.add(
-new DynamicRangeUtil.DynamicRangeInfo(6, 2766, 32L, 455L, 
163.16D));
+new DynamicRangeUtil.DynamicRangeInfo(6, 2766L, 32L, 455L, 
163.16D));
 assertDynamicNumericRangeResults(values, weights, 4, 55109, 
expectedRangeInfoList);
   }
 
+  public void testComputeDynamicNumericRangesWithLargeTopN() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[] {487, 439, 794, 277};
+long[] weights = new long[] {59, 508, 736, 560};
+
+// More requested ranges (TopN) than values should return ranges with 
weights larger than the
+// average weight - excluding the last range
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 560L, 
277L, 277L, 277D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 508L, 
439L, 439L, 439D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 
487L, 794L, 640.5D));
+assertDynamicNumericRangeResults(values, weights, 42, 1863L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithSingleTopN() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[] {487, 439, 794, 277};
+long[] weights = new long[] {59, 508, 736, 560};
+
+// A single requested range (TopN) should return a single range regardless 
of the weights
+// provided
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(4, 1863L, 
277L, 794L, 499.25D));
+assertDynamicNumericRangeResults(values, weights, 1, 1863L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithTwoTopN() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[] {487, 439, 794, 277};
+long[] weights = new long[] {59, 508, 736, 560};
+
+// Two requested ranges (TopN) should return two ranges where the first 
range's weight is equal
+// or larger than half of the total weight
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 1068L, 
277L, 439L, 358.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 
487L, 794L, 640.5D));
+assertDynamicNumericRangeResults(values, weights, 2, 1863L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithSameWeightsShuffled() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[100];
+long[] weights = new long[100];
+for (int i = 0; i < 100; i++) {
+  values[i] = i;
+  weights[i] = 50;
+}
+
+// Shuffling the values and weights should not change the answer between 
runs
+// We expect that returned ranges should come in a strict, deterministic 
order
+// with the same values and weights
+shuffleValuesWeights(values, weights);
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
0L, 24L, 12.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
25L, 49L, 37.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
50L, 74L, 62.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
75L, 99L, 87.0D));
+assertDynamicNumericRangeResults(values, weights, 4, 5000L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithSameValuesShuffled() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long totalWeight = 0;
+long[] values = new long[100];
+long[] weights = new long[100];
+for (int i = 0; i < 100; i++) {
+  values[i] = 50;
+  weights[i] = i;
+  totalWeight += i;
+}
+
+// Shuffling the values and weights should not change the answer between 
runs
+// We expect that returned ranges should come in a strict, deterministic 
order
+// with the same values and weights
+shuffleValuesWeights(values, weights);
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(51, 1275L, 
50L, 50L, 50D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(21, 1281L, 
50L, 50L, 50D));

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-02-17 Thread via GitHub


msokolov commented on PR #14226:
URL: https://github.com/apache/lucene/pull/14226#issuecomment-2664096507

   @benwtrent I'm curious how you managed to re-use the scores. I poked around 
a little and I guess it requires some new plumbing / class casts since the API 
isn't really designed for it? Do you have a patch you wouldn't mind posting?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] [Feature] Add support for passing extra information with KNNVectorField [lucene]

2025-02-17 Thread via GitHub


navneet1v commented on issue #14247:
URL: https://github.com/apache/lucene/issues/14247#issuecomment-2664091298

   > If the use case is multitenancy, it seems you would never want to search 
across tenants, so this would apply not only to KNN search but to all kinds of 
search? I agree the impact on KNN search is outsized, but would it make sense 
to build a separate index per tenant?
   
   @msokolov 
   multi-tenancy is just one use case and I added it as an example. Having 
separate indices make sense for smaller number of tenants but if the number of 
tenants goes to millions then multiple indices is not a great solution as a lot 
of pressure on the operating system in terms of managing files and directories. 
   
   If as a user I am just doing vector search and not other searches like text 
then this kind of solution is more intuitive and easy to use. 
   
   
   BTW there is another way to solve this problem, if the Lucene Codec Writers 
can get access to other fields, then we don't need to pass any information with 
vector. We can put let say a tenant id in BDV and then access it during the 
HNSWWriter to build the correct graph. WDYT about that? 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-02-17 Thread via GitHub


benwtrent commented on PR #14226:
URL: https://github.com/apache/lucene/pull/14226#issuecomment-2664103686

   @msokolov I can provide a quick patch tomorrow against main. As I said, it 
would work for stuff as it is now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]

2025-02-17 Thread via GitHub


gsmiller commented on PR #14238:
URL: https://github.com/apache/lucene/pull/14238#issuecomment-2664293726

   Just merged to main. Thanks again @houserjohn 🎉   (I marked this under the 
10.2 milestone based on the CHANGES entry; I'll keep an eye out for a backport 
PR and/or can help with this as needed)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]

2025-02-17 Thread via GitHub


gsmiller merged PR #14238:
URL: https://github.com/apache/lucene/pull/14238


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]

2025-02-17 Thread via GitHub


gsmiller commented on code in PR #14238:
URL: https://github.com/apache/lucene/pull/14238#discussion_r1958713253


##
lucene/facet/src/test/org/apache/lucene/facet/range/TestDynamicRangeUtil.java:
##
@@ -76,13 +95,243 @@ public void 
testComputeDynamicNumericRangesWithOneLargeWeight() {
 long[] values = new long[] {45, 32, 52, 14, 455, 342, 53};
 long[] weights = new long[] {143, 23, 1, 52343, 53, 12, 2534};
 
-// value 14 has its own bin since the weight is large, and the rest of 
values fall the other bin
-expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343, 
14L, 14L, 14D));
+// value 14 has its own bin since the weight is large, and the rest of the 
values fall in the
+// other bin
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343L, 
14L, 14L, 14D));
 expectedRangeInfoList.add(
-new DynamicRangeUtil.DynamicRangeInfo(6, 2766, 32L, 455L, 
163.16D));
+new DynamicRangeUtil.DynamicRangeInfo(6, 2766L, 32L, 455L, 
163.16D));
 assertDynamicNumericRangeResults(values, weights, 4, 55109, 
expectedRangeInfoList);
   }
 
+  public void testComputeDynamicNumericRangesWithLargeTopN() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[] {487, 439, 794, 277};
+long[] weights = new long[] {59, 508, 736, 560};
+
+// More requested ranges (TopN) than values should return ranges with 
weights larger than the
+// average weight - excluding the last range
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 560L, 
277L, 277L, 277D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 508L, 
439L, 439L, 439D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 
487L, 794L, 640.5D));
+assertDynamicNumericRangeResults(values, weights, 42, 1863L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithSingleTopN() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[] {487, 439, 794, 277};
+long[] weights = new long[] {59, 508, 736, 560};
+
+// A single requested range (TopN) should return a single range regardless 
of the weights
+// provided
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(4, 1863L, 
277L, 794L, 499.25D));
+assertDynamicNumericRangeResults(values, weights, 1, 1863L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithTwoTopN() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[] {487, 439, 794, 277};
+long[] weights = new long[] {59, 508, 736, 560};
+
+// Two requested ranges (TopN) should return two ranges where the first 
range's weight is equal
+// or larger than half of the total weight
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 1068L, 
277L, 439L, 358.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 
487L, 794L, 640.5D));
+assertDynamicNumericRangeResults(values, weights, 2, 1863L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithSameWeightsShuffled() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[100];
+long[] weights = new long[100];
+for (int i = 0; i < 100; i++) {
+  values[i] = i;
+  weights[i] = 50;
+}
+
+// Shuffling the values and weights should not change the answer between 
runs
+// We expect that returned ranges should come in a strict, deterministic 
order
+// with the same values and weights
+shuffleValuesWeights(values, weights);
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
0L, 24L, 12.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
25L, 49L, 37.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
50L, 74L, 62.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
75L, 99L, 87.0D));
+assertDynamicNumericRangeResults(values, weights, 4, 5000L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithSameValuesShuffled() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long totalWeight = 0;
+long[] values = new long[100];
+long[] weights = new long[100];
+for (int i = 0; i < 100; i++) {
+  values[i] = 50;
+  weights[i] = i;
+  totalWeight += i;
+}
+
+// Shuffling the values and weights should not change the answer between 
runs
+// We expect that returned ranges should come in a strict, deterministic 
order
+// with the same values and weights
+shuffleValuesWeights(values, weights);
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(51, 1275L, 
50L, 50L, 50D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(21, 1281L, 
50L, 50L, 50D));
+

Re: [I] [Feature] Add support for passing extra information with KNNVectorField [lucene]

2025-02-17 Thread via GitHub


msokolov commented on issue #14247:
URL: https://github.com/apache/lucene/issues/14247#issuecomment-2664078595

   If the use case is multitenancy, it seems you would never want to search 
across tenants, so this would apply not only to KNN search but to all kinds of 
search? I agree the impact on KNN search is outsized, but would it make sense 
to build a separate index per tenant?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]

2025-02-17 Thread via GitHub


houserjohn commented on code in PR #14238:
URL: https://github.com/apache/lucene/pull/14238#discussion_r1958838623


##
lucene/facet/src/test/org/apache/lucene/facet/range/TestDynamicRangeUtil.java:
##
@@ -76,13 +95,243 @@ public void 
testComputeDynamicNumericRangesWithOneLargeWeight() {
 long[] values = new long[] {45, 32, 52, 14, 455, 342, 53};
 long[] weights = new long[] {143, 23, 1, 52343, 53, 12, 2534};
 
-// value 14 has its own bin since the weight is large, and the rest of 
values fall the other bin
-expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343, 
14L, 14L, 14D));
+// value 14 has its own bin since the weight is large, and the rest of the 
values fall in the
+// other bin
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343L, 
14L, 14L, 14D));
 expectedRangeInfoList.add(
-new DynamicRangeUtil.DynamicRangeInfo(6, 2766, 32L, 455L, 
163.16D));
+new DynamicRangeUtil.DynamicRangeInfo(6, 2766L, 32L, 455L, 
163.16D));
 assertDynamicNumericRangeResults(values, weights, 4, 55109, 
expectedRangeInfoList);
   }
 
+  public void testComputeDynamicNumericRangesWithLargeTopN() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[] {487, 439, 794, 277};
+long[] weights = new long[] {59, 508, 736, 560};
+
+// More requested ranges (TopN) than values should return ranges with 
weights larger than the
+// average weight - excluding the last range
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 560L, 
277L, 277L, 277D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 508L, 
439L, 439L, 439D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 
487L, 794L, 640.5D));
+assertDynamicNumericRangeResults(values, weights, 42, 1863L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithSingleTopN() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[] {487, 439, 794, 277};
+long[] weights = new long[] {59, 508, 736, 560};
+
+// A single requested range (TopN) should return a single range regardless 
of the weights
+// provided
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(4, 1863L, 
277L, 794L, 499.25D));
+assertDynamicNumericRangeResults(values, weights, 1, 1863L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithTwoTopN() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[] {487, 439, 794, 277};
+long[] weights = new long[] {59, 508, 736, 560};
+
+// Two requested ranges (TopN) should return two ranges where the first 
range's weight is equal
+// or larger than half of the total weight
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 1068L, 
277L, 439L, 358.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 
487L, 794L, 640.5D));
+assertDynamicNumericRangeResults(values, weights, 2, 1863L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithSameWeightsShuffled() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[100];
+long[] weights = new long[100];
+for (int i = 0; i < 100; i++) {
+  values[i] = i;
+  weights[i] = 50;
+}
+
+// Shuffling the values and weights should not change the answer between 
runs
+// We expect that returned ranges should come in a strict, deterministic 
order
+// with the same values and weights
+shuffleValuesWeights(values, weights);
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
0L, 24L, 12.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
25L, 49L, 37.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
50L, 74L, 62.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
75L, 99L, 87.0D));
+assertDynamicNumericRangeResults(values, weights, 4, 5000L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithSameValuesShuffled() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long totalWeight = 0;
+long[] values = new long[100];
+long[] weights = new long[100];
+for (int i = 0; i < 100; i++) {
+  values[i] = 50;
+  weights[i] = i;
+  totalWeight += i;
+}
+
+// Shuffling the values and weights should not change the answer between 
runs
+// We expect that returned ranges should come in a strict, deterministic 
order
+// with the same values and weights
+shuffleValuesWeights(values, weights);
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(51, 1275L, 
50L, 50L, 50D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(21, 1281L, 
50L, 50L, 50D));

Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]

2025-02-17 Thread via GitHub


gsmiller commented on code in PR #14238:
URL: https://github.com/apache/lucene/pull/14238#discussion_r1958849147


##
lucene/facet/src/test/org/apache/lucene/facet/range/TestDynamicRangeUtil.java:
##
@@ -76,13 +95,241 @@ public void 
testComputeDynamicNumericRangesWithOneLargeWeight() {
 long[] values = new long[] {45, 32, 52, 14, 455, 342, 53};
 long[] weights = new long[] {143, 23, 1, 52343, 53, 12, 2534};
 
-// value 14 has its own bin since the weight is large, and the rest of 
values fall the other bin
-expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343, 
14L, 14L, 14D));
+// value 14 has its own bin since the weight is large, and the rest of the 
values fall in the
+// other bin
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343L, 
14L, 14L, 14D));
 expectedRangeInfoList.add(
-new DynamicRangeUtil.DynamicRangeInfo(6, 2766, 32L, 455L, 
163.16D));
+new DynamicRangeUtil.DynamicRangeInfo(6, 2766L, 32L, 455L, 
163.16D));
 assertDynamicNumericRangeResults(values, weights, 4, 55109, 
expectedRangeInfoList);
   }
 
+  public void testComputeDynamicNumericRangesWithLargeTopN() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[] {487, 439, 794, 277};
+long[] weights = new long[] {59, 508, 736, 560};
+
+// More requested ranges (TopN) than values should return ranges with 
weights larger than the
+// average weight - excluding the last range
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 560L, 
277L, 277L, 277D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 508L, 
439L, 439L, 439D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 
487L, 794L, 640.5D));
+assertDynamicNumericRangeResults(values, weights, 42, 1863L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithSingleTopN() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[] {487, 439, 794, 277};
+long[] weights = new long[] {59, 508, 736, 560};
+
+// A single requested range (TopN) should return a single range regardless 
of the weights
+// provided
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(4, 1863L, 
277L, 794L, 499.25D));
+assertDynamicNumericRangeResults(values, weights, 1, 1863L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithTwoTopN() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[] {487, 439, 794, 277};
+long[] weights = new long[] {59, 508, 736, 560};
+
+// Two requested ranges (TopN) should return two ranges where the first 
range's weight is equal
+// or larger than half of the total weight
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 1068L, 
277L, 439L, 358.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 
487L, 794L, 640.5D));
+assertDynamicNumericRangeResults(values, weights, 2, 1863L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithSameWeightsShuffled() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[100];
+long[] weights = new long[100];
+for (int i = 0; i < 100; i++) {
+  values[i] = i;
+  weights[i] = 50;
+}
+
+// Shuffling the values and weights should not change the answer between 
runs
+// We expect that returned ranges should come in a strict, deterministic 
order
+// with the same values and weights
+shuffleValuesWeights(values, weights);
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
0L, 24L, 12.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
25L, 49L, 37.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
50L, 74L, 62.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
75L, 99L, 87.0D));
+assertDynamicNumericRangeResults(values, weights, 4, 5000L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithSameValuesShuffled() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long totalWeight = 0;
+long[] values = new long[100];
+long[] weights = new long[100];
+for (int i = 0; i < 100; i++) {
+  values[i] = 50;
+  weights[i] = i;
+  totalWeight += i;
+}
+
+// Shuffling the values and weights should not change the answer between 
runs
+// We expect that returned ranges should come in a strict, deterministic 
order
+// with the same values and weights
+shuffleValuesWeights(values, weights);
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(51, 1275L, 
50L, 50L, 50D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(21, 1281L, 
50L, 50L, 50D));
+

Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]

2025-02-17 Thread via GitHub


houserjohn commented on code in PR #14238:
URL: https://github.com/apache/lucene/pull/14238#discussion_r1958839222


##
lucene/facet/src/test/org/apache/lucene/facet/range/TestDynamicRangeUtil.java:
##
@@ -76,13 +95,243 @@ public void 
testComputeDynamicNumericRangesWithOneLargeWeight() {
 long[] values = new long[] {45, 32, 52, 14, 455, 342, 53};
 long[] weights = new long[] {143, 23, 1, 52343, 53, 12, 2534};
 
-// value 14 has its own bin since the weight is large, and the rest of 
values fall the other bin
-expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343, 
14L, 14L, 14D));
+// value 14 has its own bin since the weight is large, and the rest of the 
values fall in the
+// other bin
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343L, 
14L, 14L, 14D));
 expectedRangeInfoList.add(
-new DynamicRangeUtil.DynamicRangeInfo(6, 2766, 32L, 455L, 
163.16D));
+new DynamicRangeUtil.DynamicRangeInfo(6, 2766L, 32L, 455L, 
163.16D));
 assertDynamicNumericRangeResults(values, weights, 4, 55109, 
expectedRangeInfoList);
   }
 
+  public void testComputeDynamicNumericRangesWithLargeTopN() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[] {487, 439, 794, 277};
+long[] weights = new long[] {59, 508, 736, 560};
+
+// More requested ranges (TopN) than values should return ranges with 
weights larger than the
+// average weight - excluding the last range
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 560L, 
277L, 277L, 277D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 508L, 
439L, 439L, 439D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 
487L, 794L, 640.5D));
+assertDynamicNumericRangeResults(values, weights, 42, 1863L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithSingleTopN() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[] {487, 439, 794, 277};
+long[] weights = new long[] {59, 508, 736, 560};
+
+// A single requested range (TopN) should return a single range regardless 
of the weights
+// provided
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(4, 1863L, 
277L, 794L, 499.25D));
+assertDynamicNumericRangeResults(values, weights, 1, 1863L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithTwoTopN() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[] {487, 439, 794, 277};
+long[] weights = new long[] {59, 508, 736, 560};
+
+// Two requested ranges (TopN) should return two ranges where the first 
range's weight is equal
+// or larger than half of the total weight
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 1068L, 
277L, 439L, 358.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 
487L, 794L, 640.5D));
+assertDynamicNumericRangeResults(values, weights, 2, 1863L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithSameWeightsShuffled() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long[] values = new long[100];
+long[] weights = new long[100];
+for (int i = 0; i < 100; i++) {
+  values[i] = i;
+  weights[i] = 50;
+}
+
+// Shuffling the values and weights should not change the answer between 
runs
+// We expect that returned ranges should come in a strict, deterministic 
order
+// with the same values and weights
+shuffleValuesWeights(values, weights);
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
0L, 24L, 12.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
25L, 49L, 37.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
50L, 74L, 62.0D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 
75L, 99L, 87.0D));
+assertDynamicNumericRangeResults(values, weights, 4, 5000L, 
expectedRangeInfoList);
+  }
+
+  public void testComputeDynamicNumericRangesWithSameValuesShuffled() {
+List expectedRangeInfoList = new 
ArrayList<>();
+long totalWeight = 0;
+long[] values = new long[100];
+long[] weights = new long[100];
+for (int i = 0; i < 100; i++) {
+  values[i] = 50;
+  weights[i] = i;
+  totalWeight += i;
+}
+
+// Shuffling the values and weights should not change the answer between 
runs
+// We expect that returned ranges should come in a strict, deterministic 
order
+// with the same values and weights
+shuffleValuesWeights(values, weights);
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(51, 1275L, 
50L, 50L, 50D));
+expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(21, 1281L, 
50L, 50L, 50D));

Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]

2025-02-17 Thread via GitHub


stefanvodita commented on code in PR #14238:
URL: https://github.com/apache/lucene/pull/14238#discussion_r1957977134


##
lucene/facet/src/test/org/apache/lucene/facet/range/TestDynamicRangeUtil.java:
##
@@ -285,6 +285,23 @@ private static void 
assertDynamicNumericRangeValidProperties(
   pairOffset += count;
 }
 
+// The minimum/maximum of each range is actually the smallest/largest value
+for (int pairOffset = 0, rangeIdx = 0; rangeIdx < 
mockDynamicRangeResult.size(); rangeIdx++) {
+  DynamicRangeUtil.DynamicRangeInfo rangeInfo = 
mockDynamicRangeResult.get(rangeIdx);
+  int count = rangeInfo.count();
+  long min = Long.MAX_VALUE;
+  long max = Long.MIN_VALUE;
+  for (int i = pairOffset; i < pairOffset + count; i++) {

Review Comment:
   Seeing the other change in this file - why isn't it applicable here? If we 
know the pairs are sorted, don't we also know where the min and the max are 
already?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Figure out why hunspell tests occasionally fail and make them more consistent [lucene]

2025-02-17 Thread via GitHub


dweiss commented on issue #14235:
URL: https://github.com/apache/lucene/issues/14235#issuecomment-2662323672

   Ok, the scheduled actions works as expected:
   https://github.com/apache/lucene/actions/runs/13362768562


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Utility classes to make it easier to use sandbox facet API for most common cases [lucene]

2025-02-17 Thread via GitHub


epotyom commented on PR #14237:
URL: https://github.com/apache/lucene/pull/14237#issuecomment-2662789904

   Thanks for reviewing @stefanvodita !
   
   > I have some non-blocking concerns that this is a lot of new code, with 
more abstractions, and it doesn't have a user waiting to use it. There are 
arguments against those concerns as well and I personally like the new 
aggregations, but I thought it was worth raising the point.
   
   I agree with the concern, at the same time it might be one of these chicken 
and egg cases when users avoid using the feature just because it doesn't have 
simple enough API for simple use cases?
   
   One use case for the new utils I have in mind is luceneutil perf tests - it 
puts me off that [we need 90 
lines](https://github.com/mikemccand/luceneutil/blob/dac76d58c579bf440ff4c339622c236202ec65d0/src/main/perf/SearchTask.java#L261-L352)
 to test performance of the new API, while for existing facets module it is 
[only two 
lines](https://github.com/mikemccand/luceneutil/blob/dac76d58c579bf440ff4c339622c236202ec65d0/src/main/perf/SearchTask.java#L403-L404).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Utility classes to make it easier to use sandbox facet API for most common cases [lucene]

2025-02-17 Thread via GitHub


epotyom commented on code in PR #14237:
URL: https://github.com/apache/lucene/pull/14237#discussion_r1958053677


##
lucene/demo/src/java/org/apache/lucene/demo/facet/SandboxFacetsExample.java:
##
@@ -130,6 +135,88 @@ void index() throws IOException {
 IOUtils.close(indexWriter, taxoWriter);
   }
 
+  /**
+   * Example for {@link FacetBuilder} usage - simple API that provides results 
in a format very
+   * similar to classic facets module. It doesn't give all flexibility 
available with {@link

Review Comment:
   Good idea, I'll do that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]

2025-02-17 Thread via GitHub


houserjohn commented on code in PR #14238:
URL: https://github.com/apache/lucene/pull/14238#discussion_r1957934311


##
lucene/CHANGES.txt:
##
@@ -20,6 +20,7 @@ New Features
 
 Improvements
 -
+* GITHUB#14238: Improve test coverage of Dynamic Range Faceting. (John Houser)

Review Comment:
   Addressed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-17 Thread via GitHub


epotyom commented on code in PR #14204:
URL: https://github.com/apache/lucene/pull/14204#discussion_r1958022460


##
lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java:
##
@@ -0,0 +1,252 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.histogram;
+
+import java.io.IOException;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesSkipper;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.NumericDocValues;
+import org.apache.lucene.index.SortedNumericDocValues;
+import org.apache.lucene.internal.hppc.LongIntHashMap;
+import org.apache.lucene.search.CollectionTerminatedException;
+import org.apache.lucene.search.Collector;
+import org.apache.lucene.search.LeafCollector;
+import org.apache.lucene.search.Scorable;
+import org.apache.lucene.search.ScoreMode;
+
+final class HistogramCollector implements Collector {

Review Comment:
   > Then it would be nice to make ordinals `long`s rather than `int`
   
   It's an interesting idea. The issue is many facet implementations don't need 
`long` ordinals, e.g. Taxonomy or SSDV facets counting use `int`s, even though 
at least in some cases they index as `long`s. It looks like  using `long` for 
them might be wasteful. Overall, using `long` for something that is essential a 
group id in a grouping mechanism seems excessive.
   
   At the same time `long` facet ordinals can simplify and improve performance 
for some FacetCutter implementations, in particular `LongValueFacetCutter` . 
Also, as you've mentioned, it is FacetRecorder responsibility to keep counts in 
a dense data structure, so it might be fine to move to `long`.
   
   In any case it requires a separate effort, and I think we should run 
[luceneutil #325](https://github.com/mikemccand/luceneutil/pull/325) and Amazon 
internal perf tests before making the decision. I can create an issue for it.
   
   > ordinals seem to be expected to be positive
   
   Yes, it is also a limitation in the current API. IIRC the only thing that 
relies on it is `OrdinalIterator#NO_MORE_ORDS = -1`. We can probably reserve 
some other value for it, e.g. `Long.MAX_VALUE` or `MIN_VALUE` , it should work 
for most cases, including histogram since bucketWidth has to be greater than 2. 
It's a bit fragile for LongValueFacetCutter - we'd have to throw runtime error 
is the value in the index is `NO_MORE_ORDS`. Although the implementation is 
already fragile as it uses LongIntHashMap which [size is limited by array max 
size](https://github.com/apache/lucene/blob/05fc9a834d85fdbefbaf4545fb34ab0f38c0c2a7/lucene/core/src/java/org/apache/lucene/internal/hppc/HashContainers.java#L48).
  So I suppose it's not too terrible to not allow NO_MORE_ORDS value when 
counting.
   
   > Separately I played with the quick/dirty benchmark I had created, which 
seems to have got a bit more than 2x slower.
   
   I'd like to look into it - maybe there is something we can improve. Are you 
running runGeoBench.cmd to get results?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add histogram facet capabilities. [lucene]

2025-02-17 Thread via GitHub


epotyom commented on code in PR #14204:
URL: https://github.com/apache/lucene/pull/14204#discussion_r1958022460


##
lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java:
##
@@ -0,0 +1,252 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.histogram;
+
+import java.io.IOException;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesSkipper;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.NumericDocValues;
+import org.apache.lucene.index.SortedNumericDocValues;
+import org.apache.lucene.internal.hppc.LongIntHashMap;
+import org.apache.lucene.search.CollectionTerminatedException;
+import org.apache.lucene.search.Collector;
+import org.apache.lucene.search.LeafCollector;
+import org.apache.lucene.search.Scorable;
+import org.apache.lucene.search.ScoreMode;
+
+final class HistogramCollector implements Collector {

Review Comment:
   > Then it would be nice to make ordinals `long`s rather than `int`
   
   It's an interesting idea. The issue is many facet implementations don't need 
`long` ordinals, e.g. Taxonomy or SSDV facets counting use `int`s, even though 
at least in some cases they index as `long`s. It looks like  using `long` for 
them might be wasteful. Overall, using `long` for something that is essential a 
group id in a grouping mechanism seems excessive.
   
   At the same time `long` facet ordinals can simplify and improve performance 
for some FacetCutter implementations, in particular `LongValueFacetCutter` . 
Also, as you've mentioned, it is FacetRecorder responsibility to keep counts in 
a dense data structure, so it might be fine to move to `long`.
   
   In any case it requires a separate effort, and I think we should run 
[luceneutil #325](https://github.com/mikemccand/luceneutil/pull/325) and Amazon 
internal perf tests before making the decision. I can create an issue for it.
   
   > ordinals seem to be expected to be positive
   
   Yes, it is also a limitation in the current API. IIRC the only thing that 
relies on it is `OrdinalIterator#NO_MORE_ORDS = -1`. We can probably reserve 
some other value for it, e.g. `Long.MAX_VALUE` or `MIN_VALUE` , it should work 
for most cases, including histogram since bucketWidth has to be greater than 2. 
It's a bit fragile for LongValueFacetCutter - we'd have to throw runtime error 
is the value in the index is `NO_MORE_ORDS`. Although the implementation is 
already fragile as it uses LongIntHashMap which [size is limited by an array 
size](https://github.com/apache/lucene/blob/05fc9a834d85fdbefbaf4545fb34ab0f38c0c2a7/lucene/core/src/java/org/apache/lucene/internal/hppc/HashContainers.java#L48).
  So I suppose it's not too terrible to not allow NO_MORE_ORDS value when 
counting.
   
   > Separately I played with the quick/dirty benchmark I had created, which 
seems to have got a bit more than 2x slower.
   
   I'd like to look into it - maybe there is something we can improve. Are you 
running runGeoBench.cmd to get results?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Accept nodes where score == Math.nextUp(results.minCompetitiveSimilarity()) after #14215 [lucene]

2025-02-17 Thread via GitHub


iverase opened a new pull request, #14249:
URL: https://github.com/apache/lucene/pull/14249

   In https://github.com/apache/lucene/pull/12770 we refine how we were 
searching the graph by only considering nodes that were making the values of 
the score better, basically the condition in line 290 become `friendSimilarity 
> minAcceptedSimilarity`.
   
   In #14215, to improve the termination check and avoid edge cases when having 
many duplicated vectors, we change how we compute the minAcceptedSimilarity to 
`Math.nextUp(results.minCompetitiveSimilarity())`. 
   
   Because we didn't change the first condition, now we are not considering 
nodes where `friendSimilarity 
== minAcceptedSimilarity` although those nodes will have better score?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org