Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]
HoustonPutman commented on code in PR #13914: URL: https://github.com/apache/lucene/pull/13914#discussion_r1958610123 ## lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java: ## @@ -202,66 +208,83 @@ public SegmentOutput(int hitsLength) { * is used to compute the equi-weight per bin. */ public static List computeDynamicNumericRanges( - long[] values, long[] weights, int len, long totalWeight, int topN) { + long[] values, long[] weights, int len, long totalValue, long totalWeight, int topN) { assert values.length == weights.length && len <= values.length && len >= 0; assert topN >= 0; List dynamicRangeResult = new ArrayList<>(); if (len == 0 || topN == 0) { return dynamicRangeResult; } -new InPlaceMergeSorter() { - @Override - protected int compare(int index1, int index2) { -int cmp = Long.compare(values[index1], values[index2]); -if (cmp == 0) { - // If the values are equal, sort based on the weights. - // Any weight order is correct as long as it's deterministic. - return Long.compare(weights[index1], weights[index2]); -} -return cmp; - } +double rangeWeightTarget = (double) totalWeight / topN; +double[] kWeights = new double[topN]; +for (int i = 0; i < topN; i++) { + kWeights[i] = (i == 0 ? 0 : kWeights[i - 1]) + rangeWeightTarget; +} - @Override - protected void swap(int index1, int index2) { -long tmp = values[index1]; -values[index1] = values[index2]; -values[index2] = tmp; -tmp = weights[index1]; -weights[index1] = weights[index2]; -weights[index2] = tmp; - } -}.sort(0, len); +WeightedSelector.WeightRangeInfo[] kIndexResults = +new WeightedSelector() { + private long pivotValue; + private long pivotWeight; -long accuWeight = 0; -long valueSum = 0; -int count = 0; -int minIdx = 0; + @Override + protected long getWeight(int i) { +return weights[i]; + } -double rangeWeightTarget = (double) totalWeight / Math.min(topN, len); + @Override + protected long getValue(int i) { +return values[i]; + } -for (int i = 0; i < len; i++) { - accuWeight += weights[i]; - valueSum += values[i]; - count++; + @Override + protected void swap(int index1, int index2) { +long tmp = values[index1]; +values[index1] = values[index2]; +values[index2] = tmp; +tmp = weights[index1]; +weights[index1] = weights[index2]; +weights[index2] = tmp; + } - if (accuWeight >= rangeWeightTarget) { + @Override + protected void setPivot(int i) { +pivotValue = values[i]; +pivotWeight = weights[i]; + } + + @Override + protected int comparePivot(int j) { +int cmp = Long.compare(pivotValue, values[j]); +if (cmp == 0) { + // If the values are equal, sort based on the weights. + // Any weight order is correct as long as it's deterministic. + return Long.compare(pivotWeight, weights[j]); +} +return cmp; + } +}.select(0, len, totalValue, 0, totalWeight, 0, kWeights); + +int lastIdx = -1; +long lastTotalValue = 0; +long lastTotalWeight = 0; +for (int kIdx = 0; kIdx < topN; kIdx++) { Review Comment: Sounds good! It's pretty trivial to switch over to the other implementation, so happy to test that out when the benchmark is available! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]
HoustonPutman commented on PR #13914: URL: https://github.com/apache/lucene/pull/13914#issuecomment-2663827155 > I know you mentioned there is a change in behavior in the caveat, but I do believe that this example should probably return ranges with equal counts. This one was a `<=` that should have been a `<`. I've fixed it and the tests pass. > I've tried to track down this bug, and I haven't quite fixed it, but I believe the fix is related to these lines: Yeah this one was an issue with floating point math. I've set it such that the last quantile will always be the total, no need to do math for that. The test still returns different results, but at least it fails without an exception. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Relation check within 1D BKD Leaves [lucene]
jpountz commented on PR #14244: URL: https://github.com/apache/lucene/pull/14244#issuecomment-2663735068 I understand the idea, but this looks a bit like a hack to me. I'm also not too fond of optimizing point-in-set queries, as terms would be a better way to index the data than points if the goal is to search for exact values? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add histogram facet capabilities. [lucene]
jpountz commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1958608604 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.histogram; + +import java.io.IOException; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.DocValuesSkipper; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.NumericDocValues; +import org.apache.lucene.index.SortedNumericDocValues; +import org.apache.lucene.internal.hppc.LongIntHashMap; +import org.apache.lucene.search.CollectionTerminatedException; +import org.apache.lucene.search.Collector; +import org.apache.lucene.search.LeafCollector; +import org.apache.lucene.search.Scorable; +import org.apache.lucene.search.ScoreMode; + +final class HistogramCollector implements Collector { Review Comment: > So I suppose it's not too terrible to not allow NO_MORE_ORDS value when counting. I'm a bit confused as to what your suggestion is then. For instance if the value is -5 and the interval is 10, the bucket would be computer as `Math.floorDiv(-5, 10)`, which returns -1. Based on your points, it seems like the least fragile approach would be to make the facet cutter responsible for densifying bucket ordinals? So in the end, it would make sense to have two histogram implementations, one as a cutter, that densifies ordinals using a hash table that is similar to your branch, and another one as a raw collector manager for users who are interested in computing counts per histogram bucket but nothing else like this PR? > Are you running runGeoBench.cmd to get results? I'm running the geonames benchmark, that consists or running the `IndexGeoNames#main` then `SearchGeoNames#main`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce multiSelect for ScalarQuantizer [lucene]
HoustonPutman closed pull request #13919: Introduce multiSelect for ScalarQuantizer URL: https://github.com/apache/lucene/pull/13919 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Utility classes to make it easier to use sandbox facet API for most common cases [lucene]
stefanvodita commented on code in PR #14237: URL: https://github.com/apache/lucene/pull/14237#discussion_r1958258387 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/utils/BaseFacetBuilder.java: ## @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.facet.utils; Review Comment: I get your argument. Let's keep it as is then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] [Feature] Add support for passing extra information with KNNVectorField [lucene]
msokolov commented on issue #14247: URL: https://github.com/apache/lucene/issues/14247#issuecomment-2663422716 I wonder if there might be a better way to accomplish your actual goal. Adding "extra data" doesn't seem like a good idea to me since it inherently blurs the function of the data format. Can you describe the intended use case more fully? I guess this is some kind of clustering? What does that mean? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Utility classes to make it easier to use sandbox facet API for most common cases [lucene]
epotyom commented on code in PR #14237: URL: https://github.com/apache/lucene/pull/14237#discussion_r1958063799 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/utils/BaseFacetBuilder.java: ## @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.facet.utils; Review Comment: I want to keep `*FacetOrchstrator`s in the same package as `FacetBuilder`s, so that we can keep some `FacetBuilder` method package-private. So the only file that maybe doesn't belong in the same package is `ComparableUtils.java`. At the same time the idea of ComparableUtils is similar - make it easier for users to use the new module, so I thought we could reduce number of packages and keep meaningful package structure at the same time... But I don't feel strongly about it. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]
benwtrent commented on PR #14160: URL: https://github.com/apache/lucene/pull/14160#issuecomment-2663547774 > But the only question is whether we could somehow make it available in on 10.x since it would be a behavior change? I'm not sure if that question is still relevant though. If the results are uniformly better and there is no API breakage, would there be any concern with backport to 10.x? @msokolov I did backport to 10.x, but due to the recall differences I kept the current default behavior in 10.x. But, folks can opt-in to use it by supplying a query time parameter. I do think we can get uniformly better results eventually, but I thought it was "good enough" for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove duplicates from the hnsw recall testing [lucene]
msokolov commented on PR #14234: URL: https://github.com/apache/lucene/pull/14234#issuecomment-2663482993 thanks, @benwtrent ... this test was a bit hacky. Makes sense to remove noisy data from a test like this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]
msokolov commented on PR #14160: URL: https://github.com/apache/lucene/pull/14160#issuecomment-2663491200 > Do you think the behavior change/result change is worth waiting for a major? I do think folks should be able to use this now, but be able to opt out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Vector API to decode BKD docIds [lucene]
jpountz commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2663774846 Thanks for running benchmarks. I'm confused as to why running inner loops of size 512 would be to much better than inner loops of size 32. This doesn't feel right? Does luceneutil also report worse performance with the inner loop of size 32? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Utility classes to make it easier to use sandbox facet API for most common cases [lucene]
epotyom commented on code in PR #14237: URL: https://github.com/apache/lucene/pull/14237#discussion_r1958141105 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/utils/CommonFacetBuilder.java: ## @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.sandbox.facet.utils; + +import java.io.IOException; +import org.apache.lucene.sandbox.facet.cutters.FacetCutter; +import org.apache.lucene.sandbox.facet.labels.OrdToLabel; + +/** + * Common {@link FacetBuilder} that works with any {@link FacetCutter} and {@link OrdToLabel} + * provided by user. + */ +public final class CommonFacetBuilder extends BaseFacetBuilder { + + private final FacetCutter cutter; + private final OrdToLabel ordToLabel; + + public CommonFacetBuilder(String dimension, FacetCutter cutter, OrdToLabel ordToLabel) { +super(dimension); +this.cutter = cutter; +this.ordToLabel = ordToLabel; + } + + @Override + FacetCutter createFacetCutter() { Review Comment: Ah you are right, this method doesn't have to create FacetCutter, e.g. FacetCutter might be created in advance and returned by the method. Renamed to `getFacetCutter` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]
houserjohn commented on code in PR #14238: URL: https://github.com/apache/lucene/pull/14238#discussion_r1958559535 ## lucene/facet/src/test/org/apache/lucene/facet/range/TestDynamicRangeUtil.java: ## @@ -285,6 +285,23 @@ private static void assertDynamicNumericRangeValidProperties( pairOffset += count; } +// The minimum/maximum of each range is actually the smallest/largest value +for (int pairOffset = 0, rangeIdx = 0; rangeIdx < mockDynamicRangeResult.size(); rangeIdx++) { + DynamicRangeUtil.DynamicRangeInfo rangeInfo = mockDynamicRangeResult.get(rangeIdx); + int count = rangeInfo.count(); + long min = Long.MAX_VALUE; + long max = Long.MIN_VALUE; + for (int i = pairOffset; i < pairOffset + count; i++) { Review Comment: Good point. Added in latest revision. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add histogram facet capabilities. [lucene]
epotyom commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1958677872 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.histogram; + +import java.io.IOException; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.DocValuesSkipper; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.NumericDocValues; +import org.apache.lucene.index.SortedNumericDocValues; +import org.apache.lucene.internal.hppc.LongIntHashMap; +import org.apache.lucene.search.CollectionTerminatedException; +import org.apache.lucene.search.Collector; +import org.apache.lucene.search.LeafCollector; +import org.apache.lucene.search.Scorable; +import org.apache.lucene.search.ScoreMode; + +final class HistogramCollector implements Collector { Review Comment: >I'm a bit confused as to what your suggestion is then. I think we changing `NO_MORE_ORDS` constant value to `Long.MAX_VALUE` or `Long.MIN_VALUE` solves this problem (after changing facet ord type from int to long). > So in the end, it would make sense to have two histogram implementations, one as a cutter, that densifies ordinals using a hash table that is similar to your branch, and another one as a raw collector manager for users who are interested in computing counts per histogram bucket but nothing else like this PR? It seems reasonable. And we can look into changing facet ord type to long as a separate issue. > I'm running the geonames benchmark, that consists or running the IndexGeoNames#main then SearchGeoNames#main. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Binary vector format for flat and hnsw vectors [lucene]
gaoj0017 commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2664639299 As we have consistently emphasized in both public and private communications, we are concerned that the **OSQ method employs an idea highly similar to the one presented in our [extended RaBitQ paper](https://arxiv.org/pdf/2409.09913), which was published months earlier than OSQ - however, Elastic does not acknowledge extended RaBitQ as a prior art but as a parallel work**. Specifically, the high-level idea of OSQ is to try different parameters for scalar quantization on a per-vector basis as described in Elastic’s reply. This idea is highly similar to one of the key ideas of our extended RaBitQ (as described in Section 3.2.2 of our extended RaBitQ paper): it involves testing various parameters for each individual vector, performing rounding (i.e., scalar quantization) based on these parameters, and then calculating the similarity between the rounded vector and the original vector. It finally selects the optimal parameter and corresponding rounded vector. With this, ou r extended RaBitQ method achieves significant performance improvements over LVQ, the state-of-the-art variant of scalar quantization by then. For example, under 2-bit quantization, the distance estimation error of LVQ is larger than that of the extended RaBitQ by orders of magnitude. In addition, Elastic insists on categorizing our RaBitQ and its extension methods as a variant of PQ to highlight the differences from OSQ, despite our repeated refutation of this point in our responses in private meetings. We believe we have made it clear that our RaBitQ and its extension are not variants of PQ, but provide optimized methods of binary and scalar quantization, respectively. We respect OSQ’s differences from extended RaBitQ and would be pleased to see if they can make further improvements. However, these differences do not negate the similarity between the two methods. At the same time, we hope Elastic will respect (1) the significant performance improvements brought by our extended RaBitQ method in the realm of scalar quantization and (2) the fact that our extended RaBitQ method was published months earlier than OSQ. Finally, we would like to reiterate our two points: 1. **The so-called BBQ majorly follows our RaBitQ method;** 2. **The OSQ method (introduced in this PR) has its major idea similar to our extended RaBitQ method (published months earlier) and our extended RaBitQ method is a prior art which achieves good accuracy at 1-bit/2-bit binary/scalar quantization for the first time.** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Improve user-facing docs. [lucene]
jpountz opened a new pull request, #14251: URL: https://github.com/apache/lucene/pull/14251 This improves user-facing docs in Lucene's package javadocs: - Make some docs up-to-date, e.g. some of them were still referring to oal.index.Fields. - More emphasis of structured search and knn search. - More guidance on how to index data in `oal.document`. - Add API docs for impacts. - Discourage less strongly to change the similarity at search time. - Recommend using Query factory methods on Field classes instead of creating queries manually. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add histogram facet capabilities. [lucene]
epotyom commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1958677872 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.histogram; + +import java.io.IOException; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.DocValuesSkipper; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.NumericDocValues; +import org.apache.lucene.index.SortedNumericDocValues; +import org.apache.lucene.internal.hppc.LongIntHashMap; +import org.apache.lucene.search.CollectionTerminatedException; +import org.apache.lucene.search.Collector; +import org.apache.lucene.search.LeafCollector; +import org.apache.lucene.search.Scorable; +import org.apache.lucene.search.ScoreMode; + +final class HistogramCollector implements Collector { Review Comment: >I'm a bit confused as to what your suggestion is then. I think that changing `NO_MORE_ORDS` constant value to `Long.MAX_VALUE` or `Long.MIN_VALUE` solves this problem (after changing facet ord type from int to long). > So in the end, it would make sense to have two histogram implementations, one as a cutter, that densifies ordinals using a hash table that is similar to your branch, and another one as a raw collector manager for users who are interested in computing counts per histogram bucket but nothing else like this PR? It seems reasonable. And we can look into changing facet ord type to long as a separate issue. > I'm running the geonames benchmark, that consists or running the IndexGeoNames#main then SearchGeoNames#main. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]
gsmiller commented on PR #14238: URL: https://github.com/apache/lucene/pull/14238#issuecomment-2664264734 Looks good. Thanks @houserjohn ! I see @stefanvodita also added his approval so I will go ahead and get this merged. Thanks again! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Disable the query cache by default. [lucene]
github-actions[bot] commented on PR #14187: URL: https://github.com/apache/lucene/pull/14187#issuecomment-2664271825 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Bump floor segment size to 16MB. [lucene]
github-actions[bot] commented on PR #14189: URL: https://github.com/apache/lucene/pull/14189#issuecomment-2664271800 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new Directory implementation for AWS S3 [lucene]
github-actions[bot] commented on PR #13949: URL: https://github.com/apache/lucene/pull/13949#issuecomment-2664272130 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] [Feature] Add support for passing extra information with KNNVectorField [lucene]
navneet1v commented on issue #14247: URL: https://github.com/apache/lucene/issues/14247#issuecomment-2663976487 > I wonder if there might be a better way to accomplish your actual goal. Adding "extra data" doesn't seem like a good idea to me since it inherently blurs the function of the data format. Can you describe the intended use case more fully? I guess this is some kind of clustering? What does that mean? As mentioned in the description one use-case I was thinking was, lets say I have an identifier named tenantId for the vector and while doing the search I always want to get results for that specific tenant hence I want to build multiple HNSW graphs at a segment level 1 per tenant. So that when search comes in for a specific tenant, rather than searching over the large graph containing all the segments, search can traverse the single smaller graph for that specific tenant. Alternative to this can be by using filters with k-NN, but in filters we need to run the query and collect all the filters docs before going in the HNSW search. This adds extra latency and also reduces the recall if the tenant is very small. @msokolov please let me know if you have more questions happy to ans that. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add histogram facet capabilities. [lucene]
epotyom commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1958677872 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.histogram; + +import java.io.IOException; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.DocValuesSkipper; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.NumericDocValues; +import org.apache.lucene.index.SortedNumericDocValues; +import org.apache.lucene.internal.hppc.LongIntHashMap; +import org.apache.lucene.search.CollectionTerminatedException; +import org.apache.lucene.search.Collector; +import org.apache.lucene.search.LeafCollector; +import org.apache.lucene.search.Scorable; +import org.apache.lucene.search.ScoreMode; + +final class HistogramCollector implements Collector { Review Comment: >I'm a bit confused as to what your suggestion is then. I think that changing `NO_MORE_ORDS` constant value to `Long.MAX_VALUE` or `Long.MIN_VALUE` solves this problem (after changing facet ord type from int to long). With min bucket width > 1, we will never have bucket ID that is equal to Long.MAX_VALUE or Long.MIN_VALUE. > So in the end, it would make sense to have two histogram implementations, one as a cutter, that densifies ordinals using a hash table that is similar to your branch, and another one as a raw collector manager for users who are interested in computing counts per histogram bucket but nothing else like this PR? It seems reasonable. And we can look into changing facet ord type to long as a separate issue. > I'm running the geonames benchmark, that consists or running the IndexGeoNames#main then SearchGeoNames#main. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]
houserjohn commented on code in PR #14238: URL: https://github.com/apache/lucene/pull/14238#discussion_r1958855333 ## lucene/facet/src/test/org/apache/lucene/facet/range/TestDynamicRangeUtil.java: ## @@ -76,13 +95,241 @@ public void testComputeDynamicNumericRangesWithOneLargeWeight() { long[] values = new long[] {45, 32, 52, 14, 455, 342, 53}; long[] weights = new long[] {143, 23, 1, 52343, 53, 12, 2534}; -// value 14 has its own bin since the weight is large, and the rest of values fall the other bin -expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343, 14L, 14L, 14D)); +// value 14 has its own bin since the weight is large, and the rest of the values fall in the +// other bin +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343L, 14L, 14L, 14D)); expectedRangeInfoList.add( -new DynamicRangeUtil.DynamicRangeInfo(6, 2766, 32L, 455L, 163.16D)); +new DynamicRangeUtil.DynamicRangeInfo(6, 2766L, 32L, 455L, 163.16D)); assertDynamicNumericRangeResults(values, weights, 4, 55109, expectedRangeInfoList); } + public void testComputeDynamicNumericRangesWithLargeTopN() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[] {487, 439, 794, 277}; +long[] weights = new long[] {59, 508, 736, 560}; + +// More requested ranges (TopN) than values should return ranges with weights larger than the +// average weight - excluding the last range +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 560L, 277L, 277L, 277D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 508L, 439L, 439L, 439D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 487L, 794L, 640.5D)); +assertDynamicNumericRangeResults(values, weights, 42, 1863L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithSingleTopN() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[] {487, 439, 794, 277}; +long[] weights = new long[] {59, 508, 736, 560}; + +// A single requested range (TopN) should return a single range regardless of the weights +// provided +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(4, 1863L, 277L, 794L, 499.25D)); +assertDynamicNumericRangeResults(values, weights, 1, 1863L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithTwoTopN() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[] {487, 439, 794, 277}; +long[] weights = new long[] {59, 508, 736, 560}; + +// Two requested ranges (TopN) should return two ranges where the first range's weight is equal +// or larger than half of the total weight +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 1068L, 277L, 439L, 358.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 487L, 794L, 640.5D)); +assertDynamicNumericRangeResults(values, weights, 2, 1863L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithSameWeightsShuffled() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[100]; +long[] weights = new long[100]; +for (int i = 0; i < 100; i++) { + values[i] = i; + weights[i] = 50; +} + +// Shuffling the values and weights should not change the answer between runs +// We expect that returned ranges should come in a strict, deterministic order +// with the same values and weights +shuffleValuesWeights(values, weights); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 0L, 24L, 12.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 25L, 49L, 37.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 50L, 74L, 62.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 75L, 99L, 87.0D)); +assertDynamicNumericRangeResults(values, weights, 4, 5000L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithSameValuesShuffled() { +List expectedRangeInfoList = new ArrayList<>(); +long totalWeight = 0; +long[] values = new long[100]; +long[] weights = new long[100]; +for (int i = 0; i < 100; i++) { + values[i] = 50; + weights[i] = i; + totalWeight += i; +} + +// Shuffling the values and weights should not change the answer between runs +// We expect that returned ranges should come in a strict, deterministic order +// with the same values and weights +shuffleValuesWeights(values, weights); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(51, 1275L, 50L, 50L, 50D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(21, 1281L, 50L, 50L, 50D));
Re: [PR] OptimisticKnnVectorQuery [lucene]
msokolov commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2664096507 @benwtrent I'm curious how you managed to re-use the scores. I poked around a little and I guess it requires some new plumbing / class casts since the API isn't really designed for it? Do you have a patch you wouldn't mind posting? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] [Feature] Add support for passing extra information with KNNVectorField [lucene]
navneet1v commented on issue #14247: URL: https://github.com/apache/lucene/issues/14247#issuecomment-2664091298 > If the use case is multitenancy, it seems you would never want to search across tenants, so this would apply not only to KNN search but to all kinds of search? I agree the impact on KNN search is outsized, but would it make sense to build a separate index per tenant? @msokolov multi-tenancy is just one use case and I added it as an example. Having separate indices make sense for smaller number of tenants but if the number of tenants goes to millions then multiple indices is not a great solution as a lot of pressure on the operating system in terms of managing files and directories. If as a user I am just doing vector search and not other searches like text then this kind of solution is more intuitive and easy to use. BTW there is another way to solve this problem, if the Lucene Codec Writers can get access to other fields, then we don't need to pass any information with vector. We can put let say a tenant id in BDV and then access it during the HNSWWriter to build the correct graph. WDYT about that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] OptimisticKnnVectorQuery [lucene]
benwtrent commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2664103686 @msokolov I can provide a quick patch tomorrow against main. As I said, it would work for stuff as it is now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]
gsmiller commented on PR #14238: URL: https://github.com/apache/lucene/pull/14238#issuecomment-2664293726 Just merged to main. Thanks again @houserjohn 🎉 (I marked this under the 10.2 milestone based on the CHANGES entry; I'll keep an eye out for a backport PR and/or can help with this as needed) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]
gsmiller merged PR #14238: URL: https://github.com/apache/lucene/pull/14238 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]
gsmiller commented on code in PR #14238: URL: https://github.com/apache/lucene/pull/14238#discussion_r1958713253 ## lucene/facet/src/test/org/apache/lucene/facet/range/TestDynamicRangeUtil.java: ## @@ -76,13 +95,243 @@ public void testComputeDynamicNumericRangesWithOneLargeWeight() { long[] values = new long[] {45, 32, 52, 14, 455, 342, 53}; long[] weights = new long[] {143, 23, 1, 52343, 53, 12, 2534}; -// value 14 has its own bin since the weight is large, and the rest of values fall the other bin -expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343, 14L, 14L, 14D)); +// value 14 has its own bin since the weight is large, and the rest of the values fall in the +// other bin +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343L, 14L, 14L, 14D)); expectedRangeInfoList.add( -new DynamicRangeUtil.DynamicRangeInfo(6, 2766, 32L, 455L, 163.16D)); +new DynamicRangeUtil.DynamicRangeInfo(6, 2766L, 32L, 455L, 163.16D)); assertDynamicNumericRangeResults(values, weights, 4, 55109, expectedRangeInfoList); } + public void testComputeDynamicNumericRangesWithLargeTopN() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[] {487, 439, 794, 277}; +long[] weights = new long[] {59, 508, 736, 560}; + +// More requested ranges (TopN) than values should return ranges with weights larger than the +// average weight - excluding the last range +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 560L, 277L, 277L, 277D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 508L, 439L, 439L, 439D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 487L, 794L, 640.5D)); +assertDynamicNumericRangeResults(values, weights, 42, 1863L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithSingleTopN() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[] {487, 439, 794, 277}; +long[] weights = new long[] {59, 508, 736, 560}; + +// A single requested range (TopN) should return a single range regardless of the weights +// provided +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(4, 1863L, 277L, 794L, 499.25D)); +assertDynamicNumericRangeResults(values, weights, 1, 1863L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithTwoTopN() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[] {487, 439, 794, 277}; +long[] weights = new long[] {59, 508, 736, 560}; + +// Two requested ranges (TopN) should return two ranges where the first range's weight is equal +// or larger than half of the total weight +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 1068L, 277L, 439L, 358.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 487L, 794L, 640.5D)); +assertDynamicNumericRangeResults(values, weights, 2, 1863L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithSameWeightsShuffled() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[100]; +long[] weights = new long[100]; +for (int i = 0; i < 100; i++) { + values[i] = i; + weights[i] = 50; +} + +// Shuffling the values and weights should not change the answer between runs +// We expect that returned ranges should come in a strict, deterministic order +// with the same values and weights +shuffleValuesWeights(values, weights); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 0L, 24L, 12.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 25L, 49L, 37.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 50L, 74L, 62.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 75L, 99L, 87.0D)); +assertDynamicNumericRangeResults(values, weights, 4, 5000L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithSameValuesShuffled() { +List expectedRangeInfoList = new ArrayList<>(); +long totalWeight = 0; +long[] values = new long[100]; +long[] weights = new long[100]; +for (int i = 0; i < 100; i++) { + values[i] = 50; + weights[i] = i; + totalWeight += i; +} + +// Shuffling the values and weights should not change the answer between runs +// We expect that returned ranges should come in a strict, deterministic order +// with the same values and weights +shuffleValuesWeights(values, weights); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(51, 1275L, 50L, 50L, 50D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(21, 1281L, 50L, 50L, 50D)); +
Re: [I] [Feature] Add support for passing extra information with KNNVectorField [lucene]
msokolov commented on issue #14247: URL: https://github.com/apache/lucene/issues/14247#issuecomment-2664078595 If the use case is multitenancy, it seems you would never want to search across tenants, so this would apply not only to KNN search but to all kinds of search? I agree the impact on KNN search is outsized, but would it make sense to build a separate index per tenant? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]
houserjohn commented on code in PR #14238: URL: https://github.com/apache/lucene/pull/14238#discussion_r1958838623 ## lucene/facet/src/test/org/apache/lucene/facet/range/TestDynamicRangeUtil.java: ## @@ -76,13 +95,243 @@ public void testComputeDynamicNumericRangesWithOneLargeWeight() { long[] values = new long[] {45, 32, 52, 14, 455, 342, 53}; long[] weights = new long[] {143, 23, 1, 52343, 53, 12, 2534}; -// value 14 has its own bin since the weight is large, and the rest of values fall the other bin -expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343, 14L, 14L, 14D)); +// value 14 has its own bin since the weight is large, and the rest of the values fall in the +// other bin +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343L, 14L, 14L, 14D)); expectedRangeInfoList.add( -new DynamicRangeUtil.DynamicRangeInfo(6, 2766, 32L, 455L, 163.16D)); +new DynamicRangeUtil.DynamicRangeInfo(6, 2766L, 32L, 455L, 163.16D)); assertDynamicNumericRangeResults(values, weights, 4, 55109, expectedRangeInfoList); } + public void testComputeDynamicNumericRangesWithLargeTopN() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[] {487, 439, 794, 277}; +long[] weights = new long[] {59, 508, 736, 560}; + +// More requested ranges (TopN) than values should return ranges with weights larger than the +// average weight - excluding the last range +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 560L, 277L, 277L, 277D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 508L, 439L, 439L, 439D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 487L, 794L, 640.5D)); +assertDynamicNumericRangeResults(values, weights, 42, 1863L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithSingleTopN() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[] {487, 439, 794, 277}; +long[] weights = new long[] {59, 508, 736, 560}; + +// A single requested range (TopN) should return a single range regardless of the weights +// provided +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(4, 1863L, 277L, 794L, 499.25D)); +assertDynamicNumericRangeResults(values, weights, 1, 1863L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithTwoTopN() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[] {487, 439, 794, 277}; +long[] weights = new long[] {59, 508, 736, 560}; + +// Two requested ranges (TopN) should return two ranges where the first range's weight is equal +// or larger than half of the total weight +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 1068L, 277L, 439L, 358.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 487L, 794L, 640.5D)); +assertDynamicNumericRangeResults(values, weights, 2, 1863L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithSameWeightsShuffled() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[100]; +long[] weights = new long[100]; +for (int i = 0; i < 100; i++) { + values[i] = i; + weights[i] = 50; +} + +// Shuffling the values and weights should not change the answer between runs +// We expect that returned ranges should come in a strict, deterministic order +// with the same values and weights +shuffleValuesWeights(values, weights); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 0L, 24L, 12.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 25L, 49L, 37.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 50L, 74L, 62.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 75L, 99L, 87.0D)); +assertDynamicNumericRangeResults(values, weights, 4, 5000L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithSameValuesShuffled() { +List expectedRangeInfoList = new ArrayList<>(); +long totalWeight = 0; +long[] values = new long[100]; +long[] weights = new long[100]; +for (int i = 0; i < 100; i++) { + values[i] = 50; + weights[i] = i; + totalWeight += i; +} + +// Shuffling the values and weights should not change the answer between runs +// We expect that returned ranges should come in a strict, deterministic order +// with the same values and weights +shuffleValuesWeights(values, weights); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(51, 1275L, 50L, 50L, 50D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(21, 1281L, 50L, 50L, 50D));
Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]
gsmiller commented on code in PR #14238: URL: https://github.com/apache/lucene/pull/14238#discussion_r1958849147 ## lucene/facet/src/test/org/apache/lucene/facet/range/TestDynamicRangeUtil.java: ## @@ -76,13 +95,241 @@ public void testComputeDynamicNumericRangesWithOneLargeWeight() { long[] values = new long[] {45, 32, 52, 14, 455, 342, 53}; long[] weights = new long[] {143, 23, 1, 52343, 53, 12, 2534}; -// value 14 has its own bin since the weight is large, and the rest of values fall the other bin -expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343, 14L, 14L, 14D)); +// value 14 has its own bin since the weight is large, and the rest of the values fall in the +// other bin +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343L, 14L, 14L, 14D)); expectedRangeInfoList.add( -new DynamicRangeUtil.DynamicRangeInfo(6, 2766, 32L, 455L, 163.16D)); +new DynamicRangeUtil.DynamicRangeInfo(6, 2766L, 32L, 455L, 163.16D)); assertDynamicNumericRangeResults(values, weights, 4, 55109, expectedRangeInfoList); } + public void testComputeDynamicNumericRangesWithLargeTopN() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[] {487, 439, 794, 277}; +long[] weights = new long[] {59, 508, 736, 560}; + +// More requested ranges (TopN) than values should return ranges with weights larger than the +// average weight - excluding the last range +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 560L, 277L, 277L, 277D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 508L, 439L, 439L, 439D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 487L, 794L, 640.5D)); +assertDynamicNumericRangeResults(values, weights, 42, 1863L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithSingleTopN() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[] {487, 439, 794, 277}; +long[] weights = new long[] {59, 508, 736, 560}; + +// A single requested range (TopN) should return a single range regardless of the weights +// provided +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(4, 1863L, 277L, 794L, 499.25D)); +assertDynamicNumericRangeResults(values, weights, 1, 1863L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithTwoTopN() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[] {487, 439, 794, 277}; +long[] weights = new long[] {59, 508, 736, 560}; + +// Two requested ranges (TopN) should return two ranges where the first range's weight is equal +// or larger than half of the total weight +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 1068L, 277L, 439L, 358.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 487L, 794L, 640.5D)); +assertDynamicNumericRangeResults(values, weights, 2, 1863L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithSameWeightsShuffled() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[100]; +long[] weights = new long[100]; +for (int i = 0; i < 100; i++) { + values[i] = i; + weights[i] = 50; +} + +// Shuffling the values and weights should not change the answer between runs +// We expect that returned ranges should come in a strict, deterministic order +// with the same values and weights +shuffleValuesWeights(values, weights); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 0L, 24L, 12.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 25L, 49L, 37.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 50L, 74L, 62.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 75L, 99L, 87.0D)); +assertDynamicNumericRangeResults(values, weights, 4, 5000L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithSameValuesShuffled() { +List expectedRangeInfoList = new ArrayList<>(); +long totalWeight = 0; +long[] values = new long[100]; +long[] weights = new long[100]; +for (int i = 0; i < 100; i++) { + values[i] = 50; + weights[i] = i; + totalWeight += i; +} + +// Shuffling the values and weights should not change the answer between runs +// We expect that returned ranges should come in a strict, deterministic order +// with the same values and weights +shuffleValuesWeights(values, weights); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(51, 1275L, 50L, 50L, 50D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(21, 1281L, 50L, 50L, 50D)); +
Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]
houserjohn commented on code in PR #14238: URL: https://github.com/apache/lucene/pull/14238#discussion_r1958839222 ## lucene/facet/src/test/org/apache/lucene/facet/range/TestDynamicRangeUtil.java: ## @@ -76,13 +95,243 @@ public void testComputeDynamicNumericRangesWithOneLargeWeight() { long[] values = new long[] {45, 32, 52, 14, 455, 342, 53}; long[] weights = new long[] {143, 23, 1, 52343, 53, 12, 2534}; -// value 14 has its own bin since the weight is large, and the rest of values fall the other bin -expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343, 14L, 14L, 14D)); +// value 14 has its own bin since the weight is large, and the rest of the values fall in the +// other bin +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343L, 14L, 14L, 14D)); expectedRangeInfoList.add( -new DynamicRangeUtil.DynamicRangeInfo(6, 2766, 32L, 455L, 163.16D)); +new DynamicRangeUtil.DynamicRangeInfo(6, 2766L, 32L, 455L, 163.16D)); assertDynamicNumericRangeResults(values, weights, 4, 55109, expectedRangeInfoList); } + public void testComputeDynamicNumericRangesWithLargeTopN() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[] {487, 439, 794, 277}; +long[] weights = new long[] {59, 508, 736, 560}; + +// More requested ranges (TopN) than values should return ranges with weights larger than the +// average weight - excluding the last range +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 560L, 277L, 277L, 277D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 508L, 439L, 439L, 439D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 487L, 794L, 640.5D)); +assertDynamicNumericRangeResults(values, weights, 42, 1863L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithSingleTopN() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[] {487, 439, 794, 277}; +long[] weights = new long[] {59, 508, 736, 560}; + +// A single requested range (TopN) should return a single range regardless of the weights +// provided +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(4, 1863L, 277L, 794L, 499.25D)); +assertDynamicNumericRangeResults(values, weights, 1, 1863L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithTwoTopN() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[] {487, 439, 794, 277}; +long[] weights = new long[] {59, 508, 736, 560}; + +// Two requested ranges (TopN) should return two ranges where the first range's weight is equal +// or larger than half of the total weight +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 1068L, 277L, 439L, 358.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 487L, 794L, 640.5D)); +assertDynamicNumericRangeResults(values, weights, 2, 1863L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithSameWeightsShuffled() { +List expectedRangeInfoList = new ArrayList<>(); +long[] values = new long[100]; +long[] weights = new long[100]; +for (int i = 0; i < 100; i++) { + values[i] = i; + weights[i] = 50; +} + +// Shuffling the values and weights should not change the answer between runs +// We expect that returned ranges should come in a strict, deterministic order +// with the same values and weights +shuffleValuesWeights(values, weights); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 0L, 24L, 12.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 25L, 49L, 37.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 50L, 74L, 62.0D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 75L, 99L, 87.0D)); +assertDynamicNumericRangeResults(values, weights, 4, 5000L, expectedRangeInfoList); + } + + public void testComputeDynamicNumericRangesWithSameValuesShuffled() { +List expectedRangeInfoList = new ArrayList<>(); +long totalWeight = 0; +long[] values = new long[100]; +long[] weights = new long[100]; +for (int i = 0; i < 100; i++) { + values[i] = 50; + weights[i] = i; + totalWeight += i; +} + +// Shuffling the values and weights should not change the answer between runs +// We expect that returned ranges should come in a strict, deterministic order +// with the same values and weights +shuffleValuesWeights(values, weights); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(51, 1275L, 50L, 50L, 50D)); +expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(21, 1281L, 50L, 50L, 50D));
Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]
stefanvodita commented on code in PR #14238: URL: https://github.com/apache/lucene/pull/14238#discussion_r1957977134 ## lucene/facet/src/test/org/apache/lucene/facet/range/TestDynamicRangeUtil.java: ## @@ -285,6 +285,23 @@ private static void assertDynamicNumericRangeValidProperties( pairOffset += count; } +// The minimum/maximum of each range is actually the smallest/largest value +for (int pairOffset = 0, rangeIdx = 0; rangeIdx < mockDynamicRangeResult.size(); rangeIdx++) { + DynamicRangeUtil.DynamicRangeInfo rangeInfo = mockDynamicRangeResult.get(rangeIdx); + int count = rangeInfo.count(); + long min = Long.MAX_VALUE; + long max = Long.MIN_VALUE; + for (int i = pairOffset; i < pairOffset + count; i++) { Review Comment: Seeing the other change in this file - why isn't it applicable here? If we know the pairs are sorted, don't we also know where the min and the max are already? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Figure out why hunspell tests occasionally fail and make them more consistent [lucene]
dweiss commented on issue #14235: URL: https://github.com/apache/lucene/issues/14235#issuecomment-2662323672 Ok, the scheduled actions works as expected: https://github.com/apache/lucene/actions/runs/13362768562 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Utility classes to make it easier to use sandbox facet API for most common cases [lucene]
epotyom commented on PR #14237: URL: https://github.com/apache/lucene/pull/14237#issuecomment-2662789904 Thanks for reviewing @stefanvodita ! > I have some non-blocking concerns that this is a lot of new code, with more abstractions, and it doesn't have a user waiting to use it. There are arguments against those concerns as well and I personally like the new aggregations, but I thought it was worth raising the point. I agree with the concern, at the same time it might be one of these chicken and egg cases when users avoid using the feature just because it doesn't have simple enough API for simple use cases? One use case for the new utils I have in mind is luceneutil perf tests - it puts me off that [we need 90 lines](https://github.com/mikemccand/luceneutil/blob/dac76d58c579bf440ff4c339622c236202ec65d0/src/main/perf/SearchTask.java#L261-L352) to test performance of the new API, while for existing facets module it is [only two lines](https://github.com/mikemccand/luceneutil/blob/dac76d58c579bf440ff4c339622c236202ec65d0/src/main/perf/SearchTask.java#L403-L404). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Utility classes to make it easier to use sandbox facet API for most common cases [lucene]
epotyom commented on code in PR #14237: URL: https://github.com/apache/lucene/pull/14237#discussion_r1958053677 ## lucene/demo/src/java/org/apache/lucene/demo/facet/SandboxFacetsExample.java: ## @@ -130,6 +135,88 @@ void index() throws IOException { IOUtils.close(indexWriter, taxoWriter); } + /** + * Example for {@link FacetBuilder} usage - simple API that provides results in a format very + * similar to classic facets module. It doesn't give all flexibility available with {@link Review Comment: Good idea, I'll do that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests [lucene]
houserjohn commented on code in PR #14238: URL: https://github.com/apache/lucene/pull/14238#discussion_r1957934311 ## lucene/CHANGES.txt: ## @@ -20,6 +20,7 @@ New Features Improvements - +* GITHUB#14238: Improve test coverage of Dynamic Range Faceting. (John Houser) Review Comment: Addressed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add histogram facet capabilities. [lucene]
epotyom commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1958022460 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.histogram; + +import java.io.IOException; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.DocValuesSkipper; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.NumericDocValues; +import org.apache.lucene.index.SortedNumericDocValues; +import org.apache.lucene.internal.hppc.LongIntHashMap; +import org.apache.lucene.search.CollectionTerminatedException; +import org.apache.lucene.search.Collector; +import org.apache.lucene.search.LeafCollector; +import org.apache.lucene.search.Scorable; +import org.apache.lucene.search.ScoreMode; + +final class HistogramCollector implements Collector { Review Comment: > Then it would be nice to make ordinals `long`s rather than `int` It's an interesting idea. The issue is many facet implementations don't need `long` ordinals, e.g. Taxonomy or SSDV facets counting use `int`s, even though at least in some cases they index as `long`s. It looks like using `long` for them might be wasteful. Overall, using `long` for something that is essential a group id in a grouping mechanism seems excessive. At the same time `long` facet ordinals can simplify and improve performance for some FacetCutter implementations, in particular `LongValueFacetCutter` . Also, as you've mentioned, it is FacetRecorder responsibility to keep counts in a dense data structure, so it might be fine to move to `long`. In any case it requires a separate effort, and I think we should run [luceneutil #325](https://github.com/mikemccand/luceneutil/pull/325) and Amazon internal perf tests before making the decision. I can create an issue for it. > ordinals seem to be expected to be positive Yes, it is also a limitation in the current API. IIRC the only thing that relies on it is `OrdinalIterator#NO_MORE_ORDS = -1`. We can probably reserve some other value for it, e.g. `Long.MAX_VALUE` or `MIN_VALUE` , it should work for most cases, including histogram since bucketWidth has to be greater than 2. It's a bit fragile for LongValueFacetCutter - we'd have to throw runtime error is the value in the index is `NO_MORE_ORDS`. Although the implementation is already fragile as it uses LongIntHashMap which [size is limited by array max size](https://github.com/apache/lucene/blob/05fc9a834d85fdbefbaf4545fb34ab0f38c0c2a7/lucene/core/src/java/org/apache/lucene/internal/hppc/HashContainers.java#L48). So I suppose it's not too terrible to not allow NO_MORE_ORDS value when counting. > Separately I played with the quick/dirty benchmark I had created, which seems to have got a bit more than 2x slower. I'd like to look into it - maybe there is something we can improve. Are you running runGeoBench.cmd to get results? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add histogram facet capabilities. [lucene]
epotyom commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1958022460 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.histogram; + +import java.io.IOException; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.DocValuesSkipper; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.NumericDocValues; +import org.apache.lucene.index.SortedNumericDocValues; +import org.apache.lucene.internal.hppc.LongIntHashMap; +import org.apache.lucene.search.CollectionTerminatedException; +import org.apache.lucene.search.Collector; +import org.apache.lucene.search.LeafCollector; +import org.apache.lucene.search.Scorable; +import org.apache.lucene.search.ScoreMode; + +final class HistogramCollector implements Collector { Review Comment: > Then it would be nice to make ordinals `long`s rather than `int` It's an interesting idea. The issue is many facet implementations don't need `long` ordinals, e.g. Taxonomy or SSDV facets counting use `int`s, even though at least in some cases they index as `long`s. It looks like using `long` for them might be wasteful. Overall, using `long` for something that is essential a group id in a grouping mechanism seems excessive. At the same time `long` facet ordinals can simplify and improve performance for some FacetCutter implementations, in particular `LongValueFacetCutter` . Also, as you've mentioned, it is FacetRecorder responsibility to keep counts in a dense data structure, so it might be fine to move to `long`. In any case it requires a separate effort, and I think we should run [luceneutil #325](https://github.com/mikemccand/luceneutil/pull/325) and Amazon internal perf tests before making the decision. I can create an issue for it. > ordinals seem to be expected to be positive Yes, it is also a limitation in the current API. IIRC the only thing that relies on it is `OrdinalIterator#NO_MORE_ORDS = -1`. We can probably reserve some other value for it, e.g. `Long.MAX_VALUE` or `MIN_VALUE` , it should work for most cases, including histogram since bucketWidth has to be greater than 2. It's a bit fragile for LongValueFacetCutter - we'd have to throw runtime error is the value in the index is `NO_MORE_ORDS`. Although the implementation is already fragile as it uses LongIntHashMap which [size is limited by an array size](https://github.com/apache/lucene/blob/05fc9a834d85fdbefbaf4545fb34ab0f38c0c2a7/lucene/core/src/java/org/apache/lucene/internal/hppc/HashContainers.java#L48). So I suppose it's not too terrible to not allow NO_MORE_ORDS value when counting. > Separately I played with the quick/dirty benchmark I had created, which seems to have got a bit more than 2x slower. I'd like to look into it - maybe there is something we can improve. Are you running runGeoBench.cmd to get results? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Accept nodes where score == Math.nextUp(results.minCompetitiveSimilarity()) after #14215 [lucene]
iverase opened a new pull request, #14249: URL: https://github.com/apache/lucene/pull/14249 In https://github.com/apache/lucene/pull/12770 we refine how we were searching the graph by only considering nodes that were making the values of the score better, basically the condition in line 290 become `friendSimilarity > minAcceptedSimilarity`. In #14215, to improve the termination check and avoid edge cases when having many duplicated vectors, we change how we compute the minAcceptedSimilarity to `Math.nextUp(results.minCompetitiveSimilarity())`. Because we didn't change the first condition, now we are not considering nodes where `friendSimilarity == minAcceptedSimilarity` although those nodes will have better score? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org