[GitHub] [lucene] gsmiller opened a new issue, #11742: MatchingFacetSetsCounts doesn't properly implement getTopChildren
gsmiller opened a new issue, #11742: URL: https://github.com/apache/lucene/issues/11742 ### Description `MatchingFacetSetsCounts#getTopChildren` is currently just delegating to `#getAllChildren`, which isn't really the correct thing to do. We should properly implement "top children." ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on issue #11574: TopN is not being used in getTopChildren() [LUCENE-10538]
gsmiller commented on issue #11574: URL: https://github.com/apache/lucene/issues/11574#issuecomment-1235534712 Let's resolve this out. This issue grew into a few different spin-offs, which are now all resolved. But... it looks like we introduced another case of not properly implementing "top children" with facet sets. I created a specific issue for this to carry it forward: #11742 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller closed issue #11574: TopN is not being used in getTopChildren() [LUCENE-10538]
gsmiller closed issue #11574: TopN is not being used in getTopChildren() [LUCENE-10538] URL: https://github.com/apache/lucene/issues/11574 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova opened a new pull request, #11743: LUCENE-10592 Better estimate memory for HNSW graph
mayya-sharipova opened a new pull request, #11743: URL: https://github.com/apache/lucene/pull/11743 Better estimate memory used for OnHeapHnswGraph, as well as add tests. Also don't over-allocate arrays in NeighborArray. Relates to #992 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller opened a new pull request, #11744: Remove LongValueFacetCounts#getTopChildrenSortByCount since it provides redundant functionality
gsmiller opened a new pull request, #11744: URL: https://github.com/apache/lucene/pull/11744 ### Description `LongValueFacetCounts#getTopChildrenSortByCount` does exactly the same thing as the more standard `LongValueFacetCounts#getTopChildren`, so we can tighten up our API. Note that this PR would go with 10.0, and I'll put up a separate PR to mark it deprecated against 9.x. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rishabhmaurya opened a new issue, #11745: Store summarized results in internal nodes of BKD for time series points
rishabhmaurya opened a new issue, #11745: URL: https://github.com/apache/lucene/issues/11745 ### Description Time series points have a timestamp, measurement and dimensions associated with them. The common queries are range queries on timestamp, metric aggregation on measurement and grouping on dimensions. Or similar query with histogram on timestamp. **Proposal:** Prototype can be found [here](https://github.com/rishabhmaurya/lucene/pull/1) 1. Introduce a new time series point as a field in lucene - `TSPoint` which can be added as - ```java Document doc = new Document(); doc.add(new TSIntPoint("tsid1", "cpu", timestamp, measurement)); ``` `tsid1` is the time series ID of the point. It will be the unit of storage for time series points and for prototype each of them represents a unique field in lucene. `timestamp` is the actual point in BKD on which the index is created. Full definition can be found [here](https://github.com/rishabhmaurya/lucene/blob/0c07f677c4caf45b1499f236adc0217992e46722/lucene/core/src/java/org/apache/lucene/document/TSIntPoint.java). 2. Interface for decomposable aggregate function, which can be defined as part of index configuration - Sum function - ```java new BKDSummaryWriter.SummaryMergeFunction() { @Override public int getSummarySize() { return Integer.BYTES; } @Override public void merge(byte[] a, byte[] b, byte[] c) { packBytes(unpackBytes(a) + unpackBytes(b), c); } @Override public Integer unpackBytes(byte[] val) { return NumericUtils.sortableBytesToInt(val, 0); } @Override public void packBytes(Integer val, byte[] res) { NumericUtils.intToSortableBytes(val, res, 0); } }; ``` 3. New query per `LeafReader` to perform range queries on TSPoint and retrieve summarized results - ```java LeafReader leafReader; PointValues points = leafReader.getPointValues("tsid1"); TSPointQuery tsPointQuery = new TSPointQuery("tsid1", lowerBoundTimestamp, upperBoundTimestamp); byte[] res = tsPointQuery.getSummary((BKDWithSummaryReader.BKDSummaryTree) points.getPointTree(), mergeFunction); ``` Instead of BKDReader and BKDWriter, we will be using [BKDSummaryWriter](https://github.com/rishabhmaurya/lucene/blob/0c07f677c4caf45b1499f236adc0217992e46722/lucene/core/src/java/org/apache/lucene/util/bkd/BKDSummaryWriter.java) [BKDSummaryReader](https://github.com/rishabhmaurya/lucene/blob/0c07f677c4caf45b1499f236adc0217992e46722/lucene/core/src/java/org/apache/lucene/util/bkd/BKDWithSummaryReader.java) which supports writing summaries in internal nodes of the tree. Changes in [IntersectVisitor](https://github.com/rishabhmaurya/lucene/blob/0c07f677c4caf45b1499f236adc0217992e46722/lucene/core/src/java/org/apache/lucene/index/PointValues.java#L320-L337) interface here. Comparison with DocValues Below is the comparison of running unit test for [DocValue](https://github.com/rishabhmaurya/lucene/pull/1/commits/157215cd2c4787787748625e10b58001583c6e6e#diff-87c31ac3b1cd1ef45b15c84d9cf1c5bab03feb94f199256f859f14ed4747abd2R129) approach vs [TSPoint](https://github.com/rishabhmaurya/lucene/pull/1/commits/157215cd2c4787787748625e10b58001583c6e6e#diff-87c31ac3b1cd1ef45b15c84d9cf1c5bab03feb94f199256f859f14ed4747abd2R52) approach - This test ingests `1000` docs against a given TSID and performs a range query on timestamp 100 times against the same TSID. Merge function used is `sum`. | DocValues approach | TSPoint approach| | --- | -- | |Indexing took: 42948 ms | Indexing took: 32985 | |Matching docs count:1304624 \| Segments:3 \| DiskAccess: 1304624 | Matching docs count:8784032 \| Segments:10 \| DiskAccess: 302 | |Search took: 12382 ms | Search took: 50ms | This is not apple to apple comparison since number of segments are 3 in DocValues approach whereas its 10 in TSPoint approach. Limitation of this feature * Doc deletion currently not supported. We need to evaluate how important is it and possibly find a way to support it in future. * Only [decomposable](https://en.wikipedia.org/wiki/Aggregate_function#Decomposable_aggregate_functions) aggregation functions can be supported. E.g. min, max, sum, avg, count. Todos * Implementation for multiple TSIDs. For now we need to create a new field with the name same as TSID for a timeseries. * Segment merge for BKD with summaries. Currently, the UTs disables merge and perform search across multiple segments and cumulate the results. * Pluggable merge function to merge 2 `TSPoint`. Currently its hardcoded in `FieldInfo.java` which isn't the right pl
[GitHub] [lucene] gsmiller opened a new pull request, #11746: Deprecate LongValueFacetCounts#getTopChildrenSortByCount since it provides redundant functionality
gsmiller opened a new pull request, #11746: URL: https://github.com/apache/lucene/pull/11746 ### Description This is a companion (backport) PR to #11744 that marks functionality deprecated instead of outright removing it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph
msokolov commented on code in PR #11743: URL: https://github.com/apache/lucene/pull/11743#discussion_r961747487 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -104,8 +104,8 @@ public void removeLast() { } public void removeIndex(int idx) { -System.arraycopy(node, idx + 1, node, idx, size - idx); -System.arraycopy(score, idx + 1, score, idx, size - idx); +System.arraycopy(node, idx + 1, node, idx, size - idx - 1); Review Comment: oh, good! Is it OK when `size - idx - 1` == 0? I think so, at least I checked javadocs and they say only length < 0 raises an error ## lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java: ## @@ -175,20 +175,28 @@ public long ramBytesUsed() { long neighborArrayBytes0 = nsize0 * (Integer.BYTES + Float.BYTES) + RamUsageEstimator.NUM_BYTES_ARRAY_HEADER * 2 -+ RamUsageEstimator.NUM_BYTES_OBJECT_REF; ++ RamUsageEstimator.NUM_BYTES_OBJECT_REF ++ Integer.BYTES * 2; long neighborArrayBytes = nsize * (Integer.BYTES + Float.BYTES) + RamUsageEstimator.NUM_BYTES_ARRAY_HEADER * 2 -+ RamUsageEstimator.NUM_BYTES_OBJECT_REF; - ++ RamUsageEstimator.NUM_BYTES_OBJECT_REF ++ Integer.BYTES * 2; long total = 0; for (int l = 0; l < numLevels; l++) { int numNodesOnLevel = graph.get(l).size(); Review Comment: were we overestimating before because not all the nodes were populated? ## lucene/core/src/test/org/apache/lucene/util/TestRamUsageEstimator.java: ## @@ -222,6 +229,33 @@ public void testPrintValues() { System.out.println("LONG_SIZE = " + LONG_SIZE); } + public void testHnswGraph() throws IOException { +int size = atLeast(2000); +int dim = randomIntBetween(100, 1024); +int M = randomIntBetween(4, 96); +VectorSimilarityFunction similarityFunction = +VectorSimilarityFunction.values()[ +random().nextInt(VectorSimilarityFunction.values().length - 1) + 1]; +VectorEncoding vectorEncoding; +if (similarityFunction == VectorSimilarityFunction.DOT_PRODUCT) { Review Comment: I don't think you need this check -- it's OK now to use BYTE encoding with other similarities -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #11737: Simplify dense optimization check in TermInSetQuery
gsmiller merged PR #11737: URL: https://github.com/apache/lucene/pull/11737 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov opened a new pull request, #11747: update DOAP and releaseWizard to reflect migration to github
msokolov opened a new pull request, #11747: URL: https://github.com/apache/lucene/pull/11747 While preparing the 9.4.0 release, I ran across some references to JIRA and update those. I also found that the region names used in the holiday.py module installed by pip for me were different than those used in releaseWizard.py, and the ones in the latest module seem to be more standard (two-letter ISO codes), so I switched to those. Finally I fixed some references to outdated locations in the DOAP. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph
mayya-sharipova commented on code in PR #11743: URL: https://github.com/apache/lucene/pull/11743#discussion_r962070814 ## lucene/core/src/test/org/apache/lucene/util/TestRamUsageEstimator.java: ## @@ -222,6 +229,33 @@ public void testPrintValues() { System.out.println("LONG_SIZE = " + LONG_SIZE); } + public void testHnswGraph() throws IOException { +int size = atLeast(2000); +int dim = randomIntBetween(100, 1024); +int M = randomIntBetween(4, 96); +VectorSimilarityFunction similarityFunction = +VectorSimilarityFunction.values()[ +random().nextInt(VectorSimilarityFunction.values().length - 1) + 1]; +VectorEncoding vectorEncoding; +if (similarityFunction == VectorSimilarityFunction.DOT_PRODUCT) { Review Comment: Thanks, good to know. I copied that part from another test in `TestHnswGraph`. I've corrected here as well in as in that test. Addressed in 2e01deab2e1678b4687250d22f62ad0cd6196bc9 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph
mayya-sharipova commented on code in PR #11743: URL: https://github.com/apache/lucene/pull/11743#discussion_r962071131 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -104,8 +104,8 @@ public void removeLast() { } public void removeIndex(int idx) { -System.arraycopy(node, idx + 1, node, idx, size - idx); -System.arraycopy(score, idx + 1, score, idx, size - idx); +System.arraycopy(node, idx + 1, node, idx, size - idx - 1); Review Comment: Thank you for bringing my attention to this, I checked, and indeed `length=0` is permissible in `System.arraycopy` function. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on issue #11553: Improve performance of SortedSetDV faceting by iterating on class types [LUCENE-10517]
gsmiller commented on issue #11553: URL: https://github.com/apache/lucene/issues/11553#issuecomment-1235993045 @ChrisHegarty I was looking back through faceting improvements and trying to catch up on this change, which I hadn't followed closely when it was being made. When you get a chance (not urgent), could you help me understand where `invokeinterface` was coming into play before your change? I can't reason about where an interface-defined method would be involved with `#nextDoc()`, but I'm sure this is just my detailed JVM ignorance surfacing. Hoping you could help me understand some detail here. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph
mayya-sharipova commented on code in PR #11743: URL: https://github.com/apache/lucene/pull/11743#discussion_r962072588 ## lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java: ## @@ -175,20 +175,28 @@ public long ramBytesUsed() { long neighborArrayBytes0 = nsize0 * (Integer.BYTES + Float.BYTES) + RamUsageEstimator.NUM_BYTES_ARRAY_HEADER * 2 -+ RamUsageEstimator.NUM_BYTES_OBJECT_REF; ++ RamUsageEstimator.NUM_BYTES_OBJECT_REF ++ Integer.BYTES * 2; long neighborArrayBytes = nsize * (Integer.BYTES + Float.BYTES) + RamUsageEstimator.NUM_BYTES_ARRAY_HEADER * 2 -+ RamUsageEstimator.NUM_BYTES_OBJECT_REF; - ++ RamUsageEstimator.NUM_BYTES_OBJECT_REF ++ Integer.BYTES * 2; long total = 0; for (int l = 0; l < numLevels; l++) { int numNodesOnLevel = graph.get(l).size(); Review Comment: In comparison with what `RamUsageTester::ramUsed` reports, we are actually underestimating. I still need to do more investigation. Also, we preemptively allocate memory for all `NeighborArray` objects which may never get populated with real data. We probably need to look into this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org