[GitHub] [lucene] shaie commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
shaie commented on PR #841: URL: https://github.com/apache/lucene/pull/841#issuecomment-1149554008 > `facetset` for the exact implementation and `hyperrectangle` for the range implementation If I understand correctly, @gsmiller's proposal was more about efficiency, right? So if you have a large number of points that you'd like to index, an R-Tree might be a better data structure for these range matches, than the current linear scan we do in `RangeFSM`. But it doesn't mean we shouldn't have `RangeFSM` for a small dimensionality / small number of FacetSet per doc cases right? They might even perform better? IMO, and to echo Greg's "progress not perfection", I feel it's better if we introduce the `HyperRectangle` impls (1) when we have a good use case to demonstrate the need for it (it just helps to reason about it) and (2) when we're actually ready to implement it "more efficiently". That is, in this PR I feel like we can do w/ `RangeFSM` and later (if and when we'll have a better alternative for large dims) document and reference the alternative? > For the second question, I think we should keep this as a `long[]` based API as we know we want to make the KD tree and R tree optimizations in the future If we think that KD-Trees can be more efficient for `FacetSetMatcher` impls and we're sure that we'll get there _soon_, then yeah, we can move to `long[]` based API. Another alternative I thought of is keep the `byte[]` API, but introduce a helper method on `FSM` which converts the `byte[]` to `long[]` for whoever needs it. To re-iterate what I previously wrote about API back-compat, I feel that putting `@lucene.experimental` tag on the classes is enough to have us not worry about changing it. Especially if the API hasn't been released yet, and definitely if we intend to follow with a KD-Tree impl very soon. A year from now (just an example) it's a different story, but for now I don't think we need to finalize the API for good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
[ https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551429#comment-17551429 ] Adrien Grand commented on LUCENE-10396: --- Another potential use-case we are interested in for Elasticsearch would be to have the ability to visit a single document per field value for a field that is the primary index sort. This has a few applications, one of them is to compute the number of unique values of this primary sort field for documents that match a query. The collector could implement {{LeafCollector#competitiveIterator}} by using the sparse index to skip all documents that have the same value as the last collected hit. > Automatically create sparse indexes for sort fields > --- > > Key: LUCENE-10396 > URL: https://issues.apache.org/jira/browse/LUCENE-10396 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Attachments: sorted_conjunction.png > > > On Elasticsearch we're more and more leveraging index sorting not as a way to > be able to early terminate sorted queries, but as a way to cluster doc IDs > that share similar properties so that queries can take advantage of it. For > instance imagine you're maintaining a catalog of cars for sale, by sorting by > car type, then fuel type then price. Then all cars with the same type, fuel > type and similar prices will be stored in a contiguous range of doc IDs. > Without index sorting, conjunctions across these 3 fields would be almost a > worst-case scenario as every clause might match lots of documents while their > intersection might not. With index sorting enabled however, there's only a > very small number of calls to advance() that would lead to doc IDs that do > not match, because these advance() calls that do not lead to a match would > always jump over a large number of doc IDs. I had created that example for > ApacheCon last year that demonstrates the benefits of index sorting on > conjunctions. In both cases, the index is storing the same data, it just gets > different doc ID ordering thanks to index sorting: > !sorted_conjunction.png! > While index sorting can help improve query efficiency out-of-the-box, there > is a lot more we can do by taking advantage of the index sort explicitly. For > instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range > queries on fields that are primary sort fields by performing a binary search > to identify the first and last documents that match the range query. I would > like to introduce [sparse > indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields > that are used for index sorting, with the goal of improving the runtime of > {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive > and making it easier and more efficient to leverage index sorting to filter > on subsequent sort fields. A simple form of a sparse index could consist of > storing every N-th values of the fields that are used for index sorting. > In terms of implementation, sparse indexing should be cheap enough that we > wouldn't need to make it configurable and could enable it automatically as > soon as index sorting is enabled. And it would get its own file format > abstraction. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10604) Improve ability to test and debug triangulation algorithm in Tessellator
Craig Taverner created LUCENE-10604: --- Summary: Improve ability to test and debug triangulation algorithm in Tessellator Key: LUCENE-10604 URL: https://issues.apache.org/jira/browse/LUCENE-10604 Project: Lucene - Core Issue Type: Test Components: core/other Reporter: Craig Taverner While working on https://issues.apache.org/jira/browse/LUCENE-10563, it was found to be extremely difficult to debug the inner workings of the triangulation algorithm in the Tessellator class. To make things easier, we made use of a monitor interface that tests could use to instrument the algorithm and receive detailed information on the state of the algorithms progress. This issue requests that this interface and a test implementation of it be included in Lucene to facilitate future bug-fixing of this complex algorithm. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10604) Improve ability to test and debug triangulation algorithm in Tessellator
[ https://issues.apache.org/jira/browse/LUCENE-10604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551470#comment-17551470 ] Ignacio Vera commented on LUCENE-10604: --- +1 This algorithm is very difficult to debug for big, complex polygons. I have seen the monitor interface in action and it was extremely useful to understand what the algorithm was doing. > Improve ability to test and debug triangulation algorithm in Tessellator > > > Key: LUCENE-10604 > URL: https://issues.apache.org/jira/browse/LUCENE-10604 > Project: Lucene - Core > Issue Type: Test > Components: core/other >Reporter: Craig Taverner >Priority: Minor > Labels: geo, pull-request-available > > While working on https://issues.apache.org/jira/browse/LUCENE-10563, it was > found to be extremely difficult to debug the inner workings of the > triangulation algorithm in the Tessellator class. To make things easier, we > made use of a monitor interface that tests could use to instrument the > algorithm and receive detailed information on the state of the algorithms > progress. This issue requests that this interface and a test implementation > of it be included in Lucene to facilitate future bug-fixing of this complex > algorithm. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase merged pull request #933: LUCENE-10563: Unable to Tessellate polygon
iverase merged PR #933: URL: https://github.com/apache/lucene/pull/933 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10563) Unable to Tessellate polygon
[ https://issues.apache.org/jira/browse/LUCENE-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551492#comment-17551492 ] ASF subversion and git services commented on LUCENE-10563: -- Commit 1dceff12c8bb764255927be55e46f4d435f47aed in lucene's branch refs/heads/main from Craig Taverner [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1dceff12c8b ] LUCENE-10563: Fix failure to tessellate complex polygon (#933) > Unable to Tessellate polygon > > > Key: LUCENE-10563 > URL: https://issues.apache.org/jira/browse/LUCENE-10563 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 9.1 >Reporter: Yixun Xu >Priority: Major > Attachments: polygon-1.json, polygon-2.json, polygon-3.json > > Time Spent: 0.5h > Remaining Estimate: 0h > > Following up to LUCENE-10470, I found some more polygons that cause > {{Tessellator.tessellate}} to throw "Unable to Tessellate shape", which are > not covered by the fix to LUCENE-10470. I attached the geojson of 3 failing > shapes that I got, and this is the > [branch|https://github.com/apache/lucene/compare/main...yixunx:yx/reproduce-tessellator-error?expand=1#diff-5e8e8052af8b8618e7e4325b7d69def4d562a356acbfea3e983198327c7c8d18R17-R19] > I am testing on that demonstrates the tessellation failures. > > [^polygon-1.json] > [^polygon-2.json] > [^polygon-3.json] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10563) Unable to Tessellate polygon
[ https://issues.apache.org/jira/browse/LUCENE-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551493#comment-17551493 ] ASF subversion and git services commented on LUCENE-10563: -- Commit b7231bb54884f9ce0232430c4a60cdb5753c6b82 in lucene's branch refs/heads/branch_9x from Craig Taverner [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b7231bb5488 ] LUCENE-10563: Fix failure to tessellate complex polygon (#933) # Conflicts: # lucene/CHANGES.txt > Unable to Tessellate polygon > > > Key: LUCENE-10563 > URL: https://issues.apache.org/jira/browse/LUCENE-10563 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 9.1 >Reporter: Yixun Xu >Priority: Major > Attachments: polygon-1.json, polygon-2.json, polygon-3.json > > Time Spent: 0.5h > Remaining Estimate: 0h > > Following up to LUCENE-10470, I found some more polygons that cause > {{Tessellator.tessellate}} to throw "Unable to Tessellate shape", which are > not covered by the fix to LUCENE-10470. I attached the geojson of 3 failing > shapes that I got, and this is the > [branch|https://github.com/apache/lucene/compare/main...yixunx:yx/reproduce-tessellator-error?expand=1#diff-5e8e8052af8b8618e7e4325b7d69def4d562a356acbfea3e983198327c7c8d18R17-R19] > I am testing on that demonstrates the tessellation failures. > > [^polygon-1.json] > [^polygon-2.json] > [^polygon-3.json] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10604) Improve ability to test and debug triangulation algorithm in Tessellator
[ https://issues.apache.org/jira/browse/LUCENE-10604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551498#comment-17551498 ] Craig Taverner commented on LUCENE-10604: - I have created a PR for this at https://github.com/apache/lucene/pull/946 > Improve ability to test and debug triangulation algorithm in Tessellator > > > Key: LUCENE-10604 > URL: https://issues.apache.org/jira/browse/LUCENE-10604 > Project: Lucene - Core > Issue Type: Test > Components: core/other >Reporter: Craig Taverner >Priority: Minor > Labels: geo, pull-request-available > > While working on https://issues.apache.org/jira/browse/LUCENE-10563, it was > found to be extremely difficult to debug the inner workings of the > triangulation algorithm in the Tessellator class. To make things easier, we > made use of a monitor interface that tests could use to instrument the > algorithm and receive detailed information on the state of the algorithms > progress. This issue requests that this interface and a test implementation > of it be included in Lucene to facilitate future bug-fixing of this complex > algorithm. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10563) Unable to Tessellate polygon
[ https://issues.apache.org/jira/browse/LUCENE-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ignacio Vera resolved LUCENE-10563. --- Fix Version/s: 9.3 Assignee: Ignacio Vera Resolution: Fixed Note that polygon 1 was a real bug but the other 2 polygons are illegal and the error that is thrown is legit. > Unable to Tessellate polygon > > > Key: LUCENE-10563 > URL: https://issues.apache.org/jira/browse/LUCENE-10563 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 9.1 >Reporter: Yixun Xu >Assignee: Ignacio Vera >Priority: Major > Fix For: 9.3 > > Attachments: polygon-1.json, polygon-2.json, polygon-3.json > > Time Spent: 0.5h > Remaining Estimate: 0h > > Following up to LUCENE-10470, I found some more polygons that cause > {{Tessellator.tessellate}} to throw "Unable to Tessellate shape", which are > not covered by the fix to LUCENE-10470. I attached the geojson of 3 failing > shapes that I got, and this is the > [branch|https://github.com/apache/lucene/compare/main...yixunx:yx/reproduce-tessellator-error?expand=1#diff-5e8e8052af8b8618e7e4325b7d69def4d562a356acbfea3e983198327c7c8d18R17-R19] > I am testing on that demonstrates the tessellation failures. > > [^polygon-1.json] > [^polygon-2.json] > [^polygon-3.json] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #935: LUCENE-10599: Improve LogMergePolicy's handling of maxMergeSize.
jpountz commented on code in PR #935: URL: https://github.com/apache/lucene/pull/935#discussion_r892145811 ## lucene/core/src/java/org/apache/lucene/index/LogMergePolicy.java: ## @@ -568,23 +568,41 @@ public MergeSpecification findMerges( // Finally, record all merges that are viable at this level: int end = start + mergeFactor; while (end <= 1 + upto) { -boolean anyTooLarge = false; boolean anyMerging = false; +long mergeSize = 0; +long mergeDocs = 0; for (int i = start; i < end; i++) { final SegmentInfoAndLevel segLevel = levels.get(i); final SegmentCommitInfo info = segLevel.info; - anyTooLarge |= - (size(info, mergeContext) >= maxMergeSize - || sizeDocs(info, mergeContext) >= maxMergeDocs); if (mergingSegments.contains(info)) { anyMerging = true; break; } + long segmentSize = size(info, mergeContext); + long segmentDocs = sizeDocs(info, mergeContext); + if (mergeSize + segmentSize > maxMergeSize || mergeDocs + segmentDocs > maxMergeDocs) { +// This merge is full, stop adding more segments to it +if (i == start) { + // This segment alone is too large, return a singleton merge + if (verbose(mergeContext)) { +message( +"" + i + " is larger than the max merge size/docs; ignoring", mergeContext); + } + end = i + 1; +} else { + // Previous segments are under the max merge size, return them + end = i; +} +break; + } + mergeSize += segmentSize; + mergeDocs += segmentDocs; } -if (anyMerging) { - // skip -} else if (!anyTooLarge) { +if (anyMerging || end - start <= 1) { + // skip: there is an ongoing merge at the current level or the computed merge has a single + // segment and this merge policy doesn't do singleton merges +} else { Review Comment: Correct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #935: LUCENE-10599: Improve LogMergePolicy's handling of maxMergeSize.
jpountz commented on code in PR #935: URL: https://github.com/apache/lucene/pull/935#discussion_r892164198 ## lucene/core/src/java/org/apache/lucene/index/LogMergePolicy.java: ## @@ -568,23 +568,41 @@ public MergeSpecification findMerges( // Finally, record all merges that are viable at this level: Review Comment: > how could it guarantee that the level decided in range of [start,end) won't contain segments that have lower level than levelBottom? LogMergePolicy doesn't try to provide this guarantee. It's actually important it doesn't try to provide this guarantee, otherwise it could end up with lots of unmerged segments. For instance imagine that you have 9 segments (S1..S9) on level 10 then one segment (S10) on level 9, one more segment on level 10 (S11) and then potentially other segments. If LogMergePolicy refused to merge segments that are on a lower level then it could never merge together segments S1..S10. This is because segment S10 can only be merged with segments that are on a higher level because both the previous and the next segment are on a higher level, and LogMergePolicy only merges adjacent segments. This is a downside of LogMergePolicy compared to TieredMergePolicy: because it only performs merges of adjacent segments, it sometimes has to return merges where not all segments are on the same level. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester
kaivalnp commented on code in PR #932: URL: https://github.com/apache/lucene/pull/932#discussion_r892189591 ## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ## @@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) { return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]); } +public void setBitSet(BitSet bitSet, int cost) { + bitSets[ord] = bitSet; Review Comment: Thank you @jtibshirani @msokolov for your input! I agree that we shouldn't modify core classes for tests, and we can create a separate issue for possible collection optimizations However I feel that this test is important as well, for deciding which query would be suitable for pre-filtering (by comparing with post-filtering + over selection time) and we should address this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
mikemccand commented on PR #633: URL: https://github.com/apache/lucene/pull/633#issuecomment-1149911664 My beasting has not uncovered any more failures! I think this is ready! I'll try to merge soon, just to `main` ... let's let it bake there for a while before backporting to 9.x? Thanks @vigyasharma. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm
iverase commented on code in PR #946: URL: https://github.com/apache/lucene/pull/946#discussion_r892375896 ## lucene/core/src/java/org/apache/lucene/geo/Tessellator.java: ## @@ -817,8 +841,12 @@ private static final boolean splitEarcut( sortByMortonWithReset(searchNode); sortByMortonWithReset(splitNode); } - earcutLinkedList(polygon, searchNode, tessellation, State.INIT, mortonOptimized); - earcutLinkedList(polygon, splitNode, tessellation, State.INIT, mortonOptimized); + notifyMonitorSplit("SPLIT[" + depth + "]", monitor, searchNode, splitNode); Review Comment: Maybe we should just pass the depth here so we are not allocating those strings every time? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm
iverase commented on code in PR #946: URL: https://github.com/apache/lucene/pull/946#discussion_r892379156 ## lucene/core/src/java/org/apache/lucene/geo/Tessellator.java: ## @@ -131,16 +136,24 @@ public static List tessellate(final Polygon polygon, boolean checkSelf } // Calculate the tessellation using the doubly LinkedList. List result = -earcutLinkedList(polygon, outerNode, new ArrayList<>(), State.INIT, mortonOptimized); +earcutLinkedList( +polygon, outerNode, new ArrayList<>(), State.INIT, mortonOptimized, monitor, 0); if (result.size() == 0) { + notifyMonitor("FAILED", monitor, null, result); Review Comment: Shall we extract static strings instead for FAILED, COMPLETED, ...? ## lucene/core/src/java/org/apache/lucene/geo/Tessellator.java: ## @@ -817,8 +841,12 @@ private static final boolean splitEarcut( sortByMortonWithReset(searchNode); sortByMortonWithReset(splitNode); } - earcutLinkedList(polygon, searchNode, tessellation, State.INIT, mortonOptimized); - earcutLinkedList(polygon, splitNode, tessellation, State.INIT, mortonOptimized); + notifyMonitorSplit("SPLIT[" + depth + "]", monitor, searchNode, splitNode); + earcutLinkedList( + polygon, searchNode, tessellation, State.INIT, mortonOptimized, monitor, depth); + earcutLinkedList( + polygon, splitNode, tessellation, State.INIT, mortonOptimized, monitor, depth); + notifyMonitorSplitEnd("SPLIT[" + depth + "]", monitor); Review Comment: same here ## lucene/core/src/java/org/apache/lucene/geo/Tessellator.java: ## @@ -1444,6 +1472,60 @@ public static final boolean pointInPolygon( return false; } + /** + * Implementation of this interface will receive calls with internal data at each step of the + * triangulation algoirithm. This is of use for debugging complex cases, as well as gaining Review Comment: s/algoirithm/algorithm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10605) fix error in 32bit jvm object alignment gap calculation
sun wuqiang created LUCENE-10605: Summary: fix error in 32bit jvm object alignment gap calculation Key: LUCENE-10605 URL: https://issues.apache.org/jira/browse/LUCENE-10605 Project: Lucene - Core Issue Type: Improvement Components: core/other Affects Versions: 9.2 Environment: jdk 7 32-bit jdk 8 32-bit Reporter: sun wuqiang Attachments: image-2022-06-08-20-50-27-712.png, image-2022-06-08-21-24-57-674.png ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement) This method is used to calculate the optimal length of an array during expansion. According to current logic,in order to avoid space waste caused by *object alignment gap.* In *32-bit* JVM,the array length will select the numbers(the +current optional+ columns) in the table below. But the results weren't perfect. See the table below. !image-2022-06-08-20-50-27-712.png! I used *jol-core* to calculate object alignment gap {code:java} org.openjdk.jol jol-core 0.16 compile {code} Execute the following code: {code:java} System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); {code} !image-2022-06-08-21-24-57-674.png! To further verify that the tool's results are correct, I wrote the following code to infer how much space the array of different lengths actually occupies based on when the OOM occursThe conclusion is consistent with jol-core. {code:java} // -Xms16m -Xmx16m // Used to infer the memory space occupied // by the length of various arrays public static void main(String[] args) { byte[][] arr = new byte[1024 * 1024][]; for (int i = 0; i < arr.length; i++) { if (i % 100 == 0) { System.out.println(i); } // According to OOM occurrence time // in 32-bit JVM, // Arrays range in length from 5 to 12, // occupying the same amount of memory arr[i]=new byte[5]; } } {code} *new byte[5]* and *new byte[12]* use the same amount of memory In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return value of this method. But I don't know whether it is necessary to do this function. If necessary, I will modify it together. Thank you very much! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] craigtaverner commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm
craigtaverner commented on code in PR #946: URL: https://github.com/apache/lucene/pull/946#discussion_r892406529 ## lucene/core/src/java/org/apache/lucene/geo/Tessellator.java: ## @@ -817,8 +841,12 @@ private static final boolean splitEarcut( sortByMortonWithReset(searchNode); sortByMortonWithReset(splitNode); } - earcutLinkedList(polygon, searchNode, tessellation, State.INIT, mortonOptimized); - earcutLinkedList(polygon, splitNode, tessellation, State.INIT, mortonOptimized); + notifyMonitorSplit("SPLIT[" + depth + "]", monitor, searchNode, splitNode); Review Comment: Since the split function is called very infrequently (a couple of times for an entire tessellation session), I presume this object allocation is negligible (note how a few lines later we have `return true` so even though this is a while loop, it is called only once). The option of passing in the depth only and creating the string deeper in the stack will result in exactly the same number of object constructions. However, I can see the advantage of test code not creating the strings at all. Since this is called very little, it is only a tiny saving, but a saving nevertheless. I kept it as a string here mainly for symmetry with the other functions (notifyMonitor). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wormday opened a new pull request, #949: LUCENE-10605: fix error in 32bit jvm object alignment gap calculation
wormday opened a new pull request, #949: URL: https://github.com/apache/lucene/pull/949 ### Description https://issues.apache.org/jira/browse/LUCENE-10605 ArrayUtil.oversize() The generated array length cannot eliminate the Object alignment gap, wasting some space -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] craigtaverner commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm
craigtaverner commented on code in PR #946: URL: https://github.com/apache/lucene/pull/946#discussion_r892428974 ## lucene/core/src/java/org/apache/lucene/geo/Tessellator.java: ## @@ -131,16 +136,24 @@ public static List tessellate(final Polygon polygon, boolean checkSelf } // Calculate the tessellation using the doubly LinkedList. List result = -earcutLinkedList(polygon, outerNode, new ArrayList<>(), State.INIT, mortonOptimized); +earcutLinkedList( +polygon, outerNode, new ArrayList<>(), State.INIT, mortonOptimized, monitor, 0); if (result.size() == 0) { + notifyMonitor("FAILED", monitor, null, result); Review Comment: Good idea -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] craigtaverner commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm
craigtaverner commented on code in PR #946: URL: https://github.com/apache/lucene/pull/946#discussion_r892429207 ## lucene/core/src/java/org/apache/lucene/geo/Tessellator.java: ## @@ -1444,6 +1472,60 @@ public static final boolean pointInPolygon( return false; } + /** + * Implementation of this interface will receive calls with internal data at each step of the + * triangulation algoirithm. This is of use for debugging complex cases, as well as gaining Review Comment: Definitely! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] craigtaverner commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm
craigtaverner commented on code in PR #946: URL: https://github.com/apache/lucene/pull/946#discussion_r892430679 ## lucene/core/src/java/org/apache/lucene/geo/Tessellator.java: ## @@ -817,8 +841,12 @@ private static final boolean splitEarcut( sortByMortonWithReset(searchNode); sortByMortonWithReset(splitNode); } - earcutLinkedList(polygon, searchNode, tessellation, State.INIT, mortonOptimized); - earcutLinkedList(polygon, splitNode, tessellation, State.INIT, mortonOptimized); + notifyMonitorSplit("SPLIT[" + depth + "]", monitor, searchNode, splitNode); Review Comment: I realize that even though we create these strings infrequently, we should not create them at all if `monitor==null`, so I moved the creation one level down in the stack to get that behaviour. Hopefully this is sufficient? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] craigtaverner commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm
craigtaverner commented on code in PR #946: URL: https://github.com/apache/lucene/pull/946#discussion_r892431465 ## lucene/core/src/java/org/apache/lucene/geo/Tessellator.java: ## @@ -817,8 +841,12 @@ private static final boolean splitEarcut( sortByMortonWithReset(searchNode); sortByMortonWithReset(splitNode); } - earcutLinkedList(polygon, searchNode, tessellation, State.INIT, mortonOptimized); - earcutLinkedList(polygon, splitNode, tessellation, State.INIT, mortonOptimized); + notifyMonitorSplit("SPLIT[" + depth + "]", monitor, searchNode, splitNode); + earcutLinkedList( + polygon, searchNode, tessellation, State.INIT, mortonOptimized, monitor, depth); + earcutLinkedList( + polygon, splitNode, tessellation, State.INIT, mortonOptimized, monitor, depth); + notifyMonitorSplitEnd("SPLIT[" + depth + "]", monitor); Review Comment: Moved creation down one level to get it after the `if(monitor==null)` check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm
iverase commented on code in PR #946: URL: https://github.com/apache/lucene/pull/946#discussion_r892480730 ## lucene/core/src/java/org/apache/lucene/geo/Tessellator.java: ## @@ -1444,6 +1471,71 @@ public static final boolean pointInPolygon( return false; } + /** + * Implementation of this interface will receive calls with internal data at each step of the + * triangulation algorithm. This is of use for debugging complex cases, as well as gaining insight + * into the way the algorithm works. Data provided includes a status string containing the current + * mode, list of points representing the current linked-list of internal nodes used for + * triangulation, and a list of triangles so far created by the algorithm. + */ + public interface Monitor { +String FAILED = "FAILED"; +String COMPLETED = "COMPLETED"; Review Comment: static? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm
iverase commented on code in PR #946: URL: https://github.com/apache/lucene/pull/946#discussion_r892481023 ## lucene/core/src/java/org/apache/lucene/geo/Tessellator.java: ## @@ -1444,6 +1471,71 @@ public static final boolean pointInPolygon( return false; } + /** + * Implementation of this interface will receive calls with internal data at each step of the + * triangulation algorithm. This is of use for debugging complex cases, as well as gaining insight + * into the way the algorithm works. Data provided includes a status string containing the current + * mode, list of points representing the current linked-list of internal nodes used for + * triangulation, and a list of triangles so far created by the algorithm. + */ + public interface Monitor { +String FAILED = "FAILED"; +String COMPLETED = "COMPLETED"; Review Comment: and final -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm
iverase commented on code in PR #946: URL: https://github.com/apache/lucene/pull/946#discussion_r892482033 ## lucene/core/src/java/org/apache/lucene/geo/Tessellator.java: ## @@ -817,8 +841,12 @@ private static final boolean splitEarcut( sortByMortonWithReset(searchNode); sortByMortonWithReset(splitNode); } - earcutLinkedList(polygon, searchNode, tessellation, State.INIT, mortonOptimized); - earcutLinkedList(polygon, splitNode, tessellation, State.INIT, mortonOptimized); + notifyMonitorSplit("SPLIT[" + depth + "]", monitor, searchNode, splitNode); Review Comment: yes, that is what I was thinking, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm
iverase commented on code in PR #946: URL: https://github.com/apache/lucene/pull/946#discussion_r892487112 ## lucene/core/src/java/org/apache/lucene/geo/Tessellator.java: ## @@ -1444,6 +1471,71 @@ public static final boolean pointInPolygon( return false; } + /** + * Implementation of this interface will receive calls with internal data at each step of the + * triangulation algorithm. This is of use for debugging complex cases, as well as gaining insight + * into the way the algorithm works. Data provided includes a status string containing the current + * mode, list of points representing the current linked-list of internal nodes used for + * triangulation, and a list of triangles so far created by the algorithm. + */ + public interface Monitor { +String FAILED = "FAILED"; +String COMPLETED = "COMPLETED"; Review Comment: I see, it is an interface so they are static and final by default, sorry :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits
msokolov commented on PR #947: URL: https://github.com/apache/lucene/pull/947#issuecomment-1150043099 If folks have time to review, that would be great. The main thing to focus on I think is 1. how the HnswGraphSearcher/Writer implementation is forked into `float[]` and `BytesRef` using generics 2. adding a VERSION to the index format so we can handle back-compat in a reasonable way 3. the fact that some of the details of VectorSimilarityFunction are now directly handled in util.hnsw -- I didn't see a better way, but it blurs the abstraction boundary -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester
jtibshirani commented on code in PR #932: URL: https://github.com/apache/lucene/pull/932#discussion_r892605221 ## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ## @@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) { return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]); } +public void setBitSet(BitSet bitSet, int cost) { + bitSets[ord] = bitSet; Review Comment: Sounds good, it definitely makes sense to do both changes. @kaivalnp would you like to open a new JIRA issue focused on using cached/ precomputed filters directly? I'm also happy to file the issue and tag you for credit + awareness. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] craigtaverner commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm
craigtaverner commented on code in PR #946: URL: https://github.com/apache/lucene/pull/946#discussion_r892608782 ## lucene/core/src/java/org/apache/lucene/geo/Tessellator.java: ## @@ -1444,6 +1471,71 @@ public static final boolean pointInPolygon( return false; } + /** + * Implementation of this interface will receive calls with internal data at each step of the + * triangulation algorithm. This is of use for debugging complex cases, as well as gaining insight + * into the way the algorithm works. Data provided includes a status string containing the current + * mode, list of points representing the current linked-list of internal nodes used for + * triangulation, and a list of triangles so far created by the algorithm. + */ + public interface Monitor { +String FAILED = "FAILED"; +String COMPLETED = "COMPLETED"; Review Comment: It is in an interface, so `public static final` is implied. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] craigtaverner commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm
craigtaverner commented on code in PR #946: URL: https://github.com/apache/lucene/pull/946#discussion_r892626127 ## lucene/core/src/java/org/apache/lucene/geo/Tessellator.java: ## @@ -817,8 +841,12 @@ private static final boolean splitEarcut( sortByMortonWithReset(searchNode); sortByMortonWithReset(splitNode); } - earcutLinkedList(polygon, searchNode, tessellation, State.INIT, mortonOptimized); - earcutLinkedList(polygon, splitNode, tessellation, State.INIT, mortonOptimized); + notifyMonitorSplit("SPLIT[" + depth + "]", monitor, searchNode, splitNode); Review Comment: Great. I made a similar change to the main call too, so all string creation (other than failed) is done after the if statement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] craigtaverner commented on a diff in pull request #946: LUCENE-10604: Add Tessellation monitor for easier debugging of triangulation algorithm
craigtaverner commented on code in PR #946: URL: https://github.com/apache/lucene/pull/946#discussion_r892627220 ## lucene/core/src/java/org/apache/lucene/geo/Tessellator.java: ## @@ -1444,6 +1471,71 @@ public static final boolean pointInPolygon( return false; } + /** + * Implementation of this interface will receive calls with internal data at each step of the + * triangulation algorithm. This is of use for debugging complex cases, as well as gaining insight + * into the way the algorithm works. Data provided includes a status string containing the current + * mode, list of points representing the current linked-list of internal nodes used for + * triangulation, and a list of triangles so far created by the algorithm. + */ + public interface Monitor { +String FAILED = "FAILED"; +String COMPLETED = "COMPLETED"; Review Comment: Woops, I see you already realized that. I responded once I saw the first message, but my UI had not updated to show the others. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10602) Dynamic Index Cache Sizing
[ https://issues.apache.org/jira/browse/LUCENE-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551712#comment-17551712 ] Adrien Grand commented on LUCENE-10602: --- I work with Chris and I suggested him to open this issue, I'll try to provide a bit more context. It's not uncommon for us to have nodes that handle several TBs of data. With documents that need ~250 bytes each, which is also typical, this gives ~4.4B documents per TB of data. Caching a query for 4.4B documents requires ~520MB of memory assuming one bit per document. So if we want to be able to cache, say 4 queries across 1TB of data, then we need ~2GB of heap for the query cache. We could give less memory to the cache, but this would increase the risk that every new entry in the cache evicts a hot entry for the cache. This is potentially an issue for the query cache since computing cache entries has overhead: it requires evaluating all documents that match the query, while the query that is being cached might be used in a conjunction that only requires evaluating a subset of the matching docs. But we're also seeing the opposite case when the cache is oversized for the amount of data that a node handles. And because the cache only evicts when it's full or when segments get closed, the cache will often grow until it's completely full, even though most cache entries never get used. We don't know at node startup time how much data this node is going to handle eventually, which makes it impossible to size the query cache correctly. So if Lucene's query cache could evict entries from the cache when they appear to be very little used, this would help not spend large amounts of heap on useless cache entries. > Dynamic Index Cache Sizing > -- > > Key: LUCENE-10602 > URL: https://issues.apache.org/jira/browse/LUCENE-10602 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Chris Earle >Priority: Major > > Working with Lucene's filter cache, it has become apparent that it can be an > enormous drain on the heap and therefore the JVM. After extensive usage of an > index, it is not uncommon to tune performance by shrinking or altogether > removing the filter cache. > Lucene tracks hit/miss stats of the filter cache, but it does nothing with > the data other than inform an interested user about the effectiveness of > their index's caching. > It would be interesting if Lucene would be able to tune the index filter > cache heuristically based on actual usage (age, frequency, and value). > This could ultimately be used to give GBs of heap back to an individual > Lucene instance instead of burning it on cache storage that's not effectively > used (or useful). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10602) Dynamic Index Cache Sizing
[ https://issues.apache.org/jira/browse/LUCENE-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551717#comment-17551717 ] Uwe Schindler commented on LUCENE-10602: If all this is needed, why not implement your own QueryCache instance for ES? Then you have more control over when and how something is cached or evicted. There's no need to use the IndexSearcher#DEFAULT_QUERY_CACHE, you can set you own (globally as default per node or per searcher). Maybe it would be good to get more context, but actualy the cache context is more up to the application. IMHO, it should be enough to implement a sophisticated https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/QueryCache.java The LRUQueryCache in Lucene is just an example. > Dynamic Index Cache Sizing > -- > > Key: LUCENE-10602 > URL: https://issues.apache.org/jira/browse/LUCENE-10602 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Chris Earle >Priority: Major > > Working with Lucene's filter cache, it has become apparent that it can be an > enormous drain on the heap and therefore the JVM. After extensive usage of an > index, it is not uncommon to tune performance by shrinking or altogether > removing the filter cache. > Lucene tracks hit/miss stats of the filter cache, but it does nothing with > the data other than inform an interested user about the effectiveness of > their index's caching. > It would be interesting if Lucene would be able to tune the index filter > cache heuristically based on actual usage (age, frequency, and value). > This could ultimately be used to give GBs of heap back to an individual > Lucene instance instead of burning it on cache storage that's not effectively > used (or useful). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10602) Dynamic Index Cache Sizing
[ https://issues.apache.org/jira/browse/LUCENE-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551719#comment-17551719 ] Uwe Schindler commented on LUCENE-10602: Elasticsearch could for example subclass LRUQueryCache and disallow caching of results for some (huge) shards or make sure no smaller entries are expunged. > Dynamic Index Cache Sizing > -- > > Key: LUCENE-10602 > URL: https://issues.apache.org/jira/browse/LUCENE-10602 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Chris Earle >Priority: Major > > Working with Lucene's filter cache, it has become apparent that it can be an > enormous drain on the heap and therefore the JVM. After extensive usage of an > index, it is not uncommon to tune performance by shrinking or altogether > removing the filter cache. > Lucene tracks hit/miss stats of the filter cache, but it does nothing with > the data other than inform an interested user about the effectiveness of > their index's caching. > It would be interesting if Lucene would be able to tune the index filter > cache heuristically based on actual usage (age, frequency, and value). > This could ultimately be used to give GBs of heap back to an individual > Lucene instance instead of burning it on cache storage that's not effectively > used (or useful). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10602) Dynamic Index Cache Sizing
[ https://issues.apache.org/jira/browse/LUCENE-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551720#comment-17551720 ] Uwe Schindler commented on LUCENE-10602: Theoretically you could also serialize huge Bitsets to disk (temp directory) and with MMapDircetory (and or Java 19 MemorySegments -> next week on BBUZZ) make a off-heap cache for large shards. I think there is a lot of flexibility in that small interface. > Dynamic Index Cache Sizing > -- > > Key: LUCENE-10602 > URL: https://issues.apache.org/jira/browse/LUCENE-10602 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Chris Earle >Priority: Major > > Working with Lucene's filter cache, it has become apparent that it can be an > enormous drain on the heap and therefore the JVM. After extensive usage of an > index, it is not uncommon to tune performance by shrinking or altogether > removing the filter cache. > Lucene tracks hit/miss stats of the filter cache, but it does nothing with > the data other than inform an interested user about the effectiveness of > their index's caching. > It would be interesting if Lucene would be able to tune the index filter > cache heuristically based on actual usage (age, frequency, and value). > This could ultimately be used to give GBs of heap back to an individual > Lucene instance instead of burning it on cache storage that's not effectively > used (or useful). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a diff in pull request #935: LUCENE-10599: Improve LogMergePolicy's handling of maxMergeSize.
zhaih commented on code in PR #935: URL: https://github.com/apache/lucene/pull/935#discussion_r892682597 ## lucene/core/src/java/org/apache/lucene/index/LogMergePolicy.java: ## @@ -568,23 +568,41 @@ public MergeSpecification findMerges( // Finally, record all merges that are viable at this level: Review Comment: Got it, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4574) FunctionQuery ValueSource value computed twice per document
[ https://issues.apache.org/jira/browse/LUCENE-4574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551732#comment-17551732 ] Kevin Risden commented on LUCENE-4574: -- [~dsmiley] / [~jpountz] - is it possible this was addressed by LUCENE-6263 (and somewhat related LUCENE-8405)? I don't see this behavior currently. Even requesting score, the score is now cached so don't compute the function twice. > FunctionQuery ValueSource value computed twice per document > --- > > Key: LUCENE-4574 > URL: https://issues.apache.org/jira/browse/LUCENE-4574 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 4.0, 4.1 >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Attachments: LUCENE-4574.patch, LUCENE-4574.patch, LUCENE-4574.patch, > LUCENE-4574.patch, Test_for_LUCENE-4574.patch > > > I was working on a custom ValueSource and did some basic profiling and > debugging to see if it was being used optimally. To my surprise, the value > was being fetched twice per document in a row. This computation isn't > exactly cheap to calculate so this is a big problem. I was able to > work-around this problem trivially on my end by caching the last value with > corresponding docid in my FunctionValues implementation. > Here is an excerpt of the code path to the first execution: > {noformat} > at > org.apache.lucene.queries.function.docvalues.DoubleDocValues.floatVal(DoubleDocValues.java:48) > at > org.apache.lucene.queries.function.FunctionQuery$AllScorer.score(FunctionQuery.java:153) > at > org.apache.lucene.search.TopFieldCollector$OneComparatorScoringMaxScoreCollector.collect(TopFieldCollector.java:291) > at org.apache.lucene.search.Scorer.score(Scorer.java:62) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:588) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280) > {noformat} > And here is the 2nd call: > {noformat} > at > org.apache.lucene.queries.function.docvalues.DoubleDocValues.floatVal(DoubleDocValues.java:48) > at > org.apache.lucene.queries.function.FunctionQuery$AllScorer.score(FunctionQuery.java:153) > at > org.apache.lucene.search.ScoreCachingWrappingScorer.score(ScoreCachingWrappingScorer.java:56) > at > org.apache.lucene.search.FieldComparator$RelevanceComparator.copy(FieldComparator.java:951) > at > org.apache.lucene.search.TopFieldCollector$OneComparatorScoringMaxScoreCollector.collect(TopFieldCollector.java:312) > at org.apache.lucene.search.Scorer.score(Scorer.java:62) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:588) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280) > {noformat} > The 2nd call appears to use some score caching mechanism, which is all well > and good, but that same mechanism wasn't used in the first call so there's no > cached value to retrieve. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] pminkov commented on pull request #940: Use similarity.tf() in MoreLikeThis
pminkov commented on PR #940: URL: https://github.com/apache/lucene/pull/940#issuecomment-1150241668 I'm uploading the selected words file: [mlt-selected-words.txt](https://github.com/apache/lucene/files/8863975/mlt-selected-words.txt) And the input file: [plots.txt](https://github.com/apache/lucene/files/8863978/plots.txt) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #948: Fix typo in PostingsReaderBase docstring
msokolov commented on PR #948: URL: https://github.com/apache/lucene/pull/948#issuecomment-1150259511 It looks to me as if the writer may intended `ImpactsEnum` there? Maybe check git blame to see when that edit was made... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4574) FunctionQuery ValueSource value computed twice per document
[ https://issues.apache.org/jira/browse/LUCENE-4574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551745#comment-17551745 ] Kevin Risden commented on LUCENE-4574: -- Ugh actually just after I sent that I found a reproducing case in Solr - I don't know if this is Solr specific. Still digging. bf=functionquery and sort=SOME_FIELD and fl=score it looks like. TopFieldCollector.populateScores is the secondary score calculation after MaxScore is already trying to calculate the scores. > FunctionQuery ValueSource value computed twice per document > --- > > Key: LUCENE-4574 > URL: https://issues.apache.org/jira/browse/LUCENE-4574 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 4.0, 4.1 >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Attachments: LUCENE-4574.patch, LUCENE-4574.patch, LUCENE-4574.patch, > LUCENE-4574.patch, Test_for_LUCENE-4574.patch > > > I was working on a custom ValueSource and did some basic profiling and > debugging to see if it was being used optimally. To my surprise, the value > was being fetched twice per document in a row. This computation isn't > exactly cheap to calculate so this is a big problem. I was able to > work-around this problem trivially on my end by caching the last value with > corresponding docid in my FunctionValues implementation. > Here is an excerpt of the code path to the first execution: > {noformat} > at > org.apache.lucene.queries.function.docvalues.DoubleDocValues.floatVal(DoubleDocValues.java:48) > at > org.apache.lucene.queries.function.FunctionQuery$AllScorer.score(FunctionQuery.java:153) > at > org.apache.lucene.search.TopFieldCollector$OneComparatorScoringMaxScoreCollector.collect(TopFieldCollector.java:291) > at org.apache.lucene.search.Scorer.score(Scorer.java:62) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:588) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280) > {noformat} > And here is the 2nd call: > {noformat} > at > org.apache.lucene.queries.function.docvalues.DoubleDocValues.floatVal(DoubleDocValues.java:48) > at > org.apache.lucene.queries.function.FunctionQuery$AllScorer.score(FunctionQuery.java:153) > at > org.apache.lucene.search.ScoreCachingWrappingScorer.score(ScoreCachingWrappingScorer.java:56) > at > org.apache.lucene.search.FieldComparator$RelevanceComparator.copy(FieldComparator.java:951) > at > org.apache.lucene.search.TopFieldCollector$OneComparatorScoringMaxScoreCollector.collect(TopFieldCollector.java:312) > at org.apache.lucene.search.Scorer.score(Scorer.java:62) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:588) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280) > {noformat} > The 2nd call appears to use some score caching mechanism, which is all well > and good, but that same mechanism wasn't used in the first call so there's no > cached value to retrieve. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #945: Try to fix the gradle compilation in idea
msokolov commented on PR #945: URL: https://github.com/apache/lucene/pull/945#issuecomment-1150273382 Thanks, @dweiss -- still seems to work for me after your latest fixes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4574) FunctionQuery ValueSource value computed twice per document
[ https://issues.apache.org/jira/browse/LUCENE-4574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551750#comment-17551750 ] David Smiley commented on LUCENE-4574: -- Maybe, but more recently, see LUCENE-10252 which Solr does not yet have as it was done in 9.1 (Solr is on Lucene 9.0 still). > FunctionQuery ValueSource value computed twice per document > --- > > Key: LUCENE-4574 > URL: https://issues.apache.org/jira/browse/LUCENE-4574 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 4.0, 4.1 >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Attachments: LUCENE-4574.patch, LUCENE-4574.patch, LUCENE-4574.patch, > LUCENE-4574.patch, Test_for_LUCENE-4574.patch > > > I was working on a custom ValueSource and did some basic profiling and > debugging to see if it was being used optimally. To my surprise, the value > was being fetched twice per document in a row. This computation isn't > exactly cheap to calculate so this is a big problem. I was able to > work-around this problem trivially on my end by caching the last value with > corresponding docid in my FunctionValues implementation. > Here is an excerpt of the code path to the first execution: > {noformat} > at > org.apache.lucene.queries.function.docvalues.DoubleDocValues.floatVal(DoubleDocValues.java:48) > at > org.apache.lucene.queries.function.FunctionQuery$AllScorer.score(FunctionQuery.java:153) > at > org.apache.lucene.search.TopFieldCollector$OneComparatorScoringMaxScoreCollector.collect(TopFieldCollector.java:291) > at org.apache.lucene.search.Scorer.score(Scorer.java:62) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:588) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280) > {noformat} > And here is the 2nd call: > {noformat} > at > org.apache.lucene.queries.function.docvalues.DoubleDocValues.floatVal(DoubleDocValues.java:48) > at > org.apache.lucene.queries.function.FunctionQuery$AllScorer.score(FunctionQuery.java:153) > at > org.apache.lucene.search.ScoreCachingWrappingScorer.score(ScoreCachingWrappingScorer.java:56) > at > org.apache.lucene.search.FieldComparator$RelevanceComparator.copy(FieldComparator.java:951) > at > org.apache.lucene.search.TopFieldCollector$OneComparatorScoringMaxScoreCollector.collect(TopFieldCollector.java:312) > at org.apache.lucene.search.Scorer.score(Scorer.java:62) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:588) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280) > {noformat} > The 2nd call appears to use some score caching mechanism, which is all well > and good, but that same mechanism wasn't used in the first call so there's no > cached value to retrieve. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on pull request #948: Fix typo in PostingsReaderBase docstring
vigyasharma commented on PR #948: URL: https://github.com/apache/lucene/pull/948#issuecomment-1150395272 > It looks to me as if the writer may intended `ImpactsEnum` there? Maybe check git blame to see when that edit was made... Good idea. I checked git history and found [this commit](https://github.com/apache/lucene/commit/c13216934c58870b487a9bd04b2fd4ea24431000#diff-dde065fe5990e588a2a00cb720b3df719655db2cb84d0cb2168519445fedf993), where ```java /** * of this class to manage creation of {@link DocsEnum} and * {@link DocsAndPositionsEnum} instances. * / was replaced by /** * of this class to manage creation of {@link org.apache.lucene.index.PostingsEnum} and * {@link org.apache.lucene.index.PostingsEnum} instances. * / ``` I guess earlier we had separate `DocsEnum` and `DocsAndPositionsEnum`, but now both can be handled within the `PostingsEnum`? The class, however, does provide for creating `ImpactsEnum` (added later [here](https://github.com/apache/lucene/commit/af680af77f3f80c779e038a0ad8a136c9dcb9f5d)), and we can mention that in the doc string. I'll update the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on pull request #948: Fix typo in PostingsReaderBase docstring
vigyasharma commented on PR #948: URL: https://github.com/apache/lucene/pull/948#issuecomment-1150432604 The "force push" was a rebase on main. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
gsmiller commented on PR #841: URL: https://github.com/apache/lucene/pull/841#issuecomment-1150439971 Let me try to summarize my understanding of the "future optimization" debate as it pertains to this proposed API/implementation and see if we're on the same page or if I'm overlooking something. The current proposal encapsulates/delegates matching logic behind `FacetSetMatcher`, which is responsible for determining whether-or-not that `FSM` instance matches a provided point. There's `RangeFSM` which knows how to match based on whether-or-not the point is contained in the n-dim range, and there's `ExactFSM` which is just doing exact point equivalence. The "protocol" for doing facet aggregation/counting is implemented inside `MatchingFacetSetsCounts`, which delegates to the `FSM#matches` method. The inefficiency is that—because `MatchingFacetSetsCounts` can make no assumptions about the `FSM` instances provided to it, it must iterate _all_ provided `FSM` instances for every indexed point for every provided hit. For the single point case (`ExactFSM`), this is crazy right? Even if there are a small number of provided `ExactFSM` instances we're matching against, doing a linear scan of all of them for every point is pretty dang inefficient. Especially so for a case where there are many provided hits with many indexed points for each. I think the same logic generally holds true for the range case as well, but maybe that's more debatable. But, the problem is, `MatchingFacetSetsCounts` doesn't know anything about these `FSM` instances it gets and can't do anything other than just ask them all, "hey, do you match this point?" And so the debate seems to be how to setup the API to allow for future optimizations, or if we should even worry about it at all. I personally think we should design with this optimization in mind, but I think we're close and I don't actually think the current proposal needs to really change to allow for future optimizations. This is where I get a little fuzzy on the conversation that's taken place as I haven't totally caught up on the various proposals, and conversations taking place. But, if we kept the implementation as it currently exists, in the future, if we want to put these optimizations in place, could we not just add a method to `FSM` that exposes the min/max values for each dimension? Then, `MatchingFacetSetsCounts` could inspect the underlying "hyperrectangles" represented by each `FSM` by looking at these min/max values before it counts and decide how it wants to proceed. The `matches` method is potentially still useful/needed depending on how flexible we want this thing to be; if a user creates an `FSM` implementation that's more complicated than just a "hyperrectangle" (e.g., some complicated geometry), the contract for providing min/max could be that the provided min/max is a bounding box, but `matches` still needs to be called to actually confirm the match. I point this out not to propose implementing it now, but to say that I think we have options to extend what is currently proposed here if/when we want to optimize. Does this make sense or am I talking about a completely different problem or missing key parts of the conversation that's happened? Apologies if I have. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
mdmarshmallow commented on PR #841: URL: https://github.com/apache/lucene/pull/841#issuecomment-1150471678 So based on everyone's comments: 1. It seems like we should ditch the `hyperrectangle` implementation and that `facetset` does everything we need for right now 2. When we decide to optimize this (right after this PR is merged ideally), we would let `MatchingFacetSetsCount` be able to take a look at the `FSM`'s passed to it and then determine if it should put the FSM into an R tree, KD tree, or just linearly scan based on the `min` and `max` of each`FSM`. I think this makes sense, but we also shouldn't discuss it too much here as I think this is for another PR. I think the point is we can optimize the `facetsets` package in it's current state. With that being said, I do plan on writing the KD and R tree optimizations as soon as this is merged so I am still for this remaining a `long[]` API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10605) fix error in 32bit jvm object alignment gap calculation
[ https://issues.apache.org/jira/browse/LUCENE-10605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sun wuqiang updated LUCENE-10605: - Issue Type: Bug (was: Improvement) > fix error in 32bit jvm object alignment gap calculation > --- > > Key: LUCENE-10605 > URL: https://issues.apache.org/jira/browse/LUCENE-10605 > Project: Lucene - Core > Issue Type: Bug > Components: core/other >Affects Versions: 9.2 > Environment: jdk 7 32-bit > jdk 8 32-bit >Reporter: sun wuqiang >Priority: Trivial > Attachments: image-2022-06-08-20-50-27-712.png, > image-2022-06-08-21-24-57-674.png > > Time Spent: 10m > Remaining Estimate: 0h > > ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement) > This method is used to calculate the optimal length of an array during > expansion. > > According to current logic,in order to avoid space waste caused by *object > alignment gap.* In *32-bit* JVM,the array length will select the numbers(the > +current optional+ columns) in the table below. But the results weren't > perfect. > See the table below. > !image-2022-06-08-20-50-27-712.png! > > I used *jol-core* to calculate object alignment gap > {code:java} > > org.openjdk.jol > jol-core > 0.16 > compile > {code} > > Execute the following code: > {code:java} > System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); > {code} > > !image-2022-06-08-21-24-57-674.png! > > To further verify that the tool's results are correct, I wrote the following > code to infer how much space the array of different lengths actually occupies > based on when the OOM occursThe conclusion is consistent with jol-core. > {code:java} > // -Xms16m -Xmx16m > // Used to infer the memory space occupied > // by the length of various arrays > public static void main(String[] args) { > byte[][] arr = new byte[1024 * 1024][]; > for (int i = 0; i < arr.length; i++) { > if (i % 100 == 0) { > System.out.println(i); > } > // According to OOM occurrence time > // in 32-bit JVM, > // Arrays range in length from 5 to 12, > // occupying the same amount of memory > arr[i]=new byte[5]; > } > } {code} > *new byte[5]* and *new byte[12]* use the same amount of memory > > > In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return > value of this method. But I don't know whether it is necessary to do this > function. If necessary, I will modify it together. Thank you very much! > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10605) fix error in 32bit jvm object alignment gap calculation
[ https://issues.apache.org/jira/browse/LUCENE-10605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sun wuqiang updated LUCENE-10605: - Attachment: image-2022-06-09-08-25-55-289.png > fix error in 32bit jvm object alignment gap calculation > --- > > Key: LUCENE-10605 > URL: https://issues.apache.org/jira/browse/LUCENE-10605 > Project: Lucene - Core > Issue Type: Bug > Components: core/other >Affects Versions: 9.2 > Environment: jdk 7 32-bit > jdk 8 32-bit >Reporter: sun wuqiang >Priority: Trivial > Attachments: image-2022-06-08-20-50-27-712.png, > image-2022-06-08-21-24-57-674.png, image-2022-06-09-08-25-55-289.png > > Time Spent: 10m > Remaining Estimate: 0h > > ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement) > This method is used to calculate the optimal length of an array during > expansion. > > According to current logic,in order to avoid space waste caused by *object > alignment gap.* In *32-bit* JVM,the array length will select the numbers(the > +current optional+ columns) in the table below. But the results weren't > perfect. > See the table below. > !image-2022-06-08-20-50-27-712.png! > > I used *jol-core* to calculate object alignment gap > {code:java} > > org.openjdk.jol > jol-core > 0.16 > compile > {code} > > Execute the following code: > {code:java} > System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); > {code} > > !image-2022-06-08-21-24-57-674.png! > > To further verify that the tool's results are correct, I wrote the following > code to infer how much space the array of different lengths actually occupies > based on when the OOM occursThe conclusion is consistent with jol-core. > {code:java} > // -Xms16m -Xmx16m > // Used to infer the memory space occupied > // by the length of various arrays > public static void main(String[] args) { > byte[][] arr = new byte[1024 * 1024][]; > for (int i = 0; i < arr.length; i++) { > if (i % 100 == 0) { > System.out.println(i); > } > // According to OOM occurrence time > // in 32-bit JVM, > // Arrays range in length from 5 to 12, > // occupying the same amount of memory > arr[i]=new byte[5]; > } > } {code} > *new byte[5]* and *new byte[12]* use the same amount of memory > > > In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return > value of this method. But I don't know whether it is necessary to do this > function. If necessary, I will modify it together. Thank you very much! > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10605) fix error in 32bit jvm object alignment gap calculation
[ https://issues.apache.org/jira/browse/LUCENE-10605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sun wuqiang updated LUCENE-10605: - Attachment: image-2022-06-09-08-26-36-528.png > fix error in 32bit jvm object alignment gap calculation > --- > > Key: LUCENE-10605 > URL: https://issues.apache.org/jira/browse/LUCENE-10605 > Project: Lucene - Core > Issue Type: Bug > Components: core/other >Affects Versions: 9.2 > Environment: jdk 7 32-bit > jdk 8 32-bit >Reporter: sun wuqiang >Priority: Trivial > Attachments: image-2022-06-08-20-50-27-712.png, > image-2022-06-08-21-24-57-674.png, image-2022-06-09-08-25-55-289.png, > image-2022-06-09-08-26-36-528.png > > Time Spent: 10m > Remaining Estimate: 0h > > ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement) > This method is used to calculate the optimal length of an array during > expansion. > > According to current logic,in order to avoid space waste caused by *object > alignment gap.* In *32-bit* JVM,the array length will select the numbers(the > +current optional+ columns) in the table below. But the results weren't > perfect. > See the table below. > !image-2022-06-08-20-50-27-712.png! > > I used *jol-core* to calculate object alignment gap > {code:java} > > org.openjdk.jol > jol-core > 0.16 > compile > {code} > > Execute the following code: > {code:java} > System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); > {code} > > !image-2022-06-08-21-24-57-674.png! > > To further verify that the tool's results are correct, I wrote the following > code to infer how much space the array of different lengths actually occupies > based on when the OOM occursThe conclusion is consistent with jol-core. > {code:java} > // -Xms16m -Xmx16m > // Used to infer the memory space occupied > // by the length of various arrays > public static void main(String[] args) { > byte[][] arr = new byte[1024 * 1024][]; > for (int i = 0; i < arr.length; i++) { > if (i % 100 == 0) { > System.out.println(i); > } > // According to OOM occurrence time > // in 32-bit JVM, > // Arrays range in length from 5 to 12, > // occupying the same amount of memory > arr[i]=new byte[5]; > } > } {code} > *new byte[5]* and *new byte[12]* use the same amount of memory > > > In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return > value of this method. But I don't know whether it is necessary to do this > function. If necessary, I will modify it together. Thank you very much! > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10605) fix error in 32bit jvm object alignment gap calculation
[ https://issues.apache.org/jira/browse/LUCENE-10605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sun wuqiang updated LUCENE-10605: - Description: ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement) This method is used to calculate the optimal length of an array during expansion. According to current logic,in order to avoid space waste caused by *object alignment gap.* In *32-bit* JVM,the array length will select the numbers(the +current optional+ columns) in the table below. But the results weren't perfect. For example, if I want to expand byte[2], I call the method oversize(2,1) to get the size of the next array, which returns 8. But Byte [8] is not the best result. Since byte[8] and byte[12] use the same memory space (both are 24 bytes due to alignment gap), So it's best to return 12 here. See the table below. !image-2022-06-09-08-26-36-528.png! I used *jol-core* to calculate object alignment gap {code:java} org.openjdk.jol jol-core 0.16 compile {code} Execute the following code: {code:java} System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); {code} !image-2022-06-08-21-24-57-674.png! To further verify that the tool's results are correct, I wrote the following code to infer how much space the array of different lengths actually occupies based on when the OOM occursThe conclusion is consistent with jol-core. {code:java} // -Xms16m -Xmx16m // Used to infer the memory space occupied // by the length of various arrays public static void main(String[] args) { byte[][] arr = new byte[1024 * 1024][]; for (int i = 0; i < arr.length; i++) { if (i % 100 == 0) { System.out.println(i); } // According to OOM occurrence time // in 32-bit JVM, // Arrays range in length from 5 to 12, // occupying the same amount of memory arr[i]=new byte[5]; } } {code} *new byte[5]* and *new byte[12]* use the same amount of memory In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return value of this method. But I don't know whether it is necessary to do this function. If necessary, I will modify it together. Thank you very much! was: ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement) This method is used to calculate the optimal length of an array during expansion. According to current logic,in order to avoid space waste caused by *object alignment gap.* In *32-bit* JVM,the array length will select the numbers(the +current optional+ columns) in the table below. But the results weren't perfect. See the table below. !image-2022-06-08-20-50-27-712.png! I used *jol-core* to calculate object alignment gap {code:java} org.openjdk.jol jol-core 0.16 compile {code} Execute the following code: {code:java} System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); {code} !image-2022-06-08-21-24-57-674.png! To further verify that the tool's results are correct, I wrote the following code to infer how much space the array of different lengths actually occupies based on when the OOM occursThe conclusion is consistent with jol-core. {code:java} // -Xms16m -Xmx16m // Used to infer the memory space occupied // by the length of various arrays public static void main(String[] args) { byte[][] arr = new byte[1024 * 1024][]; for (int i = 0; i < arr.length; i++) { if (i % 100 == 0) { System.out.println(i); } // According to OOM occurrence time // in 32-bit JVM, // Arrays range in length from 5 to 12, // occupying the same amount of memory arr[i]=new byte[5]; } } {code} *new byte[5]* and *new byte[12]* use the same amount of memory In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return value of this method. But I don't know whether it is necessary to do this function. If necessary, I will modify it together. Thank you very much! > fix error in 32bit jvm object alignment gap calculation > --- > > Key: LUCENE-10605 > URL: https://issues.apache.org/jira/browse/LUCENE-10605 > Project: Lucene - Core > Issue Type: Bug > Components: core/other >Affects Versions: 9.2 > Environment: jdk 7 32-bit > jdk 8 32-bit >Reporter: sun wuqiang >Priority: Trivial > Attachments: image-2022-06-08-20-50-27-712.png, > image-2022-06-08-21-24-57-674.png, image-2022-06-09-08-25-55-289.png, > image-2022-06-09-08-26-36-528.png > > Time Spent: 10m > Remaining Estimate: 0h > > ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement) > This method is used to calculate the optimal length of an array during > expansion. > > According to current logic,in order to avoid space waste caused by
[jira] [Updated] (LUCENE-10605) fix error in 32bit jvm object alignment gap calculation
[ https://issues.apache.org/jira/browse/LUCENE-10605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sun wuqiang updated LUCENE-10605: - Description: ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement) This method is used to calculate the optimal length of an array during expansion. According to current logic,in order to avoid space waste caused by *object alignment gap.* In *32-bit* JVM,the array length will select the numbers(the +current optional+ columns) in the table below. But the results weren't perfect. For example, if I want to expand byte[2], I will call the method oversize(2,1) to get the size of the next array, which returns 8. But byte [8] is not the best result. Since byte[8] and byte[12] use the same memory space (both are 24 bytes due to alignment gap), So it's best to return 12 here. See the table below. !image-2022-06-09-08-26-36-528.png! I used *jol-core* to calculate object alignment gap {code:java} org.openjdk.jol jol-core 0.16 compile {code} Execute the following code: {code:java} System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); {code} !image-2022-06-08-21-24-57-674.png! To further verify that the tool's results are correct, I wrote the following code to infer how much space the array of different lengths actually occupies based on when the OOM occursThe conclusion is consistent with jol-core. {code:java} // -Xms16m -Xmx16m // Used to infer the memory space occupied // by the length of various arrays public static void main(String[] args) { byte[][] arr = new byte[1024 * 1024][]; for (int i = 0; i < arr.length; i++) { if (i % 100 == 0) { System.out.println(i); } // According to OOM occurrence time // in 32-bit JVM, // Arrays range in length from 5 to 12, // occupying the same amount of memory arr[i]=new byte[5]; } } {code} *new byte[5]* and *new byte[12]* use the same amount of memory In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return value of this method. But I don't know whether it is necessary to do this function. If necessary, I will modify it together. Thank you very much! was: ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement) This method is used to calculate the optimal length of an array during expansion. According to current logic,in order to avoid space waste caused by *object alignment gap.* In *32-bit* JVM,the array length will select the numbers(the +current optional+ columns) in the table below. But the results weren't perfect. For example, if I want to expand byte[2], I call the method oversize(2,1) to get the size of the next array, which returns 8. But Byte [8] is not the best result. Since byte[8] and byte[12] use the same memory space (both are 24 bytes due to alignment gap), So it's best to return 12 here. See the table below. !image-2022-06-09-08-26-36-528.png! I used *jol-core* to calculate object alignment gap {code:java} org.openjdk.jol jol-core 0.16 compile {code} Execute the following code: {code:java} System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); {code} !image-2022-06-08-21-24-57-674.png! To further verify that the tool's results are correct, I wrote the following code to infer how much space the array of different lengths actually occupies based on when the OOM occursThe conclusion is consistent with jol-core. {code:java} // -Xms16m -Xmx16m // Used to infer the memory space occupied // by the length of various arrays public static void main(String[] args) { byte[][] arr = new byte[1024 * 1024][]; for (int i = 0; i < arr.length; i++) { if (i % 100 == 0) { System.out.println(i); } // According to OOM occurrence time // in 32-bit JVM, // Arrays range in length from 5 to 12, // occupying the same amount of memory arr[i]=new byte[5]; } } {code} *new byte[5]* and *new byte[12]* use the same amount of memory In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return value of this method. But I don't know whether it is necessary to do this function. If necessary, I will modify it together. Thank you very much! > fix error in 32bit jvm object alignment gap calculation > --- > > Key: LUCENE-10605 > URL: https://issues.apache.org/jira/browse/LUCENE-10605 > Project: Lucene - Core > Issue Type: Bug > Components: core/other >Affects Versions: 9.2 > Environment: jdk 7 32-bit > jdk 8 32-bit >Reporter: sun wuqiang >Priority: Trivial > Attachments: image-2022-06-08-20-50-27-712.png, > image-2022-06-08-21-24-57-674.png, image-2022-06-09-08-25-55-289.png, > image-2022-06-09-
[jira] [Updated] (LUCENE-10605) fix error in 32bit jvm object alignment gap calculation
[ https://issues.apache.org/jira/browse/LUCENE-10605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sun wuqiang updated LUCENE-10605: - Affects Version/s: 8.11.1 (was: 9.2) > fix error in 32bit jvm object alignment gap calculation > --- > > Key: LUCENE-10605 > URL: https://issues.apache.org/jira/browse/LUCENE-10605 > Project: Lucene - Core > Issue Type: Bug > Components: core/other >Affects Versions: 8.11.1 > Environment: jdk 7 32-bit > jdk 8 32-bit >Reporter: sun wuqiang >Priority: Trivial > Attachments: image-2022-06-08-20-50-27-712.png, > image-2022-06-08-21-24-57-674.png, image-2022-06-09-08-25-55-289.png, > image-2022-06-09-08-26-36-528.png > > Time Spent: 10m > Remaining Estimate: 0h > > ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement) > This method is used to calculate the optimal length of an array during > expansion. > > According to current logic,in order to avoid space waste caused by *object > alignment gap.* In *32-bit* JVM,the array length will select the numbers(the > +current optional+ columns) in the table below. But the results weren't > perfect. > For example, if I want to expand byte[2], I will call the method > oversize(2,1) to get the size of the next array, which returns 8. > But byte [8] is not the best result. > Since byte[8] and byte[12] use the same memory space (both are 24 bytes due > to alignment gap), > So it's best to return 12 here. > See the table below. > !image-2022-06-09-08-26-36-528.png! > > I used *jol-core* to calculate object alignment gap > {code:java} > > org.openjdk.jol > jol-core > 0.16 > compile > {code} > > Execute the following code: > {code:java} > System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); > {code} > > !image-2022-06-08-21-24-57-674.png! > > To further verify that the tool's results are correct, I wrote the following > code to infer how much space the array of different lengths actually occupies > based on when the OOM occursThe conclusion is consistent with jol-core. > {code:java} > // -Xms16m -Xmx16m > // Used to infer the memory space occupied > // by the length of various arrays > public static void main(String[] args) { > byte[][] arr = new byte[1024 * 1024][]; > for (int i = 0; i < arr.length; i++) { > if (i % 100 == 0) { > System.out.println(i); > } > // According to OOM occurrence time > // in 32-bit JVM, > // Arrays range in length from 5 to 12, > // occupying the same amount of memory > arr[i]=new byte[5]; > } > } {code} > *new byte[5]* and *new byte[12]* use the same amount of memory > > > In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return > value of this method. But I don't know whether it is necessary to do this > function. If necessary, I will modify it together. Thank you very much! > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10606) Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed queries
Kaival Parikh created LUCENE-10606: -- Summary: Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed queries Key: LUCENE-10606 URL: https://issues.apache.org/jira/browse/LUCENE-10606 Project: Lucene - Core Issue Type: Improvement Components: core/search Reporter: Kaival Parikh While working on this [PR|https://github.com/apache/lucene/pull/932] to add prefilter testing support, we saw that hit collection took a long time for BitSetIterator backed scorers (due to iteration over the entire underlying BitSet, and copying it into an internal one) (Link to [numbers|https://github.com/apache/lucene/pull/932#discussion_r96850], second table) These BitSetIterators can be frequent (as they are used in LRUQueryCache), and bulk collection can be optimized with more knowledge of the underlying iterator -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10606) Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed queries
[ https://issues.apache.org/jira/browse/LUCENE-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551976#comment-17551976 ] Kaival Parikh commented on LUCENE-10606: Instead of collecting hit-by-hit using a LeafCollector, we can break down the search by instantiating a weight, creating scorers, and checking the underlying iterator. If it is backed by a BitSet, we can directly update the reference (as we won't be editing it). Else we can create a new BitSet from the iterator using BitSet.of This way the collection is optimized (and can be advantageous as LRUQueryCache internally uses a BitSet, so such iterators will be common). Sample [code|https://github.com/apache/lucene/compare/main...kaivalnp:alternate_collection] > Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed > queries > > > Key: LUCENE-10606 > URL: https://issues.apache.org/jira/browse/LUCENE-10606 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Kaival Parikh >Priority: Minor > Labels: performance > > While working on this [PR|https://github.com/apache/lucene/pull/932] to add > prefilter testing support, we saw that hit collection took a long time for > BitSetIterator backed scorers (due to iteration over the entire underlying > BitSet, and copying it into an internal one) (Link to > [numbers|https://github.com/apache/lucene/pull/932#discussion_r96850], > second table) > These BitSetIterators can be frequent (as they are used in LRUQueryCache), > and bulk collection can be optimized with more knowledge of the underlying > iterator -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10606) Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed queries
[ https://issues.apache.org/jira/browse/LUCENE-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaival Parikh updated LUCENE-10606: --- Description: While working on this [PR|https://github.com/apache/lucene/pull/932] to add prefilter testing support, we saw that hit collection took a long time for BitSetIterator backed scorers (due to iteration over the entire underlying BitSet, and copying it into an internal one) These BitSetIterators can be frequent (as they are used in LRUQueryCache), and bulk collection can be optimized with more knowledge of the underlying iterator was: While working on this [PR|https://github.com/apache/lucene/pull/932] to add prefilter testing support, we saw that hit collection took a long time for BitSetIterator backed scorers (due to iteration over the entire underlying BitSet, and copying it into an internal one) (Link to [numbers|https://github.com/apache/lucene/pull/932#discussion_r96850], second table) These BitSetIterators can be frequent (as they are used in LRUQueryCache), and bulk collection can be optimized with more knowledge of the underlying iterator > Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed > queries > > > Key: LUCENE-10606 > URL: https://issues.apache.org/jira/browse/LUCENE-10606 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Kaival Parikh >Priority: Minor > Labels: performance > > While working on this [PR|https://github.com/apache/lucene/pull/932] to add > prefilter testing support, we saw that hit collection took a long time for > BitSetIterator backed scorers (due to iteration over the entire underlying > BitSet, and copying it into an internal one) > These BitSetIterators can be frequent (as they are used in LRUQueryCache), > and bulk collection can be optimized with more knowledge of the underlying > iterator -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10606) Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed queries
[ https://issues.apache.org/jira/browse/LUCENE-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551976#comment-17551976 ] Kaival Parikh edited comment on LUCENE-10606 at 6/9/22 6:21 AM: Instead of collecting hit-by-hit using a LeafCollector, we can break down the search by instantiating a weight, creating scorers, and checking the underlying iterator. If it is backed by a BitSet, we can directly update the reference (as we won't be editing it). Else we can create a new BitSet from the iterator using BitSet.of This way the collection is optimized (and can be advantageous as LRUQueryCache internally uses a BitSet, so such iterators will be common). Sample [code|https://github.com/apache/lucene/compare/main...kaivalnp:alternate_collection] This is the baseline: ||selectivity||effective topK||post-filter recall||post-filter time||pre-filter recall||pre-filter time|| |0.8|125|0.967|1.55|0.976|19.65| |0.6|166|0.964|1.94|0.981|17.79| |0.4|250|0.961|2.69|0.986|14.71| |0.2|500|0.958|4.78|0.992|11.19| |0.1|1000|0.959|8.53|0.994|11.50| |0.01|1|0.937|58.32|1.000|10.34| This is after alternate collection: ||selectivity||effective topK||post-filter recall||post-filter time||pre-filter recall||pre-filter time|| |0.8|125|0.961|1.56|0.976|1.64| |0.6|166|0.960|1.93|0.980|1.98| |0.4|250|0.961|2.68|0.987|2.68| |0.2|500|0.960|4.76|0.991|4.55| |0.1|1000|0.961|8.50|0.995|7.84| |0.01|1|0.953|58.17|1.000|9.71| was (Author: JIRAUSER289654): Instead of collecting hit-by-hit using a LeafCollector, we can break down the search by instantiating a weight, creating scorers, and checking the underlying iterator. If it is backed by a BitSet, we can directly update the reference (as we won't be editing it). Else we can create a new BitSet from the iterator using BitSet.of This way the collection is optimized (and can be advantageous as LRUQueryCache internally uses a BitSet, so such iterators will be common). Sample [code|https://github.com/apache/lucene/compare/main...kaivalnp:alternate_collection] > Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed > queries > > > Key: LUCENE-10606 > URL: https://issues.apache.org/jira/browse/LUCENE-10606 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Kaival Parikh >Priority: Minor > Labels: performance > > While working on this [PR|https://github.com/apache/lucene/pull/932] to add > prefilter testing support, we saw that hit collection took a long time for > BitSetIterator backed scorers (due to iteration over the entire underlying > BitSet, and copying it into an internal one) > These BitSetIterators can be frequent (as they are used in LRUQueryCache), > and bulk collection can be optimized with more knowledge of the underlying > iterator -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester
kaivalnp commented on code in PR #932: URL: https://github.com/apache/lucene/pull/932#discussion_r893116012 ## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ## @@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) { return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]); } +public void setBitSet(BitSet bitSet, int cost) { + bitSets[ord] = bitSet; Review Comment: Thank You! I have opened a JIRA [issue](https://issues.apache.org/jira/browse/LUCENE-10606), hope it is okay Please feel free to suggest edits / alternate approaches / provide feedback -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10606) Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed queries
[ https://issues.apache.org/jira/browse/LUCENE-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551976#comment-17551976 ] Kaival Parikh edited comment on LUCENE-10606 at 6/9/22 6:29 AM: Instead of collecting hit-by-hit using a LeafCollector, we can break down the search by instantiating a weight, creating scorers, and checking the underlying iterator. If it is backed by a BitSet, we can directly update the reference (as we won't be editing it). Else we can create a new BitSet from the iterator using BitSet.of This way the collection is optimized (and can be advantageous as LRUQueryCache internally uses a BitSet, so such iterators will be common). Sample [code|https://github.com/apache/lucene/compare/main...kaivalnp:alternate_collection] This is the baseline: ||selectivity||pre-filter recall||pre-filter time|| |0.8|0.976|19.65| |0.6|0.981|17.79| |0.4|0.986|14.71| |0.2|0.992|11.19| |0.1|0.994|11.50| |0.01|1.000|10.34| This is after alternate collection: ||selectivity||pre-filter recall||pre-filter time|| |0.8|0.976|1.64| |0.6|0.980|1.98| |0.4|0.987|2.68| |0.2|0.991|4.55| |0.1|0.995|7.84| |0.01|1.000|9.71| was (Author: JIRAUSER289654): Instead of collecting hit-by-hit using a LeafCollector, we can break down the search by instantiating a weight, creating scorers, and checking the underlying iterator. If it is backed by a BitSet, we can directly update the reference (as we won't be editing it). Else we can create a new BitSet from the iterator using BitSet.of This way the collection is optimized (and can be advantageous as LRUQueryCache internally uses a BitSet, so such iterators will be common). Sample [code|https://github.com/apache/lucene/compare/main...kaivalnp:alternate_collection] This is the baseline: ||selectivity||effective topK||post-filter recall||post-filter time||pre-filter recall||pre-filter time|| |0.8|125|0.967|1.55|0.976|19.65| |0.6|166|0.964|1.94|0.981|17.79| |0.4|250|0.961|2.69|0.986|14.71| |0.2|500|0.958|4.78|0.992|11.19| |0.1|1000|0.959|8.53|0.994|11.50| |0.01|1|0.937|58.32|1.000|10.34| This is after alternate collection: ||selectivity||effective topK||post-filter recall||post-filter time||pre-filter recall||pre-filter time|| |0.8|125|0.961|1.56|0.976|1.64| |0.6|166|0.960|1.93|0.980|1.98| |0.4|250|0.961|2.68|0.987|2.68| |0.2|500|0.960|4.76|0.991|4.55| |0.1|1000|0.961|8.50|0.995|7.84| |0.01|1|0.953|58.17|1.000|9.71| > Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed > queries > > > Key: LUCENE-10606 > URL: https://issues.apache.org/jira/browse/LUCENE-10606 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Kaival Parikh >Priority: Minor > Labels: performance > > While working on this [PR|https://github.com/apache/lucene/pull/932] to add > prefilter testing support, we saw that hit collection took a long time for > BitSetIterator backed scorers (due to iteration over the entire underlying > BitSet, and copying it into an internal one) > These BitSetIterators can be frequent (as they are used in LRUQueryCache), > and bulk collection can be optimized with more knowledge of the underlying > iterator -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org