Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]
gf2121 commented on PR #12699: URL: https://github.com/apache/lucene/pull/12699#issuecomment-1831412787 Hi @mikemccand @jpountz ! The [newest nightly benchmark](https://home.apache.org/~mikemccand/lucenebench/2023.11.28.18.03.59.html) shows we get back some speed for [PKLookUp](https://home.apache.org/~mikemccand/lucenebench/PKLookup.html), but still not as good as the performance before the merge of https://github.com/apache/lucene/pull/12631. This result is not quite same as what i saw locally, but reasonable for me as we do more `Outputs#read` anyway. Since we are [preparing for the release of 9.9.0](https://lists.apache.org/thread/6bxf0fo1ypblxn2hbdfyg38y3hc8lo9g). I'm not sure if it is an acceptable trade-off now. Should we backport this or revert https://github.com/apache/lucene/pull/12631? I'd appreciate your suggestions :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Multi-value Support for KnnVectorField [lucene]
alessandrobenedetti commented on issue #12313: URL: https://github.com/apache/lucene/issues/12313#issuecomment-1831534483 Hi @david-sitsky, the multi-valued vectors in Lucene's contribution is now paused for lack of fundings. I'll resume it from my side if and when I get some sponsors :) The nested documents approach on the other hand has been released with Lucene 9.8! You read the various considerations that apply in the thread! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add Facets#getSpecificValues (bulk) and bulk path -> ordinal lookup for taxonomy faceting [lucene]
ChrisHegarty commented on issue #12180: URL: https://github.com/apache/lucene/issues/12180#issuecomment-1831540117 Hi, I'd like to ask a clarifying question as part of the _9.9.0_ release manager duties, since this currently-unresolved issue is associated with _Milestone 9.9.0_. It's clear that progress has been made on this issue, with the addition of the new `TaxonomyReader::getBulkOrdinals` method. Is more work planned, or should this issue be closed? It's fine either way, I just want to ensure that we are clear about the targeted release, if any. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] configuration items of the alg file are adapted to the 9.0 branch [LUCENE-10100] [lucene]
ChrisHegarty commented on issue #11138: URL: https://github.com/apache/lucene/issues/11138#issuecomment-1831542918 Hi, I'd like to ask a clarifying question as part of the _9.9.0_ release manager duties, since this currently-unresolved issue is associated with Milestone 9.9.0. Is more work planned, or should this issue be closed? It's fine either way, I just want to ensure that we are clear about the targeted release, if any. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] IntTaxonomyFacets chooses dense values array when FacetsCollector has no MatchingDocs [lucene]
Shradha26 commented on issue #12558: URL: https://github.com/apache/lucene/issues/12558#issuecomment-1831628034 Hey @gsmiller, thanks for trying this out again - were you able to reproduce this in 9_8 again? I'd found the bug in 9_7 and I don't think it was an issue in 9_8. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] configuration items of the alg file are adapted to the 9.0 branch [LUCENE-10100] [lucene]
mikemccand commented on issue #11138: URL: https://github.com/apache/lucene/issues/11138#issuecomment-1831658096 Ahh thanks @ChrisHegarty -- no more work for this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] configuration items of the alg file are adapted to the 9.0 branch [LUCENE-10100] [lucene]
mikemccand closed issue #11138: configuration items of the alg file are adapted to the 9.0 branch [LUCENE-10100] URL: https://github.com/apache/lucene/issues/11138 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix intermittently failing TestSortedSetFieldSource [lucene]
ChrisHegarty opened a new pull request, #12850: URL: https://github.com/apache/lucene/pull/12850 This commit fixes the intermittently failing TestSortedSetFieldSource. The test assertions depend on doc order which may be affected by merging. The fix is to trivially avoid merging for the very small index, with just two docs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add Facets#getSpecificValues (bulk) and bulk path -> ordinal lookup for taxonomy faceting [lucene]
mikemccand commented on issue #12180: URL: https://github.com/apache/lucene/issues/12180#issuecomment-1831660442 Thanks @ChrisHegarty -- this one is done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix intermittently failing TestSortedSetFieldSource [lucene]
ChrisHegarty commented on PR #12850: URL: https://github.com/apache/lucene/pull/12850#issuecomment-1831661212 The test previously has been seen to fail with: ``` org.junit.ComparisonFailure: expected: but was: at __randomizedtesting.SeedInfo.seed([45D4C56DF2092811:7D67E193D5FAFCC0]:0) at junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:117) at junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:146) at org.apache.lucene.queries.function.TestSortedSetFieldSource.testSimple(TestSortedSetFieldSource.java:59) ... ``` After the change the test has not failed in several hundreds of thousands of runs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add Facets#getSpecificValues (bulk) and bulk path -> ordinal lookup for taxonomy faceting [lucene]
mikemccand closed issue #12180: Add Facets#getSpecificValues (bulk) and bulk path -> ordinal lookup for taxonomy faceting URL: https://github.com/apache/lucene/issues/12180 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]
mikemccand commented on PR #12699: URL: https://github.com/apache/lucene/pull/12699#issuecomment-1831665131 I don't think we should revert #12631 -- that one was a nice disk space improvement, and the `PKLookup` performance is quite noisy -- going up or down by just as much when we upgrade JDKs or switch to Panama MMap impl. Thanks @gf2121. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Report the time it took for building the FST in Test2BFST [lucene]
mikemccand commented on PR #12847: URL: https://github.com/apache/lucene/pull/12847#issuecomment-1831667475 > The test failed with just `Error: The operation was canceled.` but I can't tell why it happened. The same PR in my local branch works: [dungba88#20](https://github.com/dungba88/lucene/pull/20) Weird. I restarted the failed job. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] [9.x] Fix intermittently failing TestSortedSetFieldSource [lucene]
ChrisHegarty opened a new pull request, #12851: URL: https://github.com/apache/lucene/pull/12851 Backport of: * Fix intermittently failing TestSortedSetFieldSource #12850 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Report the time it took for building the FST in Test2BFST [lucene]
mikemccand commented on code in PR #12847: URL: https://github.com/apache/lucene/pull/12847#discussion_r1408286029 ## lucene/CHANGES.txt: ## @@ -176,7 +178,8 @@ API Changes * GITHUB#12799: Make TaskExecutor constructor public and use TaskExecutor for concurrent HNSW graph build. (Shubham Chaudhary) -* +* GITHUB#12758, GITHUB#12803: Remove FST constructor with DataInput for metadata. Please Review Comment: Ahh are you adding the missing 9.9 CHANGES.txt entries for recent FST changes here? Great! ## lucene/CHANGES.txt: ## @@ -246,6 +249,8 @@ Improvements minimal FST. Inspired by this Rust FST implemention: https://blog.burntsushi.net/transducers (Mike McCandless) +* GITHUB#12738: NodeHash now stores the FST nodes data instead of just node addresses (Anh Dung Bui) Review Comment: And here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]
gf2121 commented on PR #12699: URL: https://github.com/apache/lucene/pull/12699#issuecomment-1831673998 Thanks @mikemccand for feedback! I'll backport this to 9.9.0 then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Report the time it took for building the FST in Test2BFST [lucene]
dungba88 commented on code in PR #12847: URL: https://github.com/apache/lucene/pull/12847#discussion_r1409117086 ## lucene/CHANGES.txt: ## @@ -176,7 +178,8 @@ API Changes * GITHUB#12799: Make TaskExecutor constructor public and use TaskExecutor for concurrent HNSW graph build. (Shubham Chaudhary) -* +* GITHUB#12758, GITHUB#12803: Remove FST constructor with DataInput for metadata. Please Review Comment: Yeah, I'm wondering if I should split it in 2 PR, or this change itself can be backported to 9.9 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Report the time it took for building the FST in Test2BFST [lucene]
mikemccand commented on code in PR #12847: URL: https://github.com/apache/lucene/pull/12847#discussion_r1409134920 ## lucene/CHANGES.txt: ## @@ -176,7 +178,8 @@ API Changes * GITHUB#12799: Make TaskExecutor constructor public and use TaskExecutor for concurrent HNSW graph build. (Shubham Chaudhary) -* +* GITHUB#12758, GITHUB#12803: Remove FST constructor with DataInput for metadata. Please Review Comment: No worries -- I can merge in this PR and then backport everything -- the `fstSizeInBytes` is important for 9.9.0 as well. Thanks for fixing the CHANGES! Please check if we missed anything else if you can! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Report the time it took for building the FST in Test2BFST [lucene]
mikemccand merged PR #12847: URL: https://github.com/apache/lucene/pull/12847 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix intermittently failing TestSortedSetFieldSource [lucene]
mikemccand commented on PR #12850: URL: https://github.com/apache/lucene/pull/12850#issuecomment-1831732238 > The test assertions depend on doc order which may be affected by merging. Hmm, I don't fully understand the test, but did the sort order truly rely on `docid` tie-breaking? If so then this is the right fix. I just want to make sure we are not hiding a bug that e.g. causes this sorter to incorrectly sort / compare documents across segments that were not in fact ties by the sort criteria. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Report the time it took for building the FST in Test2BFST [lucene]
dungba88 commented on code in PR #12847: URL: https://github.com/apache/lucene/pull/12847#discussion_r1409158917 ## lucene/CHANGES.txt: ## @@ -176,7 +178,8 @@ API Changes * GITHUB#12799: Make TaskExecutor constructor public and use TaskExecutor for concurrent HNSW graph build. (Shubham Chaudhary) -* +* GITHUB#12758, GITHUB#12803: Remove FST constructor with DataInput for metadata. Please Review Comment: Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Do we suboptimally call ByteBuffersDataOutput.toDataInput()? [lucene]
dungba88 opened a new issue, #12852: URL: https://github.com/apache/lucene/issues/12852 ### Description See this comment: https://github.com/apache/lucene/pull/12624#issuecomment-1829633268 > I wonder: are there other places in Lucene that might fall prey to this performance trap (calling toDataInput frequently while continuing to append bytes to the DataOutput)? FreqProxTermsWriter seems to use this for each term to re-read the postings (maybe for the static sort use case). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
dungba88 commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1409177467 ## lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java: ## @@ -330,6 +330,29 @@ public static int[] growExact(int[] array, int newLength) { return copy; } + /** + * Returns an array whose size is at least {@code minLength}, generally over-allocating + * exponentially, but never allocating more than {@code maxLength} elements. + */ + public static int[] growInRange(int[] array, int minLength, int maxLength) { +if (array.length >= minLength) { + return array; +} +if (minLength > maxLength) { + throw new IllegalArgumentException( + "requested minimum array length " + + minLength + + " is larger than requested maximum array length " + + maxLength); +} + +int potentialLength = oversize(minLength, Integer.BYTES); +if (potentialLength > maxLength) { Review Comment: I think `growExact(array, Math.min(potentialLength, maxLength))` would be clearer? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
dungba88 commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1409183001 ## lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java: ## @@ -330,6 +330,29 @@ public static int[] growExact(int[] array, int newLength) { return copy; } + /** + * Returns an array whose size is at least {@code minLength}, generally over-allocating + * exponentially, but never allocating more than {@code maxLength} elements. + */ + public static int[] growInRange(int[] array, int minLength, int maxLength) { +if (array.length >= minLength) { + return array; +} +if (minLength > maxLength) { + throw new IllegalArgumentException( + "requested minimum array length " + + minLength + + " is larger than requested maximum array length " + + maxLength); +} + +int potentialLength = oversize(minLength, Integer.BYTES); +if (potentialLength > maxLength) { + return growExact(array, maxLength); +} +return growExact(array, potentialLength); + } + /** * Returns an array whose size is at least {@code minSize}, generally over-allocating * exponentially Review Comment: Would it make sense if we delegate the existing `grow(int[], int)` to the new one to avoid having double code path? Something like `return grow(array, minSize, Integer.MAX_VALUE)` would work I guess. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
dungba88 commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1409183001 ## lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java: ## @@ -330,6 +330,29 @@ public static int[] growExact(int[] array, int newLength) { return copy; } + /** + * Returns an array whose size is at least {@code minLength}, generally over-allocating + * exponentially, but never allocating more than {@code maxLength} elements. + */ + public static int[] growInRange(int[] array, int minLength, int maxLength) { +if (array.length >= minLength) { + return array; +} +if (minLength > maxLength) { + throw new IllegalArgumentException( + "requested minimum array length " + + minLength + + " is larger than requested maximum array length " + + maxLength); +} + +int potentialLength = oversize(minLength, Integer.BYTES); +if (potentialLength > maxLength) { + return growExact(array, maxLength); +} +return growExact(array, potentialLength); + } + /** * Returns an array whose size is at least {@code minSize}, generally over-allocating * exponentially Review Comment: Would it make sense if we delegate the existing `grow(int[], int)` to the new one to avoid having double code path? Something like `return growInRange(array, minSize, Integer.MAX_VALUE)` would work I guess. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix intermittently failing TestSortedSetFieldSource [lucene]
ChrisHegarty commented on PR #12850: URL: https://github.com/apache/lucene/pull/12850#issuecomment-1831806426 My understanding of this part of the test is that it asserts the sort order of the multi-valued string values in the `value` field. The changes here do not affect this assertion, the sort order is still checked. There are just two docs in the index. The first doc has a single value for the `value` field, `baz`. The second doc has two values for the `value` field, `foo` and `bar`. The test expects to find `baz` and `bar` (since `bar` is less than `foo`), for the `values` field in doc1 and doc2 respectively. The test assertions are based on doc order in the index. Does this address your concern? Or have I missed the point? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
benwtrent commented on PR #12844: URL: https://github.com/apache/lucene/pull/12844#issuecomment-1831865170 I ran knnPerfTest in Lucene util against this PR. 100k cohere vectors. Flushing was depending on memory usage (Do we check the ram usage of onHeapgraph during indexing to determine when to flush?) main branch: ``` recall latency nDocfanout maxConn beamWidth index 0.912 0.77 10 0 16 100 48602 ``` This branch: ``` recall latency nDocfanout maxConn beamWidth index ms 0.912 0.75 10 0 16 100 165929 ``` Indexing is about 3.5x slower with this change. I don't know if its due to the OnHeapGraph memory estimation being slow or the node resizing. I am gonna run a profiler to see whats up. Good news is that search latency and recall are unchanged. Forced merging time seems about the same as well :). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
benwtrent commented on PR #12844: URL: https://github.com/apache/lucene/pull/12844#issuecomment-1831889329 Yeah, `ramBytesUsed` changes here are adding significant overhead to indexing. Here is a cpu profile. [pr12844-768-10-wall.jfr.zip](https://github.com/apache/lucene/files/13500964/pr12844-768-10-wall.jfr.zip) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1832050665 I re-ran the Test2BFST with the new change, it looks much better ``` 1> TEST: now verify [fst size=4621076364; nodeCount=2252341486; arcCount=2264078585] 1> 0...: took 0 seconds 1> 100...: took 27 seconds 1> 200...: took 54 seconds 1> 300...: took 82 seconds 1> 400...: took 109 seconds 1> 500...: took 137 seconds 1> 600...: took 165 seconds 1> 700...: took 192 seconds 1> 800...: took 219 seconds 1> 900...: took 247 seconds 1> 1000...: took 275 seconds 1> 1100...: took 302 seconds ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]
jpountz commented on PR #12699: URL: https://github.com/apache/lucene/pull/12699#issuecomment-1832082403 +1 to not reverting #12631. The fact that our terms index used to not share prefixes properly was a bug, I'm glad it's fixed even though it may make performance slightly slower because more nodes have non-empty outputs to contribute. FWIW it looks like your change did not only help PKLookup, but possibly also [`CountTerm`](https://home.apache.org/~mikemccand/lucenebench/CountTerm.html), [`Respell`](https://home.apache.org/~mikemccand/lucenebench/Respell.html), [`Fuzzy1`](https://home.apache.org/~mikemccand/lucenebench/Fuzzy1.html) and [`Fuzzy2`](https://home.apache.org/~mikemccand/lucenebench/Fuzzy2.html). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Copy collected acc(maxFreqs) into empty acc, rather than merge them. [lucene]
jpountz commented on PR #12846: URL: https://github.com/apache/lucene/pull/12846#issuecomment-1832112426 > Sorry, I am still confused about this. If we clear an unEmpty treeset, how to deal with its competitive entries? This looks similar to the arraycopy that you perform on the `maxFreqs` array, ignoring existing values? We would just need to update the contract of this method to say something like `Copy the provided accumulator into this accumulator, ignoring its existing state. This is effectively the same as calling clear() then addAll().`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
stefanvodita commented on PR #12844: URL: https://github.com/apache/lucene/pull/12844#issuecomment-1832333168 Thank you for running the benchmarks @benwtrent, you were quicker than I was. I was worried the modified `ramBytesUsed` would be slow. Maybe we can come up with a smarter way to do the estimation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
stefanvodita commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1409603364 ## lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java: ## @@ -330,6 +330,29 @@ public static int[] growExact(int[] array, int newLength) { return copy; } + /** + * Returns an array whose size is at least {@code minLength}, generally over-allocating + * exponentially, but never allocating more than {@code maxLength} elements. + */ + public static int[] growInRange(int[] array, int minLength, int maxLength) { +if (array.length >= minLength) { + return array; +} +if (minLength > maxLength) { + throw new IllegalArgumentException( + "requested minimum array length " + + minLength + + " is larger than requested maximum array length " + + maxLength); +} Review Comment: Initially, I had placed the validation first. I changed it to cover cases where INITIAL_CAPACITY >= minSize > maxSize. In these cases, we've already allocated beyond maxSize, so we might as well use that memory. Do you think that ends up being more confusing than validating first? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix intermittently failing TestSortedSetFieldSource [lucene]
ChrisHegarty commented on PR #12850: URL: https://github.com/apache/lucene/pull/12850#issuecomment-1832391520 > This new failure looks like it's due to the fact that MockRandomMergePolicy now randomly reverses the doc ID order on merge? That's it exactly. I confirmed this through local instrumentation of MockRandomMergePolicy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]
gsmiller opened a new pull request, #12853: URL: https://github.com/apache/lucene/pull/12853 In GH#12558, @Shradha26 discovered that faceting implementations are sometimes getting empty MatchingDoc lists, resulting in undesired behavior. We should consider it a bug for MatchingDocs to not be populated, even if there are no collected hits. After chasing the issue a bit, it appears that drill-sideways can miss calling `#finish` on the FacetsCollectors it manages if it end up not scoring any docs. This PR addresses it. I made a couple related changes in this PR to get at fixing the bug: 1. Added `DrillSideways#createDrillSidewaysFacetsCollectorManager` as a protected method that users can extend to create specific facet collecting implementations. This seems reasonable to me as we already have a similar hook for the drill-down collector. I found this hook useful for testing. If we don't want to expose this API surface, I can adjust the testing approach, but I think it's reasonable to expose personally. 2. Changed the `DrillSidewaysScorer` to accept `LeafCollectors` instead of `FacetCollectors`. This seems like a better division of responsibilities anyway, and is simpler, but it also lets `DrillSidewaysQuery` manage the creation of the leaf collectors when setting up the `BulkScorer`. This is necessary if `DrillSidewaysQuery` is going to call `#finish` in no-scoring cases since `#finish` relies on a leaf scorer having been created already from the collector. 3. I also added a TODO to go remove `DrillSideways#createDrillDownFacetsCollector` since it is no longer used and is a trappy hook for users. I'll tackle this in a separate PR though since we need a deprecation back port and it's not really related to this (just something I came across). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]
gsmiller commented on code in PR #12853: URL: https://github.com/apache/lucene/pull/12853#discussion_r1409693152 ## lucene/CHANGES.txt: ## @@ -112,6 +112,9 @@ Bug Fixes * GITHUB#12220: Hunspell: disallow hidden title-case entries from compound middle/end +* GITHUB#12558: Ensure #finish is called on all drill-sideways FacetsCollectors even when no hits are scored. Review Comment: Put this under 10.0 for now. After 9.9, if we add a section for 9.10 I will back port it there (no reason not to, we just don't have the section setup). Also, if 9.9 re-spins I will back port to 9.9, but we shouldn't block that release for this fix IMO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] IntTaxonomyFacets chooses dense values array when FacetsCollector has no MatchingDocs [lucene]
gsmiller commented on issue #12558: URL: https://github.com/apache/lucene/issues/12558#issuecomment-1832460996 I opened a PR here that fixes the root cause of this bug: https://github.com/apache/lucene/pull/12853 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] IntTaxonomyFacets test case [lucene]
gsmiller closed pull request #12563: IntTaxonomyFacets test case URL: https://github.com/apache/lucene/pull/12563 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] IntTaxonomyFacets test case [lucene]
gsmiller commented on PR #12563: URL: https://github.com/apache/lucene/pull/12563#issuecomment-1832462625 Closing in favor of #12853 that fixes the underlying root cause and adds testing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
zhaih commented on PR #12844: URL: https://github.com/apache/lucene/pull/12844#issuecomment-1832571805 @benwtrent Thanks for running the benchmark. I looked at the profile and I think we call `ramBytesUsed` after every document is indexed to control the flush [here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java#L206). @stefanvodita I can think of two options here: 1. just use maxSize as before for estimation, it's not too accurate but at least is a good upper bound 2. Carefully account the extra memory used when we add new nodes to the graph and accumulate it by an `AtomicLong`, including the cost for the node itself as well as the cost of resizing the existing neighbor's NeighborArray. Which I think is doable (in both single thread and multi thread situation) but a bit tricky. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Mark DrillSideways#createDrillDownFacetsCollector as @Deprecated [lucene]
gsmiller opened a new pull request, #12854: URL: https://github.com/apache/lucene/pull/12854 This extension hook is only referenced by the also-deprecated `DrillSideways#search(DrillDownQuery, Collector)`, so let's mark it deprecated as well and remove it on main. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Remove DrillSideways#createDrillDownFacetsCollector in favor of the manager-based hook [lucene]
gsmiller opened a new pull request, #12855: URL: https://github.com/apache/lucene/pull/12855 This extension hook is no longer used on the main branch, so let's remove it. Opened a corresponding PR to mark this deprecated on 9x (#12854) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] [Minor] Fix the only use of java.lang.String#toLowerCase() with no Locale [lucene]
slow-J opened a new pull request, #12856: URL: https://github.com/apache/lucene/pull/12856 I was looking at this old issue: https://github.com/apache/lucene/issues/4390 `Flexible Queryparser has some Locale violation with lowercasing`, which was already fixed with the work done in [GH#5271](https://github.com/apache/lucene/issues/5271) But I came across this single use of `.toLowerCase()` with no Locale in Checksum.java. TIL: forbidden apis doesn't run against the `package org.apache.lucene.gradle`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
stefanvodita commented on PR #12844: URL: https://github.com/apache/lucene/pull/12844#issuecomment-1832732598 Thanks for the suggestions @zhaih! I have to think about option 2 a bit more. If we change `ramBytesUsed` back, performance recovers (Mostly? I'm not sure how noisy this benchmark is). Benchmark config: ``` dim = 100 doc_vectors = '%s/data/enwiki-20120502-lines-1k-100d.vec' % constants.BASE_DIR query_vectors = '%s/util/tasks/vector-task-minilm.vec' % constants.BASE_DIR ``` main: ``` recall latency nDocfanout maxConn beamWidth visited index ms 0.7200.16 1 0 64 250 100 25511.00 post-filter 0.5420.23 10 0 64 250 100 43620 1.00 post-filter 0.5120.27 20 0 64 250 100 108235 1.00 post-filter ``` This PR: ``` recall latency nDocfanout maxConn beamWidth visited index ms 0.7200.16 1 0 64 250 100 50281.00 post-filter 0.5420.23 10 0 64 250 100 315369 1.00 post-filter 0.5120.28 20 0 64 250 100 1578461 1.00 post-filter ``` This PR with the old memory estimation: ``` recall latency nDocfanout maxConn beamWidth visited index ms 0.720 0.16 1 0 64 250 100 24971.00 post-filter 0.542 0.23 10 0 64 250 100 43886 1.00 post-filter 0.512 0.28 20 0 64 250 100 117152 1.00 post-filter ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Reproducible TestDrillSideways failure [lucene]
gsmiller commented on issue #12418: URL: https://github.com/apache/lucene/issues/12418#issuecomment-1832886094 I think I know what's going on here, and I believe #12853 fixes it. In both seeds mentioned here, I've noticed that the drill-sideways query short-circuits on at least one segment, knowing it has no results. Specifically, a null BulkScorer gets returned from DrillSidewaysQuery. The problem with this is that the "sideways" facet collectors get left hanging, and nothing calls #finish on them. I reproduced with both seeds on 9x and main, and verified the above fix works in all cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Is it correct for facets to assume positive aggregation values? [lucene]
gsmiller commented on issue #12585: URL: https://github.com/apache/lucene/issues/12585#issuecomment-1832904037 I think I'd be curious if there are many real-world use-cases out there that have a non-positive association between each document and facet label? I would guess it's pretty uncommon. I actually can't really contrive an example either. @stefanvodita have you encountered real-world use-cases that would benefit from this? Apologies if you describe it somewhere and I'm missing it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Removing TermInSetQuery array ctor [lucene]
gsmiller commented on PR #12837: URL: https://github.com/apache/lucene/pull/12837#issuecomment-1832910639 +1 to this simplification. Let's also get rid of `public TermInSetQuery(RewriteMethod rewriteMethod, String field, BytesRef... terms)`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add a new static method for KeywordField#newSetQuery to support collections parameter [lucene]
gsmiller commented on issue #12243: URL: https://github.com/apache/lucene/issues/12243#issuecomment-1832911784 +1 to shrinking the API surface of the `TermInSetQuery` ctors and getting rid of the varargs flavors since they're pure syntactic sugar. Looks like #12837 is the related PR, so left a small comment over there. Thanks @slow-J ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix intermittently failing TestSortedSetFieldSource [lucene]
mikemccand commented on PR #12850: URL: https://github.com/apache/lucene/pull/12850#issuecomment-1833004513 Thanks for the detailed explanation @ChrisHegarty and @jpountz -- fix looks good! Sneaky random merge policies... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Copy collected acc(maxFreqs) into empty acc, rather than merge them. [lucene]
vsop-479 commented on PR #12846: URL: https://github.com/apache/lucene/pull/12846#issuecomment-1833216581 @jpountz I applied copy for all level's empty acc, Please take a look when you get a chance. As so far I just performed copy on empty acc that cleared by ``clear``. Do you mean we could skip the ``clear`` (mainly ``Arrays.fill(maxFreqs, 0)``) after ``writeSkipData``, and override the stale acc in next writing directly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org