Re: [PR] Copy collected acc(maxFreqs) into empty acc, rather than merge them. [lucene]

2023-11-29 Thread via GitHub
vsop-479 commented on PR #12846: URL: https://github.com/apache/lucene/pull/12846#issuecomment-1833216581 @jpountz I applied copy for all level's empty acc, Please take a look when you get a chance. As so far I just performed copy on empty acc that cleared by ``clear``. Do you mean we co

Re: [PR] Fix intermittently failing TestSortedSetFieldSource [lucene]

2023-11-29 Thread via GitHub
mikemccand commented on PR #12850: URL: https://github.com/apache/lucene/pull/12850#issuecomment-1833004513 Thanks for the detailed explanation @ChrisHegarty and @jpountz -- fix looks good! Sneaky random merge policies... -- This is an automated message from the Apache Git Service. To re

Re: [I] Add a new static method for KeywordField#newSetQuery to support collections parameter [lucene]

2023-11-29 Thread via GitHub
gsmiller commented on issue #12243: URL: https://github.com/apache/lucene/issues/12243#issuecomment-1832911784 +1 to shrinking the API surface of the `TermInSetQuery` ctors and getting rid of the varargs flavors since they're pure syntactic sugar. Looks like #12837 is the related PR, so lef

Re: [PR] Removing TermInSetQuery array ctor [lucene]

2023-11-29 Thread via GitHub
gsmiller commented on PR #12837: URL: https://github.com/apache/lucene/pull/12837#issuecomment-1832910639 +1 to this simplification. Let's also get rid of `public TermInSetQuery(RewriteMethod rewriteMethod, String field, BytesRef... terms)`? -- This is an automated message from the Apache

Re: [I] Is it correct for facets to assume positive aggregation values? [lucene]

2023-11-29 Thread via GitHub
gsmiller commented on issue #12585: URL: https://github.com/apache/lucene/issues/12585#issuecomment-1832904037 I think I'd be curious if there are many real-world use-cases out there that have a non-positive association between each document and facet label? I would guess it's pretty uncomm

Re: [I] Reproducible TestDrillSideways failure [lucene]

2023-11-29 Thread via GitHub
gsmiller commented on issue #12418: URL: https://github.com/apache/lucene/issues/12418#issuecomment-1832886094 I think I know what's going on here, and I believe #12853 fixes it. In both seeds mentioned here, I've noticed that the drill-sideways query short-circuits on at least one segment,

Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-11-29 Thread via GitHub
stefanvodita commented on PR #12844: URL: https://github.com/apache/lucene/pull/12844#issuecomment-1832732598 Thanks for the suggestions @zhaih! I have to think about option 2 a bit more. If we change `ramBytesUsed` back, performance recovers (Mostly? I'm not sure how noisy this bench

[PR] [Minor] Fix the only use of java.lang.String#toLowerCase() with no Locale [lucene]

2023-11-29 Thread via GitHub
slow-J opened a new pull request, #12856: URL: https://github.com/apache/lucene/pull/12856 I was looking at this old issue: https://github.com/apache/lucene/issues/4390 `Flexible Queryparser has some Locale violation with lowercasing`, which was already fixed with the work done in [GH#5271

[PR] Remove DrillSideways#createDrillDownFacetsCollector in favor of the manager-based hook [lucene]

2023-11-29 Thread via GitHub
gsmiller opened a new pull request, #12855: URL: https://github.com/apache/lucene/pull/12855 This extension hook is no longer used on the main branch, so let's remove it. Opened a corresponding PR to mark this deprecated on 9x (#12854) -- This is an automated message from the Apache Git S

[PR] Mark DrillSideways#createDrillDownFacetsCollector as @Deprecated [lucene]

2023-11-29 Thread via GitHub
gsmiller opened a new pull request, #12854: URL: https://github.com/apache/lucene/pull/12854 This extension hook is only referenced by the also-deprecated `DrillSideways#search(DrillDownQuery, Collector)`, so let's mark it deprecated as well and remove it on main. -- This is an automated

Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-11-29 Thread via GitHub
zhaih commented on PR #12844: URL: https://github.com/apache/lucene/pull/12844#issuecomment-1832571805 @benwtrent Thanks for running the benchmark. I looked at the profile and I think we call `ramBytesUsed` after every document is indexed to control the flush [here](https://github.com/apac

Re: [PR] IntTaxonomyFacets test case [lucene]

2023-11-29 Thread via GitHub
gsmiller commented on PR #12563: URL: https://github.com/apache/lucene/pull/12563#issuecomment-1832462625 Closing in favor of #12853 that fixes the underlying root cause and adds testing -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [PR] IntTaxonomyFacets test case [lucene]

2023-11-29 Thread via GitHub
gsmiller closed pull request #12563: IntTaxonomyFacets test case URL: https://github.com/apache/lucene/pull/12563 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

Re: [I] IntTaxonomyFacets chooses dense values array when FacetsCollector has no MatchingDocs [lucene]

2023-11-29 Thread via GitHub
gsmiller commented on issue #12558: URL: https://github.com/apache/lucene/issues/12558#issuecomment-1832460996 I opened a PR here that fixes the root cause of this bug: https://github.com/apache/lucene/pull/12853 -- This is an automated message from the Apache Git Service. To respond to t

Re: [PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]

2023-11-29 Thread via GitHub
gsmiller commented on code in PR #12853: URL: https://github.com/apache/lucene/pull/12853#discussion_r1409693152 ## lucene/CHANGES.txt: ## @@ -112,6 +112,9 @@ Bug Fixes * GITHUB#12220: Hunspell: disallow hidden title-case entries from compound middle/end +* GITHUB#12558: E

[PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]

2023-11-29 Thread via GitHub
gsmiller opened a new pull request, #12853: URL: https://github.com/apache/lucene/pull/12853 In GH#12558, @Shradha26 discovered that faceting implementations are sometimes getting empty MatchingDoc lists, resulting in undesired behavior. We should consider it a bug for MatchingDocs to not b

Re: [PR] Fix intermittently failing TestSortedSetFieldSource [lucene]

2023-11-29 Thread via GitHub
ChrisHegarty commented on PR #12850: URL: https://github.com/apache/lucene/pull/12850#issuecomment-1832391520 > This new failure looks like it's due to the fact that MockRandomMergePolicy now randomly reverses the doc ID order on merge? That's it exactly. I confirmed this through loca

Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-11-29 Thread via GitHub
stefanvodita commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1409603364 ## lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java: ## @@ -330,6 +330,29 @@ public static int[] growExact(int[] array, int newLength) { return cop

Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-11-29 Thread via GitHub
stefanvodita commented on PR #12844: URL: https://github.com/apache/lucene/pull/12844#issuecomment-1832333168 Thank you for running the benchmarks @benwtrent, you were quicker than I was. I was worried the modified `ramBytesUsed` would be slow. Maybe we can come up with a smarter way to do

Re: [PR] Copy collected acc(maxFreqs) into empty acc, rather than merge them. [lucene]

2023-11-29 Thread via GitHub
jpountz commented on PR #12846: URL: https://github.com/apache/lucene/pull/12846#issuecomment-1832112426 > Sorry, I am still confused about this. If we clear an unEmpty treeset, how to deal with its competitive entries? This looks similar to the arraycopy that you perform on the `maxF

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-29 Thread via GitHub
jpountz commented on PR #12699: URL: https://github.com/apache/lucene/pull/12699#issuecomment-1832082403 +1 to not reverting #12631. The fact that our terms index used to not share prefixes properly was a bug, I'm glad it's fixed even though it may make performance slightly slower because m

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-29 Thread via GitHub
dungba88 commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1832050665 I re-ran the Test2BFST with the new change, it looks much better ``` 1> TEST: now verify [fst size=4621076364; nodeCount=2252341486; arcCount=2264078585] 1> 0...: took 0

Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-11-29 Thread via GitHub
benwtrent commented on PR #12844: URL: https://github.com/apache/lucene/pull/12844#issuecomment-1831889329 Yeah, `ramBytesUsed` changes here are adding significant overhead to indexing. Here is a cpu profile. [pr12844-768-10-wall.jfr.zip](https://github.com/apache/lucene/files/13

Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-11-29 Thread via GitHub
benwtrent commented on PR #12844: URL: https://github.com/apache/lucene/pull/12844#issuecomment-1831865170 I ran knnPerfTest in Lucene util against this PR. 100k cohere vectors. Flushing was depending on memory usage (Do we check the ram usage of onHeapgraph during indexing to determine whe

Re: [PR] Fix intermittently failing TestSortedSetFieldSource [lucene]

2023-11-29 Thread via GitHub
ChrisHegarty commented on PR #12850: URL: https://github.com/apache/lucene/pull/12850#issuecomment-1831806426 My understanding of this part of the test is that it asserts the sort order of the multi-valued string values in the `value` field. The changes here do not affect this assertion, th

Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-11-29 Thread via GitHub
dungba88 commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1409183001 ## lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java: ## @@ -330,6 +330,29 @@ public static int[] growExact(int[] array, int newLength) { return copy;

Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-11-29 Thread via GitHub
dungba88 commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1409183001 ## lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java: ## @@ -330,6 +330,29 @@ public static int[] growExact(int[] array, int newLength) { return copy;

Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-11-29 Thread via GitHub
dungba88 commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1409177467 ## lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java: ## @@ -330,6 +330,29 @@ public static int[] growExact(int[] array, int newLength) { return copy;

[I] Do we suboptimally call ByteBuffersDataOutput.toDataInput()? [lucene]

2023-11-29 Thread via GitHub
dungba88 opened a new issue, #12852: URL: https://github.com/apache/lucene/issues/12852 ### Description See this comment: https://github.com/apache/lucene/pull/12624#issuecomment-1829633268 > I wonder: are there other places in Lucene that might fall prey to this performance t

Re: [PR] Report the time it took for building the FST in Test2BFST [lucene]

2023-11-29 Thread via GitHub
dungba88 commented on code in PR #12847: URL: https://github.com/apache/lucene/pull/12847#discussion_r1409158917 ## lucene/CHANGES.txt: ## @@ -176,7 +178,8 @@ API Changes * GITHUB#12799: Make TaskExecutor constructor public and use TaskExecutor for concurrent HNSW graph bui

Re: [PR] Fix intermittently failing TestSortedSetFieldSource [lucene]

2023-11-29 Thread via GitHub
mikemccand commented on PR #12850: URL: https://github.com/apache/lucene/pull/12850#issuecomment-1831732238 > The test assertions depend on doc order which may be affected by merging. Hmm, I don't fully understand the test, but did the sort order truly rely on `docid` tie-breaking? I

Re: [PR] Report the time it took for building the FST in Test2BFST [lucene]

2023-11-29 Thread via GitHub
mikemccand merged PR #12847: URL: https://github.com/apache/lucene/pull/12847 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [PR] Report the time it took for building the FST in Test2BFST [lucene]

2023-11-29 Thread via GitHub
mikemccand commented on code in PR #12847: URL: https://github.com/apache/lucene/pull/12847#discussion_r1409134920 ## lucene/CHANGES.txt: ## @@ -176,7 +178,8 @@ API Changes * GITHUB#12799: Make TaskExecutor constructor public and use TaskExecutor for concurrent HNSW graph b

Re: [PR] Report the time it took for building the FST in Test2BFST [lucene]

2023-11-29 Thread via GitHub
dungba88 commented on code in PR #12847: URL: https://github.com/apache/lucene/pull/12847#discussion_r1409117086 ## lucene/CHANGES.txt: ## @@ -176,7 +178,8 @@ API Changes * GITHUB#12799: Make TaskExecutor constructor public and use TaskExecutor for concurrent HNSW graph bui

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-29 Thread via GitHub
gf2121 commented on PR #12699: URL: https://github.com/apache/lucene/pull/12699#issuecomment-1831673998 Thanks @mikemccand for feedback! I'll backport this to 9.9.0 then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] Report the time it took for building the FST in Test2BFST [lucene]

2023-11-29 Thread via GitHub
mikemccand commented on code in PR #12847: URL: https://github.com/apache/lucene/pull/12847#discussion_r1408286029 ## lucene/CHANGES.txt: ## @@ -176,7 +178,8 @@ API Changes * GITHUB#12799: Make TaskExecutor constructor public and use TaskExecutor for concurrent HNSW graph b

[PR] [9.x] Fix intermittently failing TestSortedSetFieldSource [lucene]

2023-11-29 Thread via GitHub
ChrisHegarty opened a new pull request, #12851: URL: https://github.com/apache/lucene/pull/12851 Backport of: * Fix intermittently failing TestSortedSetFieldSource #12850 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] Report the time it took for building the FST in Test2BFST [lucene]

2023-11-29 Thread via GitHub
mikemccand commented on PR #12847: URL: https://github.com/apache/lucene/pull/12847#issuecomment-1831667475 > The test failed with just `Error: The operation was canceled.` but I can't tell why it happened. The same PR in my local branch works: [dungba88#20](https://github.com/dungba88/luce

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-29 Thread via GitHub
mikemccand commented on PR #12699: URL: https://github.com/apache/lucene/pull/12699#issuecomment-1831665131 I don't think we should revert #12631 -- that one was a nice disk space improvement, and the `PKLookup` performance is quite noisy -- going up or down by just as much when we upgrade

Re: [PR] Fix intermittently failing TestSortedSetFieldSource [lucene]

2023-11-29 Thread via GitHub
ChrisHegarty commented on PR #12850: URL: https://github.com/apache/lucene/pull/12850#issuecomment-1831661212 The test previously has been seen to fail with: ``` org.junit.ComparisonFailure: expected: but was: at __randomizedtesting.SeedInfo.seed([45D4C56DF2092811:7D67E193D5FAF

Re: [I] Add Facets#getSpecificValues (bulk) and bulk path -> ordinal lookup for taxonomy faceting [lucene]

2023-11-29 Thread via GitHub
mikemccand closed issue #12180: Add Facets#getSpecificValues (bulk) and bulk path -> ordinal lookup for taxonomy faceting URL: https://github.com/apache/lucene/issues/12180 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [I] Add Facets#getSpecificValues (bulk) and bulk path -> ordinal lookup for taxonomy faceting [lucene]

2023-11-29 Thread via GitHub
mikemccand commented on issue #12180: URL: https://github.com/apache/lucene/issues/12180#issuecomment-1831660442 Thanks @ChrisHegarty -- this one is done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

[PR] Fix intermittently failing TestSortedSetFieldSource [lucene]

2023-11-29 Thread via GitHub
ChrisHegarty opened a new pull request, #12850: URL: https://github.com/apache/lucene/pull/12850 This commit fixes the intermittently failing TestSortedSetFieldSource. The test assertions depend on doc order which may be affected by merging. The fix is to trivially avoid merging for t

Re: [I] configuration items of the alg file are adapted to the 9.0 branch [LUCENE-10100] [lucene]

2023-11-29 Thread via GitHub
mikemccand commented on issue #11138: URL: https://github.com/apache/lucene/issues/11138#issuecomment-1831658096 Ahh thanks @ChrisHegarty -- no more work for this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [I] configuration items of the alg file are adapted to the 9.0 branch [LUCENE-10100] [lucene]

2023-11-29 Thread via GitHub
mikemccand closed issue #11138: configuration items of the alg file are adapted to the 9.0 branch [LUCENE-10100] URL: https://github.com/apache/lucene/issues/11138 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

Re: [I] IntTaxonomyFacets chooses dense values array when FacetsCollector has no MatchingDocs [lucene]

2023-11-29 Thread via GitHub
Shradha26 commented on issue #12558: URL: https://github.com/apache/lucene/issues/12558#issuecomment-1831628034 Hey @gsmiller, thanks for trying this out again - were you able to reproduce this in 9_8 again? I'd found the bug in 9_7 and I don't think it was an issue in 9_8. -- This is an

Re: [I] configuration items of the alg file are adapted to the 9.0 branch [LUCENE-10100] [lucene]

2023-11-29 Thread via GitHub
ChrisHegarty commented on issue #11138: URL: https://github.com/apache/lucene/issues/11138#issuecomment-1831542918 Hi, I'd like to ask a clarifying question as part of the _9.9.0_ release manager duties, since this currently-unresolved issue is associated with Milestone 9.9.0.

Re: [I] Add Facets#getSpecificValues (bulk) and bulk path -> ordinal lookup for taxonomy faceting [lucene]

2023-11-29 Thread via GitHub
ChrisHegarty commented on issue #12180: URL: https://github.com/apache/lucene/issues/12180#issuecomment-1831540117 Hi, I'd like to ask a clarifying question as part of the _9.9.0_ release manager duties, since this currently-unresolved issue is associated with _Milestone 9.9.0_.

Re: [I] Multi-value Support for KnnVectorField [lucene]

2023-11-29 Thread via GitHub
alessandrobenedetti commented on issue #12313: URL: https://github.com/apache/lucene/issues/12313#issuecomment-1831534483 Hi @david-sitsky, the multi-valued vectors in Lucene's contribution is now paused for lack of fundings. I'll resume it from my side if and when I get some sponsors :)

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-29 Thread via GitHub
gf2121 commented on PR #12699: URL: https://github.com/apache/lucene/pull/12699#issuecomment-1831412787 Hi @mikemccand @jpountz ! The [newest nightly benchmark](https://home.apache.org/~mikemccand/lucenebench/2023.11.28.18.03.59.html) shows we get back some speed for [PKLookUp](http