Re: [PR] Remove IOException from DocIdSet#iterator signature [lucene]

2025-02-24 Thread via GitHub
javanna commented on PR #14284: URL: https://github.com/apache/lucene/pull/14284#issuecomment-2681021215 > FWIW it looks like there are other cleanups to be done in this class, e.g. removing DocIdSet#all(int) and DocIdSet#bits() (another remainder from the query/filter merge). Thanks

[PR] Remove unused DocIdSet#all method [lucene]

2025-02-24 Thread via GitHub
javanna opened a new pull request, #14288: URL: https://github.com/apache/lucene/pull/14288 DocIdSet#all is no longer relevant since Query and Filter were merged. We can remove it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Git

Re: [PR] Remove IOException from DocIdSet#iterator signature [lucene]

2025-02-24 Thread via GitHub
javanna commented on PR #14284: URL: https://github.com/apache/lucene/pull/14284#issuecomment-2680997183 > pulling an iterator shouldn't throw an I/O exception agreed, that was also my thinking when I made the change. > Should this be a 11.0 change rather than 10.2? I do

Re: [PR] Introduce DocIdStream#intoBitset to speed up cache [lucene]

2025-02-24 Thread via GitHub
gf2121 commented on PR #14277: URL: https://github.com/apache/lucene/pull/14277#issuecomment-2680655769 > (Longer-term, I'm thinking of removing the query cache https://github.com/apache/lucene/pull/14187) Thanks for explanation, I understand the motivation that we are focusing on sk

Re: [PR] Introduce DocIdStream#intoBitset to speed up cache [lucene]

2025-02-24 Thread via GitHub
gf2121 closed pull request #14277: Introduce DocIdStream#intoBitset to speed up cache URL: https://github.com/apache/lucene/pull/14277 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific co

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-02-24 Thread via GitHub
dungba88 commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2680433274 @msokolov I didn't see the change to cap `perLeafTopK` in the [latest commit](https://github.com/apache/lucene/pull/14226/commits/5b06e168c683e3edf36987379091357c298f0f28#diff-6bf79d1f0e

Re: [PR] Backport error-prone changes from main to branch_10x [lucene]

2025-02-24 Thread via GitHub
asfgit merged PR #14286: URL: https://github.com/apache/lucene/pull/14286 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Backport error-prone changes from main to branch_10x [lucene]

2025-02-24 Thread via GitHub
rmuir commented on PR #14286: URL: https://github.com/apache/lucene/pull/14286#issuecomment-2680113100 Thank you @msfroh ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

Re: [PR] Add posTagFormat parameter for OpenNLPPOSFilter [lucene]

2025-02-24 Thread via GitHub
github-actions[bot] commented on PR #14194: URL: https://github.com/apache/lucene/pull/14194#issuecomment-2680017293 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

[PR] improve checkJavadocLinks.py to detect "invalid reference" [lucene]

2025-02-24 Thread via GitHub
rmuir opened a new pull request, #14287: URL: https://github.com/apache/lucene/pull/14287 See background by @mkhludnev on the dev list: https://lists.apache.org/thread/pm1szr9og6qhmjzp371xwk0mvwxxkd1l In some cases: "invalid reference" is generated, passes through Xdoclint and broken

Re: [PR] Backport error-prone changes from main to branch_10x [lucene]

2025-02-24 Thread via GitHub
rmuir commented on PR #14286: URL: https://github.com/apache/lucene/pull/14286#issuecomment-2679880554 This crazy SleepingLockWrapper still lives in main/, but its `DEFAULT_POLL_INTERVAL` is `static final` there. I wonder if this got fixed in another commit or something. -- This is an au

Re: [PR] Backport error-prone changes from main to branch_10x [lucene]

2025-02-24 Thread via GitHub
msfroh commented on PR #14286: URL: https://github.com/apache/lucene/pull/14286#issuecomment-2679883061 > This crazy SleepingLockWrapper still lives in main/, but its `DEFAULT_POLL_INTERVAL` is `static final` there. I wonder if this got fixed in another commit or something. Yeah, I f

Re: [PR] Backport error-prone changes from main to branch_10x [lucene]

2025-02-24 Thread via GitHub
msfroh commented on PR #14286: URL: https://github.com/apache/lucene/pull/14286#issuecomment-2679879073 > Looks like CI fails in a check on core/. there might be more, if it gets past core/: > Oh... dang, I forgot to run locally with error-prone enabled. I'll do that and clean up

Re: [PR] Backport error-prone changes from main to branch_10x [lucene]

2025-02-24 Thread via GitHub
rmuir commented on PR #14286: URL: https://github.com/apache/lucene/pull/14286#issuecomment-2679876336 Looks like CI fails in a check on core/. there might be more, if it gets past core/: ``` /home/runner/work/lucene/lucene/lucene/core/src/java/org/apache/lucene/store/SleepingLock

[PR] Backport error-prone changes from main to branch_10x [lucene]

2025-02-24 Thread via GitHub
msfroh opened a new pull request, #14286: URL: https://github.com/apache/lucene/pull/14286 ### Description This change backports the following PRs to 10.x: * https://github.com/apache/lucene/commit/4cd4f8e2f99b8e300fdd996b54920625a257acf7 * https://github.com/apache/lucene/p

Re: [PR] Introduce DocIdStream#intoBitset to speed up cache [lucene]

2025-02-24 Thread via GitHub
jpountz commented on PR #14277: URL: https://github.com/apache/lucene/pull/14277#issuecomment-2679834513 To be honest, I'm a bit on the fence about introducing specialization for caching. This is why I was wondering if faceting could benefit from it too, though I'm not a fan of the fact tha

Re: [PR] Disable the query cache by default. [lucene]

2025-02-24 Thread via GitHub
jpountz commented on PR #14187: URL: https://github.com/apache/lucene/pull/14187#issuecomment-2679832873 For the record, another downside of the query cache: the fact that it caches per segment doesn't play nicely with intra segment concurrency. E.g. if you have a single large segment that

Re: [PR] Introduce DocIdStream#intoBitset to speed up cache [lucene]

2025-02-24 Thread via GitHub
jpountz commented on PR #14277: URL: https://github.com/apache/lucene/pull/14277#issuecomment-2679821035 I had thought of something in-between your previous PR and this one. E.g. adding `BulkScorer#intoBitSet` with a similar signature and contract as `DocIdSetIterator#intoBitSet`. -- Thi

Re: [I] improve checkJavadocLinks.py to detect "invalid reference" [lucene]

2025-02-24 Thread via GitHub
rmuir commented on issue #14285: URL: https://github.com/apache/lucene/issues/14285#issuecomment-2679794279 ![Image](https://github.com/user-attachments/assets/fd952f55-8984-4586-9316-2c05e47f6538) -- This is an automated message from the Apache Git Service. To respond to the message, ple

[I] improve checkJavadocLinks.py to detect "invalid reference" [lucene]

2025-02-24 Thread via GitHub
rmuir opened a new issue, #14285: URL: https://github.com/apache/lucene/issues/14285 ### Description See background by @mkhludnev on the dev list: https://lists.apache.org/thread/pm1szr9og6qhmjzp371xwk0mvwxxkd1l In some cases: "invalid reference" is generated, passes through Xd

Re: [I] Allow skip_factor to be set dynamically within QueryCache [lucene]

2025-02-24 Thread via GitHub
sgup432 commented on issue #14183: URL: https://github.com/apache/lucene/issues/14183#issuecomment-2679778282 @jpountz Just checking if you’ve had a chance to look into this. As mentioned, I believe dynamically adjusting `skip_factor` would be beneficial. Additionally, we can also introduc

Re: [PR] Enable error-prone checks for NonFinalStaticField [lucene]

2025-02-24 Thread via GitHub
dweiss commented on PR #14228: URL: https://github.com/apache/lucene/pull/14228#issuecomment-2679643415 Thanks, @msfroh ! I think it should be possible to cherry pick a series of commits from main - perhaps with minor adjustments. When you try to cherry pick this commit from main, you'll se

[PR] Remove IOException from DocIdSet#iterator signature [lucene]

2025-02-24 Thread via GitHub
javanna opened a new pull request, #14284: URL: https://github.com/apache/lucene/pull/14284 There is no implementation of DocIdSet#iterator that throws IOException. This commit proposes removing throwing IOException from its signature. -- This is an automated message from the Apache Git S

Re: [PR] Enable error-prone checks for NonFinalStaticField [lucene]

2025-02-24 Thread via GitHub
msfroh commented on PR #14228: URL: https://github.com/apache/lucene/pull/14228#issuecomment-2679564727 > I think a backport to 10x would be nice but perhaps it's worth another issue since it's not trivial. I can take care of that. After going down the rabbit hole once with t

Re: [PR] Enhance DictionaryCompoundWordTokenFilter [lucene]

2025-02-24 Thread via GitHub
rmuir commented on PR #14278: URL: https://github.com/apache/lucene/pull/14278#issuecomment-2679383631 Yes, I'm just suggesting to split it. We can add this new parameter here, backport to minor release 10.2.0, no breaking changes. Separately we can default it to `true` for 11.0? -- Thi

Re: [I] Refactor QueryCache to improve concurrency and performance [lucene]

2025-02-24 Thread via GitHub
sgup432 commented on issue #14222: URL: https://github.com/apache/lucene/issues/14222#issuecomment-2679321128 >But in a typical workload we expect to be spending most of our time executing queries rather than caching them, which will reduce the amount of time spent acquiring locks, and the

Re: [PR] Adds github action to verify changelog entry and set milestone to PRs [lucene]

2025-02-24 Thread via GitHub
pseudo-nymous commented on PR #14279: URL: https://github.com/apache/lucene/pull/14279#issuecomment-2679232219 There were small fixes which I missed in previous commit. This didn't had any history attached to it and I kept the commit which had comments. Also, I'll keep this in check

Re: [PR] Enhance DictionaryCompoundWordTokenFilter [lucene]

2025-02-24 Thread via GitHub
rmuir commented on PR #14278: URL: https://github.com/apache/lucene/pull/14278#issuecomment-2679239148 for changing defaults, my goto would be, if we could do that as a followup PR, for a major release. We can expose this parameter in a minor release without hurting anyone, but if w

Re: [PR] Enhance DictionaryCompoundWordTokenFilter [lucene]

2025-02-24 Thread via GitHub
renatoh commented on PR #14278: URL: https://github.com/apache/lucene/pull/14278#issuecomment-2679257613 I would argue, at least in German, nothing but longestMatch=true and skipping forward does make any sense. Without skipping forward the filter extracts a lot of nonsense and in my opinio

Re: [PR] Enhance DictionaryCompoundWordTokenFilter [lucene]

2025-02-24 Thread via GitHub
rmuir commented on PR #14278: URL: https://github.com/apache/lucene/pull/14278#issuecomment-2679218267 I'm not really opinionated on it, was just brainstorming because I had to look at the source code to figure out what the parameter was doing. And I agree, it is surprising behavior

Re: [PR] Enhance DictionaryCompoundWordTokenFilter [lucene]

2025-02-24 Thread via GitHub
renatoh commented on PR #14278: URL: https://github.com/apache/lucene/pull/14278#issuecomment-2679176831 > looks good to me. I wonder about the name of the parameter, maybe "greedy" would be more intuitive as a way to describe what it is doing? not saying "consumeChars" is a good name

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-02-24 Thread via GitHub
benwtrent commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2679008833 @gaoj0017 > The OSQ method (introduced in this PR) has its major idea similar to our extended RaBitQ method and our extended RaBitQ method is a prior art which achieves good ac

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-02-24 Thread via GitHub
benwtrent commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2679029410 > so I rebased and removed the reuse-scores part of this since it was conflicting with other changes and doesn't seem worth preserving @msokolov thanks for confirming and digging

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-02-24 Thread via GitHub
benwtrent commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2679022162 > I wonder where can I find the code for the benchmarks that you are mentioning in the description? Thanks! @lpld I patched a version of Lucene util, sort of like this: https://

Re: [PR] Remove field argument from DocIdSetBuilder constructor [lucene]

2025-02-24 Thread via GitHub
javanna merged PR #14282: URL: https://github.com/apache/lucene/pull/14282 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-02-24 Thread via GitHub
msokolov commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2678966117 There were some conflicts with other recent changes, so I rebased and removed the reuse-scores part of this since it was conflicting with other changes and doesn't seem worth preserving

Re: [PR] Address completion fields testing gap and truly allow loading FST off heap [lucene]

2025-02-24 Thread via GitHub
javanna commented on PR #14270: URL: https://github.com/apache/lucene/pull/14270#issuecomment-2678961874 @jpountz do you have opinions on this? Who else should I ping otherwise? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] Remove field argument from DocIdSetBuilder constructor [lucene]

2025-02-24 Thread via GitHub
javanna commented on PR #14282: URL: https://github.com/apache/lucene/pull/14282#issuecomment-2678958683 thanks for the reference, I found that in the history but I did not have all the context, it's good to know. I was also wondering if this could be perceived as a breaking change and I th

Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]

2025-02-24 Thread via GitHub
jpountz commented on PR #14273: URL: https://github.com/apache/lucene/pull/14273#issuecomment-2678921217 > Looks like we can do similar trick for range facets and long values facets? This is right. -- This is an automated message from the Apache Git Service. To respond to the messag

Re: [I] Weird Performance regression due to: #13907 [lucene]

2025-02-24 Thread via GitHub
jpountz commented on issue #14281: URL: https://github.com/apache/lucene/issues/14281#issuecomment-2678895153 This is interesting, the linked change is expected to replace some potentially expensive calls to madvise (since they need to iterate all pages) with cheaper checks, so I wouldn't h

[I] Lack of coverage of DenseConjunctionBulkScorer with min competitive scores and competitive iterators [lucene]

2025-02-24 Thread via GitHub
jpountz opened a new issue, #14283: URL: https://github.com/apache/lucene/issues/14283 `DenseConjunctionBulkScorer` has good test coverage, but we don't test that it correctly reacts to the min competitive score being set to a higher value than the constant score, or to a competitive iterat

Re: [PR] Enhance DictionaryCompoundWordTokenFilter [lucene]

2025-02-24 Thread via GitHub
rmuir commented on code in PR #14278: URL: https://github.com/apache/lucene/pull/14278#discussion_r1967791155 ## lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/TestCompoundWordTokenFilter.java: ## @@ -682,4 +687,41 @@ protected TokenStreamComponents createCo

Re: [PR] Enhance DictionaryCompoundWordTokenFilter [lucene]

2025-02-24 Thread via GitHub
rmuir commented on PR #14278: URL: https://github.com/apache/lucene/pull/14278#issuecomment-2678643974 looks good to me. I wonder about the name of the parameter, maybe "greedy" would be more intuitive as a way to describe what it is doing? -- This is an automated message from the Apache

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-02-24 Thread via GitHub
msokolov commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2678496890 I pushed a version that re-uses scores *and* limits per-leaf topK to global topK. The former didn't make very much difference, but the latter change did improve things quite a bit. He

[PR] Remove field argument from DocIdSetBuilder constructor [lucene]

2025-02-24 Thread via GitHub
javanna opened a new pull request, #14282: URL: https://github.com/apache/lucene/pull/14282 The argument is unused, its callers can stop providing it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]

2025-02-24 Thread via GitHub
epotyom commented on code in PR #14273: URL: https://github.com/apache/lucene/pull/14273#discussion_r1967691866 ## lucene/core/src/java/org/apache/lucene/search/DocIdStream.java: ## @@ -34,12 +33,35 @@ protected DocIdStream() {} * Iterate over doc IDs contained in this strea

Re: [PR] Adds github action to verify changelog entry and set milestone to PRs [lucene]

2025-02-24 Thread via GitHub
dweiss commented on PR #14279: URL: https://github.com/apache/lucene/pull/14279#issuecomment-2678495642 Please do not force-push your changes. Keep prior commits so that the history is preserved and comments are attached where they should be. Thanks! -- This is an automated message from t

Re: [I] Refactor QueryCache to improve concurrency and performance [lucene]

2025-02-24 Thread via GitHub
msokolov commented on issue #14222: URL: https://github.com/apache/lucene/issues/14222#issuecomment-2678412268 This shows a nice improvement on the microbenchmark! But in a typical workload we expect to be spending most of our time executing queries rather than caching them, which will redu

Re: [PR] Reuse entry point scores and provide mechanisms to provide scores for directly entry points [lucene]

2025-02-24 Thread via GitHub
benwtrent commented on PR #14256: URL: https://github.com/apache/lucene/pull/14256#issuecomment-2678310512 > it might suffice to create a wrapping RandomVectorScorer that would supply the cached scores, while delegating the others to the underlying scorer? 🤔 Maybe we could wrap the

Re: [PR] Reuse entry point scores and provide mechanisms to provide scores for directly entry points [lucene]

2025-02-24 Thread via GitHub
benwtrent commented on code in PR #14256: URL: https://github.com/apache/lucene/pull/14256#discussion_r1967569988 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -229,25 +230,32 @@ public void addGraphNode(int node, UpdateableRandomVectorScorer

Re: [PR] Adds github action to verify changelog entry and set milestone to PRs [lucene]

2025-02-24 Thread via GitHub
pseudo-nymous commented on code in PR #14279: URL: https://github.com/apache/lucene/pull/14279#discussion_r1967566449 ## .github/workflows/verify-changelog-and-set-milestone.yml: ## @@ -0,0 +1,100 @@ +name: "Change Log Entry Verifier and Milestone Setter" +run-name: Change log e

Re: [PR] Enable error-prone checks for NonFinalStaticField [lucene]

2025-02-24 Thread via GitHub
dweiss commented on PR #14228: URL: https://github.com/apache/lucene/pull/14228#issuecomment-2678269055 I've merged this. There are conflicts when backporting to branch_10x and they sort of depend on previous fixes to spotless, so I leave this patch on main only. I think a backport to 10x w

Re: [PR] Adds github action to verify changelog entry and set milestone to PRs [lucene]

2025-02-24 Thread via GitHub
pseudo-nymous commented on code in PR #14279: URL: https://github.com/apache/lucene/pull/14279#discussion_r1967542895 ## .github/workflows/verify-changelog-and-set-milestone.yml: ## @@ -0,0 +1,100 @@ +name: "Change Log Entry Verifier and Milestone Setter" +run-name: Change log e

Re: [PR] Adds github action to verify changelog entry and set milestone to PRs [lucene]

2025-02-24 Thread via GitHub
pseudo-nymous commented on code in PR #14279: URL: https://github.com/apache/lucene/pull/14279#discussion_r1967542895 ## .github/workflows/verify-changelog-and-set-milestone.yml: ## @@ -0,0 +1,100 @@ +name: "Change Log Entry Verifier and Milestone Setter" +run-name: Change log e

Re: [PR] Enable error-prone checks for NonFinalStaticField [lucene]

2025-02-24 Thread via GitHub
dweiss merged PR #14228: URL: https://github.com/apache/lucene/pull/14228 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Adds github action to verify changelog entry and set milestone to PRs [lucene]

2025-02-24 Thread via GitHub
pseudo-nymous commented on code in PR #14279: URL: https://github.com/apache/lucene/pull/14279#discussion_r1967537809 ## .github/workflows/verify-changelog-and-set-milestone.yml: ## @@ -0,0 +1,100 @@ +name: "Change Log Entry Verifier and Milestone Setter" +run-name: Change log e

Re: [PR] Adds github action to verify changelog entry and set milestone to PRs [lucene]

2025-02-24 Thread via GitHub
pseudo-nymous commented on code in PR #14279: URL: https://github.com/apache/lucene/pull/14279#discussion_r1967532868 ## .github/workflows/verify-changelog-and-set-milestone.yml: ## @@ -0,0 +1,100 @@ +name: "Change Log Entry Verifier and Milestone Setter" +run-name: Change log e

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-02-24 Thread via GitHub
lpld commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2678033882 Hi @benwtrent, that's an amazing amount of work. I wonder where can I find the code for the benchmarks that you are mentioning in the description? Thanks! -- This is an automated m

[PR] ExceptionInInitializerError in ScorerUtil [lucene]

2025-02-24 Thread via GitHub
imario42 opened a new pull request, #14280: URL: https://github.com/apache/lucene/pull/14280 lazy initialize the ScorerUtil DEFAULT_IMPACTS_ENUM_CLASS to prevent initialization issues with this class if the thread gets interrupted. ### Description The ScorerUtils fails to initi

Re: [PR] Reciprocal Rank Fusion (RRF) in TopDocs [lucene]

2025-02-24 Thread via GitHub
javanna commented on code in PR #13470: URL: https://github.com/apache/lucene/pull/13470#discussion_r1967234360 ## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ## @@ -350,4 +354,89 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits,

Re: [PR] Reciprocal Rank Fusion (RRF) in TopDocs [lucene]

2025-02-24 Thread via GitHub
jpountz commented on PR #13470: URL: https://github.com/apache/lucene/pull/13470#issuecomment-2677780646 Thanks for taking a look. I have a bias for the latter, as I was planning on improving the docs of the oal.search package as a follow-up to provide guidance wrt how to do hybrid search b

Re: [PR] Reciprocal Rank Fusion (RRF) in TopDocs [lucene]

2025-02-24 Thread via GitHub
javanna commented on PR #13470: URL: https://github.com/apache/lucene/pull/13470#issuecomment-263959 This looks good to me. Perhaps we could mark the new static method experimental, especially if we think we are going to want to support more ways of combining topdocs soon enough. I don'